Author – Baihong Bao
For decades, human-computer interaction was built on a comfortable assumption: people click, systems respond and the boundaries are clear. A button is a button, a menu is a menu. When things go wrong, you can often spot the missing step, the frozen cursor, the bad modal window.
Then came conversational interfaces, and the assumptions broke down in ways that show up across every domain.
A teacher asks a chatbot to explain photosynthesis to a class. It does so fluently and the students walk away thinking plants breathe oxygen. A patient describes worsening insomnia and a loss of appetite. The chatbot responds with breathing exercises and a reminder to stay positive, missing a pattern that any trained clinician would flag. A person interested in reading the news asks for “balanced coverage” and gets an answer shaped more by hidden defaults than transparent design.
In graphical user interfaces (GUIs), evaluation was able to rely on predictable elements: information display, navigation flow and error prevention. Conversational User Interfaces (CUIs), systems where the primary interaction is through natural language dialogue, blow that up. The interface talks back. It adapts, improvises, speculates and sometimes invents. Evaluating such systems with GUI-era methods is like testing a bridge by shaking only one plank.
This shift is not cosmetic. It’s structural, and it demands an evaluation revolution.
Why GUI-era Evaluation Fails in CUI Worlds
CUIs expose a simple truth: there is no fixed screen anymore. The “interface” is the model’s behaviour. Every answer is a new state. Every user message is a new context. Every turn is a micro-negotiation of goals, expectations and values.
Three assumptions baked into GUI evaluation now collapse in a CUI:
Output space explosion: In GUIs, you can see all the options. In CUIs, you can’t enumerate every possibility. You can’t pre-check every failure mode. This was already visible in previous LLM alignment work: you can’t “remove” bad outputs from an infinite output space. You need mechanisms for detection, repair and adaptation in the moment.
Subjective experience becomes first-class data: tone, humility, perceived respect, teachability. None of these fit neatly into classical usability checklists. Yet they drive trust, reliance and long-term behaviour. A model can be factually correct yet socially wrong. In CUIs, experiential qualities aren’t “nice to have”; they are operational variables.
Multi-turn dynamics break linear assumptions: GUI evaluation often assumes a clean A→B→C flow. CUIs operate in loops. A user’s trust after the third exchange affects how they phrase their seventh message, which shapes the model’s accuracy in the twelfth. These feedback loops are undocumented in benchmark-centric work. Evaluations don’t yet capture the “co-adaptive” dance between users and models. The result: many systems pass benchmark tests all the while quietly and subtly failing users.
The Missing Lens: Interaction, Perception and Alignment
The GUI era was able to get away with measuring capability. The CUI era requires measuring consequences.
We need evaluation methods that link:
- Objective system aspects (e.g.: refusal policy strictness, tool access, explanation style)
- Subjective system aspects (perceived clarity, respect, transparency)
- Experience (satisfaction, trust, mental models, effort)
- Interaction behaviour (repair attempts, reliance, abandonment, reframing)
- Personal & situational context (expertise, goals, vulnerability)
This structure echoes work in user-centric evaluation from recommender systems, which have long since shown “being accurate is not enough”. CUIs inherit the same principle but at higher stakes and with more volatile behaviour. The shift from GUI to CUI isn’t just about interface modality. It’s about evaluation scope.
Why this Matters Now
Models are no longer passive tools. They are partners: sometimes good, sometimes brittle, always unpredictable. They are entering sensitive domains fast: education, health, news, policy as well as everyday decision-making.
If we continue using GUI-era evaluation strategies, we will misread system quality, overlook subtle harms and mistake smooth conversation for genuine alignment.
Generative models blur the line between output quality and user experience. A confident tone can hide uncertainty. A polite apology can mask repeated failure. A personalised answer can narrow someone’s worldview without them noticing.
In GUI ecosystems, design errors were visible. In CUI ecosystems, design errors sound helpful. They arrive wrapped in fluent language, confident tone and apparent comprehension, making them harder to detect for users and evaluators alike. That makes modern evaluation not just a research problem but an ethical one.
What the Evaluation Revolution Looks Like
A credible CUI evaluation framework needs three upgrades:
- Measure what users perceive, not just what models produce: We need validated scales for perceived transparency, fairness, autonomy support, emotional safety and cognitive explainability. These determine real-world reliance, not benchmark numbers.
- Capture multi-turn causal chains: How does a system change a user’s mental model mid-conversation? When a misunderstanding happens, does the repair restore trust or degrade it? These are dynamic phenomena, not single-shot metrics.
- Define alignment in use, not just alignment in principle: Training pipelines like Reinforcement learning from human feedback (RLHF) capture generic preferences. But CUIs require alignment that persists across turns, context shifts and user diversity. A system behaves well even when users deviate from the script.
The Bridge We Need to Build
We are living through the biggest interaction shift since the mouse and keyboard. GUI thinking gave us decades of stable mental models. CUIs replace them with improvisation, ambiguity and co-adaptation.
The evaluation revolution isn’t optional. It’s the only way to match the speed of deployment with the depth of responsibility.
Designing better CUIs starts with asking better questions:
- Did the model help the user understand or just answer?
- Did it widen the user’s view or narrow it?
- Did it signal uncertainty in a way that changed behaviour?
- Did it support autonomy instead of quietly eroding it?
- Did it repair mistakes in ways humans perceive as genuine?
These questions weren’t part of GUI-era evaluation but are now unavoidable.
Closing
Conversational interfaces rewrite the rules. They blur input and output, system and interface, guidance and persuasion. If GUIs were about control, CUIs are about conversation. Yet conversation demands a different kind of accountability.
Being accurate is table stakes. Being aligned, adaptive, respectful and transparent across multi-turn interaction is the real challenge. We won’t get there with GUI-era evaluations.
The interface has changed. Our evaluation mindset must change with it.
References
Bommasani, R., Liang, P., & Lee, T. (2023). Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1), 140-146.
Fragiadakis, G., Diou, C., Kousiouris, G., & Nikolaidou, M. (2024). Evaluating human-ai collaboration: A review and methodological framework. arXiv preprint arXiv:2407.19098.
Gao, J., Gebreegziabher, S. A., Choo, K. T. W., Li, T. J. J., Perrault, S. T., & Malone, T. W. (2024, May). A taxonomy for human-llm interaction modes: An initial exploration. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (pp. 1-11).
Knijnenburg, B. P., Willemsen, M. C., Gantner, Z., Soncu, H., & Newell, C. (2012). Explaining the user experience of recommender systems. User modeling and user-adapted interaction, 22(4), 441-504.
Zhao, M., Simmons, R., & Admoni, H. (2025). The role of adaptation in collective human–AI teaming. Topics in cognitive science, 17(2), 291-323.
Image Attribution
Generated by: Nano Banana Pro
Date: 24 October 2025
Prompt: “A minimalist, modern illustration showing the shift from GUI to CUI. On the left, depict classic graphical interfaces: floating windows, buttons, sliders, icons—clean geometric shapes in cool blues and teals. On the right, transition into a warm, organic conversational interface made of abstract speech bubbles, flowing lines and soft gradients, representing dynamic dialogue with AI. A clear bridge or smooth gradient connects the two worlds, symbolising the evolution toward conversational systems and the need for new evaluation methods. Style: flat vector, high contrast, soft shadows, generous negative space. Absolutely no text, no letters, no numbers, no logos, no UI labels. Output resolution: 1920×1080.”