Artificial intelligence tutors are often hailed as the future of education, with the potential to adapt to individual learning styles, provide step-by-step guidance, and offer instant, personalized feedback. But how well do these AI tools actually perform in real classroom settings, especially in complex subjects like law?
The Experiment: SmartTest in Action
Researchers at the University of Wollongong created a specialized chatbot called SmartTest for a criminal law course. Unlike general-purpose chatbots, SmartTest was designed for educators to embed specific questions, model answers, and prompts. It used the Socratic method to encourage critical thinking rather than simply giving away answers.
Over five test cycles, about 35 students per session used SmartTest to work through legal scenarios and short-answer questions. Their interactions and the chatbot’s feedback were carefully recorded and analyzed.
What Went Right—and Wrong
SmartTest showed potential in helping students spot gaps in their understanding. However, the chatbot often provided inaccurate or misleading feedback—between 40% and 54% of conversations during scenario-based questions contained at least one error. When the format shifted to simpler short-answer questions, the error rate dropped to 6%-27%, but mistakes still occurred.
A major challenge was the amount of effort required to make SmartTest work effectively. Rather than saving time, it demanded extensive prompt engineering and manual oversight from instructors, raising questions about its practicality for busy educators.
Reliability Issues
Perhaps most concerning was SmartTest’s inconsistency. Even under the same conditions, the chatbot sometimes gave excellent feedback and, at other times, confusing or incorrect information. Upgrading to newer AI models, like ChatGPT-4.5, didn’t consistently improve results.
Student Perspectives
Students appreciated the immediate feedback and conversational style, which helped reduce anxiety and encouraged participation. Most preferred having access to the chatbot over no practice at all. However, only a minority (27%) favored AI feedback over delayed feedback from human tutors, with nearly half still preferring human input.
While AI chatbots like SmartTest show promise for supporting low-stakes learning, they currently lack the reliability and depth needed for more advanced educational settings. The convenience is appealing, but students and teachers still place greater trust in human expertise. The study underscores the need for caution: AI tools should be seen as experimental aids rather than replacements for traditional teaching, at least for now.