Building an AI Essay Grading System: Rules-Based Evaluation Meets LLM Intelligence

The problem with AI grading

Most AI grading tools take one of two approaches, and both have problems.

The first is pure rubric scoring — assign points for keyword presence, argument structure, length requirements. This is fast and consistent but misses nuance. A student who uses all the right legal terms but applies them incorrectly gets a good score. A student who demonstrates brilliant reasoning using unexpected terminology gets penalised.

The second is pure LLM evaluation — ask GPT-4 or Claude to read the essay and grade it. This catches nuance but introduces inconsistency. The same essay submitted twice might get different scores. The model might apply its own standards rather than the instructor’s. And there’s no transparency — the student doesn’t know why they got the grade they got.

EmanuelAYCE needed something better. Law school issue-spotting essays require identifying legal issues, applying relevant rules, analysing facts, and reaching conclusions. The grading needs to be both nuanced (understanding legal reasoning) and consistent (applying the instructor’s criteria the same way every time).

Case Study

EmanuelAYCE

Transforming Law School Study

Read case study

The architecture we built

The solution combines three layers. The rules layer lets instructors define grading criteria in natural language. Instead of numeric rubrics, an instructor might write: “The student should identify the due process issue raised by the government’s action. A strong answer will distinguish between substantive and procedural due process. The student should apply the Mathews v. Eldridge balancing test.” These rules are stored as structured text and fed to the LLM as evaluation criteria.

The evaluation layer uses the LLM to assess each student response against the instructor’s rules. The prompt is carefully constructed: it includes the question, the rules, a model answer (if provided), and the student’s response. The LLM evaluates each criterion independently, producing a structured assessment — which issues were identified, which rules were correctly applied, where the reasoning went wrong.

The feedback layer transforms the structured assessment into personalised guidance. Rather than just saying “you missed the due process issue,” the system explains: “Your answer addresses the equal protection argument well, but doesn’t identify the due process problem created by the government’s action without a hearing. Consider how the Mathews v. Eldridge factors apply here — what private interest is at stake? What’s the risk of error?” This guided approach helps students learn without giving them the answer directly.

Why natural language rules matter

The decision to express grading criteria in natural language rather than numeric rubrics was one of the most important architectural choices. It means instructors can define exactly what they’re looking for in the same language they’d use to explain it to a colleague. No one needs to translate pedagogical intent into software configuration.

It also means the criteria can be as specific or general as needed. For a torts exam, the instructor might require identification of five specific issues. For a policy essay, the criteria might be more about quality of argumentation and use of evidence. The same system handles both.

We’ve since applied this pattern to other projects — including the grading rules in PlanYourSunset’s document generation, where legal document validity criteria are expressed in plain language. The underlying principle is the same: let domain experts define rules in their own language, and let the AI interpret and apply them.

Consistency and validation

The biggest concern with LLM-based grading is consistency. We address this in three ways.

Multiple evaluation passes: each essay is evaluated twice with slightly different prompt orderings. If the two evaluations disagree significantly, a third pass is triggered, and the final grade uses a consensus approach.

Calibration against expert grades: before deploying for a new course, we run the system against a set of expert-graded essays and measure agreement. The system is tuned until it reaches acceptable inter-rater reliability with the expert grades.

Student appeal mechanism: students can flag grades they disagree with. These flags feed back into the calibration process, helping identify criteria that the system is applying differently than the instructor intends.

“The insight that made EmanuelAYCE work was treating the AI as a consistent applier of human-defined criteria, not as an independent judge. The instructor’s judgment is encoded in the rules. The AI’s job is to apply those rules the same way to every student. This gives you the nuance of LLM understanding with the consistency of automated scoring.”

— Evgeny Smirnov, CEO and Lead Architect:

What we learned

The hardest part wasn’t the AI — it was getting the rules right. Instructors initially wrote criteria that were either too vague (the AI interpreted them inconsistently) or too specific (the AI penalised students who took valid but unexpected approaches). The iteration process — writing rules, testing against student responses, refining — typically took 2–3 rounds per assignment.

The feedback quality surprised everyone. Students consistently rated the AI feedback as more helpful than traditional red-ink annotations, because it explained not just what was wrong but why and what to do about it. Some students told us they understood concepts better from the AI feedback than from class discussion.

Budget for an essay grading system with custom rubrics: $40K–$80K for MVP, 6–10 weeks. Ongoing costs are primarily LLM API ($500–$3,000/month depending on student volume) plus periodic rubric updates.

Building an AI assessment or tutoring system? Contact us — we’ll show you how EmanuelAYCE works and discuss your specific educational context.