संस्कृतकथा

Human Evaluation of Sanskrit Stories

We generated 50,000 synthetic Sanskrit stories using AI. Now we need your expert judgment to evaluate their quality — contributing to an academic paper on Sanskrit and language models.

कथम्

How it works

Read a Sanskrit story

Each story is displayed in Devanagari with context: the required vocabulary, Dharmic principle, and story feature.

Score on 6 dimensions

Rate grammar, vocabulary level, coherence, Dharmic integration, word usage, and cultural authenticity (1–5 scale).

Complete a batch of 10

Each session is a batch of 10 stories, taking about 25 minutes. Review at your own pace, as many batches as you like.

Quality dimensions

Stories per batch

~25 min

Per session

Blind

Review protocol

किमर्थम्

Why this matters

The problem with AI evaluators

LLM-as-judge evaluations have known biases: self-preference bias of ~0.8 points on a 5-point scale, and leniency bias varying by 1+ point across evaluators. We cannot trust AI to grade its own work.

Human evaluation is the gold standard

Your scores provide the ground truth. By comparing human judgments with LLM scores, we can measure exactly how much AI evaluators can be trusted for Sanskrit content.

Contributing to Sanskrit + AI research

This data feeds directly into an academic paper (targeting ACL/EMNLP 2026) on generating pedagogically sound Sanskrit stories. Your contribution advances both Sanskrit education and AI research.

Rigorous methodology

Blind review protocol, calibration stories, inter-annotator agreement metrics, and trust scoring — this follows the best practices from crowdsource annotation research.

मूल्याङ्कनम्

What you'll evaluate

Sanskrit Grammar

Correct vibhakti, verb forms, sandhi

Vocabulary Level

Age-appropriate word choices

Story Coherence

Clear beginning, middle, and end

Dharmic Integration

Natural embedding of principle

Word Usage

Required words fit naturally

Cultural Authenticity

Bharatiya setting, names, customs

परिचयः

About this project

SanskritKatha is a research project by Mahesh Ramakrishnan, based in Bengaluru. The goal: understand how well AI can generate pedagogically sound Sanskrit stories, and where it falls short.

The project has produced 50,000 synthetic Sanskrit stories across two difficulty tiers using multiple language models. Automated evaluation gives us numbers, but numbers alone don't capture whether a story feels right to someone who knows Sanskrit. That's where you come in.

Mahesh has spent 26 years in language technology — building NLP products, working with text at scale, and more recently, exploring what AI can and cannot do for classical languages. This project sits at that intersection.

This is not a funded lab or a university department. It is one person's serious attempt at contributing to a field that matters. All reviewer contributions will be acknowledged in the resulting publication.

Apply to be a reviewer

We're looking for Sanskrit scholars, teachers, students, and enthusiasts.