Research Platform
संस्कृतकथा
Human Evaluation of Sanskrit Stories
We generated 50,000 synthetic Sanskrit stories using AI. Now we need your expert judgment to evaluate their quality — contributing to an academic paper on Sanskrit and language models.
How it works
Read a Sanskrit story
Each story is displayed in Devanagari with context: the required vocabulary, Dharmic principle, and story feature.
Score on 6 dimensions
Rate grammar, vocabulary level, coherence, Dharmic integration, word usage, and cultural authenticity (1–5 scale).
Complete a batch of 10
Each session is a batch of 10 stories, taking about 25 minutes. Review at your own pace, as many batches as you like.
6
Quality dimensions
10
Stories per batch
~25 min
Per session
Blind
Review protocol
Why this matters
The problem with AI evaluators
LLM-as-judge evaluations have known biases: self-preference bias of ~0.8 points on a 5-point scale, and leniency bias varying by 1+ point across evaluators. We cannot trust AI to grade its own work.
Human evaluation is the gold standard
Your scores provide the ground truth. By comparing human judgments with LLM scores, we can measure exactly how much AI evaluators can be trusted for Sanskrit content.
Contributing to Sanskrit + AI research
This data feeds directly into an academic paper (targeting ACL/EMNLP 2026) on generating pedagogically sound Sanskrit stories. Your contribution advances both Sanskrit education and AI research.
Rigorous methodology
Blind review protocol, calibration stories, inter-annotator agreement metrics, and trust scoring — this follows the best practices from crowdsource annotation research.
What you'll evaluate
Sanskrit Grammar
Correct vibhakti, verb forms, sandhi
Vocabulary Level
Age-appropriate word choices
Story Coherence
Clear beginning, middle, and end
Dharmic Integration
Natural embedding of principle
Word Usage
Required words fit naturally
Cultural Authenticity
Bharatiya setting, names, customs
About this project
SanskritKatha is a research project by Mahesh Ramakrishnan, based in Bengaluru. The goal: understand how well AI can generate pedagogically sound Sanskrit stories, and where it falls short.
The project has produced 50,000 synthetic Sanskrit stories across two difficulty tiers using multiple language models. Automated evaluation gives us numbers, but numbers alone don't capture whether a story feels right to someone who knows Sanskrit. That's where you come in.
Mahesh has spent 26 years in language technology — building NLP products, working with text at scale, and more recently, exploring what AI can and cannot do for classical languages. This project sits at that intersection.
This is not a funded lab or a university department. It is one person's serious attempt at contributing to a field that matters. All reviewer contributions will be acknowledged in the resulting publication.
Apply to be a reviewer
We're looking for Sanskrit scholars, teachers, students, and enthusiasts.