Data
Datasets In Progress
We have not yet released any public datasets — but we have internal data used for training models, and we are preparing Thai benchmark datasets for public release.
Used for Training
Used internally — not yet available for download
CPT Corpus — Thai continual pre-training corpus ~144M tokens (CulturaX + Pantip)
Internal SFT Data — Instruction fine-tuning data (includes legal/math/agent domain variants)
Internal NaraEval-TH — Thai evaluation framework (8 dimensions, 200 questions, dual judge) — evaluation in progress
In Progress Upcoming Benchmarks
In preparation — not yet available for download
TH-MMLU Thai multi-task understanding evaluation (multi-subject MCQs) TH-MBPP Thai Python programming evaluation ThaiDial Thai conversation dataset How We Handle Data
Open Sources
We disclose the sources and license of every dataset used in training
Respect Licenses
We only use data that permits use for model training
Gradual Release
We release public datasets when ready and verified — no rush