Skip to content
Data

Datasets In Progress

We have not yet released any public datasets — but we have internal data used for training models, and we are preparing Thai benchmark datasets for public release.

Internal Data

Used for Training

Used internally — not yet available for download

CPT Corpus — Thai continual pre-training corpus ~144M tokens (CulturaX + Pantip)
Internal
SFT Data — Instruction fine-tuning data (includes legal/math/agent domain variants)
Internal
NaraEval-TH — Thai evaluation framework (8 dimensions, 200 questions, dual judge) — evaluation in progress
In Progress
Planned Releases

Upcoming Benchmarks

In preparation — not yet available for download

TH-MMLU Thai multi-task understanding evaluation (multi-subject MCQs)
In Preparation
TH-MBPP Thai Python programming evaluation
Planned
ThaiDial Thai conversation dataset
Planned
Principles

How We Handle Data

Open Sources

We disclose the sources and license of every dataset used in training

Respect Licenses

We only use data that permits use for model training

Gradual Release

We release public datasets when ready and verified — no rush