Data

Datasets In Progress

We have not yet released any public datasets — but we have internal data used for training models, and we are preparing Thai benchmark datasets for public release.

Internal Data

Used for Training

Used internally — not yet available for download

CPT Corpus — Thai continual pre-training corpus ~144M tokens (CulturaX + Pantip)

Internal

SFT Data — Instruction fine-tuning data (includes legal/math/agent domain variants)

Internal

NaraEval-TH — Thai evaluation framework (8 dimensions, 200 questions, dual judge) — evaluation in progress

In Progress

Planned Releases

Upcoming Benchmarks

In preparation — not yet available for download

TH-MMLU Thai multi-task understanding evaluation (multi-subject MCQs)

In Preparation

TH-MBPP Thai Python programming evaluation

Planned

ThaiDial Thai conversation dataset

Planned

Principles

How We Handle Data

Open Sources

We disclose the sources and license of every dataset used in training

Respect Licenses

We only use data that permits use for model training

Gradual Release

We release public datasets when ready and verified — no rush