Qwen 3.5 covers Amharic. The constraint is tokenizer efficiency, not language visibility.

Why we even bothered to check this

Amharic is the language of more than 60 million people and the working language of Ethiopia. At Addis AI, we build voice tools, chatbots, and summarizers that need to feel native rather than merely passable.

That makes tokenizer efficiency a real product issue, not a technical footnote. If a model needs too many tokens to read Amharic, cost goes up, context windows shrink faster, and production behavior gets harder to manage.

What tokenization really means here

Tokenization is how a model chops text into pieces it can process. English usually gets bigger pieces. Amharic often gets much smaller ones. Smaller pieces mean more tokens, higher cost, shorter effective context, and more room for quality loss.

How we tested it

We used official model files from Hugging Face and a representative slice of real Amharic speech transcriptions. The goal was not to repeat marketing claims, but to measure what the tokenizer actually does.

What we did step by step

Loaded every Qwen 3.5 small model from 0.8B to 9B, checked the vocabulary for Amharic letters, verified clean encode-decode behavior, and then ran 10,000 real Amharic lines through the tokenizer.

The real Amharic we used

google/WaxalNLP, amh_asr train split. 10,000 genuine Ethiopian speech transcriptions rather than synthetic examples or isolated sentences.

Good news first: the vocabulary is genuinely huge

Qwen 3.5 really does expose a 248,320-token vocabulary, and the number is consistent across all four smaller models we checked.

ModelVocabulary sizeResult
Qwen3.5-0.8B248,320Confirmed
Qwen3.5-2B248,320Confirmed
Qwen3.5-4B248,320Confirmed
Qwen3.5-9B248,320Confirmed

What this means

The model family has plenty of vocabulary room, and every Amharic character we tested round-tripped cleanly through encode and decode. The coverage claim is real. The deeper issue appears later, in efficiency.

Finding 2: only 25 of 248,070 loaded tokens are pure Ethiopic

Out of 248,070 loaded tokens, only 25 are pure Ethiopic characters. That is the clearest signal that Amharic is present, but not represented in a tokenizer-native way.

Ethiopic share of loaded vocabulary

25 of 248,070 loaded tokens

Pure Ethiopic tokens0.0101%

Total tokens loaded

248,070

Pure Ethiopic tokens

25

Percentage

0.0101%

Under the hood: why Amharic words get split so much

We tested 384 Ethiopic codepoints directly. Coverage exists, but most characters still expand into multiple tokens instead of one, which is where the token inflation begins.

Ethiopic character split profile

384 codepoints tested

Single-token characters20 of 384 / 5.2%
Two-token characters242 of 384 / 63.0%
Three-token characters122 of 384 / 31.8%

Once most characters are multi-token, full sentences become expensive very quickly. That is why the sentence-level test matters more than the vocabulary headline.

The real test: 10,000 everyday Amharic sentences

Theory is useful, but real text is the truth. We ran 10,000 everyday Amharic lines and measured the actual cost.

Example sentence from the dataset

እኔ አዲስ አበባ ነው የምኖረው -> 9 tokens instead of the 4-5 you would roughly expect in English.

Total tokens used

1,805,886

Total words

267,768

Tokens per word

6.74

Compared to English

~5x more

What the token stream actually looks like

Tokens that were actual Amharic characters26.3%
Unique Amharic characters with their own token7.6%

Relative token cost impact for the same content volume

English baseline1x
Amharic on Qwen 3.5~5x

In simple terms, the model understands every letter correctly, but it has to work much harder than it should. That turns into higher API bills, shorter context headroom, and more pressure on latency-sensitive production systems.

Conclusion

The model-level claim is valid: Qwen 3.5 has a large, consistent vocabulary and correctly represents Amharic text. The limiting factor is tokenizer efficiency, not coverage.

On 10,000 real Amharic sentences, the tokenizer averages 6.74 tokens per word and roughly 5x relative token cost versus a simple English baseline framing. For production systems, that directly affects cost, latency, and usable context budget.

Budget and latency

Plan for roughly 5x higher token volume on Amharic-heavy traffic compared with English baselines.

Context strategy

Apply summarization and chunking earlier in the pipeline so the effective context budget lasts longer.

Evaluation KPI

Track tokens per task alongside accuracy and latency when choosing models for production.

Want to check it yourself?

Qwen model configQwen tokenizer fileWaxalNLP dataset info