Datasets / Models / Documentation

Public datasets and model releases behind the work.

Speech datasets, large Amharic corpora, Shook ASR checkpoints, and the Gemma CPT and SFT releases together form the clearest public record of the stack.

wxl_amh

Speech / 3k rows / Waxal Amharic audio transcriptions

A focused Amharic speech dataset built around audio and transcription pairs, useful for ASR training, transcription workflows, and speech evaluation.

  • 3k rows
  • 988 audio files in repo tree
  • Speech + transcription

Documentation

  • Best public proof point for the voice-data side of the work.
  • Useful for ASR training, transcription workflows, and speech evaluation in Amharic.
  • Connects directly to the speech-model work in the Shook line.

amharic-combined-corpus

Text / 10.1M rows / combined Amharic corpus

A large-scale Amharic corpus for pretraining, adaptation, retrieval, and downstream evaluation work.

  • 10.1M rows
  • 2.29 GB
  • 50 downloads last month

Documentation

  • The Hugging Face viewer reports 10,056,352 rows in the train split.
  • Most relevant as the text-layer proof behind Amharic model adaptation work.
  • Belongs next to the Gemma CPT line because the story is corpus first, model second.

wikipedia-amharic

Bilingual text / 55.8k rows / English-Amharic Wikipedia

A bilingual English-Amharic knowledge asset useful for translation, retrieval, question answering, and grounded language tasks.

  • 55.8k rows
  • Apache-2.0
  • English + Amharic

Documentation

  • Translated by Addis AI - Aleph and documented as a parallel English-Amharic corpus.
  • Useful for translation, retrieval, QA, and knowledge-grounded language work.
  • Good supporting proof because it is both technically relevant and easy to explain.

shook-medium-amharic-2k

ASR / 0.8B params / WER 9.1091

A high-signal Amharic ASR checkpoint with published evaluation numbers and training hyperparameters.

  • 420 downloads last month
  • 0.8B params
  • WER 9.1091
  • LR 1e-5 / batch 16 / grad acc 4 / 2 epochs

Documentation

  • Current Hugging Face card includes a full training-results table and public hyperparameters.
  • Best Shook model to lead with because it is measurable, not just described.
  • Strongest public link between the speech data layer and the deployed ASR story.

shook-tiny-amharic-600hr

ASR / 37.8M params / WER 22.1786

The smaller, lighter end of the Shook speech stack. It matters because it shows the work is not only a single flagship checkpoint.

  • 37.8M params
  • Fine-tuned from whisper-tiny
  • WER 22.1786

Documentation

  • Training card exposes hyperparameters including lr 3e-5, batch size 96, and 2 epochs.
  • Good supporting proof for compact speech deployment work.
  • Most useful when shown as the smaller companion to shook-medium-amharic-2k.

gemma-2-27b-amharic-alpaca-sft

Amharic SFT track / 27B class model

The supervised fine-tuning layer in the Gemma Amharic line. It is the clearest public signal that the model work extends beyond speech into instruction-following language systems.

  • 27B class
  • SFT track
  • Instruction-following

Documentation

  • Model card states that it is fine-tuned on top of gemma-2-27b-amharic-cpt.
  • Supports conversational interactions in Amharic.
  • Best framed as the top of the stack after corpus work and continued pretraining.

gemma-2-27b-amharic-cpt

Amharic CPT track / 27B class model

The continued pretraining track in the Gemma Amharic line. It is the right model to highlight when you want to show foundation-model adaptation rather than only application-layer fine-tuning.

  • 27B class
  • ~2B tokens of Amharic
  • Expanded tokenizer

Documentation

  • Model card says the tokenizer was expanded from the original 256k vocabulary to better cover Amharic script.
  • Continual pretraining was done on roughly 2B tokens of Amharic corpus data.
  • This is the cleanest public proof that the stack reaches the foundation-model adaptation layer.

Dataset

Open

amharic_courpus_tg_1

Text / 477k rows / Amharic corpus release

A useful mid-scale corpus release that broadens the public Amharic text layer beyond the flagship combined corpus.

Dataset

Open

the-stack-amharic

Code + translation / Amharic developer dataset

A more niche release, but a strategically important one because it points toward developer tooling and code-oriented Amharic infrastructure.

Model

Open

shook-900-base

Base model / 72.6M params

A base Shook checkpoint that helps make the model family visible beyond the tuned Amharic releases.