Aseto.ai Speech-to-Text Model Development: Advancing Voice AI for Cypriot and Greek Languages

Aseto.ai has developed an advanced Speech-to-Text (STT) model purpose-built for the Cypriot dialect and the Greek language, achieving a substantial reduction in transcription errors compared to leading open models. This work addresses a long-standing limitation in Voice AI technology for regional languages, enabling practical deployment of Cypriot and Greek voice agents across business and consumer applications. Early benchmarking shows over 30% relative improvement in Word Error Rate (WER) for Cypriot speech and over 40% improvement for standard Greek while maintaining strong English performance. These results demonstrate that regional optimization can deliver tangible accuracy gains without compromising multilingual capability.
Aseto.ai Speech-to-Text Model Development: Advancing Voice AI for Cypriot and Greek Languages

1. The Challenge: Voice AI in Low-Resource Languages

Modern voice technologies rely on Speech-to-Text systems trained primarily on global, English-centric data. While models like OpenAI’s Whisper and NVIDIA’s Canary deliver remarkable results in high- resource languages, their performance drops sharply when exposed to dialects and low-resource languages such as Greek and even more so for the Cypriot dialect.

The Cypriot dialect presents unique challenges: a mix of distinct phonetics, vocabulary, and intonation patterns that diverge from standard Greek. Because of limited availability of Cypriot-specific data, existing STT systems struggle to interpret real-world speech, producing error rates that render them impractical for production workflows.

For Cypriot users and businesses, this has meant repeated misunderstandings, poor transcription quality, and an inability to leverage the productivity benefits of voice-based automation.

Aseto’s goal was to close this gap to create a model that understands Cypriot speakers naturally, in their own linguistic context.

2. Aseto’s Approach

Aseto’s STT initiative is grounded in three core pillars:

a. Targeted Data Curation

Aseto compiled a proprietary dataset of Cypriot and Greek speech recorded across diverse speakers, contexts, and environments. Every sample was transcribed by native speakers to ensure linguistic accuracy and authenticity providing a rare, high-quality dataset tailored for dialect adaptation.

b. Fine-Tuned Multilingual Training

Starting from an open-source foundation, Aseto applied fine-tuning on curated Cypriot, standard Greek, and English data. The objective was to enhance dialectal understanding while preserving generalization across languages, thus avoiding catastrophic forgetting—a common issue where models lose performance in their base languages after adaptation.

c. Rigorous Normalized Evaluation

To ensure objective and reproducible benchmarking, all transcriptions were normalized to lowercase and stripped of punctuation before Word Error Rate (WER) computation. This isolates the speech recognition accuracy itself, independent of stylistic formatting or punctuation recovery.

3. Benchmarking Results

Aseto evaluated its model against Whisper Large V3 and Canary-1b-v2 across three datasets:

  • Internal Cypriot Dataset(collected and transcribed by Aseto)
  • Common Voice Greek (Validated Split)
  • Common Voice English (Validated Split)

The comparison focuses on Word Error Rate (WER), where lower values indicate better performance. Table 1: Comparative Word Error Rate (WER) Across Models

Dataset Aseto.ai (Prototype) Whisper Large V3 Canary-1b-v2 Relative Improvement (vs. Whisper)
Cypriot Dialect 24% 38% 55% 35% lower errors
Standard Greek 9% 16% 25% 45% lower errors
English 7% 6% 5% Minor difference (≤2%)

(Lower WER is better)

Analysis:

Aseto’s model demonstrates strong dialectal generalization, outperforming leading open models on both Cypriot and standard Greek datasets by a wide margin. English performance remains stable, confirming that multilingual fine-tuning improved Greek-family comprehension without degrading overall capability.

These results establish Aseto’s model as the most accurate Cypriot speech recognition system to date, and among the strongest Greek-language performers tested.

4. Real-World Impact

Beyond benchmarks, practical deployment tests revealed the significance of these improvements.

Where existing models frequently misinterpret numbers, addresses, and personal names—crucial elements in business workflows—Aseto’s model produced stable, production-grade transcriptions. This reliability has already enabled:

  • Voice-based Customer Support - accurate transcription of phone calls in mixed Greek–Cypriot speech.
  • Automated Form Filling - reliable capture of spoken data such as IDs, dates, and addresses.
  • Multilingual Operations - seamless performance when users switch between Greek, Cypriot, and English in conversation.

These capabilities unlock use cases that were previously impossible with generic open-source models, bringing the benefits of Voice AI to Cypriot enterprises for the first time.

5. Ongoing Development

The current prototype represents an early milestone in Aseto’s Voice AI roadmap. Active areas of research include:

  • Expansion of Cypriot and Greek datasets through semi-supervised learning.
  • Enhancement of model robustness to noise, accents, and telephony-grade audio.
  • Improved punctuation restoration and speaker diarization for transcription use cases.
  • Integration into Aseto’s Voice Agent platform for end-to-end AI communication.

Aseto plans to continue benchmarking and to publish future updates as model performance and dataset scale advance.

6. Conclusion

This research demonstrates that with targeted adaptation and domain expertise, it is possible to achieve substantial gains in low-resource language performance while maintaining multilingual stability.

Aseto.ai’s Speech-to-Text model marks a significant step toward inclusive, regionally fluent Voice AI— technology that understands Cypriot and Greek speakers naturally, in their own linguistic and cultural context.

By solving the core transcription challenge, Aseto is laying the foundation for a new generation of intelligent voice systems that serve, rather than exclude, the languages of the Eastern Mediterranean.

References

  1. OpenAI Whisper Large V3 - https://huggingface.co/openai/whisper-large-v3
  2. NVIDIA Canary-1B-V2 - https://huggingface.co/nvidia/canary-1b-v2
  3. Mozilla Common Voice Datasets - https://commonvoice.mozilla.org/
  4. Understanding Word Error Rate (WER) - https://www.rev.com/