Kitten TTS Chrome Extension#

A locally-executed neural text-to-speech system leveraging transformer-based voice synthesis through a Chrome extension interface with FastAPI backend orchestration.

Overview#

This project implements a client-side text-to-speech solution utilizing Kitten TTS—an open-source neural TTS engine—deployed as a local FastAPI microservice. The architecture eliminates API dependencies, cloud latency, and privacy concerns by performing all inference operations on-device.

Architecture#

The system follows a distributed microservices pattern with a local inference server:

┌─────────────────┐     HTTP/WebSocket     ┌──────────────────┐
│  Chrome Client  │ ←─────────────────────→ │  FastAPI Server  │
│  (Extension)    │                         │  (KittenTTS)     │
└─────────────────┘                         └──────────────────┘
      content.js                                   main.py
      popup.js                                   Neural Inference
      background.js                               Voice Synthesis

Components#

Frontend Layer (Chrome Extension)

  • background.js — Service worker for command routing and lifecycle management
  • content.js — DOM traversal, text extraction, and audio playback orchestration
  • popup.{html,css,js} — User interface for voice selection and playback controls

Backend Layer (FastAPI + KittenTTS)

  • main.py — Async REST API server handling TTS requests
  • Local neural model inference with 8-voice polyphonic synthesis
  • Real-time audio streaming with configurable playback rates (0.5x–2.0x)

Technical Implementation#

Neural Voice Synthesis#

KittenTTS provides expressive neural voices trained on diverse datasets:

Voice IDTypeCharacteristics
expr-voice-{2-5}-fFemaleMulti-speaker expressive synthesis
expr-voice-{2-5}-mMaleMulti-speaker expressive synthesis

Command Protocol#

CommandShortcutAction
Read SelectionCtrl+Shift+K / Cmd+Shift+KTTS synthesis on highlighted text
Read PageCtrl+Shift+P / Cmd+Shift+PFull document content extraction
Stop PlaybackCtrl+Shift+S / Cmd+Shift+SAbort current audio stream

Local Inference Pipeline#

Text Input → Tokenization → Neural Forward Pass → Audio Buffer → Browser Playback

Key advantages of local deployment:

  • Zero API costs and rate limits
  • Sub-50ms latency on modern hardware
  • Complete data sovereignty (no text leaves your machine)
  • Offline operation after initial model download

Setup#

Backend Deployment#

cd backend
bash setup.sh              # uv venv + dependency resolution
source .venv/bin/activate
python main.py             # Spawns server at http://127.0.0.1:8765

Client Installation#

chrome://extensions/ → Developer Mode → Load Unpacked → chrome-extension/

Prerequisite: Generate icon assets (icon16.png, icon48.png, icon128.png) via ImageMagick or online SVG→PNG converters.

Use Cases#

  • Accessibility — Screen reader alternative with neural voice quality
  • Content consumption — Listen to articles, documentation, or research papers
  • Multitasking — Absorb written content while performing other tasks
  • Language learning — Hear proper pronunciation with adjustable playback speed

Technical Constraints#

  • Backend server must remain active during extension usage
  • Initial model download occurs on first inference request (~hundreds of MB)
  • Chrome/Chromium-based browsers required (MV3 service worker architecture)

Leveraging open-source neural TTS for privacy-preserving local synthesis.