Kitten TTS Chrome Extension#

A locally-executed neural text-to-speech system leveraging transformer-based voice synthesis through a Chrome extension interface with FastAPI backend orchestration.

Overview#

This project implements a client-side text-to-speech solution utilizing Kitten TTS—an open-source neural TTS engine—deployed as a local FastAPI microservice. The architecture eliminates API dependencies, cloud latency, and privacy concerns by performing all inference operations on-device.

Architecture#

The system follows a distributed microservices pattern with a local inference server:

┌─────────────────┐     HTTP/WebSocket     ┌──────────────────┐
│  Chrome Client  │ ←─────────────────────→ │  FastAPI Server  │
│  (Extension)    │                         │  (KittenTTS)     │
└─────────────────┘                         └──────────────────┘
      content.js                                   main.py
      popup.js                                   Neural Inference
      background.js                               Voice Synthesis

Components#

Frontend Layer (Chrome Extension)

background.js — Service worker for command routing and lifecycle management
content.js — DOM traversal, text extraction, and audio playback orchestration
popup.{html,css,js} — User interface for voice selection and playback controls

Backend Layer (FastAPI + KittenTTS)

main.py — Async REST API server handling TTS requests
Local neural model inference with 8-voice polyphonic synthesis
Real-time audio streaming with configurable playback rates (0.5x–2.0x)

Technical Implementation#

Neural Voice Synthesis#

KittenTTS provides expressive neural voices trained on diverse datasets:

Voice ID	Type	Characteristics
`expr-voice-{2-5}-f`	Female	Multi-speaker expressive synthesis
`expr-voice-{2-5}-m`	Male	Multi-speaker expressive synthesis

Command Protocol#

Command	Shortcut	Action
Read Selection	`Ctrl+Shift+K` / `Cmd+Shift+K`	TTS synthesis on highlighted text
Read Page	`Ctrl+Shift+P` / `Cmd+Shift+P`	Full document content extraction
Stop Playback	`Ctrl+Shift+S` / `Cmd+Shift+S`	Abort current audio stream

Local Inference Pipeline#

Text Input → Tokenization → Neural Forward Pass → Audio Buffer → Browser Playback

Key advantages of local deployment:

Zero API costs and rate limits
Sub-50ms latency on modern hardware
Complete data sovereignty (no text leaves your machine)
Offline operation after initial model download

Setup#

Backend Deployment#

cd backend
bash setup.sh              # uv venv + dependency resolution
source .venv/bin/activate
python main.py             # Spawns server at http://127.0.0.1:8765

Client Installation#

chrome://extensions/ → Developer Mode → Load Unpacked → chrome-extension/

Prerequisite: Generate icon assets (icon16.png, icon48.png, icon128.png) via ImageMagick or online SVG→PNG converters.

Use Cases#

Accessibility — Screen reader alternative with neural voice quality
Content consumption — Listen to articles, documentation, or research papers
Multitasking — Absorb written content while performing other tasks
Language learning — Hear proper pronunciation with adjustable playback speed

Technical Constraints#

Backend server must remain active during extension usage
Initial model download occurs on first inference request (~hundreds of MB)
Chrome/Chromium-based browsers required (MV3 service worker architecture)

Leveraging open-source neural TTS for privacy-preserving local synthesis.

Kitten TTS Extension