Kitten TTS Extension
Kitten TTS Chrome Extension#
A locally-executed neural text-to-speech system leveraging transformer-based voice synthesis through a Chrome extension interface with FastAPI backend orchestration.
Overview#
This project implements a client-side text-to-speech solution utilizing Kitten TTS—an open-source neural TTS engine—deployed as a local FastAPI microservice. The architecture eliminates API dependencies, cloud latency, and privacy concerns by performing all inference operations on-device.
Architecture#
The system follows a distributed microservices pattern with a local inference server:
┌─────────────────┐ HTTP/WebSocket ┌──────────────────┐
│ Chrome Client │ ←─────────────────────→ │ FastAPI Server │
│ (Extension) │ │ (KittenTTS) │
└─────────────────┘ └──────────────────┘
content.js main.py
popup.js Neural Inference
background.js Voice Synthesis
Components#
Frontend Layer (Chrome Extension)
background.js— Service worker for command routing and lifecycle managementcontent.js— DOM traversal, text extraction, and audio playback orchestrationpopup.{html,css,js}— User interface for voice selection and playback controls
Backend Layer (FastAPI + KittenTTS)
main.py— Async REST API server handling TTS requests- Local neural model inference with 8-voice polyphonic synthesis
- Real-time audio streaming with configurable playback rates (0.5x–2.0x)
Technical Implementation#
Neural Voice Synthesis#
KittenTTS provides expressive neural voices trained on diverse datasets:
| Voice ID | Type | Characteristics |
|---|---|---|
expr-voice-{2-5}-f | Female | Multi-speaker expressive synthesis |
expr-voice-{2-5}-m | Male | Multi-speaker expressive synthesis |
Command Protocol#
| Command | Shortcut | Action |
|---|---|---|
| Read Selection | Ctrl+Shift+K / Cmd+Shift+K | TTS synthesis on highlighted text |
| Read Page | Ctrl+Shift+P / Cmd+Shift+P | Full document content extraction |
| Stop Playback | Ctrl+Shift+S / Cmd+Shift+S | Abort current audio stream |
Local Inference Pipeline#
Text Input → Tokenization → Neural Forward Pass → Audio Buffer → Browser Playback
Key advantages of local deployment:
- Zero API costs and rate limits
- Sub-50ms latency on modern hardware
- Complete data sovereignty (no text leaves your machine)
- Offline operation after initial model download
Setup#
Backend Deployment#
cd backend
bash setup.sh # uv venv + dependency resolution
source .venv/bin/activate
python main.py # Spawns server at http://127.0.0.1:8765
Client Installation#
chrome://extensions/ → Developer Mode → Load Unpacked → chrome-extension/
Prerequisite: Generate icon assets (icon16.png, icon48.png, icon128.png) via ImageMagick or online SVG→PNG converters.
Use Cases#
- Accessibility — Screen reader alternative with neural voice quality
- Content consumption — Listen to articles, documentation, or research papers
- Multitasking — Absorb written content while performing other tasks
- Language learning — Hear proper pronunciation with adjustable playback speed
Technical Constraints#
- Backend server must remain active during extension usage
- Initial model download occurs on first inference request (~hundreds of MB)
- Chrome/Chromium-based browsers required (MV3 service worker architecture)
Leveraging open-source neural TTS for privacy-preserving local synthesis.