Appearance
Technical Details
This page covers the core technical implementation details of the project.
Backend (Python + FastAPI)
The backend is a FastAPI application that exposes a simple API for the frontend.
Modular Service Architecture
The most significant architectural feature is its modular, pluggable service layer. The core logic is separated into distinct services, each responsible for one part of the voice assistant pipeline:
- SpeechToTextService: Converts audio to text.
- TranslationService: Translates text between languages.
- LlmService: Interacts with a Large Language Model.
- TextToSpeechService: Converts text to audio.
Each service is defined by an interface (a Python Protocol) and has one or more implementations. For example, the TranslationService could have implementations for Google Translate, a local Hugging Face model, or a mock service for testing.
Configuration-based Service Loading
The application uses pydantic-settings to load configuration from environment variables. This allows the administrator to choose which service implementation to use at runtime. For example, setting the TRANSLATION_PROVIDER environment variable to google will load the Google Translate implementation of the TranslationService.
This provides great flexibility for different environments:
- Local Development: Use mock or local, CPU-based models.
- Production: Use powerful cloud-based services like Cohere, Google AI, or private APIs.
API Endpoints
POST /voice: The main endpoint that accepts an audio file (wavormp3). It runs the full pipeline and returns a JSON response with a URL to the generated audio file.GET /storage/{file_name}: A static file endpoint to serve the generated audio files.
Frontend (React + TypeScript)
The frontend is a single-page application built with React and Vite.
API Client Generation
The project uses openapi-typescript to generate a type-safe API client from the backend's OpenAPI specification. The client itself is openapi-fetch.
This means that any changes to the backend API automatically generate updated types and functions for the frontend, preventing mismatches between the two.
State Management & UI
- Component Structure: The core logic is encapsulated in the
VoiceAssistantcomponent (src/features/voice/voice-assistant.tsx). - Voice Recording: A custom hook,
useVoiceRecording, abstracts the browser'sMediaRecorderAPI to handle recording, stopping, and accessing the recorded audio data. - Service Layer: A
VoiceServiceclass (src/features/voice/service.ts) acts as an intermediary between the UI components and the API client, handling the request to the backend.