Skip to content

Backend

The backend serves as the central processing unit for the Igbo Voice Assistant. It orchestrates the entire workflow, from receiving audio input from the frontend to delivering the final audio response. Its primary responsibilities include:

  • Receiving recorded audio data from the frontend.
  • Converting audio to text using a Speech-to-Text (STT) model.
  • Translating the transcribed text to English for LLM processing.
  • Interacting with a Large Language Model (LLM) to process user queries.
  • Translating the LLM's English response back to Igbo.
  • Converting the Igbo text response into an audio file using a Text-to-Speech (TTS) model.
  • Storing and providing access to the generated audio files.
  • Sending a link to the audio file back to the frontend.

Overall Architecture

The backend is designed as a modular system using a microservices-like pattern for its AI capabilities. It uses a dependency injection pattern to load and manage different services (STT, Translation, LLM, TTS). The specific implementation for each service can be swapped via environment variables, allowing for flexibility in deploying with different models (e.g., local vs. cloud-based).

A ProcessVoiceAssistantUseCase class encapsulates the core logic, orchestrating the flow of data through the various services in the pipeline.

Technology Stack

The backend is built using the following technologies:

  • Python: The core programming language.
  • FastAPI: A modern, high-performance web framework for building APIs with Python.
  • Pydantic: Used for data validation and settings management, particularly for loading configuration from environment variables.
  • AI/ML Integration: The backend is designed to integrate with various AI/ML models and services. The specific libraries used depend on the configured providers for each step of the pipeline (e.g., transformers for local models, cohere SDK for Cohere's API, etc.).