1. Receive text input (direct text, file, or SRT).
2. Determine backend (Kokoro or Noiz) based on user configuration and availability.
3. If using timeline mode, parse the SRT file and voice map.
4. Configure voice settings (voice ID, language, emotion, etc.).
5. Execute text-to-speech conversion.
6. Generate audio output file.
7. Provide feedback on success or failure.