Voice Clone / Audio Book Creator
Check out the project here
- For how I implemented a script that will work to create an audio book from any text see the audiobook project
Tools Used:
NLP, Tokenization, NLTK, Processing and synthesis of textual content for audio generation, Text-to-Speech (TTS) Sythesis, Fine-Tuning of Model for Audio Generation, Manipulating Audio Data, Pytorch, Integration of TTS model for real life application, Manipulation and Concatenation of Audio Waveforms, Playback of Audio using various Sound Libraries, Git, Model Checkpoints, Automation of Text Processing
Summary:
Designed and Implemented Text-to-Speech (TTS) Models: Engineered deep learning models for text-to-speech synthesis, enabling the conversion of textual content into high-quality audio waveforms. Leveraged PyTorch and torchaudio libraries for model development and manipulation of audio data.
Utilized Natural Language Processing (NLP) Techniques: Employed NLTK for text preprocessing tasks, including tokenization of input text into sentences. Applied regex-based sentence splitting to preserve sentence-ending punctuation during text processing.
Deployed Real-time Audio Synthesis: Integrated TTS models into the system for real-time audio synthesis, allowing for immediate playback of synthesized audio.
Demonstrated proficiency in audio manipulation, including concatenation of audio waveforms and adjustment of playback rates: Utilized sound libraries like sounddevice for audio playback and manipulation.
Scripting and Automation: Automated text processing and synthesis tasks using Python scripts, enhancing efficiency and scalability of the system.
- Version Control and Collaboration: Managed model checkpoints and configurations using Git, enabling version control and collaboration on the project.
Listen to the Reference Audio used to train the Model:
Listen to the voice clone read a story generated by Chat GPT using the prompt:
- Prompt: Can I get a short story that involves space travel, pirates, magical abilities, and crazy blackhole imps?
Example clip showing the model not only understands punctuation but understands exclamation, heightening immersion when generating AudioBooks:
- Key Timestamps: 00:02 “MOVE!”, 00:07 “Move!” “No!” I said, 00:45 - 00:53 “But ohhhh noo, they don’t allow that do they?!” “There are ways” Lillian said. “NO!” the man barked “No! Not nearly enough”