Deep Voice is a production-quality text-to-speech system constructed entirely from deep neural networks. Unlike alternative neural text-to-speech (TTS) systems, Deep Voice runs in real-time, synthesizing audio as fast as it needs to be played, making it usable for interactive applications like media and conversational interfaces. By training deep neural networks capable of learning from large amounts of data and simple features (rather than custom-designed hand-engineered pipelines), we created an incredibly flexible system for high-quality voice synthesis in real time.
Deep Voice lays the groundwork for truly end-to-end speech synthesis without a complex processing pipeline and without relying on hand-engineered features for inputs or pre-training. Synthesizing artificial human speech from text, commonly known as text-to-speech (TTS), is an essential component in many applications such as speech-enabled devices, navigation systems, and accessibility for the visually-impaired. Fundamentally, it allows human-technology interaction without requiring visual interfaces.
No Training Offered