ElevenLabs is launching its own speech-to-text model

Spread the love

Elevenlabs, an AI startup that raised just $ 180 million mega funding round, is mainly known for its audio generation provu. The company launched its first standalone speech-to-text model to take another step in another technical direction, called Scribe.

The startup of a price of $ 3.3 billion has assisted several other companies in providing speech-to-stay services through its vast library of voice. However, the company is now looking at speech detection and competing with the choice of whisper models of Gladia, SpeechMatics, Assembly, Dipgram and Openai.

The Scribe model of Elevenlabs at the launch supports more than 99 languages. The company classifies more than 25 languages in the excellent accuracy category for models where the word error rate is less than 5%. The list includes English (97%accuracy rate), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish and Vietnamese. Other languages are ranked with high (5–10% word error rate), good (10 to 20% word error rate), and medium (25 to 50%) word error rate in different categories.

The company said the model whispered the Google Gemini 2.0 flash and whispering the large V3 in several languages in Flaves and Common Voice benchmark tests.

Elevenlabs developed a speech-to-stay component for its AI connivance agent platform, which was released last year. However, this is the first time the company is releasing a standalone speech detection model. In a conversation with Techcrunch last month, CEO Mati Staniszewski spoke about improving the model of detection of speech.

“We want to understand what you are saying in a conversation. We are working on ways to generate only materials and understand and move away from moving the speech at that time, ”staniszewski said at that time. “Many people say that speech is a solved problem. But for many languages, this is very bad. We think we can make a better speech detection model because we have in-house teams that have to anote data and give us quick response. ,

The model also has a smart speaker diarrheation to tell you who is speaking, an auto-tagging sound event such as Timstamp at Word Level, and Audience Hansi for accurate subtitle. Startup is providing customers a way to translate the video content directly to add subtitles or captions to their studio.

Scribe currently works with only pre-varied audio formats. The company said it would soon release the lower delay of model to the real -time version. This means that it is not yet effective to meet from taking transcription or voice note.

Elevenlabs is pricing at $ 0.40 for transchied audio for one hour. While the rate is competitive, some of its rivals offer a low price for audio transcription at this time with some feature discrimination.

Source link