share this!
2
2
Share
Email

June 5, 2018

New method enables high quality speech separation

by Association for Computing Machinery

People have a natural knack for focusing on what a single person is saying, even when there are competing conversations in the background or other distracting sounds. For instance, people can often make out what is being said by someone at a crowded restaurant, during a noisy party, or while viewing televised debates where multiple pundits are talking over one another. To date, being able to computationally—and accurately—mimic this natural human ability to isolate speech has been a difficult task.

"Computers are becoming better and better at understanding speech, but still have significant difficulty understanding speech when several people are speaking together or when there is a lot of noise," says Ariel Ephrat, a Ph.D. candidate at Hebrew University of Jerusalem-Israel and lead author of the research. (Ephrat developed the new model while interning at Google the summer of 2017.) "We humans know how to understand speech in such conditions naturally, but we want computers to be able to do it as well as us, maybe even better."

To this end, Ephrat and his colleagues at Google have developed a novel audio-visual model for isolating and enhancing the speech of desired speakers in a video. The team's deep network-based model incorporates both visual and auditory signals in order to isolate and enhance any speaker in any video, even in challenging real-world scenarios, such as video conferencing, where multiple participants oftentimes talk at once, and noisy bars, which could contain a variety of background noise, music, and competing conversations.

The team, which includes Google's Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein, will present their work at SIGGRAPH 2018, held 12-16 August in Vancouver, British Columbia. The annual conference and exhibition showcases the world's leading professionals, academics, and creative minds at the forefront of computer graphics and interactive techniques.

In this work, the researchers did not just focus on auditory cues to separate speech but also visual cues in the video—i.e., the subject's lip movements and potentially other facial movements that may lend to what he or she is saying. The visual features garnered are used to "focus" the audio on a single subject who is speaking and to improve the quality of speech separation.

To train their joint audio-visual model, Ephrat and collaborators curated a new dataset, "AVSpeech," comprised of thousands of YouTube videos and other online video segments, such as TED Talks, how-to videos, and high-quality lectures. From AVSpeech, the researchers generated a training set of so-called "synthetic cocktail parties"—mixtures of face videos with clean speech and other speech audio tracks with background noise. To isolate speech from these videos, the user is only required to specify the face of the person in the video whose audio is to be singled out.

In multiple examples detailed in the paper, titled "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation," the new method turned out superior results as compared to existing audio-only methods on pure speech mixtures, and significant improvements in delivering clear audio from mixtures containing overlapping speech and background noise in real-world scenarios. While the focus of the work is speech separation and enhancement, the team's novel method could also be applied to automatic speech recognition (ASR) and video transcription—i.e., closed captioning capabilities on streaming videos and TV. In a demonstration, the new joint audio-visual model produced more accurate captions in scenarios where two or more speakers were involved.

Surprised at first by how well their method worked, the researchers are excited about its future potential.

"We haven't seen speech separation done 'in-the-wild' at such quality before. This is why we see an exciting future for this technology," notes Ephrat. "There is more work needed before this technology lands in consumer hands, but with the promising preliminary results that we've shown, we can certainly see it supporting a range of applications in the future, like video captioning, video conferencing, and even improved hearing aids if such devices could be combined with cameras."

The researchers are currently exploring opportunities for incorporating it into various Google products.

Provided by Association for Computing Machinery

Citation: New method enables high quality speech separation (2018, June 5) retrieved 27 April 2024 from https://phys.org/news/2018-06-method-enables-high-quality-speech.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Introducing Cloud Text-to-Speech service for developers

4 shares

Feedback to editors

New method enables high quality speech separation

Optical barcodes expand range of high-resolution sensor

Ridesourcing platforms thrive on socio-economic inequality, say researchers

Did Vesuvius bury the home of the first Roman emperor?

Florida dolphin found with highly pathogenic avian flu: Report

A new way to study and help prevent landslides

New algorithm cuts through 'noisy' data to better predict tipping points

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

Relevant PhysicsForums posts

Baltimore's Francis Scott Key Bridge Collapses after Ship Strike

Calculation of coolant tank size required to feed multiple grinding stations

Validating "NET Power's" use of the Allam-Fetvedt cycle

What adhesive suitable for gluing steel and G10 material?

How much will my seaweed tanks heat up in the hot Indian summer?

How to attach clamp to metal bar?

Introducing Cloud Text-to-Speech service for developers

Researchers develop more comprehensive acoustic scene analysis method

Study suggests we can recognize speakers only from how faces move when talking

Facial expression more important to conveying emotion in music than in speech

Read my lips: Using multiple senses in speech perception (Video)

A computer can pick out speech even amid cacophony

Tiny probe that senses deep in the lung set to shed light on disease

MIT and NASA engineers demonstrate a new kind of airplane wing

When Concorde first took to the sky 50 years ago

Paper sensors remove the sting of diabetic testing

Micropores let oxygen and nutrients inside biofabricated tissues

Understanding dynamic stall at high speeds

Medical Xpress

Tech Xplore

Science X

New method enables high quality speech separation

Optical barcodes expand range of high-resolution sensor

Ridesourcing platforms thrive on socio-economic inequality, say researchers

Did Vesuvius bury the home of the first Roman emperor?

Florida dolphin found with highly pathogenic avian flu: Report

A new way to study and help prevent landslides

New algorithm cuts through 'noisy' data to better predict tipping points

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

Relevant PhysicsForums posts

Related Stories

Introducing Cloud Text-to-Speech service for developers

Researchers develop more comprehensive acoustic scene analysis method

Study suggests we can recognize speakers only from how faces move when talking

Facial expression more important to conveying emotion in music than in speech

Read my lips: Using multiple senses in speech perception (Video)

A computer can pick out speech even amid cacophony

Recommended for you

Tiny probe that senses deep in the lung set to shed light on disease

MIT and NASA engineers demonstrate a new kind of airplane wing

When Concorde first took to the sky 50 years ago

Paper sensors remove the sting of diabetic testing

Micropores let oxygen and nutrients inside biofabricated tissues

Understanding dynamic stall at high speeds

Newsletter sign up

Donate and enjoy an ad-free experience