Join Amazon Scientists and Engineers at ICASSP 2019!

What are the Alexa and Devices organizations? Amazon Alexa brings voice-driven experiences to our customers– just ask, and Alexa will play music, answer questions, turn on the lights, make calls, and provide information, news, sports scores, weather, and more—instantly. Amazon Devices brings breakthrough technology to our customers from Kindle and Fire TV Stick, to Echo and Dash. We are developing and hiring the future of voice and device technology across the globe.

Alexa & Devices Research

Our research is customer focused. Your discoveries in speech recognition, natural-language understanding, deep learning, and other disciplines of machine learning can fuel new ideas and applications that have direct impact on peoples’ lives. We also firmly believe that our team must engage deeply with the academic community and be part of the scientific discourse.

Minhua Wu

3 questions with Minhua Wu,
an applied scientist in the Alexa Speech group

 

Your group has three papers at ICASSP this year. What are they about?

Two are about multichannel acoustic modeling. Right now, we are using seven microphones, and we beamform the signal to do acoustic modeling. We optimize the beam former based on speech quality, but we optimize the acoustic model based on word error rate.

With our new work, we are incorporating beamforming into the acoustic model, so we are jointly optimizing both the beam former and the acoustic model, and both of them are optimized toward reducing word error rate.

The other paper is about improving the conventional beamforming system. The challenge we are dealing with is the noise robustness of the system. Under clean conditions, ASR is already performing very well, but we still have problems dealing with noisy conditions. So the main idea is to train a teacher model on the clean data, then artificially add noise — music noise, audio playback noise, household noise.

Then we use the noisy data to train the student model, paired with the teacher’s predictions from the clean data. We are hoping to create a student that performs similarly to the teacher but in the noisy domain.

We also apply a logit selection method. We are using the predictions from the teacher to train the student model, but instead of using all of them, we only use the most confident ones. We’re trying to limit this to the most reliable information that the student can learn from the teacher.

When you survey the field, what new trends excite you?

For machine learning, it’s simplifying the problem. If you look at the conventional ASR system, there are many manual steps. When you extract features, you have to carefully design the way they are extracted. And when you do the modeling, you are using senone targets, which are actually the tied states of triphones, and there is a tree-building process to create those. There are a lot of algorithms involved in that procedure.

Nowadays, in the speech field, people are moving to more end-to-end-based systems. You don’t need the complicated tree-building process to create those tied-state triphone targets. You don’t need to align frames of feature vectors to those training targets. You can just throw in maybe frames of Fourier transform coefficients, and use characters or subword units to do the modeling. The decoding step becomes more straightforward, also.

To me, it’s a big simplification of the system. It doesn’t require you to have much domain knowledge in the speech field to build a speech system, and it encourages more talent from the general machine learning field to join the speech community and optimize the system.

What made you decide to take a job in industry, instead of going the academic route?

Academics do more theoretical work, and the systems you build may get incorporated in real products that impact customers. But here, once you’ve developed a model, you can get it in our products, and see how it impacts customers at scale.

As an applied scientist, of course you must do engineering work to build models, but on the other hand, you also have the opportunity to read papers, to learn new technologies in the field and build new systesms. That part is exciting and similar to what people are doing in academia.

Philip Hilmes
3 questions with Philip Hilmes,
director of audio technology
for the hardware division of Amazon Devices

 

Is there an overarching theme to your group’s research?

Well, we work in multiple areas, hardware and software. We work on improving speech recognition accuracy; we also do all the voice communication for Alexa Calling; we do audio playback on all of our devices; and we do a lot of different detection algorithms and all sorts of random things from health and wellness and medical-related things to karaoke.

But if you had to go down to fundamentals, we basically combine the best of machine learning techniques and the best of signal-processing techniques with innovative algorithms and hardware designs for problems in all those areas.

How do you combine signal processing and machine learning?

Acoustic echoes are handled really well by traditional signal-processing techniques, but we may use machine learning to contrive a control mechanism to figure out how to adapt that signal processing to a specific situation. So if you have an Echo device that’s playing music, we use acoustic-echo cancellation filters that have learned what is playing, and then, listening to the microphones, it uses that reference to directly cancel the audio. But its performance is very dependent on how it’s adapting the filter to cancel the audio.

So we use machine learning algorithms to understand what’s going on in the environment and then adapt our canceling the sound. The adaptive filter for echo cancellation changes based on what it’s hearing, but you can tell it to change or you can tell it, no, don’t change. So we listen to understand, for instance, somebody’s trying to talk to Alexa. Let’s freeze the filter so that it doesn’t change and try to cancel what the person is saying.

We want to cancel everything that the Echo device is playing, and we also have other processes that try to cancel interference within the room, but we try to do everything we can so that we don’t touch the speech so that it gets through to the wake word engine and the speech recognition engine.

Another example is in beamforming. You can do traditional beamforming where you combine the microphones with filters that affect the amplitude and phase to focus on sound from a specific direction. But if that’s all you do, then you still have to figure out what direction to point it in. So we train it on the sound of interest. This is the speech that we want to listen to, so when it combines the microphones together, it can combine them in a way that optimizes for the specific direction of the speech and the overall recognition end to end.

When you were describing your team’s projects, did you say … karaoke?

We recently developed this machine learning algorithm that can very accurately extract all the vocals from any music track that you have, so the vocals sound perfectly clean and clear and amazing, and the leftover music is also perfectly clean and clear without any vocals in it. It’s extremely good separation.

For karaoke, we still have to get permission to use this technology, but we don’t have to go and get multitrack recordings and have them remixed and music companies can save a bunch of time and money.

Have you seen how on the Echo Show they can show you lyrics? By separating out all the vocals they can do a better job of the speech recognition to know exactly what word we’re on in the song and highlight that so that you can sing to it better. And then you can also mix in however much of the original vocals you want. So it has several applications.

 

ICASSP 2019 Research Papers

 

Join Us

Are you interested in pioneering and developing the future of voice and device technology? Please send resumes to icassp2019@amazon.com.

 

Find jobs in Amazon @ ICASSP 2019