190 open jobs
Amazon Alexa is the Founding Sponsor of Interspeech 2019!
The Alexa team looks forward to meeting you at Interspeech 2019! Come and visit our booth to learn more about our research and career opportunities. Below is more information about our technology and team.
Technologies We Focus On
The Alexa Science team made the magic of Alexa possible, but that was just the beginning. Our goal is to make voice interfaces ubiquitous and as natural as speaking to a human. We have a relentless focus on the customer experience and customer feedback. We use many real-world data sources including customer interactions and a variety of techniques like highly scalable deep learning. Learning at this massive scale requires new research and development. The team is responsible for cutting-edge research and development in virtually all fields of Human Language Technology: Automatic Speech Recognition (ASR), Artificial Intelligence (AI), Natural Language Understanding (NLU), Question Answering, Dialog Management, and Text-to-Speech (TTS). See an interview with VP Rohit Prasad here.
Alexa scientists and developers have large-scale impact on customer’s lives and on the industry-wide shift to voice user interfaces. Scientists and engineers in the Alexa team also invent new tools and APIs to accelerate development of voice services by empowering developers through the Alexa Skills Kit and the Alexa Voice Service. For example, developers can now create a new voice experience by simply providing a few sample sentences.
Your discoveries in speech recognition, natural language understanding, deep learning, and other disciplines of machine learning can fuel new ideas and applications that have direct impact on peoples’ lives. We firmly believe that our team must engage deeply with the academic community and be part of the scientific discourse. There are many opportunities for presentations at internal machine learning conferences, which act as a springboard for publications at premier conferences. We also partner with universities through the Alexa Prize.
3 questions with Dilek Hakkani-Tür, the senior principal scientist leading Alexa's research on dialogue systems
You joined Alexa a year ago, and your team has three papers at this year’s Interspeech. What are they about?
Two of them are for multidomain task-oriented dialogues. The first paper is about unifying the dialogue-act schema of multiple task-oriented-dialogue data sets, with the goal of building a universal dialogue-act tagger. Dialogue acts have been studied for a very long time, and there are multiple data sets that are available, each annotated with a set of core dialogue acts that describe interactions at the level of intentions. Unfortunately, there are differences in the annotation schemas used for these datasets.
Our thinking was, to benefit from these data sets, we can try to map them into the same space. After making trivial mappings manually, we train a dialogue-act-tagging model on one corpus to label the other one, and vice versa, to automatically detect alignments between acts of different schemas.
Then we can train the universal dialogue-act tagger to annotate new data, such as human-human task-oriented conversations. Our ultimate goal is to be able to train a complete dialogue system from these human-human conversations.
The other paper is about multidomain dialogue state tracking. After each user turn in the conversation, state tracking aims to estimate what the user’s request is given all the turns till that time in a conversation. Let’s say you’re looking for restaurants, and the backend restaurant API has three slots, for cuisine, price, and area. You may say, “I want Indian food in the center”. And the system may say, “Oh, sorry, there’s no Indian restaurant in the center of town. How about north?”
When you agree to this utterance, the constraint for restaurant cuisine, Indian, is still valid, but the area constraint changes from center to north. Dialogue state tracking estimates which information stays the same and which is changed to a new value.
Many previous approaches do not scale to real applications, where one can observe rich natural-language utterances that can include previously unseen slot-value mentions and a large, possibly unlimited space of dialogue states. Our proposed approach is a hybrid one. Building on the previous work, one model forms slot-value vocabularies from the training examples, and at inference time, given the dialogue context and each slot, it estimates a probability for each value in the slot vocabulary.
The other model is an open-vocabulary model, which estimates a list of targets from the dialogue context — for example, considering named-entity results or all possible n-grams in the context — and makes a binary decision for each target to estimate if that should be the value of the slot. We have observed that the first approach performs well on closed-vocabulary slots, and the other one performs well on slots that have a large set of values. The final decision is made by combining these two methods.
That covers the two papers on task-oriented dialogues. How about the third paper?
The third is related to the Alexa Prize. The Alexa Prize is a fantastic framework to engage universities and enable their systems to interact with real users. I’m quite proud that Amazon came up with this idea.
While we have seen significant progress over the last two years of the Alexa Prize Challenge, we are still far away from the grand challenge of 20-minute engaging conversations. We have observed common problems, such as lack of conversational depth — oftentimes, bots run out of things to say and try to change the topic of conversation —or loops in the conversation flows, where the same user can go through a similar flow in the same dialogue. Furthermore, we observed that university teams have difficulty publishing their original work, due to challenges related to repeatable research. To enable research on deeper conversations and enable teams to publish their findings, we decided to release a topical-conversation data set and a set of benchmarks.
We started with a set of entities that Alexa Prize participants like to talk about, and starting with the most common ones, we looked into groups of entities — in this case, three entities. We curated knowledge and reading sets consisting of articles and fun facts about all three entities. Then we paired crowd workers and gave them reading content. Sometimes we gave the two the exact same content; sometimes we created some differences in the reading content — for example, split fun facts between the workers, to create engaging and lively interactions.
Then we asked them to engage in a conversation on the topic of the reading set. We also asked them to write down where they found their facts, so that they are grounding their responses on the loosely structured knowledge in the reading set.
You’ve been a regular attendee at Interspeech since it launched, in 2000. What appeals to you so much about the conference?
I think I only missed one, the time I was pregnant.
One of the reasons I like Interspeech is that it has a breadth of topics about speech, language, and conversation from all kinds of perspectives. I don’t think that richness has changed much over the years. I can find papers interesting to me — for example, on language, dialogue, and spoken conversations — as well as on broader speech-processing areas.
I usually try to submit something to Interspeech so that I can get feedback from others. One of our papers got in as a poster; the other two are orals. I really like the posters because I see them as a way of learning more from other people through interactions when presenting the posters. Yes, you know the problem and solution better than anyone, as you’re the one who worked on it and came up with the ideas. But other participants have all kinds of new ideas or questions, and sometimes some of them are things that you didn’t think of before. I don’t see this as much at some other conferences.
The other thing is you also kind of measure the pulse of the audience. In the oral sessions you can see how many people participated. But when you enter a poster session, you literally see where the interesting ideas are by looking at where all the people are. And as an author, you also get that form of feedback. It’s quite overwhelming — you get tired after these two hours of presentation. But it’s a great experience. I still like it.
Connect with us at Interspeech!
If you would like to meet with us in person at the conference, please contact email@example.com.
Are you ready for your next opportunity? Check out our open positions on this page here, and learn more about the Alexa team here. We have global opportunities available, and speech and machine learning scientists will be available to meet at Interspeech.
2019 Conference Papers
- Multi-Dialect Acoustic Modeling Using Phone Mapping and Online i-Vectors, Harish Arsikere, Ashtosh Sapru, Sri Garimella
- Scalable Multi Corpora Neural Language Models for ASR, Anirudh Raju, Denis Filimonov, Gautam Tiwari, Guitang Lan, Ariya Rastrow
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations, Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür
- Towards Achieving Robust Universal Neural Vocoding, Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal
- Acoustic Model Bootstrapping Using Semi-Supervised Learning, Langzhou Chen, Volker Leutnant
- Two Tiered Distributed Training Algorithm for Acoustic Modeling, Pranav Ladkat, Oleg Rybakov, Radhika Arava, Sree Hari Krishnan Parthasarathi, I-Fan Chen, Nikko Ström
- Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification, Chieh-Chi Kao, Ming Sun, Yixin Gao, Shiv Vitaladevuni, Chao Wang
- Model Compression on Acoustic Event Detection with Quantized Distillation, Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang
- Neural Named Entity Recognition from Subword Units, Abdalghani Abujabal, Judith Gaspers
- Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings, Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, Björn Hoffmeister
- HyST: A Hybrid Approach for Flexible and Accurate Dialogue State Tracking, Rahul Goel, Shachi Paul, Dilek Hakkani-Tür
- Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues, Shachi Paul, Rahul Goel, Dilek Hakkani-Tür
- Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-to-Speech, Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman
- A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction, Che-Wei Haung, Roland Maas, Sri Harish Mallidi, Björn Hoffmeister
- One-Versus-All Models for Asynchronous Training: An Empirical Analysis, Rahul Gupta, Aman Alok, Shankar Ananthakrishnan
- Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion, Alex Sokolov, Tracy Rohlin, Ariya Rastrow
Meet a few members from the Alexa Team at Interspeech
AWS AI Summit 2018: Delivering on the Promise of AI Together
Alexa VP and Head Scientist Rohit Prasad focuses on AI advances that are delighting customers
AWS re:invent 2017: Alexa State of the Science
Alexa VP and Head Scientist Rohit Prasad presents the state of the science behind Amazon Alexa.
re:MARS 2019: AI Invention That Puts Customers First
Jeff Wilke, CEO, Worldwide Consumer, explains how Amazon has been using AI to develop great customer experiences for nearly 20 years.
"I spoke to the future and it listened" – Gizmodo
Meet the team of world-class scientists behind Alexa.
2018 Alexa Prize Finals
A look into the 2018 Alexa Prize Challenge
Washington Ideas 2017
Rohit Prasad, VP & Head Scientist, Alexa Machine Learning, talks about the future of Alexa & conversational AI with Alexis Madrigal