Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

      The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we… Read More Apple Machine Learning Research 

​  


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *