Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel. INTERSPEECH 2022.
We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on accuracy. The fully differentiable architecture is trained end-to-end with an accompanying lightweight arbitrator mechanism operating at the frame level to make dynamic decisions on each input while a tunable loss function is used to regularize the overall level of compute against predictive performance. We report empirical results from experiments using the compute amortized TransformerTransducer (T-T) model conducted on LibriSpeech data. Our best model can achieve a 60% compute cost reduction with only a 3% relative word error rate (WER) increase.
Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris. INTERSPEECH 2022.
The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNNT with a novel convolutional front end consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNNT outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNNT’s superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.
Christin Jose, Joe Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis. INTERSPEECH 2022.
Conversational agents commonly utilize keyword spotting (KWS) to initiate voice interaction with the user. For user experience and privacy considerations, existing approaches to KWS largely focus on accuracy, which can often come at the expense of introduced latency. To address this tradeoff, we propose a novel approach to control KWS model latency and which generalizes to any loss function without explicit knowledge of the keyword endpoint. Through a single, tunable hyperparameter, our approach enables one to balance detection latency and accuracy for the targeted application. Empirically, we show that our approach gives superior performance under latency constraints when compared to existing methods. Namely, we make a substantial 25% relative false accepts improvement for a fixed latency target when compared to the baseline state-of-the-art. We also show that when our approach is used in conjunction with a max-pooling loss, we are able to improve relative false accepts by 25% at a fixed latency when compared to cross entropy loss.
Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann. ICASSP 2022.
Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose training neural contextual adapters for personalization in neural transducer based ASR models. Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models. Using an in-house dataset, we demonstrate that contextual adapters can be applied to any general purpose pretrained ASR model to improve personalization. Our method outperforms shallow fusion, while retaining functionality of the pretrained models by not altering any of the model weights. We further show that the adapter style training is superior to full-fine-tuning of the ASR models on datasets with user-defined content.
Xuandi FU, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P. Strimel, Kanthashree Mysore Sathyendra. ICASSP 2022.
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (NLU) module through an interface to infer semantic labels, such as intent and slot tags. This design, however, does not consider the NLU posterior while making transcript predictions, nor correct the NLU prediction error immediately by considering the previously predicted word-pieces. In addition, the NLU model in the two-stage system is not streamable, as it must wait for the audio segments to complete processing, which ultimately impacts the latency of the SLU system. In this work, we propose a streamable multi-task semantic transducer model to address these considerations. Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags while aggregating them through a fusion network. Using an industry scale SLU and a public FSC dataset, we show the proposed model outperforms the two-stage E2E SLU model for both ASR and NLU metrics.
Anastasios Alexandridis*, Grant P. Strimel*, Ariya Rastrow, Pavel Kveton, Jon Webb, Maurizio Omologo, Siegfried Kunzmann, Athanasios Mouchtaris. ICASSP 2022.
We introduce Caching Networks (CachingNets), a speech recognition network architecture capable of delivering faster, more accurate decoding by leveraging common speech patterns. By explicitly incorporating select sentences unique to each user into the network’s design, we show how to train the model as an extension of the popular sequence transducer architecture through a multitask learning procedure. We further propose and experiment with different phrase caching policies, which are effective for virtual voice-assistant (VA) applications, to complement the architecture. Our results demonstrate that by pivoting between different inference strategies on the f ly, CachingNets can deliver significant performance improvements. Specifically, on an industrial-scale, VA ASR task, we observe up to 7.4% relative word error rate (WER) and 11% sentence error rate (SER) improvements with accompanied latency gains.
Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Pavel Kveton, Jon Webb, Athanasios Mouchtaris. ICASSP 2022.
On-device spoken language understanding (SLU) offers the potential for significant latency savings compared to cloud-based processing, as the audio stream does not need to be transmitted to a server. We present Tiny Signal-to-interpretation (TinyS2I), an end-to-end on-device SLU approach which is focused on heavily resource constrained devices. TinyS2I brings latency reduction without accuracy degradation, by exploiting use cases when the distribution of utterances that users speak to a device is largely heavy-tailed. The model is tailored to process on-device frequent utterances with support for dynamic contextual content, while deferring all other requests to the cloud. Compared to a powerful baseline, we demonstrate that TinyS2I achieves comparable performance, while offering latency gains due to local processing.
Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Mueller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo. ICASSP 2022.
Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems. Prosodic features such as energy and pitch have been shown to be useful for DAC. Despite their importance, little research has explored neural approaches to integrate prosodic features into end-to-end (E2E) DAC models which infer dialogue acts directly from audio signals. In this work, we propose an E2E neural architecture that takes into account the need for characterizing prosodic phenomena co-occurring at different levels inside an utterance. A novel part of this architecture is a learnable gating mechanism that assesses the importance of prosodic features and selectively retains core information necessary for E2E DAC. Our proposed model improves DAC accuracy by 1.07% absolute across three publicly available benchmark datasets.
Thanh Tran*, Kai Wei*, Weitong Ruan, Ross McGowan, Nathan Susanj, Grant P. Strimel. IAAI 2022.
Recent years have seen significant advances in multi-turn Spoken Language Understanding (SLU), where dialogue contexts are used to guide intent classification and slot filling. However, how to selectively incorporate dialogue contexts, such as previous utterances and dialogue acts, into multi-turn SLU still remains a substantial challenge. In this work, we propose a novel contextual SLU model for multiturn intent classification and slot filling tasks. We introduce an adaptive global-local context fusion mechanism to selectively integrate dialogue contexts into our model. The local context fusion aligns each dialogue context using multi-head attention, while the global context fusion measures overall context contribution to intent classification and slot filling tasks. Experiments show that on two benchmark datasets, our model achieves absolute F1 score improvements of 2.73% and 2.57% for the slot filling task on Sim-R and Sim-M datasets, respectively. Additional experiments on a largescale, de-identified, in-house dataset further verify the measurable accuracy gains of our proposed model.
Kai Wei*, Thanh Tran*, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel. ASRU 2022.
Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively. We also present results on a publicly available dataset and show that our method significantly improves performance over a noncontextual baseline.
Jonathan Macoskey*, Grant P. Strimel*, Jinru Su, Ariya Rastrow. INTERSPEECH 2021.
We introduce Amortized Neural Networks (AmNets), a compute cost- and latency-aware network architecture particularly well-suited for sequence modeling tasks. We apply AmNets to the Recurrent Neural Network Transducer (RNN-T) to reduce compute cost and latency for an automatic speech recognition (ASR) task. The AmNets RNN-T architecture enables the network to dynamically switch between encoder branches on a frame-by-frame basis. Branches are constructed with variable levels of compute cost and model capacity. Here, we achieve variable compute for two well-known candidate techniques: one using sparse pruning and the other using matrix factorization. Frame-by-frame switching is determined by an arbitrator network that requires negligible compute overhead. We present results using both architectures on LibriSpeech data and show that our proposed architecture can reduce inference cost by up to 45% and latency to nearly real-time without incurring a loss in accuracy.
Jonathan Macoskey*, Grant P. Strimel*, Ariya Rastrow. INTERSPEECH 2021.
As more speech processing applications execute locally on edge devices, a set of resource constraints must be considered. In this work we address one of these constraints, namely over-the-network data budgets for transferring models from server to device. We present neural update approaches for release of subsequent speech model generations abiding by a data budget. We detail two architecture-agnostic methods which learn compact representations for transmission to devices. We experimentally validate our techniques with results on two tasks (automatic speech recognition and spoken language understanding) on open source data sets by demonstrating when applied in succession, our budgeted updates outperform comparable model compression baselines by significant margins.
Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris. INTERSPEECH 2021.
We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer’s encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer’s encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.
Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel. INTERSPEECH 2021.
We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer’s encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer’s encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.
Jonathan Macoskey*, Grant P. Strimel*, Ariya Rastrow. ICASSP 2021.
We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leverage a recurrent cell we call the Bifocal LSTM (BFLSTM), which we detail in the paper. The architecture is compatible with other optimization strategies such as quantization, sparsification, and applying time-reduction layers, making it especially applicable for deployed, real-time speech recognition settings. We present the architecture and report comparative experimental results on voice-assistant speech recognition tasks. Specifically, we show our proposed Bifocal RNN-T can improve inference cost by 29.1% with matching word error rates and only a minor increase in memory size.
Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Pierard, Jon Webb. INTERSPEECH 2020.
We introduce DashHashLM, an efficient data structure that stores an n-gram language model compactly while making minimal trade-offs on runtime lookup latency. The data structure implements a finite state transducer with a lossless structural compression and outperforms comparable implementations when considering lookup speed in the small-footprint setting. DashHashLM introduces several optimizations to language model compression which are designed to minimize expected memory accesses. We also present variations of DashHashLM appropriate for scenarios with different memory and latency constraints. We detail the algorithm and justify our design choices with comparative experiments on a speech recognition task. Specifically, we show that with roughly a 10% increase in memory size, compared to a highly optimized, compressed baseline n-gram representation, our proposed data structure can achieve up to a 6x query speedup.
Joseph McKenna*, Samridhi Choudhary*, Michael Saxon*, Grant P. Strimel, Athanasios Mouchtaris. INTERSPEECH 2020.
End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these models generalize to broader use cases. In this work, we analyze the relationship between the performance of STI models and the difficulty of the use case to which they are applied. We introduce empirical measures of data set semantic complexity to quantify the difficulty of the SLU tasks. We show that near-perfect performance metrics for STI models reported in the literature were obtained with data sets that have low semantic complexity values. We perform experiments where we vary the semantic complexity of a large, proprietary data set and show that STI model performance correlates with our semantic complexity measures, such that performance increases as complexity values decrease. Our results show that it is important to contextualize an STI model’s performance with the complexity values of its training data set to reveal the scope of its applicability.
Grant P. Strimel, Kanthashree Mysore Sathyendra, Stanislav Peshterliev. INTERSPEECH 2018.
In this paper we investigate statistical model compression applied to natural language understanding (NLU) models. Smallfootprint NLU models are important for enabling offline systems on hardware restricted devices, and for decreasing ondemand model loading latency in cloud-based systems. To compress NLU models, we present two main techniques, parameter quantization and perfect feature hashing. These techniques are complementary to existing model pruning strategies such as L1 regularization. We performed experiments on a large scale NLU system. The results show that our approach achieves 14-fold reduction in memory usage compared to the original models with minimal predictive performance impact.
Grant P. Strimel and Manuela M. Veloso. IROS 2014.
The robot coverage problem, a common planning problem, consists of finding a motion path for the robot that passes over all points in a given area or space. In many robotic applications involving coverage, e.g., industrial cleaning, mine sweeping, and agricultural operations, the desired coverage area is large and of arbitrary layout. In this work, we address the real problem of planning for coverage when the robot has limited battery or fuel, which restricts the length of travel of the robot before needing to be serviced. We introduce a new sweeping planning algorithm, which builds upon the boustrophedon cellular decomposition coverage algorithm to include a fixed fuel or battery capacity of the robot. We prove the algorithm is complete and show illustrative examples of the planned coverage outcome in a real building floor map.
Thomas Kollar, Jayant Krishnamurthy, Grant P. Strimel. RSS 2013.
This paper addresses the problem of enabling robots to interactively learn visual and spatial models from multi-modal interactions involving speech, gesture and images. Our approach, called Logical Semantics with Perception (LSP), provides a natural and intuitive interface by significantly reducing the amount of supervision that a human is required to provide. This paper demonstrates LSP in an interactive setting. Given speech and gesture input, LSP is able to learn object and relation classifiers for objects like mugs and relations like left and right. We extend LSP to generate complex natural language descriptions of selected objects using adjectives, nouns and relations, such as “the orange mug to the right of the green book.” Furthermore, we extend LSP to incorporate determiners (e.g., “the”) into its training procedure, enabling the model to generate acceptable relational language 20% more often than the unaugmented model.
Supervisors: Herbert A. Simon University Professor Manuela M. Veloso (chair) & Professor Avrim Blum.
This thesis presented methods for robotic mapping of large structured spaces and coverage planning under finite energy/fuel constraints. The research addressed both the theoretic and practical questions involved and techniques were evaluated on real service robots.