Interpreting voice commands in radar sys- tems Recognizing speech of multiple languages with limited resources in an offline environment Master’s thesis in Computer Science Johan Andersson Oskar Dahlberg Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2017 Master’s thesis 2017 Interpreting voice commands in radar systems Recognizing speech of multiple languages with limited resources in an offline environment Johan Andersson Oskar Dahlberg Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2017 Interpreting voice commands in radar systems Recognizing speech of multiple languages with limited resources in an offline envi- ronment Johan Andersson Oskar Dahlberg © Johan Andersson, 2017. © Oskar Dahlberg, 2017. Supervisor: Aarne Ranta, Chalmers, Department of Computer Science and Engi- neering Supervisor: Sven Nilsson, SAAB AB Examiner: Andreas Abel, Chalmers, Department of Computer Science and Engi- neering Master’s Thesis 2017 Department of Computer Science and Engineering Computer Science Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2017 iv Automatic Speech Recognition in Radar Systems Recognizing speech with limited resources in an offline environment Johan Andersson Oskar Dahlberg Department of Computer Science and Engineering Chalmers University of Technology Abstract This master’s thesis is done in collaboration with SAAB and outlines an architecture for developing a voice command system with focus on supporting multiple languages, ease of introducing new domain knowledge, resource efficient and running in an offline environment. The solution and the decisions made along the process as well as their advantages and drawbacks are explained in this report. As basis for the project we have worked with CMU Sphinx and Grammatical Framework to aid us in the process. Furthermore, the US-English community language model available from CMU Sphinx has been adapted to better understand the Swedish accent with the help of a tool provided by CMU Sphinx. The result of the adaptation as well as recommendation for future development is also included in this thesis. Keywords: Speech Recognition, Sphinx, Grammatical Framework, Hidden Markov model, Syntax, Semantics. v Acknowledgements We would like to thank our company partner SAAB AB for giving us the opportunity to work with such an interesting domain and problem in a meaningful environment. Furthermore we would like to thank the following people who have helped us carry out this Master’s thesis. Sven Nilsson, for supervising us at SAAB providing helpful feedback and guidance during the work and always keeping up a good spirit. Aarne Ranta, for providing insights from his vast knowledge in the area and fast response time when questions emerged during the project. Christian Fransson, who helped with understanding the domain and providing rel- evant commands as well as providing feedback on our design choices. Martina Dinefjäll, for providing answers to any general questions about SAAB and helped out with pointing us to the right resources. Ulf Dahlgren, who helped us with the integration in SAAB’s own systems. Carolina Fred, for helping us with grammar and word choice for the Spanish lexicon of our prototype. We would also like to mention the following people who took time out of their daily work in order to gather training data for the project: Liza Axenskär, Lina Björn- heden, Jeanette Davidsson, Andreas Eklund, Kent Falk, Christian Fransson, Martin Helmersson, Anders Karlsson, Kenneth Karlsson, Thomas Lang, Jakob Lundberg, Sven Nilsson, Esbjörn Rundberg, Elin Rönnow, Carl Schoster among others. Johan Andersson, Gothenburg, June 2017 Oskar Dahlberg, Gothenburg, June 2017 vii Contents 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory & Method 5 2.1 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Recognition problem . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Language modeling . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4 Acoustic modeling . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.5 Hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 7 2.1.6 Recognition using HMMs . . . . . . . . . . . . . . . . . . . . . 9 2.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 Available solutions . . . . . . . . . . . . . . . . . . . . . . . . 14 3 System 15 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Speech-to-text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Grammatical Framework . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Abstract syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Concrete syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.3 Putting it all together . . . . . . . . . . . . . . . . . . . . . . 22 3.3.4 Usage of GF in our prototype . . . . . . . . . . . . . . . . . . 23 3.4 Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 Summary of pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Multilinguality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Model adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Evaluation 29 4.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 ix Contents 4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Training step . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Evaluation step . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.3 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Implementing a new language . . . . . . . . . . . . . . . . . . 32 4.3.2 Runtime resources . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Discussion 37 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Grammatical Framework . . . . . . . . . . . . . . . . . . . . . 37 5.1.2 Experiment discussion . . . . . . . . . . . . . . . . . . . . . . 37 5.1.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Critical system safety . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 x 1 Introduction The introduction of speech control in everyday devices has enabled systems to be interacted with in a new way and has altered the expectations on systems in pro- fessional setting as well. Many speech recognition solutions such as Google Cloud Speech API [1], Alexa Voice Service [2], among others’ require an online connection in order to interpret speech in a time efficient way. An online solution is not always practical, for example due to poor internet connection or the security risk involved with connecting a device to the internet. In such cases an offline solution is required. Both of these reasons are true for the system we are targeting with our project and thus, our solution must work in an offline environment. Another requirement, which is posed from our company partner SAAB (in the future also referred to as “the company”), is that the solution should be easy to re- adapt to spoken languages not previously incorporated. For example, if it can handle English, adding Swedish should be easy from a developer’s perspective, while still yielding good results. This adaptability and having to run in an offline environment poses some interesting technical challenges each on their own. 1.1 Aim The aim is to develop and evaluate a prototype application that can transform sounds into commands. This prototype is limited to 2 spoken languages, run in an offline environment and produce results in real time. The results of the prototype must be evaluated to measure its performance. Furthermore, it should not be overly cumbersome in terms of development time to introduce new domain knowledge or new spoken languages into the application. Parameters we have used for evaluation of the prototype consist of: • Multilinguality, in our case how much work is required for adding a new lan- guage. • Sentence/command error rate. • Training/instructions required for the user to operate the system. • Resource requirements in terms of computational resources. The end prototype should be in a state where further development could incorporate it into a fully-fledged assistant application able to aid with domain-specific tasks. 1 1. Introduction 1.2 Delimitations To prevent the focus of the thesis from being too large, the scope is limited to the development of a prototype that works well with a limited feature set instead of a larger system that has poor performance. For instance, this limited feature set does not cover background noise reduction or speaker identification. Time is also a limiting factor which means that third-party libraries have to be used in order to complete the prototype within the timeframe. It is not feasible to create a new speech-to-text system and at the same time gather data in order to train new models. Rather, we use existing models in order to save time and being able to focus on other parts of the system. The prototype implementation covers only two spoken languages. This is due to the limited timeframe in which the prototype has to be developed, but also the number of languages spoken by the authors. Since the authors are only fluent in two languages; Swedish and English, English is the main focus for the prototype with adding another language as proof of concept. There is a larger focus on making it simple for the system to include a new spoken language rather than implementing as many languages as possible. 1.3 Related work A Ph.D. student has done a similar project in 2005 where a multilingual device was developed to help individuals navigate Gothenburg [3]. It could interpret several languages, including English, Swedish, and Spanish. The project consisted of a map representing the tram network, it was used by either saying the names of the stations out loud or by selecting the stations on the map. It also gave responses in the form of speech. This approach also used Grammatical Framework (see section 3.3) as a tool to help with multiple languages, but had other solutions for the remainder of the system. The main difference from our project is the domain and the requirements that come with it. While the project done in 2005 worked as a tool for pedestrians to take the tram, our project will work as a tool for a professional operator to use as an addition to existing control system. There are higher demands on response time and precision. The previous system was limited to very few commands of the same structure while our project is focused on a broader sense with considerable more command variations. There has also been a project constructing an automatic speech recognizer on the International Space Station (2005). The authors implemented both a statistics- based model (see section 2.1.3 on language modeling) and a grammar-based model (see section 2.2.3 on grammar). It is a command-based system which uses phrases like “set alarm for five minutes from now” and “please speak up” [4]. This system only understands English, but it extracts the semantics in a way similar to Gram- matical Framework. The authors also concludes that the word error rate is lower in a system which uses a grammar-based approach, than that of a system which uses 2 1. Introduction a statistics-based approach. The main difference to our project in this case is that of multilinguality, understanding more than one language. 3 1. Introduction 4 2 Theory & Method In order to have computers recognize and understand human speech it is beneficial if the input is converted into a format that the computer can easily process, for example a text string or some other data structure. This chapter will briefly explain theory and available methods for doing natural language processing from speech to help the reader follow the subsequent chapters. 2.1 Speech recognition In Figure 2.1 we can see an overview of a speech recognizer pipeline taking audio signal input and outputting a transcription. In the following section we will explain the theory behind this process and what different parts are necessary to perform recognition. Figure 2.1: Overview of recognizer process. Adapted from Ir P Wiggers and LJM Rothkrantz [5] 2.1.1 Preprocessing Speech is a continuous signal with infinite possible variation depending on a lot of factors such as speaker mood, environment, and background noise among others. 5 2. Theory & Method However, not all information present in this variation is needed to perform speech recognition; it is therefore advantageous to reduce the input space to simplify the complexity of the model used to perform recognition. The first step of speech recog- nition is thus a preprocessing step to reduce the input and highlight important features. During the preprocessing step the continuous input is split into a discrete set of observations. In our case these are feature vectors produced by a variant of Mel- frequency cepstral analysis [5]. The process of Mel-frequency cepstral analysis takes audio signal as input and performs a series of transforms and filtering operations and gives a vector as output. First the input audio is split in pieces of, for example, 25ms of snippets sampled every 10ms, giving overlapping snippets. On these snippets Mel- frequency cepstral analysis is performed resulting in speech vectors. Humans are much better at determining changes in pitch in lower frequencies than higher [6]. Mel-frequency cepstral analysis utilizes this knowledge by filtering away redundant parts humans would not perceive. The resulting vector captures the details of human speech while removing noise and other needless parts of the input data. This results in a compact format keeping the important bits. 2.1.2 Recognition problem After the preprocessing step the input has been divided into a sequence of discrete values. Those values, or observations are then used to determine one or many hypotheses on what was said. The recognition problem can then be stated as follows: We have a sequence of observations O = o1o2...on, given these observations and a set of all possible sequences of words spoken, S, we want to determine the most likely sequence of words W . This can be done by solving the equation 2.1, where w is an element of S. W = argmax w P (w|O) (2.1) This problem however is not easily solvable since possible observation sequences, even with the preprocessing step, are very large and S is infinite. Bayes’ theorem allows us to reformulate this equation to the equation 2.2. This leads us to the two main components of a recognizer, the language model P (w) (see section 2.1.3), and the acoustic model P (O|w) (see section 2.1.4).The term P (O) can be ignored since it is simply a scaling factor. If we can solve this new problem, then we can output a hypothesis for a given observation. To output a set of hypotheses is only a matter of changing the argmax to not return a single instance but a list of instances ordered by likelihood where the first item is most likely. P (w|O) = P (w)P (O|w) P (O) (2.2) 2.1.3 Language modeling A language model is a probability distribution over a sequence of words, P (w). Given a sequence of words, a language model can answer how probable that sequence is. 6 2. Theory & Method When it comes to speech recognition, language models can be used in order to determine which word was said if there are many alternatives from the phonetic analysis. For example, for the phase “I will dye my hair” the language model will assign a higher probability for the word “DYE” than the word “DIE” which have very similar pronunciations. Given the amount of variations to write a particular text and the breath of vocabulary in many languages, the probability of observing a particular sequence decreases as the sequence length increases. For example, this particular paragraph is probably unique in its particular choice of word and order. Since longer sequences will fast approach a probability of zero, it is common to limit the probability of any given word to be dependent on n previous words. This is known as an n-gram model [7]. Commonly n is between 1 and 3 to reduce model size since the amount of data needed to train the model grows very fast as n increases. 2.1.4 Acoustic modeling An acoustic model answers the question of which observations O are produced when a particular sequence of words w is uttered. Given the nature of speech and the unbounded length of such a sequense, in order to solve this problem we need a statistical model where answers can be computed on the fly [5]. In other words, an acoustic model can calculate the probability P (O|w) for any O and w. Commonly, most acoustic models are instances of a Hidden Markov model (HMM) [8], although it is possible to use other models such as artificial neural networks. 2.1.5 Hidden Markov model A Hidden Markov model is a variant of Markov model [9] where the observation sequence and state sequence are decoupled [5]. Every state has a probability distri- bution which determines how likely a particular observation is in that state. Given a particular observation it is not possible to determine which state generated it, the state sequence who generated a particular observation sequence is unknown from looking at the observations, they are hidden. A Hidden Markov model can be de- fined seen in list 2.1. A model with all parameters defined can thus be written in short form as λ = (A,B, π), where A, B, and π have some appropriate values. Additionally when it comes to speech recognition and the usage of HMMs there are some things to consider. Speech is modeled as a set of observations ordered in a sequence. To capture this order when using HMMs to recognize speech the HMMs used are of the type left-right models, the states are presented as a chain and the probability of transitioning backwards in the chain is set to 0 an example of a left-right HMM can be seen in Figure 2.2. How are HMMs used to recognize speech then? If we assume we have a set of models already trained to recognize different words in the form of HMMs. For a particular observation sequence O we want to determine which model is most likely to generate that sequence, we want to determine argmaxλ P (O|λ). Once this is done we can see which model, λ, was returned and lookup which word that particular model corresponds to. 7 2. Theory & Method • Number of distinct observation symbols M . • An output alphabet V = v1, v2, ...vM . • Number of states N • State space Q = 1, 2, ...N • Probability distribution of transitions between states A = aij where aij = P (qt+1 = j|qt = i) 1 ≤ i, j ≤ N • Observation symbol probability distribution B = bj(k) in which bj(k) = P (ot = vk|qt = j) 1 ≤ k ≤M, 1 ≤ j ≤ N • Initial state distribution π = πi where πi = P (q1 = i) 1 ≤ i ≤ N List 2.1: Definition of Hidden Markov model [5] To solve this problem the calculation in equation 2.3, where q is a state sequence, can be evaluated. However, this calculation quickly becomes unfeasible to do in real time as the complexity grows with the number of states and exponentially with the length of the observation sequence, resulting in a complexity of O(TNT ) where T is the length of the observation sequence. For a more detailed explaination on this equation and the complexity of it see Wiggers and Rothkrantz work [5]. P (O|λ) = ∑ q P (q|π) T∏ t=1 P (ot|qt, λ) (2.3) There exist a method to perform this computation in a feasible way though, the Forward algorithm [9]. In the Forward algorithm a dynamic programming approach is utilized to calculate the values for each time step in parallel. Its efficiency comes from the fact that many sequences share common subsequences and performing the calculations in this manner prevents recalculating many subsequences along the way. The complexity of the Forward algorithm is O(N2T ) which is considerably better than the case before and can be used in practice. Given HMMs for different words it is possible to recognize them. But how can such models be created? We have the shorthand for an HMM definition as λ = (A,B, π). The question is now how we estimate A, B and π to create a model reliably recognizing the word we want. The parameters we are interested in are the transition probabilities between states, how likely each observation is in each state as well as the initial distribution of states. This cannot be done without knowing what we want to recognize, thus we require a set of observation data D in order to estimate the parameters. Given that we have data covering what we want to recognize we want to create a model as seen in equation 2.4. λtrained = argmax λ P (D|λ) (2.4) This is not analytically solvable but there exists an algorithm solving it itera- tively; Baum-Welch algorithm or Forward-Backward algorithm [10]. The algorithm is a version of Expectation Maximization (EM) algorithm adapted to fit the case of HMM. The idea is to start with some initial estimation of the parameters, com- pute expectations on transitions and output of observations and then update the 8 2. Theory & Method estimations of the parameters. This process repeats until convergence. During the process the estimation step utilizes the observation data. It is a type of hill-climbing algorithm, which means it does have the risk of getting stuck in a local maximum. 2.1.6 Recognition using HMMs In terms of Speech Recognition when it comes to using HMMs there are some things to consider. As previously mentioned, states are usually placed in sequence without any transitions backwards in this sequence. Another design decision is what an HMM should model. Previously, our example for the recognition mentioned several HMMs where each of them represented a word on their own. This is not necessary, for example an HMM in speech recognition can model a phrase, a word or a sub-word unit. Those different kinds come with advantages and disadvantages. While an ap- proach where HMMs represent words can be very useful as it can capture coartic- ulation effects it requires that all words that we want to recognize are part of the training data, preferably in many samples [5]. To reduce the required amount of training data it is common to use sub-word units. Different words can contain the same sub-word unit, which means that the HMM representing a specific sub-word unit can receive training data from different words. Training data can be shared and it is possible to capture words that are not present in the training data if all their sub-word parts are present. For example, if we consider the words “select” and “celebrity” both share some parts of pronounciations, written in a type of phonetic text they can be written out: “/S/ /AH/ /L/ /EH/ /K/ /T/” and “/S/ /AH/ /L/ /EH/ /B/ /R/ /IH/ /T/ /IY/” [11] respectively with each sub-word unit marked out. If we train a model where both those words are present using sub-word units HMMs, then this model could be able to recognize the word “celery” which corresponds to “/S/ /EH/ /L/ /ER/ /IY/” since most sub-words units are in the other words present in the training data, given that there exist words with the sub-word unit “/ER/” (since it is missing in the other two words) in them within the training data. A typical use of sub-word models are phones [12]. A phone is a unit of sound. A pair of consecutive phones is called a diphone. Along the same line, three con- secutive phones are called a triphone. Diphones and triphones are used to capture the properties of transition between sounds and their relation. Having a way to model such transitions and sounds themselves is necessary to capture the proper- ties of human language. Since different languages use different sounds or at least different order of sounds there are different sets of phones, diphones and triphones for each language. Furthermore, a phone is generally considered to have 3 phases; an on-glide phase, the pure phase and, the off-glide phase [13]. The first and last phases consist of the phone with an overlap of proceeding and succeeding phones respectively, while the pure phase in between is the pure form of the phone. How does the sub-word model of phones, the pre-processed signal in terms of feature vectors and HMMs fit together then? Initially, speech vectors were mapped to different phones using K-means clustering, likely a result of limited computing power and memory. Nowadays, feature vectors are used directly in the HMMs. 9 2. Theory & Method Figure 2.2: HMM representing the phone /S/ Figure 2.3: HMM representing the word “select” This means that the observation symbol probability distributions are in the form of multivariate continuous probability density functions, N (·), as seen in equation 2.5 where uj is mean vector and Uj is the covariance matrix. bj(ot) = N (ot, µj, Uj) (2.5) Given the previous statement of phones having 3 phases an HMM representing a phone typically has at least 3 states, one for each phase. An example of such an HMM can be seen in figure 2.2. It is possible to connect HMMs to each other creat- ing a new larger composite HMM. Connecting many HMMs representing different phones we can create an HMM representing a whole word, an example of this can be seen in figure 2.3 where a composite HMM for the word “select” is shown. In a similar way we can connect words together. We mentioned previously that words are usually modeled in a way were a particular word depend on previous words. This can also be viewed as a particular word can be followed by a set of words with a certain probability for each of them. In a way this can be seen as a Markov model where each state is a word and the transitions captures the probability of a word being followed by another word. Using this fact we can create a composite HMM consisting of words where we replace the words by their phone HMMs to create a model for a whole phrase. An example of such an HMM can be seen in figure 2.4. Note, that using such composite HMMs risk creating large models, which can become hard to search. To circumvent this problem it is common to prune away parts of the model along the way. For each time step, sequences which do not meet a certain threshold are terminated and not considered for later time steps. We now have all the building blocks required to transform audio signal into a transcription of words through the use of HMMs. 10 2. Theory & Method Figure 2.4: HMM representing the phrase “select ready/unready surveillance units” 11 2. Theory & Method 2.2 Parsing Through the use of an acoustic model and language model it is possible to convert raw audio signal into text. While a text string of natural language is an improve- ment it is not easily managed format of input to an application. Problems such as ambiguity remain among other concerns such as superfluous information. The string “Please, pass me the salt, kind sir” contains a lot of unnecessary words that are not needed to represent the command given. Doing some processing of previous mentioned sentence could put it in a more structured format such as: {action: move, source: salt, target: speaker, worker: person}. Performing such an extraction and mapping of information is no trivial task however [14]. In order to solve this information extraction and mapping issue there are many approaches available, it could be possible to use a statistical method or machine learning (ML) method with the use of Artifical Neural Networks (NN) for example. In both those cases there is the requirement of previous data and the end result is always probabilistic in nature in some way. When designing a system such as ours where precision is a key element, this approach is undesirable due to the risk of errors. While NNs may perform well it is hard to reason about why they perform well and in what situations they may fail. A more transparent method and hand engineered solution is desirable to easier reason about and explain system performance and reliability. Our chosen approach is to use a parser. Commonly parsers are built using a grammar-approach [15] where the developer specifies a syntax in the form of a gram- mar that the parser then uses to convert raw text into a parse. The representation of a parse is often in the form of a tree as it can then easily represent the order and relation of different tokens [15]. 2.2.1 Syntax The definition of syntax used in this thesis is the one concerning natural language as given by the Oxford dictionary: ‘The arrangement of words and phrases to create well-formed sentences in a language.” [16] To elaborate, syntax describes how a language is built, in what order do the words come, how can a word with the same meaning be said in different ways and when should it be used, for instance plural or masculine/feminine. Syntax is a way to describe these rules e.g., should the adjective come before the subject as in English, or the other way around as in Spanish? Each language has its own set of rules when words should be used which can lead to difficulties when translating from one language to another [17], as well as when trying to develop a software system able to process human language. For each language that should be processed by the system that particular language syntactic rules must be incorporated in the system for it to function properly. 2.2.2 Semantics While syntax describes how something is written, semantics describes what is being written, the meaning of the text [18]. Understanding semantics is the part of nat- 12 2. Theory & Method ural language understanding that is most difficult for computers. For example, the headline “Criminals get nine months in violin case” [19], is easier for a human to understand that the headline is likely about a court case and not a physical violin container, but a computer without more than dictionary knowledge would not be able to tell. 2.2.3 Grammar In terms of parsing and text processing, a grammar is a way to express syntactic rules and capture legal sentences in a language. See Figure 2.5 for a simple grammar in JSpeech Grammar Format covering phrases like “please open a window thanks” and “could you close the file” among others. The grammar has a starting point with the rule for a basic command seen in Figure 2.6 which consist of a polite start, the command followed by a polite end. Then follows rules that define a command as an action and an object and so on. public = ; = ; = open | close | delete | move; = [the | a] (window | file | menu); = please | kindly | could you ; = [ please | thanks | thank you ]; Figure 2.5: Example of JSpeech Grammar Format [20] = ; Figure 2.6: Example of rule in JSpeech Grammar Format The benefit of using a grammar-approach in our case is twofold. First, there is very little development cost of modifying or expanding an existing grammar. While it may not be trivial to do in all cases it can be as easy as adding a single line in a text file to add a completely new phrase. There is no need to retrain the system, as would be in a statistical ML-approach, or do other modifications. Secondly, it is possible to improve the precision of speech-to-text tools through the use of grammar [4]. Instead of deciding through probability between possible words in the language model it can simply disregard words not present in the gram- mar, the grammar is the language model. The grammar may still be ambiguous and the output may be many possible parses with a probability score assigned to each of them. The downside is that the system only recognizes the words specified by the grammar. If the user says something similar, the system cannot determine what the user meant, or it tries to fit the sentence into the given grammar, which can 13 2. Theory & Method result in false positives. For example, if the grammar contains the word “remove” and the user says the similar word “approve”, which the grammar does not include, the system risk matching the input as “remove” and thus misinterprets the user. 2.2.4 Available solutions There exist a number of different approaches to do parsing. One way is to have a regular language described by a set of regular expressions and have a regular expression tool create a parser for the particular language. Another way, common to programming languages, is to have their syntax specified in a deterministic context- free grammar (DCFG). The reason for using DCFG is that it is possible to write parsers for a DCFG that can parse input in linear time [21]. However, using DCFG comes with limitations, for instance it does not allow ambiguity which makes it unsuitable for parsing natural language where ambiguity often is present. There is no need to limit the model to a DCFG if the strings are short (phrases, compared to large code bases), the gain from linear time does not outweigh the gain from a more refined parsing approach allowing ambiguity. An alternative to classic deterministic grammar based parsing is to use statis- tical approaches with help of machine learning algorithms. For such approaches, data is first gathered and labelled manually as a basis for training the system. Such systems require large sets of data in order to work efficiently and risk suffering from common problems such as overfitting [22]. An alternative approach we have ex- plored is to use a grammar approach that allows ambiguity and allows separation of semantics from syntax, for this purpose we have explored Grammatical Framework, which is further explained in section 3.3. 14 3 System There are many ways to approach the challenge of building an Automatic Speech Recognition (ASR) system, a good outline is to start with the goal of the system and adjust approach thereafter since many available technologies are suited better or worse depending on situation. This chapter will discuss the different techniques used and which ones have been disregarded. 3.1 Overview The problem at hand, developing an ASR system interpreting natural language into specific computer understandable commands, is split into many modules and sub-problems. As can be seen in Figure 3.1, the prototype is divided into four parts that send information from one step to the next. The first step takes speech and turns it into a format that is easier to analyze. In this case the speech-to-text module produces a text for the next module to analyze. The second part is the parser which transforms given input natural language text into a format that is easier to abstract information from, a parse tree. The third part then takes a parse tree as input and interprets if this particular input is valid in the current context or helps disambiguate if several interpretations are possible and outputs valid interpretations. The fourth and last part, the application layer, then receives an interpreted command and executes actions corresponding to that command. Since there are multiple steps during the process, it is easy to follow the trans- formation and see where an error occurred if something went wrong. Furthermore, it is easy to replace any part of the pipeline if a better alternative is found. However, there is the risk of reducing performance by keeping the different parts separated. There is the possibility that having more communication and closer connection be- tween the parts could help performance, perhaps a step used later in the pipeline could be involved earlier for a better result? We deemed the advantages to be lim- ited at best and considered the architecture described in Figure 3.1 to be better suited for our problem. Such a modular approach does not necessarily prevent different parts provid- ing information to the others, for example the parser module model is used for the speech-to-text module as well in a different format. Furthermore, it would be pos- sible to have a probabilistic approach where each step had several output paired with confidence scores. For example, the speech-to-text module could output sev- eral transcriptions with the same confidence while later steps could refine this by 15 3. System Figure 3.1: Overview of prototype pipeline adding or subtracting to these scores. Unfortunately, our speech-to-text software tool only outputs one hypothesis when using grammars which prevented us from doing a probabilistic approach. 3.2 Speech-to-text The first part of our pipeline is the speech-to-text module; it is responsible for converting raw audio signal into text. During this section we explain our prototype setup and the techniques and libraries chosen. For more details on general theory how this process is done see the theory section 2. There is a limited amount of tools to choose from when exploring ASR with the requirement of running in an offline environment in a multiplatform setting. After some research we filtered out three possible alternatives: HTK Speech Recognition 16 3. System Toolkit (HTK) [23], Kaldi ASR (Kaldi) [24] and, CMUSphinx (Sphinx) [25]. They can all run offline and claim to be more or less resource effective. HTK Speech Recognition Toolkit (HTK) and Kaldi ASR (Kaldi) were later dis- regarded for different reasons. HTK has a license that does not allow for commercial use or redistribution. Furthermore, HTK is implemented in C which is not compati- ble with the company partner’s requirements, since C is deemed an unsafe language. Kaldi is a toolkit developed as a research tool, mainly for experimenting with the latest technologies rather than being easy to use in applications. It is implemented in C++, which the authors have limited experience in using as well as it not being on par with the company partner’s requirements. Kaldi has indirect support for grammars but it would require some manual transformation between formats, from a grammar into finite state transducer [26]. Sphinx is a speech recognition toolkit focused on practical application develop- ment and not on research, designed specifically for low-resource platforms [27]. It seems well established within the software community, it has good documentation and getting hold of the developers to ask questions has been easy. It is written in Java which is one of the main languages used by our company partner; this means that any future development and integration of our prototype should be easy. It supports using grammars in the JSpeech Grammar Format in the recognizer out of the box which helps with increasing precision. However, while using grammars it does not expose any support for getting confidence scores in any meaningful way, to our knowledge. Sphinx also seems to report half-recognized constructs in the grammar rather than reporting an error if the user does not complete a sentence. In the end Sphinx was chosen as it fulfilled our requirements and best fit our purpose compared to HTK and Kaldi. 3.2.1 Sphinx For our prototype we use Sphinx4, the Java version of Sphinx’s tool stack. We have used their community created acoustic models for US-English and Spanish for our prototype as well as the dictionaries provided with them. Alongside Sphinx commu- nity models we use our own grammar exported from the Grammatical Framework. All this allows us to run Sphinx from within our Java application using their mod- els and transforming audio signal from a recording device connected to a computer into Java strings. The recognizer runs on its own thread to prevent blocking other operations and reports its results to a worker queue taken care of by the rest of the application. For example, the phrase “select ready surveillance units” uttered by the user will go through Sphinx preprocessing steps of converting the audio to feature vectors and removing background noise. Afterwards the community created acoustic model and our grammar language model work in tandem checking the input against the internal HMMs present, the first audio snippets are matched against the phones “/S/ /AH/ /L/ /EH/ /K/ /T/” which corresponds to “select”, without risk for matching against “/S/ /AH/ /L/ /EH/ /B/ /R/ /IH/ /T/ /IY/” which corresponds to “celebrity” and is not allowed by our grammar language model. When the whole 17 3. System input is processed we receive a hypotheses from Sphinx: “select ready surveillance units” which is added to a queue for further processing in the pipeline. While Sphinx in its current state does not expose confidence values for hypoth- esis when grammars are used, it is possible that such support could be added to the Sphinx code base. Since Sphinx is open source it would be possible to look into how the grammar speech-recognizer works and possibly add support for exposing its confidence values. How difficult or easy this would be is hard to tell since the lim- ited time-frame and other priorities have not presented the opportunity to explore Sphinx code base in depth. 3.3 Grammatical Framework The aim of the prototype is to support multiple languages, thus, adding a new language should not require a completely new system. If a generic system was adding support for more languages, by simply matching specific sentences to commands, the number of matches would be growing with a factor of k ·n, where k is the number of commands and n is the number of languages. Since each new command would have to, in its full form, be matched in each language that is supported. Such a growth in development complexity is undesired and thus a more general approach is needed to resolve this. It is possible to extract shared semantics of languages into a more abstract form, so sentences from different languages that share semantics can be transformed to the same abstract form. For instance, in English “turn the lights off” and in Swedish “släck lampan” has the same semantic meaning while their syntax and words differ. If we in the parser module can extract the semantic meaning and output it to the next part, the next part can then work independently of the input language. There is a tool that has aided us in this effort: Grammatical Framework (GF) [28]. GF allows a user to specify an abstract syntax that covers semantics you want to extract and then for each language you want to support you implement a con- crete syntax following the abstract syntax’s specification. This allows developers to encode their domain in an abstract syntax and only port parts that are language dependent in the concrete syntax. GF can then parse any language having a con- crete syntax into an abstract parse tree. To summarize, given any input language with a corresponding concrete syntax it can produce an abstract syntax tree which is independent of the input language. Thus the phrases “markera redo övervakn- ingsenheter” and “selecciona unidad de vigilancia preparada” is parsed by GF into the same abstract parse tree, examples of parse trees can be seen in figures 3.2, 3.3 with their abstract text representation visible in figure 3.4. GF with its help library covers all major languages and helps with working out language specific syn- tax which allows developer not familiar with the linguistic properties of the language to use GF with its library to incorporate an unfamiliar language. 18 3. System Figure 3.2: Parse tree for Swedish Figure 3.3: Parse tree for Spanish TopCmd (CommonToCmd (SelectUnitsCmd Surveillance Ready)) Figure 3.4: Abstract Parse Tree from GF 19 3. System −− Unifying constructs TopCmd : Cmd −> Top ; CommonToCmd : CommonCmd −> Cmd ; SensitiveToCmd : SensitiveCmd −> Cmd ; −− Actual commands GetDirectionCmd : PID −> CommonCmd ; GetSpeedCmd : PID −> CommonCmd ; GetDistanceCmd : PID −> CommonCmd ; GetClosestOfType : TargetParam −> CommonCmd ; SelectUnitsCmd : UnitTypeParam −> StatusParam −> CommonCmd ; ClearSelections : CommonCmd ; SetModeCmd : Mode −> SensitiveCmd ; DisplayRangeCmd : ShowHideParam −> PID −> CommonCmd ; DisplayInformationCmd : PID −> CommonCmd ; Figure 3.5: Extract of abstract syntax There are many advantages using GF. It allows us to simplify the process of extracting semantics from different languages into a shared language independent output of the parser module. GF can output its grammars in many formats, in- cluding JSpeech Grammar Format (JSGF) which our Speech-to-text module can then use as language model. Furthermore it allows language specific output given an abstract syntax tree which allows us to have generic output of our application be linearized into the user’s own language which can vary. 3.3.1 Abstract syntax For our prototype we have created a small example grammar in GF within our domain, allowing the user to query for information and carry out actions in the prototype application. An extract from the abstract syntax can be seen in figure 3.5 covering some commands. Here we see different constructions for the different commands with varying number of variables required for them. It resulting in a “CommonCmd” or a “SensitiveCmd” which is unified to our “Top” category by the first three rules in the example. Since there are several types of commands, the system is able to recognize this depending on the label you give it; the system is then able to have different behaviors for different labels of the command. For instance, a “SensitiveCmd” might require that you accept the action before it executes it, since the instruction has a larger effect. If we take a closer look at the command in figure 3.6 we can see that the “Selec- tUnitsCmd” takes two arguments, a “UnitTypeParam” and a “StatusParam” which are later in the grammar defined as seen in figure 3.7 with different constructors. This rule defines that in order to select units it requires the type and the status of those units in order to construct a meaningful command. This is all the information we need in the abstract syntax for our example sentence, “select ready surveillance units”. 20 3. System SelectUnitsCmd : UnitTypeParam −> StatusParam −> CommonCmd ; Figure 3.6: Abstract construct linked to “select ready surveillance units”-command Surveillance : UnitTypeParam ; Ready, Unready, Unarmed, Armed : StatusParam ; Figure 3.7: Extract of abstract syntax 3.3.2 Concrete syntax After the abstract syntax comes the concrete syntax filling the specification. We have used GF’s functor structure to clearly separate different parts and have a partial, or as in GF-terms, incomplete concrete syntax to define word types and general rules for the different rules of the abstract syntax. In figure 3.8 we can see the rule in the concrete syntax for our selection-command. First we have the name of the rule followed by the parameters named u and s; this is followed by an implementation taking help from GF’s help library for grammatical rules. The implementation consists of an ordered set of function applications. All commands are utterances and thus have the “mkUtt” construct wrapping them. For this particular case we want imperative form of the verb “select” which acts on a particular noun “ready surveillance units” which leads up to the usage of “mkImp”, “mk_V2” and “mkNP” where “Imp”, “V” and “N” is short for imperative, verb and noun respectively in the GF helper library api. Since the verb acts in regard to something we are looking for a two-place verb and noun in question have modifier in the terms of “ready” which leads up to the usage of “mk_V2” and “mkCN”. The last function used, “mkNP”, simply unifies nouns to noun-phrases. This results in an utterance in imperative form with verbs acting on a particular noun. The “Select_V” construct and constructs for status and unit arguments are the only things missing, all other functions are provided by the GF help library. The argument constructs can be seen in figure 3.9 and are simple delegation to the last part; the lexicon part of the syntax. The lexicon part consists of rules which describe what constructs maps to what words in that particular language. This is the only language dependent part of our grammar and consists of as little glue as possible with only lookups and mostly simple unary functions as well as some mapping to the language paradigm of GF helper library to a particular language. This can all be seen in our extract in figure 3.10 from our lexicon implementation of English. The only difference between the SelectUnitsCmd u s = mkUtt (mkImp (mk_V2 Select_V) (mkNP (mkCN s u))) ; Figure 3.8: Concrete construct linked to “select ready surveillance units”-command 21 3. System Surveillance = Surveillance_N ; Ready = Ready_A ; Unready = Unready_A ; Figure 3.9: Concrete part delegating to lexicon part Ready_A = mkA "ready" ; Unready_A = mkA "unavailable" ; Surveillance_N = mkN "surveillance units"; Display_V = mkV "display" | mkV "show" ; Select_V = mkV "select" ; Clear_V = mkV "clear" | mkV "remove" ; −− Delegations to GF lib mk_V2 = mkV2 ; mk_V3 = mkV3 ; mk_ID_N = mkN ; mk_N2 = mkN2 ; Figure 3.10: Lexicon part of concrete syntax for “select ready surveillance units”-command for English English, Swedish and Spanish grammar is the lexicon part, in figures 3.11 and 3.12 the differences required for our example can be seen. 3.3.3 Putting it all together The relation between the different parts are covered in part by inheritance syntax of GF as well as a functor instance which maps required interfaces to their respective instance. An example of such wiring can be seen in our Swedish grammar in fig- ure 3.13, this maps the Syntax module required by the GF library functions to the Swedish one in the library as well as the lexicon for our grammar to Swedish lexicon we have created. When all these parts are in place it is possible to invoke GF with a parse instruction and give it the example sentence of “markera redo övervakn- ingsenheter”, “select ready surveillance units” and “selecciona unidad de vigilancia preparada” and all of those inputs will produce the abstract tree “TopCmd (Com- monToCmd (SelectUnitsCmd Surveillance Ready))” as a result. Ready_A = mkA "redo" ; Select_V = mkV "markera" ; Surveillance_N = mkN "vervaknings" (mkN "enheter"); Figure 3.11: Minimal part of lexicon for “select ready surveillance units”-command for Swedish 22 3. System Ready_A = mkA "preparado" ; Select_V = mkV "seleccionar" ; Surveillance_N = mkN "unidad de vigilancia"; Figure 3.12: Minimal part of lexicon for “select ready surveillance units”-command for Spanish concrete FutureSwe of AbstractFuture = Future with (Syntax = SyntaxSwe), (LexFuture = LexFutureSwe); Figure 3.13: Functor instance of Swedish grammar While all of the above may seem like a lot of effort in order to parse a line of text there are benefits to using GF and doing it in this particular way. There is a clear distinction between what you seek in the domain, which is captured by the abstract syntax, and the language specific implementation, which is captured by the concrete syntax. The concrete syntax uses proper grammatical constructors with the help of all common language constructors without having to worry about the language specific syntax. Once the abstract grammar and the first part of the concrete grammar are in place, the matter of introducing a specific language is as simple as filling out a lexicon file, which is the second part of the concrete grammar, with proper word choices for that specific language. This distinction allows clear separation of concern in development, a developer familiar with GF can implement the concrete part while someone familiar with the language targeted can fill out the lexicon without having to have much knowledge of GF. When adding a new language, the GF compilers makes sure that no rules in the abstract syntax goes unhandled in the concrete part and thus there is no risk of support for different languages diverging. The unified output despite input language differences helps reduce complexity of the overall system as the next module only has to handle abstract syntax trees as input. 3.3.4 Usage of GF in our prototype GF has rudimentary support for Java in terms of a module named PGF (Portable Grammar Format) which has Java-bindings. However we could not use this for the prototype. The reason for this is that PGF has no executable for download and the safety precautions at the company did not allow the compilation steps to follow through which forced us to create a workaround. We have an interface in our Java-code as seen in figure 3.14, which we have then hidden part of our workaround behind. In the concrete implementation of the interface for “getAbstractTreeAsString” we create a script file which we invoke the full GF runtime on, collect the result from and return to the main application. A similar process is done for the other methods with different script content. After invoking our GFTool implementation we get the abstract tree as a string which is used to build a parse tree representation in Java. 23 3. System public interface GFTool { public String getAbstractTreeAsString( String input, String grammar) throws IOException; public String getTranslation( String tree , String fromGrammar, String toGrammar) throws IOException; } Figure 3.14: Interface for GF tool If it would become possible to run PGF the only update is to replace the concrete implementation with one that uses PGF. It is possible to export datatypes for many languages from GF including Haskell, JavaScript and Prolog among others but not Java at the moment. To gain as much help from GF as possible and ease development we have created our own tool which generates Java enums based on the abstract GF-grammar files including code to convert strings to enum instances. This allows us to convert the strings output from GF into datatypes native in Java and well supported by Java’s syntax, such as switch-statements while keeping coherence and reusing the grammar previously constructed. An example of such an enum instance can be seen in figure 3.15. Creating this Java-code from the GF-grammar introduces a dependency on the parsing module in a later stage in the pipeline which is undesirable. While the overarching goal is to have the modules be as independent as possible, introducing such dependencies was considered to help development in a meaningful manner enough to justify its existence. 3.4 Interpreter The interpreter is the part of the system that makes sure that the right command is executed and checks whether it is legal or not. Previous parts of the system lack context and is mechanically doing the job given the provided input. The interpreter, however, is aware of the current state of the system, and even though a command is legal with respect to the grammar, it might not be a valid command to execute given the current state of the system. This may be something simple as an ID not corresponding to any valid target, or ambiguities that can make one input acceptable for several commands. If the system asks the operator a yes-and-no question, the answer is acceptable for any yes-and-no question, but should only execute commands responding to the current context. For example “Should I display information of target P22?” and “Should I stop tracking target P22?” can both be answered with “yes” but depend- ing on which question was asked different actions should be taken by the system. Furthermore, a phrase like “yes” might be uttered at a time when no question was 24 3. System public enum StatusParam { READY("Ready"), UNREADY("Unready"), UNARMED("Unarmed"), ARMED("Armed"); private String functionString; private StatusParam(String s) { functionString = s; } public String getFunctionString() { return functionString; } public static StatusParam getParam(String s) { for (StatusParam p : StatusParam.values()) { if (p.getFunctionString().equals(s)) { return p; } } return null; } } Figure 3.15: Enumerator example from StatusParam 25 3. System public interface Interpeter { public void executeInstruction(Instruction instruction ); } Figure 3.16: Interface for Interpreter provided with instructions recognized asked as well, the interpreter should then disregard the input since no question was asked and the statement is thus illegal. If there is more than one possible parse tree but only one legal interpretation in the current context, only the command corresponding to that interpretation is to be output. If there is more than one valid interpretation the system handles it by prompting for further input to help disambiguate or ignore the command given. However, our proof of concept grammar does not include any ambiguities and Sphinx gives only one hypothesis when using grammars which saves us from this problem. Adding the interpreter steps helps improve precision since valid parse trees that are not legal are filtered away. 3.5 Integration Alongside creating our own demo application we have been collaborating with the company to integrate it with its existing product. For us it was a simple process of exporting a runnable jar-file from our application and pointing out the correct class to instantiate. From the company side all that was necessary was to implement the interface seen in figure 3.16 with appropriate actions corresponding to instructions provided and instantiate our recognizer providing mentioned interface. Such integration was successful and simply adding our prototype to existing projects for demonstration purposes was done with just a few hours of work, most of it consisting of mapping instructions to the existing product’s actions. This integration was done with a predetermined command set agreed upon together. In order to introduce new commands a new runnable jar must be exported with proper enums generated by our toolchain. 3.6 Summary of pipeline To revisit our initial presentation of the architecture but on a more concrete level see figure 3.17. Here we can see respective part of our pipeline and a representation of their input and output along with the example previously presented of selecting ready surveillance units. 26 3. System Figure 3.17: Concrete overview of prototype pipeline 27 3. System 3.7 Multilinguality To have a system that can be modular enough to change language is a challenge. Not only do different languages use different words but the syntax differs as well. Sphinx already has acoustic models for the major languages, and it has support for creating new models for languages not yet supported by the community. This allows us to have a prototype as a proof of concept as well as keeping the possibility for adding new languages if desired in the future. Sphinx is used as the speech-to-text module in our prototype and we have used English and Spanish as proof of concept languages. To have the input in text format does not solve the syntactical or semantical problem, GF is used to combat this problem. Due to the separation between the abstract and concrete grammar, the implementation of new languages is simple and flexible regardless of the syntax of the language. The semantics of what was said is extracted to an abstract form, which will look the same no matter which input language was used. This means that the system modules after the parser in the pipeline do not need to change if a new language is implemented, which helps minimize the effort needed to implement a new language. 3.8 Model adaptation Sphinx and its toolchain have support for what is called acoustic model adaptation [29]. Instead of training a new model for a specific use case it allows adding training data and tweaking an existing model, adapting it to the situation at hand. We realized after some initial testing that the available US-English model provided by Sphinx does not perform to our expectations when it comes to native Swedish speakers and their English accents, and has thus explored the model adaptation feature. We gathered volunteers who are native Swedish speakers and recorded them speaking English phrases from our grammar used for demonstration purposes. We then used these recordings to run the model adaptation process on the existing US-English model provided by Sphinx. 28 4 Evaluation The evaluation of the system has three main parts, the criteria on which the system was evaluated, the experimental set up when gathering results as well as the results. 4.1 Criteria The evaluation of the project can be based on several criterias. We have chosen a subset to focus on which are the ones most in line with the aim of the project that can be measured easily. Each such criterion is listed and explained in this section. Multilinguality How well does the project supports multiple languages? We have chosen to measure this by how many changes we had to make in order to add a second and third language to the project and approximately how many hours of work are needed to incorporate the current set of available commands in a new target language. For example, if we support English, how difficult is it to add Spanish support? How many lines of code must be added and is there any other work required such as documentation, gathering training data etc? This comes from the aim of having a system that is as easy as possible to adapt to a new language in order to be viable from a business perspective. Command/sentence error rate How well does the system understand the user? How many sentences are correctly recognized by the system? We measure the systems precision by letting different users say two sets of sentences, one set of correct sentences and count correct in- terpretation and one set of incorrect sentences and count false positive interpreted sentences. Gain from model adaptation We have measured the command/sentence error rate with both respect to an adapted model and an original model. The adapted model comes from using Sphinx model adaption feature as described in section 4.2. 29 4. Evaluation Usage complexity and instructions required How difficult is it to become proficient in using the system? How much time is needed to become skilled in using the system? We measure this by the amount of instructions needed to understand how to operate the system as well as a qualified approximation on how long it takes for a new user to get used to using the system. Runtime computer resource requirements What resources does the project utilize while running? Due to the restricted avail- able resources in the target environment for the project we have chosen to monitor and measure the resources usage in terms of CPU, memory and storage. All this with respect that responses should be delivered in real time by the system, in other word with as small delay as possible before an action is carried out. 4.2 Experimental setup The experiment was done in English, since it is the most commonly known language other than Swedish at the company, and there is an acoustic model available for Sphinx in English. 4.2.1 Training step First we performed a training step using Sphinx model adaptation due to the reasons explained in section 3.8. The volunteers consisted of 17 people, 13 men and 4 women. They were recorded saying the same 25 commands each resulting in audio data of just below 2 minutes per speaker and a total of roughly 30 minutes. This data was then used to adapt the model. For the recording of the data and to simplify the work of creating the correct file structure a small program was created, whose GUI can be seen in figure 4.1. It presented the commands one by one to the user and allowed them to re-record commands if they felt it necessary. For each user it created a folder with appropriate file structure with sound files and transcription files. An example of the content of a transcription file can be seen in figure 4.2. This reduced the work for development as the file structure required by Sphinx was already present when it was time to run the training tool. Once all training data was collected the training was done by running Sphinx adaption software and took just a few minutes. The adaption software takes an existing acoustic model as input and updates the parameters in accordance to the new training data. It updates the different probabilities in the underlying HMM to better map input sound to appropriate phones. 30 4. Evaluation Figure 4.1: Screenshot of the simple training program that was developed select ready surveillence unit (Johan_Andersson\001) give me the velocity of papa one one (Johan_Andersson\002) select ready surveillence unit (Oskar_Dahlberg\001) ... Figure 4.2: How the transcription file could look like to adhere to Sphinx requirements 4.2.2 Evaluation step Out of the original 17 volunteers, 10 had time to return for the evaluation step, 8 men and 2 women. The sentence error rate was then tested with both the adapted model and the original model to see if there is any difference in the result between the two runs. In order to minimize the bias of the result, 50% of the participants had the adapted model on the first run and the original on the second run and vice versa. The test consisted of 24 legal phrases in order to check for an error rate, and 12 incorrect phrases of varying similarity to legal commands to check for false posi- tives. Both the experiment and the recording of training data took place in a silent room with little interference from background noise. The process of the evaluation program was monitored. 4.2.3 Software used In order to recreate the experimental setup, the tools and software are listed below: • OS: Windows 7 Enterprise 64 bit • Sphinx version: CMU Sphinx 4-5prealpha, released 2016-05-31 • Sphinx model: English, US, 5.2 • GF version: GF-3.8 for Windows • Resource monitoring: Process explorer v16.2 31 4. Evaluation 4.2.4 Hardware For hardware the following was used: • Microphone: Konig electronic, serial number KN-MIC50C • External sound card: Goobay, model number 68878 • Laptop: DELL laptop with i7-6820HQ CPU @ 2.70Ghz and 16 GB of RAM. 4.3 Results Here we present the results gathered from our evaluation step described in section. 4.3.1 Implementing a new language For the speech-to-text part we downloaded the community model for US-English and adapted it as described in section 3.8. It does work out of the box but we wanted to increase performance as well as evaluated Sphinx model adaptation features viability. For the second language our original plan was to have Swedish and believed we could reuse resources created at KTH previously [30] by converting those models to that of Sphinx format. There were tools available for such conversions but we did not manage to run them successfully on the existing Swedish models. Since the priority of the project was not to create acoustic models we moved to Spanish as our second language for the speech-to-text part. For the parser module we started with an English version of our abstract gram- mar, the second language implemented was Swedish. This was done by creating a lexicon instance for Swedish and going back to the partial concrete grammar to correct a few mistakes. Those mistakes were not noticeable when using English but became apparent when the Swedish sentences were not those expected after adding a Swedish lexicon instance. All in all the time spent on introducing Swedish support for the parser took less than an hour. The reason we started with Swedish after English for the parser is partly due to it being our original target second language as well as our fluency of the language. The third language added to the parser was Spanish, which took a bit more time since additional personnel that was fluent in Spanish had to be called in. In terms of development, time required when someone familiar with the language was present was comparable to the Swedish implementation; done within a very short timeframe of a few hours. Overall, implementing a new language proved to be very easy if there is an acoustic model available. The original incomplete concrete grammar took some effort to create, mostly due to the need to familiarize ourselves with GF and the helper library. However, once the incomplete concrete grammar was present creating lexicon files were a trivial task done in short time for anyone familiar with the language to be implemented and did not really require any special knowledge of GF to be completed. As can be seen in figure 4.1, the JSGF code generated by GF requires fewer lines of code for each individual language, but once the number of languages in- creases, the code required to make a new language work in GF is much less since 32 4. Evaluation GF JSGF Abstract/Concrete 239 - English 92 143 Swedish 92 211 Spanish 92 279 Total 515 633 Table 4.1: Number of lines of code required for a new language in GF and in JSGF the language specific rules do not have to be implemented. These numbers does not take into consideration the new lines GF generates in order to increase readability when converting to JSGF format, or the comments and new lines we have added in order to increase readability in the GF code. In summary, to implement a new language in our pipeline there are three steps that must be carried out. Firstly, obtain an acoustic model for the new language. This can be done either by using an existing model created by the community or creating a new one. Secondly, create a lexicon file adhering to the abstract and in- complete concrete grammar present. In practice, this means creating a file of around 100 lines of code with domain specific words for the language in question. Thirdly, there is a compilation step and preparation of grammar for JSpeech Grammar For- mat. This is only a matter of running a tool in our toolchain and pointing it to the correct grammar; the tool compiles the grammar using GF, cleans the output and presents it in a Sphinx-compatible format. Performing these three steps can be done within half a workday, given that there is an acoustic model available. 4.3.2 Runtime resources The acoustic models come compressed in an efficient binary format making them have a very small storage footprint. However, when running the application the memory usage is noticeable. Sphinx has less resource intensive models at the cost of performance loss which may be interesting in an even more restricted setting. The CPU usage is the average usage during command intensive sessions with users providing commands almost constantly spread across many cores. Sphinx itself is single threaded; however we run the rest of our prototype and modules on separate threads from Sphinx in order to not block the recognition process needlessly. The parsing and such is thus run on separate threads. 4.3.3 Experimental results The result from the experiment is quite interesting. As can be seen in table 4.3, there was an improvement in the hit rate by about 8%. What can also be noted is that there is a large difference between the two groups performance and their first and second round. The group that started with the non-adapted model had an error rate of 41% using the non-adapted model and 27% on the adapted model, which is 33 4. Evaluation Resources used CPU 7,79% Memory (RAM) 1.08 GB Storage (US-EN model) 52.6 MB Storage (Adapted US-EN model) 52.6 MB Storage (Spanish model) 46.9 MB Storage (GF source + compiled) < 1 MB Table 4.2: Resources required by prototype Non Adapted Model Adapted Model Adapted first 0,25 0,24 Not adapted first 0,41 0,27 Total 0,33 0,25 Table 4.3: Sentence error rate for the adapted model and the original model. For all participants combined and depending on which model they did first. a huge improvement. The group that started with the adapted model had an error rate of 25% using the non-adapted model and 24% on the adapted model. What else could be noted from this was that the persons whose results were good in the first round, had good result in the second round as well, while the individu- als whose results were poor in the first round generally increased the performance regardless of which model they started with. The result of false positives are around 27% for both models and can be seen in table 4.4. False positives are sentences that are not legal in the system but were parsed into a legal command regardless. The majority of the false positives come from phrases which share much of their structure with that of a legal command. For instance, the phrase “give me the velocity of yada yada” would be legal if the words “yada yada” would be replaced with a valid ID, and it was parsed into a legal ID almost every time. A sentence that was completely different from a command, like “order me some McDonalds” or “how are you”, did rarely parse into a legal command, though that did happen on occasion. Phrases like those are more likely to be uttered to another person when talking rather than something that resembles a command. An important aspect that must be pointed out is that the amount of result gathered is a bit low to draw confident conclusions about the performance. There are only 10 people present for the evaluation step with limited time which means that performance may be better or worse then indicated in these results. 4.3.4 Usability In terms of our prototype it was used in two cases, firstly a small integration with existing system mapping voice commands to existing GUI-shortcuts, secondly our own demo application of a radar system with some more elaborate voice commands to select unit types etc. The shortcuts that were added to the real application might 34 4. Evaluation Non Adapted Model Adapted Model Adapted first 0,32 0,27 Not adapted first 0,25 0,27 Total 0,28 0,27 Table 4.4: False positives for the adapted model and the original model. For all participants combined and depending on which model they did first. not add too much to the experience. Features like zooming on the map shown or hide and show panels are easier to do with the shortcuts at hand. The reason why those commands were chosen was due to that fact that they were shortcuts in the existing system and were easy to implement. The real improvement was more easily seen when implementing the demo environment and the operator is able to select interesting targets or show certain information without having to open panels in order to execute the command, in other words more complicated actions. The commands chosen might not have been the most intuitive ones in that they shared similar traits which could lead to confusion of which command used what words. 35 4. Evaluation 36 5 Discussion 5.1 Discussion In this section we present our thoughts on the project and the outcome of it. 5.1.1 Grammatical Framework GF has been a good tool to use when handling multiple languages. Due to the abstract grammar and concrete grammar in combination with the support library, it has been easy to implement new languages. Even though the first language was a little more work to implement in GF than another grammar format dedicated for that specific language with the same purpose, the second language was imple- mented in less than an hour which is remarkable little effort for localization to a new language. It does take some effort to learn to navigate the structure and become efficient as with any new programming language, especially when it comes to how to construct grammatical rules using the helper library in an efficient manner. In this case we consider the effort well worth the gain that GF provides us. 5.1.2 Experiment discussion There is room for improvement when it comes to how the experiment was carried out. However, the results indicate an improvement in accuracy and some of the participants commented on that they experienced that the adapted model did work better. Sphinx model adaptation feature does work, even on a relatively small amount of data. Comparing our training data of just 60 MB compared to training a new acoustic model which may require over many GBs of data depending on use case. It is worth noting that the sample size of the training and evaluation were rather small, this is due to limitations regarding bringing equipment and data out of the company and bringing external people into the building for security reasons. It was deemed to be easier to use employees already working on site to help with the evaluation. Thus only employees who had time to spare could help with the experiment and training of the model. However, we feel confident that the results point to an improvement using Sphinx model adaptation as well as the usability of our proposed architecture overall. Regarding the result of the model adaptation, there was a tendency that the second round generally scored higher, there could be several reasons why this was the case. The first round contained more hesitation in the speech from the volunteer, 37 5. Discussion which became better as more commands where said. In the second round the phrases had already been said once by the volunteer, thus less effort had to be put into reading the same sentence again and understanding the structure. Possibly, this resulted in better pronunciations and less hesitation. As for the result of the experiment, even though there was an improvement in the accuracy after adapting the existing model, an error rate of 25% when listening is not good enough to be used reliably in the end system. The top performer had an error rate around 5-10%, however, the data set was not large enough to draw any definite conclusions. There are a few methods that could combat the low error rate, first is to give the system more training data in order to recognize accent of the user. The question is how much training data has to be gathered in order to have the accuracy to be at an acceptable level. Another solution that Sphinx offers is to interpret the content of a file, since there is a finite amount of data, the recognition can be on the entire file rather than guessing when the person is done talking. This could increase the speech-to-text accuracy. If neither of these methods would prove to be accurate enough, there might be a good idea to start looking for another product that translates speech- to-text. The reason why we chose Sphinx is partly due to its free license, the fact that it has support for grammar and the low effort required to set it up in order to run the system. With more time put into the project, a better speech-to-text engine might be implemented and could improve the performance. 5.1.3 Precision The area of the project we are least content about is the precision of the prototype. While it does perform well for some users it fails often enough so that it becomes an annoyance. We do believe this could partly be remedied by collecting more training data for adaptation with a better suited test group and perhaps revising the recording method. As of the moment the recording is done continuously and Sphinx tries to determine when a user starts and stops speaking a command. To our knowledge Sphinx does not expose any method to tweak the thresholds for such detections and depending on environment this detection works more or less well. Throughout the project we noticed it is very common for Sphinx to cut off input and starting analysis prematurely, especially if there is any hesitation in the speech from the user. It would be interesting to compare our result with using a push-to- talk method to record a whole command and then apply Sphinx recognizer with our model to see if the result would be better. 5.2 Critical system safety A very important aspect is the reliability of the system. If the system and technology are to be used in close co-operation with armed forces it is important that it performs according to expectation and does not provide incorrect results. For example, a misinterpreted command from a user cannot be allowed to be the reason to put human lives in danger. It is important to be aware of possible problem and consider 38 5. Discussion it when developing such a system as ours and provide methods to deal with situations where potential incorrect results could have dire consequences. In our case we did this by modeling our commands in two categories, common commands and sensitive commands. Common commands should require a lower threshold of confidence in order to be executed and may not require verification from the user. In this category we consider things such as GUI-updates. Sensitive commands we considered to be of such nature that a higher level of confidence is required for them to be carried out. They could require the user to verify that the interpretation is correct before continuing. An example of such a command would be turning the radar signal on and thus indirectly exposing the position to possible hostiles. Another important aspect is accountability. What measures are in place to ensure that the system performs as expected? And if it does not, what data is collected and what traceability for the reason is present? It should always be possible to find out why a certain course of action was present in the system which was one of the reasons why we choose a modular approach with transparent steps with many possible points of checkpoints for the result produced. There are likely many more considerations to take into account, especially when trying to introduce a system like ours into a real world application. However the ones previously mentioned are those we consider important to highlight and be aware of at least in early stages of development of an ASR system in a domain such as ours. 5.3 Future work Something to consider for the future is the choice of words used in the commands. During the implementation we noticed that Sphinx had problems distinguishing words that have similar characteristics. One command had three different states but with only one word differencing. The command could be select, accept or remove. Sphinx had some problem separating the words accept and select, since they have similar structure. We solved this by changing the word accept to approve, which has the same meaning but different word structure. Sphinx could then easily distinguish select from approve. However, now there was a problem with approve and remove since these words share similar traits. We eventually solved this by changing remove to decline, but this is a problem that likely will emerge again. If continued work is done on this prototype and more commands are implemented, the choice of words will have to be more carefully considered in order to avoid similar words that might increase the risk for errors when interpreting speech. To increase the precision of the system there are a number of different solutions that could be tested. For our project we have tested one speech-to-text system. It is possible that there are other alternatives that perform better with the same functionality. We did not look much into commercial alternatives or Windows only systems, which may be an option for our company partner. We do recommend that if further explorations of other alternatives to Sphinx are looked into that support for grammars are considered a high priority. Since grammars have been noted to increase precision [4] and offer better control for the developer which is useful in a restricted domain. 39 5. Discussion If continued work on Sphinx is performed our recommendation is to see if an alternative way of handling input data would help. One of the common errors we encountered is that Sphinx cuts off input and starts performing analysis before the user has finished speaking. This could possibly be remedied by changing the input method from continuous to a push-to-talk approach or similar where the whole command is first recorded and then supplied to Sphinx, guaranteeing that the full input is used as basis for the interpretation. This should not be very difficult or time consuming to implement. Furthermore, it would be interesting to see if a better test group with people comfortable in using an ASR-system and the domain would record training data and in a larger extent than 30 minutes. A good starting point would probably be 20 people or more recording 5 minutes of commands present in the domain to be used, resulting in a total of 100 minutes of recorded data. The evaluation of the result should also be given more room with a larger amount of commands evaluated. 50 commands instead of the 24 used here to better assess the outcome of the training. 40 Bibliography [1] Speech API - Speech Recognition | Google Cloud Platform. https://cloud. google.com/speech/. Accessed: 2017-05-10. [2] Alexa Voice Service. https : / / developer . amazon . com / alexa - voice - service. Accessed: 2017-05-10. [3] Björn Bringert. Programming language techniques for natural language appli- cations. Department of Computer Science and Engineering, 2008. [4] Beth Ann Hockey and Manny Rayner. “Comparison of grammar-based and statistical language models trained on the same data”. In: Proceedings of the AAAI Workshop on Spoken Language Understanding. 2005. [5] Automatische Spraakherkenning, Wiggers, and LJM Rothkrantz. “Automatic Speech Recognition USING Hidden Markov Models”. In: (2003). [6] Stanley S Stevens and John Volkmann. “The relation of pitch to frequency: A revised scale”. In: The American Journal of Psychology 53.3 (1940), pp. 329– 353. [7] Daniel Jurafsky and H James. “Speech and language processing an introduc- tion to natural language processing, computational linguistics, and speech”. In: (2000). [8] Naomi Sager. Natural language information processing. Addison-Wesley Pub- lishing Company, Advanced Book Program, 1981. [9] Lawrence R Rabiner. “A tutorial on hidden Markov models and selected ap- plications in speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286. [10] Stephen Tu. Derivation of baum-welch algorithm for hidden markov models. 2015. [11] CMU Sphinx US-En Dictionary. https://github.com/cmusphinx/sphinx4/ blob/master/sphinx4- data/src/main/resources/edu/cmu/sphinx/ models/en-us/cmudict-en-us.dict. Accessed: 2017-06-11. [12] R Thangarajan, AM Natarajan, and M Selvam. “Word and triphone based approaches in continuous speech recognition for Tamil language”. In: WSEAS transactions on signal processing 4.3 (2008), pp. 76–86. [13] Petra Perner, Patrick Wang, and Azriel Rosenfeld. Advances in Structural and Syntactical Pattern Recognition: 6th International Workshop, SSPR’96, Leipzig, Germany, August, 20-23, 1996, Proceedings. Vol. 1121. Springer Sci- ence & Business Media, 1996. [14] Danielle S McNamara. “Computational methods to extract meaning from text and advance theories of human cognition”. In: Topics in Cognitive Science 3.1 (2011), pp. 3–17. 41 https://cloud.google.com/speech/ https://cloud.google.com/speech/ https://developer.amazon.com/alexa-voice-service https://developer.amazon.com/alexa-voice-service https://github.com/cmusphinx/sphinx4/blob/master/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict https://github.com/cmusphinx/sphinx4/blob/master/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict https://github.com/cmusphinx/sphinx4/blob/master/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict Bibliography [15] Aarne Ranta. Implementing programming languages. An introduction to com- pilers and interpreters. College Publications, 2012. [16] syntax - definition of syntax in english. https://en.oxforddictionaries. com/definition/syntax. Accessed: 2017-04-06. [17] Jane Garry and Carl Rubino. “Facts about the World’s Languages”. In: HW Wilson (2001). [18] semantics - definition of semantics in english. https://en.oxforddictionaries. com/definition/semantics. Accessed: 2017-04-06. [19] Double entendre examples. http://examples.yourdictionary.com/double- entendre-examples.html. Accessed: 2017-04-10. [20] JSGF Grammars in Sphinx4 [CMUSphinx Wiki]. http://cmusphinx.sourceforge. net/wiki/sphinx4:jsgfsupport. Accessed: 2017-05-10. [21] Lothar Budach. Fundamentals of Computation Theory: 8th International Con- ference, FCT’91, Gosen, Germany, September 9-13, 1991. Proceedings. Vol. 529. Springer Science & Business Media, 1991. [22] Ian H Witten et al. Data Mining: Practical machine learning tools and tech- niques. Morgan Kaufmann, 2016. [23] HTK Speech Recognition Toolkit. http://htk.eng.cam.ac.uk/. Accessed: 2017-05-16. [24] Kaldi ASR. http://kaldi-asr.org/. Accessed: 2017-05-16. [25] CMUSphinx Open Source Speech Recognition. https://cmusphinx.github. io/. Accessed: 2017-05-16. [26] Kaldi grammar support. https://sourceforge.net/p/kaldi/mailman/ message/31895432/. Accessed: 2017-05-16. [27] About - CMUSphinx Open Source Speech Recognition. https://cmusphinx. github.io/wiki/about/. Accessed: 2017-05-16. [28] Aarne Ranta. Grammatical framework: Programming with multilingual gram- mars. CSLI Publications, Center for the Study of Language and Information, 2011. [29] Adaptation of the acoustic model. https://cmusphinx.github.io/wiki/ tutorialadapt/. Accessed: 2017-05-15. [30] Giampiero Salvi. “Developing acoustic models for automatic speech recog- nition in Swedish”. In: European Student Journal of Language and Speech, Stockholm, Sweden (1999). 42 https://en.oxforddictionaries.com/definition/syntax https://en.oxforddictionaries.com/definition/syntax https://en.oxforddictionaries.com/definition/semantics https://en.oxforddictionaries.com/definition/semantics http://examples.yourdictionary.com/double-entendre-examples.html http://examples.yourdictionary.com/double-entendre-examples.html http://cmusphinx.sourceforge.net/wiki/sphinx4:jsgfsupport http://cmusphinx.sourceforge.net/wiki/sphinx4:jsgfsupport http://htk.eng.cam.ac.uk/ http://kaldi-asr.org/ https://cmusphinx.github.io/ https://cmusphinx.github.io/ https://sourceforge.net/p/kaldi/mailman/message/31895432/ https://sourceforge.net/p/kaldi/mailman/message/31895432/ https://cmusphinx.github.io/wiki/about/ https://cmusphinx.github.io/wiki/about/ https://cmusphinx.github.io/wiki/tutorialadapt/ https://cmusphinx.github.io/wiki/tutorialadapt/ Introduction Aim Delimitations Related work Theory & Method Speech recognition Preprocessing Recognition problem Language modeling Acoustic modeling Hidden Markov model Recognition using HMMs Parsing Syntax Semantics Grammar Available solutions System Overview Speech-to-text Sphinx Grammatical Framework Abstract syntax Concrete syntax Putting it all together Usage of GF in our prototype Interpreter Integration Summary of pipeline Multilinguality Model adaptation Evaluation Criteria Experimental setup Training step Evaluation step Software used Hardware Results Implementing a new language Runtime resources Experimental results Usability Discussion Discussion Grammatical Framework Experiment discussion Precision Critical system safety Future work