International Journal of Networked and Distributed Computing

Volume 9, Issue 1, January 2021, Pages 10 - 18

The Performance Evaluation of Continuous Speech Recognition Based on Korean Phonological Rules of Cloud-Based Speech Recognition Open API

Authors
Hyun Jae Yoo1, Sungwoong Seo1, Sun Woo Im2, Gwang Yong Gim1, *
1Department of IT Policy and Management, Graduate School, Soongsil University, Seoul, Korea
2Graduate School of Korean Language and Literature, Soongsil University, Seoul, Korea
*Corresponding author. Email: gygim@ssu.ac.kr
Corresponding Author
Gwang Yong Gim
Received 9 October 2020, Accepted 18 November 2020, Available Online 8 January 2021.
DOI
https://doi.org/10.2991/ijndc.k.201218.005How to use a DOI?
Keywords
Speech recognition, pronunciation dictionary, Korean phonological rules, cloud computing, Open API
Abstract

This study compared and analyzed the speech recognition performance of Korean phonological rules for cloud-based Open APIs, and analyzed the speech recognition characteristics of Korean phonological rules. As a result of the experiment, Kakao and MS showed good performance in speech recognition. By phonological rule, Kakao showed good performance in all areas except for nasalization and Flat stop sound formation in final syllable. The performance of speech recognition of Korean phonological rules was good for /l/nasalization and /h/deletion. The speech recognition performance of phonological rule words accounted for a very high percentage of the whole words speech recognition performance, and the speech recognition performance of phonological rule was more different among companies than between speakers. This study hopes to contribute to the improvement of speech recognition system performance of cloud companies for Korean phonological rules and is expected to help speech recognition developers select Open API for application speech recognition system development.

Copyright
© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

Speech recognition systems have significantly improved performance with cloud computing technology [1] and application of artificial intelligence [2]. The cloud-based speech recognition engine addresses the difficulties of developing speech recognition systems. By collecting large amount of speech data for development of speech recognition system, high performance computer for learning large volume speech data is not needed. Cloud-based speech recognition Open API has saved a lot of time, effort, and money to develop an applied speech recognition system. The improved performance and ease of development of speech recognition systems are being applied in a variety of areas. Speech recognition systems are largely divided into pre-processing and recognition units [3]. The recognition unit makes a word for the extracted speech information of the speech. The process of creating words either uses pronouncing dictionaries according to the characteristics of the speech recognition system [46], using information through deep learning of vocal information without pronouncing dictionaries [7,8]. The speech recognition system should accurately recognize phonological changes regardless of whether a pronouncing dictionary exists. In the speech recognition process, meaningful sentences should be made in terms of syllables by finding the exact morphemes. Therefore, assessing the recognition rate of speech recognition systems for phonological rules will help to understand the characteristics of speech recognition systems. This study aims to explore the characteristics of cloud-based speech recognition system’s application of phonological rules and to present the criteria for selecting a high-performance cloud-based Open API for developing an applied speech recognition system. This study conducted a study on continuous speech recognition performance evaluation in accordance with the Korean phonological rules of the cloud-based speech recognition Open API. The composition of the paper described the related research on Korean phonological rules and cloud-based speech recognition Open API in Chapter 2, and Chapter 3 described experiment methods and test results as experiments. Chapter 4 summarizes the evaluation and meaning of experimental results and describes future research tasks.

2. BACKGROUND

2.1. Speech Recognition Overview

Speech recognition is a technique that converts a person’s pronunciation into meaningful characters. The Korean Telecommunications Technology Association (TTA)’s Information and Communication terms Dictionary describes speech recognition as “automatically identifying linguistic meaning contents from speech, and more specifically, it is a processing process that identifies words or series of words and extracts meanings by entering speech waveforms.” The processing of speech recognition systems is divided into pre-processing and recognition units, as in Figure 1 [3]. In preprocessing, the input speech information is extracted, and in the recognition section, the speech information extracted from preprocessing is converted into words and the sentence is made. The creation of sentences uses pronouncing dictionaries and vocabulary dictionaries made in large TEXT coppers.

Figure 1

Process of speech recognition system.

Factors affecting the performance of speech recognition system include noise elimination method, method of extracting speech characteristics, method of generating sound model, method of generating pronouncing dictionary, method of creating language model, and method of decoding network method. Pronouncing dictionary is a crucial factor in making speech information of speech recognition system a meaningful word. The pronouncing dictionary gives vocal information according to the heading. This vocalization information reflects phonetic rules to create a pronouncing dictionary. In particular, the difficulty of Korean speech recognition is to make a dictionary of pronunciation because there are so many rules that apply to the Korean pronunciation method in the generation of pronunciation.

2.2. Korean Phonological Rule

Phonological rules mean changing the predetermined pronunciation of the morpheme due to changes in phoneme and phenomena of change. The phonological process can be divided into official phonological processes and general phonological changes. The phonological process is divided into five parts [9] from the point of view of the syllabus: replacement, elimination, inclusion, condensation, and metathesis. Moreover, it is divided into essential phonological rules and veterinary rules, depending on the environment. Essential phonological rules are rules that must be applied in all conditioned environments, and optionally rules are rules that are both good and need not be applied in the same phoneme environment. Table 1 shows the classification and division of public phoneme rules by phoneme process. Table 2 describes essential phonological rules and examples of related words in the synchronic phonological process, and Table 3 describes optionally phonological rules and examples of related words in the synchronic phonological process [9].

Synchronic phonological process Formative phonological process Formative phonological process of consonants Replacement Flat stop sound formation in final syllable, Nasalization, Liquidization, Place assimilation, Fortition
Deletion Simplification of Consonant cluster, /h/deletion, Geminate consonants Reduction, /t/deletion, /l/deletion, Nasal deletion
Insertion Gemination, /n/insertion, Homophony insertion
Contraction Aspiration, Fortition
Formal phonological process for vowels and semi-vowels Replacement Vowel harmony, Umlaut, /j/semi-vowelization /w/semi-vowelization, Complete assimilation of vowel, front-vowelization, Vowel rounding
Deletion /ɯ/deletion, Same vowel elision, /j/deletion, /w/deletion
Insertion /j/insertion
Contraction Vowel coalescence
Joint process Formative phonological process of consonants Replacement Implosive formation, Voicing (voicing assimiliation), Lateralization, Palatalization
Formal phonological process for vowels and semi-vowels Replacement w-fronting (w front-vowelization)
Deletion /j/deletion
Table 1

Classification and types of synchronic phonological rules

Phonological rules Explanation Example: Pronunciation variation (Before → After), Symbol (IPA)
Flat stop sound formation in final syllable A phenomenon in which the obstruent changes from the final consonant to one of the final syllable neutralization /p, t, k/. jʌpʰ → jʌp, sotʰ → sot
Nasalization Final syllable neutralization /p, t, k/ assimilate into nasal /m, n, ŋ/ respectively in front of nasal sound. pabman → pamman, padnɯn → pannɯn
Liquidization /n/ encounters /l/ and turns into /l/. mulnoli → mullori, ɕilnɛ → ɕillɛ
Fortition A phonological phenomenon in which a plain consonant among obstruent is changed to a fortis in a certain environment. tsabgo, → tsapkʼo, midgo → mitkʼo
Simplification of consonant cluster In the case of a group of consonants consisting of two consonants, one of the two consonants is dropped out of the final consonant. nʌgtsto, → nʌktʼo, ʌntsnɯn) → ʌnnɯn
/h/deletion The predicate final consonant /h/ is a dropout phenomenon in front of a vowel. nahɯn → naɯn, anha → aˈna
/t/deletion /t/ is dropped in front of /sʼ/. tsʌdzso → tsʌsʼo, os sanda → osʼanda
/l/deletion The predicate final consonant /l/ are dropped in front ofthe first consonant /n/, thefinal consonant /n, l, m, p/, the pre-final ending ‘-ɯsi-, -ɯo-', the sentence-closingending ‘ -ɯo, -ɯma'. mandɯl-m nida → mandɯmnida
Aspiration A phenomenon in which the final syllable neutralization and the flat spirant become aspiration sounds when they meet /h/. nohda → notʰa, anhgo → (ankʰo
Complete assimilation of vowel In Gyeongsang dialect, the mediated vowel /ʌ/ is completely assimilated to the frontvowel. tah-ʌmo → taːmo
/ɯ/deletion /ɯ/ is dropped out under various conditions. sʼɯ-ʌ → sʼʌ, hʰɯ-ʌ → tʰʌ
Same vowel elision When vowel endings are connected after /a, ʌ/ terms, endings /a, ʌ/ are dropped out. ka-ʌ → kaː, sʌ-ʌ → səː
/j/deletion [j] is eliminated after the palatal sound [ɲ, ʃ, ʃʼ, ʎ]. tʌndzi-ʌ → tʌndzʌ, igi-e → iˈkeː
/w/deletion In some dialects, when the conjugated form is connected with a bilabial sound, a tongue front sound and a double vowel /wa, w/ is eliminated. po-a → pwa → paː
Lateralization [r] is the lateralization of [l] at the final consonant or after [l]. oɾɯnparro → oɾɯnballo, tarrara → tallaɾa
Palatalization A phenomenon in which dental sound [n], alveolar sound [s, sʼ, l] change to palatal sound [ɲ, ʃ, ʃʼ, ʎ] in a front sound [i, j, ɥ], respectively. kasʼni/kanni, jʌnlʎo/jʌlʎo
Table 2

Explanation and examples of essential phonological rules

Phonological rules Explanation Example: Pronunciation variation (Before → After), Symbol (IPA)
Place assimilation A Phenomenon that /t, n/ is changed to /p, m/ in front of bilabial, /k, ŋ/ in front of dorsal (back). And /p, m/ is changed to /k, ŋ/ in front of dorsal (back). mitʰpʰan → mibpʰan, tsipʰko → tsikʼo
Geminate consonants Reduction In front of fortis and aspirate of stop and fricative sound, a flat (lax) sound /p, t, k/ is optionally deleted at same place pabpʰul → papʰul, tɯdtsa → tɯtsʼa
Gemination In front of fortis and aspirate of stop and fricative sound, a flat (lax) sound /p, t, k/ is optionally inserted at same place apʼa → appʼa, apʰasə → appʰasə
/n/insertion A phenomenon that occurs optionally when the preceding word ends with a consonant and the word behind it starts with /i, j/ when a compound or derivative word is created pamil → pamɲil, polil → polɲil
Umlaut The back vowel /a, ʌ, o, u/ is changed to front vowel /ɛ, e, ø, y/ due to the influenceof the following front vowel 'i' or glide 'j' pab-i → pɛˈbi, tʼəg-i → tʼegi
/j/semi-vowelization When ending of vowel is connected behind predicate /i/, auslaut of predecate /i/ is optionally changed to semi-vowel /j/ pʰ-ə → pʰjʌ
/w/semi-vowelization When ending of vowel is connected behind predicate /o, u/, auslaut of predicate /o, u/ is optionally changed to semi-vowel /w/ po-a → pwa
Front-vowelization Onset of postposition or ending /ɯ/ is changed to /i/ behind sibilant /s, sʼ, ts, tsʼ, tsʰ/ os-əno → osino
/j/insertion /j/ is inserted optionally when ending onset /ʌ/ is connected to predicate /i, e, ɛ, wi, ø/. pʰi-ə → pʰjʌ
Table 3

Explanation and examples of optionally phonological rules

2.3. Prior Study on the Korean Pronunciation

Factors affecting the degradation of speech recognition rate in speech recognition systems include noise, completed pauses, repeat/repeat speech, pronunciation variation, stammering, and vocabulary diagram. Among these factors that reduce speech recognition is due to non-grammatical vocalization, except for noise factors. Many studies have been conducted on how to match spelling and pronunciation to reduce the error rate of speech recognition by non-grammatical speech. These methods include creating and using Grapheme to Phoneme (G2P) and learning pronunciation. The process of making a pronouncing dictionary is complex and has many maintenance limitations, so studies are being conducted on end-to-end speech recognition that does not require a pronouncing dictionary [10]. The existing pronouncing dictionary is a standard pronouncing dictionary based on linguistic standards, and the phonetic column is hand-written. However, this required professional knowledge of Korean phonological changes and required a lot of time and effort in writing. To solve these problems, we created a phonetic dictionary based on the Korean phonological rules [11]. This method performed particularly well in multi-pronunciation dictionaries. However, there is a problem with multiple pronunciations, which increases the size of the dictionary, increases the ambiguity of the perceived object at the recognition stage, and increases the congestion [12]. Thus, phonetic was extracted from two corpus of syllable unit and morpheme unit in consideration of phonological variation [4], and a pronouncing dictionary was created by establishing a new unit corpus in which morphological phonological variation was considered [5]. The size of the pronouncing dictionary decreased a lot and the error rate of the word also decreased. There was also a study without a pronouncing dictionary. Although the G2P process that requires changes in phoneme and exception processing of Hangeul is necessary [13], there is a study that breaks down the method of recognizing through deep learning without the G2P process [7] and uses it as an output unit of sound model [8] by breaking it down into letters in initial, neutral, and ending. The method of using lettering showed better performance than pronouncing dictionaries. A new set of phonetic phonemes was created by clustering the ignited voices into a common spectral pattern to increase the discriminative power [6]. Pronouncing dictionaries using common spectral patterns had an effect of reducing the relative word error rate of 8.9% in the phonetic speech than phonetic pronouncing dictionaries, and free speech data by about 7.0%. This study will be meaningful in evaluating the pronunciation treatment of the cloud company speech recognition system through the performance evaluation of Korean phonological rules for the cloud company speech recognition system.

2.4. Cloud Speech Recognition Open Application Programming Interface

Cloud-based speech recognition Open Application Programming Interface (API) is an application service in cloud computing environment. Cloud computing is a service that remotely orders and pays for computer resources (such as software, hardware, storage, etc.) and uses them [1,14]. Cloud computing has characteristics such as multitenancy, on-demand usage, usage measurement, elasticity, resilience, and ubiquitous access. The advantages of cloud computing are, first, low investment and lower maintenance costs. Second, the scalability of computer resources is good. Third, the service configuration is short. Fourth, availability and reliability are high. Fifth, rapid decision-making by the organization of the system configuration is reflected. The downside is, first, that it is vulnerable to security. The stability of data should be delegated to external companies. Second, it is difficult to transfer data when changing service provider. Third, data may be required to be disclosed in accordance with local regulations and regulations of the service provider [1,14]. Cloud-based speech recognition Open API is an API that enables speech recognition developers to develop speech recognition systems using the characteristics of cloud computing. The difficulty of developing a speech recognition system should be based on high-performance computers that can collect large-capacity speech data and learn large-scale speech data. However, cloud-based speech recognition Open API addresses the difficulties of developing speech recognition systems. The cloud-based Open API allows application speech recognition developers to implement desired application speech recognition systems quickly and easily. Companies providing cloud-based speech recognition Open API are represented by domestic Kakao Speech-to-Text system [15], SKT NUGU [16], Naver Clova Speech Recognition [17], GiGA Genie Speech Recognition [18], ETRI STT [19], and others, while foreign companies are Microsoft Azure Cognitive Speech Service [20], Amazon Transcribe [21], IBM Watson Speech to Text [22], and Google Cloud Speech-to-Text [23].

2.5. Prior Study on Cloud Speech Recognition Open API

Cloud-based speech recognition Open API supports development of application speech recognition system quickly and easily. Due to the convenience of development using cloud-based speech recognition Open API, applied speech recognition system is being established in various fields. Application speech recognition developers should choose the speech recognition Open API appropriate for their application speech recognition system, depending on the function and performance they want in developing the application speech recognition system. There are many cases of cloud speech recognition Open API performance evaluation studies to provide criteria for this choice. Cloud-based speech recognition Open API shows performance differences depending on the timing of research and the nature of learning data. The March 2017 study found that Google API was the best [24]. In August 2017, a study conducted experiments on numbers, Hangul, and sentences. The numbers were Kakao, and Naver performed well in Hangeul and sentences [25]. The October 2017 study conducted an experiment on sentences, and the main factors in sentences in which recognition errors occurred were words in Portuguese and English, acronyms, names and certain corporate terms. The Google Cloud Speech API had the highest accuracy. However, the speed was found to be the slowest [26]. In the December 2017 study, the Korean people’s standard language and dialect were studied according to gender, age, and region. The accuracy of the sentences was measured based on spacing, props, surveys, and words according to the resulting sentences. Overall accuracy was good for Google, dialect was good for the Chungcheong and Jeolla dialects, and in the Gyeongsang dialects, sentences with large differences in intonation and pitch and unfamiliar Gyeongsang dialect words were not well recognized [27]. In the December 2018 study, Google showed moderate performance, unlike previous studies [28]. In the 2019 study, Korean and English sentences were recorded at a distance of 1, 3 and 5 m [29]. ETRI Open API in Korean, ETRI Open API in 1 m, Naver in 3 m, Naver Clova in 5 m, Microsoft Azure Speech Service in English, Microsoft Azure Speech Service in 1 m, Amazon Transcribe in 3 m, and ETRI Open API in 5 m showed high recognition rates [29]. According to prior research from 2017 to 2019, Google showed good performance in the beginning, but Microsoft Azure Speech Service and ETRI Open API showed good performance in the second half. In the preceding study, there is no case of speech recognition research on changes in Korean phonemes and changes. It would be a meaningful study to evaluate phonetic recognition of phonetic fever in identifying the characteristics of cloud-based speech recognition Open API.

3. EXPERIMENT

3.1. Experimental Method

The experiment tested the speech recognition performance of the cloud-based speech recognition Open API for Korean phonological rules. The Korean phonological rules selected 10 essential phonological rules (nasalization, /t/deletion, palatalization, /h/deletion, simplification of Consonant cluster, fortition, /l/nasalization, flat stop sound formation in final syllable, aspiration, liquidation) that occur during the public phonological process [9]. The cloud-based speech recognition Open API targeted seven domestic and foreign cloud companies (Kakao, ETRI, Naver, Microsoft, Google, IBM, Amazon, IBM). Speech data recorded a total of 100 sentences and 2560 phrases, 10 sentences each for 10 syllables by phonological rule, such as Table 4 [9,30,31]. Five speakers, male and female, participated in the recording, and the recording environment was recorded in a general office without soundproofing facilities. The format of the speech data was 16-bit PCM with a sampling of 16 kHz. Cloud-based speech recognition Open API did not consider any speech recognition options provided by cloud companies. The service method was chosen as a non-streaming method. The experimental equipment used a web program developed using PHP for general desktop computers. The evaluation method was measured in words. The recognition performance was verified by calculating the Word Error Rate (WER) in sentence, as shown in Equation (1). In Equation (1) S means Substitution, I mean Insertion, D means Delete, and N means the whole input phrase.

WorldErrorRate(WER)=(S+I+D)N (1)

In the evaluation, the error rate of the whole words (sentence containing the phonological rule word) and the error rate of the phonological rule word were measured, respectively. Whole words consisted of sentences containing phonological rules (Ex: KOREAN-> "종이를 접는 방법들 배우고 싶다" / IPA->" tsoˈŋiɾɯl tsʌmnɯn paŋbʌpɯl pɛˈugo ɕiptʼa" / ENGLISH->"I would like to learn how to fold paper", the phonological rule word is KOREAN->"접는" / IPA->"tsʌmnɯn"/ ENGLISH->"folding") [32].

Phonological rules Examples: Pronunciation variation (Before → After), Symbol (IPA)
Nasalization tsʌpnɯn → tsʌmnɯn, pʌpman → pʌmman, patnɯn → pannɯn, natsʰman → nanman, magnɯn → maŋnɯn,tʼʌgman → tʼʌŋman, takʼnɯn → taŋnɯn, ipʰman → imman, u:snɯn → uːnnɯn, mitnɯnta → minnɯnta.
/t/deletion tsʌdzsoka → tsʌsʼoka, mitsaoni → misʼaˈoˈni, kasʼsɯbnida → kasʼɯmnida, patsɯbnida → pasʼɯmnida, kɯɾɯsɕʼisgo → kɯˈɾɯt ɕʼikʼo, mutsɯbnida → musʼɯmnida, tɯtsɯbnida → tɯsʼɯmnida, tatsɯbnida → tasʼɯmnida,kətsɯbnida → kəsʼɯmnida, kotsɯbnida → kosʼɯmnida.
Palatalization mitʰi → mitsʰi, patʰita → patsʰita, kuˈti → kuˈdzi, katʰi → katsʰi, hɛˈdotiɾɯl → hɛˈdodziɾɯl, maˈti iˈni → maˈdzi iˈni,kʌtʰi → kʌtsʰi, putʰida → putsʰida, satʰsatʰi → saˈsʼatsʰi, kuthjʌsʼta → kutsʰʌta.
/h/deletion naˈha → naˈa, ɕiˈɾhʌhanda → ɕiˈɾʌɦanda, maːnhɯn → maːnɯn, sʼaˈhida → sʼaˈida, noˈhinda → noˈinda, kʼɯlhidaka → kʼɯlidaka, anhɯn → annɯn, aˈnha → aˈna, aˈɾha → aˈɾa, aɾhɯn → aɾɯn
Simplification of consonant cluster nʌksto → nʌktʼo, ʌndznɯn → ʌnnɯn, kuːlmko → kuːmkʼo, saːlmmani → saːmmani, haltʰtsi → haltsʼi, kʼɯlhnɯnda → kʼɯllɯnda, palktsi → paktsʼi, nʌlbta → nʌltʼa, ɯlpʰko → ɯpkʼo, alhnɯn → allɯn.
Fortition kukputʰʌ → kukpʼutʰʌ, tsapko → tsapkʼo, papto → paptʼo, mittsi → mitsʼi, uːsko → uːtkʼo, aːnko → aːnkʼo,oːmtsi → oːmtsʼi, kaltɯŋɯn → kaltʼɯŋɯn, multsilɯn → multsʼilɯn, solpaŋulɯl → solpʼaŋulɯl.
/l/nasalization nɯŋljʌkdo → nɯŋɲʌkdo, hamljaŋ → hamɲaŋ, ɯˈmunlon → ɯˈmunnon, hjʌpljʌkhajʌ → hjʌmɲʌkhaˈjʌ,homlʌnɯl → hom nʌnɯl, ʌplodɯ → ʌpnodɯ, ɕimlilɯl → ɕimɲilɯl, pjəːŋljʌlɯn → pjəːŋɲʌlɯn, tsikljʌllo → tsiŋɲʌllo, taːmljʌk → taːmɲʌk.
Flat stop sound formation in final syllable jʌpʰ → jʌp, təːpʰko → təːpkʼo, ipʰto → iptʼo, nastwa → natʼo, isʼta → itʼa, pitsʼto → pitʼo, nohnɯn → nonnɯn, takʼnɯn → taknɯn, osman → onman, apʰman → amman.
Aspiration mathjʌŋɯn → matʰjʌŋɯn, nohko → nokʰo, nohtaga → notʰaga, nəhtʌɾa → nətʰʌɾa, maːnhkʌdɯn → maːnkʰʌdɯn,kʼɯl htsido → kʼɯltsʰido, tsohtsinɯn → tsoˈtsʰinɯn, pʌphakkwanɯn → pʌpʰakʼanɯn, palkhjʌtsugi → palkʰjʌtsuˈgi,iphaki → ipʰaki.
Liquidization mulnori → mullori, sʼalnunɰi → sʼallunɰi, pulnɯŋ → pullɯŋ, ɕilnɛhwaɾɯl → ɕillɛɦwaɾɯl, hultʰnɯn → hullɯn, tal nimɯl → talʎiml, alhnɯn → allɯn, sonnanloka → sonnaːlloka, ɕinlaeˈke → ɕillaeˈke, onlain → ollain.
Table 4

Examples of words by experimental phonological rules

3.2. Experiment Result

As a result of the experiment, as shown in Table 5, the error rate for all words on a per cloud company basis was 8.09% for Microsoft and 8.28% for Kakao, showing good performance. IBM 43.38% and Naver 19.02% showed poor performance. As shown in Table 6, the error rates for phonological rules were 18.00% for Kakao and 25.60% for Microsoft, which showed good performance. IBM 71.20% and Amazon 36.00% did not performance well. As shown in Table 5, the error rates of sentences containing phonological rules word on the basis of phonological rules were for /l/nasalization 12.32% and /h/deletion 13.20%, showing good performance. Palatalization 23.48% and aspiration 22.04% showed poor performance. In Table 6, the error rates for phonological rules word were /h/deletion 16.29% and /l/nasalization 20.57%, which showed good performance, while simplification of consonant cluster 61.43% and aspiration 49.14% showed poor performance. Table 7 show the ratio of the number of incorrect words in phonological rules words to the number of incorrect words in whole words. The ratio was as low as 31.34% for IBM and 35.11% for Naver, and as high as 61.84% for Microsoft and 50.57% for Google. As shown in Table 8, the speech recognition error rate for whole words by speaker was 14.79% to 20.34%, and the speech recognition error rate for phonological rule words by speaker was 30.43% to 40.00% as shown in Table 9.

Amazon ETRI Google IBM Kakao MS Naver Sum of wrong words Total number of words WER%
Nasalization 12% 13% 9% 44% 10% 8% 20% 309 1855 16.66
/t/deletion 12% 13% 13% 42% 8% 10% 18% 270 1610 16.77
Palatalization 22% 18% 17% 56% 13% 13% 25% 452 1925 23.48
/h/deletion 13% 4% 8% 45% 4% 4% 15% 231 1750 13.20
Simplification of consonant cluster 21% 22% 16% 43% 14% 11% 22% 415 1960 21.17
Fortition 16% 12% 8% 40% 9% 7% 13% 254 1715 14.81
/l/nasalization 11% 5% 6% 42% 5% 3% 15% 220 1785 12.32
Flat stop sound formation in final syllable 9% 11% 4% 44% 5% 5% 20% 270 1960 13.78
Aspiration 27% 23% 13% 41% 9% 16% 26% 378 1715 22.04
Liquidization 16% 11% 7% 46% 6% 4% 16% 251 1645 15.26
Sum of wrong words 404 343 261 1136 212 207 487 3050 17920 17.02
Total number of words 2560 2560 2560 2560 2560 2560 2560 17920
WER% 15.78 13.40 10.20 44.38 8.28 8.09 19.02 17.02
Table 5

Speech recognition error rate for whole words by company (WER%)

Amazon ETRI Google IBM Kakao MS Naver Sum of wrong words Total number of words WER%
Nasalization 36.00% 32.00% 26.00% 76.00% 24.00% 20.00% 24.00% 119 350 34.00
/t/deletion 46.00% 44.00% 50.00% 92.00% 24.00% 44.00% 40.00% 170 350 48.57
Palatalization 42.00% 34.00% 36.00% 60.00% 8.00% 26.00% 34.00% 120 350 34.29
/h/deletion 14.00% 2.00% 8.00% 60.00% 0.00% 12.00% 18.00% 57 350 16.29
Simplification of consonant cluster 72.00% 56.00% 54.00% 88.00% 44.00% 52.00% 64.00% 215 350 61.43
Fortition 20.00% 26.00% 20.00% 60.00% 16.00% 16.00% 22.00% 90 350 25.71
/l/nasalization 24.00% 12.00% 12.00% 58.00% 6.00% 10.00% 22.00% 72 350 20.57
Flat stop sound formation in final syllable 14.00% 26.00% 6.00% 78.00% 18.00% 20.00% 40.00% 101 350 28.86
Aspiration 52.00% 72.00% 36.00% 60.00% 26.00% 42.00% 56.00% 172 350 49.14
Liquidization 40.00% 18.00% 16.00% 80.00% 14.00% 14.00% 22.00% 102 350 29.14
Sum of wrong words 180 161 132 356 90 128 171 1218 3500 34.80
Total number of words 500 500 500 500 500 500 500 3500
WER% 36.00 32.20 26.40 71.20 18.00 25.60 34.20 34.80
Table 6

Speech recognition error rate for phonological rules by company (WER%)

1st 2nd Average (%)


Company WER% Company WER%
Nasalization MS 20.00 Kakao, Naver 24.00 34.00
/t/deletion Kakao 24.00 Naver 40.00 48.57
Palatalization Kakao 8.00 MS 26.00 34.29
/h/deletion Kakao 0.00 ETRI 2.00 16.29
Simplification of consonant cluster Kakao 44.00 MS 52.00 61.43
Fortition Kakao, MS 16.00 Amazon, Google 20.00 25.71
/l/nasalization Kakao 6.00 MS 10.00 20.57
Flat stop sound formation in final syllable Google 6.00 Amazon 14.00 28.86
Aspiration Kakao 26.00 Google 36.00 49.14
Liquidization Kakao, MS 14.00 Google 16.00 29.14
Total Kakao 18.00 MS 25.60 34.80
Table 7

Ranking of WER% in terms of phonological rules by company

A B C D E Sum of wrong words Total number of words WER%
Nasalization 13.21% 24.26% 14.82% 14.29% 16.71% 309 1855 16.66
/t/deletion 15.53% 19.25% 17.39% 13.35% 18.32% 270 1610 16.77
Palatalization 20.78% 26.49% 19.74% 23.64% 26.75% 452 1925 23.48
/h/deletion 12.29% 16.29% 13.43% 10.00% 14.00% 231 1750 13.20
Simplification of consonant cluster 19.13% 24.49% 20.15% 21.68% 20.41% 415 1960 21.17
Fortition 12.83% 18.95% 11.37% 10.50% 20.41% 254 1715 14.81
/l/nasalization 12.04% 14.29% 13.73% 9.80% 11.76% 220 1785 12.32
Flat stop sound formation in final syllable 10.97% 15.82% 9.44% 12.76% 19.90% 270 1960 13.78
Aspiration 20.41% 24.78% 20.99% 19.53% 24.49% 378 1715 22.04
Liquidization 12.77% 17.93% 12.77% 10.64% 22.19% 251 1645 15.26
Sum of wrong words 539 729 552 530 700 3050 17920 17.02
Total number of words 3584 3584 3584 3584 3584 17920
WER% 15.04 20.34 15.40 14.79 19.53 17.02
Table 8

Speech recognition error rate for whole words by speaker (WER%)

A B C D E Sum of wrong words Total number of words WER%
Nasalization 30.00% 45.71% 31.43% 27.14% 35.71% 119 350 34.00
/t/deletion 45.71% 47.14% 57.14% 44.29% 48.57% 170 350 48.57
Palatalization 31.43% 41.43% 34.29% 34.29% 30.00% 120 350 34.29
/h/deletion 12.86% 24.29% 15.71% 10.00% 18.57% 57 350 16.29
Simplification of consonant cluster 60.00% 70.00% 55.71% 62.86% 58.57% 215 350 61.43
Fortition 27.14% 35.71% 20.00% 15.71% 30.00% 90 350 25.71
/l/nasalization 28.57% 18.57% 22.86% 11.43% 21.43% 72 350 20.57
Flat stop sound formation in final syllable 24.29% 34.29% 27.14% 27.14% 31.43% 101 350 28.86
Aspiration 50.00% 48.57% 50.00% 48.57% 48.57% 172 350 49.14
Liquidization 24.29% 34.29% 27.14% 22.86% 37.14% 102 350 29.14
Sum of wrong words 234 280 239 213 252 1218 3500 34.80
Total number of words 700 700 700 700 700 3500
WER% 33.43 40.00 34.14 30.43 36.00 34.80
Table 9

Speech recognition error rate for words of phonological rules by speaker (WER%)

4. CONCLUSION

In this paper, a study was conducted on continuous speech recognition performance in accordance with the Korean phonological rules of the cloud-based speech recognition Open API. First, the results of the experiment were compared and analyzed the speech recognition performance of the cloud-based speech recognition Open API. In Figure 2, the whole words error rate and phonological rule words error rate by cloud company both showed good performance for Kakao and MS, while IBM and Naver showed low performance. Looking at Table 7’s ranking of error rates for corporate phonological rule phrases, Kakao showed good performance in all areas except nasalization and flat stop sound formation in final syllable, while Microsoft showed good performance in nasalization and Google showed good performance in flat stop sound formation in final syllable. Table 7’s second-place group showed Google performing well in two areas of aspiration and liquidization, Amazon in two areas of fortition and final syllable neutralization, and Naver in /t/deletion and ETRI in /h/deletion. Cloud company’s speech recognition Open API showed good performance for certain phonological rules. Second, the speech recognition characteristics of the Korean phonological rules were analyzed. In Figure 3, the whole words error rate and the phonological rule words error rate were both good for /h/deletion and /l/nasalization, and palatalization, simplification of consonant cluster, and aspiration were poor. Table 10 represents the ratio of the number of wrong words of phonological rules words to the number of wrong words of whole words by company. The ratio is very high, from at least 35.11% to up to 61.84%. Figures 4 and 5 show similar alignments in both speech recognition error rates for whole words and speech recognition error rates for phonological rule words. In other words, the speech recognition performance of the phonological rule words is affecting the speech recognition performance in the whole words. Figures 6 and 7 represent speech recognition error rates for phonological rule words by company and speaker, and the linear shape in Figure 6 shows a more distracting linear form than in Figure 7. In other words, speech recognition performance for phonological rules can be attributed more to the speech recognition engine of cloud companies than to the speaker. Through this study, we confirmed that the speech recognition performance of the cloud-based speech recognition Open API for Korean phonological rule differs between companies, and that the speech recognition system of the same cloud company also has characteristics that show different performance by Korean phonological rule. According to the characteristics of speech recognition by Korean phonological rule, first, there was a difference in speech recognition performance by phonological rule. Second, speech recognition performance of phonological rule words had a significant impact on the overall speech performance. Third, speech recognition performance for phonological rule words was more different between companies than speakers. Therefore, this research will contribute to improving the Korean phonological rule speech recognition performance of the speech recognition engine of the cloud computing company and help speech recognition developers select the Open API to develop an applied speech recognition system.

Figure 2

Comparison of WER% by company.

Figure 3

Comparison of WER% by phonological rules.

Amazon ETRI Google IBM Kakao MS Naver Sum of wrong words
Total number of wrong words (A) 404 343 261 1136 212 207 487 3050
The number of words with wrong phonological rules (B) 180 161 132 356 90 128 171 1218
B/A ratio (%) 44.55 46.94 50.57 31.34 42.45 61.84 35.11 39.93
Table 10

Ratio of phonological rules to whole words by company

Figure 4

Linear comparison of WER% by company.

Figure 5

Linear comparison of WER% by phonological rules.

Figure 6

Comparison of WER% of phonological rules by company.

Figure 7

Comparison of WER% for phonological rules by speaker.

A future task is to evaluate the performance of speech recognition on the optionally phonological rules of the synchronic phonological process. The result is expected to be different from the speech recognition rate of the essential phonological rule because the rules may or may not be applied in the same phoneme environment. Following the essential phonological rules of the synchronic phonological process, the study of the evaluation of speech recognition performance for the optionally phonological rules will be meaningful to improve the performance of speech recognition in Korean.

CONFLICTS OF INTEREST

The authors declare they have no conflicts of interest.

REFERENCES

[1]JH Jeong, Current status and challenges of cloud computing, NARS Issue Rep., Vol. 313, 2017, pp. 17-21.
[3]Trends and Prospects of Voice Recognition Technology, Korea Creative Content Agency’s Cultural Technology (CT) In-depth Report, 11 (2011), 2020. Available from: https://www.kocca.kr/cop/bbs/view/B0000144/1756144.do?menuNo=.
[4]KN Lee and MH Chung, Morphological analysis of spoken Korean based on pseudo-morphemes, Korean Institute of Information Scientists and Engineers, in Proceedings of the Annual Conference on Human and Language Technology (Busan, Korea, 1998), Vol. 10, pp. 396-404.
[9]Jc Bae, Opening of Korean Phonetics, third ed., (Hak)Shingu media & publishing, Gyeonggi Sungnamsi Jungwongu, Korea, 2018.
[11]LG Nim and JM Hwa, Pronunciation dictionary for continuous speech recognition (in Korean), Proc. KIISE. Conf., Vol. 27, 2000, pp. 197-199.
[12]P Younghee and M Chung, Pseudomorpheme-based Korean continuous speech recognition using tagged word bigram, Korean Inst. Inform. Sci. Eng., Vol. 26, 1999, pp. 351-353.
[13]JW Yoo, A study on method of constructing pronunciation unit for continuous speech recognition, The Korean Electronics and Telecommunications Research Institute report, ETRI-94-03295, Vol. 1, 1995.
[14]L Chang-Beom, Legal tasks for safe use and revitalization of cloud computing, Review of The Korea Institute of Information Security and Cryptology (Review of KIISC), Vol. 20, 2010, pp. 32-43.
[15]Guide of Kakao Speech API, 2020. Available from: https://developers.kakao.com/docs/latest/ko/voice/.
[16]Guide of NUGU SDK Developers, 2020. Available from: https://developers-doc.nugu.co.kr/nugu-sdk.
[17]Guide of Clova Speech Recognition, 2020. Available from: https://www.ncloud.com/product/aiService/csr.
[18]Guide of GiGA Genie Speech Recognition API, 2020. Available from: https://apilink.kt.co.kr/api/menu/apiSpcDetail.do?apiSpcId=57.
[19]Guide of aihub Speech Recognition API, 2020. Available from: http://www.aihub.or.kr/ai_software/370#group00.
[20]Guide of Azure Speech to Text, 2020. Available from: https://azure.microsoft.com/ko-kr/services/cognitive-services/speech-to-text/.
[21]Guide of Amazon Transcribe, 2020. Available from: https://aws.amazon.com/ko/transcribe.
[22]Guide of Watson Speech to Text, 2020. Available from: https://www.ibm.com/kr-ko/cloud/watson-speech-to-text.
[23]Guide of Google Speech-to-Text, 2020. Available from: https://cloud.google.com/speech-to-text/.
[24]V Këpuska and G Bohouta, Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx), Int. J. Eng. Res. Appl., Vol. 7, 2017, pp. 20-24.
[25]SJ Choi and JB Kim, Comparison analysis of speech recognition open APIs’ accuracy, Asia Pac. J. Multim. Serv. Converge. Art Human. Sociol., Vol. 7, 2017, pp. 411-418.
[26]AL Herchonvicz, CR Franco, and MG Jasinski, A comparison of cloud-based speech recognition engines, Computer on the Beach, Vol. 4, 2019, pp. 366-375.
[27]H Roh and K Lee, A basic performance evaluation of the speech recognition APP of standard language and dialect using Google Naver and DaumKAKAO APIs, Asia Pac. J. Multim. Serv. Converge. Art Human. Sociol., Vol. 7, 2017, pp. 819-829.
[28]I Bobriakov, Comparison of the top speech processing APIs, 2018. Available from: https://activewizards.com/blog/comparison-of-the-top-speech-processing-apis.
[29]O Hyun-woo, L Koen-Nyeong, and Y Dongsuk, Performance comparison of open APIs for speech recognition, in Journal of the Acoustical Society of Korea 2019 Spring Conference (Jeju, Korea, 2019), Vol. 5. Volume 38, No 1(s), P256.
[30]J Lee, Lecture on Korean Phonology, Samkyung Munhwa Sa, Seoul Gangbukgu Miadong, Korea, 2014.
[31]Jh Lee, Gh Lee, and Sj Kim, Korean Pronouncing Dictionary, Jigu Publishing Co., Gyoha-eup, Paju-si, Gyeonggi-do, Korea, 2008.
[32]J Laver, Principles of Phonetics, Cambridge University Press, New York, 1994, pp. 561.
Journal
International Journal of Networked and Distributed Computing
Volume-Issue
9 - 1
Pages
10 - 18
Publication Date
2021/01
ISSN (Online)
2211-7946
ISSN (Print)
2211-7938
DOI
https://doi.org/10.2991/ijndc.k.201218.005How to use a DOI?
Copyright
© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Hyun Jae Yoo
AU  - Sungwoong Seo
AU  - Sun Woo Im
AU  - Gwang Yong Gim
PY  - 2021
DA  - 2021/01
TI  - The Performance Evaluation of Continuous Speech Recognition Based on Korean Phonological Rules of Cloud-Based Speech Recognition Open API
JO  - International Journal of Networked and Distributed Computing
SP  - 10
EP  - 18
VL  - 9
IS  - 1
SN  - 2211-7946
UR  - https://doi.org/10.2991/ijndc.k.201218.005
DO  - https://doi.org/10.2991/ijndc.k.201218.005
ID  - Yoo2021
ER  -