The Application of Computer-Aided Under-Resourced Language Translation for Malay into Kadazandusun

: A computer-aided language translation using a Machine translation (MT) is an application performed by computers (machines) that translates one natural language to another. There are many online language translation tools, but thus far none offers a sequence of text translations for the under-resourced Kadazandusun language. Although there are web-based and mobile applications of Kadazandusun dictionaries available, the systems do not translate more than one word. Hence, this paper aims to present the discussion of the preliminary translation of Malay to Kadazandusun. The basic word-to-word with dictionary alignment translation based on Direct Machine Translation (DMT) is selected to begin the exploration of the translation domain where DMT is one of the earliest translation methods which relies on the word-to-word approach (sequence-to-sequence model). This paper aims to investigate the under-resourced language and the task of translating from the Malay language to the Kadazandusun language or vice versa. This paper presents the application and the process as well as the results of the system according to the basic Kadazandusun word arrangement (Verb-Subject-Object) and its translation quality using the Bilingual Evaluation Understudy (BLEU) score. Several phases are involved during the process, including data collection (word pair translation), preprocessing, text selection, translation procedures, and performance evaluation. The preliminary language translation approach is proven to be capable of producing up to 0.5 BLEU scores which indicate that the translation is readable, however, requires post-editing for better comprehension. The findings are significant for the quality of the under-resourced language translation and as a starting point for other machine translation methodologies such as statistical or deep learning-based translation.


Introduction
There are many existing web-based MT systems available online.However, based on analysis gathered during the observation, although some of the translation services provide translation in the Malay language, none of them offer text translation in the Kadazandusun language which is one of the native languages in Sabah and classified as an under-resourced language.The language is currently taught in several schools and public universities, and it is used in a local newspaper's news section.Translation of a low-resource language like Kadazandusun is limited to dictionary lookup for one word and not for a sentence or a certain length of text in a single translation request.As a result, the major purpose of this research is to examine preliminary language translation utilizing the DMT technique.Machine translation (MT) refers to a computerized linguistic translation from one language to another (from a source language or SL and target language or TL) [1], where it is a subfield of Artificial www.aetic.theiaer.orgIntelligence for automatic language translation.It is also a multidisciplinary domain of research and application with support from different schools of thought such as computer science, artificial intelligence, mathematical modeling, statistics, education, language, linguistics, and many more.
The goal of machine translation (MT) is to employ software or technology to allow individuals to convert a written language to another (textual pair).In Malaysia, the Malay language has remained a core language, with the emphasis on developing improved MT.The earliest publication related to a translation system for the Malay-English language can be found in [2].Furthermore, several papers specifically addressing the research in MT for the Malay language were slowly gaining attention as summarized in Table 1.

Reference
Translation Description/Domain Cheong (1986) [2] English-Malay The first project on a computer-aided translation system for Malay-English started at Universiti Sains Malaysia, based on a secondary school chemistry textbook.Cheong (1987) [3] English-Malay A study on the interrogative model in a Malay-English translation system.Ogura et al. (1999) [4] Japanese-Malay Semantic-based transfer MT system applied to a Japanese to Malay translation prototype using 20,000 Malay entries.Yeong et al. (2016) [1] English-Malay Statistical MT, applying a dictionary and lemmatizer.Wang et al. (2016) [5] Malay/Indonesian-English Improved statistical-based MT for resource-poor MT.Alsaket and Aziz (2014) [6] Arabic-Malay A rule-based approach was applied to develop the MT system with quite good human judgment accuracy at 92.3%.Almeshrky and Aziz (2012) [7] Arabic-Malay A transfer-based approach which implemented to the MT system with 89.4% accuracy, comparing human judgment and system translation.Lakew et al. (2018) [8] Varieties/ Indonesian-Malay A neural MT approach study on varieties of language including Indonesian-Malay.Chua et al. (2018) [9] English-Malay Example-based MT combined with analogical-based and structural semantics in English-Malay translation.The reported BLEU score is about 37.06%.
MT is a system associated with converting a text from one language to another with similar or comparable meaning and grammatical structure.In comparison to Malay language translation, the Kadazandusun language currently has a limited state-of-the-art MT study.For the time being, researchers in this field have undertaken preliminary studies in order to create a system that would leverage existing methodologies and techniques to discover a viable translation methodology.Resources available online and offline that use the Kadazandusun language could not benefit the community because the population is not able to fully understand the language, especially the younger generation.With about 30 percent of the entire population in Sabah, the Kadazandusun are the biggest ethnic community.In response to this issue, this paper proposes using rue-based direct translation to study MT from Kadazandusun language texts to Malay or vice versa.The purpose of this research is to assess the source language utilizing text from local newspapers (Malay articles) with translation into Kadazandusun as training data.
Direct Machine Translation, Rules, Corpus, Statistical, and Transfer Approach are the five major methodologies in MT.From these approaches, the last four require a parallel text corpus to generate their model (e.g., rules, analogy, and statistics), while the Direct Machine Translation (DMT) approach is a bilingual and uni-directional from the source to the target language.DMT is a word-by-word sequence translation method that may incorporate structural or grammatical changes.Although DMT is less capable of translating sequences of text effectively, this study investigates the preliminary MT work using this method for under-resourced languages like the Kadazandusun.Furthermore, this is known to be the first attempt to use computing capabilities to apply MT from Bahasa Melayu to Kadazandusun or vice versa for a sentence translation.Thus, the basic MT approach using word-to-word translation is investigated.
Various approaches used in MT have been classified into two: the single approach and the hybrid approach.The single approach is defined as employing only one way in the translation, while the hybrid www.aetic.theiaer.orgincorporates the statistical method with a rule-based approach that includes several frameworks such as syntax, forest, word, and phrase [10].
DMT is regarded as a fundamental method for translating a sequence of words from one language to another without much linguistic processing, utilizing a bilingual dictionary.It is also known as a dictionary-driven MT.The translation process for DMT is depicted in Figure 1.In the DMT system, the morphological analysis will extract all words from the text in a source language.It may involve preprocessing (removing unwanted characters, etc.) and root word generation with stemming.The second step is to look up a base word or an original word in the pairwise bidirectional dictionary.The dictionary must have a match of pairwise words for words in the target language or unsuccessful translation otherwise.The final step is to perform some degree of syntactic rearrangement of the words according to the predefined rules in the system.This phase will rearrange the TL words to match the sentence in the target language and output it as a TL text.The quality of MT can be measured using evaluation metrics such as BLEU [12], NIST [13], METEOR [14], Perplexity Matrix [15], and Neural Network [16].Among these evaluation metrics, BLEU (Bilingual Evaluation Understudy) is a term that appears often in MT literature.The scoring value using BLEU is a method for evaluating MT automatically using these features: quick and inexpensive calculation, easy to understand, language-independent, high correlation with human evaluation, and widely adopted.The following Equation 1 is used to determine the BLEU score.) In Equation 1, the application of Brevity Penalty (BP) in BLEU, where  is the words total from reference,  is the words total in the translated output,  is the value of n-grams (1-gram, bigram, 3-gram, 4-gram),   is the weight of the precision and   is the modified precision.
BLEU calculates the similarity between reference and created phrases using the fundamental notions of n-gram precision.The scale range used in the BLEU metric is from 0 to 1 and is normally presented in a percentage value (0%-100%).Closer to 1 indicates that the translation corresponds to a human translation.A BLEU score of less than 50% or 0.5, indicates that the MT engine is poor and not performing optimally, resulting in a higher level of post-editing required before reaching publishable quality.While a score of 50% to 100%, or (0.5-1.0), is generally considered an average translation with some post-editing required [12].

Material and Methods
As stated in the introduction, there are currently no MT systems available for the Kadazandusun language.As a result, this is the first attempt to use the DMT approach to investigate an MT in Kadazandusun.Any MT would include the following steps: 1) source language text input, 2) source language text decoding (also known as the transfer phase), and 3) text translation or encoding to the target language.In step 1, most systems will do text preprocessing with the existence of a corpus or a word database.The next step is using certain approaches to decode the text such as pairwise or dictionarybased, sequence to sequence, and neural-based decoding.Finally, in step 3, the translation is simply encoding the source text to a target text.The decoder will create a translation structure that may preserve www.aetic.theiaer.orgthe meaning of the original sentence to a target language, ranging from simple word-to-word translation with some linguistic alignments to complex neural decoding structures.
The preliminary study on MT for the Kadazandusun language is done using DMT word-to-word translation with dictionary and rule alignment.The steps involved in the procedure are depicted in Figure 2 below.

Source Text Collection
The first step is the gathering of a text corpus, where the preliminary corpus is a Kadazandusun newspaper section that contains publicly available news acquired from the New Sabah Times.Table 2 provides an overview of the corpus information.Text preprocessing is applied to the text to get the Kadazandusun word list.The first stage of text preprocessing was applied based on the best settings observed in [17].The filtering procedure removes characters that match with integers and alphanumeric, leaving only word tokens in the text.Next, the tokenization process is applied to extract a single word based on a whitespace tokenizer.It provides approximately 14531 unique words; however, it also contains several words in English, Malay, and names that needed to be removed.After those words were removed, a total of 7079 remaining words were recorded for the next phase.

Word Translation Database
In this step, the main references book for creating the wordlist were the Kadazandusun dictionary [18], Kamus Malay-Dusun-English [19], and [20].Based on the previous step, the list of 7079 words was used to produce a word-to-word match between the Kadazandusun and the Malay language.Out of 7079 words, 5663 of the words were not found or spelled differently in the dictionary.Thus, this poses a big challenge for word-to-word matching in the DMT.

Result and Discussions
The language translation algorithm based on DMT is tested with specific 34 sentences and the sample translation are listed in Table 3.The Kadazandusun language sequence is in the form of Verb-Subject-Object (VSO) as compared to the Malay language which is Subject-Verb-Object (SVO) [21], and 34 sentences are adequate to cover these sequences in this preliminary study.More sentences will be included in the future study which focuses on advanced phrase patterns.The results in Table 3   Due to the Kadazandusun language pattern in VSO, many of the sentence structures are the inverse structure of Malay sentences, where the predicate phrase is in front of the subject phrase.However, there are certain phrases where the structure may be formed using the SVO order.All the translations structure with (Malay SVO sequence) in Table 3 are incorrect based on the sentence order structure, except for the Subject-Verb (SV) structure as shown in examples 8 and 9 respectively (BM: Semua pelajar telah diberi uniform, DMT: Oinsanai susumikul nonuan uniform, and BM: Semua pemimpin masyarakat dijemput hadir, DMT: Oinsanai lalansanon ginumuan alapon rumikot).
Although part of language grammar is being discussed in this paper, it should be noted that the explanations are focused on the capability and output from the preliminary text translation using the algorithm in the DMT system and ways to overcome future problems in the translator.Based on the results in Table 3, two areas will be explored and examined: the word order structure of the translation and the quality of the translation based on the MT perspective using the BLEU score.

Word Translation Arrangement Rules
First, the investigation of the DMT system from Malay (BM) to Kadazandusun (KD) is looking into the basic grammar concept which is VSO and its extended usages such as VS, VSAp (Ap is Adverb of place), VSVO, VSOO, VSOV and two types of word order sequence which are similar to the Malay language grammar structure, SVO and SV.To simplify the notation, part of the speech in the sentence structure for each example is omitted.
www.aetic.theiaer.orgAnother drawback of the basic word-to-word sequence translation (without linguistic alignment) in the current DMT system is that it tries to translate every word from the source resulting in an incorrect or unsuitable word selection that may be included in the translation.In a BM sentence, the word 'akan' is translated as 'atantu'' in Kadazandusun because the word is simply available in the dictionary.Therefore, in a DMT literal translation, the meaning of the sentence is still understandable, although grammatically wrong.Thus, to upgrade the translation: 1) scan the BM sentence to find the verb (V) to be placed at the beginning of the translation, 2) the conjunction such as 'akan' to the verb brings the meaning of an action to be done, thus, another word translation instead of 'boli' needs to be used ('bolian' in this case, is indicating a future tense 'will buy' or 'akan beli').In the next part of the sentence, the structure in BM '… ikan untuk Nancy' is Object-Object and KD follows this structure as '… sada i Nancy'.However, the DMT translates the sentence as 'montok Nancy'.Future solutions for the DMT algorithm (grammar rule) are: a In this structure, DMT is able to provide a literal translation.Notably, the meaning of the translation is out of context if the BM sentence is about moving from a certain place, and not 'to run'.In the BM sentence, the subject ('kami' -we) is asking an object/someone ('dia' -he/him) to move, but the translation is vice versa, which is 'he/him' asking 'we/us' to run.In linguistic and grammar contexts, this is a critical translation error.Additionally, if the meaning was 'to run', the literal translation by the DMT is still wrong.Future solution for DMT algorithm (grammar rule): a. Rule 6 -extend Rule 1 (VSO) and Rule 3 (VS) to accommodate VSOO order.b.Rule 2 -extend to check ligature or conjunction in VSOO order.
The structure of the translation follows the SVO order, however, the selection of word pairs may not be correct, because 'pelajar' is translated as 'susumikul' when it should be 'tangaanak sikul'.Moreover, the word 'telah' was not removed from the translation.Future solution for DMT algorithm (grammar rule): a. Rule 7 -check for SVO in BM sentence and match with SVO in KD. b.All rules -extend the rules to check if certain KD words are a combination of two words in BM or vice versa (e.g., pelajar = 'tanganaak sikul').
Again, the structure of the DMT follows the SV order, however, the selection of words may differ from the correct translation due to word pair availability and algorithm selection in the current system.As an example, 'pemimpin masyarakat' is translated as 'puru ginumuan' while the text in the Kadazandusun news normally uses 'lalansanon mogiigiyon'.This is one such example where a translator from the news company may have used terms in the published news which are different from other human translators.

Translation Quality Evaluation using BLEU Score
While there are various MT assessment criteria (summarization, complexity, POS tag, frequency itemset, and association relational item), this initial research focuses on the BLEU score to present an overview of current work.Based on translation quality (without semantic meaning and detailed linguistic analysis), the BLEU score is calculated (according to Equation 1) for each translation given the reference www.aetic.theiaer.org(correct translation).Table 4 presents the sample evaluation for each sentence.According to the sentence's BLEU score, the average is 0.6735 and about 85% of the sentences have a score above 0.5.With an average of 0.68 BLEU score, the DMT performance (limited to the given sample sentences) has achieved the level of translation that the average performance described in [22].However, the system still needs to rewrite the sentence structure to match the grammar and semantic meaning.Accordingly, to increase the translation performance, basic sentence structure orders (as identified in the previous section) need to be applied.The translation is then tested for a longer text from newspapers with a human translation pair, in this case between Malay (Bernama News) and Kadazandusun (published human translation at New Sabah Times).The sample of the Bahasa Melayu news1 is shown in Figure 3 and the Kadazandusun translation text adapted from News Sabah Times2 is shown in Figure 4. Furthermore, sample translation output from Bahasa Melayu to Kadazandusun using the DMT is presented in Figure 5.The purpose of this translation is to examine the performance of the translation using a relatively long text and the possibility of complex sentences.Evaluating the translation of DMT as compared to the published news, the BLEU score is at 0.8182 (as one whole sentence) and 0.7273 (an average when the sentence is divided into different paragraphs).Again, considering only the given sample, a high BLEU score indicates that the quality of the translation is comparable to human translation.However, it is important to note that BLEU does not evaluate if the translation delivers a similar meaning as the source.Thus, if the semantic meaning is the concern, then other MT evaluation metrics with advanced linguistic analysis should be used.www.aetic.theiaer.orgFinally, a longer text corpus from the collection (source indicated in Table 2) was prepared to test the capability of the translation system.A set of reference text and candidate text files were constructed to evaluate the development of a translation system for Malay-Kadazandusun translation.Both files contain 400 lines of text from Malay (Bernama News) and its Kadazandusun translation (human translation at New Sabah Times).News content in the text ranges from drug-related crime, water tariff, development, health, illegal immigrants, entertainment, events, and politics.The purpose of using a longer text is to investigate the computational load of the language translation system and to explore new words that are not in the database.Based on the prepared text, the summary of information and the results of the translation using the BLEU score are shown in Table 5.Based on Table 5, the number of words and sentence length in KD is always more than its BM pair.This is because some KD translations will add (for example) particles such as i, o, do, dot, no, po, and nopo nga.As an example, the word 'do' in 'Pinggisoman popoimagon do Kampung Bukit Giling' is a particle added to complete the sentence translation from the Malay sentence 'Usaha mewartakan Kampung Bukit Giling'.In contrast, there are some instances where the vocabulary size for BM is more than KD and some of the BM words can be replaced by only a single word in KD.The use of 'telah menerima' in 'Kementerian Pembangunan Luar Bandar Sabah telah menerima laporan' is replaced with 'nakaramit' as shown in the translation 'Komontirian Kopotundaan Labus Kakadayan Sabah nakaramit ruputan'.The DMT maximum sentence length is similar to BM because it translates using a word-to-word method.For reference, Figure 6 lists the samples with high BLEU scores and Figure 7 is the samples with low BLEU scores.The scores are displayed in Figures 8 and 9    In Figure 8, the BLEU score is high despite some of the words not being successfully translated (no word translation pairs in the database).This is because, first, the DMT output has most of the words from human translation regardless of its position.Therefore, to increase the score, words that have no translation pairs in the dictionary need to be added or updated.Secondly, sentences in the samples are in the structure of SVO and SV, except in some parts where there is a structure of VSO.For example, 'Shakib berkata' is translated as 'Minoboros i Shakib', but the DMT output is 'Shakib minoboros'.In another sample, the KD DMT uses different words compared to human selection such as 'Boyoon' instead of 'Luguan' for the word 'Ketua', 'Gipan' over 'Udang' and 'Sistom' for 'Sistem'.Sistom, luguan and gipan are available in the dictionary of Daftar Kata Bahasa Kadazandusun -Bahasa Malaysia [18].Low BLEU scores as shown in Figure 9 are news header titles.In news header translation, human translators prefer to condense news headers by removing certain words or rephrasing the headline, resulting in a translation that is different but still provides a similar meaning.For example, the sample in www.aetic.theiaer.orgline 81 of the human translation where 'Jabatan Kesihatan Sabah' is omitted.In sample line 268, the word 'Sabah' was also dropped in the KD translation, therefore not completely translating the whole sentence from the BM headline.In the BM headline, the reason for forming the special committee was given in the sentence, however, this was omitted in the KD translation.The author may have decided to shorten the translation of the headline so readers can read further in the news article.This kind of translation that shortens a sentence is called text paraphrasing or summarizing and is a problem and an important domain in advanced text mining for machine translations with more linguistic power.
In terms of computational load for the system, large text processing requires a longer time to process, in this case, up to 32 seconds for 10,204 words.Compared to Google Translate, which takes less than a second to process the translation but only accepts 5000 characters.A similar length of 5000 characters was also tested with KD DMT and the processing time was also less than a second.Thus, in future translation system development for public use, 5000 characters should be the maximum number of characters per translation task that can be offered for the service.As for the BLEU score, DMT achieved a 0.5 score which is an average translation score that requires a post-editing in the sentences.

Conclusion
The Kadazandusun language is considered an under-resourced language due to its limited usage.However, the language is now being taught in selected schools and universities and it is still being used in a local newspaper.In this paper, we discuss the preliminary experiment on the MT for an underresourced language within Malay-Kadazandusun using direct machine translation or DMT.The findings in this paper agree with the study presented in [23], where challenges for e-translation tools, specifically for contextual understanding and translation quality are imminent.This problem is far more difficult for a language with limited resources where the language preservation activities, experienced practitioners, and software development practitioners are still finding the best form of collaboration.In terms of implementation, the DMT system has been successfully implemented; however, the experiment shows that the current system requires upgrades and development.The current limitations are: 1) the database content of KD words is limited compared to Malay words, 2) KD standard words and their spelling are yet to be confirmed by language experts because the current system uses a vocabulary that was acquired from a KD newspaper (with existing spelling errors) and certain alignments from available KD dictionaries, 3) translation is a rule-based, where there are minimal grammar checks to align the translation according to basic Kadazandusun word arrangement (VSO), 4) checks for word ligatures or conjunctions such as particle i, no, po, nopo, and nopo nga is implemented in basic rules, and 5) the BLEU score is higher but post-editing is still required due to problems 3 and 4. As mentioned in the Word Translation Arrangement section, there are seven rules suggested to be implemented in future developments of the DMT system specifically to correct word arrangements in the translation and the semantic meaning or equivalence of the sentences.Finally, we hope that this paper will contribute to the future direction of multidisciplinary research in computing, language, and language translation.Additionally, work on deep learning-based machine translation is still being investigated for a similar application.

Figure 2 .
Figure 2. The process in the translation system investigation

Figure 3 .
Figure 3. BERNAMA news archive for translation sample

Figure 4 .Figure 5 .
Figure 4. Sample translation of the text in Figure 3 from New Sabah Times news archive respectively.

Figure 6 .
Figure 6.Sample corpus translation from BM to KD with a high DMT BLEU score

Figure 7 .Figure 8 .
Figure 7. Sample corpus translation from BM to KD with a low DMT BLEU score

Figure 9 .
Figure 9. Sample source (BM), human translation (KD), and KD DMT output with low BLEU scores

Table 3 .
Translation sample Source text collection . Rule 6 -extend Rule 1 (VSO) and Rule 3 (VS) to accommodate the VSOO order.b.Rule 2 -extend to check ligature or conjunction in VSOO order.

Table 4 .
BLEU score as of the translation system against the human translation reference

Table 5 .
Summary of the text collection