Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning

The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.


Introduction
Recent years have witnessed an increase in the interest regarding low-resource languages, and their need to be improved. Albanian is an Indo-European language spoken by around 7 million native speakers in Albania and Kosovo around the Balkans. The Albanian language is one of the most diverse and interesting languages to study due to its complex grammar and inflection paradigm; this makes morphological tagging extremely challenging [1]. In light of this, the Albanian language is one of the lowresources languages which has seen a gradual improvement year by year. One of the approaches to the Albanian language that can be evaluated is to use unsupervised learning methods that will learn from raw text. Unsupervised learning is the key to advancing the machine learning methods and unlocking access to almost unlimited amounts of data that can be used as training resources. Also, it involves training a model without pre-tagging or annotating [2].
The primary task of part-of-speech tagging is to take a text as an input and produce an output text where every word is marked with a mark corresponding to a grammatical category such as nouns, verbs, adjectives, adverbs, etc. This marking depends on the word's meaning and adverbs. Grammatical categories contain words that have the same grammatical properties.
The Albanian language has grammatical categories such as nouns, verbs, adjectives, numerals, genders and determinants, person-number indexing, tenses, active or passive voice. In grammar, a part of speech (also a word class, a lexical class, or a lexical category) is a linguistic category of words, which is generally defined by the syntactic or morphological action of the lexical item in question [11]. Tagging the www.aetic.theiaer.org Albanian language is especially challenging since it has extremely rich inflection paradigms and has 100 different forms: inflection patterns for levees of the same syntactic category. Research and comparative studies of the Albanian language are rarely conducted in Natural Language Processing. A small annotated morphological corpus of Albanian-inflected words extracted from Wiktionary with the (Universal Morphology) project was presented by Kirova et al. [14]. Annotations are done at the word level without regard to context and following the Universal Morphology schema, which associates each inflected word with its lemma and a set of morphological tags. There are 33,483-word forms for 589 lemmas in the corpus. Using the corpus, morphological analysis models have been trained and tested. Despite this, since the corpus contains individual words without sentence context, the data cannot be directly used to train a part-of-speech or morphological tagger.
In part-of-speech tagging, Kabashi et al. [15] proposed a set of part-of-speech tags after noticing that there was no set of moderate size. They built on their own previous work to achieve around 70% accuracy. Also, recently, Kote et al. [11] presented a corpus of 118,000 tokens tagged with part-of-speech and morphological features in Albanian. Furthermore, the team trained a neural morphological tagger and lemmatize that achieved good results, the best of which was 92.74% in POS tagging. In addition to using Universal Dependency guidelines to annotate the corpus, it went through a manual review process. In this study, 73 sources were tested at 77 MB of total capacity. In addition to the source frequency, 631,008 words have been tokenized and analyzed individually for frequency and different average frequency (△M). The Albanian corpus contains around 250,152 tokens, making it the largest so far. This paper examines the challenges of unsupervised learning of morphology for low resource languages in the Albanian language.

Natural Language Processing
Natural language processing's primary purpose is to understand and produce natural language at several levels: syntax, semantics, pragmatics, and dialogue. Among these levels are syntax, a language's structure, semantics, and pragmatics. A few steps are involved in Natural Language Processing, including tokenization, normalization, stemming, part-of-speech tagging, etc. [8]. Additionally, NLP includes different approaches and grammatical rules, such as inflection, derivation, tenses, semantic analyses, lexicon, morphemes, and corpora. In the domain-based corpora for the Albanian language, all these approaches and rules were applied.

Tokenization
As part of Natural Language Processing, there are several steps, among which is tokenization. The purpose of this step is to divide long strings into smaller chunks or tokens [5]. Massive chunks of text can be tokenized into sentences, so they can be tokenized into words. Furthermore, the processing is generally completed after a piece of text has been appropriately tokenized. Because the Albanian alphabet includes letters such as 'ç' and 'ë' in the text files, one challenge during tokenization is that UTF-8 is used to encode the text files. Otherwise, the files would not be appropriately tokenized.

Normalization
The next step of Natural Language Processing is normalization, which describes the process of pulling all text to a level playing field: converting all text to the same parameters such as uppercase or lowercase, removing punctuation. Some examples would be the roman numerals such as 'XX', 'V' etc.; hyphened words such as 'shoqërore-ekonomike' -Eng. 'socio-economic'; 'juridiko-civile' -Eng. 'civillegal' etc.; words with apostrophes such as the word 'ç'është' abbreviated from 'çfarë është'-Eng. (what is), 't'i' shortened from 'të i' -Eng. (to) etc.; numbers expanding contractions, converting numbers to their word equivalents, and so on [4]. Normalization puts all words on a similar footing and allows processing to proceed uniformly. www.aetic.theiaer.org

Stemming
In Natural Language Processing, stemming is the process of removing affixes of suffixes, prefixes, infixes, and circumfixes from a word or phrase so that it may be processed further [2]. Examples of steaming in Albanian are shown in figure 1.

Corpus
Collection of the text refers to a corpus in Natural Language Processing. Such collections may be formed of a single language of texts or span many languages; there are numerous reasons for which multilingual corpora may be helpful. There have been several attempts to build a corpus for NLP tasks in the Albanian language. Still, they have significant drawbacks due to small corpus sizes, different formats fitted for individual tasks, and, moreover, they are not publicly available. The paper documents creating an Albanian corpus that includes 250k sentences from texts of various fields in electronic sources.

Part-of-speech (POS) Tagging
Part-of-speech tagging is a fundamental step in Natural Language Processing. The most popular part-of-speech tagging would be identifying words as nouns, verbs, adjectives, etc. The Albanian language has some properties that pose difficulties in creating a part-of-speech tag set [6]. A challenge faced in building a dictionary for low resource language (in this case for Albanian language) is that a partof-speech tag set that can adequately represent the underlying linguistic phenomena is difficult to build. This difficulty is due to linguistic differences and a partial lack of standardization in using new/foreign words -especially in unedited electronic sources. B. Kabashi et al. has presented a corpus of approximately 2000 sentences by manually annotating them [3]. Some of those sentences have been selected randomly from texts of different genres. The remaining sentences have been selected manually to allow for a wider variety of linguistic phenomena in the sample corpus. The corpus presented in this research paper includes 250k sentences, which is the largest corpora proposed for the Albanian language. A tagger for the Albanian language is yet to be completed, but we report work to date.
Nouns have different morphological categories such as nominative, accusative, dative, genitive, and ablative: the inflection suffices for the dative, genitive, and ablative forms are identical. The distinction between them can only be made from context. On the other hand, as it pertains to indefiniteness (indefinite and definite): Indefinite forms like (një) vaja-Eng.(one/a) girl can be distinguished from definite forms like vajza -Eng. (the) girl [19].
The Albanian language nouns are determined by gender, such as feminine, masculine, and neutraland numbers such as singular and plural. However, the majority of Albanian nouns change gender in the plural form. The words aktrim/aktrimi, for example, are masculine in the singular and feminine in the plural (aktrime/aktrimet). Hence they are known as heterogeneous nouns [7]. This phenomenon makes examining confluence difficult without a morphological component. Table 1 shows the proposed noun tags for the Albanian language. As an example, consider the sentence 'Të rinjtë janë nga Tetova.', Eng. 'The young people are from Tetova.', which is analyzed as Të\Art rinjtë\NArt janë\V nga\Prep Tetova\Nm.\Punct. In the Albanian language, the same orphological categories are also applicable with adjectives. The meaning of adjectives in the Albanian language is to describe a person or thing in the sentence, which should also correspond to the gender and number of nouns. For instance, let us take a masculine and feminine example: "Ky është Andi, vëllau im."-Eng. "This is Andi, my brother."; whereas for feminine, it would be: "Kjo është Gea, motra ime."-Eng. "This is Gea, my sister.". From this, it becomes apparent that gender-wise there is a difference between 'ky' for masculine singular and 'kjo' for feminine singularwhich in English is represented by 'this' in both genders [18].
Additionally, there is a difference in the personal pronoun between 'im', in masculine singular, additionally and 'ime' in feminine singular -which in English is represented by gender-neutral 'my'. The five proposed tags that can describe adjectives in the Albanian language are separate tags for adjectives that occur before nouns, adjectives with preposed articles, and for noninflectional-adjectives [20]. Table 2 below shows. Adjectives in the Albanian language have three forms: positive, comparative, and superlative [11]. Escalation is realized as a combination of the base word with the comparative article 'më', e.g. (1) positive: e bukur -Eng. 'beautiful', (2) comparative: 'më e bukur'-Eng. 'more beautiful' and (3) superlative: 'më e bukura' Eng. 'most beautiful'.
Numerals in the Albanian language are unclassified into cardinal and ordinary numbers. Ordinary numbers have the same properties as adjectives, except escalation, and are always preceded by an article. As an example, consider the sentence: 'Kjo ishte fitorja e saj e tretë brenda një muaji.' -Eng. 'This was her third victory within one month.', which is tagged as Sot\Adv ishte\V fitorja\N e\Art tretë\NumO e\Art saj\PossPr brenda\Prep një\NumC muaji\No.\Punc.
In the Albanian language, pronouns are classified into subtypes according to their specificity. Furthermore, the relative pronoun is different from an interrogative pronoun or a personal pronoun [10]. Each distinguished type of pronoun has its own tag, as shown in Table 4 below.
Like nouns or adjectives, some pronouns can be preceded by a proposed article. The interrogative pronoun, for example, 'cili' -Eng. 'who' can turn into 'which' when we consider relative the pronoun 'i cili', Eng. 'which'. This begs consideration that article and pronoun need to be treated together.

Evaluation of Experiments
Using the Natural Language Toolkit for implementation of the corpus in Albanian, we conducted our experiments. NLTK is a group of open-source program modules for linguistic purposes. Natural language processing is covered by NLTK in both symbolic and statistical ways and is interfaced to annotated corpora. NLTK is the suit of Python modules that provide many Natural Language Processing data types, processing task, corpus simples, and problem sets [16][17]. In choosing Python as our object-oriented language, we wanted to make sure that data and code could be encapsulated and reused easily. A corpus was built by collecting all the texts in the Albanian language related to various fields, such as computer science, medicine, economy, politics, and tourism. Building a dictionary for the Albanian language is challenging, as has been discussed so far. The corpus has 631,008 words, of which 250,152 unique words were identified after removing duplicates and characters such as hyphens, apostrophes, numbers, and roman numerals. Table 5 shows some of the words derived from all sources used to calculate the expected frequency and average differences between them. An analysis of the highest source frequency, expected frequency and average differences from the total number of appearances resulted in the word 'të' having the highest expected frequency of 0.08033, an average difference of 0.013365, and source frequency of 73, and 588394 total appearances: continuing with the word 'e', 'në', 'vendet', 'thënë', 'shprehjes' etc. A comparison between the obtained results of different languages, such as Albanian, English, and French, has also been conducted. Approximately 229,000 words and 11,150 tokens are included in the corpus of the English language, while the French language has 70,000 words and 9,150 tokens. In Table 6 below, we have summarized the most frequent English and French tokens to be compared with the Albanian language. The most frequent token among the Albanian language corpora is 'të' with 2.92%. The most frequent token among the English language corpora is 'the' with 3.24% in English. In the French language, the word 'de' is the most frequently used word with a frequency of 4.07%.

Conclusion
This paper discusses the significance of the research presented in this corpus, which puts the Albanian language in a different linguistic context. A morphological tagger and 250,152 tokens are presented for the Albanian language. If the Albanian language is incorporated into the European www.aetic.theiaer.org languages group, and more public documents are translated into this language, there will be plenty of scope for improving this corpus. We expect further improvement in tagger accuracy with more sources since we intend to train relatively good tagger models using a larger corpus to allow fully automatic tagging.