Recommender System for Multiple Databases Based on Web Log Mining

Wan Hussain Wan Ishak and Nurul Farhana Ismail, "Recommender System for Multiple Databases Based on Web Log Mining”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 187-193, Vol. 5, No. 5, 20th March 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.05.023, Available: http://aetic.theiaer.org/archive/v5/v5n5/p23.html. Review Article


Introduction
Information is a piece of knowledge that is stored in various forms such as printed and digital forms. Digital forms of storing have been widely implemented to reduce the dependency on printed materials especially paper and ink and to minimise the physical storing space. To date, digital technology has had an impact on economic and environmental performance [1]. Digital materials have many advantages, such as cost reduction, environmentally friendly, easy to disseminate and share through any communication medium, and easy to print when necessary.
Information storing and retrieval in digital form is known as information retrieval (IR) system [2]. The IR system consists of a database system and search interface [3]. The search interface is equipped with a search engine that can be used to retrieve information using queries [4,5]. The information enters the database either through manual or automatic procedure. The size of the database is typically large as the information increases over the years. To date, due to the big data evolution, the traditional IR model needs to be improved [6].
Recommender system (RS) has been used to improve the existing IR systems. RS can help search by recommending items related to user interests [7,8]. RS can assist searchers in finding information and creating the personalized content [9]. For instance, based on the search history, the RS can provide suggestions to the searcher [7,10].
Data mining, specifically web log mining, is one of the approaches that can be used to analyse the web logs by considering the history of the search activity. The log contains data with regard to the search history, database access, and downloads. The log also stores some of the searcher's www.aetic.theiaer.org information such as Internet Protocol (IPs), sessions, date and time. These data can be utilised by the RS to help other searchers find similar information.
This paper presents the RS model for a multiple independent database system using web log mining. A multiple independent database system is a system that contains a collection of online databases owned or operated by different service providers or publishers. Examples of online databases are ACM Digital Library, EBSCOhost, Emerald, ERIC, JSTOR, ProQuest, etc. Searching in online databases is a very tedious and time-consuming process [11]. Each database may need to be logged in individually and has its own search system. Users will have to switch searching between the search systems. Additionally, users may not be able to determine which database is the most suitable or contain what they want. Therefore, the RS can be implemented as an interface to the multiple independent database system and provide recommendation assistance such as which database should be used.

Literature Review
The RS is a user support system for searching and finding information, services, items, or products on the World Wide Web (WWW). It is an application system and method that provides recommendations to users to make decisions on selecting the best items [12]. It has been used in conjunction to the IR system to provide relevant suggestions to users for their collection of items or products that they might be interested in [7,13]. According to Burke [14], the design of the recommendation engine is based on domain and specific features of the available data.
Typically, RS is categorized into three types; collaborative, content-based, and hybrid systems [8,15]. Collaborative filtering generates recommendations by analysing historical interaction. It can be divided into neighbourhood and model-based approaches. The neighbourhood-based approach relies on the rating made by the other users that were selected based on their similarity. The modelbased approach employed statistical model to generate the recommendation. The pure collaborative model relies heavily on the user rating matrix. This approach considers all users and items as atomic units. This limitation has been overcome by considering user profile, also known as content-based filtering. This approach evaluates the representation that describe the items and make a comparison with the user interest. Content-based and collaborative recommenders can be combined to leverage the strengths of both approaches. This technique is called a hybrid RS.
Generally, the RS has been implemented in various fields such as entertainment, e-commerce, and services. In e-commerce RS sites were implemented to recommend products to their customers by utilizing their preferences and other customers' purchase history [16]. According to Herlocker et. al [17] there are many advantages of RS in e-commerce. Among the benefits are it helps buyer to find the products that best matches with their needs and reduce the searching and browsing effort.
According to Ricci et. al [12], the RS commenced development of a simple observation, in which a person's decision usually depend on the suggestions given by others. Based on their study, they highlighted the importance of the RS as to increase the sold items, sell various items, increase consumer satisfaction, improve consumer well-being and more understanding on the consumers need.
In the tourism field, the RS can be used to assist travellers in finding the destination that best which suits their interests [18]. Alrasheed et. al [18] proposed RS which provides users with a bunch of destinations preferred by similar travelers to enable building a list of recommended destinations. Then, the RS ranks the list based on user preferences and constraints. Esmaeili et. al [19] proposed a social-hybrid RS for suggesting tourist attractions. Their study utilised several factors such as similarity of traveller's interests, trust, reputation, relationships, and social communities. These factors can increase the RS recommendation quality.
Data mining has good potential in RS research. Data mining is among the steps in knowledge discovery in databases (KDD) [20]. KDD is an automated exploration analysis and modelling of big data repositories [21]. KDD involves a nontrivial process of extracting useful patterns in data [22]. In KDD, data mining is a process consisting of applying data analysis and appropriate algorithms to generate meaningful patterns over a large dataset. For instance, Mohsin et. al [23] demonstrates that www.aetic.theiaer.org the mining of the historical data provides useful patterns that can be used in the present decision making and planning. Furthermore, data mining has been proven to be able to generate good quality of knowledge for decision maker [24].
Etzioni [25] introduced a concept of web mining to find, identify, uncover, and explore information, documents, and knowledge from the web. Etzioni also emphasises that the web mining can support searcher to browse, search, and visualise the web contents. Web mining contains two categories: web content mining and web usage mining [20]. Web content mining is the process of obtaining interesting information from web pages [26]. Web content information can be classified into four types of data information, namely unstructured, structured, semi-structured, and multimedia content [27]. The web usage mining can be applied to find the web usage patterns [26]. It can be used to predict searcher's preferences and behaviour [28].

Methodology
This study involved three main phases, which are identifying search keywords and database, frequent keyword analysis, and recommendation modelling.
The first phase comprised three main steps: data selection, data pre-processing, and search keyword detection. Data selection concerned with the selection of appropriate data for this study, while data pre-processing defined a series of actions to be undertaken before the data could be used in the experiment. The search keyword detection aimed to identify the keywords used in the query and its associated database.
In this study, the case study was on library of Universiti Utara Malaysia (UUM). Therefore, web log of the online databases that the library subscribed was obtained. These data were typically secured under the eResources application, which was one of the online services managed by the library. The dataset of server log data in the eResources application under the PSB website used in this study consisted of 32 log files. Each log file contained 500,000 lines.
The data were cleaned and the keywords used by the searchers were identified. The keywords were found in the Uniform Resource Locator (URL) which ended with "query", "keyword", and etc. Figure 1 shows example of the keywords identified in the web log. The keywords would be extracted, and the database associated with that keywords would be stored. As shown in Figure 2, "sukuk+structure", "continuous+education+ and+retention" were among the keywords that were identified from the log. In the next step, all stop words were detected and removed. In the second phase, the search keywords that had been obtained from Phase 1 were analysed. The process involved sorting and grouping of the keywords and its associated database. The analysis of frequency was performed using Microsoft Excel to obtain the frequent search keywords.
Finally, the RS model was designed and developed to illustrate the workability of the actual RS. The model was implemented using Hypertext Preprocessor (PHP) language and MySQL database. Figure 2 shows the proposed design of the RS interface. The system acted as an interface to the existing eResources system hosted by PSB UUM. The system was able to suggest which databases contained the keywords that the user had entered. www.aetic.theiaer.org

Findings
Based on the analysis, 19,146 keywords were identified and retrieved. In addition, 11 databases that are associated with the keywords were also identified. This finding is depicted in Table 1. Table  1 shows that IEEE database is the most popular where 4,862 keywords were obtained. Of these, 3,418 keywords are unique, while the rest are repetitive. Serial Solutions is the second popular database where 3,043 keywords were obtained. Other databases records less than 2,000 keywords. The analysis of the keywords frequency together with its associated databases is depicted in Table 2. The table illustrated that most of the keywords were used twice and thrice. Databases, such as IEEE, Serial Solutions, ScienceDirect, EBSCOhost, and ERIC, contained keywords that repeated more than three times. The IEEE database had nine keywords that repeated more than ten times, while Serial Solutions had five keywords that repeated more than 10 times. Table 3 demonstrates the example analysis of the 30 highest frequent keywords that were identified from the databases. Keywords such as "management", "performance", "entrepreneur", "system", "service", "business", "theory", and "relationship" were used in all databases. Keyword "management" was the most popular where it was searched 68 times in all databases. While "methodology" is less sought by the searchers.
The usability test was also conducted on the recommender search interface. Based on the feedback, most of the respondents stated that they were familiar with the functionalities of online database system. About 90% of users agreed that they needed guidance when using the prototype for the first time, while the other 10% said the system was easy to use and user friendly. It showed www.aetic.theiaer.org that all the users agreed that the existing eResource module did not have all the necessary functions and capabilities as they expected. Most of the users expected the eResource module to have the capabilities in helping users on searching the information. They also agreed on having a RS for this eResource module. Overall, most of the users stated that the proposed RS interface was simple and easy to use. They also agree that the interface and information provided are clear and effective in supporting the searchers.

Conclusion
This study proves that library users actively use online databases to find and obtain information for academic research purposes. The patterns reveal that users are searching for information from more than one databases. This can be seen by the use of 30 most popular keywords in most of the databases. The proposed RS interface was found to be beneficial to the users, assisting them in finding the information in the "right" databases.
www.aetic.theiaer.org The major constraint faced by this study is the data size. The data size was too big, too difficult, and took a great deal of effort to process. Furthermore, the raw data contained too much "rubbish". The quality of the data was affected and some of the data might be lost during the processing.
Based on the nature and size of the web log data, an automatic mechanism for extracting useful data needs to be designed and developed. The mechanism should be able to distinguish the major fields such as keywords, database name, session, and user profile. Therefore, a specific study should embark on this automatic extraction. It will be a useful tool for research in this field.
Future studies should also include user profile as one of the information to be extracted. User profile can increase the recommendation as those who have a similar profile might have the same interest.