A Formal OLAP Algebra for NoSQL based Data Warehouses

Shreya Banerjee, Sourabh Bhaskar, Anirban Sarkar and Narayan C. Debnath, "A Formal OLAP Algebra for NoSQL based Data Warehouses”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 154-161, Vol. 5, No. 5, 20th March 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.05.019, Available: http://aetic.theiaer.org/archive/v5/v5n5/p19.html. Research Article


Introduction
Modern Data Warehouses (DW) solutions demand to act more in internet-style than to enforce the user to act within predefined structures [1]. Consequently, nowadays DWs need to handle a variety of subject areas, diverse data sources and heterogeneous data types like structured, semistructured and unstructured. Accordingly, On Line Analytical Processing (OLAP) operations require dealing with related business queries based on that irregular information [2]. To manage these new characteristics of DWs, business analysts focuses on using of NoSQL databases.
Flexible deployment, high read/write efficiency as well as scaling to very large data sets -these are remarkable features of NoSQL databases [3]. Yet, these databases are categorized based on various data models at physical level such as Document Store, Key-Value stores, Graph databases and Column-Family store [4]. Each physical level data model has their own approach towards handling OLAP algebra. In general, every kind of NoSQL database has a query language of its own. For example, Cassandra database has developed Cassandra Query Language (CQL); MongoDB query language is used in MongoDB database; Neo4j database has Cypher query language etc. [5]. Thus, lack of a common specification of OLAP operations over different NoSQL databases make serious problems when DWs using these databases are required to be portable. This challenge creates a research question, that how to provide a uniform standard towards OLAP operations for distinct types of NoSQL based DWs. www.aetic.theiaer.org This paper is aiming to address the aforementioned research question. The research methodology followed in this paper is described next. Ontology is applied to resolve the challenge. It is defined as an explicit specification of shared conceptualization [6]. Axioms are used to enable the ontology to provide enriched and formal semantics towards related concepts. OLAP operations on different NoSQL based DWs are varied due to both syntactic and semantic differences. These variances need to be decreased to get a standard specifications of OLAP operation over disparate NoSQL based DWs [7]. An ontology based specification can provide common conceptualization towards the elements of DW domain in terms of concepts and related axioms. Thus, syntactic differences can be omitted. In this context, the ontology driven conceptual model described in [8] is adopted to express a set of OLAP operators and operations formally. Further, semantics differences among OLAP operations can also be omitted with the help of ontology. Figure 1 has described the overall process. Although, the proposed conceptualization is implemented in a document-oriented database, it can also be implemented in other NoSQL based DWs.

Related Work
In literature, several research works exist related to formalization and implementation of OLAP queries on NoSQL based DW. In [9], authors have described ways to implement columnar NoSQL DW (CN-DW) and OLAP queries in Hbase. In [10], authors are using OLAP queries to know about the popularity in recent tweeter trends. In [11], authors used dice and drill-down operation to evaluate the performance on different enterprise scenarios of columnar family. In [12], authors have proposed an approach where an ontology serves as superimposed conceptual layer between multidimensional data and business analysts. The Ontology based OLAP is proposed using UML (Unified Modelling Language) diagram. In [13], a model is described for extracting OLAP dimensions from document-oriented SQL database based on parallel similarity techniques. In [14], authors have presented a Personalization System based on three interrelated ontologiesresources ontology, DW ontology, and domain ontology. They presented these three ontology models in UML and in OWL (Web Ontology Language). However, in all these approaches any common formal specifications of OLAP operations over distinct NoSQL DWs are not provided.
Majority of existing works described OLAP operations specific to its physical level implementations. However, very few works have focused on formal representation of OLAP algebra. Moreover, very few proposals have addressed how to adapt flexible data for OLAP in NoSQL based DW systems. In this context, this paper proposed a universal OLAP interface for disparate NoSQL based DWs. The proposed uniform OLAP interface is devised based on formal semantics of OLAP operators and operations and further implemented in a document-oriented NoSQL based DW.

Summarization of Ontology Driven Conceptual Modelling of NoSQL based Data Warehouses
The conceptual model described in [8] has three main layers namely -Collection (Top-Most layer), Family (Intermediate Layer) and Attribute (Bottom-Most Layer). Attribute layer realizes the measure and dimension attributes of DWs. Family layer represents fact and dimension hierarchies in DW. Further, the data cubes based on facts are mapped towards Collection layer. Attribute layer has its construct types -Attribute (AT). Likewise, Family layer has construct type -Family (FA) and Collection layer has construct type Collection (col). AT is the group of all possible instances of a data item. This can be classified in two types namely-Measure Attribute (MAT) and Dimension Attribute FF has single level. A DF can be decomposed into multiple levels to form the dimension hierarchies. Col is created from group of FF those are semantically related. Thus, from the top level a whole DW can be observed as a group of cols. Cube can be created from FF and realized as a col. Further, using different relationships, distinct types of constructs in the conceptual model are linked with each other. These relationships are of two kinds -Inter-layer kind and Intra-layer kind. Containment and Inverse Containment relationships are included towards both intra-layer kind and inter-layer kind relationships category. Further, Association relationship can only be included towards Inter-layer kind relationship group. In addition, different relationships of this conceptual has distinct properties such as Cardinality, Modality, and Ordering. Figure 2 has illustrated the conceptual model described in [8].

Proposed OLAP Algebra for NoSQL based DWs
Proposed OLAP algebra is classified in two groups. Those groups are OLAP operators and OLAP operations. Two operators are included in the first category namely, Select and Aggregate operator. On the other hand, five types of operations are included in the second category. These five operations used those two operators.

Proposed OLAP operators
Formal representations of two OLAP operators are proposed next. (a) Select Operator ( ): This operator will extract the dimension and its hierarchy from dimension family depending on some predicate p. This can be atomic predicate, denoted as p or it can be a composite predicate denoted as 1 < > 2 < >. . . . < > . In the composite predicate, <op> acts as a logical operator such as AND, OR etc. The p can be either dimensional family (DF) or dimension hierarchy (DFH). The algebraic notation of the operator is ( ) = Here DF is the original dimension family on measure and DFo is the output dimension family on measure after the restriction. Null predicate operator will return the original DF. Hence The aggregate operator will perform the grouping function GF on measure attribute ( ) of the specified set of DFs in a cube C. The GF is the relational aggregation function, which will operate on the only. These GFs can be SUM, MIN, MAX, AVG, and COUNT. The algebraic notation of the aggregate operator is

Proposed OLAP Operations
In this section, five OLAP operations are formally specified. The slice operation pick out one specific dimension from an input cube and provides a new sub-cube. The algebraic notation for the slice operation is The dice operation picks two or more dimensions from an input cube and provides a new sub-cube. The algebraic notation for the dice operation is ( ) Here, CON is the condition defined as,

Illustration of Proposed OLAP Algebra Using a Case Study
In this section, proposed OLAP algebra is illustrated using a case study described in [8]. The case study is based on sales and shipping. Sales of different products can be done in sale branches. Branches can be located in multiple locations. Shipping can have multiple shippers who will shipped the product from one location to another.  Distinct features of this described case study is irregular. This requires flexible data representation. Consequently, NoSQL databases are required to demonstrate these data set in DWs. Figure 3 represents the main elements of the case study as described in [8]. Data cubes related to the case study can be realized through distinct cols based on different FF.
Several queries founded on the proposed OLAP algebra are demonstrated next using the described case study.
Query 1: Find the derived dimension of Time for the month "November". Select operator ( ) is required to accomplish this query. The formal expression of the query is as, .
Query 2: The total number of Electronics product type units sold across all of the dimensions (Time, Location, Branch, and Shipper).
Slice operation is required to accomplish this query. The formal expression of the query is as, . _ ( )) Here, slice operation is accomplished for the dimension "Product" based on the criterion Product.ProductType.productType_Name = "Electronics". Query 3: The total unit sold for a particular product type "electronics", city "Durgapur" and month "November".
Dice operation is required to accomplish this query. The formal expression of the query is as, ℎ=" " ( ))) The dice operation is performed on the cube using the following selection criteria. The criteria involves three dimensions -Product Type Name = Electronics, City = Durgapur, and Month = November. Query 4: Find the total unit sold across all product by increasing the aggregation levels of time: from Day to Year (Day→Month→Year).
The roll-up operation is required to accomplish this query.

Implementation of Proposed Algebra
In this section, proposed OLAP operators and operations are implemented using MongoDB. The case study described in section 5 is used to illustrate the implementation. The transformation between the conceptual model and MongoDB is described in [8]. Figure 4(a) represents the general form of Select operator. Figure 4(b) specifies the general form of Aggregate operator. In these figures, Dij is the dimension with related hierarchy. Dimension number is represented through i and j is used for changing hierarchy level in a particular i th dimension. The XYZ represents MAT.
In these general forms, a cube represents a dimension, or a fact with related dimensions or a view. A dimension and a fact with related dimensions can be implemented as "Collection" in MongoDB [8]. Further, views can be implemented in different ways. The first view is created for a cube that includes both facts (shipping and sales) present in the case study. This cube also comprises three shared dimensions between these two facts. Figure 5 has illustrated this view. The second view is created for a cube that includes the fact shipping and corresponding dimension hierarchy. Figure 6 has illustrated this view. Likewise, the third view can be created for a cube that includes the fact sales and corresponding dimension hierarchy. (1). Slice operation (sl): Query1: The total number of Electronics product type units sold across all of the dimensions (Time, Location, Branch, and Shipper). The implementation of the above query is specified in figure  7. This query realizes the slice operation of OLAP algebra. In the above query the "salesView" will be sliced by the predicate productType_Name="electronics".
(2). Dice operation (di): Query2: The total unit sold for a product type electronics, city Durgapur and month "November".
The above query is implemented as specified in figure 8. Results of the above query realize dice operation by restricting two dimensions of data cube. The salesView will be diced by the predicate productType_Name="electronics" and year="2017". This query can be executed as systematic as specified in figure 9. The query is realized by Rollup operation. According to proposed Roll-up operation, Intermediate Cube IC1 and IC2 are generated. IC2 is Roll-up output of IC1, which is Roll-up output of C0.

(4). Drill-down operation (Ddn):
Query 4: Find the total units sold across all product by decreasing the aggregation level on time: from year to day (Year→Month→Day). This query can be executed as systematic as specified in figure 10. The query is realized by Drill-down operation. According to proposed Drill-down operation, Intermediate Cube IC1 and IC2 are generated. IC2 is Drill-down output of IC1, which is Drill-down output of C0.

(5). Pivot operation (pvt):
Query 5: Analyze the total dollars sold in respect to product and Time and vise-versa. This query realizes the pivot operation. It rotates or transposes the data axes to view the data from different perspective. The implementation of the query is represented in figure 11. Figure 11(a) has represented total dollars sold in respect to Product and Time. On the other hand, figure 11(b) has represented total dollars sold in respect to Time and Product.

Conclusion
The lack of uniform representation of OLAP operations over distinct NoSQL based DWs make them less portable. Addressing this challenge, in this paper, an ontology based formal and rigorous specification of OLAP operations are proposed. The main contribution of the proposed work is to provide uniform precise syntax and semantics towards different OLAP operators and operations. These proposed formal specifications are independent of any physical level implementation. Thus, proposed operators are able to be applied in distinct type of NoSQL based DWs. Further, the www.aetic.theiaer.org proposed formal specification is implemented in a document-oriented database MongoDB. Moreover, the proposed approach is suitable for web-scale analytical applications. Future work will include automated query answering through incorporating prescribed formal semantics of OLAP operators in a rule based reasoner. Besides this, another important future work will be automated conversion of formal operators towards specific NoSQL based DWs.