A Unified Conceptual Model for Data Warehouses

Shreya Banerjee, Sourabh Bhaskar, Anirban Sarkar and Narayan C. Debnath, "A Unified Conceptual Model for Data Warehouses”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 162169, Vol. 5, No. 5, 20th March 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.05.020, Available: http://aetic.theiaer.org/archive/v5/v5n5/p20.html. Review Article


Introduction
Over the last few years, NoSQL databases have achieved strong popularity. These new generation databases are different from traditional relational databases for possessing several significant features such as persistent and non-relational data, flexible schemas, high availability, dynamic insertion of different kinds of data, replication, massive horizontal scaling and distribution. Modern Data Warehouses (DWs) are competent to cope with and excel with emerging data analysis trends such as fast query expectations from users, data generated from cloud, unstructured or non-relational data, and rapid synthesis of data [1]. Thus, DWs solutions nowadays demand to act more in internet-style than to enforce the user to act within predefined structures [2]. Usually, classical DW and On Line Analytical Processing (OLAP) are comprised of a set of concepts like, facts, dimensions, measures and dimension hierarchies, those are used for structured schema representations [3]. The concept of cube is used for multi-dimensional data visualization. However, in case of web-scale applications, many of the dimensional information may not be available in regular structure. Consequently, decision makers are increasingly using NoSQL databases to implement their business solutions [4].
NoSQL databases are classified based on different physical level data models. Those are Document Store, Graph databases, Key-Value stores, and Column-Family store [5]. This heterogeneity brings several dimensions of challenges in systematic design methodology for NoSQL based DW solutions. Firstly, lack of common conceptual model for different NoSQL www.aetic.theiaer.org databases poses significant research challenges in design of DWs. Secondly, NoSQL based implementation of DWs requires a systematic design methodology, comprises of different levels of abstraction in DW design including, conceptual level, logical level and physical level [6]. A conceptual DW model will isolate the purpose of designer from its execution. Thirdly, representation of agreeable numerical data (DW concepts like, facts and measures) and contextual data (dimensions and its hierarchies) are needed in order to illustrate the effective associations among Fact, Measure and Dimension [1]. Fourthly, De-normalization of both contextual and numerical data is also required to achieve flexible characteristics of NoSQL databases. Fifthly, realizations of data cubes are important for visualizing and executing analytical queries effectively.
The objective of this paper is to address these abovementioned challenges. The research methodology followed in this paper is described next. An ontology driven common conceptual model for NoSQL based DW system is proposed to resolve the mentioned challenges. Ontology is defined as an explicit specification of shared conceptualization of the elements of DW domain in terms of concepts and related axioms [7]. Axioms enable ontology to provide enriched and formal semantics towards different concepts. The proposed conceptual model is capable to represent a generalized and rigorous formal set of concepts at the conceptual level design phase of DW using NoSQL database features. In the proposed conceptual model, several generic concepts of the model described in [8], are extended for DW domain. Further, the proposed conceptual model is implemented in a document-oriented database MongoDB. However, it can be transformed towards other NoSQL based DWs, such as Columnar, Key-Valued and Graph oriented. Figure 1 describes the proposed design methodology of NoSQL based DW system.

Related Work
Several research works exist in representation of formal conceptual model for NoSQL based DW. In [1], [9][10][11] authors have described a conceptual model for NoSQL based OLAP systems that can be mapped towards either Column or Document oriented DW using a set of rules. In [6], an existing benchmark for relational database based DWs is improved towards a generalized benchmark for distinct NoSQL based DWs. This approach is based on Star schema for DW. In [12], physical DW design is investigated over column-oriented databases through Map-Reduce framework using Hbase. In [4] and [13], authors described rules for implementing DW in document-oriented database systems and Hive respectively. In [14], authors have described a method that has used ontology to generate the multidimensional schema from a conceptual formalization of a domain. However, NoSQL databases are not considered in this approach. In [15], a new aggregation operator, known as CN-CUBE (Columnar NoSQL CUBE) is described. Further, it is implemented in column-oriented NoSQL database. Majority of existing works described models for NoSQL based DW system specific to its physical level implementations. These models are transformed towards either columnar or document oriented NoSQL databases using a set of rules. Further, semantics of distinct concepts are not well explored in these approaches. Moreover, illustrations of data cubes are represented by few approaches and are confined towards specific NoSQL solutions.

Proposed Conceptual Model for NoSQL Based DWs
Proposed conceptual model is consisting of group of constructs, relationships and a number of significant properties to unify conceptual level representations of different NoSQL based DW solutions. Ontology is applied for the proposed conceptualization to provide rigorous and formal vocabularies towards distinct facets. The proposed conceptualization is consisting of all details those are necessary for representation the concepts of facts, dimensions and measures in DW. Further, it provides the concepts of data cubes and dimension hierarchies when multi-dimensional data are heterogeneous types, and ranged from structured to semi-structured. The proposed conceptual model is equally useful for traditional DW modelling using relational databases when related dimensional data, fact data and their relationships are strictly structured and homogeneous in nature. All concepts in the proposed model are represented through axioms expressed using mathematical logic. Figure 2 has illustrated the proposed conceptual model.

Constructs and Layers in proposed conceptual model
Proposed conceptual model has a layered organization. This model is consisting of three main layers namely-Collection, Family and Attribute. All these three layers have their respective construct types-Collection (col), Family (FA), and Attribute (AT). Fact and dimension hierarchies in DW map towards Family layer of the proposed conceptualization. The measure and members of dimensions are mapped towards Attribute layer. Further, Collection layer realizes the data cubes based on facts.
(a) Attribute Layer: It is the base layer of the proposed conceptual model. Key construct type of this layer is Attribute (AT) that is the group of all possible instances of a data item. AT is elementary in nature. This can be of two types namely-Measure Attribute (MAT) and Dimension Attribute (DAT). A MAT represents single measure of a fact in a DW. On the other hand, a DAT represents single attribute belonging to a dimension in a DW. Formalization of AT is, F1:∀ ( ( ) → ( ( )⨁ ( ))) Explanation: F1 specifies that an AT instance x can be either MAT type or DAT type. www.aetic.theiaer.org

Explanation:
If there exists an FF then that FF should encapsulates several MAT and associated DF. Here, x, v and y are instances of FF, MAT and DF respectively.
( 1, 2))) Explanation: If there exists a DF then it should encapsulate several DAT. Further, those DF can be encapsulated in another DF dynamically.
(c) Collection Layer: This is the top most layer of the conceptual model. Key construct type of this layer is Collection (col). A col is created from a combination of semantically related FF. Thus, from the top level the whole DW can be seen as set of Collections. Formalization of Collection is, F5: ∀ ∃ ∃ ( ( ) ↔ ( ( ) ∧ ( ) ∧ ( , ))) Explanation: If there exists a col then it should encapsulate several FF. Here, x and v are instances of col and FF respectively.
(d) Cube: Cube is the de-facto logical representation for data visualization. Cube can be created from FF and realized as a col in the proposed conceptual model. If there are multiple FF, then a cube can be devised for each FF or combinations of FF. In the latter case, FF can share DF and related MA. Formalization of cube as, Explanation: Several instances Cube (v) can be realized from each instance of FF (x) including base and apex level cubes.

Relationships in the Proposed Conceptual Model
In the proposed conceptual model, distinct construct types are connected with each another using different relationships. Proposed relationships can be classified in two types. One is inter-layer kind relationships and another is intra-layer kind of relationships. Inter-layer kind relationships exist between disparate construct types of two distinct layers. Whereas, Intra-layer kind relationships exit between analogous construct types of a similar layer.  1)))) (b) Inverse Containment (Icnt): This relationship is intra-layer kind and connects two construct types when one is encapsulated towards another construct type dynamically. Direction of this relationship is opposite to the Cnt relationship. In the proposed conceptual model, lower level DFs are encapsulated towards higher-level DFs using Icnt relationships. This relationship is helpful to represent distinct levels of granularity in dimension hierarchies. It is capable to add different dimensions in distinct granular level on the fly and useful to change granularity level dynamically. 1)))) (c) Association (AS): These relationships are intra-layer kind and connect constructs types anticipated to achieve several goals together. An AS may exist between FF and DF. Further, AS can be present between two different cols. www.aetic.theiaer.org

Properties of various relationships
The proposed set of relationships support several properties like, Cardinality, Ordering, and Modality to handle both structured and flexible nature in the model. ( 1) ∧ ( 2)) → _ ( 1, 2)) In the similar way, Ord for other relationships can be represented.

Illustration of the proposed conceptual model using a case study
Let, a case study related to a DW system based on sales and shipping. Sales of different products can be done in sale branches. Branches can be located in multiple locations. Shipping can have multiple shippers who will ship the product from one location to another.   Figure 3 represents the key elements of the case study. Data cubes related to the case study can be realized through distinct cols based on different FFs. Figure 4 has illustrated Shipping FF along with related DFs and MAT with corresponding cardinality and dimension hierarchy.

Implementation Strategy
In this section, two kinds of strategies are proposed for implementation of data cubes in NoSQL based DW systems. Further, the proposed conceptual model is transformed towards a Document Oriented database MongoDB. In addition, two implementations strategies are illustrated using MongoDB based on the case study specified in section 4. Proposed implementation strategies are useful for visualization of multi-dimensional nature of NoSQL based DW systems. However, there is no binding to use other kinds of NoSQL databases for implementation inline of the proposed strategies.
Single Collection based Implementation Strategy: In this strategy, data cubes will be realized as a single col of a FF. Thus, numbers of data cubes in DW system depend on numbers of FFs. Hence, if there are n numbers of FFs, then there should be n numbers of data cubes. These FFs have nested related DFs, DF Hierarchies, DAT and MAT.
Multiple Collection based Implementation Strategy: In this strategy, a data cube can be realized based on multiple cols of FFs and related DFs. These multiple cols include cols of each DFs related with a FF and a col of the FF itself. These DFs nest related dimension hierarchies and DAT. Further, the MAT are nested in the FF. In this strategy, data cubes will be devised dynamically (on the fly) by associating multiple cols of FF and DF. This strategy is capable of creation of flexible schema for NoSQL based DWs by adding of measure and dimension definitions using Icnt and AS relationships. Table 1 has described the differences between these two different strategies. Table 2 specifies the transformation between constructs of proposed conceptual model and MongoDB. In single collection based Implementation Strategy, data cubes are realized through single"Collection" element of MongoDB that is comprised of "Documents" elements corresponding to a FF. Based on the case study specified in section 4, data cubes are created from Sales fact and realized as a single "Collection" element that nests "Document" element corresponding to Sales fact. Further, Sales fact encapsulates "Documents" elements for related dimension hierarchies, namely,

Multiple Collection based Implementation
Single Collection based Implementation This strategy has less redundancy, because, a fact or shared dimensions are defined once This strategy has high-level redundancy, since a fact or shared dimensions are defined multiple times After defining once, insertion of new data definitions are propagated to other places. Hence, addition of data definitions can be handled easily addition of data definitions is costlier, as newly added dimensions and measures have to be defined multiple times maintenance is inexpensive than single one maintenance is expensive than multiple one due to more data integration policy, query execution time will be higher due to less data integration policy, query execution time will be lower Location, Branch, Product and Time. "Document" elements for Sales fact also encapsulate measures Units Sold and Dollars Sold. Similarly, another data cube can be created from Shipping fact separately. Figure 5 has illustrated the corresponding implementation in MongoDB.
In multiple collections based implementation strategy, data cubes from multiple "Collection" elements in MongoDB are created for all FF and DF elements and further aggregated towards creation of required data cube. Based on the case study specified in section 4, a data cube created for Shipping fact is based on an aggregated "Collection" element. This aggregated "Collection" is implemented by associating "Collection" elements of the fact Shipping and each related dimension hierarchies Location, Shipper, Time, and Product. "Document" element representing Shipping fact also encapsulates measures Units Shipped and Dollars Shipped. In MongoDB, a data cube can be built for Multiple Collections based Implementation Strategy using "aggregate()" function. Figure 6 has specified multiple "Collection" elements. Figure 7 has illustrated a data cube that is created from multiple "Collection" elements (figure 6) using "aggregate" operator.