Parallelizing Computations of Full Disjunctions

In relational databases, the full disjunction operator is an associative extension of the full outerjoin to an arbitrary number of relations. Its goal is to maximize the information we can extract from a database by connecting all tables through all join paths. The use of full disjunctions has been envisaged in several scenarios, such as data integration, and knowledge extraction. One of the main limitations in its adoption in real business scenarios is the large time its computation requires. This paper overcomes this limitation by introducing a novel approach parafd , based on parallel computing techniques, for implementing the full disjunction operator in an exact and approximate version. Our proposal has been compared with state of the art algorithms, which have also been re-implemented for performing in parallel. The experiments show that the time performance outperforms existing approaches. Finally, we have experimented the full disjunction as a collection of documents indexed by a textual search engine. In this way, we provide a simple technique for performing keyword search over relational databases. The results obtained against a benchmark show high precision and recall levels even compared with the existing proposals.


Introduction
Due to their capability of managing and storing data in an effective and efficient way, relational databases have been largely adopted in business applications.The relational database design methodology, based on normal forms, assures data integrity and eliminates redundancy by coding the information into a number of tables connected with each other via foreign key relationships.Nevertheless, in several scenarios, the fragmentation induced by the model may represent a big obstacle for a user to understand and work with the entire database content.The universal relation assumption [1] allows users to address this kind of problems, by treating data if it were all in a single relation over all the attributes.The universal relation computation requires data in different relations to be in some way integrated, and, to do it, users have to know how the tables are connected in the database.Let us suppose that we want to apply the universal relation assumption to a database composed of two relations.The universal relation can be obtained through a simple straightforward integration process that generates a relation schema composed of the union of the attributes of the input relations.The universal relation is populated by means of the application of the outerjoin operator (usually on the attributes either sharing the same names -via natural outerjoins, or specified in the foreign key relations -via equijoins), which avoids loss of data from the source tables.
The population of the universal relation of a database composed of several tables requires particular attention.The ourterjoin operator is not associative: its application may generate different results if we consider a different order in the tables involved in the join paths.Moreover, there is not a unique order: the tables can be linked through different paths and cycles may arise.Each path conveys different semantics and a cycle can be transformed in a number of paths, one for each involved table.
The full disjunction [2] has been proposed to cope with these issues.It consists of an associative extension of the full outerjoin to an arbitrary number of tables completely preserving the entire information content of the data source.This operator is implemented by joining tuples over all possible paths connecting the database tables, thus making its computation a critical task.As described in Section 4, a number of algorithms implementing the operator have been proposed in the literature.Nevertheless, they have proven to be inadequate in real scenarios for dealing with large data sources, due to the execution time required.
The full disjunction is of paramount importance in all scenarios where a denormalization of relational databases completely preserving the information is needed.The existing tools able to compute the full disjunction are not easily usable in real environments, due to the computational complexity of the algorithms implemented and the long execution times they require.If we were able to provide an efficient computation, this would have a big impact in a large number of scenarios.A typical scenario is data integration.Different databases can model the same real-world domain in different ways.De-normalizing the data before the integration eases the process.Another interesting scenario is provided by Data Mining and Machine Learning, where the input is typically constituted by a single table.Data in relational databases has to be de-normalized to be used with these approaches.The recent research focused on Big Data made available efficient data abstractions/structures (e.g., RDD [3]) and MapReduce based frameworks for supporting parallel computation on massive datasets (e.g., Apache Hadoop1 , Apache Spark2 ).
In this paper, we leverage such technical advances by introducing parafd (PAR-Allel Full Disjunction): an approach providing an efficient implementation of the full disjunction.Our proposal divides the computation into different phases: a) creation of a database graph representing the database schema; b) computation of all spanning trees over the database graph; c) computation of a full disjunction for each spanning tree; and d) merging the full disjunctions by removing duplicated and subsumed items.The advantages of this proposal mainly lie in the availability of optimized and low complexity algorithms for spanning trees computation and in a novel parallel implementation of a multi-relation hash star join algorithm able to reduce the overhead resulting from the distribution of data on the network.Moreover, there are scenarios where we do not need a "complete" full disjunction including all combinations of tuples from the database tables, but only the most significant ones, according to some quality metrics.This would reduce the computation time.parafd can be adapted for creating an approximate full disjunction, thanks to the definition of a measure (based on the Pointwise Mutual Information in our implementation) for identifying the "most significant" spanning trees, and computing the full disjunctions associated to them only.
We performed a deep experimentation of parafd, by also comparing it against two existing algorithms: IncrementalF D [4] and BiComN LOJ [5].Since it could be unfair to compare parallel and sequential algorithms, we extended and reimplemented both the approaches so that they can perform in parallel.Four variants of IncrementalF D, with different levels of parallelization, are presented in Section 4. The experiments highlight the efficiency of our proposal, reducing the time required for generating all full disjunctions up to 4 magnitude orders.The effectiveness of parafd has been evaluated in the challenging scenario of keyword search over relational databases.We considered the full disjunction as a collection of documents (one for each tuple composing it) to be indexed by a text retrieval engine.We experimented this search system with a well known benchmark [6] obtaining results with high precision levels.
Summarizing, the main contributions of this paper are: • the development and implementation of an algorithm based on spanning trees and a parallel implementation of a multi-relation hash join algorithm for computing the full disjunctions of a database; • the re-design of the IncrementalF D and BiComN LOJ algorithms to be able to perform in parallel.Four implementations of IncrementalF D, with different levels of parallelism, are described and evaluated in the paper; • a technique for computing an approximate full disjunction, i.e., a full disjunction with only the most significant tuples, according to a quality measure; • a deep experimentation of the approaches with real and large datasets showing that parafd outperforms the state of the art.
The rest of this paper is organized as follows: Sections 2 and 3 formally define the full disjunction operator and introduce parafd for its computation.In Section 4, we describe two main existing techniques for computing the full disjunction and introduce four parallel implementations.Related work is discussed in Section 5.The experimental evaluation is presented in Section 6 and finally in Section 7 we sketch out some conclusion and future work.

Preliminaries
The full disjunction is an associative extension of the outerjoin [2].Approaches aiming to maximize the capability of joining pieces of data from different relations built full disjunctions upon natural outerjoins [7,4,5] (i.e., equijoin on common attributes).In this paper, we extend those papers by introducing a definition of full disjunction based on equijoins between foreign and primary keys.In this way, we take only into account the connections between the tables introduced by the database designer, thus preserving the original semantics of the data 3 .
Let us consider a relational database with n relations R = {R 1 , ..., R n }, where each relation R i has a schema sc(R i ) composed of p i attributes R i .A 1 , ..., R i .A pi , a primary key P K ⊆ sc(R i ) and possibly multiple foreign keys F K ⊆ sc(R i ) referring to other relations.
The schema of R, denoted sc(R), is the union of the schemas sc(R i ) of relations in R. The schema graph of R, denoted G sc(R) = (V, E), is an undirected graph showing connections between relations generated by foreign key relationships, where V and E are the set of nodes and edges, respectively.There is a node for each relation R i .
There is an edge e=(R i , R j ) ∈ E between the nodes R i and R j , if the primary key R i .P K defined on R i is referenced by the foreign key R j .F K defined on R j .Note that, in general, there may be multiple edges between the same pair of nodes, generated by different foreign keys on the same relations.For sake of simplicity, in the following, we assume that only an edge is possible.We say that R is connected if G sc(R) is connected.The relational model allows users to merge data from different relations through the join operator.The j oining tree of tuples is the data structure that has been introduced in the literature [8] to represent tuples connected by a join operation.
Definition 1 (Joining trees of tuples).Given a database R = {R 1 , ..., R n } with schema graph G sc(R) = (V, E), a joining tree of tuples JT is a tree of tuples where each edge (t i , t j ) in JT , with t i ∈ R i and t j ∈ R j satisfies two properties: (1) e = (R i , R j ) ∈ E, and (2) (t i , t j ) ∈ R i R j .The set of tuples of JT is denoted by T uples(JT ).
Join consistent and connected tuple sets are joining trees of tuples that do not contain more than one tuple from the same table.These are the building block components of the full disjunction.
Definition 2 (Join consistent and connected tuple set).Given a database R = {R 1 , ..., R n } with schema graph G sc(R) , a tuple set of R is any set of tuples T = {t 1 , ..., t m } consisting of at most one tuple from each relation (hence m ≤ n).
We say that T is join consistent and connected if there exists a joining tree JT such that T uples(JT ) = T , i.e., the set of tuples of JT coincides with T .We denote as JCC(T ) a set of tuples T is join consistent and connected.
Example 2: With reference to the database shown in Figure 1(a) with schema graph in Figure 1(b), the tuple set T 0 ={c 0 , s 0 , m 0 } is join consistent: in this case there are 3 joining trees whose nodes coincide with T 0 : {(c 0 , s 0 ), (s 0 , m 0 )}, {(c 0 , s 0 ), (c 0 , m 0 )} and {(m 0 , s 0 ), (c 0 , m 0 )}; T 1 ={c 1 , s 1 , m 1 } is also join consistent: in this case there exists only the joining tree {(c 1 , s 1 ), (s 1 , m 1 )} whose nodes coincide with T 1 ; on the other hand, Definition 3 (Full Disjunction).Let R be a set of relations.The full disjunction of R, denoted F D(R), is the set of all tuple sets T of R, such that (1) T ∈ JCC(T ), and (2) T is maximal, that is, there is no join consistent and connected tuple set of R that properly contains T .
The full disjunction is a set of tuple sets, having each item the same schema, regardless the tuple sets it involves.Let us consider a join consistent and connected tuple set T of R, and denote embed R (T ) the tuple that is obtained by firstly joining tuples of T and then adding columns with null values for the remaining attributes of sc(R).More formally, embed R (T ) is the tuple t over sc(R), such that for all attributes A of sc(R), if T contains a tuple t with the attribute A, then t Example 3: Table 1(c) shows the full disjunction of the running example database, i.e. the possible "combinations" of the data in the original table according to the foreign key relationships.Note that only the identifiers of the tuples in the original tables are provided.In the following, we will adopt the term full disjunction to refer also to its transformation in full disjunction relation.

The parafd Approach
The parafd process is based on the idea that the full disjunction of a set of relations R is obtainable as the union of the full disjunctions of all possible spanning trees4 of its schema graph G sc(R) .This result, which is formally demonstrated in Section 3.1, allows us (1) to compute the full disjunction through the simple application of the full outerjoin operator; and (2) to split the computation process in a number of steps that can be executed in parallel (see Section 3.2).

Computing a Full Disjunction through Spanning Trees
In this section we show that the full disjunction of a set of relations R with schema graph G sc(R) can be obtained by the combination of the full disjunctions computed for each possible spanning tree of G sc(R) , after the removal of the tuple sets that have been already generated by other spanning trees or are contained in other tuple sets.Moreover, we show that the full disjunction of a spanning tree can be computed by means of the full outerjoin operator.
To implement this procedure, we need to extend the full disjunction definition to be applied to schema subgraphs.
Definition 5 (Full Disjunction of a Schema Subgraph).Let R be a set of relations with schema graph G sc(R) .Given a connected subgraph SG = (V SG , E SG ) of G sc(R) , the full disjunction of the set of relations V SG is called full disjunction of SG and denoted by F D(SG).
The subsumption operator [2] allows us to remove duplicated and contained tuple sets.
Definition 6 (Subsumption).Given two tuple sets T and T , we say that T subsumes T if and only if T ⊇ T .The unary subsumption operator ↓ denotes the removal of subsumed tuple sets from a set of tuple sets X : Example 4: In the example of Figure 1, we can build the following three spanning trees: ST 1 = (R, {e 1 , e 2 }), ST 2 = (R, {e 2 , e 3 }), ST 3 = (R, {e 1 , e 3 }).It is easy to verify that {c 0 , s 0 , m 0 } is in both F D(ST 1 ) and F D(ST 2 ), while which is subsumed by {c 1 , s 1 , m 1 }.Then, to obtain F D(R) starting from the full disjunction of its spanning trees we need to eliminate such subsumed tuple sets as stated by the following theorem.
Based on Definitions 5 and 6, Theorem 1 demonstrates how we can compute the full disjunction of a set of relations by means of the spanning trees computed on its schema graph.
Theorem 1: Given a database R = {R 1 , ..., R n } with schema graph G sc(R) , and ST the set of all spanning trees of G sc(R) , the full disjunction F D(R) can be obtained as: Proof.The proof proceeds by demonstrating that the right-hand side (RHS) of Equation 2includes only tuples satisfying both the properties required to be a full disjunction for R (see Definition 3).Then we demonstrate that it is not possible that a tuple set which is part of the full disjunction of R is not contained in RHS of Equation 2.
A tuple set T ∈ RHS of Equation 2 is a join consistent and connected tuple set.First of all, T is an element of a full disjunction.Then, by definition, T is a join consistent and connected tuple set.Moreover, each tuple set T is maximal since the subsumption operator removes contained tuple sets.
Finally, we prove by reductio ad absurdum that it cannot exist a tuple set , there exists by construction at least a spanning tree ST of G sc (R) such that T ∈ F D(ST ).But ST should be in ST , otherwise there would exist a spanning trees of G sc(R) which is not part of the set of all spanning tree of G sc(R) .
In the rest of this subsection we synthesize the technique for computing the full disjunction of a tree, introduced in [5].Let us use R 1 1 R 2 to denote the full outerjoin of R 1 and R 2 .Full outerjoin is not associative: different execution orders generate different results.Left-deep outerjoins allow us to introduce a specific order.Definition 7 (Left-deep Outerjoin).The left-deep outerjoin of (R 1 , . . ., R n ), denoted by 1 (R 1 , . . ., R n ), is defined as follows: given in [5] shows that when the scheme graph is a tree, a connectedprefix ordering yields a left-deep outerjoin that is equivalent to the full disjunction.We can apply this result to a spanning tree: given

Full Disjunction of a Spanning Tree: Computation by Hash Star Join
We showed in the previous section that the computation of full disjunction of a set of relations R with schema graph G sc(R) can be decomposed in a number of computations, one for each spanning tree we can build over the schema graph G sc(R) .We then showed that we can use a full outerjoin sequence to compute the full disjunction of a spanning tree.
In this section, we introduce an efficient way to compute full outerjoin sequences that are based on a novel parallel algorithm for performing hash star joins and that is described in Algorithm 3.This algorithm can efficiently perform multi-relation joins, by reducing the communication costs.
To be able to apply the hash star join, we introduce Theorem 2, where we show how to model the spanning trees as sets of star trees.Given the schema graph G sc(R) , let T R a tree of G sc(R) .Let T R be the star subtree obtained as a subgraph of T R by considering as center vertex the vertex of T R with maximum degree 5 , where R c is the center vertex and R s i , ∀i = 1, . . ., n, are the adjacent satellite vertices.Theorem 2: Given the schema graph G sc(R) , let T R be a tree of G sc(R) .Then: where Γ(T R, T R ) is the tree defined as follows 1. removing the star subtree T R from T R

adding the relation
Proof.The proof is based on the fact that Equation 3 generates a connected-prefix ordering of the vertices of T R; therefore, by applying the aforementioned Proposition 3.1 given in [5], the full disjunction can be computed as a left-deep outerjoin sequence.First of all, for a star tree , for any ordering of satellite vertices {R s 1 , . . ., R s n }.Then, by iteratively applying the tree contraction defined by the Γ() function, we obtain a connected-prefix ordering of T R.
Let us show this result intuitively with the example in Figure 2, where we consider, on the left, a tree T R 1 with 12 vertices denoted by 0, 1, . . ., 11. T R 1 is the star subtree of T R 1 with center vertex 0 and satellite vertices {1, 2, 3, 4, 5}, then we obtain T R 2 = Γ(T R 1 , T R 1 ), where R 1 = F D(T R 1 ).T R 2 is the star subtree with center vertex R 1 and satellite vertices {6, 8, 10}, then we obtain T R 3 is the star subtree with center vertex R 2 and satellite vertices {7, 9, 11}: since T R 3 = T R 3 the process stops.It is trivial to prove that 0, 1, 2, 3, 4, 5 is a connectedprefix ordering of the vertices of T R 1 .It is also trivial to prove that 0, 1, 2, 3, 4, 5, 6, 8, 10 is a connected-prefix ordering of the union of the vertices of T R In the same way, 0, 1, 2, 3, 4, 5, 6, 8, 10, 7, 9, 11 is a connected-prefix ordering of the union of the vertices of T R 1 , T R 2 and T R 3 , i.e., of the vertices of T R 1 .Algorithm 1: ComputeFD -Full Disjunction of a database Example 6: The big picture of the parafd approach is shown in Figure 3: the spanning trees of a database schema graph are computed.We obtain the full disjunction of each spanning tree through the application of Hash Star Joins.The full disjunction of the database is generated by collecting the results obtained for each spanning tree and removing subsumed elements.

parafd implementation
Algorithm1 shows the implementation of parafd.It takes as input a database R with schema graph G and it computes its full disjunction.First of all, the set X that will contain the resulting full disjunction (line 1) is initialized.Then, it computes all spanning trees from the input schema graph by means of the GetSpanningT rees function (line 2), which is described in Section 3.3.1.The full disjunction of each spanning tree is then computed (lines 3-4).This computation is performed by the function ComputeT S which is described in Section 3.3.2.At the end of the iteration, X will contain the full disjunction computed for each spanning tree (line 4).The full disjunction F D of R is finally generated by removing duplicated and subsumed tuple sets from T S (line 5) and is returned as output (line 6).A description of the tuple set subsumption process is provided in Section 3.3.3.The computation of all spanning trees of a graph is a problem that has been already addressed in the literature.Among the existing solutions, we adopted the algorithm presented in [9], based on backtracking and depth-first search, which runs in O(V +E +V N ) time, where V , E, and N are the number of vertices, edges, and spanning trees, respectively.Since in the real databases the number of tables and foreign keys relationships is limited, generating and computing the spanning trees is a feasible approach.

Computation of the full disjunction of a spanning tree
Algorithm 2: ComputeFD ST -Full Disjunction of a Spanning Tree Input : A spanning tree ST .Output: The full disjunctions of ST .
procedure HashStarJoin, described in Algorithm 3, computes the full disjunction of X (line 3).Line 4 builds the new spanning tree to be evaluated by means of the Γ function defined in Theorem 2. The process iterates until ST is constituted of a single vertex only (line 5).Finally, the full disjunction is returned (line 6).
Performing efficiently a join operation in a distributed environment is usually a critical task.Communication costs represent the main bottlenecks because the computation time is usually less expensive than the time required for data distribution/shuffling.The Hash Star Join technique is similar to SHJ proposed in [10] (see related work section) for joining in a parallel/cluster architecture the fact table of a data warehouse system with its corresponding dimensions.The distribution of the data to be joined is done by applying a shipping function that decides the host cluster of each record in a table.The shipping function is applied to a column (partitioning key) of the table to partition.When performing a join operation, if the joining key is the same as the partitioning key used for both input tables, then the join can be locally executed within each cluster.In the other case, the join operations need for a re-partitioning of the data.
The strategy adopted for partitioning the tables is crucial for obtaining high time performance computations.In our Hash Star Join approach, we would like to exploit the fact that adjacent vertices in a star tree can share the same joining key with the center vertex and thus all the related joins can be performed locally within a cluster.Based on this consideration, the idea is to analyze the join associations existing between the input tables and perform their partitioning based on the most frequent join attributes.This lead to a specific execution order of the join operations that maximizes the tables involved in each operation, and reduces the amount of data to transmit over the network.
The principle behind the standard hash join is to limit the number of total comparisons: only the tuples that fall in the same bucket are checked if they are joined consistent.In our implementation the main changes made to this basic logic are: (a) creation of a distributed hash table, that allows us to parallelize for each bucket the join computation among different processes (b) involvement of an arbitrary number of tables and (c) application of the outerjoin operator within a bucket.

Algorithm 3: HashStarJoin
Input : A star tree ST with center in relation R k and adjacent relations Adj R k .Output: A set of tuple sets representing the full disjunction F D * of X.
Algorithm 3 shows the procedure for generating the sets of tuple sets from a star graph centered in R k and with satellite relations Adj R k .We start initializing the set F D * that will contain the resulting FD (line 1).In line 2, given a star tree ST in input, the function ClusterByF K finds a sequence of subStars SS 1 , ...SS n such that, for each SS i , the vertices share the same joining key, and SS i has more vertices than SS i+1 .
Example 7: If we consider the star tree ST 2 (see Example 4) with City as center vertex, the sequence is just SS 1 = ST 2 .The reason is that both the adjacent vertices M embership and State share the City joining key.If we consider the star tree ST 3 with M embership as center vertex, the sequence is SS 1 = {M, C} and SS 2 = {S, C} as the there is no a common join attribute.
Then, the algorithm iterates over the obtained sequence of subStars (lines 3-8).At each iteration a set Substar is extracted from SubStars.This set of tables is then partitioned through a hash function that is applied to the most common attribute in their join associations.According to this partition schema, we distribute the tuples to different nodes.Tuple sets are thus computed separately by each node through a MapReduce process6 (lines 5-6).For each partition the Lef tDeepOuterjoin operator joins the tuples following a left outerjoin sequence.When all nodes have completed the computation, the results are collected and stored in F D (line 7).This operation terminates the elaboration of the current cluster of relations.Finally, the tuple sets computed from the current cluster are merged with the results of the previous clusters via the full outerjoin operator (line 8).All tuple sets are then returned as output (line 9).
In this perspective, computing the full disjunction for each spanning tree in isolation, as done by Algorithm 2, can be computationally very expensive.Spanning trees can differ with each other by few edges, and running several times join operations on the same tree portions results in a number of overlapping full disjunctions and an unjustified increase in the overall execution time of the algorithm.To address this issue, we implemented a mechanism for storing the tuple sets generated by sequence of edges of the spanning trees to be able to retrieve and reuse them if the same tree portion is navigated in another computation.
Example 8: Consider the tree shown in Figure 4(b), where the city node has state, membership as adjacent nodes.The join attributes between each pair of tables to be combined by the full outerjoin operator are a) S.capital → C.name belonging to the tables state and city; b) M.cityorg → C.name belonging to the tables membership and city .C.name is an attribute common to both the join associations.This information is exploited to perform a single partition of the three tables considered.

Generating the full disjunction
Our technique for removing subsumed tuple sets relies on a set-trie data structure [11] to store the full disjunction produced by each spanning tree.A set-trie is an extension of a trie [12] which, in addition to simply verifying the membership of an element within it, supports search operations on subsets and supersets.In particular, set-tries use prefixes of common elements to index the elements thus enabling the efficient identification of their subsets / supersets.The complexity of these operations is O(c * |set|), where |set| represents the size of the input, and c is a constant.

Algorithm 4: RemoveSubsumed
Input : A set of tuple sets T S. Output: A set of tuple sets T S , with no subsumed items.Algorithm 4 shows the functionality implemented for removing subsumed tuple sets.In line 1 the Coverage function is applied for computing a specific covering set C of T S (i.e., a collection of subsets of T S whose union is T S) such that, for each tuple t of T S, there is a (unique) item of C containing all and only the tuple sets with t.This operation can be easily done with a MapReduce implementation.Then, in parallel for each partition 7 (lines 3-11), the constituting tuple sets are extracted and sorted by size in descending order in the Q max priority queue (lines 4-6).The tuple sets are progressively indexed in a set-trie (line 7) for being analyzed as possible results.Lines 8-11 exploit the set-trie to verify if tuple sets are subsumed.If the tuple set is not contained in any previous full disjunction (line 10), it is inserted in the set-trie (line 11).Finally, full disjunctions computed in each partition are merged and returned (lines 12-13).

Approximating the full disjunction
There are scenarios where the computation of the complete set of full disjunctions is not needed.In these situations, approximating the complete set of tuples in a full disjunction with the most "meaningful" ones decreases the computation time with some loss of information.
The selection of an approximate set of full disjunction is based on the analysis of the spanning trees.In particular, we adopted the Pointwise Mutual Information as the measure for weighting the edges of a graph and implemented a well-known Algorithm ( [13]) for computing the top-k maximum cost spanning trees.
Pointwise Mutual Information (PMI) has been largely applied in computer science and it provides a correlation measure between two entities, evaluating their probability of joint occurrence in the hypothesis of absence and presence of statistical dependence.
The database research community typically relies on PMI-based measures to weight the cohesion of tables connected via foreign keys in a database [14,15].

Reference techniques for computing the full disjunction
This section provides an overview of two main techniques for computing full disjunctions proposed in the literature: IncrementalF D [4] and BiComN LOJ [5].These techniques have been extended in this paper to perform with equijoins (in the original papers they have been designed to work with natural joins) and with parallel computing techniques (4 variations of IncrementalF D have been developed).These techniques are used in Section 6 as a baseline for evaluating parafd.

IncrementalFD
IncrementalF D [4] performs an incremental computation of full disjunction.It iterates over all the tables and, for each of them, it computes a number of tuple sets which are "candidate" to belong to the full disjunction: given a table R in a set of relations R, the candidate full disjunction for R is the subset of F D(R) that contains tuple sets with a tuple from R. Candidate tuple sets do not contain subsumed items (i.e., tuple sets are already maximal), then their simple union generates the resulting F D(R).
The candidate full disjunctions are computed by means of two processes: the extension operation that takes a connected and join consistent tuple set and adds a series of tuples from the tables which have not been already used, thus creating another connected and join consistent tuple set, and the variation operation that generates connected and join consistent tuple sets by substituting a tuple in a connected and join consistent tuple set with another tuple from one of the tables that have already been examined.
Algorithm 5 provides more details on IncrementalF D. In line 1 F D, the set that will contain the resulting full disjunctions, is initialized.Line 2 shows that the process iterates over all the tables.For each table R i , Incomplete, the variable that stores the tuple sets under development, is initialized.Then, in line 6, the Algorithm iteratively extracts a tuple set T from Incomplete and applies the extension and variation processes.

The BiComNLOJ approach
BiComN LOJ [5] is an approach for the full disjunction computation that exploits the concept of polynomial delay.This concept requires that the time interval between the production of two successive solutions varies in a polynomial manner with respect to the size of the input data, thus assuring high performance.In particular, BiComN LOJ consists of two main components: one for the calculation of sequences of left deep outerjoins in an acyclic graph (N estedLoopOuterJoin -N LOJ) and one for the calculation of full disjunctions in a general graph which has a quadratic delay (PDelayFD).These components were initially assembled to compute the full disjunctions according to the following steps: 1. calculation of the biconnected components of the schema graph 2. calculation of the full disjunctions for each biconnected component using the PDe-layFD algorithm 3. combination of the full disjunctions deriving from each biconnected component through the execution of the N LOJ algorithm respecting a specific order This computational schema however does not guarantee a polynomial delay execution as the time needed to produce the full disjunction for a single biconnected component is exponential.In order to achieve the polynomial delay property the combination of the intermediate full disjunction produced by the different biconnected components, through the N LOJ algorithm, is progressively executed.Firstly, the full disjunction deriving from the first biconnected component is computed.Then, starting from these results, the full disjunction items of the second biconnected component are then computed, and so on.
In order to guarantee the correctness of the combination process a specific order must be used: this is called "strong connected-prefix order", and it imposes that two successive biconnected components have to be connected by a connecting relation.This is a relation that either appears in both the biconnected components or is directly connected, through a join condition, with a relation of the previous biconnected component.This new execution flow provides a method for full disjunction computation that is compliant with the polynomial delay property.

Parallelizing the IncrementalFD algorithm
The logic of the IncrementalF D algorithm can be easily parallelized as it consists of several independent computations operating on data of significant dimensions (e.g., multiple database tables).This section introduces four approaches to parallelize IncrementalF D by means of a map reduce strategy.

Table-driven naive parallelization
This approach is a direct variant of the IncrementalF D algorithm, where the first iteration over the database tables (see line 2 of Algorithm 5) is executed in parallel.In this way, the approach simultaneously performs the creation of "candidate" full disjunction relations starting from the different tables.The parallelism adopted here is mainly applied at a "code level": the algorithm does not change, but it is the flow of execution that has been altered by inserting multiple workers operating simultaneously to perform parallel "candidate" full disjunction generation.Once the generation phase is completed, a deduplication task is performed to remove repeated tuple sets.The whole approach and the deduplication task can be easily implemented adapting the IncrementalF D algorithm implementation into a MapReduce architecture, where the map task consists in a "candidate" full disjunction discovery process (there is a mapper for each table) and the reduce task directly removes the duplicated tuple sets in order to produce full disjunction.
Example 11: Figure 6 exemplifies our approach.For each table, a map task performing the extension and the variation processes is created.In this way, all tuple sets originating from that table are computed.Then, a reducer task responsible for the removal of duplicated items is executed.The remaining tuple sets compose the full disjunction.

Two-phase computation
The idea behind this variation is to divide the process for generating the full disjunction into two phases parallelized via a MapReduce implementation as shown in  This approach largely improves the previous table-driven naive parallelization.Firstly, the duplicated tuples sets generated by the first phase are removed before the variation process is applied, thus avoiding the creation of tuple sets to be later on removed.Then, the uniform redistribution of the initial data to the mappers in the second phase optimizes the performances by balancing the workload of the workers executing the variation process.Finally, the algorithm implements a parallel and progressive mechanism for generating the full disjunction.The tuple sets resulting from the first phase constitute a preliminary result that the user can exploit in advance.
Example 12: One of the advantages introduced with the subdivision of the execution flow into two steps is the ability of removing duplicated tuple sets early, at the end of the first MapReduce task, thus generating an initial set of full disjunction relations.Figure 8 shows the application of the deduplication process to tables State and City of the running example.The first map tasks apply the extension process to the initial tuples.Duplicates are removed by the first reduce tasks.Then, a second MapReduce cycle applies the variation process and removes the duplicates.

Block-based parallelization
The Block-based parallelization improves the performance of the approach by optimizing the execution of the variation process.Given a tuples set, the variation process replaces its composing tuples with other tuples belonging to the same tables.This task can be optimized if we are able to work on multiple tuple sets simultaneously.In this way, the cache usage is optimized and the total execution time decreases.
Therefore, the Block-based parallelization implements a change in the logical data unit processed by the algorithm, by grouping all tuples sets generated from the same tables in the previous extension block and computing the variation of the entire group.
Example 13: Let us consider the three tuple sets generated by the extension task on the tuples of the State table : (1) {c 0 , s 0 , m 0 }, (2) {c 0 , s 1 , m 1 } and (3) {s 2 , m 3 }.The variation process will scan all tuples of the City and Membership tables to generate possible variations.Figure 9 shows the process and the results obtained, where the columns represent the input tuple sets and the row the possible variations.The Figure shows that only two valid results are generated: {c 0 , s 0 , m 1 } and {c 1 , s 1 , m 1 }.The other cases generate tuple sets which are not connected and join consistent.Note that when the variation tuple is already included in the considered tuple set no variation operation is applied ("n.a." in Figure).

Checkpoint-based version
The block-based version can generate workload imbalance among the workers.To provide a partial solution to this issue, a checkpoint system has been implemented to periodically interrupt the execution of the workers, collect the results produced up to that point and produce a new distribution of the data.The frequency of the re-balancing is established as proportional to the number of new full disjunction discovered with respect to the number of tuple sets taken in input by the considered worker.When the difference between these two dimensions exceeds a fixed threshold the checkpoint system is triggered.

Related Work
Section 4 has introduced the main existing techniques for computing the full disjunction of a relational database.In this section, we introduce other related work on the available technologies for supporting parallel computing, and some approaches for performing the join in distributed and parallel frameworks.

Technology supporting parallel computing
Recently, a large number of technologies and paradigms have been developed to efficiently manage high data volumes with reduced costs.MapReduce [16] was a first product developed to address the emergence of the high scalability and flexibility requirements imposed by large amounts of data.Its main advantage was its ability to hide implementation details in a parallel environment through the adoption of a distributed programming paradigm based on two primitive functions: Map and Reduce.Its simplicity, flexibility and fault tolerance have immediately made it a tool of paramount importance for managing large amounts of data.Several implementations of its paradigm have been developed.Among them, the most used is Apache Hadoop8 .However, this model has shown some limitations.The main significant ones are: a) the use of a nondeclarative programming paradigm based on its two primitives only, b) the inability to exploit the advantages deriving from the structured organization of some information, i.e., all data have to be organized in a key-value form, and c) the adoption of a rigid dataflow articulated in fixed phases such as reading data from a distributed file system, applying a MapReduce job and storing the results in a distributed file system.Within an iterative logic, the repeated application of a sequence of MapReduce jobs involves, due to the lack of storage of an intermediate state, intense use of I/O operations that can easily degrade performance.In response to these needs, new approaches have been introduced.Apache Hive9 and Apache Pig10 were the first tools integrating the MapReduce paradigm into a declarative logic, similar to the SQL one.Several wrappers have also been introduced to use structured data with a schema instead of key-value pairs.Finally, several frameworks have been created to achieve high performance even in the presence of iterative and interactive data processing.Apache Spark11 represents the major exponent of this category of frameworks.It provides the most complete, reliable and performance solution.Thanks to an in-memory computing model, in fact, it is up to 100 times faster than Apache Hadoop and supports structured data analysis based on a declarative logic.

Join in a distributed environment
The join operator is a useful tool to integrate information from different data sources.Its operation, however, does not fit well in a distributed scenario where the information to be integrated can be divided between different nodes and therefore its integration requires high network traffic.
Parallel join processing originates from the work on the early parallel database systems, such as Bubba [17], PRISMA/DB [18] and GAMMA [19], where hash-based partitioning was used to distribute the join argument to multiple machines in a cluster.Evaluating a multi-join query via hashing in parallel over a shared-nothing environment has also been investigated in the literature.Different parallel processing strategies such as left-deep and right-deep [20], segmented right-deep [21], zigzag tree [22] and other variations [23] have been proposed.However, most of them report their results based on simulations, while we report our results based on a working distributed system.
Within a MapReduce framework, the join operation can be performed in two different ways: Map-side join and Reduce-side join [24].The first family of joins includes Map-Merge joins and Broadcast joins.According to the first approach, the tables to be integrated are already partitioned in the distributed file system on the join key and a merge phase is applied through Map functions.The broadcast join instead exploits the possibility of replicating and storing in the memory of each mapper the table of smaller size in order to carry out the comparisons of joins more efficiently.In the Reduce-side joins, the most common strategy is the repartition join [25], which labels, in the map phase, the tuples according to the provenance table, partitions the tuples based on the join key and performs the tuple comparison in the reduce phase.This corresponds to the application of a standard hash join in a distributed environment.Although other MapReduce join approaches have been proposed, such as Map-Reduce-Merge [26] and Map-Join-Reduce [27], many implementations provided by higher-level systems built above MapReduce, like the Hadoop-based systems and those integrated within Apache Spark, are available.However, these systems provide implementations that can work only with two tables [28].Therefore, the application of an integration task by joining n data sources, which represents the typical scenario with the full disjunction, is divided into several phases.Concurrent join [29] and Scatter-Gather-Merge [28] algorithms have been proposed to efficiently implement star joins.
The need to integrate n tables represents also a typical scenario in the field of business intelligence, where star queries require that a central table, i.e., the fact table, is integrated with information extracted from other tables, i.e., the dimension tables.Most of star query implementations [30] exploit the different dimensionality of the tables involved (i.e., the fact table is typically greater than the dimension tables), therefore they do not provide a generic strategy to operate.[10] introduces a technique called star hash join (SHJ), which provides an optimization of a left-deep tree-shaped query plan applied in a datawarehouse scenario.SHJ solves a problem similar to our implementation, but applies a less efficient solution based on a sequence of joins.In particular, the dimension tables are partitioned by their primary keys, and the fact table is partitioned by one of its foreign keys.Hence, only one join between one dimension table and the fact table may be performed locally within each cluster.

Experimental evaluation
We conducted a large number of experiments to assess the quality of parafd.First of all, we evaluated its time performance since efficiency is one of the main problems affecting the existing algorithms.The experiments described in Section 6.1 demonstrate that our implementation is usable in real scenarios.In Section 6.2 we tested the robustness of parafd by varying the dimensionality of the input data and the number of join connections.This evaluation confirms that our approach is scalable.Section 6.3 evaluates our technique for approximating full disjunctions; The experiments show a reduction of the time required for computing the approximation, and a limited loss of information.Finally, in Section 6.4, we evaluated the effectiveness of parafd by experimenting the full disjunction as a collection of documents for a text retrieval search engine.Implementation, environment.We performed the experiments in a cluster of 6 virtual machines running Ubuntu 12.04.Each machine has 16 processors, 128 GB of RAM and 1 TB of storage.We implemented parafd using the Python interface of the Apache Spark framework.Dataset descriptions.Three reference datasets [6] with complementary features have been used in our experiments.The Internet Movie Database (IMDB 12 ) is a database of cinematographic data including more than 1.6 M tuples distributed in 6 relations and with a total size of 459 MBs.Wikipedia dataset is a reduced version of the popular encyclopedia including six relations, more than 200k tuples and a total size of 391 MBs.Mondial13 is a small dataset with a size of 16 MBs, 17K tuples (two orders of magnitude smaller than the IMDB dataset), but with a complex schema composed of 28 relations.By virtue of these characteristics, Mondial represents a meaningful test for the validation of full disjunction algorithms whose complexity mainly depends on the high number of data connections.IMDB can be thought as a typical business data source composed of a large amount of data and a simple schema.Finally Wikipedia, containing the full text of articles, is a good option for the validation of keyword search systems.

Efficiency of the approach
Three experiments have been performed to assess the efficiency of our approach.In a first experiment, we compared the execution time of our approach with respect to the existing algorithms and their parallelized versions described in Section 4. In a second experiment, we evaluated parafd time performance on the three reference datasets.Finally, we evaluated the hash star join implementation against the traditional left-deep join technique.
The comparison of the performance of the existing strategies for computing full disjunction has been performed by considering three subsets of the Mondial database.Table 1 shows the results of this experiment; Column 1 reports the subset of tables of Mondial used as configuration (input database), Column 2 reports the total number of tuples in the database and Column 3 reports the resulting number of tuples in the full disjunction of such database.For each data configuration, all proposed algorithms have been tested and their execution times are reported.Note that the results reported for BiComN LOJ  refer to a parallelized version of the original algorithm we have implemented.To avoid noise and bias generated by other running applications and network failures, we have repeated the experiment 5 times for each algorithm.The table shows the average time and in brackets the standard deviation.A mark "-" shows experiments that did not finish before the timeout (300 hours).
The experiments show that parafd outperforms the existing techniques reducing the time required for computing full disjunctions until 4 orders of magnitude.Moreover, our implementation is the most scalable when the data increases.The time required has the same order of magnitude for all configurations.This behavior is not shared by the other techniques, where the times for completing the task largely varies with the input size.A further evaluation of parafd scalability is proposed in Section 6.2.
In the second experiment we measured the time to compute full disjunctions in the three reference datasets.As in the previous experiments, we repeated the computation 5 times to avoid bias.Table 2 shows the average time and the standard deviation measured.Despite its small size (17k tuples), Mondial generated the largest number of full disjunctions (more than 280 * 10 6 ).This is due to the high number of foreign keys that generate more than 4k spanning trees.Our approach is the only technique among the ones tested able to compute this large number of full disjunctions (the other approaches fail after 300 hours of computation).
Finally, we evaluated the efficiency of the hash join algorithm implemented in parafd.The experiment consisted in comparing the execution time required by our hash star join algorithm to calculate the outerjoins between n relations with respect to the one required by a sequence of left-deep outerjoins applied to respect the connected-prefix ordering.The approaches have been evaluated on a single tree selected from the schema graphs of the reference datasets.In particular, the schema graphs of the Imdb and Wikipedia datasets generate only one tree.The spanning tree of maximum weight has been considered for the Mondial schema graph.The results of this evaluation are reported in Table 3.The hash star join technique performs better than a sequence of left-deep outerjoins: the percentage reduction of the execution times varies from about 25% to 40%.Mondial represents the dataset on which the performance of the proposed algorithm with respect to the considered baseline is smaller.This result is motivated by the fact that this dataset has a number of tuples and an overall size smaller compared to the other scenarios (see the dataset descriptions at the beginning of Section 6).This feature influences the times for data distribution among the cluster workers which impacts less on the entire process.

Robustness of the approach
In this section we show the ability of our system to scale when the size of the data increases.In a first experiment, we executed parafd on the reference datasets by varying the number of active hosts within the considered cluster.The results are reported in Figure 10(a), where for each cluster configuration, we show the execution times.We observe that similar execution times are required in the IMDB and Wikipedia databases (i.e., a minimum time of about 800 seconds with 6 active hosts and a maximum time exceeding 4000 seconds with a single host), while Mondial shows the highest execution times (i.e., about ten times higher).Then, in all scenarios, the execution times show a fairly linear trend with respect to the growth of the number of active hosts.A more intuitive representation of this trend is shown in Figure 10(b), where the speedup of our approach is reported for the three datasets.As can be seen, only the execution times obtained with 6 active hosts are slightly lower than an exact linear trend.This can be explained considering that the times for the distribution of data between the hosts increase.
In a second experiment we evaluated the execution time by varying the number of joining tuples in the input tables.For this purpose, synthetic datasets consisting of 3 tables, with a cardinality of 100k tuples and different distributions of the join connections were generated.In a first scenario we have evaluated the time performance with a balanced dataset, when the tables are connected only via one-to-one relations.In the other scenarios we built unbalanced distribution of joining tuples, where the cardinality of the tuples involved in one of the foreign/primary key relations is lower than the remaining two.In particular, each pair of tuples deriving from two over three tables was repeated with probability 0.9 a number of times equal to 25, 50, 100, 250 and 500 respectively.Table 4 reports the number of full disjunction and the relative computation time for each of the synthetic datasets.The first scenario produces a number of full disjunction equal to the number of tuples of all three tables by construction.The time taken to calculate the full disjunction in this scenario is about 5 minutes.With the increase in the number of join associations between the tuples, the number of full disjunction also grows proportionally, but it is possible to notice that the execution time remains quite stable.It goes from 460.94 seconds to generate 40M full disjunction up to 952.19 seconds for 406M full disjunction.This experiment shows that the developed approach is able to scale even as the size of join connections between the tuples changes.

Effectiveness of the full disjunction approximation
This section shows the experiments performed to evaluate the efficiency and the effectiveness of the technique for generating the approximated full disjunction.The efficiency has been evaluated by considering four degrees of approximation (i.e. the full disjunction is computed by considering 10%, 30%, 50%, 70% the overall number of spanning trees in the dataset) and measuring the number of relations in each approximation and the time required for their computation.Table 5 shows the results of our experiments executed in Mondial: as expected the number of full disjunction relations and the time required for their computation increases with the size of the approximation.
The ability of an approximating amount of relations in representing the entire full disjunction has been evaluated by observing if attribute values represented by the full disjunction relations are also existing in the approximated version.We tested the approach in three scenarios, where 100 attribute values, 100 pairs of values, and 100 triples have been randomly extracted.We computed, for each scenario, the precision, i.e., the fraction of items that have been retrieved in the approximations, and the recall, i.e. the fraction of full disjunction containing the items in the approximation with respect to the full disjunction in the complete set.We performed the experiment by selecting a number of levels of approximation.Table 6 shows the results obtained.The value in brackets is the standard deviation.In all configurations high precision and recall levels are obtained.In this section, we propose to consider the full disjunction as a collection of documents to be indexed by a text retrieval system (i.e., Lucene 14 ).In this way, we implement a simple keyword search system on relational databases and we provide a measure of the effectiveness of the full disjunction in a real and challenging scenario.We experimented our idea against the benchmark proposed in [6], where IMDB, Wikipedia and Mondial have been evaluated against 50 queries per source.The time required for indexing the data and the average time, the standard deviation, the minimum and the maximum time required to solve the queries is shown in Table 7.We observe that time required for solving the queries in most of the cases makes the simple approach implemented able to work real time with any optimization.Only in a few cases, the keyword queries are complex and require a large time to be completed.
Figure 11 shows the average precision of the first answer returned by parafd, compared with the other systems, in solving the keyword queries of the benchmark in the three datasets.Our approach largely outperforms the other systems in the scenarios related to the IMDb and Wikipedia datasets, and works as the best other approaches in Mondial.The Mondial database schema is complex: the tables are connected via a large number of paths that form cycles.The result is that the same tuple is repeated more times in the full disjunction relations and the rank adopted by Lucene in some cases is not consistent with the one expected by the user.

Conclusion
In this paper we have presented parafd: a new approach, based on parallel computing techniques, for generating the full disjunction of a relational database.The same approach can also be used for obtaining an approximated full disjunction, where a limited number of full disjunction from the complete set is computed.For comparing our proposal with the state of the art, we have implemented IncrementalF D and BiComN LOJ, two of the main algorithms available in the literature.The experiments demonstrate that parafd time performance outperforms existing approaches.The effectiveness of the approximated version has been also experimented, and we showed that it provides a good representation of the entire full disjunction.Finally, we have applied the full disjunction as a collection of documents to be indexed and retrieved by a search engine.The idea is to provide a basic answer to the problem of keyword search on relational database.Our idea has been compared against an existing benchmark, obtaining results with high recall and precision.

Example 5 :
With reference to the spanning tree ST 1 = (R, {e 1 , e 2 }), where e 1 = (M, S) and e 2 = (C, S) the full disjunction relation F D(ST 1 ) can be obtained by computing one of the following two outerjoin sequences: (C 1 S) 1 M and (M 1 S) 1 C.

Figure 2 :
Figure 2: Example of application of the Theorem 2

Figure 3 :
Figure 3: parafd applied to the running example

Figure 4 :
Figure 4: Running example schema graph and some derived spanning trees

Figure 5 :
Figure 5: Candidate full disjunction of State

Figure 6 :
Figure 6: Table-driven naive parallelization Figures 7(a) and 7(b).The first phase implements the functionalities introduced in lines 2-8 of Algorithm 5. A mapper is initialized for each relation, with the goal of generating the first set of tuple sets.The subsequent reducers remove the duplications.The second (a) Full Disjunction Generation -First Phase: Extension Process (b) Full Disjunction Generation -Second Phase: Variation Process

Figure 7 :
Figure 7: The MapReduce phases of Checkpoint-based version

Figure 8 :Figure 9 :
Figure 8: Two Phase Computation approach applied to the running example

Figure 11 :
Figure 11: Average precision of the first answer of parafd compared with the benchmark[6]

Table 1 (
d) reports the full disjunction relation, where the tuple identifier used in Table 1(c) are substituted with the real tuples or null values if missing.

Table 1 :
Efficiency with fragments of the Mondial Database.A mark "-" is reported for experiments not finished before the timeout (300 hours)

Table 2 :
Efficiency of parafd in the reference datasets

Table 3 :
Comparison of extended hash join and left-deep outerjoins

Table 4 :
Comparison of execution times with different number of join connections

Table 5 :
Efficiency of the approximated version of parafd

Table 6 :
Effectiveness of the approximated version of parafd 6.4.A keyword search system on relational databases based on a full disjunction

Table 7 :
Keyword search response times (in seconds)