Computing inter-document similarity with Context Semantic Analysis

We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of speciﬁc knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. In fact, our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a Semantic Context Vector , a novel model for representing the context of a document, which is exploited by CSA to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that: (i) for the general task of inter-document similarity, CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of speciﬁc knowledge bases; (ii) for Information Retrieval tasks, enriching documents with context (i.e., employing the Semantic Context Vector model) improves the results quality of the state-of-the-art technique that employs such similar semantic enrichment .


Highlights
• A novel knowledge-based technique for inter-document similarity computation: Context Semantic Analysis (CSA) • CSA relies on a generic RDF knowledge base (e.g.DBpedia) to extract a semantic contextual vector able to represent the context of a document.
• CSA can be effectively applied in the Information Retrieval domain.
• Experimental results show that CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases.

A C C E P T E D M
A N U S C R I P T

Introduction
Recent years have seen a growing number of knowledge bases employed in several domains and applications.Besides DBpedia [1], which is the heart of the Linked Open Data (LOD) cloud [2], other important knowledge bases are: Wikidata [3], a collaborative knowledge base; YAGO [4], a huge semantic knowledge base derived from Wikipedia, WordNet and GeoNames; Snomed CT [5], the best known ontology in the medical domain and AGROVOC [6], a multilingual agricultural thesaurus we recently used for annotating agricultural resources [7].

A C C E P T E D M A N U S C R I P T
In the literature, knowledge-based approaches have been employed for improving existing techniques in Natural Language Processing (NLP) [8] and Information Retrieval (IR) domains [9].Yet, there is much room for improvement in order to effectively exploit these rich models in these fields [10].For instance, in the context of inter-document similarity, which plays an important role in many NLP and IR applications, classic techniques rely solely on syntactic information and are usually based on Vector Space Models [11], where the documents are represented in a vector space having document words as dimensions.Nevertheless, such techniques fail in detecting relationships among concepts in simple scenarios like the following sentences: "The Rolling Stones with the participation of Roger Daltrey opened the concerts' season in Trafalgar Square" and "The bands headed by Mick Jagger with the leader of The Who played in London last week".These two sentences contain highly related concepts (e.g., Roger Daltrey is the leader of The Who) which can be found by exploiting the knowledge network encoded within knowledge bases such as DBpedia.
To overcome the limitation of a purely syntactical approach, in [12] we proposed Context Semantic Analysis (CSA), a novel semantic technique for estimating interdocument similarity, leveraging the information contained in a knowledge base.One of the main novelties of CSA w.r.t.other knowledge-based approaches is its applicability over any RDF knowledge base, so that all datasets belonging to the LOD cloud [2] (more than one thousand) can be used.CSA is based on the notion of contextual graph of a document, i.e. a subgraph of the knowledge base that contains the contextual information of the document; the notion of contextual graph is very similar to the one of semantic graph defined in [10].The contextual graph is then suitably weighted to capture the degree of associativity between its concepts, i.e., the degree of relevance of a property for the entities it connects.The vertices of such a weighted contextual graph are then ranked by using PageRank methods, so obtaining a Semantic Context Vector, a novel model able to represent the context of the document.Thus, the similarity of two documents is computed by comparing their Semantic Context Vectors with general vector comparison methods, such as the cosine similarity.By evaluating our method on a standard benchmark for document similarity (which consider correlations with human judges), we showed how CSA outperforms almost all other methods and how it can exploit any RDF knowledge base.Moreover we analyzed its scalability in a clustering task with a large corpus of documents, and showed that our approach outperforms the considered baselines.
This paper extends our previous work at the SISAP 2016 Conference.The main novel contribution of the extended paper is to test Context Semantic Analysis (CSA) applicability and effectiveness in a a real-world application domain, such as Information Retrieval (IR).To this purpose, we analysed the semantic based approaches recently proposed in the Information Retrieval research community.We found that, the most effective and general IR framework, adopting semantic enrichment of documents, is KE4IR [13].We studied its layered architecture and tried to improve its performance, by including CSA, as a new semantic layer.The outcome was really positive as we were able to show that KE4IR + CSA outperforms the original KE4IR framework (see Section 5.2).
The paper is structured as follows.Section 2 describes the related work, while Section 3 is devoted to some preliminaries useful for the rest of the paper.Then, CSA

A C C E P T E D M
A N U S C R I P T is described in Section 4 and Section 5 shows its evaluation.Finally, Section 6 outlines conclusions and future work.

Related Work
Text similarity has been one the main research area of the last years due to wide range of its applications in tasks such as information retrieval, text classification, document clustering, topic detection, etc. [14].In this field a lot of techniques have been proposed but we can group them in two main categories, content-based and knowledgeenriched approaches, where the main difference is that the first group uses only textual information contained in documents while the second one enriches these documents by extracting information from other sources, usually knowledge bases.

Content-based Approaches
The standard document representation technique is the Vector Space Model [11].Each document is expressed as a weighted high-dimensional vector, the dimensions corresponding to individual features such as words.The result is called the bag-ofwords model and it is the first example of content-based approach.The limitation of this model is that it does not address polysemy (the same word can have multiple meanings) and synonymy (two words can represent the same concept).Another technique belonging the content-based group is Latent Semantic Analysis (LSA) [15], which assumes that there is a latent semantic structure in the documents it analyzes.Its goal is to extract this latent semantic structure by applying dimensionality reduction to the terms-document matrix used for representing the corpus of documents.
Finally, in the context of Information Retrieval, probabilistic models are employed for ranking documents according to their relevance (similarity) to a given search query, i.e., similarities are computed as probabilities that a document representation matches or satisfies a query.Among them, the most popular are: Okapi BM-25 [16] and language modelling approaches [17].

Knowledge-enriched Approaches
Recently, a lot of effort has been employed in designing new techniques for text similarity that use information contained in knowledge bases.Explicit Semantic Analysis (ESA) [18] proposes to map the documents to Wikipedia articles, and to represent each document as a vector of features extracted from both the document and the related articles text.Thus, the similarity of two documents can be computed through any vector space comparison algorithm.
Another document similarity technique that leverages the information contained in Wikipedia is WikiWalk [19], where the personalized PageRank on Wikipedia pages is used, with a personalization vector based on the ESA weights on concepts detected in the documents, to produce a vector used for estimating the similarity.A big drawback of this approach is the computational cost; indeed, for each document we have to execute first ESA and then compute the personalized PageRank on the whole Wikipedia.Another remarkable approach is SSA, i.e.Salient Semantic Analysis [20].This method starts with Wikipedia for creating a corpus where concepts and saliency are explicitly annotated, then, the authors use this corpus to build concept-based word profiles, which are used to measure the semantic relatedness of words and texts.These groups of knowledge-enriched approach are designed for using only Wikipedia as source of knowledge and they are not portable to generic knowledge bases.Our method CSA differs from them because it aims to be a general approach that can use use any knowledge base expressed according to the Semantic Web standard, i.e described in RDF, so that all datasets belonging to the Linked Open Data cloud [2] (more than one thousand) can be used as source of knowledge.To the best of our knowledge, the only approach portable to knowledge bases is the one proposed in [10], where the authors represent documents belonging to a corpus as graphs extracted form a RDF knowledge base.It differs from CSA because it is based on a Graph Edit Distance (GED) graph matching method to estimate similarity, while in our approach a document is represented as a vector and the similarity can be estimated more effortlessly by using cosine similarity.
Finally, from the Information Retrieval community, two recent works [21,13] have proposed general information retrieval techniques, based on the Vector Space Model, to work with documents semantically enriched with Linked Open Data.In Section 5.3, we show how CSA can be employed to enhance the IR framework KE4IR [13], which has been experimentally demonstrated to outperform Waitelonis et al. [21].Our experimental evaluation shows that CSA improves original KE4IR.

Inter-Document Similarity
The state-of-the-art techniques for estimating inter-document similarity are primarily based on Vector Space Models: a document is represented through a bag-of-words feature vector, which contains information about the presence and absence of words in the document, and the similarity between two documents is calculated as the cosine of the angle between the two respective vectors (i.e., their cosine similarity).
Vector Space Models are generally based on a co-occurrence matrix, a way of representing how often words co-occur; in a term-document matrix, each row represents a word and each column represents a document.Let C be a corpus composed of n documents, where each document d j is composed of a sequence of terms.Let m be the number of terms in C; the term-document matrix T is a matrix m×n where each cell (i, j) contains the weight t ij assigned to term i in the document j.A document d j is then represented by the vector d j = [t 1j , ..., t mj ].
While with the simple bag-of-words representation the weight t ij is equal to the number of time the term i appears in the document j, many weighting strategies have been proposed in the literature (see, for example, [22]), such as tf-idf (Term Frequency -Inverse Document Frequency).
The novel technique we proposed Context Semantic Analysis (CSA), is based on a matrix T whose columns are associated with documents, and whose rows with concepts of a Knowledge Base KB, such as DBpedia (section 3.2).The weight assigned to concept i in document j intuitively defined as follows: first, the document j is represented by means of the so-called Contextual Graph (section 4); the weight of the

A C C E P T E D M
A N U S C R I P T concept i into the document j is then computed as the relevance of the node i in the Contextual Graph, by using well-known algorithms, such as PageRank (section 3.3).
The aim of the proposed technique is to extend documents with a context extracted from a knowledge base; to show that this extension is useful for estimating document similarity, we chose to use common approaches, such as tf-idf with the cosine similarity.The combination of our technique with more complex weighting schemes represent an interesting future work.

Knowledge Base
We focus on RDF knowledge bases2 ; an RDF knowledge base can be considered a set of facts (statements), where each fact is a triple of the form <subject,predicate,object>. A set of such triples is an RDF graph KB = (V, E): a labeled, directed multi-graph, where subjects and objects are vertices and the predicates are labeled edges between vertices.According to [23], vertices are divided in 3 disjoint sets: URIs U , blank nodes B and literals L. Literals cannot be the subjects of RDF triples.For our experiments we choose two generic domain knowledge bases: DBpedia [1] and Wikidata [3], due to their large coverage and variety of relationships at the extensional level.

PageRank
PageRank was first proposed to rank web pages [24], but the method is now used in several applications for finding vertices in a graph that are most relevant for a certain task.Let G be a graph with n vertices and d i be the outdegree of the vertex i; the Standard PageRank algorithm computes the PageRank vector R defined by the equation: where M is the transition probability matrix, a n×n matrix given by M ij = 1/d i (d i is the outdegree of i) if it exists an edge from i to j and 0 otherwise, c is the damping factor, a scalar value between 0 and 1 (usually between 0.85 to 0.95) and v is the teleport vector, a uniform vector of size n in which each element is 1/n.
In the Standard PageRank configuration the vector v is a stochastic normalized vector where all the values are 1 n , meaning that the random surfer has an equal probability to be teleported in any of the nodes of the graph G.In other words, Standard PageRank uses just graph topology; on the other hand, many graphs, as the ones in our case, come with weights on either nodes and edges, which can be used to personalize the PageRank algorithm.The Personalized PageRank [25] uses node weights to define a non-uniform vector v and thus biasing the computation of the PageRank vector R to be more influenced from heavier nodes.Another variant is the Weighted PageRank [26] which uses edge weights to define a custom transition probability matrix for influencing further the computation of the PageRank vector R.In the transition probability matrix of the Weighted PageRank, a weighted outdegree d i for a node i is used, with d i = j A ij , where A ij > 0 represents the weight on an edge from node i to node j.

Context Semantic Analysis
In this section we introduce our novel technique for estimating inter-document similarity, called Context Semantic Analysis (CSA), that is based on leveraging the information contained in a generic RDF knowledge base.Given a corpus C of documents and an RDF knowledge graph KB, CSA is composed of the following three steps: 1. Contextual Graph Extraction: the Contextual Graph CG(d) containing the contextual information of a document d is extracted from the KB.

Semantic Context Vectors Generation: the Semantic Context Vector SCV (d)
representing the context of the document d is generated analyzing its CG(d).

Context Similarity Evaluation:
the Context Similarity is evaluated by comparing the context vectors of documents belonging to the corpus C.

Contextual Graph Extraction
Given a document d and a knowledge graph KB, the goal of this first step is to extract a subgraph from KB containing all the information about d.Our method relies only on the extensional knowledge of a knowledge base, i.e. on its A-Box.More precisely, given a knowledge base KB, we consider the subgraph where the triples are in the A-Box of the KB.We also exclude the triples containing literals, so, all the vertices V A belongs to (U ∪B), i.e., are URIs or blank nodes, and every edge E A corresponds to an object property.We made this choice because our previous works shown that the T-Box of several knowledge bases belonging to the LOD cloud is incomplete and sometimes even absent, moreover, information about the structure of a knowledge base can be inferred from its A-Box [27,28].For example, in Figure 1 we have only 3 triples that belongs to KB A : the ones containing the dbo:genre property.
Given the subgraph KB A , the extraction of the Contextual Graph CG(d) for a document d is a three-step process: 1. Starting Entities Identification; 2. Contextual Graph Construction; 3. Contextual Graph weighting.
Such steps are described below.

Starting Entities Identification: the entities of KB
A explicitly mentioned in the document d are identified.Such set of entities is called starting entities of d, denoted by SE(d).The problem of finding the set SE(d) is an instance of the wellknown Named Entity Recognition problem [29].Its solution is out of scope of this work, thus we empirically evaluated some of the already implemented techniques, and, on the basis of the obtained results, we chose DBpedia Spotlight [30] and TextRazor4 to identify starting entities w.r.t.DBpedia and Wikidata, respectively.

Contextual Graph Construction:
the Contextual Graph of the document d is defined as the subgraph of KB A composed of all the triples connecting with a path of length l, at least 2 starting entities in SE(d).More precisely, given a document d and a length l > 0, we define: where P ath(s 1 , s 2 ) is a path on KB A from s 1 and s 2 .
For example, let us consider the two sentences used in the introduction (each sentence is represented as a document):  In Information Retrieval, a keyword query is usually composed of a few words, so, in this context for a generic query q it is common to have only a single starting entity (i.e., |SE(d 1 )|=1).A user, in order to retrieve the documents d 1 or d 2 could use keywords queries like: q 1 : Roger Daltrey.q 2 : Mick Jagger.
In Figure 3 a portion of the Contextual Graphs extracted starting from these two queries are shown.The contextual graph CG 3 (q 1 ) contains, besides dbr:Roger Daltrey, the entities dbr:The Who and dbr:Rock music, which belong to the contextual graph CG 3 (d 2 ) as well.Then, the query q 1 can retrieve both documents d 1 and d 2 .Similar considerations can be done for contextual graph CG 3 (q 2 ) of the query q 2 .

Contextual Graph weighting:
In the literature, several graph weighting methods have been proposed to capture the degree of associativity between concepts in the graph, i.e., the degree of relevance of a property for the entities it connects [10,31].The most common way of weighing a property p i is to compute its Information Content (IC), IC(X = p i ) = −log(P (p i )), where P (p i ) is the probability that a random variable X exhibits the outcome p i .This metric makes the hypothesis that specificity is a good proxy for relevance; in our example, an edge labeled with rdf:type will accordingly get an IC which is comparably lower than, say, one labeled with dbo:genre.The metric IC(p i ) measures the specificity of the property p i , regardless of the entities it actually connects; to take into account that the same property can connect more or less specific entities, the authors in [10] considered IC(obj i |p i ) computed in a similar way to IC(p i ), where P (obj i |p i ) is the conditional probability that a node obj i appears as object of the property p i ; then they proposed the Joint Information Content weighting function: w jointIC = IC(obj i |p i ) + IC(p i ).In our example, with this metric, the rdf:type edge leading to dbo:MusicGenre receives a much higher weight than that pointing to the far more generic dbo:City.The drawback of this function is that it penalizes infrequent object that occur with infrequent properties; for example, dbo:Punk Rock is overall very infrequent, but it get an high probability when it occurs conditional on dbo:genre.The authors in [10] propose to mitigate this problem by computing the joint information content while making an independence assumption between the predicated and the object; the resulting weights are then computed as the sum of the Information Content of the predicate and the object, so obtaining the Combined Information Content w combIC = IC(obj i ) + IC(p i ).
The metrics presented so far take into account only the extensional knowledge of a KB, i.e.only on the triples of the A-Box; we introduce a new weighting function based on the fact that the importance of a property between two entities also depends on the classes to which such entities belong (each entity in an RDF graph is instance of at least one class).For example, in Figure 1, most people would agree that, for subjects which are instance of dbo:Band, the importance of dbo:genre increases when the object is an instance of dbo:MusicGenre.In fact, the 94% of the dbo:Band instances are subject of a dbo:genre property that has as object, in 91% of cases, an instance of dbo:MusicGenre, while only the 0.002% of times, an instance of dbo:City.Taking in exam the triple < s i , p i , o i >, we measure the correlation between a property p i , the class of the subject s i and the class of the object o i by using the notion of Total Correlation [32], which is a method for weighting multi-way co-occurrences according to their importance: where S i and O i are the classes associated to the entities s i and o i , respectively 5 .
To summarize, for Contextual Graphs we will consider three edge weight functions: Joint Information Content (W Joint ), Combined Information Content (W Comb ) and Total Correlation (W T otCor ).

Semantic Context Vectors Generation
At this point we have all the ingredients necessary to define the notion of Semantic Context Vector, a vector representation of documents based on Contextual Graphs.Given a corpus of documents C = {d 1 , ..., d n } and an RDF KB, for each document d ∈ C we build its contextual graph CG l (d); then we consider the set E = {e 1 , ..., e m } of entities occurring in all the contextual graphs.Similar to the term-document matrix (see Section 3.1) we consider an entity-document matrix T , a m×n matrix where the cell (i, j) contains the weight s(e i , d j ) of the entity e i ∈ E in the document d j ∈C.A document d j is thus represented by the jth column of such matrix, called Semantic Context Vector of d j and denoted by SCV (d j ): The weight s(e i , d j ) has to take into account for the importance of the entity e i within CG l (d j ) and, thus, it is defined by considering an edge weight function and a PageRank method.
As edge weight functions for CG l (d), we consider W Comb , W Joint and W T otCor (defined in the previous section) to set up the transition probability matrix M as a k×k matrix, where k is the number of nodes of CG(d) and M pq = w(p,q) k z=1 w(p,z) , where w(p, q) returns the weight if an edge from p to q exists, otherwise it returns 0.Moreover, we denote with W noweight the case when edge weights are not used and the transition probability matrix M is given by M pq = 1/d p if it exists an edge from p to q and 0 otherwise (d p be the outdegree of the vertex p).
The PageRank methods we consider are the ones resumed in Section 3.3: 1. Standard PageRank: in this case (denoted by r) there is no personalization vector, i.e., an uniform vector is considered; 2. Personalized PageRank: in this case (denoted by pr) the personalization vector v = (v 1 , ..., v k ) is setup to give an equal probability to starting entities: we used KB=Dbpedia and KB=Wikidata in our tests.
2. CG-L : the length for the Contextual Graph CG l (d); we tested our method with CG-L = 2 and CG-L = 3.

WF: the edge weight function for CG l (d):
we consider W Comb , W Joint , W T otCor and W noweight .
4. PageRankConfiguration: the damping factor and personalization vector used.With r@df and pr@df we denote Standard and Personalized PageRank, respectively, with a damping factor equal to df .
As an example, for the documents d 1 and d 2 of Figure 2, part of their SCVs are shown in Table 1; the knowledge base is DBpedia and CG-L is egual to 3; both PageRank and Personalized PageRank are considered, with a damping factor equal to .75 (i.e.r@75 and pr@75).

Entity Document d1
Document d2 pr@75 r@75 pr@75 r@75 The We can observe that PageRank tends to arrange weight in all the context graph's nodes, while with the Personalized PageRank all the weight is focused in the neighborhood of the starting entities.
Table 2 shows the different configuration used.

A C C E P T E D M
A N U S C R I P T

Context Similarity Evaluation
In this last step, the Context Similarity between two documents is evaluated by comparing their context vectors.More precisely, the CSA Similarity, denoted by sim CSA , between two documents d 1 and d 2 is computed as the cosine similarity between their Semantic Contextual Vectors: where v=SCV (d 1 ) and s=SCV (d 2 ).
As an example, by considering the Semantic Context Vectors shown in Table 1, the sim CSA between the two documents d 1 and d 2 of Figure 2, is equal to 0.78 by using r@75 vectors and 0.61 by using pr@75.In the next section we will evaluate which CSA configuration is more effective in detecting similarities between documents.

Linear combination of CSA with text similarity measures
The CSA similarity, sim CSA , is only based on information extracted from a knowledge base; to include in the final similarity measure (sim f ) also the textual information, we consider a linear combination of the CSA similarity with (standard) textual similarity measures sim T XT (such as LSA [33] and ESA [34]) between two documents as: where α is the weight parameter used for combining the two measures.

Evaluation
In this section we evaluate CSA: firstly, we assess CSA efficacy by considering the correlation with human judges; secondly, we evaluate how CSA performs in a realworld application, employing it in an Information Retrieval framework; thirdly, we analyze CSA scalability in a clustering task on a large dataset.
All experiments have been performed on a server running Ubuntu 14.04, with 80GB RAM, and an Intel Xeon E5-2670 v2 @ 2.50GHz CPU.CSA has been implemented in Python 2.7, and for generating the contextual graphs, we imported the DBpedia graph in Neo4J6 .

Correlation with Human Judges
This experiment compares on a benchmark dataset [33] results obtained with CSA, and results produced by human judgement.

Experimental Setup
The most common and effective way for evaluating techniques of inter-document similarity is to assess how the similarity measure produced emulates human judges.To this end, we use the dataset of documents LP507 [33], which contains 50 documents, selected from the Australian Broadcasting Corporation's news mail service, evaluated by 83 students of the University of Adelaide.Each possible pair of documents (1,225 pairs in total) has 8-12 human judgements.These judgements have been averaged for each document pair, obtaining only 67 distinct values for 1,225 similarity scores.For this reason, Gabrilovich et al. [18] and Schuhmacher et al. [10] suggest to employ Pearson's linear correlation coefficient (r) between the computed similarities and the ones assigned by human judges.We follow this suggestion, to compare our results with those presented in [18] and [10]; yet, we also consider the Kendall's (τ ) correlation coefficient, which is typically employed in Information Retrieval context to measure ordinal associations.As shown in the following, the outcome of our analysis shows that these two measures leads to the same conclusions for this experiment.

Results and Discussion
In Table 3, CSA 8 is compared with other literature techniques.Bag-of-Words [33] indicates the simple bag-of-word document representation, coupled with termfrequency weighting and cosine similarity.We considered also Okapi BM259 as weighting, coupled with the dot product.Un-Backgrounded LSA means that LSA [33] has been applied only considering the LP50 dataset, differently form Backgrounded LSA, which employs additional documents to perform a better dimensionality reduction (see [33] for the details).The original performance of ESA reported in [18] on the LP50 dataset has been criticized in [34] for being based on a cut-of value used to prune the vectors in order to produce better results on the LP50 dataset and, consequently, overfit the approach to this particular dataset.In fact, a much lower performance has been obtained in [34] and [20] by re-implementing ESA without adapting the cut-off value.We employ this implementation in our experiments.
The main result emerging from this comparison is that our CSA method alone yields results comparable to state-of-the-art techniques (LSA and ESA), and enhances them when used in conjunction; for example, CSA + ESA obtains a correlation r = 0.72 (τ = 0.42), so it attains a 16% improvement.The Graph Edit Distance (GED) based approach of [10], which is the most similar to our, produces almost identical results but with GED the similarity measures are obtained in a much more computationally expensive way than in CSA (a deeper comparison is in the next Section).By taking in exam other knowledge-enriched techniques built on top of a specific knowledge base (Wikipedia), CSA combined with ESA slightly outperforms SSA, but it does not reach the performance of WikiWalk + ESA.[33] 0.41 0.13 BM25 0.50 0.17 Un-Backgrounded LSA [33] 0.52 0.18 Backgrounded LSA [33] 0.59 0.28 ESA reimplemented [34] 0.59 0.30 GED-based (Dbpedia) [10] 0.63 0.37 SSA [20] 0.68 0.40 WikiWalk + ESA [19] 0.77 0.47 As shown in Table 3, the relative performances of the methods are the same either considering the Pearson's r and the Kendall's τ .In fact, we observe that these two measures show the same trends in all our experiments; hence, hereafter we present only the results for Pearson's r for the sake of presentation.

A C C E P T E D M A N U S C R I P T
Complete results are shown in Figure 4, which shows the Pearson coefficient r between the human gold standard and CSA by varying the parameters that define the Semantic Context Vectors, with the exception of CG-L that has been considered constant and equal to 3. One of the main results is that, for all the configurations, the Personalized PageRank (pr) outperforms the Standard PageRank (r); another interesting result DBpedia CG-L 2 pr@40 0.57 pr@40 0.59 pr@60 0.58 pr@30 0.59 0.59 3 pr@60 0.59 pr@65 0.61 pr@65 0.61 pr@65 0.62 0.62 Jaccard on starting entities 0.49

A C C E P T E D M A N U S C R I P T
Wikidata CG-L 2 pr@40 0.54 pr@40 0.56 pr@40 0.55 pr@40 0.57 0.57 3 pr@40 0.59 pr@40 0.60 pr@40 0.60 pr@40 0.61 0.61 Jaccard on starting entities 0.48 Cosine (bag of words) 0.41 is that, in almost all the configurations, the novel edge weighting function W T otCor we proposed slightly outperforms the other ones, W Joint and W Comb .We can also appreciate different behaviors w.r.t the KB: DBpedia is more stable, while Wikidata exhibits a strong performance decay by increasing the damping factor, with the Personalized PageRank.In particular, the CSA configuration with DBpedia, W T otCor , Personalized PageRank with damping factor ranging from 0.30 to 0.85, is quite stable: it varies by only 2.5% from the minimum (0.605 pr@30) to the maximum (0.62 pr@65); then such a CSA configuration is almost parameter free.Table 4 shows the Pearson coefficient r for the best CSA configurations we found, by varying all the parameters.
In order to evaluate CSA we produced some baselines: • Jaccard on starting entities: we used the starting entities collected for each document as descriptor of the document and we used the Jaccard similarity for estimating the similarity between documents, namely sim(d 1 , d2) = SE(d1)∩SE(d2) SE(d1)∪SE(d2) .
• Cosine (bag of words): we model the document corpus in a standard bag of words Vector Space Model and we compute the cosine similarity 10 .
CSA is able to outperform both baselines; we obtained a relative improvement of the 21% (with either DBpedia and Wikidata) w.r.t. the Jaccard baseline 11 ; this improvement is particularly significant because it is only due to the information extracted from the knowledge bases by CSA 12 .W.r.t. the Cosine baseline the margins are greater (34% DBpedia and 33% Wikidata); this result is not too surprising because this baseline utilize only the words contained in the text for estimating the similarity.Table 5 shows the performance of the linear combination of CSA with the standard text similarity measures un-backgrounded LSA [33] 13 and ESA reimplemented [34].The best performance is obtained with α = 0.5, and we can observe that the best 4 for CSA (i.e.pr@65 for DBpedia and pr@40 for Wikidata) are also the best configurations of CSA combined with LSA and ESA.

Information Retrieval Application
The goal of this experiment is to evaluate CSA in a real-world Information Retrieval (IR) application.In particular, we integrated CSA in a IR framework (KE4IR [13]), and measure on a well-known benchmark dataset the improvement yielded by our technique.

Experimental Setup
Given a text query, the goal of IR is to find the relevant documents in a text collection, ranking them according to their relevance degree for the query.The relevancy of documents is typically measured by means of similarity measure in the Vector Space Model, hence employing CSA for this task is straightforward.We consider KE4IR [13], based on the popular IR framework Apache Lucene14 .To the best of our knowledge, KE4IR is the current state-of-the-art in IR for retrieving documents with semantic enrichment 15 (i.e., documents enriched with annotation derived from external knowledge bases, such as DBpedia).
In KE4IR both the documents and the queries are represented as term vectors whose elements are the weights of textual and semantic content extracted from DBpedia 16 .The terms derived directly from the text represent the textual-layer.The authors in [13] enriched the textual information with other layers, which are: the uri-layer, the type-layer, the time-layer, and the frame-layer.
• The uri-layer contains the entities of DBpedia related to the document/query text (e.g., dbr:The Rolling Stones) weighted according to their tf-idf of the entities in the documents.KE4IR employs PIKES 17 to annotate and enrich documents/queries.

A C C E P T E D M A N U S C R I P T
• The type-layer is composed of the classes of the entities identified (e.g., dbo:Band).
• The time-layer contains the temporal values expressed in the text end matched against DBpedia (e.g., Year, Month, etc.).
• The frame-layer is composed of compact structures capturing relations among entities.
In order to compute the rank for each document d i given a query q j a similarity score for each of the layer is computed by using a measure sim dot (d, q) derived from the cosine similarity: Then, the similarity scores obtained for each of the layers are linearly combined to produce the final rank.Notice that dividing sim dot (d, q) for the product of the norms of the two vector d and q we obtain the cosine similarity of the two vectors.Omitting these normalization components is a common practice in the context of IR: this allows to avoid biased results due to the typically small size of the query terms [13].
We extended KE4IR to support CSA as independent layer, to employ it as substitute of the uri-layer in our experiments.(Notice that both layers are composed of entities extracted from a knowledge base used to reprent the content of a document.)To compare standard KE4IR (i.e., KE4IR with the original uri layer) and KE4IR with the CSA-layer we employ the same dataset, the same experimental setup, and metrics of Corcoglioniti et al. [13] 18 .
For this evaluation, we employed two datasets described in the following: 1. yovisto, which consists of a set of 331 documents from the yovisto blog 19on history in science, tech, and art.The articles have an average length of 570 words, containing 3 to 255 annotations (average 83).Moreover, for this dataset the gold standard for the annotation is known, since documents have been manually annotated with DBpedia entities.Hence, employing this dataset, the performance of CSA can be measured minimizing the error introduced by automatic named entity recognition tools, such as DBpedia Spotlight [30] (which is employed only to spot entities in the queries, as described later).
2. trec2001 [35], which is composed of ∼1.5 * 10 6 documents extracted from the web 20 .No gold standard is provided for the annotation of this dataset, thus we employed DBpedia Spotlight to annotate these documents, as pre-processing step.

A C C E P T E D M A N U S C R I P T
The generation of the contextual graphs (CG 3 (d) for each document d in the datasets) for the yovisto and trec2001 dataset took ∼1 minute and ∼6 days, respectively.This time is required only once for dataset (as pre-processing) and could be significantly reduced employing data intensive scalable computing systems, such as Map Reduce and Apache Spark.Moreover, note that the contextual graphs computation can be incremental, thus when a new document is added to a collection, only its contextual graph has to be computed.
For the queries, yovisto and trec2001, provides 35 and 50 queries respectively, for which the list of relevance judgments is available.We limit our evaluation to the subset of 25 queries on yovisto, and 44 queries on trec2001, for which DBpedia Spotlight can spot entities.Notice that the limitation does not depend on CSA inherently, but rather on the coverage of entities contained in the knowledge base (DBpedia).Moreover, on yovisto, none of the queries contain more than one spotted entity, making this the ideal scenario for testing how CSA behaves with limited context, i.e., when |SE| = 1.(On trec2001 only four queries contains more than one spotted entity).
For both documents and queries we extracted their CG 3 (d/q) using DBpedia as knowledge base and we computed their SCV (d/q) for several configurations; then, we stored the SCV s for being used in the KE4IR framework.The metrics employed for measuring the performance are the Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), typically used to evaluate IR systems [36]: • NDCG assesses the overall ranking quality.
It takes into account both the relevance of a retrieved document and its position (notice that the relevance of a document is known from the available judgment employed as ground truth).It assumes values in the interval [0.0, 1.0], where 1.0 correspond to the maximum value obtained when all the relevant documents are retrieved, and their order matches the best ordering possible in terms of relevance.
• MAP assesses the overall precision quality.
It is obtained by averaging the precision measured after that each relevant document has been retrieved 21 .It assumes values in the interval [0.0, 1.0], where 1.0 is the best value.In contrast to NDCG, it does not take into account false negative; for instance, if only one document is retrieved and it is relevant the precision is 1, even when many more relevant documents exist.
• MRR assesses the ranking quality of the first correct result retrieved.
It is computes averaging the reciprocal ranks over all the queries; where the reciprocal rank (RR) is the reciprocal of the highest ranking position of a correct answer given a query.It assumes values in the interval [0.0, 1.0], where 1.0 is the best value. 21The precision is defined as:  For IR approaches, it is common to assume that users only look at the "first page" of the results; hence, we record NDCG and MAP both for the complete result set, and for the top ten results (denoted by NDCG@10 and MAP@10).

Results and Discussion
The results of our experiment is summarized in Table 6.As far the exploitation of entity information is concerned, we observe that KE4IR with the CSA-layer (KE4IR w/ CSA) outperforms KE4IR (KE4IR w/ uri), both on yovisto and trec2001 datasets (Table 6a-b).Differently, considering both the contribution of the information about the text and about the entities, the advantage of CSA is less evident, but still relevant: on yovisto (Table 6a), the employment of CSA allows to reach the highest performances, with the only exception for the MRR metric; while on trec2001 (Table 6b), CSA wins for all the metrics.This confirms the results obtained by Corcoglioniti et al. [13]: the textual-layer represents the most important contribution to the final results.In fact, notice that textual-layer alone achieves higher results than the uri/CSA-layer alone.The improvements of KE4IR with the CSA-layer (KE4IR w/ CSA) over traditional KE4IR (KE4IR w/ uri) resulted statistically significant, according the paired t-test with a threshold p-value equal to 0.05.
As far as the query time performance is concerned, we did not record any significant difference in the time execution when employing KE4IR w/ CSA and KE4IR w/ uri.
Figure 5 shows the performance metrics described above (NDCG and NDCG@10, MAP and MAP@10, MRR) on yovisto by varying two of the parameters that define the Semantic Context Vectors, i.e., the Contextual Graph weighting function and the dumping factor of the Personalized Page Rank (on trec2001 we recorded analogous trends.).The length for the Contextual Graph has been considered constant and equal to 3. Like in the previous evaluation (Section 5.1), the Personalized PageRank obtains stable results between pr@30 and pr@70 and it outperforms the standard PageRank (blue dashed line) with any metrics.Moreover, the novel edge weighting function W T otCor we proposed slightly outperforms the other ones, W Joint and W Comb .

Hierarchical Document Clustering
Here we evaluate CSA scalability by adapting our approach to perform hierarchical clustering on a popular benchmark dataset composed of a larger number of documents.

Experimental Setup
We used a dataset (re0) of Reuters 21578 22 , a collection of 1504 manually classified documents, which is commonly used for evaluating hierarchical clustering techniques.To build the clusters hierarchy we used a hierarchical clustering algorithm, based on a similarity measure and group-average-link [36].In this test we used only DBpedia, since was before proved that it produce more stable results.Performance is measured in terms of goodness of fit with existing categories by using F measure.As defined in [37], for an entire hierarchy of clusters the F measure of any class is the maximum value it attains at any node in the tree and an overall value for the F measure is computed by taking the weighted average of all values for the F measure as given by the following: i ni n maxF (i, j), where the max is taken over all clusters at all levels, n is the number of documents and F (i, j) is the F measure for the class i and the cluster j.

Results and Discussion
First of all, for each document d we extracted its CG 3 (d) and we computed SCV (d) for several configurations; then, we stored bot CGs and SCV s on a file system.The whole process took just 40 minutes.In Table 7 a summary of the results is shown; it includes the F measures and the average of the execution time obtained running 5 time the clustering algorithm.The configuration of CSA used for obtaining these results is GC-L=3, W T otCor and pr@65, which proves to be the best configuration also in this test.We produced three different baselines: Jaccard on starting entities, LSA [22] and GED-based (DBpedia) [10].We considered only the GED system since it is the most similar to our approach.
As a first observation, CSA outperforms all the considered baselines in terms of F-measure and the linear combination with LSA brings a 10% improvement.
We were not able to successfully complete the test for GED due to its computational cost.Intuitively, to perform hierarchical clustering, we have to compute the interdocument similarity between all the documents of the corpus, i.e., 1501 2 measures of similarity for the re0 dataset.While for CSA and LSA the cosine similarity is used, GED-similarity is based on a more expensive graph edit distance algorithm.

Conclusion and Future Work
In this paper, we proposed Context Semantic Analysis (CSA), a novel knowledgebased technique for estimating inter-document similarity.The technique is based on a Semantic Context Vector, which can be extracted from a knowledge base and stored as metadata of a document and employed to compute inter-document similarity.We showed the consistency of CSA with respect to human judges and how it outperforms

A C C E P T E D M
A N U S C R I P T standard (i.e., syntactic) inter-document similarity methods.Moreover, we obtained comparable results w.r.t.other approaches built on top of a specific knowledge base for performing semantic enrichment of the documents (i.e., ESA, WikiWalk and SSA).Our method can exploit any generic RDF KB.In order to evaluate CSA we employed two generic domain knowledge bases, i.e.DBpedia and Wikidata; however, CSA is applicable to a generic RDF knowledge base.To the best of our knowledge CSA is the first technique that showed its portability with two huge RDF knowledge bases.Moreover, we showed how CSA can be effectively applied in the Information Retrieval domain, even if user queries, typically composed of few words, contains a limited number of entities.We adapted CSA to be used in an existing IR framework and we showed how it can improve the performance of this framework.Finally, we experimentally demonstrate its scalability and effectiveness performing a hierarchical clustering task with a larger corpus of documents.
As future work, the proposed knowledge-based technique for inter-document similarity computation will be applied and tested in the context of keyword searching over relational structures [38,39,40].The basic idea is to turn tuples of a relational database to documents (by considering joining and/or grouping of tuples) and then apply CSA for computing the similarity among a given document or keyword query and the documents representing the relational database.As another future work, we are planning to test the scalability of CSA also for an IR framework.A further future work, we are planning to test CSA with some domain specific knowledge bases, such as the RDF version of AGROVOC 23 and Snomed CT, respectively: an agricultural and medical knowledge base.

Figure 1 :
Figure 1: Example of an RDF KB, with the A-Box and the T-Box.The triples of an RDF knowledge base can usually be divided into A-Box and T-Box; while the A-Box contains instance data (i.e.extensional knowledge), the T-Box contains the formal definition of the terminology (classes and properties) used in the A-Box.As an example, Figure 1 shows an extract of DBpedia 3 ; in the DBPedia T-Box, the property dbo:genre is defined with rdfs:range dbo:MusicGenre and the class dbo:Band is defined as a sub-class of both dbo:Organization and dbo:Group.In the DBPedia A-Box, the instance data dbr:The Rolling Stones (an instance of the class dbo:Band) is connected by the property dbo:genre to the instance data dbr:Rock music (an instance of the class dbo:MusicGenre).For our experiments we choose two generic domain knowledge bases: DBpedia[1] and Wikidata[3], due to their large coverage and variety of relationships at the extensional level.

d 1 :
"The Rolling Stones with the participation of Roger Daltrey opened the concerts' season in Trafalgar Square".d 2 : "The bands headed by Mick Jagger with the leader of The Who played in London last week".The related starting entities in DBpedia are the following: A C C E P T E D M A N U S C R I P T SE(d 1 ) = { The Rolling Stones, Roger Daltrey,Trafalgar Square} SE(d 2 ) = {Mick Jagger, The Who,London } In this example, by using l=2 we obtain CG 2 (d 1 ) with 5 nodes and CG 2 (d 2 ) with 12 nodes; by using l=3 we obtain CG 3 (d 1 ) with 141 nodes and CG 3 (d 2 ) with 66 nodes.The most significant portion of information shared between CG 3 (d 1 ) and CG 3 (d 2 ) is shown Figure 2; in CG 3 (d 2 ) there is a path of length 1 between London and Mick Jagger, while Mick Jagger and The Who are connected by means of two (different) paths, both of length 3.

Figure 2 :
Figure 2: Portion of DBpedia containing the most significant shared contextual information between the two sentences on the left

Figure 3 :
Figure 3: A portion of the Contextual Graph extracted from DBpedia for the two keyword queries: Roger Daltrey and Mick Jagger and 0 otherwise.As damping factor we consider a range of values from 0.10 to 0.95 with a step of 0.05.To summarize, the Semantic Context Vector of a document d, SCV (d), is defined by the following four parameters:A C C E P T E D M A N U S C R I P T1.KB : the RDF Knowledge Base used to build the contextual graph CG l (d) of d;

Table 5 :
Best Pearson correlation obtained on the LP50 dataset by combining CSA (l = 3 and Total Correlation as weight function) with LSA and ESA

Table 1 :
Semantic Context Vectors of the two documents of Figure2

Table 2 :
PageRank and Personalized PageRank configurations

Table 3 :
System comparison on the LP50 dataset

Table 4 :
Results on the LP50 dataset (Pearson r correlation coefficient).

Table 7 :
R I P T Results on the Reuters 21578 (re0) dataset (F-measure and execution time for building the cluster hierarchy)