Generalized Supervised Meta-blocking

Entity Resolution is a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i


INTRODUCTION
Entity Resolution (ER) is the task of identifying entities that describe the same real-world object among dierent datasets [4,10,22,32].ER is a core data integration task with many applications that range from Data Cleaning in databases to Link Discovery in Semantic Web data [7,10].Despite the bulk of works on ER, it remains a challenging task [4,14,22,24].One of the main reasons is its quadratic time complexity: in the worst case, every entity has to be compared with all others, thus scaling poorly to large volumes of data.To tame its high complexity, Blocking is typically used [5,6,27,28].Instead of considering all possible pairs of entities, it restricts ER to blocks of entities that have identical or similar signatures.Extensive experimental analyses have demonstrated that the schemaagnostic signatures outperform the schema-based ones, without requiring domain or schema knowledge [5,21].As a result, parts of any attribute value in each entity can be used as signatures.
Example 1 (Schema-agnostic blocking).The proles in Figure 1a contain three duplicate pairs, h4 1 , 4 3 i, h4 2 , 4 4 i and h4 6 , 4 7 i, and are clustered using Token Blocking (a block is created for every token appearing in at least 2 proles).The resulting blocks appear in Figure 1b.ER examines all pairs inside each block, detecting all duplicates.
On the downside, the resulting blocks involve high levels of redundancy: every entity is associated with multiple blocks, thus yielding numerous redundant and superuous comparisons [2,31].The former are pairs of entities that are repeated across dierent blocks, while the latter involve non-matching entities.For example, the pair h4 1 , 4 3 i is redundant in 1 2 , as it is already examined in 1 1 , while the pair h4 2 , 4 6 i 2 1 3 is superuous, as the two entities are not duplicates.Both types of comparisons can be skipped, reducing the computational cost of ER without any impact on recall [20,28].
To this end, Meta-blocking [23] discards all redundant comparisons, while reducing signicantly the portion of superuous ones.It relies on two components to achieve this goal: 1) A weighting scheme, which is a function that receives as input a pair of entities along with their associated blocks and returns a score proportional to their matching likelihood.The score is based on the co-occurrence patterns of the entities into the original set of blocks: the more blocks they share and the more distinctive (i.e., infrequent) the corresponding signatures are, the more likely they are to match and the higher is their score.
2) A pruning algorithm, which receives as input all weighted pairs and retains the ones that are more likely to be matching.
Example 2 (Unsupervised Meta-blocking).Unsupervised Metablocking builds a blocking graph (Figure 2a) from the blocks in Figure 1b as follows: each entity prole is represented as a node; two nodes are connected by an edge if the corresponding proles co-occur in  at least one block; each edge is weighted according to a weighting scheme-in our example, the number of blocks shared by the adjacent proles.Finally, the blocking graph is pruned according to a pruning algorithm-in our example, for each node, we discard the edges with a weight lower than the average of its edges.The pruned blocking graph appears in Figure 2b, with the dashed lines representing the superuous comparisons.A new block is then created for every retained edge.Figure 2c presents the nal blocks, which involve signicantly fewer pairs without missing the matching ones.This is a schema-agnostic process, just like the original blocking method.
Supervised Meta-blocking [25].It models the restructuring of a set of blocks as a binary classication task.Its goal is to train a model that learns to classify every comparison as positive (i.e., likely to be matching) or negative (i.e., unlikely to be matching).Every pair is associated with a feature vector comprising the most distinctive weighting schemes that are used by learning-free meta-blocking.
Supervised Meta-blocking involves the overhead of generating a labelled dataset, but by representing each edge with multiple features, it is more accurate in discriminating matching and nonmatching pairs than Unsupervised Meta-blocking, which employs a single weight per edge.Indeed, Supervised Meta-blocking consistently yields better precision and recall than the unsupervised approach [25].Yet, the binary classier it employs acts as a learned, unique, global threshold used to prune the edges.Dening a local threshold for each node would allow a ner control on which edges to prune.This is the intuition behind Generalized Supervised Meta-blocking, as illustrated in the following example.
Example 4 (Generalized Supervised Meta-blocking).Our new approach builds a graph where every edge is associated with a feature vector (as Supervised Meta-blocking does in Figure 3a) and trains a probabilistic classier, which assigns a weight (the matching probability) to each edge (Figure 4a).Then, several weight-and cardinalitybased algorithms can be applied.For example, Supervised WNP prunes the graph as follows: for each node, all adjacent edges with a weight lower that 0.5 are discarded; for the remaining edges, only those with a weight greater than the average one are kept.Figure 4b shows the result of this step: two edges may be assigned the same weight by the probabilistic classier, e.g., h4 1 , 43i and h4 4 , 45i, but they may be kept (e.g., the matching pair h4 1 , 43i) or discarded (e.g., the nonmatching pair h4 4 , 45i) depending on their context, i.e., the weights in their neighborhood.Note that h44, 45i is not discarded by Supervised Meta-blocking in Figure 3b, which thus underperforms Generalized Supervised Meta-blocking in terms of precision (for the same recall).
Our Contributions.Our work is motivated by a real-world application that aims to deduplicate a legacy customer database.It contains ⇠7.5 million entries that correspond to electricity supplies and, thus, are associated with an address, a customer name, and other optional attributes (e.g., tax id) that are typically empty.To exploit all available information, the quadratic computational cost of ER is reduced through schema-agnostic blocking.Our goal is to minimize the set of candidate pairs, using Supervised Metablocking, while restricting human involvement for the generation of the labelled instances.To this end, we go beyond Supervised Meta-blocking in the following ways: • We generalize it from a binary classication task to a binary probabilistic classication process (Section 3).
• The resulting probabilities are used as comparison weights, on top of which we apply new pruning algorithms that are incompatible with the original approach [25] (Section 4).
• To further improve their performance, we use three new weighting schemes as features (Section 5).
• We perform an extensive experimental study that involves 9 real-world datasets.Its results demonstrate that the new pruning algorithms signicantly outperform the existing ones.They also identify the top performing algorithms and feature vectors, showing that 50 labelled instances (25 per class) suce for high performance.
• We also perform a scalability analysis over 5 synthetic datasets with up to 300,000 entities, proving that our approaches scale well both with respect to eectiveness and time-eciency.
An entity prole 4 8 is dened as a set of name-value pairs, i.e., 4 8 = {h= 9 , E 9 i}, where both the attribute names and the attribute values are textual.This simple model is exible and generic enough to seamlessly accommodate a broad range of established data formats -from the structured records in relational databases to the semistructured entity descriptions in RDF data [21]  ), resp., as every entity prole has to be compared with all possible matches.To reduce this high computational cost, Blocking restricts the search space to similar entities [5,22].
Meta-blocking operates on top of Blocking, rening an existing set of blocks ⌫, a.k.a.block collection, as long as it is redundancypositive.This means that every entity 4 8 participates into multiple blocks (i.e., |⌫ 8 | 1, where ⌫ 8 = {1 2 ⌫ : 4 8 2 1} denotes the set of blocks containing 4 8 ), and the more blocks two entities share, the more likely they are to be matching, because they share a larger portion of their content.Such blocks emanate from Token, Q-Grams and Sux Arrays Blocking and their variants among others [20,28].
The redundancy-positive block collections involve a large portion of redundant comparisons, as the same pairs of entities are repeated across dierent blocks.These can be easily removed by aggregating for every entity 4 8 2 ⇢ 1 the set of all entities from ⇢ 2 that share at least one block with it [26].The union of these individual sets yields the distinct set of comparisons, which is called candidate pairs and is denoted by ⇠.Every non-redundant comparison between 4 8 and 4 9 , c i,j 2 C, belongs to one of the following types: • Positive pair if 4 8 and 4 9 are matching: 4 8 ⌘ 4 9 .
• Negative pair if 4 8 and 4 9 are not matching: 4 8 .4 9 .These denitions are independent of Matching: two matching (non-matching) entities are positive (negative) as long as they share at least one block in ⌫ [5,21].The set of all positive and negative pairs in a block collection ⌫ are denoted by % ⌫ and # ⌫ , respectively.The goal of Meta-blocking is to transform a given block collection Supervised Meta-blocking models every pair 2 8,9 2 ⇠ as a feature vector where each B 8 is a weighting scheme score proportional to the matching likelihood of 2 8,9 .The feature vectors for all pairs in ⇠ are fed into a binary classier, which labels them as positive or negative, if their entities are highly likely to match or not.Its performance is assessed through: (i) the true positive ) % (⇠) and negative ) # (⇠) pairs correctly classied as positive and negative, resp., and (ii) the incorrectly classied false positive % (⇠) and negative # (⇠) pairs.Supervised Meta-blocking discards all candidate pairs labelled as negative, i.e., ) # (⇠) [ # (⇠), retaining those belonging to ) % (⇠) [ % (⇠).A new block is created for every positive pair, yielding the new block collection ⌫ 0 .Thus, the eectiveness of Supervised Meta-blocking is assessed through the following measures, dened in [0, 1], with higher values indicating better performance: • Recall, a.k.a.Pairs Completeness, expresses the portion of existing duplicates that are retained: Pairs Quality, is the portion of positive candidate pairs that are matching: • F-Measure is the harmonic mean of the two: 1=2•Re•Pr/(Re+Pr).
The time eciency of Supervised Meta-blocking is assessed through its running time, ') .This includes the time required to: (i) generate the feature vectors for all candidate pairs in ⇠, (ii) train the classication model ", and (iii) apply " to ⇠.
Pruning algorithms.To address Problem 1, three pruning algorithms were introduced in [25]: 1) The Binary Classier (BCl), which simply retains all pairs classied as positive.
2) Cardinality Edge Pruning (CEP), which retains the topweighted candidate pairs, where is set to half the sum of block sizes in the input blocks ⌫, i.e., = Õ , where |1 | stands for the size of block 1, i.e., the number of entities it contains [23].
3) Cardinality Node Pruning (CNP), which adapts CEP to a local operation, maintaining the top-: weighted candidates per entity, where : amounts to the average number of blocks per entity: )) [23].Weighting schemes.These algorithms were mostly combined with the following schemes, which are schema-agnostic and generic enough to cover any redundancy-positive block collection ⌫ [25]: 1) Co-occurrence Frequency-Inverse Block Frequency (CF-IBF).Inspired from Information Retrieval's TF-IDF, it assigns high scores to entities that participate in few blocks, but co-occur in many: 2) Reciprocal Aggregate Cardinality of Common Blocks (RACCB).The smaller the blocks shared by a pair of candidates, the more distinctive information they have in common and, thus, the more likely they are to be matching: , where ||1|| is the cardinality of block 1-the number of its candidate pairs.
5) Enhanced Jaccard Scheme (EJS).Based on the same principle as LCP, it enhances JS with the inverse frequency of an entity's candidates in ⇠: ⇢ ( (2 The rst four schemes formed the feature vector that achieves the best balance between eectiveness and time eciency in [25].LCP appears twice in the feature vector of 2 8 9 , as LCP(4 8 ) and LCP(4 9 ).

PROBLEM DEFINITION
Generalized Supervised Meta-blocking is a new task that diers from Supervised Meta-blocking in two ways: (i) instead of a binary classier that assigns class labels, it trains a probabilistic classier that assigns a weight F 8,9 2 [0, 1] to every candidate pair 2 8,9 .This weight expresses how likely it is to belong to the positive class.(ii) The candidate pairs with a probability lower than 0.5 are discarded, but the rest, called valid pairs, are further processed by a pruning algorithm.The ones retained after pruning yield the new block collection ⌫ 0 , which contains a new block per retained valid pair.
Hence, the performance evaluation of Generalized Supervised Meta-blocking relies on the following measures: (i) ) % 0 (⇠), the probabilistic true positive pairs, involve duplicates that are assigned to a probability 0.5 and are retained after pruning; (ii) % 0 (⇠), the probabilistic false positive pairs, entail non-matching entities, that are assigned to a probability 0.5 and are retained by the pruning algorithm.(iii) ) # 0 (⇠), the probabilistic true negative pairs, entail non-matching entities that are assigned to a probability<0.5 and are discarded by the pruning algorithm.(iii) # 0 (⇠), the probabilistic false negative pairs, comprise matching entities, that are assigned to a probability<0.5 and are discarded by the pruning algorithm.
The run-time of Generalized Supervised Meta-blocking, ') , adds to that of Supervised Meta-blocking the time required to process the assigned probabilities by a pruning algorithm.

PRUNING ALGORITHMS
To address Problem 2, our new supervised pruning algorithms operate as follows: given a specic set of features, they train a probabilistic classier on the labelled instances.Then, they apply the trained classication model " to each candidate pair, estimating its classication probability.If it exceeds 0.5, a threshold determines whether the corresponding pair of entities will be retained or not.
Depending on the type of threshold, the pruning algorithms are categorized into two types: (i) The weight-based algorithms determine the weight(s), above which a comparison is retained.(ii) The cardinality-based algorithms determine the number : of topweighted comparisons to be retained.In both cases, the determined threshold is applied either globally, on all candidate pairs, or locally, on the candidate pairs associated with every individual entity.
We dene the following four weight-based pruning algorithms: 1) Weighted Edge Pruning (WEP).It iterates over the set of candidate pairs ⇠ twice: rst, it applies the trained classier to each pair in order to estimate the average probability ? of the valid ones.Then, it applies again the trained classier to each pair and retains only those pairs with a probability higher than ?
2) Weighted Node Pruning (WNP).It iterates twice over ⇠, too.Yet, instead of a global average probability, it estimates a local average probability per entity.It keeps in memory two arrays: one with the sum of valid probabilities per entity and one with the number of valid candidates per entity.They are populated during the rst iteration over ⇠ and are used to compute the average probability per entity.Finally, WNP iterates over ⇠ and retains a pair 2 8,9 only if its probability ?8,9 exceeds either of the related average probabilities.
3) Reciprocal Weighted Node Pruning (RWNP).The only dierence from WNP is that a comparison is retained if its classication probability exceeds both related average probabilities.This way, it applies a consistently deeper pruning than WNP.
4) BLAST.It is similar to WNP, but uses a dierent pruning criterion.Instead of the average probability per entity, it relies on the maximum probability per entity.It stores these probabilities in an array that is populated during the rst iteration over ⇠.The second iteration over ⇠ retains a valid pair 2 8,9 if it exceeds a certain portion A of the sum of the related maximum probabilities.
We also dene a new cardinality-based pruning algorithm: Reciprocal Cardinality Node Pruning (RCNP) performs a deeper pruning than CNP by retaining only the candidate pairs that are among the top-: weighted ones for both constituent entities.
Please refer to [12] for more detailed descriptions.

WEIGHTING SCHEMES
Among the features of [25] Another type of valuable matching evidence in redundancypositive block collections is the sum of the inverse cardinalities of common blocks.RACCB assumes that the higher this sum is, the more distinctive is the information shared by 4 8 and 4 9 , thus being more likely to match.Unlike |⌫ 8 \ ⌫ 9 |, RACCB does not need to be combined with any discount factor (as in CF-IBF and EJS), because it produces highly distinctive scores.However, it considers only the blocks shared by 4 8 and 4 9 , disregarding the contextual information about the rest of the blocks that contain these two entities.
To address this issue, the Weighted Jaccard Scheme (WJS) normalizes RACCB with the cardinality of all blocks containing each en- WJS promotes candidate pairs co-occurring in the most and smallest blocks, sharing a larger portion of their distinctive textual content.
Another type of matching evidence in redundancy-positive block collections, which has been overlooked in the literature, is the inverse size of common blocks.Similar to RACCB, the smaller the common blocks are, the more likely are the corresponding candidate pairs to be matching.This is encapsulated by the Reciprocal Sizes Scheme (RS) [1]: Similar to RACCB, RS is context-agnostic, considering exclusively information from the common blocks of two entities.To enhance it, the Normalized Reciprocal Sizes Scheme (NRS) extends RS with the contextual information of all blocks containing the constituent entities of a candidate pair [1]: In general, the normalized weighting schemes yield more distinctive features, because they encompass more information about a given pair of candidate matches.Moreover, the size and cardinality of blocks provide more distinctive information about 4 8 and 4 9 than the mere number of blocks they share.For this reason, the new features (i.e., , (, '( and # '() are expected to enhance signicantly the performance of the pruning algorithms.

EXPERIMENTAL EVALUATION
Hardware and Software.All the experiments were performed on a machine equipped with four Intel Xeon E5-2697 2.40 GHz (72 cores), 216 GB of RAM, running Ubuntu 18.04.We employed the SparkER library [13] to perform blocking and features generation.Unless stated otherwise, we perform machine learning analysis using Python 3.7 and the Support Vector Classication (SVC) model of scikit-learn [30], in particular.We used the default conguration parameters, enabling the generation of probabilities and xing the random state so as to reproduce the probabilities over several runs.We performed all experiments with logistic regression, too, obtaining almost identical results, but we omit them for brevity.
Datasets.Table 1a lists the 9 real-world datasets employed in our experiments (|⇢ G | stands for the number of entities in an entity collection, |⇡ | for the number of duplicate pairs).They have dierent characteristics and cover a variety of domains.Each dataset involves two dierent, but overlapping data sources, where the ground truth of the real matches is known.AbtBuy matches products extracted from Abt.com and Buy.com [18].DblpAcm matches scientic articles extracted from dblp.org and dl.acm.org[18].ScholarDblp matches scientic articles extracted from scholar.google.comand dblp.org[18].ImdbTmdb, ImdbTvdb and TmdbTvdb match movies and TV series extracted from IMDB, TheMovieDB and TheTVDB [19], as suggested by their names.Movies matches information about lms that are extracted from imdb.com and dbpedia.org[21].WMAmazon matches products from Walmart.com and Amazon.com[8].
Blocking.To each dataset, we apply Token Blocking, the only parameter-free redundancy-positive blocking method [28].The original blocks are then processed by Block Purging [21], which discards all the blocks that contain more than half of all entity proles in a parameter-free way.These blocks correspond to highly frequent signatures (e.g., stop-words) that provide no distinguishing information.Finally, we apply Block Filtering [26], removing each entity 4 8 from the largest 20% blocks in which it appears.
The performance of the resulting block collections is reported in Table 1a.We observe that in most cases, the block collections achieve an almost perfect recall that signicantly exceeds 90%.The only exception is AmazonGP, where some duplicate entities share no infrequent attribute value token -the recall, though, remains quite satisfactory, even in this case.Yet, the precision is consistently quite low, as its highest value is lower than 0.003.As a result, F1 is also quite low, far below 0.1 across all datasets.These settings undoubtedly call for Supervised Meta-blocking.
To apply Generalized Supervised Meta-blocking to these block collections, we performed 10 runs and averaged the values of precision, recall, and F1.In each run, a dierent seed is used to sample the pairs that compose the training set.Using undersampling, we formed a balanced training set per dataset that comprises 500 labelled instances.Due to space limitations, we mostly report the average performance of every approach over the 9 block collections.
Pruning Algorithm Selection.We now investigate which are the best-performing weight-and cardinality-based pruning algorithms for Generalized Supervised Meta-blocking among those discussed in Section 4. As baseline methods, we employ the pruning algorithms proposed in [25]: the binary classier BCl for weightbased algorithms as well as CEP and CNP for the cardinality-based ones.We xed the training set size to 500 pairs and used the feature vector proposed in [25] as optimal; every candidate pair 2 8,9 is represented by the vector: {⇠ -⌫ (2 8,9 ), ' ⇠⇠⌫ (2 8,9 ), ( (2 8,9 ), !⇠% (4 8 ), !⇠% (4 9 ) }. Based on preliminary experiments, we set the pruning ratio of BLAST to A =0. 35.The average eectiveness measures of the weight-and cardinality based algorithms across the 9 block collections of Table 1a are reported in Tables 2a and b, respectively.
Among the weight-based algorithms, we observe that the new pruning algorithms trade slightly lower recall for signicantly higher precision and F1.Comparing BCl with WEP, recall drops by -5.9%, while precision raises by 60.8% and F1 by 42.9%.This pattern is more intense in the case of RWNP, which reduces recall by -7.2%, increasing precision by 68.5% and F1 by 46.3%.These two algorithms actually monopolize the highest F1 scores in every case: for ImdbTmdb, ImdbTvdb and TmdbTvdb, WEP ranks rst with RWNP second and vice versa for the rest of the datasets.Their aggressive pruning, though, results in very low recall (⌧0.8) in four datasets.E.g., in the case of AbtBuy, BCl's recall is 0.852, but WEP and RWNP reduce it to 0.755 and 0.699, respectively.
The remaining algorithms are more robust with respect to recall.Compared to BCl, WNP reduces recall by just -0.2%, while increasing precision by 26.8% and F1 by 19.7%.Yet, BLAST outperforms WEP with respect to all eectiveness measures: recall, precision and F1 raise by 1.3%, 13.8% and 11.5%, respectively.This means that BLAST is able to discard much more non-matching pairs, while retaining a few more matching ones, too.
Among the cardinality-based algorithms, we observe that RCNP is a clear winner, outperforming both CEP and CNP.Compared to the former, it reduces recall by -1.1%, while increasing precision by 44% and F1 by 34.4%; compared to the latter, recall drops by -3.5%, but precision and F1 raise by 37.5% and 29.3%, respectively.Overall, RCNP constitutes the best choice for cardinality-based pruning algorithms, which are crafted for applications that promote precision at the cost of slightly lower recall [23,26].BLAST is the best among the weight-based pruning algorithms, which are crafted for applications that promote recall at the cost of slightly lower precision [23,26].Note that their F1 is signicantly higher than the original ones in Table 1a, but still far from perfect.The reason is that (Supervised) Meta-blocking merely produces a new block collection, not the end result of ER.This block collection is then processed by a Matching algorithm, whose goal is to raise F1 close to 1.
Feature selection.We now ne-tune the selected algorithms, BLAST and RCNP, by identifying the feature sets that optimize their performance in terms of eectiveness and time-eciency.We adopted a brute force approach, trying all the possible combinations of the eight features presented in Sections 2 and 5. Fixing again the training set size to a random sample of 500 balanced instances, the top-10 feature vectors with respect to F1 for BLAST and RCNP are reported in Tables 3a and b, respectively.
We observe that both algorithms are robust with respect to the top-10 feature sets, as they all achieve practically identical performance, on average.For BLAST, we obtain recall=0.882,pre-cision=0.193and F1=0.289 when combining ⇠ -⌫ and ' ⇠⇠⌫ with any two features from 5 ={ (, '(, # '(, , (}; even ' ⇠⇠⌫ can be replaced with a third feature from 5 without any noticeable impact.For RCNP, we obtain recall=0.850,precision=0.248and F1=0.353 when combining ⇠ -⌫ , ' ⇠⇠⌫ and !⇠% with any pair of features from { (, '(, # '(, , (}.In this context, we select the best feature set for each algorithm based on time eciency. In more detail, we compare the top-10 feature sets per algorithm in terms of their running times.This includes the time required for calculating the features per candidate pair and for retrieving the corresponding classication probability (we exclude the time required for producing the new block collections, because this is a xed overhead common to all feature sets of the same algorithm).Due to space limitations, we consider only the two datasets with the most candidate pairs, as reported in Table 1a: Movies and WMAmazon.We repeated every experiment 10 times and took the mean time.
In Table 3a, we observe that the feature set 78 is consistently the fastest one for BLAST, exhibiting a clear lead.Compared to the second fastest feature sets over movies (75) and WMAmazon (96), it reduces the average run-time by 11.9% and 16.0%, respectively.For RCNP, the dierences are much smaller, yet the same feature set (187) achieves the lowest run-time over both datasets.Compared to the second fastest feature sets over movies (184) and WMAmazon (239), it reduces the average run-time by 3.3% and 4.8%, respectively.
Comparison with Supervised Meta-blocking [25].We now compare BLAST and RCNP in combination with the features selected above with BCl and CNP, which use the feature set proposed in [25], {⇠ -⌫ , ' ⇠⇠⌫, (, !⇠% }.All algorithms were trained over the same randomly selected set of 500 labeled instances, 250 from each class, and were applied to all datasets in Table 1a.Their average performance is presented in Table 4.
We observe that BLAST outperforms BCl with respect to all effectiveness measures: its recall, precision and F1 are higher by 1.6%, 13.6% and 13%, respectively, on average.Thus, BLAST is much more accurate in the classication of the candidate pairs and more suitable than BCl for recall-intensive applications.Among the cardinalitybased algorithms, RCNP trades slightly lower recall than CNP for signicantly higher precision and F1: on average, across all datasets, its recall is lower by -4.1%, while its precision and F1 are higher by 34.9% and by 33.6%, respectively.As a result, RCNP is more suitable than CNP for precision-intensive applications.
Regarding the running times of these algorithms on the largest datasets, i.e., Movies and WMAmazon, we observe that BCl, CNP and RCNP exhibit similar ') in both cases, since they all employ more complex feature sets that include the time-consuming feature !⇠%.BLAST is substantially faster than these algorithms, reducing ') by more than 50%.In particular, comparing it with its weight-based competitor, we observe that BLAST is faster than BCl by 2.1 times over Movies and by 3.2 times over WMAmazon.
The eect of training set size.We now explore how the performance of BLAST and RCNP changes when varying the training set size.We used the features sets selected above (ID 78 and 187  in Tables 3a and b, respectively) and varied the number of labelled instances starting from 20, then from 50 to 500 with a step of 50.Figures 5 and 6 report the results in terms of recall, precision and F1, on average across all datasets, for BLAST and RCNP, respectively.
Notice that both algorithms exhibit the same behavior: as the training set size increases, recall gets higher at the expense of lower precision and F1.However, the increase in recall is much lower than the decrease in the other two measures.More specically, comparing the largest training set size with the smallest one, the average recall of BLAST raises by 2.4%, while its average precision drops by 29.7% and its average F1 by 24.8%.Similar patterns apply to RCNP: recall raises by 2.1%, but precision and F1 drop by 17.8% and 16.8%, respectively, when increasing the labelled instances from 50 to 500.This might seem counter-intuitive, but is caused by the distribution of matching probabilities: for small training sets, these probabilities are relatively evenly distributed in [0.5, 1, 0], but for larger ones, they are concentrated in higher scores, closer to 1.0, while the pruning threshold remains practically stable.As a result, more true and false positives exceed the threshold with the increase in the training set size, as explained in detail in [12].
Given that 20 labelled instances yield very low recall, especially for RCNP, but with 50 instances, the recall becomes quite satisfactory for both algorithms ( 0.85, on average, across all datasets), we can conclude that the optimal training set involves just 50 labelled instances, equally split among positive and negative ones.
Comparison with Supervised Meta-blocking [25].Table 5a-c reports a full comparison between the main weight-based algorithms, i.e., BLAST and BCl (note that BCl 2 uses the training set specied in [25], i.e., a random sample involving 5% of the positive instances in the ground-truth along with an equal number of randomly selected negative instances).We observe that on average, BLAST outperforms BCl 2 with respect to all eectiveness measures, increasing the average recall, precision and F1 by 7.1%, 5.0% and 9.9%, respectively.Compared to BCl 1 , BLAST increases the average recall by 3.95%, at the cost of slightly lower precision and F1 (5.9% and 2.2%, respectively).Recall drops below 0.8 in four datasets for BCl 1 (and BCl 2 ), whereas BLAST violates this limit in just two datasets.This should be attributed to duplicate pairs that share just one block in the original block collection, due to missing or erroneous values, as explained in detail in [12].BCl 1 outperforms BCl 2 in all respects, demonstrating the eectiveness of the new feature set.In terms of run-time, BLAST is slower than BCl 1 by 8.2%, on average, because it iterates once more over all candidate pairs.Compared to BCl 2 , BLAST is 6.7 times faster, on average across all datasets, because of !⇠% and of the large training sets, which learn complex binary classiers with a time-consuming processing.
Regarding the cardinality-based algorithms, we observe in Table 5d-f that RCNP typically outperforms both baseline methods with respect to all eectiveness measures.Compared to CNP 1 (CNP 2 ), RCNP raises the average recall by 5.3% (9.2%), while achieving the highest precision and F1 across all datasets, except for AbtBuy and ImdbTmdb (and ScholarDblp in the case of CNP 1 ).The relative increase in precision in comparison CNP 1 to ranges from 7.5% over TmdbTvdb to 10 and 24 times over Movies and WalmartAmazon, respectively.Compared to CNP 2 , precision raises from 5.0% over DblpAcm to 49 times over WalmartAmazon.In all cases, F1 increases to a similar extent.These patterns suggest that RCNP is typically more accurate in classifying the positive candidate pairs.In terms of run-time, RCNP is slower than CNP 1 by 6.2%, on average, as it retains the candidate pairs that are among the topk weighted ones for both constituent entities (i.e., it searches for pairs in two lists), whereas CNP 1 simply merges the lists of all entities.CNP 2 employs a much larger training set, yielding more complicated and time-consuming classiers than RCNP, which is 3 times faster, on average, across all datasets.
Overall, Generalized Supervised Meta-blocking outperforms Supervised Meta-blocking to a signicant extent, despite using a balanced training set of just 50 labelled instances.
Scalability Analysis.We assess the scalability of our approaches as the number of candidate pairs |⇠ | increases, verifying their robustness under versatile settings: instead of real-world Clean-Clean ER datasets, we now consider the synthetic Dirty datasets, and instead of SVC, we train our models using Weka's default implementation of Logistic Regression [17].
The characteristics of the datasets, which are widely used in the literature [5,28], appear in Table 1b.To extract a large block collection from every dataset, we apply Token Blocking.In all cases, the recall is almost perfect, but precision and F1 are extremely low.
We consider four methods: BCl and CNP with the features and the training set size specied in [25] as well as BLAST and RCNP with the features in Tables 5a and 5d, resp., trained over 50 labelled instances (25 per class).In each dataset, we performed three repetitions per algorithm and considered the average performance.
The eectiveness of the weight-and cardinality-based algorithms over all datasets appear in Figures 7a and 7b, respectively.BLAST signicantly outperforms BCl in all cases: on average, it reduces recall by 3.5%, but consistently maintains it above 0.93, while increasing precision and F1 by a whole order of magnitude.Note that BLAST's precision is much higher than expected over ⇡ 100 , due to the eect of random sampling: a dierent training set is used in every one of the three iterations, with two of them performing a very deep pruning, for a minor decrease in recall.
RCNP outperforms CNP to a signicant extent: on average, it reduces recall by 7.9%, but maintains it to very high levels -except for ⇡ 200 , where it drops to 0.77, due to the eect of random sampling; yet, precision raises by 2.8 times and F1 by 2.3 times.These results verify the strength of our approaches, even though they require orders of magnitude less labelled instances than [25].
Most importantly, our approaches scale better to large datasets, as demonstrated by speedup in Figure 7c.Given two sets of candidate pairs, |⇠ 1 | and |⇠ 2 |, such that |⇠ 1 | < |⇠ 2 |, this measure is dened as follows: B?443D? = |⇠ 2 |/|⇠ 1 | ⇥') 1 /') 2 , where ') 1 (') 2 ) denotes the running time over |⇠ 1 | (|⇠ 2 |) -in our case, ⇠ 1 corresponds to ⇡ 10 and ⇠ 2 to all other datasets.In essence, speedup extrapolates the running time of the smallest dataset to the largest one, with values close to 1 indicating linear scalability, which is the ideal case.We observe that all methods start from very high values, but BCl and CNP deteriorate to a signicantly larger extent than BLAST and RCNP, respectively, achieving the lowest values for ⇡ 300 .This should be attributed to their lower accuracy in pruning the non-matching comparisons, which deteriorates as the number of candidate pairs increases.As a result, they end up retaining and processing a much larger number of comparisons, which slows down their functionality.
Overall, Generalized Supervised Meta-blocking scales much better to large datasets than Supervised Meta-blocking [25] for both weightand cardinality-based algorithms.For a bit lower recall, it raises precision and F1 by 2 times and maintains a much higher speedup.

RELATED WORK
The unsupervised pruning algorithms WEP, WNP, CEP, and CNP were introduced in [23].WNP and CNP were redened in [26] so that they do not produce block collections with redundant comparisons.Unsupervised Reciprocal WNP and Reciprocal CNP were coined in [26], while unsupervised BLAST was proposed in [31].
Over the years, more unsupervised pruning algorithms have been proposed in the literature.[36] proposes a variant of CEP that retains the top-weighted candidate pairs with a cumulative weight higher than a specic portion of the total sum of weights.Crafted for Semantic Web data, MinoanER [11] combines meta-blocking evidence from two complementary block collections: the blocks extracted from the names of entities and from the attribute values of their neighbors.BLAST2 [2] leverages loose schema information in order to boost the performance of Meta-blocking's weighting schemes.Finally, a family of pruning algorithms that focuses on the comparison weights inside individual blocks is presented in [9].Our approaches can be generalized to these algorithms, too, but their analytical examination lies out of our scope.
The above works consider Meta-blocking in a static context that ignores Matching.A dynamic approach that leverages Meta-blocking is pBlocking [16].After applying Matching to the smallest blocks, intersections of the initial blocks are formed and scored based on their ratio of matching and non-matching entities.Meta-blocking is then applied to produce a new set of candidates to be processed by Matching.This process is iteratively applied until convergence.BEER [15] is an open-source implementation of pBlocking.
On another line of research, BLOSS [3] introduces an active learning approach that reduces the size of the labelled set required by Supervised Meta-blocking.It partitions the unlabelled candidate pairs into similarity levels based on CF-IBF, it applies rule-based active sampling inside every level and cleans the labelled sample from nonmatching outliers with high Jaccard weight.Our approaches render BLOSS unnecessary, as they require just 50 labelled instances.

CONCLUSIONS
We presented Generalized Supervised Meta-blocking, which casts Meta-blocking as a probabilistic binary classication task and weights all candidate pairs in a block collection through a trained probabilistic classier.Its weights are processed by pruning algorithms that are weight-based, promoting recall, or cardinality-based, promoting precision.BLAST and RCNP constitute the best algorithms, resp.We showed that four new weighting schemes give rise to feature sets that outperform the existing ones [25], while a very small, balanced training set with just 50 labelled instances suces for high eectiveness, high time eciency and high scalability.In the future, we will apply our approaches to Progressive ER [29,[33][34][35].

Figure 1 :
Figure 1: (a) The input entities (smartphone models), and (b) the redundancy-positive blocks produced by Token Blocking.

Figure 2 :
Figure 2: Unsupervised Meta-blocking example: (a) The blocking graph of the blocks in Figure 1b, using the number of common blocks as edge weights, (b) a possible pruned blocking graph, and (c) the new blocks.

Figure 3 :
Figure 3: Supervised Meta-blocking example: (a) a graph where each edge is associated with a feature vector, (b) the graph pruned by a binary classier, and (c) the output, which contains a new block per retained edge.

Figure 4 :
Figure 4: Generalized Supervised Meta-blocking example: (a) a graph weighted with a probabilistic classier, (b) the pruned graph, and (c) the new blocks.

Figure 5 :
Figure 5: The eect of the training set size on BLAST.

Figure 6 :
Figure 6: The eect of the training set size on RCNP.

Figure 7 :
Figure 7: Scalability over the datasets in Table 1b: (a) the weight-based pruning algorithms, (b) the cardinality-based ones, and (c) speedup.

Table 1 :
The datasets used in the experimental study.

Table 2 :
The average performance of all pruning algorithms over the block collections of Table1.

Table 3 :
The 10 feature sets with the highest F1 per algorithm.

Table 5 :
Performance of the main weight-and cardinality-based algorithms across all datasets in a-c and d-f, respectively.') is the mean run-time (in seconds) over 10 repetitions.