Multidimensional Clustering of EU Regions: A Contribution to Orient Public Policies in Reducing Regional Disparities

This paper applies multidimensional clustering of EU-28 regions with regard to their specialisation strategies and socioeconomic characteristics. It builds on an original dataset. Several academic studies discuss the relevant issues to be addressed by innovation and regional development policies, but so far no systematic analysis has linked the different aspects of EU regions research and innovation strategies (RIS3) and their socio-economic characteristics. This paper intends to fill this gap, with the aim to provide clues for more effective regional and innovation policies. In the data set analysed in this paper, the socioeconomic and demographic classification associates each region to one categorical variable (with 19 categories), while the classification of the RIS3 priorities clustering was performed separately on “descriptions” (21 Boolean categories) and “codes” (11 Boolean Categories) of regions’ RIS3. The cluster analysis, implemented on the results of the correspondence analysis on the three sets of categories, returns 9 groups of regions that are similar in terms of priorities and socioeconomic characteristics. Each group has different characteristics that revolve mainly around the concepts of selectivity (group’s ability to represent a category) and homogeneity (similarity in the group with respect to one category) with respect to the different classifications on which the analysis is based. Policy implications showed in this paper are discussed as a contribution to the current debate on post-2020 European Cohesion Policy, which aims at orienting public policies toward the reduction of regional disparities and to the enhance complementarities and synergies within macro-regions.


Introduction
The current debate on post-2020 European Cohesion Policy confirms the need for public policies targeting the reduction of regional disparities and the enhancement of complementarities and synergies within macro-regions. Such interventions, supported by the European Structural & Investment Funds, are key instruments for the implementation of EU policies and programmes, aimed at fostering the cohesion and competitiveness across larger EU spaces, encompassing neighbouring member and non-member States (European Commission 2016). 1 To this end, regions are encouraged to share their best practices, to learn from each other and to exploit the opportunities for joint actions, through dedicated tools created by the European Commission. A specific dimension of such leverages is the set of strategic priorities that regions have outlined in their smart specialisation on research and innovation. The concept stems from academic work on the key drivers for bottom-up policies aiming at structural changes that are needed to improve job opportunities and welfare of territories (Foray et al. 2009;Barca 2009;Foray 2018). In the programming period 2014-2020, the European Commission has adopted the Research and Innovation Smart Specialisation Strategy (RIS3) as an ex-ante conditionality for access of regions to European Regional Development Funds (ERDFs). Such policies are built on specific guidelines and on a very detailed process of implementation (European Commission 2012, 2017Foray et al. 2012;McCann and Ortega-Argilés 2015). They identify "strategic areas for intervention, based both on the analysis of the strengths and potential of the regional economies and on a process of entrepreneurial discovery with wide stakeholder involvement. It embraces a broad view of innovation that goes beyond research-oriented and technology-based activities, and requires a sound intervention strategy supported by effective monitoring mechanisms" (European Commission 2017, p. 11).
Although over 65 billion EUR of ERDFs have been allocated to such policies their impact has not been scrutinised yet and no effective monitoring tool has been implemented. 2 In addition, no systematic information on the list of projects implemented under the various regions' RIS3 priorities is available. 3 For regions aiming at learning from other regions' practices on RIS3, information on regional strategies and goals is shared through online platforms, such as the S3 platform run by EC-JRC. Other loci of interaction among regions are those supported by the EU Interreg programmes, 4 the Interact Initiatives, 5 and the macro-regions strategies. 6 National programmes too, provide fora to cross-region crosscountry comparison of structural features and policy measures on diverse domains. 7 Several academic studies provide analytical frameworks to support public decision making on subject such as income disparities (Iammarino et al. 2018) or quality of institutions (Charron et al. 2014). However, no systematic analysis has linked jointly the different aspects of EU regions specialisation strategies and their socio-economic characteristics. This paper aims to fill this gap by applying a multidimensional clustering of EU-28 regions in order to provide clues for more effective regional policies. The clustering proposed in the paper builds on an original dataset, where the EU-28 regions are classified according to their socioeconomic features , and to the strategic features of their research and innovation smart specialisations strategy (RIS3) ). In the first classification, each region is associated to one categorical variable (with 19 modalities) based on a multidimensional analysis (PCA and CA) of a large dataset, and it provides a perspective focused on regional heterogeneity across EU regions. In the second classification, two clustering of "descriptions" and "codes" of RIS3s' priorities were considered (respectively made of 21 and 11 Boolean categories). This comparative perspective is made possible by a non-supervised boolean textual classification of priorities using information on RIS3 from the Eye@RIS3 platform (European Commission-Joint Research Center JRC).
The paper is structured as follows. Section 2 describes the methods used to obtain a multidimensional classification and the dataset built on the classification of socioeconomic features of EU-28 regions and classification of priorities pointed out in their smart specialisation strategies. Section 3 returns the main results. Section 4 builds on the results of the analysis and discusses their implications for policy and possible future strands of this research.

Data and Methods
The data analysed in this paper results from the merging of two main datasets. 8 First of all, we use the classification of regions according to their socioeconomic features of Pagliacci et al. (2019). A socio-economic categorical variable is defined classifying the 208 territorial entities in EU-28 regions in 19 categories. Secondly, with regard to smart specialisation strategies, we use the classification defined by Pavone et al. (2019). There, the RIS3 priorities of 216 EU-28 territorial entities are summarised in two multi-class categorical variables: Description (21 categories) and Codes (11 categories). These two categorisations derive from an automatic classification of the priorities specified by each region in terms of free text of descriptions and of codes, which belong to three domains: scientific, economic, and policy objectives. 9 In the dataset, each record refers to a priority defined by the region with a free text description and with a series of codes in the three domains. Each region could specify one or more priorities. The automatic analysis of the two corpora (description and codes) has allowed the classification of priorities in 21 topics for descriptions and 11 groups for codes. The results of the three classifications can be cross-referenced by using the online tool created ad hoc for such cross-tabulation. Developed within the AlpGov project to map R&I in the Alpine regions, the tool is implemented to query the classifications of all the EU regions. Through an effective visualisation of maps and data, 10 it allows policy makers, researchers and public to query specific combinations of interest, focusing on the most detailed identification of groups of regions along the three categorisations: of economic characteristics, and of RIS3' priorities descriptions and codes.
Merging the two datasets, in this paper we study the multidimensional classification of 191 territorial entities according to the three above mentioned categorical variables.
The state of the art in clustering is provided by a huge literature (Jain 2010), developed in a variety of scientific fields with different languages and focusing on the most diverse problems: clustering heterogeneous data, definition of parameters and initialisations (such as the times of iterations in K-means, e.g., MacQueen 1967) and the threshold in hierarchical clustering (Jain and Dubes 1988), as well as the problem of defining the optimal number of groups. Research is increasingly focusing on combining multiple clustering of the same dataset to produce a better single one clustering (Boulis and Ostendorf 2004).
Without going into the merits of what could be the best method of classification, we put forward a grouping of regions according to their similarity in terms of their socio-economic characteristics and their RIS3 priorities. This enable comparing policy strategies in EU by implementing a factor analysis and a cluster analysis, applied on the matrix Regions × Categorical variables. Given that our case study comprises only one univocal categorical variable (19 regions' socio-economic and demographic categories) and two multi-class categorical variables (Codes and Descriptions of regions' RIS3's priorities, respectively with 11 and 19 categories), we directly apply a Correspondence Analysis (Benzecri 1992;Greenacre 2007) to the Boolean matrix Regions × Categories (191 × 51), in which the totals of rows depends on the number of categories in which each region has been classified. Usually, a matrix Units × Categorical variables (univocal classification) is studied through a multiple correspondences analysis that transforms the matrix Units × Variables (m × s) into a Boolean matrix Units × Categories (m × n). This latter matrix is considered as a particular frequency table which has the total of rows equal to the number of categorical variables considered in the analysis, while the total of columns is equal to the frequency of each category in the m units considered (Bolasco 1999). Then a correspondence analysis is applied, after transforming the Boolean data into row and column profiles, looking for their reproduction in factorial subspaces according to the criterion of the best orthogonal projections. In the present analysis, given a multiple categorization in two out of three dimensions, we adopt a Correspondence Analysis on the Boolean matrix. The factors highlight the configuration of the profiles in a graphical context. The interpretation of each factor through the analysis of the nodes' polarization sheds light on the association structure among regions' profiles. 11 Then a hierarchical agglomerative clustering based on Ward's aggregation method, with Euclidean distance, is applied on the results of the Correspondence Analysis on the dataset of regions.

Results
The correspondence analysis is applied to the Boolean matrix Regions × Categories. In this matrix, each region is classified according to a socio-economic class and to the set of categories of codes and categories of descriptions. Results of such an analysis are presented in Figs. 1 and 2, with regard to the distribution on f1f2 plane, respectively, of the 51 categories and of the 191 regions. "Appendix 1" lists the coordinates of the categories on the first four factors: these figures allow to interpret the existing polarizations in each factor. Building on this information, by analysing Fig. 1, we observe that the first factor polarises information on the specialisation of the regional economy, from services (left) to manufacturing (right), while the second factor polarises information on income, from low income (bottom) to high income (top). Figure 2 shows the distribution of the regions relative to the differences highlighted in Fig. 1. Therefore, from left to right there are regions more  In the clustering process applied to such results, each factor represents only a part of the overall set of information and different results can be obtained, according to the number of factors considered. The selection of the most appropriate number of factors can be derived by observing the boxplot of coordinates of regions in each factor. 12 Figure 3 presents the regions coordinates of the ten factors, they show different projections of the cloud of points and highlight outliers.
In particular, the 5th factor singles out only the difference between one region (in the case in this example, the Brussels region-BE01) and all the others. The same holds true for the 10 th factor (in this case, the Luxembourg region-LU00). When five factors are considered, one single cluster results with only this outlier and, by increasing the number of factors under analysis, other outliers emerge as single clusters. Therefore, in order to avoid the influence of these outlier regions within the clustering process, without excluding them from the analysis, we proceed to carry out a cluster analysis considering, for the aggregation criteria, only the coordinates related to the first four factors. By analysing the resulting dendrogram 13 (Fig. 4), nine groups of regions have been selected. According to the Calinski and Harabasz index, the optimal number of cluster is five, but in order to single out significant aggregations of regions in terms of dimensions that are relevant for our analysis we adopted a greater number of clusters. The choice of the 5 clusters, although optimal from a statistical point of view, leads to an excessively broad and not relevant aggregation with regard to the economic analysis. For example, with the 5-clusters classification we obtain a first cluster that represents 46% of the information and groups 45% of the regions: with regard to its characteristic features, this cluster has the same RIS3 priorities (Manufacturing, Agro-food and Sustainable Energy) associated to very heterogeneous socio-economic conditions. Therefore, the choice of the greater number of clusters aims at obtaining groups with more homogeneous socio-economic characteristics for the various priorities. We have adopted a classification in nine clusters that will be detailed below and summarised in the table embedded in Fig. 7. Figures 5 and 6 show the distribution of regions and groups, respectively on the f1f2 plane and f3f4 plane.
For each of the nine clusters, Table 1 lists the characteristic categories, which are defined as those with a test-value greater than 2.1 14 (they are ranked in decreasing order of their test-value, column 3). The weight of those categories, i.e. the number of times the category occurs in the dataset, is shown in absolute and relative terms, respectively in columns 4 and 5. The ratio of each category in the cluster to all categories in the cluster (columns 6) highlights the extent to which the category is characteristic.
We observe that not all the codes are characteristic categories associated to the nine clusters: by selecting categories according to their test-value we are focusing only on those presenting a value that is significantly above the average occurrence among the regions in the cluster.
In general, with regard to the three sets of categories under analysis, Table 1 returns that, in seven out of nine cases, the clusters are characterised by a mix of socio-economic categories and classes of priorities. In the case of cluster #3, there are only socio-economic aspects as characteristic categories (being the most barycentric cluster), while in cluster #7 there is only one priority as characteristic category: this happens because none of the other categories of the regions grouped in this cluster are-on average-significantly higher than the average of their occurrence in the whole dataset. The nine clusters are now described with regard to the selectivity/homogeneity of their characteristic categories. These two elements are of fundamental importance for understanding and interpreting each group. Selectivity represents the group's ability to represent a category. It indicates the percentage of category in the cluster compared to the entire dataset. Homogeneity, on the other hand, represents the similarity in the group with respect to one category, it indicates the percentage of regions with the same category in the cluster.
Cluster #1, encompassing 31 regions, is characterised by the socio economic class Highincome; low-population density; tourism (with 85.71% occurrences in the cluster, which are associated to 38.71% of regions) and the description priority Sustainable Energy (77.42% of regions). The first characteristic category represents an element of selectivity of the category in the cluster, while the second one represents an element of homogeneity within the group.
Cluster #2 comprises 31 regions and it is characterised by two distinct socio-economic classes (both characterised by very low income), and description of priorities associated to Manufacturing (74.2% of regions), Agrofood (77.4% of regions) and Fashion (present at 55.6% in the cluster). Socio economic classes represent the selectivity features, while Manufacturing and Agrofood represent the homogeneity character of this group.
Cluster #3 encompasses 25 regions and the only distinctive element of this group are socioeconomic conditions: Medium-income; employment & population imbalances; manufacturing: textile, basic metal, transport; very poorly educated (present at 50% in the cluster and referred to 24% of regions) and Urban regions; high-income; poorer employment conditions; touristic (present at 55. 6% in the cluster and referred to 20% of regions): both characters show critical socioeconomic conditions. (3) (3) Cluster #4 (with 14 regions) is characterised by regions with a low and very low income (respectively 83.3% and 61.5% of occurrences in the cluster, respectively referred to 35.7% and 57.1% of regions). The priorities' descriptions refer to Tourism (100% of regions), Creative industry (92.9% of regions) and Agrofood (85.79% of regions). Also in this case, the socio-economic conditions represent the selectivity features, while priorities' descriptions are the homogeneity character within the group.
Cluster #5, (with 14 regions), is characterised by the socio-economic class Highincome; sparsely populated; public sector; highly educated (85.7% of regions) and priorities' descriptions referred to: Social innovation & education (78.6% of regions); Growth & Welfare (64.3% of regions); Bio economy (71.4% of regions). In this case all the characteristic categories represent the homogeneity character linking the regions in this cluster.
Cluster #6, (with just 5 regions) differs from cluster #5 because of its socio-economic features, characterised by Very-high income; large urban regions; high-employment; highly educated (with 60% of occurrences in the cluster associated with three regions).
Cluster #7 encompasses 18 regions with just one characteristic category: i.e. the marine and maritime priority (55.6% of the regions); other categories associated to regions in the cluster are not significantly higher than the average of the whole dataset.
Cluster #8 comprises 28 regions and it is characterised by the socio economic class High-income; high-employment; low-manufacturing; services & public sector (with 70.83% occurrences in the cluster, referring to 60.7% of regions) and by the priority descriptions: Optics (with 100% occurrences in the cluster and referred to 17.9% of regions); Transport & Logistics (60.7% of regions); Energy Production (46.4% of regions). Optics represent a specific element, while the most homogeneous elements are the socioeconomic class and Transport & Logistics description.
Cluster #9 is composed of 25 regions and it is characterised by two different socio-economic classes: Very-high income; manufacturing; population imbalances (with 85.71% occurrences in the cluster, referred to 48% of regions) and Low-income; high-density; high unemployment; agriculture; food & drinks; very poorly educated (62.5% of occurrences in the cluster, referred to 20% of regions). What unites regions with such different socioeconomic conditions is the set of characteristic categories of description: Healthy Food (present at 76.5% in the cluster and referred to 52% of regions); ICT & Tourism (present at 51.8% in the cluster and referred to 56% of regions); Life Science (68% of regions); Aeronautics, Aerospace & Automotive industry (36% of regions). Cluster 9 has as selectivity elements both socio-economic classes and Healthy Food priority, while there are no very high values of homogeneity (Life Science, referred to 68% of regions, is the highest value). Figure 7 maps the nine clusters, with the table in the right panel summarising the homogeneity and selectivity elements characterising the nine set of clusters under analysis. It is clear from the map that the different clusters do not just capture geographical proximity, but rather the similarity in the status (socio-economic and demographics elements) and areas of specialization.

Discussion and Conclusions
In this paper, we aim at interpreting the overall framework of interconnected structural socioeconomic and demographic features and policy programmes on smart specialisation strategy in the EU. By identifying clusters of EU regions, we provide policy makers with Maps of clusters of regions, by socioeconomic features and RIS3s' priorities: summary of selectivity and homogeneity characteristic categories a more systematic and informed tool they can use to learn from other regions, when they focus on the projects implemented within the various priorities.
Clustering of multidimensional categorisation is a multifaceted issue that must be addressed with the awareness that various methods of clustering are also affected by the data under analysis, such as: the overall number of observations, the number and type of variables (categorical, non-categorical and mixed variables, multiple vs. single categorisations), the distribution of observation along the various dimensions under analysis, and missing data. In the analysis presented in this paper, we merge two data sets on EU regions. They summarise information on two interrelated sets of issues: respectively, the structural features of regions and the RIS3 priorities defined by their policy programmes. Each dataset is built by using clustering techniques applied to different types of variables: numerical, for data on the 19 socioeconomic and demographic features, considered by Pagliacci et al. (2019), and texts, for RIS3's priorities categorised in the automatic text analysis elaborated by Pavone et al. (2019). In each passage of clustering, transparent, i.e. accountable, decisions, have been taken: from the general one of defining the number of clusters, to the selection of the principal components, identification of the socioeconomic categories as well as of the number of factors to be used in clustering the groups of co-occurrences in the multidimensional space of priorities' descriptions and priorities' codes. While the process of progressive reduction of multiple categories produces some loss of information, it makes it possible to single out common or singular features that otherwise would not be observable, and to use them for policy analysis. The value added by the multidimensional analysis of both socioeconomic dimensions and priorities of smart specialisation lies precisely in that.
The results provided by cluster analysis on the results of the correspondence analysis support a complementary indication on the comparative analysis of the EU regions. In the grouping of regions obtained, it is possible to highlight the elements of homogeneity and the elements of selectivity within each of the nine groups: the former are the characteristics common to most of the regions of a group, while the latter are those occurring mainly within a group.
Policy implications emerging from the analysis presented in this paper may be considered at different levels. In particular, macro-regions that aim at designing more focused strategies may leverage on complementarities and synergies across regions each of them encompasses: these clearly emerge from homogeneous features and selectivity characters of priorities identified in the cluster analysis.