A Non-intrusive Movie Recommendation System

. Several recommendation systems have been developed to support the user in choosing an interesting movie from multimedia repositories. The widely utilized collaborative-ﬁltering systems focus on the analysis of user proﬁles or user ratings of the items. However, these systems decrease their performance at the start-up phase and due to privacy issues, when a user hides most of his personal data. On the other hand, content-based recommendation systems compare movie features to suggest similar multimedia contents; these systems are based on less invasive observations, however they ﬁnd some diﬃculties to supply tailored suggestions. Inthispaper, we propose a plot-based recommendation system, which is based upon an evaluation of similarity among the plot of a video that was watched by the user and a large amount of plots that is stored in a movie database. Since it is independent from the number of user ratings, it is able to propose famous and beloved movies as well as old or unheard movies/programs that are still strongly related to the content of the video the user has watched. We experimented diﬀerent methodologies to compare natural language descriptions of movies (plots) and evaluated the Latent Semantic Analysis (LSA) to be the superior one in supporting the selection of similar plots. In order to increase the eﬃciency of LSA, diﬀerent models have been experimented and in the end, a recommendation system that is able to compare about two hundred thousands movie plots in less than a minute has been developed.


Introduction
Nowadays movie repositories offer datasets of over 100000 items and their size increases every year by around 5000 items due to the new released movies (ac-cording to Screen Digest3 ).Searching for a movie of interest in such a large amount of data is a time consuming task.Information filtering systems can be a powerful tool in giving assistance to the user.Thus, particularly in a multimedia environment, they are implemented to minimize user effort, to increase user satisfaction and to realize a more pleasant experience.For this purpose, recommendation methodologies have been integrated into customized media content distribution services.At the state of the art, the main methodologies analyze user profiles and user ratings of the data items to compute item similarity.Consequently, they find some difficulties from the start as user preferences are not necessarily available for the system.Moreover these systems are quite intrusive as they need active feed-back from users or their personal data.Content-based recommendation systems, instead, utilize movie features (such as title, director, year of production . . . ) and, combining similarity measurements, they define how similar two movies are.While comparing movie features is quite easy, comparing plots is a challenging task; to our knowledge, none of the movie recommendation systems have proposed an algorithm based on the analysis of the plots till now.Moreover, the aim of this work was to offer recommendations that also include shows that are less popular or forgotten, because too old for example, but that can still be interesting for the user.
In this paper we propose a plot-based recommendation system which is based upon an evaluation of similarity among the plot of the movie that was watched by the user and a large amount of movie plots that is stored in a movie database.We exploit state-of-the-art text similarity techniques in order to evaluate similarity of natural language features such as plots.Then, we combined similarity of plots with similarity of non-verbose features, such as release year, crew etc. that are computed by exact matching.
In order to compare natural language features a vector space model was developed following the approach used in an Information Retrieval environment [8].Within the vector space model, each text is represented as a vector of keywords with associated weights.These weights depend on the distribution of the keywords in the given training set of plots that are stored in the database.In order to calculate these weights we exploit and compare different techniques: simple weighting technique and semantic weighting techniques.
Weighting techniques such as Term Frequency-Inverse Document Frequency (tf-idf) and Log Entropy (log) assign a weight to each keyword that has been extracted from a text using lemmatizers and taggers.The vectors that are generated have a large size, as each of them consists of as many elements as many keywords have been extracted from the whole corpus of texts, and are very sparse, as all the keywords that do not appear in a text are associated in the corresponding vector to a zero-value element.
To generate small and non-sparse vectors (about 500 elements), the output of the cited weighting techniques is refined by applying LSA [14].LSA allows to assign non-zero values to keywords that do not appear in a text but that are still related to its contents.The strong correlation between LSA weights and keyword co-occurrences allows to partially deal with synonymy (different keywords having similar meanings) and polysemy (keywords assuming different meanings).
The system has been developed in collaboration between the database group of the University of Modena and Reggio Emilia and vfree.tv 4 , a young and innovative German company focused on creating new ways of distributing television content and generating an unprecedented watching experience for the user.Building upon several decades of experience of the founders in the fields of fixed and mobile telecommunication, video processing and distribution, the company is well equipped with a wealth of capabilities, indeed, in 2010 it won one of the five main awards of the German Federal Ministry of Economics and Technology for innovative new business ideas in the area of multimedia.Its products and services introduce a unique and disruptive technology for individual distribution of individual content.The user receives at any time the content which most likely satisfies his current needs and wants.vfree.tvworks with service providers and content providers throughout Europe.
The paper is structured as follows.Section 2 introduces the structure of the local movie database that has been created.Section 3 describes the vector space model that has been used to compute similarity measures of natural language descriptions as plots.Section 4 compares the results obtained by tf-idf and LSA in selecting the 10 most similar movies to a given one.Moreover it examines the performance of approximated LSA models.Section 5 presents some related work, whereas conclusion and possible future evolvements are depicted in Section 6.

The Movie Database
With the aim of generating an extensive and reliable representation of multimedia, video metadata can be imported from external repositories and stored within a local database.Storing the data locally helps to improve the efficiency of processes that lead to the recommendation results.The local database should allow to easily enter data from different sources and perform queries on a huge amount of data (thousands of movies) in a short time and get good results.For these reasons we chose MongoDB.
MongoDB 5 is a non relational database and is schema-free, this feature allows to create databases with flexible and simple structure without decreasing the time performance when they are queried.Data is organized into collections, which correspond to tables in relational databases, and documents, the equivalent of tuples.MongoDB documents, as well as tuples, consist of a set of attributes, but since the database is schema-free the structure of each document is independent and potentially different from the structure of all the other documents in the same collection.It is then possible to change the structure of a document just modifying its attributes.MongoDB supports a query language that allows to define most of the queries that can be expressed in SQL.Furthermore, test demonstrated that for collection having a big size MongoDB shows better query performances [28].
In order to structure the local database, an analysis on the major movie repositories has been conducted.We took into consideration the Internet Movie Database (IMDb)6 , DBpedia7 and the Open Movie Database (TMDb) 8 .Information about movies can be classified in either information that is related to multimedia or information that is about people that participated in the production of multimedia.This led to the creation of three main databases (as shown in Figure 1).The Movie database comprises data that is solely related to multimedia, such as title, plot and release year.The Person database comprises data related to people (e.g.full name, biography and date of birth).The Castncrew database connects documents of the other two databases.It comprises data about roles that are covered by people in the production of multimedia, such as actor, director or producer.This configuration allows an easier adaptation to integrate different/new datasets into the system, when external sources are shut down or experience a change in their copyright.Information extracted from different online resources can be stored separately as collections of the databases.For example as both IMDb and DBpedia supply information about movies and persons, they might potentially supply collections for all the three databases.However, if we have a repository that manages information about actors only, we can store this information adding a new collection in the database person.As MongoDB does not require a fixed schema, different collections in the databases may store different attributes.On the other hand, if we want to aggregate information about movies from different collections we connect information from the databases based on the name of the actors, the title of a movie, the year of production and other features if available.

The Vector Space Model
The similarity of two media items depends on their features likeness.Hence, for each feature, a specific metric is defined in order to compute a similarity score.Most of the metrics that are adopted are calculated through only few simple operations.Things are more difficult for plots and, in general, for natural language descriptions.Our approach, that follows the one developed in an Information Retrieval environment [8], is based on a vector space model which is used to compare any pair of plots.Within this model, each text is represented as a vector of keywords with associated weights.These weights depend on the distribution of the keywords in the given training set of plots that are stored in the database.Vectors that represent plots are joined and consequently form a matrix where each row corresponds to a plot and each column corresponds to a keyword extracted from the training set descriptions.Thus, each cell of the matrix represents the weight of a specific keyword according to a specific plot.The weights in the matrix are determined in three steps.First, they are defined as the occurrences of keywords in the descriptions; second, weights are modified by optionally using the tf-idf or log technique (but other suitable weighting techniques could be used as well); third, the matrix is transformed by performing Latent Semantic Analysis (LSA) [12].LSA is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.Several experiments have demonstrated that LSA has a good accuracy in simulating human judgments and behaviors [15].

Plot-based Similarity
Texts can be compared by using vector operations, such as the cosine similarity that is used as a distance metric in order to compute the similarity score between two texts, based on a vector representation.
Definition -cosine similarity Given two vectors v i , and v j , that represent two different descriptions, the cosine angle between them can be calculated as follows: The value of the cosine angle is a real number between −1 and 1.If the value is "1" the two compared vectors are equivalent, whereas if the value is "-1" the two vectors are opposite.
Thus, to compare descriptions by using vector operations plots need to be converted into vectors of keywords.As a preliminary operation, before the first step of our method, the keyword extraction and discrimination activity is performed.Keywords correspond to terms within the text that are representative of the text itself and that at the same time are discriminating.Less discriminative words, the so called stop words, are discarded and terms are preprocessed and substituted in the vector by their lemmas, this operation is called lemmatization.The goal of lemmatization is to reduce inflectional forms of a word to a common base form (e.g. to transfom "running", "runs" in the corresponding base form "run"). Lemmatization and keyword extraction are made by using TreeTagger [25].TreeTagger is a parser that has been developed at the Institute for Computational Linguistics of the University of Stuttgart.This tool can annotate text with part-of-speech and lemma information in both the English and German language.
Keywords that have been extracted from descriptions as well as their local frequency (occurrences in the description) are stored as features of the media item in the local database.This happens for two main reasons.First, compared to the access to database values, the keyword extraction process is relatively slow 9 .As the weighting techniques define the values of the weights on the basis of the global distribution of the keywords over the whole corpus of descriptions, it is necessary to generate all the vectors before applying the tf-idf/log technique.Second, while tf-idf/log weights change when new multimedia descriptions are entered into the system, the local keyword occurrences do not.
Two different techniques have been used for computing keyword weights.In the following we briefly describe tf-idf technique, while we skip the definition of log.For major details we remand to [24] for an explanation of tf-idf and [9] for log.In both techniques, a weight that represents the relevance of a specific keyword according to a specific text depends on the local distribution of the keyword within the text as well as on the global distribution of the keyword in the whole corpus of descriptions.
Definition -tf-idf weight Given a keyword k that has been extracted from a text the tf-idf weight is calculated as follows: Where tf (k, d) corresponds to the frequency of the keyword k in the vector v and idf (k) depends on the number N of vector descriptions in the corpus and on the number df of vector descriptions in which the keyword k appears: tf-idf weights have a value between 0 and 1. Keywords with a document frequency equal to one are discarded.
The keyword weights are then refined by the use of LSA.
Let us introduce this technique by an example, suppose we have the following sentences "There is a mouse below the new Ferrari that is parked in front of the market", "With one mouse click you can view all available cars and thus renders going to the shop unnecessary".Now, let us compare the corresponding vectors that have been generated by tf-idf and LSA on the above sentences: The analysis of the values in Table 1 shows tf-idf is not able to recognize neither synonyms (e.g.market and shop) nor hyponyms (e.g.Ferrari and cars).In contrast, the use of LSA emphasizes underlying semantic: in the first sentence a value not equal to zero is assigned to the term car even if this keyword does not appear in vector v 1 (the vector that corresponds to the first sentence).This is due to the co-occurrences of the term Ferrari and other terms that also frequently occur in combination with the term car in other vectors of the training set matrix on which the LSA has been applied.There is a strong correlation between the values of the matrix and second order co-occurrences.
The LSA consists of a Singular Value Decomposition (SVD) of the vector matrix T (Training set matrix) followed by a Rank lowering.Each row and column of the resulting matrix T can be represented as a vector combination of the eigenvectors of the matrix T T T .
Where the coefficients coef f icient i of the above formula represent how strong the relationship between a keyword (or a description) and a topic eigenvector i is.The eigenvectors define the so-called topic space, thus, the coefficients related to a vector v represent a topic vector.Each eigenvector is referred to in the following as topic.
It is the topic representation of the keywords which is used as a natural language model in order to compare texts.Topic vectors may be useful for three main reasons: (1) as the number of topics that is equal to the number of non-zero eigenvectors is usually significantly lower than the number of keywords, the topic representation of the descriptions is more compact 10 ; (2) the topic representation of the keywords makes possible to add movies that have been released after the definition of the matrix T without recomputing the matrix T ; (3) to find similar movies starting from a given one, we just need to compute the topic vectors for the plot of the movie and then compare these vectors with the ones we have stored in the matrix T finding the top relevant.
Note that adopting this model we are able to represent each plot with 500 topics, instead of 220000 keywords.

Feature-based Similarity
The similarity of plots can also be combined with the similarity of other features such as directors, genre, producers, release year, cast etc.The similarity of two media items (m 1 and m 2 ) is defined as a weighted linear combination of the similarity of the feature values that describe the two items: Where F N is the number of features that describe a media item that has been chosen to compute the media similarity, similarity i is the metric used to compare the i-th feature, f eature j,k is the value assumed by feature k in the j-th media item.The result of each metric is normalized to obtain a value between zero and one where one denotes equivalence of the values that have been compared and zero means maximum dissimilarity of the values [7].

Tests
In order to perform an evaluation of the developed recommendation system, we loaded data from IMDb for test purposes into the local database.Most of the existing movie recommendation systems are based on collaborative filtering and build the movie proposal set analysing users ratings.Being famous programs more often rated, the proposed movies are usually the most famous ones.On the contrary, we are able to propose shows that are less popular or forgotten but that can still be interesting for the user.As this kind of results is new and there is not a single measure to compute similarity of multimedia plots, the evaluation of the results lacks an absolute benchmark.Anyway, results that we obtained, by using different techniques, are compared and briefly discussed.
Results obtained by applying tf-idf and log techniques on the local database show slight differences, and the quality does not seem to be significantly different.Instead, a noticeable quality improvement can be achieved by applying the LSA technique.Plots that are selected to be similar using tf-idf or log techniques usually contain terms and names that appear in the target plot, but they do not necessary refer to similar topics.LSA allows to select plots that are better related to the target's plot themes.
We report some manual tests that have been performed in order to evaluate and compare the weighting techniques.In Table 2 it can be noticed that the qualities of the results that could be achieved by the tf-idf and log techniques do not seem to be significantly different.In both of the ranked lists eight plots refer to the target plot whereas the other two plots seem to be related to similar content (godfathers).It is, therefore, not reasonable to suggest using one technique rather than the other.Table 3 shows the outcome of Latent Semantic Analysis is superior to other techniques such as tf-idf.Here, all plots that have been selected by using LSA technique refer to the theme of dreams and subconsciousness just like the target plot.In contrast, the results originating from tf-idf seem to be connected to the target plot more by the surname of the main character of the movie "Inception", which is Cobb.
Finally, in table 4 LSA results for the target movie series "Smallville" are compared, the evaluation took into consideration LSA over tf-idf and LSA over log techniques.All the plots that have been selected in both techniques refer to LSA process is based on the SVD of the plot-keyword matrix having a complexity of O(m × n) where m is the number of multimedia (rows of the matrix) and n is the number of keywords (columns) and m ≥ n.There are about 200000 multimedia for which a plot value is available in the database after the IMDB data import and almost 220000 different keywords that are extracted from the plots.Thus, the time cost for the decomposition of the matrix is O = 3 • 10 15 .Furthermore, the decomposition requires random access to the matrix [5], which implies an intensive usage of the central memory.In order to efficiently compute the decomposition of the matrix and to avoid the central memory saturation, we utilized the framework Gensim 11 .
The computational costs to create the LSA model on the local Linux system (4 AMD Phenom(tm) II X4 695 3.4 GHz processors, 3.6 GB RAM) are the following: Given a target plot, all the other plots in the database can be ranked according to their similarity in about 42 seconds.To further decrease similarity time consumption, three LSA models have been built using different assumptions.These tests have been conducted on the data extracted from DBpedia.Table 6 summarises the time performance of our recommendation system obtained using the different models.The complete model includes all the movies having a plot (78602 movies) and all the keywords appearing in these plots (133369); in this model the matrix rank is reduced by LSA to 500 (LSA topics).While generating the approximate model, short plots (less than 20 keywords) and low-frequency keywords (appearing in less than 10 plots) have been ignored.Tf-idf and log weights having a value below 0.09 and LSA weights having a value below 0.001 have been set to 0. Here, the matrix rank has been reduced to 350.The fast model is a further approximation of the approximate model in which the tf-idf and log weights having a value below 0.14 have been set to 0 and the matrix rank has been reduced to 200.With the help of a different parametrization, it is thus possible to significantly cut down the time cost for computing similarity operations.Anyway, the more the model is approximated, the lower the accuracy becomes 12 .
As described in section 3, the similarity of plots can be combined with the similarity of other features.We performed an experiment to compare the IMDb recommendation list for the movie "The Matrix" with the recommendations that have been generated by our system (we used the complete LSA model).In Table 7 are shown both the proposals that have been generated considering the plot only or plot together with other features.As it can be observed, the IMDb recommendation includes only movies that have been rated by a high number of users (more than 28000), while in the plot-based and in the feature-based recommendations even movies that are not famous have been proposed to the user.IMDB is not able to suggest movies similar to a selected plot.As a matter of fact, except for the movies of the Matrix trilogy, the IMDB recommendations are not related to the topic of "The Matrix", but they are rather the most popular science fiction movies.In contrast to the IMDB suggestions, both our algorithms are able to list a set of movies that are all related to at least one of the topics such as intelligent machines, computer hacker, computer viruses etc.An interesting observation is that, since the feature-based algorithm combines plots and features to select the recommendations, it is able to select the Matrix trilogy (that are of course the more related items) and a list of movies strongly related to the topic of "The Matrix".

Related Work
Information filtering systems can be a powerful tool in giving assistance to the user with the aim of delivering a narrow set of items which might be of interest.Recommendation algorithms are usually classified in content-based, collaborative filtering and hybrid recommendation systems [1].Collaborative filtering systems are widely industrially utilized, for example by Amazon [18], Movie-Lens [19] and Netflix [4], and recommendation is computed by analysing user profiles and user ratings of the items.When user preferences are not available, as in the start-up phase, or not accessible, due to privacy issues, it might be necessary to develop a content-based recommendation algorithm, as the one proposed in [16].Collaborative filtering and content-based approaches are combined in hybrid systems as in [2,6] Content-based recommendation systems rely on item descriptions that usually consist of punctual data.In [20] information is instead extracted by text to perform a categorization that supports the rating-based book recommendation.Herby we propose, instead, a recommendation approach that is based on natural language data, such as movie plots.Jinni13 is a movie recommendation system that analyses as well movie plots, but relies on user ratings, manual annotations and machine learning techniques.
Descriptions of multimedia contents can be extracted from suitable data sources.[3] utilizes movie features that have been extracted from IMDB and [7] shows how the similarity of the features can be combined to define the distance of two movies, but none of these works involves the analysis of movie plots.Many efforts in other research areas, like schema matching and ontology matching, developed keyword similarity techniques exploiting lexical resources [27].
Evaluating the similarity among movies is closely related to the task of text similarity.Text similarity is essentially the problem of detecting and comparing the features of two texts.One of the earliest approaches to text similarity is the vector-space model [24] with a term frequency / inverse document frequency (tf/idf) weighting.This model, along with the more sophisticated LSA semantic alternative [14] has been found to work well for tasks such as information retrieval and text classification.LSA was shown to perform better than the simpler word and n-gram feature vectors in an interesting study [17] where several types of vector similarity metrics (e.g., binary vs. count vectors, Jaccard vs. cosine vs. overlap distance measure, etc.) have been evaluated and compared.
In [16], Newsweeder, a news recommendation system is described, that relies on tf-idf.Beside weighting systems, mathematical techniques can be used to improve similarity results, such as LSA.Thanks to the analysis of word cooccurrences LSA allows to partially deal with the problem of polysemy and synonymy and to outperform weighting systems [17].
Due to the high computational cost of LSA there have been many work around in the area of approximate matrix factorization; these algorithms maintain the spirit of SVD but are much easier to compute [13].For example, in [11] an effective distributed factorization algorithm based on stochastic gradient descent is shown.We opted for a scalable implementation of the process that does not require the term-document matrix to be stored in memory and is therefore independent of the corpus size [23].

Conclusions and Future Work
The paper presented a plot-based recommendation system.The system classifies two videos as being similar if their plots are alike.A local movie database with a flexible structure has been created to store a large amount of metadata related to multimedia content coming from different sources with heterogeneous schemata.Three techniques to compare plot similarity have been evaluated: tf-idf, log and LSA.From the results obtained, LSA turned out to be superior in supporting the selection of similar contents.Efficiency tests have been performed to speed up the process of LSA computation.The tests led to the development of a recommendation system able to compare the plot of a movie with a 200000 plots database.The results are provided in a ranked list of similar movies in less than one minute.An innovative feature of the system is its independence from the movie ratings expressed by users; this allows the system to find strongly related movies that other recommendation systems, such as IMDB, do not consider.
Keywords extraction might benefit from the use of lexical databases as Word-Net [10] as they are particularly helpful in dealing with synonyms and polysemous terms.In WordNet, words (i.e.lemmas) are organized in groups of synonyms called synsets.Synsets are connected depending on semantic relationships such as hypernymy and hyponymy.Each keyword might be replaced by its meaning (synset), before the application of the weight techniques.To understand which of the synsets better express the meaning of a keyword in a plot we may adopt Word Sense Disambiguation techniques [21].The semantic relationships between synsets can be used for enhancing the keyword meaning by adding all its hypernyms and hyponyms [22,26].
In section 4, we performed some tests using IMDb data and other tests using DBpedia data; however we have not yet tested the system with a large amount of data coming from different sources.As a future work, we will evaluate other movie repositories, such as the Open Movie Database (TMDb) and the Rotten Tomatoes14 .

Table 1 .
A comparison of the weight vector correspondences obtained by using tf-idf and LSA

Table 2 .
A comparison between the tf-idf and log weighting techniques on the movie "The Godfather"

Table 3 .
A comparison between the tf-idf technique application only and the further application of LSA on the movie "Inception"

Table 4 .
A comparison between the LSA over tf-idf and LSA over log weighting techniques on the movie "The Godfather" topic of super-heroes and Superman.Just like in table 2, it is hardly possible to decide which results are of a better quality between Tf-idf and Log.The tf-idf technique might be preferred as, in contrast to the log technique, it does not require discarding terms that have a document frequency equal to one. the

Table 6 .
Similarity time costs obtained using complete, approximate and fast models

Table 7 .
A comparison on the different recommendation lists obtained for the movie "The Matrix"