Topic-based recommender: how to discover users’ research interests

In this post I want to present a new kind of recommender recently introduced in Mendeley Suggest. This recommender tries to guess specific topics of interest for a user given what the user read, and provide recommendations for the three most important topics.

Introduction

Recommender systems are software components that try to predict items that a user would like. To address this task recommender systems usually rely on machine learning algorithms to learn users preferences from data. Most of the recommender algorithms are usually based on Collaborative Filtering (CF) methods that use historical data about interactions between users and items to predict future recommendations. One of the main advantages of collaborative filtering is that it is domain independent: CF algorithms can be applied to any domain, as long as data about past interactions is available. This data could be implicit, e.g. a user bought a product or read an article, or explicit, e.g. a user liked/disliked a video or rated a movie. In both cases, specific algorithms can exploit this data to provide recommendations.

CF algorithms usually provide a single list with the best items for a user. In this blogpost we are going to present the application of a common text mining technology called topic modelling to the CF domain. The advantage of this approach is that we can provide not only a single list of recommendations, but different lists based on specific themes or topics.

Mendeley Suggest

Mendeley Suggest is an article recommender for researchers. It exploits data about Mendeley users (what they add to they library, what they read, in which discipline they are, etc. etc.) to provide recommendations about articles they may want to read.

Screen Shot 2016-03-08 at 11.34.48
Mendeley Suggest lists.

At the moment Mendeley Suggest provides different lists to address different information needs: “based on all the articles in your library” gives you recommendations based on your entire library, taking into account your recent and past interests; “popular” and “trending” in your discipline provide non-personalised recommendations, very useful for new users; “based on what you added/read last” is focused on the last action that you did, giving you contextual recommendations.  For a more detailed explanation of Mendeley Suggest, check out this blogpost.

All these lists should cover different information needs that a researcher could have. In particular,”based on all the articles in your library” provides the best recommendations because it exploits the full library of a user. However, this list usually contains relevant articles about different topics, mixing them in a way that naturally represent a user’s topical interests. For this reason we thought that it would be more clear to separate these topics in different lists: in this way, not only we could provide more recommendations for a specific topic, but we are also able to explain recommendations with the topic itself.

Thematic recommendations

Provide recommendations in different lists is not a new idea. Netflix has been proposing specific lists on different genres from a long time.

Screen Shot 2016-03-08 at 11.56.28
Netflix user interface with different genres.

In 2014 a blogpost from “The Atlantic” tried to reverse engineer the way Netflix creates micro-genres. The main finding was that Netflix is able to create thousand of micro-genres and automatically label with descriptions with a fixed structure, like “Emotional Independent Sports Movies” or “Mind-bending Cult Horror Movies from the 1980s”.

Another clear example of this kind of recommendations is provided by Amazon in their e-commerce website.

Screen Shot 2016-03-08 at 15.24.25.png
Amazon user interface with different categories.

If a user browses a few items, the system provides recommendations based on the items browsed, but it usually does not mix items from different categories in the same list.

Finding themes

Finding themes in a recommender system means finding characteristics about a group of items that are of interests to a group of users. For instance, in Netflix we can find a group of users interested in romantic movies from the ’80s. In this case the characteristic is “romantic movies from the ’80s”, which belongs to a specific set of movies and which is liked by a specific set of users. In the Amazon case, a characteristic could be a product category, e.g. shoes or cameras. If we start from the characteristic is quite easy to define groups, but not every characteristic is useful. Furthermore, these characteristics could be hidden and not easy available. For this reason these themes are usually derived from past interactions between users and items: if a group of users usually interacts with a specific group of items, we can create a new theme and assign (with a certain degree of confidence) users and items to it.

The methods that usually solve this problem in the recommender system area are usually called Matrix Factorization (MF) methods. This subclass of collaborative filtering methods tries to decompose the user-item matrix in two smaller matrices. A nice explanation of MF methods (and other methods for recommender systems) can be found here.

MF
The user-matrix on the left gets factorised in two smaller matrices.

In the example provided the user-item matrix X represents users interactions with items (1 means the user read the book). X gets factorised in two smaller matrices, U and V, in a way that their matrix product is an approximation of the original matrix. Each user and each item is mapped on k dimensions (3 in the example). This means that we assume that there are k hidden themes in the data that we have. If we consider less dimensions we could get more general themes, while if we consider more dimensions we can have more fine grained themes (but if we consider too many we will probably end up with noisy dimensions). Themes can also be obtained with a combination of different dimensions. For instance, if I am interested in the dimension “romantic  movies” and “action movies”, I could get recommendations for “romantic and action movies”.

There are plenty of matrix factorization algorithms available that can be used out of the box. For instance, the Spark Mllib library provides an implementation of  Collaborative Filtering for Implicit Feedback Datasets. These algorithms provides a factorisation of the matrix using positive and negative values: this means that a user can be negatively associated with some dimensions, and the combination of positive and negative dimensions gives a final estimation of the relevance of an item for a particular user. In this case analysing dimensions independently could not make sense, because they are supposed to be considered together.

A different technique that applies a similar strategy, but to extract topics from natural language text, is topic modelling. Topic modelling was first introduced by Blei in 2003 with the Latent Dirichlet Allocation (LDA) model. The approach is very similar to matrix factorization on the user-item matrix. In this case we consider the term frequency (TF) matrix, where we have documents on the rows, words on the columns and number of occurrences of a word a in a document as the content of the matrix. Topic modelling extract topics from the documents, which means that documents have a probability distribution on topics and topics have a probability distribution on words.

Screen Shot 2016-03-08 at 17.00.58
Topic modelling on the TF matrix.

The topics extracted with this technique are easy to interpretate, because they can be represented as list of words with a coherent semantic meaning.

Example of topics extracted from the Mendeley Catalog
Example of topics extracted from the Mendeley Catalog.

If we want to apply topic modelling on the user-item matrix for recommendations, we can consider users as documents and items as words.

Screen Shot 2016-03-17 at 09.59.47
Parallelism between topic modelling on textual data and on CF data.

In this way the topics extracted are list of items with something in common. What they have in common depends on the domain: for e-commerce products it can be the category, for movies it can be the genre, for research articles it can be the research area.

To test this approach in the Mendeley domain, we applied LDA on user libraries for computer science users using the RankSys framework, which wraps Mallet for topic modelling in a way that reduces the memory footprint. Mallet is one of the most efficient non distributed implementation of LDA. We extracted 200 topics from the libraries of computer science users and we manually inspected the top 10 articles for each topic.

Screen Shot 2016-03-11 at 18.41.20.png
Example of topics extracted from the libraries of computer science users.

It was quite clear that the topics extracted represent quite well research areas. The number of topics determines the granularity of them, as we saw that with a lower number of topics we had more general topics, while increasing the number we were able to find more specific themes. Moreover, these topics can be naturally seen as recommendation lists on specific research interests. All these characteristics make topic modelling a very good candidate for a topic-based recommender.

Creating labels

One of the main disadvantages of topic modelling is the fact that topics are not easily labelled. Usually the list of the top words is used to interpretate and understand a topic, but it’s far from a clean label. This problem is not only limited to topic modelling, but to any clustering technique. In our case, topics can be represented with list of documents, as shown in the previous figure. Looking at the article titles, it’s quite clear that some terms or combination of terms appear frequently in a specific topic, and they can be used to label the topic itself. To automatically create these labels we extracted noun phrases from the top 100 titles for every topic, and we combined them with author defined keywords that are part of the metadata of an article. The most frequent phrases and keywords are then selected to represent the topic.

Screen Shot 2016-03-11 at 18.41.39.png
Topics with automatically created labels.

Even with this simple technique the labels created clearly represent the topics. Further effort could be spent to improve the labels, removing or keeping plurals, avoiding single and composite terms, and so on.

Finding the latest topics for a user

If we want to provide recommendations based on topics for a specific user, we need to understand in which topics a user is interested. A research article can be about many topics, but for simplicity we consider that an article is only associated with the top one topic for that article. After that, we scan a user library from the latest added document and we select three topics from the latest added documents. This simple strategy ensures that we capture the most recent interests of a user and we provide recommendations focused on what he is currently working on.

Slide1.png
Topics selection from a user library.

Conclusions

In this blogpost I presented the recently introduce topic-based recommender. Topic-based recommendations can be very useful for a user because they provide items in specific sublists that are easier to process since they are on specific topics. Furthermore, describing these lists with meaningful labels makes them easier to identify and process. In Mendeley we decided to provide these recommendations for users in computer science as a live experiment. By considering only the latest added articles we are able to find only the recent research interests of a user. Up to know, results are encouraging and show that users like topic-based recommendations.

Topic-based recommendations are far from perfect. The main future lines of work are: tuning the topic modelling extraction by selecting the right number of topics; scaling the process to all disciplines; improving the label creation process; mapping the user to the right topics.

Mendeley Research Maps

With this first post I’m going to introduce Mendeley Research Maps. At Mendeley we have monthly hackdays when we can experiment with new technologies, work on side projects or simply learn something new and have fun. During one of my first hackdays I started to work at a two dimensional visualisation of research interests, inspired by an idea suggested to me by a good friend and future colleague, Davide Magatti. The first hack produced the following visualisation:

First draft of the Discipline Map
First draft of the Discipline Map

In this picture disciplines are arranged on the map based on how related they are: medicine is very broad and close to many other disciplines, such as biology and psychology. In the opposite corner computer science is close to engineering and economics. From this first draft I started to build a web application for Mendeley users, using the Mendeley Python SDK for querying the Mendeley API, Flask as a web framework, CherryPy as a web server, and several other libraries that I will mention later in the post.

Try Mendeley Research Maps!

Introduction

The idea is to create a map of research interests so that users can can locate themselves in relation to other researchers. Within a research community is quite easy to understand what people are doing, as researchers are usually aware of people who publish in the same conferences and journals. It is not so easy to understand when someone is working in a different community. Someone who appears to work in a completely different field, could actually share a lot of interests and common methodologies with you. The idea behind Mendeley Research Maps is based exactly on this concept: visualise where people research interests are, and compare them to yourself.

From research articles to topics

The first step is to find a good representation of research interests. In Mendeley we have millions of unique crowd sourced documents that cover all the research areas. In general, the title and the abstract of an article offer a clear representation of the main topics discussed in the article, so I focused on this kind of content for this analysis. A common technique to extract topics from natural language text is topic modelling. Topic modelling was first introduced by Blei in 2003 with the Latent Dirichlet Allocation model. The topics extracted with this technique are easy to interpretate, because they can be represented as list of words.

Example of topics extracted from the Mendeley Catalog
Example of topics extracted from the Mendeley Catalog

Moreover, topic modelling is able to provide a document-topic mapping. In this way, we are able to determine that a specific article, for instance the “Latent Dirichlet Allocation” article is mainly about “Bayesian statistics” (64%) and “numerical simulations” (15%). Since topic modelling is a unsupervised technique, topics can be quite messy because there isn’t any type of control over the top words selected for each topic. However, in the literature there are many approaches that try to automatically label topics to make it easier to understand them.

Topics for the Latent Dirichlet Allocation article
Topics for the Latent Dirichlet Allocation article

Formally, a document can be represented as a vector where the value of each component represents the probability of the corresponding topic:

\vec{d}=(d_1,d_2,...,d_Z)

where:

\sum_{z=1}^{Z}{d_z}=1

Once we have a topic representation for each document, we can easily represent each user in the same way. Given the articles that a researcher has in his library, we can aggregate the topics that appear in these articles and represent the user in the same space.

Formally, user u can be represented by the following vector in the topic space:

\vec{u}=(u_1,u_2,...,u_Z)

where:

u_z=\frac{\sum_{\vec{d} \in D_u}{d_z}}{|D_u|}

and D_u is the set of useru‘s library.

In order to train the topic model we extracted a subsample of the Mendeley catalog such that each subdiscipline is represented and the same number of articles are sampled from each subdiscipline. In fact, since topic model is an unsupervised model, if the training data is too skewed towards a specific discipline most of the topics will be about that discipline.

Subdisciplines for Computer Science

The topic extraction step was implemented using Mallet, a very efficient Java library for topic modelling. Unfortunately I needed to infer topics within a Python environment, so I had to use the gensim library, which offers a nice wrapper around Mallet.

From topics to maps

The next step is to map each user onto a two dimensional space based on his research interests. There are several techniques that can be used to visualise data points in a two or three dimensional space, such as singular value decomposition (SVD) and principal component analysis (PCA). In our case we applied self organizing maps (SOM), also known as Kohonen maps, which is a computational method for the low-dimensional approximation of high-dimensional data. The basic idea is that data points with similar features are mapped onto the same region of the map. In our case a data point is a researcher or an article, the high-dimensional input space is the topic space, and the low-dimensional output is a layer of neurons, usually arranged in a grid. The mapping between the input features and the output neurons can be represented by a matrix W, where each component w_{ij} indicates how strong is the connection between the topic i and the neuron j.

Untitled presentation (4)
Mapping of documents on a two dimensional grid

For training the self organizing map I used the same dataset prepared for training the topic model. In this way we ensure that the map will assign similar sized areas to all subdisciplines. However, since some disciplines have more subdisciplines than others, the area assigned to some disciplines will be larger.

The Self Organizing Map model was trained with SOMPY, a Python library that offers also some visualization functionalities.

Once the model in trained, it can be applied to new documents or users to map them on the two dimensional space. To map user u in the map spaceM, we need to transform the user-topic vector \vec{u} into the user-map vector \vec{u'} where:

u'_{j}=\sum_{z=1}^{Z}{u_z w_{zj}}

The output of this mapping can be represented as an heatmap such as the one in the following figure.

map
User’s Research Interests map.

The map indicates strong interest with red-orange, medium interest with yellow-white, and low interest with light blue-blue. Moreover, specific neurons related to the most important topics for the user are highlighted with a black border.

Maps were created using D3JS, and in particular this blog was very useful as they explain how to implement hexagonal heatmaps.

Disciplines map

From a single researcher map it is not easy to understand where different research areas are mapped. For this reason, a different map can be shown where disciplines are highlighted. To compute this map a discipline-topic vector is computed by aggregating topic vectors of documents assigned to each discipline, in the same way as the user-vector was computed. Than the discipline-topic vector is mapped in a discipline-map vector.

In order to understand which discipline to assign to each neuron, we look for the discipline with the maximum value for the considered neuron.

disciplines

Similar Maps

An interesting use case of Mendeley Research Maps is the ability to look for people with similar research interests. To be precise, the similarity between users is computed on the user-topic vectors.

Similar Users
Similar Users

The interface gives the possibility to check the Mendeley profile page of people with similar research interests. In this way a user can decide to follow an interesting researcher or to contact them, to discuss opportunities for collaboration.

Conclusions

Mendeley Research Maps projects research interests in a two dimensional space easy to visualise and understand. This mapping, beyond giving a very nice visualisation of users and disciplines, can be very interesting to compare different researchers and to find people in the same research area. Finding new researchers and connect with them using the Mendeley social network can be a great way to start new collaborations and open new opportunities!

Try Mendeley Research Maps!