With this first post I’m going to introduce Mendeley Research Maps. At Mendeley we have monthly hackdays when we can experiment with new technologies, work on side projects or simply learn something new and have fun. During one of my first hackdays I started to work at a two dimensional visualisation of research interests, inspired by an idea suggested to me by a good friend and future colleague, Davide Magatti. The first hack produced the following visualisation:
In this picture disciplines are arranged on the map based on how related they are: medicine is very broad and close to many other disciplines, such as biology and psychology. In the opposite corner computer science is close to engineering and economics. From this first draft I started to build a web application for Mendeley users, using the Mendeley Python SDK for querying the Mendeley API, Flask as a web framework, CherryPy as a web server, and several other libraries that I will mention later in the post.
The idea is to create a map of research interests so that users can can locate themselves in relation to other researchers. Within a research community is quite easy to understand what people are doing, as researchers are usually aware of people who publish in the same conferences and journals. It is not so easy to understand when someone is working in a different community. Someone who appears to work in a completely different field, could actually share a lot of interests and common methodologies with you. The idea behind Mendeley Research Maps is based exactly on this concept: visualise where people research interests are, and compare them to yourself.
From research articles to topics
The first step is to find a good representation of research interests. In Mendeley we have millions of unique crowd sourced documents that cover all the research areas. In general, the title and the abstract of an article offer a clear representation of the main topics discussed in the article, so I focused on this kind of content for this analysis. A common technique to extract topics from natural language text is topic modelling. Topic modelling was first introduced by Blei in 2003 with the Latent Dirichlet Allocation model. The topics extracted with this technique are easy to interpretate, because they can be represented as list of words.
Moreover, topic modelling is able to provide a document-topic mapping. In this way, we are able to determine that a specific article, for instance the “Latent Dirichlet Allocation” article is mainly about “Bayesian statistics” (64%) and “numerical simulations” (15%). Since topic modelling is a unsupervised technique, topics can be quite messy because there isn’t any type of control over the top words selected for each topic. However, in the literature there are many approaches that try to automatically label topics to make it easier to understand them.
Formally, a document can be represented as a vector where the value of each component represents the probability of the corresponding topic:
Once we have a topic representation for each document, we can easily represent each user in the same way. Given the articles that a researcher has in his library, we can aggregate the topics that appear in these articles and represent the user in the same space.
Formally, user can be represented by the following vector in the topic space:
and is the set of user‘s library.
In order to train the topic model we extracted a subsample of the Mendeley catalog such that each subdiscipline is represented and the same number of articles are sampled from each subdiscipline. In fact, since topic model is an unsupervised model, if the training data is too skewed towards a specific discipline most of the topics will be about that discipline.
The topic extraction step was implemented using Mallet, a very efficient Java library for topic modelling. Unfortunately I needed to infer topics within a Python environment, so I had to use the gensim library, which offers a nice wrapper around Mallet.
From topics to maps
The next step is to map each user onto a two dimensional space based on his research interests. There are several techniques that can be used to visualise data points in a two or three dimensional space, such as singular value decomposition (SVD) and principal component analysis (PCA). In our case we applied self organizing maps (SOM), also known as Kohonen maps, which is a computational method for the low-dimensional approximation of high-dimensional data. The basic idea is that data points with similar features are mapped onto the same region of the map. In our case a data point is a researcher or an article, the high-dimensional input space is the topic space, and the low-dimensional output is a layer of neurons, usually arranged in a grid. The mapping between the input features and the output neurons can be represented by a matrix , where each component indicates how strong is the connection between the topic and the neuron .
For training the self organizing map I used the same dataset prepared for training the topic model. In this way we ensure that the map will assign similar sized areas to all subdisciplines. However, since some disciplines have more subdisciplines than others, the area assigned to some disciplines will be larger.
The Self Organizing Map model was trained with SOMPY, a Python library that offers also some visualization functionalities.
Once the model in trained, it can be applied to new documents or users to map them on the two dimensional space. To map user in the map space, we need to transform the user-topic vector into the user-map vector where:
The output of this mapping can be represented as an heatmap such as the one in the following figure.
The map indicates strong interest with red-orange, medium interest with yellow-white, and low interest with light blue-blue. Moreover, specific neurons related to the most important topics for the user are highlighted with a black border.
From a single researcher map it is not easy to understand where different research areas are mapped. For this reason, a different map can be shown where disciplines are highlighted. To compute this map a discipline-topic vector is computed by aggregating topic vectors of documents assigned to each discipline, in the same way as the user-vector was computed. Than the discipline-topic vector is mapped in a discipline-map vector.
In order to understand which discipline to assign to each neuron, we look for the discipline with the maximum value for the considered neuron.
An interesting use case of Mendeley Research Maps is the ability to look for people with similar research interests. To be precise, the similarity between users is computed on the user-topic vectors.
The interface gives the possibility to check the Mendeley profile page of people with similar research interests. In this way a user can decide to follow an interesting researcher or to contact them, to discuss opportunities for collaboration.
Mendeley Research Maps projects research interests in a two dimensional space easy to visualise and understand. This mapping, beyond giving a very nice visualisation of users and disciplines, can be very interesting to compare different researchers and to find people in the same research area. Finding new researchers and connect with them using the Mendeley social network can be a great way to start new collaborations and open new opportunities!