Many thanks again to my co-presenters Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke and Bhaskar Mitra, and to everyone attending, for making the NN4IR tutorial at SIGIR 2017 in Tokyo a great success, in a packed room.

Please read the excellent blogpost on the ACM website. And thanks everyone for tweeting.

The final slides are now available on nn4ir.com.
 

I was asked to join the program committee of the 2018 edition of The Web Conference (27th edition of the former WWW conference), in Lyon, France.
Yes, that is right, WWW 2018 is rebranded as The Web Conference this year.
 
Together with Mostafa Dehghani (UvA), Jaap Kamps (UvA), Scott Roy (Google) and Ryen White (Microsoft Research), I joined the program committee of SCAI'17 — Search-Oriented Conversational AI, held October 1 in Amsterdam, and co-located with ICTIR'17.
 
I will be giving a SIGIR 2017 tutorial in Tokyo, Japan, on neural networks for Information Retrieval (NN4IR), together with Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke and Bhaskar Mitra.
More info in this overview paper and on the NN4IR website.
 
I am honoured to be invited to give a talk at the 14th SIKS/Twente Seminar on Searching and Ranking, Text as social and cultural data. This symposium is organized together with the PhD defense of Dong Nguyen.
 
I have been invited to join the Program Committee of KDD 2017, a premier interdisciplinary conference bringing together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data, which is held in Halifax, Nova Scotia, Canada, August 13 - 17, 2017
 
Spring and summer in California again!!! I am going to do a second internship at Google Research in Mountain View, April until July. I'll be working with Dana Movshovitz-Attias.

 
Wonderful! The full paper Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity, by Hosein Azerbonyad, Mostafa Dehghani, me, Maarten Marx, Jaap Kamps en Maarten de Rijke is accepted for the 39th European Conference on Information Retrieval (ECIR 2017) in Aberdeen!
 
I gave a talk about Siamese CBOW at SEA, Search Engines Amsterdam, a series of monthly talks, where academia and industry meet. Here are the slides I used.
 
Organising BNAIC 2016 was a lot of fun. I debuted as a session chair, in the Natural Language Processing session. I was also the Demo chair of the organising committee. We had a very nice demo session, I think, with "Autonomous Robot Soccer Matches" by Caitlin Lagrand et al. as BNAIC SKBS Demo Award winner.
 
This is the official version of Siamese CBOW: Optimizing Word Embeddings for Sentence Representations, the full paper I wrote with Alexey Borisov and Maarten de Rijke, which I presented last week at ACL 2016 in Berlin.
 
Our workshop paper Design and implementation of ShiCo: Visualising shifting concepts over time, written together with Carlos Martinez-Ortiz, Melvin Wevers, Pim Huijnen, Jaap Verheul and Joris van Eijnatten is accepted to the HistoInformatics2016 workshop held in conjunction with the Digital Humanities 2016 conference.
PDF will follow shortly.
 
Great stuff!! My full paper Siamese CBOW: Optimizing Word Embeddings for Sentence Representations that I wrote together with Alexey Borisov and Maarten de Rijke is accepted for ACL 2016, which is held in Berlin.

Siamese CBOW: Optimizing Word Embeddings for Sentence Similarity

We present the Siamese Continuous Bag of Words (Siamese CBOW) model, a neural network for efficient estimation of high-quality sentence embeddings. Averaging the embeddings of words in a sentence has proven to be a surprisingly successful and efficient way of obtaining sentence embeddings. However, word embeddings trained with the methods currently available are not optimized for the task of sentence representation, and, thus, likely to be suboptimal. Siamese CBOW handles this problem by training word embeddings directly for the purpose of being averaged. The underlying neural network learns word embeddings by predicting, from a sentence representation, its surrounding sentences. We show the robustness of the Siamese CBOW model by evaluating it on 20 datasets stemming from a wide variety of sources.

Here is the pre-print on arXiv.
 

I am quite thrilled and honoured by this... I was interviewed by the New Scientist.
The interview is titled Will computers ever be able to understand language? (in Dutch). It's is about my research on sentence similarity and also a bit about the state of affairs of natural language processing in general.

Here is the link to the interview on the New Scienist website.
 

Our demo paper "ShiCo: A Visualization tool for Shifting Concepts Through Time" that I wrote together with Carlos Martinez-Ortiz, Melvin Wevers, Pim Huijnen, Jaap Verheul and Joris van Eijnatten is accepted for DHBenelux 2016.
This is particularly nice, I think, because this is follow-up work of our CIKM paper Ad Hoc Monitoring of Vocabulary Shifts over Time.
 
I'll be the demo chair for BNAIC 2016, the Annual Benelux Conference on Artificial Intelligence.
The conference is jointly organized by the University of Amsterdam and the Vrije Universiteit Amsterdam, under the auspices of the Benelux Association for Artificial Intelligence (BNVKI) and the Dutch Research School for Information and Knowledge Systems (SIKS) and will be held in Amsterdam, Thursday 10 and Friday 11 November, 2016.

BTW, I also designed the logo... ;-)
 

Summer in California!!! I am going to do an internship at Google Research in Mountain View, May until August. I'll be working with Mat Kelcey.

 
The abstract of my CIKM'15 paper Short Text Similarity with Word Embeddings was accepted for the Dutch-Belgian Information Retrieval workshop (DIR2015) in Amsterdam, Holland.
 
I gave two talks on CIKM'15 in Melbourne. Here are the slides:

Short Text Similarity with Word Embeddings

Ad Hoc Monitoring of Vocabulary Shifts over Time
 

My research about sentence samantics and changes in word usage through time made it to the UvA homepage.

 
I went to the Google NLP PhD Summit in Zurich and it was great! I met a lot of very interesting people and had a lot of nice discussions.
Here is a link to the the poster I presented.
 
Cool! I will be going to the Google NLP PhD Summit in Zurich in September.


 

Today, Agnes van Belle, an AI master student I supervised, graduated. She wrote a nice thesis called Historical Document Retrieval with Corpus-derived Rewrite Rules.
Spelling changes quite often occur gradually (even if they are government-imposed) and in the thesis it is shown that we can exploit the continuum of gradual changes when doing query expansion for historical document retrieval.

 
Here is the final version of my CIKM 2015 paper Short Text Similarity with Word Embeddings with Maarten de Rijke.

 
Here is the final version of my CIKM 2015 paper Ad Hoc Monitoring of Vocabulary Shifts over Time with Melvin Wevers, Pim Huijnen and Maarten de Rijke.

 
The dataset we made for het CIKM 2015 paper "Ad Hoc Monitoring of Vocabulary Shifts over Time" with Melvin Wevers, Pim Huijnen and Maarten de Rijke is now publicly available.
Go and get it here.

Many, many thanks to all annotators who contributed their time and effort!!!
 

Yes! Yes! Nice! Nice! Both full paper submissions to CIKM 2015 are accepted. I am going to Melbourne! These are the papers:

Short Text Similarity with Word Embeddings, with Maarten de Rijke.
Short Text Similarity with Word Embeddings

Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length.
We investigate whether determining short text similarity is possible using only semantic features — where by semantic we mean, pertaining to a representation of meaning — rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity.
We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts.
We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.


Ad Hoc Monitoring of Vocabulary Shifts over Time with Melvin Wevers, Pim Huijnen and Maarten de Rijke.
Ad Hoc Monitoring of Vocabulary Shifts over Time

Word meanings change over time. Detecting shifts in meaning for particular words has been the focus of much research recently. We address the complementary problem of monitoring shifts in vocabulary over time. That is, given a small seed set of words, we are interested in monitoring which terms are used over time to refer to the underlying concept denoted by the seed words.
In this paper, we propose an algorithm for monitoring shifts in vocabulary over time, given a small set of seed terms. We use distributional semantic methods to infer a series of semantic spaces over time from a large body of time-stamped unstructured textual documents. We construct semantic networks of terms based on their representation in those semantic spaces and use graph-based measures to calculate saliency of terms. Based on these graph-based measures we produce ranked lists of terms that represent the concept underlying the initial seed terms over time as final output.
As the task of monitoring shifting vocabularies over time for an ad hoc set of seed words is, to the best of our knowledge, a new one, we construct our own evaluation set. Our main contributions are the introduction of the task of ad hoc monitoring of vocabulary shifts over time, the description of an algorithm for tracking shifting vocabularies over time given a small set of seed words, and a systematic evaluation of results over a substantial period of time (over four decades). Additionally, we make our newly constructed evaluation set publicly available.

Camera-ready PDFs will follow shortly.
 

The IPM paper Evaluating Document Filtering Systems over Time with Krisztian Balog and Maarten de Rijke is online. Here is the official link and it can also be downloaded here.
 
The NLeSc PathFinder grant proposal that I co-wrote is accepted. In the proposal we describe a system for monitoring shifts in vocabulary over time.
For example, in the 1950s people used to say automobile where nowadays everyone would use the word car. It's the same concept, but the vocabulary has changed. Another nice example is the word propaganda in Dutch. In the 1950s, this used to refer to commercial activities like advertising, where nowadays, in Dutch, one would use the word reclame.

The algorithms I developed to monitor changes in vocabulary over time will be implemented in a tool that discloses a corpus of digitized historical Dutch newspapers (covering the last four centuries) used by digital humanities scholars.
 

Nice! My paper called Evaluating Document Filtering Systems over Time with Krisztian Balog and Maarten de Rijke was accepted for the IPM special issue on Time and IR. PDF will follow soon.
 
The abstract Concepts Through Time: Tracing Concepts In Dutch Newspapers Discourse (1890-1990) Using Word Embeddings which I co-wrote with Melvin Wevers and Pim Huijnen is accepted for Digital Humanities 2015 (DH2015) in Sydney, Australia.
 
Here are some very simple and high-level slides on word2vec that I made for a reading group on our group. Nothing special, just what it is (not) and what it is used for.

 
Last year, I participated in the Cumulative Citation Recommendation task (CCR) of the Knowledge Base Acceleration (KBA) track of the Text REtrieval Conference, TREC 2013. This is the notebook paper describing our approach.
 
Today I presented my work on "Time-Aware Chi-squared for Document Filtering over Time" at CLIN24 in Leiden. This is largely the same presentation I held earlier at the TAIA workshop at SIGIR 2013 in Dublin and at TREC 2013 in Gaithersburg.
Just in case anyone is interested, here are the slides.
 
Presented my poster at ICT.OPEN 2013.


 

Nice! My abstract for CLIN24, called "Time-Aware Chi-squared for Document Filtering over Time" is accepted for presentation.