SEMILAR: A Semantic Similarity Toolkit

Frequently Asked Questions

What are the similarity methods available in SEMILAR?

SEMILAR API offers a variety of similarity methods based on WordNet (Fellbaum, 1998), Latent Semantic Analysis (LSA; Landauer et al., 1998), Latent Dirichlet Allocation (LDA; Blei et al., 2003), BLEU (Papieni et al., 2002), Meteor (Denkowski and Lavie, 2011), Pointwise Mutual Information (PMI) (Church et al., 1990), methods that use syntactic dependencies, optimized lexico-syntactic matching methods based on Quadratic Assignment, methods that incorporate negation handling, etc. Some methods have their own variations which, coupled with parameter settings and user's selection of preprocessing steps, could result in a huge methods space.

What is the granularity of similarity methods?

SEMILAR contains methods to measure the semantic similarity at word level, i.e. word-to-word measures, sentence level or sentence-to-sentence measures, paragraph-to-paragraph measures, and document-to-document measures. In addition, the methods can be applied to compute the semantic similarity of texts of different granularities such as word-to-sentence similarity, sentence-to-document similarity, etc. For instance, in summarization a paragraph-to-document similarity measure could be useful to compare the summary paragraph to the source document. Please note that in order to expand from word-to-word similarity measures to larger texts such as sentence-to-sentence some methods require a special, additional step whereas some other methods are directly applicable between texts of any granularity. For example, there are variations of LDA-based similarity methods that work at word-level and others that can be used to compute sentence-level similarity. Expanding the word-level LDA-based measures to sentence level require a special step. On the other hand, LSA can be directly used to compute the similarity of two words or two sentences by simply relying on the same cosine similarity operation between two LSA vectors (an LSA vector must be obtained for a sentence using vector algebra, i.e. simply adding up the individual words' LSA vectors).

What are the word-to-word similarity methods available in SEMILAR?

Please go to word-to-word similarity section. Note that there are sentence level or larger text level similarity measures that are expanded from word-to-word similarity measures. Also, keep in mind that some similarity methods are resource-demanding as they are backed by large data-driven models (e.g. LSA, LDA). In order to reduce the memory requirements, we recommend employing them in separate runs.

Where can I find the details about the methods/algorithms available in SEMILAR?

This document only describes the API of the most important methods implemented in SEMILAR. We do not have (yet) a single document describing in detail all the methods and algorithms available in SEMILAR. We offer a comprehensive list of references to the original publications that introduced the various methods. Please find the reference section for more details.

Similarity and Relatedness are quite different things. Have you categorized them?

Though similarity and relatedness are quite different concepts, we refer them as similarity in general. Some of the methods measure similarity more than others, but all of them can be considered measures of relatedness. We should refer to their descriptions in details to characterize them. Corpus based models usually measure the relatedness.

I am not just doing word-to-word or sentence-to-sentence similarity research, my research is on text classification, clustering, text mining, information retrieval (or something related), machine translation evaluation, etc. How can I best utilize this tool?

Certainly there are many ways of using word-to-word, sentence-to-sentence similarity or relatedness measures in information retrieval, text mining, clustering, classification, machine translation evaluation, etc. You may consider using word-to-word and/or sentence-to-sentence level similarity to document level or use document level similarity methods (The currently available document level similarity method is LDA based method. Work is on progress; please keep on visiting SEMILAR website for the recent updates) available in SEMILAR API itself. The typical use of word-to-word similarity method for document similarity would be to pair up the words from two documents and normalize the score (such as dividing the score by document lengths).

What are the recent updates to the SEMILAR API?

This document covers the methods available in the SEMILAR API as of June 2013. Please visit for more details and latest updates.

Which programming language is used to create SEMILAR API?

Java (jdk 1.7). And we plan to continue the development in this language.

How large is the SEMILAR package?

It is a large library and application because it relies on large models and packages. Most of the basic NLP tools on which we rely come with big models such as the Stanford NLP or the Open NLP parser models. Other resources are relatively large as well (a couple of hundred MBs) such as the WordNet lexical database. In addition, our similarity methods also require pre-built models. For example, LSA spaces, LDA models, and Wikipedia PMI data are large components by themselves. If a user wants to utilize selective methods, there is no need to download everything. SEMILAR can be downloaded in separate zipped files for ease of customization and setup that fits various needs. The whole SEMILAR package, which includes everything, has about 1.1 Gb.

Which corpora or data sets are needed?

We have generated LSA spaces and LDA models using the TASA corpus and the Wikipedia (early spring 2013 version) corpus, whereas for PMI calculations we have used only Wikipedia. These models can be downloaded from the SEMILAR website. The user may generate new models based on different corpora, with different preprocessing steps, or other settings. The SEMILAR API allows the user to specify new, non-default model names and paths. Please see the corpora details section for more details about the corpora we are using.

I want to generate and use LSA/LDA models using different corpora or requirements. Is it possible to generate and use my models in SEMILAR?

Yes, you can develop LSA/LDA models using your own corpus. But you have to follow the format of the model files, and certain file naming requirements for your model name, etc. We also offer a facility to generate LSA/LDA models (in progress; please see the section Using LDA for more details on creating LDA models and the References section about the tools we are using). It is recommended to generate models using the SEMILAR API as the output follows the format used by the other SEMILAR components.

What preprocessing steps are usually executed and what are the tools you are using?

Some similarity methods require certain kinds of preprocessing such as POS tagging, parsing, stemming/lemmatization, etc. Tokenization is needed by all methods. The SEMILAR preprocessor gives users the option of selecting either StanfordNLP or OpenNLP tools for tokenization, tagging, and parsing. For stemming, Porter's stemmer or WordNet based stemmer (the latter guarantees the stem is an actual word, i.e. dictionary entry) are available.

Can I skip preprocessing or do it myself?

You may preprocess your texts without using the SEMILAR preprocessor but your responsibility would be to create certain objects and populate corresponding field values in these objects. You may not need to do any preprocessing or just do the basics to use the semantic similarity methods. Please check the preprocessing requirements for particular methods you may want to use.

How much time consuming methods are there in SEMILAR?

Most of the methods are quite fast. Some optimization methods that also rely on syntactic or other types of information may be slower. If you want to run it in a typical workstation, then running different methods together can be quite heavy or may not work. Please consider running it in a more powerful machine where you can run multiple methods together.

How much memory is consumed?

It depends on the particular method. Most of the implemented methods should work on regular desktop and laptop computers (having at least 6GB memory). Running many methods together can be problematic in the regular PC.

Which WordNet version is used?

WordNet 3.0.

Can SEMILAR be used for languages other than English?

Similarity measures that are available in SEMILAR have been developed for English and there are no models included for languages other than English. But it is possible to adapt the methods to other languages. For instance, you can generate LSA/LDA models using texts from a target, other-than-English language (remember that you can develop LSA/LDA models using interface/functions available in SEMILAR API itself) and then use the SEMILAR LSA and LDA similarity measures to compute the similarity of texts in the target language.

Where can I find references and citation information?

Please find below the sections - SEMILAR Citation info and References for more details.

What are the licensing terms of using SEMILAR?

The SEMILAR library and application are free to use for non-commercial, academic and research purposes. Please check the Licensing terms on the SEMILAR website at Please note that we provide licensing information about third-party components that are being used in SEMILAR in the reference section in on the website at Complying with the SEMILAR licensing terms implies complying with license agreements issued by the third parties for the corresponding components included in SEMILAR. Please read the license agreement first before downloading and installing SEMILAR.

Are there any examples for quick start using it?

Yes. Example code is available in the SEMILAR package in the extracted SEMILAR folder. There are different methods, options, possible values of parameters, different models you can choose from, etc. You have to go through this guide and latest information from the website to make best use of SEMILAR.

I want to run all methods at once, is it possible?

Yes, it is. Some of the preprocessing tools and similarity methods have to load huge models in memory. You can probably run all other methods at once if your machine has at least 8 GB of memory.

What platforms (OS) does it support?

SEMILAR API can be used in Linux and Windows. Jdk 1.7 or higher is required.

I have encountered some error or exception, how can I diagnose?

We try to provide as much support as possible. Please provide enough details when you encounter any issues. It is possible that errors can be caused by missing required files, misplaced folders (especially after extracting the zipped files), misspelling, incorrect input format, corrupted downloaded file, etc. What is the similarity score if the given word(s) is not in the vocabulary? For example: Your LSA model may not have some of the words I would like to see the similarity score for. We are scrutinizing all the methods to make sure that users will not get confused (for example: if a similarity score of zero is returned for a given word pair it may be because one or both of the words are not available in the model). Many of the similarity measures return a value of -9999 in cased when it was not possible to calculate the similarity for the given word or sentence pairs.

I have little background on one or more similarity methods. Rather than description of methods, can I find solid steps (or examples) to get the results?

We recommend users understand the fundamentals of each method before using their implementations available in SEMILAR. To make it easy (at the extent this is possible) for users to start using the SEMILAR methods with minimal leading effort, we have different resources including this manual and the package including example codes.

I have few questions, issues about using it, can I get some help?

Yes sure. Please feel free to write to Rajendra Banjade ( and Dr. Vasile Rus (