SEMILAR: A Semantic Similarity Toolkit

LSA models generated from Whole wikipedia articles and TASA corpus

If you use these models, please cite the following paper:

Stefanescu, D., Banjade, R., and Rus, V. (2014). Latent Semantic Analysis Models on Wikipedia and TASA. The 9th Language Resources and Evaluation Conference (LREC 2014), 26-31 May, Reykjavik, Iceland. Available here

The online demo for word to word similarity is available here.


Download LSA models

Every zip file contains two files: “lsaModel” and “voc”. Every line in the “voc” file is an individual word, while each line in the “lsaModel” file is a 300-dimensional vector in the latent space. There is a perfect correspondence between the lines of the two files and so, the vector on the i-th line in the “lsaModel” file is corresponding to the word on the i-th line in the “voc” file. The similarity between two words is usually computed as the cosine similarity between their corresponding vectors.

Please see the LREC 2014 paper to find out more details about the following downloadable LSA models.

Wikipedia models: Wiki 1.zip (2.64 GB)  Wiki 2.zip (179 MB)   Wiki 3.zip (192 MB)  Wiki 4.zip (167 MB)  Wiki 5.zip (147 MB)  Wiki 6.zip (127 MB) 

TASA models: TASA 1.zip (143 MB)  TASA 2.zip (78.7 MB)   TASA 3.zip (85.5 MB)  TASA 4.zip (60.7 MB)  TASA 5.zip (64.3 MB)  TASA 6.zip (184 MB)  TASA 7.zip (185 MB) 


Any problems? E-mail Rajendra Banjade at rbanjade@ memphis.edu and Dr. Vasile Rus at vrus @ memphis. edu.