SimLex-999 is a gold standard resource for the evaluation of models that learn the meaning of words and concepts.
SimLex-999 provides a way of measuring how well models capture similarity, rather than relatedness or association. The scores in SimLex-999 therefore differ from other well-known evaluation datasets such as WordSim-353 (Finkelstein et al. 2002). The following two example pairs illustrate the difference - note that clothes are not similar to closets (different materials, function etc.), even though they are very much related:
|Pair||Simlex-999 rating||WordSim-353 rating|
|coast - shore||9.00||9.10|
|clothes - closet||1.96||8.00|
Our experiments indicate that SimLex-999 is challenging for computational models to replicate because, in order to perform well, they must learn to capture similarity independently of relatedness/association. This is hard because most language-based representation-learning models infer connections between concepts from their co-occurrence in corpora, and co-occurrence primarity reflects relatedness not similarity.
In addition to general-purpose evaluations of semantic models, SimLex-999 is structured to facilitate focused evaluations based around the following conceptual distinctions:
Download SimLex-999 by clicking here. All design details are outlined in the following paper (click to access). Please cite it if you use SimLex-999:
SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. 2014. Felix Hill, Roi Reichart and Anna Korhonen. Computational Linguistics. 2015
Contact Felix Hill (email@example.com) if your questions are not addressed in the paper.
SimLex-999 is now in German, Italian and Russian thanks to Ira Leviant and Roi Reichart. See this page for more information.
The well-known Skipgram (Word2Vec) model trained on 1bn words of Wikipedia text achieves a Spearman Correlation of 0.37 with SimLex-999 .
The best performance of a model trained on running monolingual text is a Spearman Correlation of 0.56 .
A Neural Machine Translation Model (En->Fr) trained on a relatively small bilingual corpus achieves a Spearman Correlation of 0.52 .
A model that exploits curated knowledge-bases (WordNet, Framenet etc) can reach a Spearman Correlation of 0.58 .
NEW: A model that uses rich paraphrase data for training can reach a Spearman Correlation of 0.68 .
NEWER: A hybrid model trained on features from various word embeddings and two lexical databases achieves a Spearman Correlation of 0.76 .
NEWERER: Counterfitting works well .
The average pairwise Spearman correlation between two human raters is 0.67. However, it may be fairer to compare the performance of models with the average correlation of a human rater with the average of all the other raters. This number is 0.78.
Please email firstname.lastname@example.org if you know of better performing models.
 Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean. arXiv preprint arXiv:1301.3781. 2013.
 Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction. Roy Schwarz, Roi Reichart and Ari Rappoport, CoNLL 2015.
 Embedding Word Similarity with Neural Machine Translation. Felix Hill, KyungHyun Cho, Sebastien Jean, Coline Devin and Yoshua Bengio. ICLR. 2015.
 Non-Distributional Word Vector Representations. Manaal Faruqui and Chris Dyer. ACL. 2015.
 From Paraphrase Database to Compositional Paraphrase Model and Back John Weiting, Mohit Bansal, Kevin Gimpel, Karen Livescu and Dan Roth. TACL 2015.
 Measuring semantic similarity of words using concept networks. Gabor Recski and Eszter Iklodi and Katalin Pajkossy and Andras Kornai. To appear in RepL4NLP 2016.
 Measuring semantic similarity of words using concept networks. Nikola Mrksic et al. Counter-fitting Word Vectors to Linguistic Constraints. EMNLP 2016.
SimLex-999 was produced by mining the opinions of 500 annotators via Amazon Mechanical Turk. See below for annotator instructions.