SimLex-999 is a gold standard resource for the evaluation of models that learn the meaning of words and concepts.

SimLex-999 provides a way of measuring how well models capture similarity, rather than relatedness or association. The scores in SimLex-999 therefore differ from other well-known evaluation datasets such as WordSim-353 (Finkelstein et al. 2002). The following two example pairs illustrate the difference - note that clothes are not similar to closets (different materials, function etc.), even though they are very much related:

Pair Simlex-999 rating WordSim-353 rating
coast - shore 9.00 9.10
clothes - closet 1.96 8.00

Our experiments indicate that SimLex-999 is challenging for computational models to replicate because, in order to perform well, they must learn to capture similarity independently of relatedness/association. This is hard because most language-based representation-learning models infer connections between concepts from their co-occurrence in corpora, and co-occurrence primarity reflects relatedness not similarity.

In addition to general-purpose evaluations of semantic models, SimLex-999 is structured to facilitate focused evaluations based around the following conceptual distinctions:

Download the Dataset

Download SimLex-999 by clicking here. All design details are outlined in the following paper (click to access). Please cite it if you use SimLex-999:

SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. 2014. Felix Hill, Roi Reichart and Anna Korhonen. Computational Linguistics. 2015

Contact Felix Hill ( if your questions are not addressed in the paper.

NEW: SimLex in Other Languages

SimLex-999 is now in German, Italian and Russian thanks to Ira Leviant and Roi Reichart. See this page for more information.


The well-known Skipgram (Word2Vec) model trained on 1bn words of Wikipedia text achieves a Spearman Correlation of 0.37 with SimLex-999 [1].

The best performance of a model trained on running monolingual text is a Spearman Correlation of 0.56 [2].

A Neural Machine Translation Model (En->Fr) trained on a relatively small bilingual corpus achieves a Spearman Correlation of 0.52 [3].

A model that exploits curated knowledge-bases (WordNet, Framenet etc) can reach a Spearman Correlation of 0.58 [4].

NEW: A model that uses rich paraphrase data for training can reach a Spearman Correlation of 0.68 [5].

NEWER: A hybrid model trained on features from various word embeddings and two lexical databases achieves a Spearman Correlation of 0.76 [6].

NEWERER: Counterfitting works well [7].

The average pairwise Spearman correlation between two human raters is 0.67. However, it may be fairer to compare the performance of models with the average correlation of a human rater with the average of all the other raters. This number is 0.78.

Please email if you know of better performing models.

[1] Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean. arXiv preprint arXiv:1301.3781. 2013.

[2] Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction. Roy Schwarz, Roi Reichart and Ari Rappoport, CoNLL 2015.

[3] Embedding Word Similarity with Neural Machine Translation. Felix Hill, KyungHyun Cho, Sebastien Jean, Coline Devin and Yoshua Bengio. ICLR. 2015.

[4] Non-Distributional Word Vector Representations. Manaal Faruqui and Chris Dyer. ACL. 2015.

[5] From Paraphrase Database to Compositional Paraphrase Model and Back John Weiting, Mohit Bansal, Kevin Gimpel, Karen Livescu and Dan Roth. TACL 2015.

[6] Measuring semantic similarity of words using concept networks. Gabor Recski and Eszter Iklodi and Katalin Pajkossy and Andras Kornai. To appear in RepL4NLP 2016.

[7] Measuring semantic similarity of words using concept networks. Nikola Mrksic et al. Counter-fitting Word Vectors to Linguistic Constraints. EMNLP 2016.

Annotator Instructions

SimLex-999 was produced by mining the opinions of 500 annotators via Amazon Mechanical Turk. See below for annotator instructions.