The very informative tutorial by Vlad Niculae on Word Mover’s Distance in Python includes this step:
We could train the embeddings ourselves, but for meaningful results we would need tons of documents, and that might take a while. So let’s just use the ones from the word2vec team.
I couldn’t have asked for a better justification for ConceptNet and Luminoso in two sentences.
When presenting new results from Conceptnet Numberbatch , which works way better than word2vec alone, one objection is that the embeddings are pre-computed and aren’t based on your data. ( Luminoso is a SaaS platform that retrains them to your data, in the cases where you do need that.)
Pre-baked embeddings are useful. People are resigning themselves to use word2vec’s pre-baked embeddings because they don’t know they can have better ones. I dream of the day when someone writing a new tutorial like this says “So let’s just use Conceptnet Numberbatch.”