Wikipedia category embeddings - Node2Vec, Poincare, Elmo

  • Shubhanshu Mishra (Creator)

Dataset

Description

Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href="https://archive.org/download/enwiki-20170920">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms:

* Node2vec
* Poincare embedding
* Elmo model on the category title

The following files are present:

* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") <tab> 300 dim space separated embedding.
* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format.
* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using
* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt
* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt
* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt
* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files.
* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category
* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt
* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt



Software used:

* <a href="https://github.com/napsternxg/WikiUtils">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps
* <a href="https://github.com/napsternxg/node2vec">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec
* <a href="https://github.com/RaRe-Technologies/gensim">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm
* <a href="https://github.com/allenai/allennlp">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title


Code used:
* wiki_cat_node2vec_commands.sh - Commands used to
* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings
* wiki_cat_poincare_embedding.py - generate poincare embeddings
Date made availableJul 8 2019
PublisherUniversity of Illinois Urbana-Champaign

Keywords

  • Embeddings
  • Wikipedia Category Tree
  • Poincare
  • Node2Vec
  • Elmo
  • Wikipedia

Cite this