Description
Word2Vec is a commonly used model for training word embeddings. There are numerous variant models based on Word2Vec that incorporate sub-word information, with the goals to improve the semantic expressiveness of embeddings. Most of the current implementations of these models face the trade-off between training speed and computational flexibility: Those who prioritize the former usually use static and compile programming languages (often re-use the code of Word2vec), and manually conduct gradient computing and parameters updates (e.g., FastText and Character-enhanced Word Embedding(CWE) models); Those who prioritize the latter need to utilize a high-level programming language based framework (e.g., PyTorch) for model construction, coming with the sacrifice of training speed (e.g., Distilled Sentence Embedding (DSE) model). This study addresses this issue by proposing a computational framework named Easy-embs, in which some key vector-based numeric operations are implemented with a compiled language that can be easily used to build sub-word incorporated emebdding models that are of relatively high complexity. The ultimate goal is to make it easier for researchers to implement flexible architectures and maintain a fast training speed. In our test, Easy-embs achieves very good speed performance, and comparative performance in intrinsic evaluation tasks of embeddings. Along with Easy-embs, we propose a new model Substring Embedding with Attention (SEA). We evaluate our model on CA8 dataset and Wikipedia dump dataset, and achieve word analogy accuracy of 0.482 / 0.435 on CA8 dataset, which is higher than CWE model on 0.431 / 0.358.