Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Learning Patent Speak: Investigating Domain-Specific Word Embeddings

Here, we publish word embeddings trained on more than 38 billion tokens of patent documents as part of our paper "Learning Patent Speak: Investigating Domain-Specific Word Embeddings" published at the International Conference on Digital Information Management (ICDIM) 2018. An extended version of our conference paper titled "Domain-Specific Word Embeddings for Patent Classification" is published in the Data Technologies and Applications Journal.

  • 100-dimensional embeddings (.vec file) Link 
  • 100-dimensional embeddings  (.bin file) Link
  • Embeddings with 200 dimensions and with 300 dimensions can be provided upon request

If you use our work, please cite our paper:

@article{risch2019domainspecific,
  author = {Risch, Julian and Krestel, Ralf},
  journal = {Data Technologies and Applications},
  number = 1,
  pages = {108-122},
  title = {Domain-specific word embeddings for patent classification},
  volume = 53,
  year = 2019
}

Visualization

For visualization purposes, we created a subset of 10,000 words and their vectors. This subset can be loaded into tensorflow's projector to explore the embedding space interactively (search for words and display their closest neighbors):

  • Web browser-based visualization of the embeddings Link
  • 10,000 words and their vectors to download for your own visualization purposes Link 

Abstract

A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial to patent offices and other stakeholders in the patent domain. However, a challenge for the automation of this costly manual task is the patent-specific language use. To facilitate this task, we present domain-specific pre-trained word embeddings for the patent domain. We trained our model on a very large dataset of more than 5 million patents to learn the language use in this domain. We evaluated the quality of the resulting embeddings in the context of patent classification. To this end, we propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Experiments on a standardized evaluation dataset show that our approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches.

Project-Related Publications

  • 1.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims & Prior Art. Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech@SIGIR) (2021).
     
  • 2.
    Risch, J., Alder, N., Hewel, C., Krestel, R.: PatentMatch: A Dataset for Matching Patent Claims with Prior Art. ArXiv e-prints 2012.13919. (2020).
     
  • 3.
    Risch, J., Garda, S., Krestel, R.: Hierarchical Document Classification as a Sequence Generation Task. Proceedings of the Joint Conference on Digital Libraries (JCDL). pp. 147–155 (2020).
     
  • 4.
    Risch, J., Krestel, R.: Domain-specific word embeddings for patent classification. Data Technologies and Applications. 53, 108–122 (2019).
     
  • 5.
    Risch, J., Krestel, R.: Learning Patent Speak: Investigating Domain-Specific Word Embeddings. Proceedings of the Thirteenth International Conference on Digital Information Management (ICDIM). pp. 63–68 (2018).