| Русский Русский | English English |
   
Главная Current Issue
19 | 01 | 2025
10.14489/vkit.2021.10.pp.032-039

DOI: 10.14489/vkit.2021.10.pp.032-039

Тихонов Н. И.
МЕТОДЫ PAPER2VEC И CITE2VEC ДЛЯ АНАЛИЗА КОЛЛЕКЦИЙ НАУЧНЫХ ПУБЛИКАЦИЙ
(c. 32-39)

Аннотация. Процедуры визуализации коллекций научных публикаций используются для лучшего понимания наборов данных и формирования некоторых оценок. При построении таких визуализаций могут использоваться различные методы анализа текстовых коллекций. В статье рассмотрены методы анализа Paper2vec и Cite2vec, в которых использована информация о цитировании и получены векторные представления документов. В целях демонстрации работы методов описаны процедуры визуализации.

Ключевые слова:  визуализация коллекций документов; векторное представление документов; сети цитирования; контексты цитирования.

 

Tikhonov N. I.
PAPER2VEC AND CITE2VEC METHODS FOR ANALYZING COLLECTIONS OF SCIENTIFIC PUBLICATIONS
(pp. 32-39)

Abstract. Collections of scientific publications are growing rapidly. Scientists have access to portals containing a large number of documents. Such a large amount of data is difficult to investigate. Methods of document visualization are used to reduce labor costs, search for necessary and similar documents, evaluate the scientific contribution of certain publications and reveal hidden links between documents. The methods of document visualization can be based on various models of document representation. In recent years, word embedding methods for natural language processing have become extremely popular. Following them, methods for analyzing text collections began to appear to obtain vector representations of documents. Although there are many document analyzing systems, new methods can give new understandings of collections, have better performance for analyzing large collections of documents, or find new relationships between documents. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. The text provides a brief description of the considered methods for analyzing collections of scientific publications, describes experiments with these methods, including the visualization of the results of the methods and a description of the problems that arise.

Keywords: Visualization of document collections; Vector representation of documents; Citation networks; Citation contexts.

Рус

Н. И. Тихонов (Новосибирский национальный исследовательский государственный университет, Новосибирск, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript  

Eng

N. I. Tikhonov (Novosibirsk State University, Novosibirsk, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript  

Рус

1. Апанович З. В. Эволюция методов визуализации коллекций научных публикаций // Электронные библиотеки. 2018. Т. 21, № 1. C. 2 – 42.
2. Distributed Representations of Words and Phrases and Their Compositionality / T. Mikolov,
I. Sutskever, K. Chen et al. // Advances in Neural Information Processing Systems. 5 – 8 December 2013. Lake Tahoe, Nevada, USA. V. 26. P. 3111 – 3119.
3. Pennington J., Socher R., D Manning C. Glove: Global Vectors for Word Representation // Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014). 25 – 29 October 2014. Doha, Qatar. P. 1532 – 1543. DOI 10.3115/v1/D14-1162
4. Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching Word Vectors with Subword Information / Transactions of the Association for Computational Linguistics. 2017. V. 5. P. 135 – 146. DOI 10.1162/tacl_a_00051
5. Deep Contextualized Word Representations / M. Peters, M. Neumann, M. Iyyer et al. // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1 – 6 June 2018. New Orleans, Louisiana, USA. V. 1. Р. 2227 – 2237. DOI 10.18653/v1/N18-1202
6. Tian H., Zhuo H. H. Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation / ArXiv. abs/1703.06587. 2017.
7. Gipp B., Meuschke N., Lipinski M. CITREC: An Evaluation Framework for Citation-Based Similarity Measures Based on TREC Genomics and PubMed Central // iConference 2015 Proceedings. 24 – 27 March 2015. Newport Beach, California, USA. P. 1 – 16. DOI 10.5281/zenodo.3547372
8. Berger M., McDonough K., Seversky Lee M. Cite2vec: Citation-Driven Document Exploration Via Word Embeddings // IEEE Transactions on Visualization and Computer Graphics. 2017. V. 23, No. 1. P. 691 – 700. DOI 10.1109/TVCG.2016.2598667
9. Van der Maaten L., Hinton G. Viualizing Data Using T-SNE // Journal of Machine Learning Research. 2008. V. 9. P. 2579 – 2605.

Eng

1. Apanovich Z. V. (2018). Evolution of methods of visualization of collections of scientific publications. Elektronnye biblioteki, Vol. 21, (1), pp. 2 – 42. [in Russian language]
2. Mikolov T., Sutskever I., Chen K. at al. (2013). Distributed Representations of Words and Phrases and Their Compositionality. Advances in Neural Information Processing Systems, Vol. 26, pp. 3111 – 3119. Nevada: Lake Tahoe.
3. Pennington J., Socher R., D Manning C. (2014). Glove: Global Vectors for Word Representation. Pro-ceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), pp. 1532 – 1543. Doha. DOI 10.3115/v1/D14-1162.
4. Bojanowski P., Grave E., Joulin A., Mikolov T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, Vol. 5, pp. 135 – 146. DOI 10.1162/tacl_a_00051.
5. Peters M., Neumann M., Iyyer M. at al. (2018). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 2227 – 2237. New Orleans. DOI 10.18653/v1/N18-1202.
6. Tian H., Zhuo H. H. Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation. ArXiv. abs/1703.06587. 2017.
7. Gipp B., Meuschke N., Lipinski M. (2015). CITREC: An Evaluation Framework for Citation-Based Similarity Measures Based on TREC Genomics and PubMed Central. iConference 2015 Proceedings, pp. 1 – 16. Newport Beach. DOI 10.5281/zenodo.3547372.
8. Berger M., McDonough K., Seversky Lee M. (2017). Cite2vec: Citation-Driven Document Exploration Via Word Embeddings. IEEE Transactions on Visualization and Computer Graphics, Vol. 23, (1), pp. 691 – 700. DOI 10.1109/TVCG.2016.2598667.
9. Van der Maaten L., Hinton G. (2008). Viualizing Data Using T-SNE. Journal of Machine Learning Research, Vol. 9, pp. 2579 – 2605.

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 450 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2021.10.pp.032-039

и заполните  форму 

Отправляя форму вы даете согласие на обработку персональных данных.

.

 

Eng

This article  is available in electronic format (PDF).

The cost of a single article is 450 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2021.10.pp.032-039

and fill out the  form  

 

.

 

 

 
Search
Rambler's Top100 Яндекс цитирования