Comparison of Latent Semantic Analysis (LSA) and Doc2Vec Algorithms of Thesis Similarity Detection

Authors

  • Rita Wahyuni Arifin Universitas Bina Insani
  • Mardi Yudhi Putra Universitas Bina Insani
  • Dwi Ismiyana Putri Universitas Bina Insani

DOI:

https://doi.org/10.33558/piksel.v12i2.9954

Keywords:

Algorithm, Doc2Vec, Latent Sematic Analysis, Plagiarism Detection, CRISP-DM

Abstract

This study aims to develop a system for detecting similarities in thesis titles and content to prevent plagiarism and support student originality. The high level of similarity in final projects is a significant concern in academic environments. Two text vectorization methods, Latent Semantic Analysis (LSA) and Doc2Vec, were compared to measure document similarity. Results showed that LSA achieved a very high cosine similarity (99.94%) due to dimensionality reduction that preserved semantic correlations. In contrast, Doc2Vec produced lower similarity scores, with 7.17% for PV-DM and 39.07% for PV-DBOW, indicating richer text representations. This study adopted the CRISP-DM model, which includes Business Understanding, Data Understanding, Data Preparation, Modelling, and Evaluation. The model is expected to strengthen academic integrity and encourage valuable scientific contributions.

Downloads

Download data is not yet available.

Downloads

Published

2024-09-30

How to Cite

Arifin, R. W., Putra, M. Y., & Putri, D. I. (2024). Comparison of Latent Semantic Analysis (LSA) and Doc2Vec Algorithms of Thesis Similarity Detection. PIKSEL : Penelitian Ilmu Komputer Sistem Embedded and Logic, 12(2), 425–434. https://doi.org/10.33558/piksel.v12i2.9954