Comparison of Latent Semantic Analysis (LSA) and Doc2Vec Algorithms of Thesis Similarity Detection
DOI:
https://doi.org/10.33558/piksel.v12i2.9954Keywords:
Algorithm, Doc2Vec, Latent Sematic Analysis, Plagiarism Detection, CRISP-DMAbstract
This study aims to develop a system for detecting similarities in thesis titles and content to prevent plagiarism and support student originality. The high level of similarity in final projects is a significant concern in academic environments. Two text vectorization methods, Latent Semantic Analysis (LSA) and Doc2Vec, were compared to measure document similarity. Results showed that LSA achieved a very high cosine similarity (99.94%) due to dimensionality reduction that preserved semantic correlations. In contrast, Doc2Vec produced lower similarity scores, with 7.17% for PV-DM and 39.07% for PV-DBOW, indicating richer text representations. This study adopted the CRISP-DM model, which includes Business Understanding, Data Understanding, Data Preparation, Modelling, and Evaluation. The model is expected to strengthen academic integrity and encourage valuable scientific contributions.