A Software Code Infringement Detection Scheme Based on Integration Learning

Qin, Meng

doi:10.3233/ATDE231264

Abstract

A software code plagiarism detection scheme based on ensemble learning is designed to address the issue of low accuracy in traditional abstract syntax tree based software code infringement detection methods. We adopt the AST structure of the code to integrate domain partitioning in IR with AST, and use a weighted simplified abstract syntax tree to design feature extraction and similarity calculation methods, to achieve partial detection of semantic plagiarism and calculate the similarity between text and source code. Then, the feature set of the known classification training set is placed into a random forest based ensemble classifier for training, and an association between error rate and the classification effect of the decision tree in the random forest are proposed to acquire feature node matching with the feature in the code base. The experimental results show that our scheme has higher accuracy than traditional detection methods based on abstract syntax trees. It can not only detect code similarity, but also provide the types of plagiarism, which has better comprehensive identification performance.

This website uses cookies

This website uses cookies