arrow-up icon
deco-blob-1 decoration
graphical divider

Duplicate code is a pair of matching or similar code fragments in the source code. Duplicates are created when you “copy and paste” existing code for various reasons.

Not all code duplicates are a problem, but software with a large number of duplicate codes is generally vulnerable to changes and extensions. In order to fix a certain bug, in many cases, it is not enough to fix the code in which the bug appears. Is it necessary to search all duplicate codes that have the code as the copy source and also have the same correction in the copy destination? You need to judge. As the size of the software grows, the difficulty of this task increases.

When you "copy and paste" an original code for some reason, you forget to make the necessary corrections in the copy and paste destination. Other times, when editing one of the duplicate codes, it is necessary to search for all other duplicate codes and modify them in the same way, but the modification sometimes be omitted in some duplicate codes. Such missed corrections caused by duplicate codes as described above are called "omission of modification".

The source code entered runs entirely in the user’s local environment, and there are no connections to any external servers. Data will not be exposed.

Siderscan currently supports Java/JavaScript/TypeScript/PHP/C/C++/C#/Ruby/CUDA/

In addition to the above languages, FPGA description language (extensions: vhd, vhdl, v, sv) and Objective-C (extensions: h, m) are also analyzed. However, since the analysis is analyzed as a C/C++ language, language-specific syntax cannot be taken into account.

It is heuristically derived and converted into an algorithm based on the analysis of our own open-source projects and user interviews and is not absolute. In addition, it is an index that is still under development and so its definition and associated algorithm may change.

The current version factors in the following in order the calculate the importance index:

  • The number of lines in the duplicate: ​​The number of lines in the code block that was considered a duplicate.
  • Similarity score: This shows how many parts of the logic are the same but the strings are different, such as different names for variables and functions.
  • Same file factor: If the code exists in multiple files, it is deemed more important.
  • The complexity of logic: The greater the complexity in the duplicate portion, the more important it is considered. We do this by analyzing its control structure.

When an important duplicate code or an omission of modification is detected, the result is sent via email. The information can be shared by forwarding the email or informing the URL in the email a team member.

The initial analysis takes time since all source code in the target repository is analyzed. The specific time will depend on the total size of the source code and the number of duplicate codes detected. For example, Using AWS EC2 t3.medium (4GB RAM, 2 Core CPU), the Linux kernel ( was analyzed, and the entire analysis took 56 minutes. The second and subsequent analyses usually take less than 10 minutes, since only files that have changed since the previous analysis are analyzed.

All plans, including the free plan, will provide technical support via email and implementation assistance in an on-premise environment.

Siderscan excludes from the analysis by default directories and folders containing the following strings:

test, sample, proto, example, 3rdparty

Siderscan also excludes from analysis by default directories and folders that match the following strings:

dist, vendor, vendors, node_modules

Please make sure that your source code is not included in the above directories.

Any questions? Contact us