Machine Learning Algorithm That De-anonymizes Programmers From Source Code And Binaries
Researchers have found that machine learning can be used to help identify pieces of codes, binaries, and exploits written by anonymous programmers, according to Wired. In other words, machine learning can ‘de-anonymize’ programmers from source-code or binary form.
The study was presented by Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt’s former Ph.D. student and now an assistant professor at George Washington University, at the DefCon hacking experience.
How To De-Anonymize Code
According to the researchers, the code written in the programming language is not completely anonymous. The abstract syntax trees contain stylistic fingerprints that can be used to potentially identify programmers from code and binaries.
In order to study the binary experiment, the researchers examined code samples in machine learning algorithms and removed all the features such as choice of words used, how to organize codes and length of the code. They then narrowed the features to only include the ones that actually differentiate developers from each other.
Examples of a programmer’s work are fed into the AI where it studies the coding structure. This approach trains an algorithm to recognize a programmer’s coding structure based on examples of their work.
For the testing, Caliskan and the other researchers used code samples from Google’s annual Code Jam competition. It was found that 83% of the time, the AI was successful in identifying the programmers from the sample size.
Where can it be used?
This approach could be used for identifying malware creators or investigating instances of hacks. It can also be used to find out if students studying programming stole codes from others, or whether a developer violated a non-compete clause in their employment contract.
However, this approach could have privacy implications, especially for those thousands of developers who contribute open-source code to the world and choose to remain anonymous for certain reasons.
Greenstadt and Caliskan plan to study how other factors might affect a person’s coding style. For instance, questions such as what happens when members of the same organization work together on a project, or whether people from different countries code in different ways. Also, whether the same attribution methods could be used across different programming languages in a uniform way.
“We’re still trying to understand what makes something really attributable and what doesn’t,” says Greenstadt. “There’s enough here to say it should be a concern, but I hope it doesn’t cause anybody to not contribute publicly on things.”