Code obfuscation

AI deobfuscators: Why AI won’t help hackers deobfuscate code (yet)

By Dr. Anton Tkachenko May 08, 2024 04:06 pm

Generative AI could enable malicious actors to steal your source code — but only after passing one significant technical hurdle.

Why code obfuscation is so important

In 2003, the Executive Vice Chairman of Cisco took a long flight from California to Shenzhen to speak with the CEO of Huawei, a budding telecommunications company.

It wasn’t a friendly visit.

Years earlier, Cisco had played an integral role in building the “Great Firewall,” China’s nationwide internet censorship apparatus. With privileged access to its devices, Huawei could obtain and reverse engineer Cisco’s hardware and copy its source code verbatim into its own brand of routers.

It turned out Huawei was doing the same thing to Nortel Networks. And in the years since, Chinese source code theft has become something of a recurring nightmare for technology vendors across both hemispheres, from Adobe to Morgan Stanley to Dow Chemical.

For the average mobile app developer, the priority tends to be deploying the app as soon as possible. But for those invested in security, there’s a daunting frontier to manage: The nation-states, cybercriminals (including corporate insiders), and competitors looking to get their hands on valuable source code.

Is your organization preparing for the future of cybersecurity? Nation-state cyber threats, AI, and quantum computing are just a few of the topics we cover in our cybersecurity predictions webinar.

That’s where obfuscation comes in. By renaming variables and rearranging statements, splitting and merging variables, altering loop structures and adding in nonsense code, and much more, automated tools can help make spaghetti out of intellectual property, making the job of reading it far more difficult for those who’d use it for the wrong reasons.

Unless, perhaps, they use artificial intelligence. After all, AI/ML has supercharged programming and cybersecurity: It’s used to detect and remediate threats, refactor code, and suss out bugs, and software like Copilot and Codex can even generate entire scripts (legitimate and malicious) in an instant.

It follows, then, that this same technology could be applied to the problem of deobfuscating source code. If so, bad actors would inherit the ability to reconstruct the most sensitive, valuable software out there, with all of the legal, economic, and geopolitical ramifications therein.

Except there’s a huge technical barrier standing in their way.

An unobfuscated control flow vs. an obfuscated control flow.

The limits of AI in deobfuscating code

Reverse engineers have had a while now to explore how ChatGPT can help undo obfuscated code. They haven’t made much progress … yet.

Putting aside all of the more practical considerations — discerning file format, decompiling, etc. — deobfuscation requires extracting and interpreting different kinds of information at different levels of a program: The purpose of a variable, how a function works, the control flow of a whole sequence. Then there’s the work of modeling all of that information, and testing the model. Along the way, there’s plenty of room for error: A detail overlooked, data incorrectly interpreted as code, and so on.

Let’s say we want to train a cutting-edge machine learning algorithm to simply recognize obfuscation in the first place. First, it needs to have context for understanding the code in front of it — a model trained on one language, obviously, won’t know where to begin with another.

We have the technology to train a machine learning algorithm to read and “understand” code (in a certain sense), but here it needs to recognize obfuscated code. Those are considerably different challenges since, unlike a language, obfuscation doesn’t necessarily follow a clear ruleset. In its most basic forms it might involve arbitrarily renaming variables, splicing in useless code or data, encryption, compression, stripping metadata, changing the value of data by scrambling, substituting, or shuffling it, and so on and so forth.

This isn’t to imply, though, that code must be heavily altered in order to confuse an algorithm. Imagine a bad human programmer — one who, instead of writing a nice, clean function, creates many for the same purpose, and layers them inside of one another. In some sense, this is a form of obfuscation — it makes the code’s underlying purpose difficult to discern. How can we teach a machine to classify a normal script from a poorly-written one, and each from an intentionally convoluted one?

It all comes down to training data. Like any classification problem, an algorithm trained to recognize obfuscation would need to have been exposed to massive datasets with unobfuscated and obfuscated software. It would require repeated exposure to all the various forms obfuscation could take, and all of this data would need to be labeled — probably by hand — because even the best AI still can’t grasp concepts quite like a human engineer can.

Note that, at this point, we haven’t even gotten past the preliminary step of teaching AI to recognize obfuscation to begin with! If this was a hard problem to solve, then identifying and reversing obfuscations on the level of individual functions would be exponentially more complicated and labor-intensive.

How soon will AI be able to deobfuscate code?

Despite insufficient training data, some open source developers have been working at chipping away at the edges of AI-automated deobfuscation.

For example, GptHidra and G-3PO, new plug-ins for the NSA-built reverse engineering software Ghidra, use ChatGPT to provide high-level descriptions of functions. Projects like these take advantage of the strengths of our current AI, like their ability to generate clean natural language. Each represents a step forward in the field, saving engineers a great deal of time they’d otherwise have had to spend analyzing bits of code piece by piece.

These ancillary tools are a harbinger of bigger, more meaningful progress in the not-too-distant future. Available, labeled training data can only grow over time, so the question of when AI will be able to deobfuscate code is not one of whether but when. Tomorrow? A year from now? Perhaps only in a generation.

There’s no way to prevent the rise of AI deobfuscators. What developers can do instead, in anticipation, is use this very same technology to improve obfuscation techniques in equal or greater proportion. AI has the potential to create more complex obfuscations than anything we’ve ever seen before, making the job of reversing it all the more difficult.

The race between those who’d protect code with AI, and those who’d use AI to steal it, is already underway. The entire future of cyberspace hangs in the balance.