Effective Representation Learning for Binary Code
Sponsored by WTD81 through APERITIF: Analysis Pipeline for Effective vulneRability IdenTIfication through Fuzzing and by the Royal Holloway Centre for Doctoral Training in Cyber Security.
Deep learning has revolutionized natural language processing and computer vision by stepping away from manual feature engineering and instead training deep neural networks on large quantities of data with minimal preprocessing. There have been a number of attempts to use deep learning on tasks related to program understanding; while results have been encouraging, the real-world performance of these systems is still lacking. All existing systems learn mostly the syntax of programs and derive very little semantic insight.
The goal of this project is to develop a methodology for effective representation learning for binary code, by performing preprocessing that aims to generalize code away from specific syntax. To this end, we leverage program analysis methods for abstracting semantics. With this methodology, we will learn a large-scale general purpose model for code that can be deployed for different applications.
Applications we will investigate include (a) the reconstruction of metadata in binary code to aid reverse engineering and to help target fuzzing campaigns (as part of the APERITIF project on a general-purpose fuzzing pipeline); (b) code similarity to discover derived code across compiler versions and settings; (c) detection of previously unseen malicious code.