As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this lecture series, I will cover recent research trends on leveraging such “big code” for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [55,63], tracelets [28], numerical abstractions [61,58], and textual descriptions [82,1], as well as different notions of code similarity based on these representations. To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [66,73], variable order Markov models [18], and other distance-based and model-based sequence classification techniques. Finally, we discuss applications of these techniques including semantic code search in both source code [55] and stripped binaries [28], code completion and reverse engineering [43].