Analysis and Synthesis with “Big Code”

Yahav, Eran

doi:10.3233/978-1-61499-627-9-244

Abstract

The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this lecture series, I will cover recent research trends on leveraging such “big code” for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [55,63], tracelets [28], numerical abstractions [61,58], and textual descriptions [82,1], as well as different notions of code similarity based on these representations. To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [66,73], variable order Markov models [18], and other distance-based and model-based sequence classification techniques. Finally, we discuss applications of these techniques including semantic code search in both source code [55] and stripped binaries [28], code completion and reverse engineering [43].

This website uses cookies

This website uses cookies