StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ~148K Python and ~120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language

[1]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[2]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[3]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[6]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[9]  Yutaka Matsuo,et al.  A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes , 2017, ACL.

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[12]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[13]  Anita Sarma,et al.  ANNE: Improving Source Code Search using Entity Retrieval Approach , 2017, WSDM.

[14]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15]  Jacob Aristotle,et al.  Stack Overflow , 2012 .

[16]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[17]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[18]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[21]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[22]  A. Azzouz 2011 , 2020, City.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[25]  Cristina V. Lopes,et al.  From Query to Usable Code: An Analysis of Stack Overflow Code Snippets , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[26]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[29]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[30]  Alessandro Moschitti,et al.  Corpora for Automatically Learning to Map Natural Language Questions into SQL Queries , 2010, LREC.

[31]  Frank Maurer,et al.  What makes a good code example?: A study of programming Q&A in StackOverflow , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[32]  Christoph Treude,et al.  NLP2Code: Code Snippet Content Assist via Natural Language Tasks , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[33]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[34]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[35]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[36]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[37]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[38]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[39]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[40]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[41]  Meital Zilberstein,et al.  Leveraging a corpus of natural language descriptions for program similarity , 2016, Onward!.

[42]  Alessandro Moschitti,et al.  Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing , 2009, NLDB.

[43]  Alberto Bacchelli,et al.  Quality Questions Need Quality Code: Classifying Code Fragments on Stack Overflow , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[44]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[45]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[48]  Marcelo de Almeida Maia,et al.  Redocumenting APIs with crowd knowledge: a coverage analysis based on question types , 2016, Journal of the Brazilian Computer Society.

[49]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[50]  M. Maia,et al.  Ranking crowd knowledge to assist software development , 2014, ICPC 2014.

[51]  Michael Gamon,et al.  Building Natural Language Interfaces to Web APIs , 2017, CIKM.

[52]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[55]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..