论文信息 - StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ~148K Python and ~120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language

[1] Ying Zou,et al. Spotting working code examples , 2014, ICSE.

[2] Andrew D. Gordon,et al. Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[3] Rico Sennrich,et al. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5] Ting Liu,et al. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[6] Tomoki Toda,et al. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7] Zhi-Hua Zhou,et al. Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8] Livio Robaldo,et al. The Penn Discourse TreeBank 2.0. , 2008, LREC.

[9] Yutaka Matsuo,et al. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes , 2017, ACL.

[10] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[11] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[12] D. Cox. The Regression Analysis of Binary Sequences , 1958 .

[13] Anita Sarma,et al. ANNE: Improving Source Code Search using Entity Retrieval Approach , 2017, WSDM.

[14] Anh Tuan Nguyen,et al. Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15] Jacob Aristotle,et al. Stack Overflow , 2012 .

[16] Charles A. Sutton,et al. A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[17] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[18] Yelong Shen,et al. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[19] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20] Dumitru Erhan,et al. Deep Neural Networks for Object Detection , 2013, NIPS.

[21] Christoph Treude,et al. How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[22] A. Azzouz. 2011 , 2020, City.

[23] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24] Ellen M. Voorhees,et al. The TREC-8 Question Answering Track Report , 1999, TREC.

[25] Cristina V. Lopes,et al. From Query to Usable Code: An Analysis of Stack Overflow Code Snippets , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[26] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[27] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[29] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[30] Alessandro Moschitti,et al. Corpora for Automatically Learning to Map Natural Language Questions into SQL Queries , 2010, LREC.

[31] Frank Maurer,et al. What makes a good code example?: A study of programming Q&A in StackOverflow , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[32] Christoph Treude,et al. NLP2Code: Code Snippet Content Assist via Natural Language Tasks , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[33] Tao Wang,et al. Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[34] Wang Ling,et al. Latent Predictor Networks for Code Generation , 2016, ACL.

[35] Alvin Cheung,et al. Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[36] Rayid Ghani,et al. Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[37] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[38] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[39] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[40] Mukund Raghothaman,et al. SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[41] Meital Zilberstein,et al. Leveraging a corpus of natural language descriptions for program similarity , 2016, Onward!.

[42] Alessandro Moschitti,et al. Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing , 2009, NLDB.

[43] Alberto Bacchelli,et al. Quality Questions Need Quality Code: Classifying Code Fragments on Stack Overflow , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[44] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[45] Christopher De Sa,et al. Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[46] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[48] Marcelo de Almeida Maia,et al. Redocumenting APIs with crowd knowledge: a coverage analysis based on question types , 2016, Journal of the Brazilian Computer Society.

[49] Dan Klein,et al. Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[50] M. Maia,et al. Ranking crowd knowledge to assist software development , 2014, ICPC 2014.

[51] Michael Gamon,et al. Building Natural Language Interfaces to Web APIs , 2017, CIKM.

[52] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54] Daniel Jurafsky,et al. A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[55] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..