Attribution Required: Stack Overflow Code Snippets in GitHub projects

GitHub
Creative Commons
Stack Overflow

Abstract

Stack Overflow is the largest Q&A website for developers, providing a huge amount of copyable code snippets. Using these snippets raises various maintenance and legal issues. The Stack Overflow license requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt the license. While there is a heated debate on Stack Overflow's license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from Stack Overflow without proper attribution. In this paper, we present the results of an empirical study which analyzes attributed usages of Stack Overflow code snippets in GitHub projects for the most common programming languages, and estimates a lower bound for unattributed usages in Java files. On average, one out of 32 repositories contained a reference to Stack Overflow. Further, we found that developers rather refer to the whole thread on Stack Overflow than to a specific answer. For Java, at least two thirds of the copied snippets are not attributed.

Supplementary Material

  1. Preliminary Study:
    We provide the survey codebook, the raw response data, as well as the R script used for analysis: ZI­­­P

  2. Programming Language Ranking:
    We provide instructions to recreate the ranking as well as the ranking itself: ZIP

  3. RQ1:How is content from Stack Overflow referenced in GitHub projects?
    We provide all scripts and data used for RQ1 and RQ2 in one package: ZIP

  4. RQ2: What properties do frequently referenced questions and answers from Stack Overflow possess?
    We provide all scripts and data used for RQ1 and RQ2 in one package: ZIP

  5. RQ3:How often is code from Stack Overflow posts used, but not attributed?
    We provide all scripts and data used for RQ3 in one package: ZIP

  6. Other Sources:
    Stack Exchange Data Dump
    GHTorrent Data Dump


For data retrieved from the BigQuery GitHub data set, see the GitHub Terms of Service.
All content retrieved from Stack Overflow is licensed under CC BY-SA 3.0, see also the Stack Exchange Network Terms of Service.