Science vs. Advocacy: Thoughts on the Felten BitTorrent Study
Princeton computer science professor Edward Felten has posted on his Web site a summary of a study he and Princeton student Sauhard Sahi conducted involving BitTorrent, the peer-to-peer network protocol. Felten and Sahi summarize their study as an investigation into what types of files are available on the system:
BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.
Sahi and Felten chose a random sample of files available “via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.” The summary does not clearly identify the time frame (either in length of time, or the time of year) in which Sahi and Felten performed the study.
Summary of the Study Summary
In summary, Sahi and Felten concluded that nearly half the files (46 percent) in the study comprised of non-adult movies and “shows.” (We presume the scholars mean shows — either dramatic serials or game shows — that appear on television.) These category of content would include what the Copyright Act of 1976 defines in Section 101 as “motion pictures” (“Motion pictures are audiovisual works consisting of a series of related images which, when shown in succession, impart an impression of motion, together with accompanying sounds, if any.”) Adult films and computer games and software each accounted for 14 percent of the total files; music accounted for another 10 percent of the files.
The part of the Sahi-Felten study summary that seemed to garner the most attention was the section entitled “Apparent Copyright Infringement.” Wrote the scholars:
Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.
…
Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.
In other words, the pair have drawn a preliminary conclusion that 99 percent of the files in this BitTorrent study infringed U.S. copyright law.
It is virtually impossible to discuss this study or its conclusion without reviewing the final paper, the data, and the data analysis that lead to the conclusions about “Apparent Copyright Infringement.” We and another reader have requested to review that information. We also specifically asked to see the coding sheets, the variables, and a closer look at the variable operationalizations; upon a second glance at the summary, we also would like to review the study design, particularly its sampling design.
(By the way, none of these requests are abnormal for social science studies. It is possible a reviewer may not request coding sheets, for example, but if coding schema are integral to variable operationalizations, then requesting the coding schema is not abnormal either.)
Our Questions
Still, we present some preliminary comments about the summary, and ask some questions about it. (We presume a forthcoming paper will presents the study, its data, and findings in more detail).
First, we would like to know both the time frame and the time span that the study captured. The time frame would determine time of day and time zone; the time frame would identify whether the study spanned the entire summer, a month, a week, or a day. Both are important in terms of measurement and potential data skew, especially if there is only a single temporal element captured and that temporal element is not compared to a second, third, or fourth temporal element.
Also, we would be interested in knowing whether this study was a longitudinal study, or a snapshot of activity; if it is the latter, both the time frame and time span become much more important.
Second, we hope the final paper identifies why the scholars chose “the trackerless variant of BitTorrent, using the Mainline DHT” as the data source, and what were the reasons for excluding other BitTorrent data sources.
Third, we find the scholars’ operationalization of copyright infringement to be interesting. On this issue, the scholars wrote the following:
Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.
Based upon the information in the summary, this operationalization of copyright infringement could be problematic for practical and theoretical reasons because it could skew the findings, or fail to provide proper context. In order to determine why we find this problematic, consider our rationale.
The actual definition of copyright infringement in the Copyright Act of 1976 (Section 501(a)) states the following
Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a) … is an infringer of the copyright or right of the author, as the case may be.
Effectively, this means that any time any person other than the copyright owner or its authorized agent invokes or uses any of the exclusive rights of reproduction, derivative work/adaptation, distribution, public performance or public display, that person is infringing per Section 501(a). As we have outlined in our sister publication Core Copyright, this use or invocation occurs every minute, of every hour of every day under the current legal regime.
This finding of infringement, of course, is subject to a raft of limitations or compulsory licenses in Sections 107 through 122. These limitations and licenses may mean that a de facto finding of infringement — which, too, is common and virtually automatic under the current legal regime — ultimately falls away, leaving the alleged infringer without legal liability, for reasons of public or economic policy.
The Importance of Operationalizing Infringement
But let’s return to the finding of infringement using the definition in Section 501(a) using the movies as an example. Since copyright infringement is a strict liability issue (i.e roughly meaning liability without fault), this essentially means that anytime anyone posts a file on a BitTorrent system — even a digital movie or music file ripped from their own collections — there is, arguably, an infringement because
(a) the person who owns the source disc from which the movie or music file was ripped is likely not the person that owns any of the Section 106 exclusive rights in the disc (per Section 202); and
(b) therefore has no authority to distribute that file on a digital network.
(The first sale limitation in Section 109 may or may not apply. We will presume for the sake of this argument that it is inapplicable. We also forestall any discussion of reproducing the movies into a digital format in order to get the digital file onto the BitTorrent network in the first place; that activity — which almost certainly occurs by circumventing a digital copy protection technology — likely would violate the Digital Millennium Copyright Act.)
This means that from a legal standpoint, it is possible that any file on such a distributed peer-to-peer network is an infringement under Section 501(a), regardless of whether or not the person who uploads the file owns the source disc. (Again, an ultimate and determinative finding of liability would be subject to the limitations and compulsory licenses in Sections 107 through 122 of the current Act.)
How does the legal definition of infringement affect the scholars’ operationalization of infringement in their study?
First, it could affect the study in a significant way if it does not take into account a variable for actual ownership of the source material from which the traded digital file was ripped. This matters, in turn, because the first sale doctrine may be an applicable limitation. (Again, more analysis would need to be done, but it’s worth an investigation.)
Second, if you can determine, operationalize, and make a variable for source ownership, then the study can probe deeper into what type of infringement is really at issue. Again, the issue is not whether or not there is infringing activity occurring on the network; by virtue of the way Congress wrote the infringement statute, infringement is occurring. (See our reasoning above.) Any normative arguments about the realism of applying that statute in that way in a digital networked economy are worthwhile, but will not be addressed in this specific article.
Context, Evidence-Based Findings & Scientific Method
But what we do not yet know is what type of infringement is occurring in this study. And here we distinguish between technical infringements (i.e. people who post stuff they own in disc form, but are trading, lending, or making available in digital form, without knowing what they are doing is, technically, a violation of Section 501(a)) or rogue, behavioral infringement (i.e. people who post stuff they never have rightfully purchased or possessed, and who never intend to buy the source material and merely wants to get stuff for free).
This distinction is critical for several reasons. First, identifying this factor through an operationalized variable and applicable statistical analysis would help begin to classify what type of behavior is behind the infringing activity. In turn, this is important because it begins to strike at the fit between normal behavior and legal standards. It is the common “speed limit” theory of law: if all people are traveling safely at 65 in a 55 m.p.h. zone, why write a speeding ticket? In contrast, if some are traveling at 95 in a 55 m.p.h., is there any good reason not to write a speeding ticket, regardless of the level of traffic?
Second, this distinction is critical because of a phenomenon that already has begun to occur. For example, there are some who may will point to this study as evidence that BitTorrent especially — and peer-to-peer networking, more broadly — is rife with illegal (“piratical”) activity that threatens the livelihood of creators and the companies that help manufacture, distribute, and own the discs that hold the source content (and own the content as well).
Indeed, one commentator already has issued a reflexive and impetuous claim that attempts to link the summary’s findings to a broader policy issue about net neutrality. “Valuable information to keep in mind while debating net neutrality rules and ISPs’ right to manage their networks and fight piracy,” wrote Ben Sheffner of Copyrights and Campaigns last week. In this quote and subsequent responses to reader comments, Sheffner suggested that Internet service providers have a duty restrict infringing traffic on their network, and that this duty should manifest itself in a three-strikes/graduated response policy that has been adopted nationwide in France and is beginning to be adopted in other European Union countries.
(There is plenty of background available on three strikes/graduated response. This article by Canadian attorney Barry Sookman outlines an argument in favor of three-strikes/graduated response. Last year, Sheffner gave his take on what he views as the distinction between “graduated response” from “three-strikes.” EFF posted in November about the Anti-Counterfeiting Trade Agreement (ACTA), which has been negotiated in secret, and allegedly includes a three-strikes provision that would affect U.S. law. Michael Geist did a five-part series (1, 2, 3, 4, 5) about ACTA in January, and wrote a separate column about three strikes.)
It is all the more convenient and useful for an advocacy-driven argument in favor of graduated response that “evidence” of BitTorrent’s transmissions would come from someone like Edward Felten because of his credentials and history. As a tenured computer science professor at Princeton, Felten’s work receives a default presumption of validity and prestige. Additionally, Felten had a high-profile experience with U.S. copyright law in 2000, when the recording industry lobby used the DMCA to squelch a scientific paper Felten and fellow scholars wanted to present about circumventing digital encryption on music files. Contextualizing all this information, an advocate could presume that Felten is hostile to copyright law because of this experience, and that publication of this type of result, on this type of paper, with this type of subject matter helps prove beyond a reasonable doubt — along with this Ivy League credentials — that BitTorrent (and by extension, peer-to-peer networks) are dens of copyright iniquity.
But drawing such correlations at this point — with respect to the summary, the resulting paper (which has not yet been vetted, reviewed or published), or Felten’s perceived or actual personal or professional biases — is premature and careless. At this point, no one can state definitively that the Sahi-Felten study provides any correlation between the level of infringing files and the BitTorrent network because no one has nearly enough information based exclusively upon the summary they presented. We cannot say whether Sahi and Felten considered the issues we have raised, or intentionally chose not to address them because they were deemed to be outside the scope of their study. On the basis of the summary alone, we cannot draw even an indirect correlation between this study summary and any need (or even a lack of need) for a three-strikes approach in the United States.
This is why it is important to read — and understand — the design, the variables, the operationalizations, the data collection methods, the statistical analyses in a final, peer-reviewed paper before rendering impulsive opinions about potential applicability to a major policy issue. Further, one needs to know enough about statistical analysis and research design to determine whether there is a skew, whether that skew may have been intentional, and if that skew negatively influences the study’s results. Finally, we need to hear what Sahi and Felten say about the study’s scope, and directions for further research. No matter how well-designed and presented, every study has some limitation, if only because scientific research is not static. Scientists typically live with, and explain, such limitations.
Jumping past this investigation and analysis may be considered acceptable within the context of litigation advocacy, where the objective is to win a specific objective for one’s client. But it is intellectually sloppy from a scientific and empirical perspective. As law professor Justin Hughes once wrote, “[T]he historian or the scientist is trained to research, to explain, and, we hope, to get to the bottom of things. The lawyer — hence, most legal academics— prepares just enough precedent to convince.”
Empiricism and science are the standards from which Sahi and Felten presented their research summary, and those are the standards any resulting final paper must meet. Our questions above are presented from the perspective of social science. Further, research and empirical support — not blind, unilateral advocacy — should be the bases upon which any information policy (especially three-strikes) should be proposed and promulgated.
We can say with a strong level of confidence, however, that the way the current statutes are written, it would have been shocking if anything significantly less than 100% of the files on BitTorrent were technical infringements of copyright law. That reality — and the gap between it and societal norms — is worth continued study.
© Copyright 2010, Copycense. Twitter: @copycense
Technorati Tags: Barry Sookman, Ben Sheffner, BitTorrent, Copyrights & Campaigns, Edward Felten, K. Matthew Dames, Michael Geist, Sauhard Sahi