Berkman Center for Internet and Society
Harvard Law School
Norms in Cyberspace
Overview
Questions
Methodology
Findings (So Far)
Raw Data
Source Code
Bibliography and Credits
Next Steps

Methodology

The core of our analysis is a 1000 newsgroup random sample of Harvard University’s FAS newsfeed over a 28 day period. An analysis of all groups in the newsfeed would have yielded more datapoints, however no Usenet newsfeed is complete, due to the distributed architecture of Usenet. Any newsfeed is a random or not-so-random sample. Messages are considered to be in the interval based on the timestamps they bear. Because of the propagation lag for messages inherent in the Usenet architecture, a sufficient buffer (1 week) at the tail end of the sample interval was given before extracting message properties, to ensure consistent message density.

We analyzed messages, which are the atomic unit of this study, in subsets based on four criteria – hierarchy, moderated/not moderated, domain of the message poster and group. Analysis is done on message headers and the body of messages. Linguistics analysis is implemented using PERL regular expression matching. Though more penetrating linguistics analysis is certainly possible, the time constraints of the project prevented usage of lisp and neural networks. The linguistics portion of the project has been guided by the 80-20 rule: to achieve 80% of the meaningful newsgroup statistics equipped with 20% of the relevant education.

A matrix of the properties of each message observation is then ready to be fed through Stata which is a statistical analysis program. Stata is able to tabulate observations, taking means and standard deviations of each variable. The output log of Stata runs are fed through a PERL script, making them readable to an Excel spreadsheet program, which we then use to generate charts.