Berkman Center for Internet and Society
Harvard Law School
Norms in Cyberspace
Overview
Questions
Methodology
Findings (So Far)
Raw Data
Source Code
Bibliography and Credits
Next Steps

Findings (so far)

Usenet message data was analyzed by comparing the average values for "vital properties" (enumerated below) using different aggregation schemes. Messages were aggregated by group hierarchy, by whether or not the message was found in a moderated group, by group (the fundamental unit of Usenet community) and by the top-level domain of the message poster. In this section, we summarize discoveries from our four-tiered analysis and highlight notable characteristics of Usenet self-regulation.


By Hierarchy

The hierarchy of a newsgroup is denoted by the "word" preceding the first dot in its name. Hierarchy is an online social unit with noticeable, distinct norms for three reasons:

1) The architecture of newsreaders organizes messages by hierarchy
When users decide what newsgroups to read from and post to they are constrained by their newsreader. Users, in most cases, when "browsing" for a newsgroup will traverse the offerings sequentially, meaning they will see the messages in a hierarchy together. In most cases they will focus on a hierarchy early to narrow their options, because the choices are otherwise too numerous. Though some news archive portals have attempted to organize groups more based on their content, the real readership of these portals is in question. Posting to online news portals is a hassle as it requires users to get yet another username and password.

2) The practices of Internet Service Providers.

ISP’s, who by the nature of real-world marketing and segmentation tend to serve some segment of society, pick and choose at the hierarchy level. The storage and network costs of carrying a newsfeed are significant and to carry hierarchies, some of which are costly, that the audience does not demand, is wasted capital. Hierarchies such as clari are costly (priced based on the number of end-users). And the fj hierarchy, which carries messaging written primarily in Japanese script is not cost effective for most American news providers.

3) The meanings of the hierarchy prefixes

The hierarchies have meanings, though some nebulous, given de jure by names and de facto by their usage. When users create newsgroups, by the various methods of newsgroup creation, they are exercising the first-level of Usenet self-regulation.  The hierarchies carried by Harvard FAS are:
 
ALT – Alternative to the other main hierarchies
BIONET – relating to biological research
BIT – Archive of listservs
BIZ – Business related
CLARI – Semi-official and official newswire
COMP – relating to computers
CONTROL – Relating to the overall functioning of usenet. Cancel messages etc. Peopled by Admins.
FJ -- Japanese hierarchy.
GNU – Gnu Software related.
HARVARD – Harvard newsgroups for class and organizations.
HUMANITIES – related to the humanities
K12 – Relating to teaching and rearing primary school aged children.
MICROSOFT - 
MISC – miscellaneous. A catchall like Alt.
NE - Geographically tied topics relating to New England.
NETSCAPE 
NEWS – Self-Referential hierarchy with topics relating to Usenet Itself. 
REC – Related to recreation.
SCI – Science topics
SOC – For socializing
TALK – Place to go to interact with others
UK – Relating to United Kingdom.
VMSNET – Relating to IBM’s Mainframe OS.

 

We analyzed the properties of each message and then took averages of the variables for the entire hierarchy.

As its name suggests, the number of lines variable tells us how many lines are in the body of a message. Histograms mapping the raw variable for a hierarchy uniformly suffer from extreme observation huddling in the lower valued buckets with a significant blip in the high number of lines region. This is explained by the common practice in newsgroups of posting the newsgroup FAQs periodically to the group as a way of decreasing the amount of newbie email addressing introductory level topics of science and how-to newsgroups. These FAQ messages are often extremely long and hence the blip. The way of dealing with this statistically is to take the natural logarithm of number of lines. This has the effect of decreasing large values more substantially, while decreasing small values less extremely. Then we see bell curve phenomena in hierarchies such as alt and comp, with a bumpier curve in soc.

The number of references plots are more interesting than the number of lines plots. The number of references is the number of messages that a particular message responds to. Another way of interpreting this is as the "thread length" when the message was posted. The alt hierarchy has a smoothly decreasing curve with a steep negative slope that decreases steadily. One can imagine a tree structure of messages in which the chance that a conversation will continue decreases steadily as people add responses.

Other hierarchies, such as rec and comp exhibit distinctly different curvature. These hierarchies actually show that there is a greater message distribution in the 1-3 reference range than in the 0 range. This demonstrates greater dialogality, perhaps showing that some message threads "fan out" widely, rapidly, overpowering the cumulative statistical effect of seedling messages going unsprouted. The Misc hierarchy distribution for number of references exhibits thought-evoking shape. It looks as if an alt-type distribution and a comp type distribution have been combined, resulting in a superimposition effect. We attribute this to Misc’s split personality. The Misc hierarchy lives up to its nomenclature as being a mixture of different groups people by different posters and hence it exhibits the curvature of both hierarchy types.

The number of lines quoted varies from hierarchy to hierarchy as well. This is a measure of the number of lines quoted from an earlier message somewhere on Usenet. The Talk and Alt hierarchies have the highest means of these variables. Large amounts of quoting doesn’t necessarily imply dialogicality. Newsreaders are sometimes set to quote by default, and users, not adept in configuration can’t help but to quote large amounts of text when responding. This may actually be a measure of newbieness. As expected, the clari hierarchy which carries information published by "official" and "semi-official" authorities, has almost no quoting.

Profanity by hierarchy is highest in the News hierarchy. This can be attributed to the small number of groups in this hierarchy and the administrative nature of a large proportion of these groups. The group news.admin.net-abuse.usenet, which we studied, carries recommended cancellations of pornographic content and messages that are considered defamatory, often with a long quote trail of abusive flames.

We hired a human to rate messages to determine whether they were flame or spam. Flame was defined as disciplinary or hostile speech. Flames were found to be most prevalent in the comp, alt and rec hierarchies.

Spam was defined as commercial speech -- that originating from a profit motive. Commercial speech was found to comprise over 50% of all messaging in the biz hierarchy. Misc and Alt were found to have 15% and 13% commercial messaging, respectively. The UK hierarchy was found to have 50% commercial messaging, however this is attributed to the size of the random sample in this particular hierarchy. Because of the number of groups in the UK hierarchy relative to other Hierarchies, it was granted two representative groups in the random sample. One of these was "uk.jobs.offered," hence the apparent British preference for a highly commercialized Internet. Overall spam and flame rates on Usenet were 8% and 2% respectively.
 
 

By Moderation

A listing of moderated groups was provided by Denis McKeon. He maintains various FAQs about Usenet. A flag signifying moderation was set for the groups listed. Moderation implies the presence of an approval mechanism. For some groups approvals are programmatic and for others a person does the deed. Programmatic approvals can exclude based on hostile language and other criteria. Another programmatic exclusion criteria is the absence of a particular "code word" from the first posting to the newsgroup by a particular email address. The presence of this word simply ensures that the poster has followed the group enough to know the code word, which is periodically posted by the group’s human moderator. This confounds newsgroup autoposters as well as newbies who haven’t read the group enough to know its topic nor if their question has already been asked and answered.

Mean values for variables were recalculated for moderated groups and non-moderated groups. Not surprisingly, messages in moderated groups were longer, with much less quoting, less profanity and less excessive capitalization. Somewhat surpurisingly, there was also less message depth (mean numver of references). Dialog is perhaps discouraged by the presence of an authority. In the absence of moderation the only thing like moderation is dialog and so it becomes the prevailing organ of normalization. Further there is less uncloaked cross-posting. That is less messages are posted to the "neigborhood" of newsgroups comprising the topic of a particular newsgroup.
 
 

By Group

The newsgroup is the basic unit of congregation on Usenet. Different groups have different personalities which emerge over time through the interactions of posters. The name of a newsgroup is extremely important as it is the group’s advertisement. Because of the size constraints on the advertisement (A long newsgroup name is difficult to recollect and hence time-consuming to type into a news portal) abbreviated topics are used. Sometimes these cause problems. Take the case of misc.int-property. This group which officially (whatever that means) is about intellectual property. However many times one will find postings about international properties for rent or sale on this group. This fundamental flaw in the Usenet architecture dilutes the topicality and consequent usefulness as a medium. In this study, we have tried to get at the personalities of newsgroups by averaging dialogical variables for the groups, knowing full well that personalities on Usenet can be split.

The top ten most popular groups in the random sample in terms of posting frequency are:
 
Number of Observations Newsgroup
23849
Rec.collecting.sport.football
19357
Rec.collecting.sport.baseball
13300
Rec.sport.pro-wrestling
12716
Alt.support.depression
12065
Rec.guns
11699
Rec.toys.cars
10355
Uk.jobs.offered
9812
Alt.test.test
9069
Talk.origins
8953
Rec.games.computer.ultima.onli

The top ten groups sorted by Average message depth (Number of References) are:
 
Number of References Newsgroup
10.94118
Alt.fan.ok-soda
9.944659
Alt.fan.g-gordon-liddy
8.783854
Alt.nuke.europe
8.668245
Talk.atheism
8.605872
Alt.cascade
8.078854
Alt.books.m-lackey
7.70229
Soc.culture.latin-america
7.559889
Alt.kill.the.whales
7.465805
Uk.politics.misc
7.436369
Talk.politics.libertarian

Why the dialog in a newsgroup about a discontinued caffeinated beverage is so involved is elusive. Controversial topics, though, like Alt.nuke.europe invite heated conversation and the ensuing long threads.

The top ten most profane groups are:
 
Profanity Count Newsgroup
7
Alt.irc.lamers
3.098485
Rec.pets.dogs.info
1.657143
Alt.lies
1.580645
Alt.misanthropy
1.44047
Alt.music.alternative
1.277778
Alt.flame.faggots
0.7497116
Alt.fan.zoogz-rift
0.7041199
Alt.slack
0.658363
Alt.cascade
0.6414931
Alt.nuke.europe

 

Alt.irc.lamers is a mysterious group. A hunt on dejanews yielded the message:

cmsg rmgroup alt.irc.lamers
Author: The News Administrator <news@news.ilx.com>
Date: 1995/07/30

Forums: alt.irc.lamers.ctl, control

The time has come for the newsgroup called:
alt.irc.lamers
to go the way of all things. We have not seen any relevant posts to
this group in the past few months. It's time to free up the inodes
and make the active list a bit shorter. Please go along with this
rmgroup message.

Strangely, this message is dated 1995 and nothing more recent is available on Dejanews. I believe that the group must have been killed in ’95 and then revived more recently, only to be carried by very few news servers, one such non-discriminating server being Harvard FAS. The FAS newsfeed seems have discontinued the newsgroup recently, also, but not before we captured a sample of this group. The few messages present (7) appear to be highlights from "chat logs." IRC (Internet reflector chats) are verbose dialogues between online personae that revel in vulgarity and online "conquest."

Alt.lies seems to be a hotbed of Clinton bashing. The topic of "lies" evokes anger in many users and that anger translates to the high profanity reading in the study.

The rec.pets.dogs.info high-placement is probably caused by the study’s consideration of "bitch" to be a profanity.

The top ten groups in terms of average vocabulary size are:
 
Vocabulary Size Newsgroup
2135
Alt.filesystems.afs
1322.714
Soc.culture.pakistan.history
1135.083
Rec.pets.dogs.info
828.7071
Comp.mail.maps
710.65
Rec.pets.cats.announce
709.7692
Rec.music.info
698.9375
Alt.comics.lnh
663.3907
Bit.listserv.albanian
642.3462
Rec.games.computer.doom.announ
578.3125
Soc.culture.indian.info

The culture and pet groups have high mean vocabulary sizes possibly because of the non-standard vocabularies of these areas. The various species and foreign terms. The vocabulary size measurement doesn’t account for variations in spelling. Diversity of spelling is common for foreign terms being represented in the Roman alphabet and for specialized vocabularies.

The high placement of filesystems might be attributed to the common practice of posting directory listings for problematic filesystems to this newsgroup, in hopes that an expert out there might advise.

Excessive Capitalization top 10:
 
# of Words Excessively Capitalized Newsgroup
308
Biz.comp.telebit
216.0588
Clari.local.north_dakota
159.4
Soc.culture.pakistan.history
132.2214
Comp.mail.maps
110.8182
Clari.sports.local.midwest.wis
96
Alt.filesystems.afs
94.16667
Alt.sources
93.53846
Rec.games.computer.doom.announ
91.44231
Rec.music.info
83.54902
Alt.fan.ok-soda

 

Ranking of newsgroups by average word-length
 
Average Word Length Newsgroup
14.61612
Alt.flame.faggots
11.8492
Alt.hk.spcc
9.536466
Biz.clarinet.web.xcache.small
7.30458
Comp.security.pgp.test
7.233492
Rec.arts.puppetry
7.224348
Bionet.protista
7.224266
Clari.tw.computers.releases
7.197802
Bit.listserv.psycgrad
7.145779
Bionet.genome.arabidopsis
7.127086
Vmsnet.networks.tcp-ip.ucx

The extreme numbers in the top group for this ranking is unexplained. A large Std. Dev. In this variable for this group might be a sign of odd messaging practices and non-traditional ascii combinations. The hk group might utilize double-byte character encoding.

Top ten ranking by number of lines quoted
 
Number of Lines Quoted Newsgroup
401.0764
Alt.test.test
48.14286
Alt.Cajun.info
45.34314
Alt.guinea.pig.conspiracy
44.57009
Soc.culture.indian.jammu-kashm
42.94118
Alt.fan.ok-soda
41.95
Talk.religion.pantheism
39.76957
Alt.cascade
37.03366
Soc.culture.greek
36.66667
Alt.pave.bosnia
30.84828
Soc.culture.latin-america

Top Ten Ranking by number of words:
 
Number of Words Newsgroup
8964
Alt.filesystems.afs
3033.8
Soc.culture.pakistan.history
2951.5
Comp.specification.larch
2849.164
Comp.mail.maps
2812.576
Rec.pets.dogs.info
2219.808
Rec.games.computer.doom.announ
2121.167
Alt.sources
1842.415
Rec.music.info
1689.313
Soc.culture.indian.info
1678.188
Alt.comics.lnh

Number of Groups

The mean number of groups messages in a group are posted flags the presence of a "neighborhood" (netscan) of newsgroups that the posters perceive as caring about the topics of posts. This can be caused by a dispersion of audience across many groups or the interdisciplinarity of a message. The top groups in the chart below have very few observations, so the messaging is probably anomalous. However soc.culture.pakistan.history definitely exhibits consistent uncloaked cross-posting.
 
Number of Groups Newsgroup Hierarchy Number of Observations
26
. alt.fan.dr-bronner alt
1
10
0
comp.sys.harris comp
2
10
0
alt.mock.the.court alt
4
10
. harvard.course.phys5 harvard
1
10
0
comp.sys.proteon comp
2
10
0
alt.bonehead.dave-clayton alt
2
9
. comp.networks.noctools.bugs comp
1
9
. alt.flame.tim-gilman alt
1
8
3.464102
alt.ccds alt
5
7.914286
0.4453306
soc.culture.pakistan.history soc
35
7.236842
2.235273
clari.news.crime.war clari
38
7.215827
2.728706
alt.politics.datahighway alt
139
6.990385
2.110975
clari.biz.top clari
104
6.862745
2.289276
alt.fan.ok-soda alt
51
6.714286
2.939811
alt.fan.g-gordon-liddy alt
777
6.701613
2.845564
clari.tw.misc clari
124
6.666667
1.21106
alt.sex.not alt
6
6.5
4.041452
comp.graphics.apps.data-explor comp
4
6.5
3.173551
comp.os.ms-windows.apps.winsoc comp
22
6.444444
3.166667
clari.living.bizarre clari
9

 

The emoticon analysis is based on a series of PERL regular expression matches that look for standard facial expression represented in ASCII. The PERL regular expression can be examined in emoticon-mod.pl. A PERL programmer will notice that the expression presumes a rough left to right ordinality of facial features: brow, eyes, nose (optional) and mouth. As a result of this near universality of emoticon grammar the number of regular expressions was reduced. The groups with the highest percentages of emoticon charged messages are shown below. Six groups in our random sample exhibit emoticon charged posting rates greater than or equal to 50%.

145 Groups in our random sample exhibited no emoticon charged messages. These groups are likely to be either highly technical or desolate with extremely low posting frequencies.
 
Is Emoticon (bool)
0.67
alt.flame.mud
0.59
rec.games.computer.doom.announ
0.55
alt.fan.brie
0.50
bit.listserv.xtropy-l
0.50
alt.znet.aeo
0.50
alt.fan.jeremy-reimer
0.40
alt.filesystems.afs
0.37
alt.music.moxy-fruvous
0.37
rec.music.info
0.35
alt.games.omega
0.35
comp.sys.amiga.datacomm
0.34
rec.arts.tv.uk.emmerdale
0.33
alt.guinea.pig.conspiracy
0.33
alt.bonehead.dave-clayton
0.33
alt.irc.lamers
0.31
soc.culture.belarus
0.31
alt.support.shyness
0.31
alt.fan.lemurs
0.31
alt.tv.twin-peaks
0.30
soc.culture.afghanistan
0.30
alt.timewasters

 

Smiliest newsgroups (by number of smiles per message):
 
Num Smiles
2.17
comp.security.pgp.test
1.56
alt.cascade
1.46
alt.guinea.pig.conspiracy
1.34
alt.hk.spcc
1.27
alt.fan.brie
0.93
rec.games.computer.doom.announ
0.90
soc.culture.rep-of-georgia
0.80
alt.filesystems.afs
0.75
alt.games.nomic
0.67
alt.flame.mud
0.66
alt.comics.lnh
0.64
soc.culture.turkish
0.64
alt.sex.fetish.startrek
0.59
alt.tv.twin-peaks
0.55
gnu.cfengine.bug
0.54
soc.culture.french
0.53
alt.music.moxy-fruvous
0.50
bit.listserv.xtropy-l
0.50
alt.znet.aeo
0.50
alt.fan.jeremy-reimer

Frowniest newsgroups (by number of frowns per message):
 
Num Frowns
0.92
alt.mock.the.court
0.19
news.admin.net-abuse.usenet
0.17
alt.toolkits.xview
0.16
alt.sport.foosball
0.16
comp.graphics.apps.data-explor
0.14
comp.org.isoc.interest
0.10
news.admin.hierarchies
0.10
comp.lsi
0.09
alt.fan.lemurs
0.09
comp.sys.newton.programmer
0.08
alt.technology.obsolete
0.08
comp.sys.amiga.datacomm
0.07
rec.games.computer.doom.announ
0.07
rec.arts.tv.uk.emmerdale
0.07
rec.music.info
0.06
alt.sources.amiga
0.06
alt.tv.mst3k
0.06
alt.timewasters
0.06
alt.radio.networks.cbc
0.06
alt.animation.warner-bros

By Domain of Poster

The top twenty top level domains by posting frequency are shown in the chart below. There is a clear US dominance of Usenet, however, a significant adoption among the English speaking countries of the world. Japan has a strong presence that is evidenced by the "fj" hierarchy which carries messages originating in Japan written in Japanese.
 
458301
Commercial, mainly USA (com)
204724
Network (net)
54550
USA Educational (edu)
48195
United Kingdom (uk)
38429
Unknown (unknown)
23007
Canada (ca)
17291
Non-Profit Making Organisations (org)
11393
Australia (au)
7974
Germany (de)
5898
Netherlands (nl)
3972
France (fr)
3079
United States (us)
2563
Sweden (se)
2513
New Zealand (nz)
2268
Norway (no)
2063
Japan (jp)
2050
Finland (fi)
1883
Belgium (be)
1849
USA Government (gov)
1841
Spain (es)

 

In the next chart we show the top twenty second level domains by posting frequency:
 
125172
aol.com
42474
co.uk
27005
my-dejanews.com
23564
hotmail.com
21804
netcom.com
14676
mindspring.com
13762
webtv.net
13583
earthlink.net
12643
att.net
11374
clari.net
9568
cooltest.com
9345
erols.com
7339
yahoo.com
7033
home.com
5893
usa.net
5720
bellsouth.net
5677
geocities.com
5031
com.au
4967
msn.com
4691
ac.uk

Notice that aol.com, America Online, offers roughly three times the number of postings as co.uk (all British commercial service providers.)

Of all top-level domains peopled mostly by Americans, EDU has the highest mean number of references. This indicates that Usenet users affiliated with educational institutions engage most enthusiastically in dialog over the Internet.

However in a global comparison of dialogicality. when included with all TLD’s represented on Usenet, the EDU’s just barely make the top 10:
 
Number of References
5.15
Iceland
4.57
Ireland
4.23
Yugoslavia
4.21
Luxembourg
3.72
France
3.72
Argentina
3.62
Chile
3.58
Belgium
3.25
Poland
3.25
USA Educational

This greater dialogicality exhibited by users from other nations might be attributed to the early adopters in the other countries being highly educated.

The US has the most profane posters, with Pakistan following in second place:
 
Average Profanity count per message Top Level Domain
0.75
United States us
0.48
Pakistan pk
0.26
Yugoslavia yu
0.23
Non-Profit Making Organisations org
0.22
Oman om
0.20
South Korea kr
0.19
Poland pl
0.18
Finland fi
0.17
Jordan jo
0.13
USA Military mil

The "Language Density" is the quotient of mean vocabulary size and the mean number of words in posts. This is meant to be a metric of the overall command of the english language combined with the tendency to post content dense messages. Users from the non-English speaking world clearly demonstrate greater command of the language than those who claim English as their mother tongue.
 
"Language Density" Top Level Domain
0.9117647
Georgia ge
0.8947369
Malta mt
0.8611111
Kuwait kw
0.8554778
Mauritius mu
0.8515038
Saudi Arabia sa
0.8484398
Uruguay uy
0.8329171
Japan jp
0.8235294
Papua New Guinea pg
0.8220947
Bermuda bm
0.819672
Trinidad and Tobago tt
0.8184913
Thailand th
0.8069299
Turkey tr
0.8062131
Oman om
0.8053221
Latvia lv
0.7925533
Estonia ee
0.7848885
India in
0.7789474
Bahamas bs
0.7740635
Luxembourg lu
0.7727274
United Arab Emirates ae
0.7704339
Ireland ie

I believe that these surprising results can be attributed to early adopters in any nation being the most highly-educated contingent of the Internet populace. After all, though the original domestic purpose of the Internet was military, the Internet claims the network of academic institutions as its earliest International backbone.