Revision as of 13:13, 16 March 2009

Overview

Media Cloud is a project that tracks news content comprehensively – providing open, free, and flexible tools. This will allow unprecedented quantitative analysis of media trends. For instance, some of our driving questions are:

Do bloggers introduce storylines into mainstream media or the other way around?
What parts of the world are being covered or ignored by different media sources?
Where do stories begin?
How are competing terms for the same event used in different publications?
Can we characterize the overall mix of coverage for a given source?
How do patterns differ between local and national news coverage?
Can we track news cycles for specific issues?
Do online comments shape the news?

You can see some simple visualizations generated out of our system on the main site for Media Cloud, but the project is under very active development and there is much more under the hood.

Our ideas break down roughly into the the components of our system.

Ideas

Get News Stories

We currently have a perl script that acts as a crawler of all news stories. It works by checking the relevant RSS feeds for new stories, requesting the page, and (if necessary) programmatically determining what the "next page" link is and requesting each additional page. It is working fairly well in most cases, but with an arbitrary number of new feeds and sites, you can imagine that it sometimes breaks down. There are various ways in which the crawler could be improved, including better fault tolerance, error reporting, and paging. The paging ("next page" detection) is a particularly interesting -- and more generalizable -- computer science problem.

Extract Story Text

improve statistical algorithm
neural network?

Create Term List

include rich Calais metadata
experiment with alternative engines / algorithms
develop clustering / dynamic topic identification

Allow Rich Queries

We have terabytes of data and millions of archived stories. How can we construct queries that work efficiently on this data set and generate interesting and compelling results? For instance, we are currently experimenting with time-sequence analysis of different terms across different media sources (see, for instance, our experimental charts of coverage of the bailout.

How would we go about visualizing some of the questions expressed above, with the data we have? We currently use the Google Visualizations API to actually generate our charts.

@@ Line 19: / Line 19: @@
 ==Get News Stories==
+We currently have a perl script that acts as a crawler of all news stories.  It works by checking the relevant RSS feeds for new stories, requesting the page, and (if necessary) programmatically determining what the "next page" link is and requesting each additional page.  It is working fairly well in most cases, but with an arbitrary number of new feeds and sites, you can imagine that it sometimes breaks down.  There are various ways in which the crawler could be improved, including better fault tolerance, error reporting, and paging.  The paging ("next page" detection) is a particularly interesting -- and more generalizable -- computer science problem.
-* better auto-crawler
-* improve paging system
 ==Extract Story Text==

Media Cloud: Difference between revisions

Revision as of 13:13, 16 March 2009

Contents

Overview

Ideas

Get News Stories

Extract Story Text

Create Term List

Allow Rich Queries

Navigation menu

Media Cloud: Difference between revisions

Revision as of 13:13, 16 March 2009

Overview

Ideas

Get News Stories

Extract Story Text

Create Term List

Allow Rich Queries

Navigation menu

Search