[Up: HyperTerrorist] [Prior: Myths of structure] [Robot Wisdom homepage]

Waiting for Browsers that Parse

Jorn Barger 16 December 1998

Conventional net.wisdom holds that websurfers don't like to read long articles online. So publishers respond by loading down their pages with graphic elements, and breaking up long pieces into many short pages.

But in fact, long before there was a Web, a vital community of online readers existed in the Usenet newsgroups. And this community evolved a highly efficient model of article reading, that involves:

The newsreader of choice for power users was 'trn' or threaded newsreader (by Wayne Davidson and Larry Wall), so optimised that a full session of newsreading can be accomplished by simply hitting the spacebar repeatedly.

Websurfing has to aspire to this level of elegance!

One should subscribe to a website as to a newsgroup, declaring a killfile for it, and the browser should detect and download new articles, deleting them only after they've been checked out.

HTML index-pages should be generated that include selected meta-information in a readable format. Articles themselves should be reformatted for readability-- recombining multipage presentations into a single page, removing the distractions of multicolumn text, and suppressing banner ads.

Bookmarking options should include the capacity to web-publish selected bookmarks in an annotated 'web log', and others' weblogs should be a primary potential source for new articles. Postings to newsgroups and mailinglists should be integrated into this system as well.

There's no reason to wait for XML to make this automatic-- stylesheets can't do half of what's needed anyway, and who knows when XML will be widely adopted. It should be easy enough to generalise the repeated formatting-patterns of a given site, and build regexps that can dissect and reassemble these. (Perl-- also by Larry Wall-- is the language of choice for experimenting with this, at the moment.)

Regular expressions explained: http://www.lib.uchicago.edu/keith/tcl-course/topics/regexp.html

The simplest user-interface might be a "Reformat" button in the browser that displays the HTML source of a page, broken up into sections, each of which can be re-tagged, via a popup menu, for re-formatting.

The software category that comes closest at the moment is "Update Bots". BotSpot has inventoried 50+ of these: http://botspot.com/search/s-update.htm

Thomas Boutell's commercial app "Morning Paper" is one of the leaders in the field: http://www.boutell.com/morning/manual.html

WebPluck is another-- in free Perl-- allowing target-fields to be defined via Perl regexps: http://strobe.weeg.uiowa.edu/~edhill/public/webpluck/

MacHeadlines: http://www.macalive.com/macheadlines/features.html

QuickBrowse: http://www.quickbrowse.com/start.html

Twurl: http://www.twurl.com/

InterMute suppresses banner ads and offers a few other limited forms of parsing: [article]

A Hotmail filter: http://www.cwebmail.com/

NewsHub is a Perl-powered website that uses comparable technology, based on Joe McDonald's grommit: http://www.newshub.com

An add-on that parses image names: http://members.aol.com/Trane64/java/SmartBrowser.html

A Perl script called "Daily Update": http://www.cs.virginia.edu/~dwc3q/code/DailyUpdate/index.html

An 'offline browser' called Smart Bookmarks: http://www.zdnet.com/pcmag/features/utility/offbrwsr/uob7.htm

A utility for simplifying pages on palm devices: http://www.newscientist.com/ns/19990501/newsstory4.html

The AI

The browser should understand that even 'periodical' websites will vary from issue to issue in exactly which day and time the new material appears. It should take a quick peek earlier than expected, and adjust its schedule accordingly. If a new issue has broken links, it should know to check back every few hours (or even email a note to the webmaster!).


[Next: First-cut manifesto] Web-design pages:
main : academia : info-design : adding value : resource-pages : lessons-learned : best-worst : plugging leaks

Special topics:
surfing-skills : url-hacking : open content : semantics : pagelength : linktext : startpages : bookmarklets : weblogging : colors : autobiographical pages : thumbnail-graphics : web-video : timeline of hypertext

Anti-XML/W3C/etc:
structure-myth : page-parsing : firstcut-parser : html-history : semantic web

Design prototypes:
topical portal : dense-content faq : annotated lit : random-access lit-summary : poetry sampler : gossipy history : author-resources : hyperlinked-timeline : horizontal-timeslice : web-dossier

Website-resource pages:
RobotWisdom.com : Altavista.com : 1911encyclopedia.com : Google.com : IMDb.com : Perseus.org : Salon.com : Yahoo.com

Older stuff:
design-lab : design-checklist : HyperTerrorist : design-theory : design cog-sci



Search this site Search full Web

Before you leave this site: Be sure you've checked out Jorn's weblog which offers daily updates on the best of the Web-- news etc, plus new pages on this site. See also the overview of the hundreds of pages of original content offered here, and the offer for a printed version of the site.

Hosting provided by instinct.org. Content may be copied under Open Web Content License.