InfoScraper
Tools and techniques to extract information from web pages and newsletters



Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by Arthur Conan Doyle. 


"I like deadlines," cartoonist Scott Adams once said. "I especially like the whooshing sound they make as they fly by."
"There is nothing like that feeling of spending days and days banging your head against a wall trying to solve a programming problem then suddenly finding that one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the solution. Oh yeah!"

- Chris Maunder, CodeProject Newsletter 28 Jan 2002
"Management at eSnipe, which is me, is also feeling the pain of the 2002 bear market. So rather than pout about it, I bought some stuff on eBay that I really didn’t need, but made me feel better."

- Tom Campbell, president of eSnipe

 



 

 
 Friday, October 17, 2003
  10:23:33 AM  Generate an RSS file from Exchange. Scott Hanselman, fellow RD and a .NET guru, shares the source to a very cool idea.  The application pulls items and mail out of Outlook or Exchange Public Folders and creates an RSS (rich site summary) XML feed that can be aggregated by anynumber of Blog/RSS Aggregators.  You can find the ZIP at the RD Code Center.  There are some other cool .NET examples there as well.
[Jon Box's Weblog]
  9:59:25 AM  RSS2Mobile is an RSS-to-WML service. Coool! [Scripting News]
  9:54:19 AM  RSS via Mobile Phone. "If your website is available in RSS format, you can easily make [it] available to mobile phones, PDA's, and WAP-enabled pagers using this tool. Simply enter the URL of your RSS document (0.91, 0.92, 1.0, or 2.0) below to create a .ZIP file containing a WML (Wireless Markup Language) version of the data."... [Lockergnome's RSS Resource]

 
 Thursday, October 09, 2003
  9:24:54 AM  Automatically generate CSS [Lockergnome Windows Daily]
 
 Thursday, October 02, 2003
  7:55:25 AM  Blogging: Design Your Own Weblog Application from Scratch Using ASP.NET, JavaScript, and OLE DB. In this article the author builds a full-featured blog application to illustrate the use of the Repeater and DataList controls that render nested data in a master-detail relationship. [MSDN Just Published]
 
 Monday, September 22, 2003
  3:23:21 PM  CSS is a beautiful thing. The Zengarden site demonstrates "what can be accomplished visually through CSS-based design."
  3:19:29 PM  XML Schema Regular Expressions While the built-in datatypes and constraining facets are a great start, they are often insufficient, especially for string values. Regular expressions provide a powerful mechanism for restricting data values in XML.
 
 Sunday, September 21, 2003
  8:11:41 PM  Coding4Fun: Developing Priorities: Fun First. Duncan Mackenzie shows how the coolness of a feature can increase its chance of being finished early in a development project, as he creates an application to retrieve and display a RSS feed his way. [MSDN Just Published]
 
 Saturday, September 20, 2003
  7:29:20 PM  Extreme XML: Revamping the RSS Bandit Application. Dare Obasanjo revisits his RSS Bandit C# application and improves on its previous design by using various XML features of the .NET Framework to build a rich .NET client application. [MSDN Just Published]
 
 Sunday, September 07, 2003
  2:59:48 PM  Gary Burd explains how Amazon's RSS feeds work. [Scripting News 7/26/2003]
 
 Saturday, September 06, 2003
  4:38:51 PM  Freemind: an incredible thought organization tool. I've been using an open-source tool called FreeMind for the past couple of days. This is hands-down the most incredible brainstorming / thought organization tool that I've ever seen. FreeMind is essentially a fancy XML editor. It lets you create single-rooted recursive hierarchies of information. But the presentation and editing is so powerful and intuitive that it pretty much eliminates any friction involved in restructuring your XML document. You also have the ability to link to external URL's and external files. The way I use FreeMind is to initially capture a mind-dump of ideas relating to a single topic. Then I start noticing clusters of ideas that relate to one another and I use FreeMind to reorganize those ideas into their natural order. See for yourself. Surf over to the FreeMind homepage on SourceForge. Make sure that you check out the screenshots of the program. It was worth reinstalling the Java runtime on my computer just to run this application.[iunknown.com] 7/28/2003
 
 Monday, August 11, 2003
  2:45:18 PM  MSDN: Creating an RSS News Aggregator with ASP.NET.
 
 Sunday, May 25, 2003
  8:56:03 PM  Build inexpensive portals using open source Slash Consultants who are comfortable with Perl, Linux, MySQL, and Apache may find opportunities to set up portals for their clients using Slash, an open source platform that runs the discussion site Slashdot. Learn how it can work for you and your clients.
 
 Wednesday, May 07, 2003
  6:30:39 PM  Ripping Data on the Web - How to recover and repackage information on the World Wide Web
  7:13:50 AM  YARR Yet Another RSS Reader. While preparing for our one day pre-conference tutorial at VS Connections, I decided to build YARR (Yet Another RSS Reader) as a sample application. I felt it was a particularly interesting sample application since it represented the intersection of four different technology spheres: XML, [D]HTML, SQL, and C#. I presented it to our students in the class in about an hour or so. I also presented it as an application that is ripe for code generation since I feel that code generation lets you efficiently capture cross-technology abstractions. I'm placing a drop of the source code for YARR here, for anyone who may be interested in taking a look. To install / setup the code, you need to run the SQL script from solutionsql in the ZIP file. I have a nice little command-line batch file that will run the osql command line utility from you. Just modify the script to point to your SQL server, and optionally change the login that you want to use to create the database schema. The only other bit of tweaking you might want to do is to copy the included default.htm file to the c:temp directory (or modify the sources to read the default HTML page that is displayed in IE from somewhere else). Comments / feedback would be welcome. Next up tomorrow is to rewrite this sample app using code generation.[IUnknown.com: John Lam's Weblog on Software Development]
 
 Tuesday, April 29, 2003
  7:43:51 AM  Sam R on RSS. Sam R has a lot to say today about RSS: RSS Namespace Proposal / RSS: it's not just for syndication anymore / Ghosts of RSS past / Future of RSS    RSS is becoming a very important part of my day. I agree with Sam that there are a lot of other places RSS can/could/should/will be used. I am 100% in favor of anything that will enable RSS to continue to grow. [ScottW's ASP.NET WebLog]
 
 Friday, April 18, 2003
  7:16:20 AM  OneNote FAQ. A pretty good OneNote FAQ here. [Sean 'Early' Campbell & Scott 'Adopter' Swigart's Radio Weblog]
 
 Monday, April 14, 2003
  8:27:44 PM  SgmlReader (was HtmlReader). Wow, this is even better!! [Don Box's Spoutlet]
  8:27:19 PM  Chris Hollander's comments on SharpReader. All you need is love [Don Box's Spoutlet]
  8:16:02 AM  Automated vanity-googling. Googlert is a new Google API tool (you need to supply a key) that emails you regularily with changes in the first 100 results in a Google search for a term you supply. It's automated vanity-search. Link Discuss (via Megnut) [Boing Boing Blog]
  8:15:03 AM  RSS Heck (not quite Hell). Conformance issues from around the globe [Don Box's Spoutlet]
  8:12:31 AM  Wildgrape NewsDesk is a "simple and fast RSS reader for Microsoft .Net." [Scripting News]
 
 Thursday, April 10, 2003
  7:58:18 AM  Validate With Regular Expressions
Learn how to use the most versatile tool at your disposal for checking user input against a variety of possible formats. Leverage Regular Expressions  /  Analyze Source Code With Regular Expressions  /  Use Declarative Field Validation
 
 Tuesday, April 08, 2003
  9:02:25 AM  How to design a suit-proof P2Pnet. My colleague Fred von Lohmann has revised and condensed his seminal white-paper in which he explains to P2P developers what the law actually says about P2P systems and how to design your technology to minimize your chances of getting (successfully) sued. [Boing Boing Blog]
  8:56:14 AM  Bloglet offers an "email subscription service for your blog." [Scripting News]
  8:49:46 AM  Searching Google Using the Google Web Service ( 03/05/03 ) [4GuysFromRolla.com]
  8:48:32 AM  Office Talk: Build Your Own Research Library with Office 2003 and the Google Web Service API. Chris Kunicki shows you how to build a research library, which is a built-in tool that allows users to access various information sources from within Office. [MSDN Just Published]
 
 Friday, April 04, 2003
  9:36:24 PM  Nice article on Office 2003, Research, and Google API. Only in C# of course because it's new technology :-) [Sean 'Early' Campbell & Scott 'Adopter' Swigart's Radio Weblog]
 
 Friday, March 21, 2003
  9:07:55 AM  George Tsiokos did a chart comparing versions of RSS. [Scripting News]
 
 Wednesday, March 05, 2003
  9:07:19 PM  Chris Pirillo sends a pointer to this MSN article that explains how to build a desktop news aggregator. And they say Microsoft isn't paying attention. [Scripting News]
  8:58:23 PM  Four Models for Aggregating and Publishing RSS Headlines. The State of Utah is reviewing options for creating, aggregating, and publishing news from state agencies. The decision of which technology to use to create RSS feeds can be made independent to the decision regarding a technology for aggregating and publishing (parsing) the feeds. I'll address the later first and write about the creation/CMS end tomorrow. There seems to me to be at least four models for aggregating and publishing RSS headlines. This lengthy article describes these four models with examples of each. [RSS in Government]
  10:48:31 AM  Marc Barrot: Outlined RSS Comes to the Browser. [Scripting News]
  10:46:24 AM  Jon Udell lists ten things we should know about Microsoft's InfoPath. Here's what it looks like. [Scripting News]
  10:45:39 AM  Jon Udell: "NewsGator is a fabulous hack." [Scripting News]
  10:45:01 AM  Mary Jo: Microsoft Tests the Blogging-Tool Waters. [Scripting News]
  10:44:17 AM  Marc Barrot is rendering RSS in an outline in a browser. [Scripting News]
 
 Tuesday, March 04, 2003
  7:59:26 AM  RSS... Oops! What?. (For RSS weenies only). Sam Ruby thoughtfully pointed me at the RSS Validator, which whined at me that my RSS was broken, which it was, so I fixed it, so if your feed reader is showing everything here unread that may be why. Except for I'm resisting one change that the validator wants.... [ongoing]
 
 Thursday, January 23, 2003
  7:08:52 AM  Mark Pilgrim: Parsing RSS At All Costs. [Scripting News]
 
 Thursday, January 16, 2003
  9:01:50 PM  Syndirella is an RSS aggregator for .Net. [Scripting News]
 
 Friday, January 10, 2003
  12:56:46 PM  

Posted on www.alphaAve.com, the Circus-DTE (Data Transformation Environment) programming language is available for testing. Circus-DTE is intended for environments in which document portals abound and documents and data must move on the Web or in business processes, according to Xerox. The language is intended to provide a middle ground between a general-purpose, low-level language that needed lengthy development of complex algorithms and a high-level, but inflexible, approach to build applications that translate documents and data among different formats so that they can be read by any application or on any device.

  12:42:06 PM  Build Inexpensive Portals Using Open Source Slash. Consultants who are comfortable with Perl, Linux, MySQL, and Apache may find opportunities to set up portals for their clients using Slash, an open source platform that runs the discussion site Slashdot. See how it can work for you and your clients. [TechRepublic - 16 Dec 02]
  8:38:03 AM  Wiki Eases The Burden Of Creating Documentation. WikiWikiWeb is an authoring tool that provides an easy, collaborative way to create browser-based, organic documentation. A Wiki may be the answer to your documentation woes. [Builder.com - 12 Dec 02] [Eric's incoming newsletters]
 
 Thursday, January 09, 2003
  12:41:48 PM  Sumod and Dejan Jelovic have RSS aggregators for .Net. [Scripting News]
 
 Monday, December 23, 2002
  2:04:00 PM  I have some time to spare this morning, and thought of an interesting thing to do. I'm going to figure out which Creative Commons license should apply to the module I designed last week, and then, following Denise Howell's advice (she's a lawyer) apply it. The first thing I did was run the CC license chooser, it suggested the attribution license. My intent is to let people do anything they want with my module, change it, enhance it, commercialize it, but I want credit for originating it. The next step is to get a bit of HTML code to put on the page. The CC site supplies this code here. I added it to a section at the end of the module. Comments, questions and suggestions are welcome. [Scripting News]
 
 Sunday, December 22, 2002
  1:30:57 PM  Washington Post: "Since many bloggers have no background in publishing, they often come to the medium unaware of the rules that apply." [Scripting News]
 
 Monday, December 02, 2002
  3:48:36 PM  

Using VBScript to run a program

VBScript can be used instead of a batch file to perform setup and cleanup operations before and after running another program.

For games, we might need to:

  • load the CD image
  • change controller preferences
  • change screen resolution

The VBScript syntax is:
     object.Run(strCommand, [intWindowStyle], [bWaitOnReturn{=false}])

Dim oShell
Set oShell = WScript.CreateObject ("WSCript.shell")
oShell.run "cmd /K CD C:\ & Dir"
Set oShell = Nothing

For Paragon CD-ROM Emulator, the command line would be:
    Cdman /command:e'-”T”,i-"T=E:cdthe image.cdi”

This "ejects" any CD already in drive T (the ' in e' is an optional apostrophe used to tell it to continue after any errors - like no CD mounted).

 

 
 Thursday, November 28, 2002
  9:59:33 PM  January 2002;|  Sue Mosher  |  Outlook Tips and Techniques  |  InstantDoc #23147

Outlook Tips--Displaying Multiple Folders Within a View

Can I show more than one folder within a view—such as the Day/Week/Month view that combines Calendar and Tasks, only with different folders? —Brud Rossmann

Aside from the built-in Calendar + Tasks view and the Outlook Today page, Outlook out of the box provides no views that combine data from multiple folders. The solution is to use the Outlook View Control (OVC) in a folder home page. The OVC is an ActiveX control that displays a specific Outlook page. Folder home pages are simply Web pages, and they can host multiple copies of the OVC, each displaying a different folder. You can add the OVC to a Web page, just as you would any other control, and set the necessary properties.

The original version of the OVC had a security vulnerability. For Outlook 2002, visit the Microsoft Office Download Center and download and install the latest update for Outlook 2002. Any update after August 16, 2001, has the more secure OVC.

After you use the OVC to create a Web page, make it the home page for an Outlook folder by bringing up the folder's Properties dialog box and entering the path to the Web page on the Home Page tab.

For more information about the OVC, including sample code, see the Microsoft article "OL2000: General Information About the Outlook View Control"

 
 Thursday, November 21, 2002
  10:33:24 AM  Public development for the enterprise?. Public development sites such as SourceForge.net and Microsoft's forthcoming GotDotNet Workspaces provide many tools for collaborative development. But Larry Seltzer has a hard time imagining enterprises using them. [ZDNet Tech Update Today - 21 Nov 02]
 
 Tuesday, November 19, 2002
  9:15:53 PM  Footbridge is "a lightweight tool to mirror Radio categories to Advogato, LiveJournal, and Blogger API sites." [Scripting News]
  9:07:14 AM  Take A Manageable Approach To Reading Html Page Data. Analyzing a Web site for valid data and errors can be time-consuming, but HTML scraping streamlines the process. Here are some tips and downloads to help you extract data easily from HTML pages. [Builder.com - 19 Nov 02]
  9:02:11 AM  Outlook 2002: Save Your Custom Apps With Redemption. Have Outlook 2002's new security features rendered your custom applications useless? Find out how you can bypass these security measures
and breathe new life into your custom applications. [TechRepublic - 18 Nov 02]
 
 Tuesday, November 05, 2002
  4:45:54 PM  

XML News Feeds and weblogs

RealWorldASP Newsletter 10/24/2002 - http://www.chrisg.org is my new weblog where I, um, talk about stuff. If you want the raw news without my spin on it, check out http://chrisg.com/rss where I aggregate several XML news feeds for general consumption :O)

Talking about XML News Feeds and weblogs, if you want to display news from an RSS/RDF XML news source then check out my ASP.NET XML News control. This control is how I created those news pages above really quickly. Just drag the control onto your project and set the url of the feed and you are away. All vb.net source code is included for both normal asp.net projects and IBuySpy portals!

Another control written by little ol' me is my ASP.NET picture gallery! Again, all source is included for both IBuySpy portals and normal asp.net projects. This control automatically displays pictures as thumbnails allowing you to click to get the full view in a new window. See it in action at http://www.amyg.co.uk. Both controls we written in VB.NET but work in C# projects.

ASP Tutorial http://www.aspalliance.com/chrisg/default.asp?article=83

-- Chris Garrett :OD
-- http://www.realworldasp.net/

 
 Thursday, October 31, 2002
  9:59:17 AM  Working in a New Way. For the past few days an ardent discussion on the OSAF design mailing list has revealed there is no consensus... [Mitch Kapor's Weblog]
 
 Wednesday, October 30, 2002
  9:45:01 AM  Phil Hewitt: Blogger XML-RPC Tools in VB. [Scripting News]
 
 Tuesday, October 29, 2002
  10:30:06 PM  Jon Udell: "The Xopus demo is, indeed, an eye-opener." [Scripting News] The Xopus demo is, indeed, an eye-opener. Runs in the browser, without plug-in support, toggling between WSYIWYG and XML modes, enforces schema, has multilingual support both in the UI and the document. Includes a competent table editor. The developers of this open-source project have even built a prototype of the MSIE ContentEditable feature for Mozilla, in advance of official support in Mozilla for that feature. Impressive!
 
 Saturday, October 26, 2002
  8:12:58 AM  

Radio Userland enhancements

News Aggregator

  • Update Now button
  • Check all buttons in this channel (to left of globe)
  • Ability to tag items by category
  • Ability to edit items without posting first (philosophy?)
  • Sort by channel

Categories

  • Edit items without complex dance to home page and back
  • Edit-in-place
  • Combine / copy / delete
  • Edit date/time

Hosting

  • Basic security on a directory level

Macros

  • Root-relative link (127:0:0:1 and/or radio.weblogs.com//)
  • Better support for style sheets

Editing

  • HtmlTidy
 
 Thursday, October 24, 2002
  9:05:22 PM  

Radio UserLand tools and stuff

  3:41:36 PM  

Newsletter-to-RSS update

The Outlook email security update (see Q262700) means you get a popup dialog box every time a script tries to access the email address of a message. We need to know the address to handle email parsing without needing a separate rule for each and every source.

various approaches

OL2002: How to Create a Script for the Rules Wizard

Finally decided to use a rule to Forward the message back to myself (a special pseudo-account). When Outlook forwards a message, it adds the Original Message header:

-----Original Message-----
From: SearchWin2000.com [mailto:searchWin2000-C3C8D20E08C0B7DE@lists.techtarget.com]
Sent: October 18, 2002 11:03 AM
To: SearchWin2000.com
Subject: Tips from our experts, Oct. 18, 2002

Now, the news-to-rss rule can parse the original message header without using programming tips to access it.

  3:13:17 PM  Daves Quick Search Taskbar Toolbar. I've added a couple of searches to Dave's Quick Search Taskbar Toolbar . One is for searching Arin whois database and another for searching place names at the Getty Thesaurus of Georgraphic Names. The searches are provided here in the form of an XML file that you just drop in the searches directory. Arin XML file is here . Getty XML file is here. If you are unfamiliar with Dave's Quick Search Taskbar Toolbar its an absolute must have utility on Windows. [TechnoMagician's Weblog]
  3:13:06 PM  

Finding More Channels - a great page on finding RSS feeds from Morbus Iff . It’s geared toward amphetaDesk users, but it’s useful for users of any RSS reader.

 
 Wednesday, October 23, 2002
  1:36:06 PM  

NewsScraper: check for xxxx.</link> where the last "." was a period.

 
 Monday, October 21, 2002
  12:39:18 PM  Book Excerpt: Essential Blogging, Pt. 3. The conclusion of our series of excerpts from this O'Reilly title reflects on some of the advanced features and technologies available to Radio UserLand bloggers; including RSS syndication, XML-RPC, and Upstreaming. 1021 [WebReference News]
 
 Thursday, October 17, 2002
  9:27:17 AM  Creating XML Documents with the DOM in VB6. Sure, .NET's got great support for XML, but what's a VB6 programmer to do? Learn to work with MSXML2's DOM parser, that's what. Lamont Adams helps you weed through the myriad COM classes. [Builder.com newsletters]
 
 Wednesday, October 16, 2002
  10:54:01 AM  

Email translator update

Based on a few weeks' experience, I've come to realize that while email-newsletter-to-RSS is a fantastic idea, a more-or-less-direct regular expression approach is too cumbersome and fragile.

It's better to build a generic filtering and parsing system, and use custom regular expressions to tag the content types for each newsletter.

  • item (title + text + link)
  • list (multiple elements, each possibly with its own link)
  • story (multi-line text block, may have no link or more than one link)
  • ignore
  • end-of-item

Another problem is that I ended up with a bazillion separate rules and RSS files. It would be much cleaner to have a single rule that looks up the parsing info from a table based on the particular email source.

The RSS specification (1.x, 2) only allows one channel per file.

Finally, it would be convenient to post the results directly to Radio Userland.

 
 Tuesday, October 15, 2002
  7:43:23 AM  Book Excerpt: Essential Blogging, Pt. 2. Just because you're using commercial software doesn't mean your blog must look like everyone else's. Part 2 of this excerpt series shows you how to customize your Radio UserLand blog using themes, templates, and macros. From O'Reilly. 1014 [WebReference News]
 
 Saturday, October 12, 2002
  2:29:02 PM  Little-known Radio feature. The RSS Hotlist, shows you the Top 100 most-subscribed-to feeds. Each entry has a checkbox, it's checked if you're already subscribed, not checked if not. You can click on the boxes to subscribe or unsubscribe. [Scripting News]
 
 Wednesday, October 09, 2002
  3:16:49 PM  Daves Quick Search Taskbar Toolbar. I've added a couple of searches to Dave's Quick Search Taskbar Toolbar . One is for searching Arin whois database and another for searching place names at the Getty Thesaurus of Georgraphic Names .
The searches are provided here in the form of an XML file that you just drop in the searches directory. Arin XML file is here . Getty XML file is here .
If you are unfamiliar with Dave's Quick Search Taskbar Toolbar its an absolute must have utility on Windows. [TechnoMagician's Weblog]
 
 September 28, 2002
  5:28:36 PM  I started a directory of RSS resources. [Scripting News]
 
 September 26, 2002
  11:22:03 AM  

InfoScraper design

Frustrated by information overload, but encouraged by the aggregation and filtering capabilities of RSS feeds, I've been looking for a tool to convert existing newsletters and other sources into tidy RSS. I expect that pretty soon most newsletters will offer RSS format in addition to Text and HTML, but I don't want to wait.

I tried a number of existing tools, but none of them do everything I want - especially the ability to convert both text and HTML email.

I decided to use regular expressions instead of XSLT because many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse. Not to mention that some sources are plain text.

first exercise: comics

Assorted observations:

  • one level of items is not enough - need to have at least 2, conditionally
  • need to strip some HTML formatting (keep basic <H>, <B>, <I> but not Word markup)
  • need to drop some items automatically (Your Feedback is Important)
  • need to combine onChange and filter scripts
  • want to add DHTML hide/show script to limit initially visible length of posts
  • Generic scraper should work with web pages and email (HTML and text)
  • Output should be in some kind of RSS feed format
  • Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
  • It's better to use RegEx instead of XSLT because:
    • many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
    • some sources are plain text
  • Patterns should be nestable and/or sequential (and/or)
    • It should be easy to make multiple passes to extract information from different parts of a document
  • Matched text to be included or excluded from extracted info
  • Options to strip styles, tags (or maybe specify tags to retain)
  • Syntax could follow XSLT ...
  • The whole thing needs to be table-driven, starting with the feed identifier, the RSS header info, and the collection of patterns for the items.
  • Naturally, the specification table will be XML. This means we can use an XML parser to search and process the table.
  • For email, the channel can be determined from the "From" and "Subject" fields, and the <pubDate> from the "Sent" field.
  • It might be useful to specify a pattern for items to be ignored.
  • it would be useful to have a way to highlight special keywords, and/or items containing keywords
  • Search for start pattern
  • Search for end pattern
  • Extract body (start-end, inclusive or exclusive).
    If the patterns are included in the body, then this step is a simple regular expression: {start}.*{end}
 
 September 23, 2002
  10:35:17 AM  

Scraper continued

  • It might be useful to specify a pattern for items to be ignored.
  • it would be useful to have a way to highlight special keywords, and/or items containing keywords

Eric's InfoDabble News

 
 September 22, 2002
  8:41:53 PM  

Scraper thoughts

Reference: What is RSS?

The whole thing needs to be table-driven, starting with the feed identifier, the RSS header info, and the collection of patterns for the items.

Naturally, the specification table will be XML. This means we can use an XML parser to search and process the table.

  • For email, the channel can be determined from the "From" and "Subject" fields, and the <pubDate> from the "Sent" field.
 
 September 21, 2002
  8:11:45 PM  

More scraping thoughts ...

  • Generic scraper should work with web pages and email (HTML and text)
  • Output should be in some kind of RSS feed format
  • Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
  • It's better to use RegEx instead of XSLT because:
    • many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
    • some sources are plain text
  • Patterns should be nestable and/or sequential (and/or)
    • It should be easy to make multiple passes to extract information from different parts of a document
  • Matched text to be included or excluded from extracted info
  • Options to strip styles, tags (or maybe specify tags to retain)
  • Syntax could follow XSLT ...
 
 September 16, 2002
  8:44:12 AM  

Daily comics

This project locates today's comics on their web pages, and builds a table with the comics all in one place.

Doonesbury: from the web page, look for <a href="http://www.ucomics.com/cgi-bin/shopping/buycomic.cgi and extract all text up to </a>. This gives today's comic with a hyperlink to the order form.

Normally, a web page has no way to directly load another. This is usually done with a COM component.

 

How to make a HTTP connection in VBS
Newsgroups: microsoft.public.inetsdk.programming.scripting.vbscript, microsoft.public.scripting.vbscript
From: Johnny Xia (johnny_xia@wistron.com.cn) Date: 2001-08-20 04:32:17 PST
Is there any component which can make a HTTP request in VBS? I don't need any UI, just want to GET/POST a URL.
 
From: Adrian Forbes (noemail@noemail.xxx) Date: 2001-08-20 06:55:27 PST
set obj = CreateObject("Microsoft.XMLHTTP")
 
From: oxygen (oxygen@swbell.net) Date: 2001-08-27 08:57:16 PST
Yes there is...  You need to have an XML parser installed on the server. There are three that I know of: Microsoft's xmlhttp, ASPTear,
and ASPHTTP. I personally use the Microsoft version just to keep my server all Microsoft.(uniformity I guess)

Here is some code that I wrote to access a remote URL and grab the source code:
<%@ Language=VBScript%>
<%
  Response.Buffer = True
  Dim objXMLHTTP, xml, dtmTime, strURL
  strURL = http://www.someurl.com
  ' Create an xmlhttp object:
  Set xml = Server.CreateObject("MSXML2.ServerXMLHTTP")
  ' Opens the connection to the remote server.
  xml.Open "GET", strURL, False
  ' Actually Sends the request and returns the data:
  xml.Send
  ' Move the source of what was returned into a string for later use.
  strSource = xml.responseText
  ' Be clean and clean up
  Set xml = Nothing
Response.Write strSource%>

And there you have it.

 


It's trivial in ASP.NET:

03/06/2002 [VB.NET Snippets] (c)Zidler 2002
How to read the content of an external website in a variable
This snippet explains how some search engines 'crawl' your website and cache it in their database.
'VB.Net
Function readHtmlPage(url As String) As String

   Dim objResponse As WebResponse
   Dim objRequest As WebRequest
   Dim result As String

   objRequest = System.Net.HttpWebRequest.Create(url)
   objResponse = objRequest.GetResponse()
   Dim sr As New StreamReader(objResponse.GetResponseStream())
   result = sr.ReadToEnd()

   'clean up StreamReader
   sr.Close()
   return result

End Function
Source: Dotnet4all

Other links for ASP.NET:

 

 
 September 15, 2002
  9:02:40 AM  

RSS Feeds: Syndic8   NewsIsFree   Meerkat    RSS Info   myRSS  
RSS Info: RSS Tools

Other articles:
  6:55:02 AM  

Experiments with RSSDistiller

Links:  RssDistiller How To    customizing RssDistiller 

September 14, 2002: Initial experiments, using Harrow Technology Report as a source. Comments:

  • one level of items is not enough - need to have at least 2, conditionally
  • need to strip some HTML formatting (keep basic <H>, <B>, <I> but not Word markup)
  • need to drop some items automatically (Your Feedback is Important)
  • need to combine onChange and filter scripts
  • want to add DHTML hide/show script to limit initially visible length of posts

 

-----Original Message-----
From: ehartwell@exoware.com [
mailto:ehartwell@exoware.com]
Sent: September 15, 2002 7:04 AM
To:
paolo@evectors.it
Subject: Radio UserLand: Mail from Eric Hartwell

Eric Hartwell sent this email to you through the Radio UserLand community server, re this page - RssDistillerHowTo.

I've been experimenting with RSSDistiller, and I'm ready to create custom filters.

I followed "helpcustom" to add a custom distiller to RSSDistillerDataFile.Distillers, and it shows up in the list of filterd on the Edit Feed page.

Problem: the "extract" scripts in RssDistillerSuite.Distillers are compiled code, so I can't read them. Where's the script source?

 
 September 10, 2002
  8:22:12 AM  

Mail-to-Blog:

More ...

system.verbs.apps.blogger.mailToBlog.Checkmail.script:
«Changes
 «8/14/01; 9:07:56 AM by DW
  «Runs in a separate thread, watching a mail account you specify. When a message appears, if its subject is the secret subject, post the contents of the message to your Blogger blog.
  «This script is derived from Radio's mail-to-weblog feature, myUserLandSuite.blog.checkMail.
  «http://frontier.userland.com/blogger#mailToBlog

blogger.init ();
if user.blogger.mailToBlog.enabled {
 msg ("blogger.mailToBlog.checkMail.script");
 try {
  local (msgtable, adrmsg);
  with user.blogger.mailToBlog {
   tcp.getmail (server, account, password, @msgtable, deleteMessages:true, flMessages: false);
   user.blogger.mailToBlog.stats.ctMailToBlogChecks++};
  for adrmsg in @msgtable {
   bundle { //process the text first, allow callbacks to change it
    local (s = adrmsg^.text);
    s = string.quotedPrintableDecode (s);
    s = string.replaceAll (s, "\n", "");
    s = string.trimWhitespace (s);
    adrmsg^.blogPostText = s};
   bundle { //run msg through callbacks
    try {
     if defined (user.blogger.callbacks.mailToBlog) {
      local (adrcallback);
      for adrcallback in @user.blogger.callbacks.mailToBlog {
       while typeOf (adrcallback^) == addressType {
        adrcallback = adrcallback^};
       try {adrcallback^ (adrmsg)}}}}};
   if adrmsg^.subject == string (user.blogger.mailToBlog.secretSubject) {
    blogger.newPost (adrmsg^.blogPostText);
    user.blogger.mailToBlog.stats.ctMailToBlogPosts++}}};
 msg ("")};
bundle { //schedule wakeup for next whole minute
 local (day, month, year, hour, minute, second);
 date.get (clock.now (), @day, @month, @year, @hour, @minute, @second);
 thread.sleepFor (60 - second)} //wake up once a minute, on the minute

system.verbs.builtins.radio.weblog.checkMail:
«Changes
 «11/23/01; 3:09:57 PM by JES
  «Save the adrpost returned by radio.weblog.post, and pass it to radio.weblog.updatePagesForPost, to update all of the pages that the post appears in.
 «10/19/01; 4:16:56 PM by DW
  «Testing, debugging, fixing for 7.1.
  «string.trimwhitespace the posts. No need for trailing carriage returns on the posts.
  «Add pref to control how often mail is checked. Initialized in radio.weblog.init.
 «10/19/01; 4:16:42 PM by DW
  «Check out for Radio 7.1.
 «3/25/01; 9:53:05 AM by DW
  «Add callback support. Loop over all the scripts in myUserLandData.callbacks.blogCheckMail passing the address of each message. Ignore any returned value or script error.
  «Comment out assignment to scratchpad.emailText.
 «3/1/01; 5:53:14 PM by JES
  «Convert quoted printable text to ASCII. Untaint text before posting. Remove linefeeds.
 «2/24/01; 7:30:49 AM by DW
  «This is the My.UserLand.On.The.Blackberry feature that Scott Loftesness asked for.
  «If enabled, we check a mail address and see if there are any posts available for us.
  «The subject must match the "secret subject", the body contains the post.

local (adrblog = radio.weblog.init ());
if adrblog^.prefs.mailPosting.enabled {
 if not defined (system.temp.radio.misc.lastMailPostingCheck) {
  system.temp.radio.misc.lastMailPostingCheck = date (0)};
 if clock.now () > (system.temp.radio.misc.lastMailPostingCheck + adrblog^.prefs.mailPosting.ctSecondsBetweenChecks) {
  system.temp.radio.misc.lastMailPostingCheck = clock.now ();
  local (msgtable, adrmsg);
  with adrblog^.prefs.mailPosting {
   tcp.getmail (server, account, password, @msgtable, deleteMessages:true, flMessages: false);
   adrblog^.stats.ctMailToBlogChecks++};
  for adrmsg in @msgtable {
   bundle { //run msg through callbacks
    try {
     if defined (adrblog^.callbacks.weblogCheckMail) {
      local (adrcallback);
      for adrcallback in @adrblog^.callbacks.weblogCheckMail {
       while typeOf (adrcallback^) == addressType {
        adrcallback = adrcallback^};
       try {adrcallback^ (adrmsg)}}}}};
   if adrmsg^.subject == string (adrblog^.prefs.mailPosting.secretSubject) {
    local (s = adrmsg^.text);
    s = string.trimWhitespace (s);
    s = string.quotedPrintableDecode (s);
    s = string.replaceAll (s, "\r\n", "\r");
    s = radio.string.untaint (s);
    local (adrpost = radio.weblog.post (s, adrblog));
    radio.weblog.updatePagesForPost (adrpost);
    adrblog^.stats.ctMailToBlogPosts++}}}}

 
 September 9, 2002
  7:35:11 AM  

Newsletters

Instead of reading dozens of email newsletters, it would be a lot handier to consume RSS feeds. I'm experimenting with using myRSS for those that don't already have feeds.

  • ZDNet - AnchorDesk Daily -  Tech Update Today - Tech Update Weekly
  • Netsurfer - Netsurfer Digest [newsisfree ] - Netsurfer Robotics
  • ITBusiness.ca - Update [myRSS ]
  • Sunbelt - W2KNews
  • Spaceflightnow - [myRSS ]
  • Internet.com - WebDeveloper Original XML Source
  • ASPToday - 
  • eWeek -


Click here to visit the Radio UserLand website. © Copyright 2003 Eric Hartwell.
Last update: 11/15/2003; 7:05:09 PM.
This theme is based on the SoundWaves (blue) Manila theme.