This is a mirror of official site: http://jasper-net.blogspot.com/

Crawling Websites with C# and XPath

| Monday, January 30, 2012
Trying to index every item of clothing on the internet for Clossit seems like a pretty difficult task but is easy with C# and the HTML Agility Pack library. Our crawler, nick named Monocle, is written using both.

HTML is pretty messy and can be inconsistent between different sites but HTML Agility Pack seems to be able to handle it all and lets you access it all with XPath. In this example we will be crawling Superalloy’s Wikipedia page. After adding the .dll as a reference you can download and load a single page like this:

string url = "http://en.wikipedia.org/wiki/Superalloy";
var wc = new WebClient();
var document = new HTMLDocument();
document.LoadHtml(wc.DownloadString(url));

At this point the document object is holding the html content and is ready to receive XPath queries. Let’s say we’d like to select the title of the page which appears here:

<h1 id="firstHeading" class="firstHeading">Superalloy</h1>

What makes this node unique is the H1 tag with an id of “firstHeading”. We can select this by using XPath

string title = document.SelectSingleNode("//h1[@id='firstHeading']").InnerText;
Console.WriteLine(title);


Read more: Clossit
QR: http://chart.googleapis.com/chart?chs=80x80&cht=qr&choe=UTF-8&chl=http://blog.clossit.com/crawling-websites/

Posted via email from Jasper-net

0 comments: