Post

The HTML Agility Pack

For a current project I needed to perform a simple screen scrape action. The resulting solution was functional but a bit rough and ready. Luckily I stumbled upon this open-source HTML library project: The HTML Agility Pack, hosted on CodePlex at http://htmlagilitypack.codeplex.com.

It is an excellent little library that makes dealing with HTML a breeze, whether you are screen scraping or just manipulating HTML documents locally. It is very forgivable with regards to malformed HTML documents and supports loading pages directly from the web. You can just parse the HTML or modify it, and it even supports LINQ.  A key benefit of this library is that it doesn’t force you to learn a new object model but instead mirrors the System.XML object model - a huge help for getting up and running quickly, as well as making coding it feel natural.

Download HTML directly via a URL:

1
2
3
HtmlDocument htmlDoc = new HtmlDocument();
HtmlWeb webGet = new HtmlWeb();
htmlDoc = webGet.Load(url);

Or parse an HTML string:

1
2
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlString);

Then you can use XPATH to query the HTML document as you would an XML document:           

1
2
3
4
5
6
// select a <li> where it has an element of <b> with a value of "Name:"
var nameItem = htmlDoc.DocumentNode.SelectSingleNode("//li[b='Name:']");
if (nameItem != null && nameItem.ChildNodes.Count > 1)
{
    name = nameItem.ChildNodes[1].InnerText;
}

You can download it via NuGet here : http://nuget.org/packages/HtmlAgilityPack.

For more examples of it’s use check out these posts: Parsing HTML Documents with the Html Agility Pack and Crawling a web sites with HtmlAgilityPack.

Enjoy.

This post is licensed under CC BY 4.0 by the author.