Monday, August 19, 2013

Create a custom web scrapper using html agility pack and XPath

Imagine you want to build a scrapper for a website the tools comes in handy are Html Agility pack and  a bit of knowledge on xpath.  Html Agility pack is a free to use HTML parser with very few dependencies and the main one is .Nets Xpath implementation.
The general implementation contains the following code to extract the html response from the url provided
var url= “http://en.wikipedia.org/wiki/Australia”;
var web = new HtmlWeb();
HtmlDocument responseHtmlDoc = web.Load(url);
//now start interrogating the htmlDocument using xpaths to get the data
responseHtmlDoc.DocumentNode.SelectSingleNode("//div[@class='test’]");

But now imagine you have to first post some data to retrieve the actual url to retrieve required data.  For example
You have to mimic the action of entering search criteria and button click to get set of urls  
Entering user and password and clicking on login button (that’s scary..) etc
If you are going for an asp.net website then it will have event validations and proper view state values set to perform the initial data post action. Let’s try to create a simple page scrapper which allows you to go against an asp.net forms based website and interrogate the html response to retrieve data you want.

The following steps will give an idea of how to implement custom scrapper

Step 1.    Download html agility pack from code plex [http://htmlagilitypack.codeplex.com/]. This will allow you to load the MSHTML,W3C HTML in an HtmlDocument data structure
Step 2.    Create a Custom WebClient class to make requests after manipulating cookie.
using System.Net;
internal class CookieAwareWebClient : WebClient
{
private CookieContainer cc = new CookieContainer();
private string _lastPage;

protected override WebRequest GetWebRequest(Uri address)
{
var r = base.GetWebRequest(address);
var wr = r as HttpWebRequest;
if (wr != null)
{
wr.CookieContainer = cc;
if (_lastPage != null)
{
wr.Referer = _lastPage;
}
}
_lastPage = address.ToString();
return r;
}

protected override WebResponse GetWebResponse(WebRequest request)
{
var response = base.GetWebResponse(request);
return response;
}
}

Step 3.    Inspect  the web pages viewstate value and event validation behaviors and write a viewstate value modification method
private static NameValueCollection GetViewState(string getResponse, string postcode)
{
if (string.IsNullOrEmpty(getResponse))return null;

var viewStateIndex = getResponse.IndexOf("__VIEWSTATE");
var eventValidationIndex = getResponse.IndexOf("__EVENTVALIDATION");
var collection = new NameValueCollection { { "__EVENTTARGET", "" },
{ "__EVENTARGUMENT", "" }, };
var viewState = getResponse.Substring(viewStateIndex + 37);
viewState = viewState.Substring(0, (viewState.IndexOf("/>") - 2));
collection.Add("__VIEWSTATE", viewState);
collection.Add("__VIEWSTATEENCRYPTED", "");
var eventValidation = getResponse.Substring((eventValidationIndex + 49));
eventValidation = eventValidation.Substring(0, eventValidation.IndexOf("/>") - 2);
collection.Add("__EVENTVALIDATION", eventValidation);
collection.Add("content_0$contentcolumnmain_0$txtPostcode", postcode);
collection.Add("content_0$contentcolumnmain_0$btnSearch", "Search");
return collection;
}
Step 4.    Let’s make a call to web URL and sent  the automated actions
using HtmlAgilityPack;
var webClient = new CookieAwareWebClient();
var uri = "http://someurl.aspx";
string getResponse = string.Empty;
using (StreamReader reader = new StreamReader(webClient.OpenRead(uri)))
{
getResponse = reader.ReadToEnd();
}
//call custom GetViewState method
var viewStateValues = this.GetViewState(getResponse, “mr. x”);
// Upload the NameValueCollection.
byte[] responseArray = webClient.UploadValues(uri , "POST", viewStateValues);
// Save the response string for future
var responseStringToBeStored = Encoding.ASCII.GetString(responseArray);


Now you can inspect the response “responseStringToBeStored” variable content and strip out links that can be used to scrape the page as mentioned in the beginning of the post.