Lin Blogs : Create a custom web scrapper using html agility pack and XPath

Imagine you want to build a scrapper for a website the tools comes in handy are Html Agility pack and a bit of knowledge on xpath. Html Agility pack is a free to use HTML parser with very few dependencies and the main one is .Nets Xpath implementation.

The general implementation contains the following code to extract the html response from the url provided

var url= “http://en.wikipedia.org/wiki/Australia”;

var web = new HtmlWeb();

HtmlDocument responseHtmlDoc = web.Load(url);

//now start interrogating the htmlDocument using xpaths to get the data

responseHtmlDoc.DocumentNode.SelectSingleNode("//div[@class='test’]");

But now imagine you have to first post some data to retrieve the actual url to retrieve required data. For example

You have to mimic the action of entering search criteria and button click to get set of urls

Entering user and password and clicking on login button (that’s scary..) etc

If you are going for an asp.net website then it will have event validations and proper view state values set to perform the initial data post action. Let’s try to create a simple page scrapper which allows you to go against an asp.net forms based website and interrogate the html response to retrieve data you want.

The following steps will give an idea of how to implement custom scrapper

Step 1. Download html agility pack from code plex [http://htmlagilitypack.codeplex.com/]. This will allow you to load the MSHTML,W3C HTML in an HtmlDocument data structure

Step 2. Create a Custom WebClient class to make requests after manipulating cookie.

using System.Net;

internal class CookieAwareWebClient : WebClient

{

private CookieContainer cc = new CookieContainer();

private string _lastPage;

protected override WebRequest GetWebRequest(Uri address)

{

var r = base.GetWebRequest(address);

var wr = r as HttpWebRequest;

if (wr != null)

{

wr.CookieContainer = cc;

if (_lastPage != null)

{

wr.Referer = _lastPage;

}

_lastPage = address.ToString();

return r;

}

protected override WebResponse GetWebResponse(WebRequest request)

{

var response = base.GetWebResponse(request);

return response;

}

Step 3. Inspect the web pages viewstate value and event validation behaviors and write a viewstate value modification method

private static NameValueCollection GetViewState(string getResponse, string postcode)

{

if (string.IsNullOrEmpty(getResponse))return null;

var viewStateIndex = getResponse.IndexOf("__VIEWSTATE");

var eventValidationIndex = getResponse.IndexOf("__EVENTVALIDATION");

var collection = new NameValueCollection { { "__EVENTTARGET", "" },

{ "__EVENTARGUMENT", "" }, };

var viewState = getResponse.Substring(viewStateIndex + 37);

viewState = viewState.Substring(0, (viewState.IndexOf("/>") - 2));

collection.Add("__VIEWSTATE", viewState);

collection.Add("__VIEWSTATEENCRYPTED", "");

var eventValidation = getResponse.Substring((eventValidationIndex + 49));

eventValidation = eventValidation.Substring(0, eventValidation.IndexOf("/>") - 2);

collection.Add("__EVENTVALIDATION", eventValidation);

collection.Add("content_0$contentcolumnmain_0$txtPostcode", postcode);

collection.Add("content_0$contentcolumnmain_0$btnSearch", "Search");

return collection;

}

Step 4. Let’s make a call to web URL and sent the automated actions

using HtmlAgilityPack;

var webClient = new CookieAwareWebClient();

var uri = "http://someurl.aspx";

string getResponse = string.Empty;

using (StreamReader reader = new StreamReader(webClient.OpenRead(uri)))

{

getResponse = reader.ReadToEnd();

}

//call custom GetViewState method

var viewStateValues = this.GetViewState(getResponse, “mr. x”);

// Upload the NameValueCollection.

byte[] responseArray = webClient.UploadValues(uri , "POST", viewStateValues);

// Save the response string for future

var responseStringToBeStored = Encoding.ASCII.GetString(responseArray);

Now you can inspect the response “responseStringToBeStored” variable content and strip out links that can be used to scrape the page as mentioned in the beginning of the post.

Lin Blogs

Monday, August 19, 2013

Create a custom web scrapper using html agility pack and XPath

1 comment: