Imagine
you want to build a scrapper for a website the tools comes in handy are Html Agility
pack and a bit of knowledge on
xpath. Html Agility pack is a free to
use HTML parser with very few dependencies and the main one is .Nets Xpath
implementation.
The
general implementation contains the following code to extract the html response
from the url provided
var url= “http://en.wikipedia.org/wiki/Australia”;
var web =
new HtmlWeb();
HtmlDocument responseHtmlDoc = web.Load(url);
//now start interrogating the htmlDocument using xpaths to get the
data
responseHtmlDoc.DocumentNode.SelectSingleNode("//div[@class='test’]");
But
now imagine you have to first post some data to retrieve the actual url to
retrieve required data. For example
You
have to mimic the action of entering search criteria and button click to get
set of urls
Entering
user and password and clicking on login button (that’s scary..) etc
If
you are going for an asp.net website then it will have event validations and proper
view state values set to perform the initial data post action. Let’s try to
create a simple page scrapper which allows you to go against an asp.net forms
based website and interrogate the html response to retrieve data you want.
The
following steps will give an idea of how to implement custom scrapper
Step 1.
Download
html agility pack from code plex [http://htmlagilitypack.codeplex.com/]. This
will allow you to load the MSHTML,W3C HTML in an HtmlDocument data structure
Step 2.
Create
a Custom WebClient class to make requests after manipulating cookie.
using
System.Net;
internal class CookieAwareWebClient
: WebClient
{
private CookieContainer cc = new
CookieContainer();
private string _lastPage;
protected
override WebRequest
GetWebRequest(Uri address)
{
var r = base.GetWebRequest(address);
var wr =
r as HttpWebRequest;
if (wr !=
null)
{
wr.CookieContainer
= cc;
if
(_lastPage != null)
{
wr.Referer
= _lastPage;
}
}
_lastPage
= address.ToString();
return r;
}
protected
override WebResponse
GetWebResponse(WebRequest request)
{
var
response = base.GetWebResponse(request);
return
response;
}
}
Step 3.
Inspect
the web pages viewstate value and event
validation behaviors and write a viewstate value modification method
private static NameValueCollection GetViewState(string getResponse, string postcode)
{
if (string.IsNullOrEmpty(getResponse))return null;
var viewStateIndex = getResponse.IndexOf("__VIEWSTATE");
var eventValidationIndex = getResponse.IndexOf("__EVENTVALIDATION");
var collection = new NameValueCollection { { "__EVENTTARGET", "" },
{ "__EVENTARGUMENT", "" }, };
var viewState = getResponse.Substring(viewStateIndex + 37);
viewState =
viewState.Substring(0, (viewState.IndexOf("/>") - 2));
collection.Add("__VIEWSTATE",
viewState);
collection.Add("__VIEWSTATEENCRYPTED", "");
var eventValidation =
getResponse.Substring((eventValidationIndex + 49));
eventValidation =
eventValidation.Substring(0, eventValidation.IndexOf("/>") - 2);
collection.Add("__EVENTVALIDATION", eventValidation);
collection.Add("content_0$contentcolumnmain_0$txtPostcode", postcode);
collection.Add("content_0$contentcolumnmain_0$btnSearch", "Search");
return collection;
}
Step 4.
Let’s
make a call to web URL and sent the
automated actions
using
HtmlAgilityPack;
var
webClient = new CookieAwareWebClient();
var uri =
"http://someurl.aspx";
string
getResponse = string.Empty;
using (StreamReader reader = new
StreamReader(webClient.OpenRead(uri)))
{
getResponse
= reader.ReadToEnd();
}
//call custom GetViewState method
var
viewStateValues = this.GetViewState(getResponse, “mr. x”);
// Upload the NameValueCollection.
byte[]
responseArray = webClient.UploadValues(uri , "POST",
viewStateValues);
// Save the response string for future
var responseStringToBeStored
= Encoding.ASCII.GetString(responseArray);
Now
you can inspect the response “responseStringToBeStored” variable content and
strip out links that can be used to scrape the page as mentioned in the beginning
of the post.