Get All URLs on a Page using C#

In this article, I show a class that can beused to find and display all of the urls on a web page. What for youmay ask? Well, in my experience as a web developer, I have found aclass like this to be very useful. Sometimes, you may want to use thisclass a a basis for a more complex application that crawls your sitechecking for bad or broken links. In other cases, you may simply wantto check an individual page to make sure your links are formattedcorrectly, or don't contain any obsolete pages. You could also easilychange this class to look for other items within your page, likespecific text or tags. Who knows, this may be the start of aspecialized spider that crawls sites on the internet looking forsomething specific.

I think you get the picture. Of course, tomake this class do all those wonderful things, you would have to expandon what I am presenting here. However, I believe this is a good start.The class has one public method – RetrieveUrls. The method calls twoprivate methods. The RetrieveContents method will issue a request tothe web page, and retreive the contents. The GetAllUrls method will usea regular expression to find all of the urls on the page. This methodwrites the matches to the screen, as well as saving them in a log file.Of course, if you prefer, you could modify the method to save thematches somewhere else, like an array or a database table.

Using the code

GetUrls urls = new GetUrls();

urls.RetrieveUrls("http://www.microsoft.com");

The class is listed below. Have fun!

//required namespaces
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;

namespace FindAllUrls
{
class GetUrls
{

//public method called from your application
public void RetrieveUrls( string webPage )
{
GetAllUrls(RetrieveContent(webPage));
}

//get the content of the web page passed in
private string RetrieveContent(string webPage)
{
HttpWebResponse response = null;//used to get response
StreamReader respStream = null;//used to read response into string
try
{
//create a request object using the url passed in
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(webPage);
request.Timeout = 10000;

//go get a response from the page
response = (HttpWebResponse)request.GetResponse();

//create a streamreader object from the response
respStream = new StreamReader(response.GetResponseStream());

//get the contents of the page as a string and return it
return respStream.ReadToEnd();
}
catch (Exception ex)//houston we have a problem!
{
throw ex;
}
finally
{
//close it down, we're going home!
response.Close();
respStream.Close();
}
}

//using a regular expression, find all of the href or urls
//in the content of the page
private void GetAllUrls( string content )
{
//regular expression
string pattern = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?
.*?)(?:[s>""'])";

//Set up regex object
Regex RegExpr = new Regex(pattern, RegexOptions.IgnoreCase);

//get the first match
Match match = RegExpr.Match(content);

//loop through matches
while (match.Success)
{

//output the match info
Console.WriteLine("href match: " + match.Groups[0].Value);
WriteToLog("C:matchlog.txt", "href match: " + match.Groups[0].Value + "rn");

Console.WriteLine("Url match: " + match.Groups[1].Value);
WriteToLog("C:matchlog.txt", "Url | Location | mailto match: " + match.Groups[1].Value + "rn");

//get next match
match = match.NextMatch();
}
}

//Write to a log file
private void WriteToLog(string file, string message)
{
using (StreamWriter w = File.AppendText(file))
{
w.WriteLine(DateTime.Now.ToString() + ": " + message); w.Close();
}
}
}
}

Twitter Digg Delicious Stumbleupon Technorati Facebook Email

No comments yet... Be the first to leave a reply!