Search Forum
(57243 Postings)
Search Site/Articles

Archived Articles
712 Articles

C# Books
C# Consultants
What Is C#?
Download Compiler
Code Archive
Archived Articles
Advertise
Contribute
C# Jobs
Beginners Tutorial
C# Contractors
C# Consulting
Links
C# Manual
Contact Us
Legal

GoDiagram for .NET from Northwoods Software www.nwoods.com


              
Printable Version

Get All URLs on a Page
By John Ginzo

In this article, I show a class that can be used to find and display all of the urls on a web page. What for you may ask? Well, in my experience as a web developer, I have found a class like this to be very useful. Sometimes, you may want to use this class a a basis for a more complex application that crawls your site checking for bad or broken links. In other cases, you may simply want to check an individual page to make sure your links are formatted correctly, or don't contain any obsolete pages. You could also easily change this class to look for other items within your page, like specific text or tags. Who knows, this may be the start of a specialized spider that crawls sites on the internet looking for something specific.

I think you get the picture. Of course, to make this class do all those wonderful things, you would have to expand on what I am presenting here. However, I believe this is a good start. The class has one public method - RetrieveUrls. The method calls two private methods. The RetrieveContents method will issue a request to the web page, and retreive the contents. The GetAllUrls method will use a regular expression to find all of the urls on the page. This method writes the matches to the screen, as well as saving them in a log file. Of course, if you prefer, you could modify the method to save the matches somewhere else, like an array or a database table.

Using the code

GetUrls urls = new GetUrls(); urls.RetrieveUrls("http://www.microsoft.com"); The class is listed below. Have fun!

//required namespaces
using System; 
using System.Collections.Generic; 
using System.Text; 
using System.Net; 
using System.IO; 
using System.Text.RegularExpressions; 


namespace FindAllUrls 
{ 
 class GetUrls 
 { 

  //public method called from your application 
  public void RetrieveUrls( string webPage ) 
  { 
   GetAllUrls(RetrieveContent(webPage)); 
  } 

  //get the content of the web page passed in 
  private string RetrieveContent(string webPage) 
  { 
   HttpWebResponse response = null;//used to get response 
   StreamReader respStream = null;//used to read response into string 
   try 
   { 
    //create a request object using the url passed in 
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(webPage); 
    request.Timeout = 10000; 

    //go get a response from the page 
    response = (HttpWebResponse)request.GetResponse(); 
    
    //create a streamreader object from the response 
    respStream = new StreamReader(response.GetResponseStream()); 

    //get the contents of the page as a string and return it 
    return respStream.ReadToEnd();   
   } 
   catch (Exception ex)//houston we have a problem! 
   { 
    throw ex; 
   } 
   finally 
   { 
    //close it down, we're going home! 
    response.Close(); 
    respStream.Close(); 
   } 
  } 
  
  //using a regular expression, find all of the href or urls 
  //in the content of the page 
  private void GetAllUrls( string content ) 
  { 
   //regular expression 
   string pattern = @"(?:href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?
.*?)(?:[\s>""'])"; 
   
   //Set up regex object 
   Regex RegExpr = new Regex(pattern, RegexOptions.IgnoreCase); 

   //get the first match 
   Match match = RegExpr.Match(content); 

   //loop through matches 
   while (match.Success) 
   { 

    //output the match info 
    Console.WriteLine("href match: " + match.Groups[0].Value); 
    WriteToLog("C:\matchlog.txt", "href match: " + match.Groups[0].Value + "\r\n"); 

    Console.WriteLine("Url match: " + match.Groups[1].Value); 
    WriteToLog("C:\matchlog.txt", "Url | Location | mailto match: " + match.Groups[1].Value + "\r\n"); 
    
    //get next match 
    match = match.NextMatch(); 
   } 
  } 

  //Write to a log file 
  private void WriteToLog(string file, string message) 
  { 
   using (StreamWriter w = File.AppendText(file)) 
   { 
    w.WriteLine(DateTime.Now.ToString() + ": " + message); w.Close(); 
   } 
  } 
 } 
}