C# Strings and Regular Expressions

3.0. Introduction

It would be very rare to create an entireapplication without using a single string. Strings help make sense ofthe seemingly random jumble of binary data that applications use toaccomplish a task. They appear in all facets of application developmentfrom the smallest system utility to large enterprise services. Theirvalue is so apparent that more and more connected systems are leaningtoward string data within their communication protocols by utilizingthe Extensible Markup Language (XML) rather than the more cumbersometraditional transmission of large binary data. This book uses stringsextensively to examine the internal contents of variables and theresults of program flow using Framework Class Libraries (FCL) methodssuch as Console.WriteLine and MessageBox.Show.

In this chapter, you will learn how to takeadvantage of the rich support for strings within the .NET Framework andthe C# language. Coverage includes ways to manipulate string contents,programmatically inspect strings and their character attributes, andoptimize performance when working with string objects. Furthermore,this chapter uncovers the power of regular expressions and how theyallow you to effectively parse and manipulate string data. Afterreading this chapter, you will be able to use regular expressions in avariety of different situations where their value is apparent.

3.1. Creating and Using String Objects

You want to create and manipulate string data within your application.

Technique

The C# language, knowing the importance of string data, contains a string keyword that simulates the behavior of a value data type. To create a string, declare a variable using the stringkeyword. You can use the assignment operator to initialize the variableusing a static string or with an already initialized string variable.

string string1 = "This is a string";
string string2 = string1;

To gain more control over string initialization, declare a variable using the System.String data type and create a new instance using the new keyword. The System.Stringclass contains several constructors that you can use to initialize thestring value. For instance, to create a new string that is a smallsubset of an existing string, use the overloaded constructor, whichtakes a character array and two integers denoting the beginning indexand the number of characters from that index to copy:

class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "Field1, Field2";
System.String string2 = new System.String( string1.ToCharArray(), 8, 6 );

Console.WriteLine( string2 );

}
}

Finally, if you know a string will be intensively manipulated, use the System.Text. StringBuilder class. Creating a variable of this data type is similar to using the System.Stringclass, and it contains several constructors to initialize the internalstring value. The key internal difference between a regular stringobject and a StringBuilder lies in performance.Whenever a string is manipulated in some manner, a new object has to becreated, which subsequently causes the old object to be marked fordeletion by the garbage collector. For a string that undergoes severaltransformations, the performance hit associated with frequent objectcreation and deletions can be great. The StringBuilder class,on the other hand, maintains an internal buffer, which expands to makeroom for more string data should the need arise, thereby decreasingfrequent object activations.

Comments

There is no recommendation on whether you use the string keyword or the System.String class. The string keyword is simply an alias for this class, so it is all a matter of taste. We prefer using the string keyword, but this preference is purely aesthetic. For this reason, we simply refer to the System.String class as the string class or data type.

The string class contains manymethods, both instance and static, for manipulating strings. If youwant to compare strings, you can use the Compare method. If you are just testing for equality, then you might want to use the overloaded equality operator (==). However, the Comparemethod returns an integer instead of Boolean value denoting how the twostrings differ. If the return value is 0, then the strings are equal.If the return value is greater than 0, as shown in Listing 3.1, thenthe first operand is greater alphabetically than the second operand. Ifthe return value is less than 0, the opposite is true. When a string issaid to be alphabetically greater or lower than another, each characterreading from left to right from both strings is compared using itsequivalent ASCII value.

Listing 3.1 Using the Compare Method in the String Class

using System;

namespace _1_UsingStrings
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "";
String string2 = "";

Console.Write( "Enter string 1: " );
string1 = Console.ReadLine();
Console.Write( "Enter string 2: " );
string2 = Console.ReadLine();

// string and String are the same types
Console.WriteLine( "string1 is a {0}\nstring2 is a {1}",
string1.GetType().FullName, string2.GetType().FullName );

CompareStrings( string1, string2 );
}

public static void CompareStrings( string str1, string str2 )
{
int compare = String.Compare( str1, str2 );

if( compare == 0 )
{
Console.WriteLine( "The strings {0} and {1} are the same.\n",
str1, str2 );
}
else if( compare < 0 )
{
Console.WriteLine( "The string {0} is less than {1}",
str1, str2 );
}
else if( compare > 0 )
{
Console.WriteLine( "The string {0} is greater than {1}",
str1, str2 );
}
}
}
}

As mentioned earlier, the stringclass contains both instance and static methods. Sometimes you have nochoice about whether to use an instance or static method. However, afew of the instance methods contain a static version as well. Becausecalling a static method is a nonvirtual function call, you seeperformance gains if you use this version. An example where you mightsee both instance and static versions appears in Listing 3.1. Thestring comparison uses the static Compare method. You can also do so using the nonstatic CompareTomethod using one of the string instances passed in as parameters. Inmost cases, the performance gain is negligible, but if an applicationneeds to repeatedly call these methods, you might want to considerusing the static over the non-static method.

The string class is immutable. Once a string is created, it cannot be manipulated. Methods within the stringclass that modify the original string instance actually destroy thestring and create a new string object rather than manipulate theoriginal string instance. It can be exp

ensive t
o repeatedly call string methods if new objects are created and destroyed continuously. To solve this, the .NET Framework contains a StringBuilder class contained within the System.Text namespace, which is explained later in this chapter.

3.2. Formatting Strings

Given one or more objects, you want to create a single formatted string representation.

Technique

You can format strings using numeric and picture formatting within String.Format or within any method that uses string-formatting techniques for parameters such as Console.WriteLine.

Comments

The String class as well as a fewother methods within the .NET Framework allow you to format strings topresent them in a more ordered and readable format. Up to this point inthe book, we used basic formatting when calling the Console.WriteLine method. The first parameter to Console.WriteLineis the format specifier string. This string controls how the remainingparameters to the method should appear when displayed. You useplaceholders within the format string to insert the value of avariable. This placeholder uses the syntax {n} where n is the index in the parameter list following the format specifier. Take the following line of code, for instance:

Console.WriteLine( "x={0}, y={1}, {0}+{1}={2}", x, y, x+y );

This line of code has three parametersfollowing the format specifier string. You use placeholders within theformat specification, and when this method is called, the appropriatesubstitutions are made. Although you can do the same thing using stringconcatenation, the resultant line of code is slightly obfuscated:

string s = "x=" + x + ",y=" + y + ", " + x + "+" + y + "=" + (x+y);
Console.WriteLine( s );

You can further refine the format byapplying format attributes on the placeholders themselves. Theseadditional attributes follow the parameter index value and areseparated from that index with a : character. There are twotypes of special formatting available. The first is numeric formatting,which lets you format a numeric parameter into one of nine differentnumeric formats, as shown in Table 3.1. The format of these specifiers,using the currency format as an example, is Cxx where xx is a number from 1 to 99 specifying the number of digits to display. Listing3.2 shows how to display an array of integers in hexadecimal format,including how to specify the number of digits to display. Notice alsohow you can change the case of the hexadecimal numbers A through F byusing an uppercase or lowercase format specifier.

Table 3.1 Numeric Formatting Specifiers

Character

Format

Description

C or c

Currency

Culturally aware currency format.

D or d

Decimal

Only supports integral numbers. Displays a string using decimal digits preceded by a minus sign if negative.

E or e

Exponential/scientific notation

Displays numbers in the form ±d.dddddddd where d is a decimal digit.

F or f

Fixed point

Displays a series of decimal digits with a decimal point and additional digits.

G or g

General format

Displays either as a fixed-point or scientific notation based on the size of the number.

N or n

Number format

Similar to fixed point but uses a separator character (such as ,) for groups of digits.

P or p

Percentage

Multiplies the number by 100 and displays with a percent symbol.

R or r

Roundtrip

Formats a floating-point number so that it can be successfully converted back to its original value.

X or x

Hexadecimal

Displays an integral number using the base-16 number system.


Listing 3.2 Specifying a Different Numeric Format by Adding Format Specifiers on a Parameter Placeholder

using System;

namespace _2_Formatting
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
double[] numArray = {2, 5, 4.5, 45.43, 200000};

// format in lowercase hex
Console.WriteLine( "\n\nHex (lower)\n———–" );
foreach( double num in numArray )
{
Console.Write( "0x{0:x}\t", (int) num );
}

// format in uppercase hex
Console.WriteLine( "\n\nHex (upper)\n———–" );
foreach( double num in numArray )
{
Console.Write( "0x{0:X}\t", (int) num );
}
}
}
}

Another type of formatting is pictureformatting. Picture formatting allows you to create a custom formatspecifier using various symbols within the format specifier string.Table 3.2 lists the available picture format characters. Listing 3.3also shows how to create a custom format specifier. In that code, thedigits of the input number are extracted and displayed using acombination of digit placeholders and a decimal-point specifier.Furthermore, you can see that you are free to add characters not listedin the table. This freedom allows you to add literal charactersintermixed with the digits.

Table 3.2 Picture Formatting Specifiers

>

Character

Name

Description

0

Zero placeholder

Copies a digit to the result string if a digit is at the position of the 0. If no digit is present, a 0 is displayed.

#

Display digit placeholder

Copies a digit to the result string if a digit appears at the position of the #. If no digit is present, nothing is displayed.

.

Decimal point

Represents the location of the decimal point in the resultant string.

,

Group separator and number scaling

Inserts thousands separators if placed between two placeholders or scales a number down by 1,000 per , character when placed directly to the left of a decimal point.

&

Percent

Multiplies a number by 100 and inserts a % symbol.

E±0, e±0

Exponential notation

Displays the number in exponential notation using the number of 0s as a placeholder for the exponent value.

\

Escape character

Used to specify a special escape-character formatting instruction. Some of these include \n for newline, \t for tab, and \ for the \ character.

;

Section separator

Separates positive, negative, and zero numbers in the format stringin which you can apply different formatting rules based on the sign ofthe original number.


Listing 3.3 shows how custom formatting can separate a number by its decimal point. Using a foreach loop,each value is printed using three different formats. The first formatwill output the value's integer portion using the following formatstring:

0:$#,#

Next, the decimal portion is written. Ifthe value does not explicitly define a decimal portion, zeroes arewritten instead. The format string to output the decimal value is

$.#0;

Finally, the entire value is displayed up to two decimal places using the following format string:

{0:$#,#.00}

Listing 3.3 Using Picture Format Specifiers to Create Special Formats

using System;

namespace _2_Formatting
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
double[] numArray = {2, 5, 4.5, 45.43, 200000};

// format as custom
Console.WriteLine( "\n\nCustom\n——" );
foreach( double num in numArray )
{
Console.WriteLine( "{0:$#,# + $.#0;} = {0:$#,#.00}", num );
}
}
}
}

3.3. Accessing Individual String Characters

You want to process individual characters within a string.

Technique

Use the index operator ([]) byspecifying the zero-based index of the character within the string thatyou want to extract. Furthermore, you can also use the foreach enumerator on the string using a char structure as the enumeration data type.

Comments

The string class is really acollection of objects. These objects are individual characters. You canaccess each character using the same methods you would use to access anobject in most other collections (which is covered in the next chapter).

You use an indexer to specify which objectin a collection you want to retrieve. In C#, the first object begins atthe 0 index of the string. The objects are individual characters whosedata type is System.Char, which is aliased with the char keyword. The indexer for the stringclass, however, can only access a character and cannot set the value ofa character at that position. Because a string is immutable, you cannotchange the internal array of characters unless you create and return anew string. If you need the ability to index a string to set individualcharacters, use a StringBuilder object.

Listing 3.4 shows how to access thecharacters in a string. One thing to point out is that because thestring also implements the IEnumerable interface, you can use the foreach control structure to enumerate through the string.

Listing 3.4 Accessing Characters Using Indexers and Enumeration

using System;
using System.Text;

namespace _3_Characters
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string str = "abcdefghijklmnopqrstuvwxyz";

str = ReverseString( str );
Console.WriteLine( str );

str = ReverseStringEnum( str );
Console.WriteLine( str );
}

static string ReverseString( string strIn )
{
StringBuilder sb = new StringBuilder(strIn.Length);

for( int i = 0; i < strIn.Length; ++i )
{
sb.Append( strIn[(strIn.Length-1)-i] );
}
return sb.ToString();
}

static string ReverseStringEnum( string strIn )
{
StringBuilder sb = new StringBuilder( strIn.Length );
foreach( char ch in strIn )
{
sb.Insert( 0, ch );
}

return sb.ToString();
}
}
}

3.4. Analyzing Character Attributes

You want to evaluate the individual characters in a string to determine a character's attributes.

Technique

The System.Char structure containsseveral static functions that let you test individual characters. Youcan test whether a character is a digit, letter, or punctuation symbolor whether the character is lowercase or uppercase.

Comments

One of the hardest issues to handle whenwriting software is making sure users input valid data. You can usemany different methods, such as restricting input to only digits, butultimately, you always need an underlying validating test of the inputdata.

You can use the System.Char structure to perform a variety of text-validation procedures. Listing 3.5 demonstrates validating user input as well as inspecting the characteristics of a character. It begins by displaying a menu and then waiting for user input using the Console.ReadLine method. Once a user enters a command, you make a check using the method ValidateMainMenuInput. This method checks to make sure the first character in the input string is not a digit or punctuation symbol. If the validation passes, the string is passed to a method that inspects each character in the input string. This method simply enumerates through all the characters in the input string and prints descriptive messages based on the characteristics. Some of the System.Char methods for inspection have been inadvertently left out of Listing 3.5. Table 3.3 shows the remaining methods and their functionality. The results of runnin

g the application in Listing 3.5 appear in Figure 3.1.

Listing 3.5 Using the Static Methods in System.Char to Inspect the Details of a Single Character

using System;

namespace _4_CharAttributes
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
char cmd = 'x';

string input;
do
{
DisplayMainMenu();
input = Console.ReadLine();

if( (input == "" ) ||
ValidateMainMenuInput( Char.ToUpper(input[0]) ) == 0 )
{
Console.WriteLine( "Invalid command!" );
}
else
{
cmd = Char.ToUpper(input[0]);

switch( cmd )
{
case 'Q':
{
break;
}

case 'N':
{
Console.Write( "Enter a phrase to inspect: " );
input = Console.ReadLine();
InspectPhrase( input );
break;
}
}
}
} while ( cmd != 'Q' );
}

private static void InspectPhrase( string input )
{
foreach( char ch in input )
{
Console.Write( ch + " – ");

if( Char.IsDigit(ch) )
Console.Write( "IsDigit " );
if( Char.IsLetter(ch) )
{
Console.Write( "IsLetter " );
Console.Write( "(lowercase={0}, uppercase={1})",
Char.ToLower(ch), Char.ToUpper(ch));
}
if( Char.IsPunctuation(ch) )
Console.Write( "IsPunctuation " );
if( Char.IsWhiteSpace(ch) )
Console.Write( "IsWhitespace" );

Console.Write("\n");

}
}
private static int ValidateMainMenuInput( char input )
{
// a simple check to see if input == 'N' or 'Q' is good enough
// the following is for illustrative purposes
if( Char.IsDigit( input ) == true )
return 0;
else if ( Char.IsPunctuation( input ) )
return 0;
else if( Char.IsSymbol( input ))
return 0;
else if( input != 'N' && input != 'Q' )
return 0;

return (int) input;
}

private static void DisplayMainMenu()
{
Console.WriteLine( "\nPhrase Inspector\n——————-" );
Console.WriteLine( "N)ew Phrase" );
Console.WriteLine( "Q)uit\n" );
Console.Write( ">> " );
}
}
}

Table 3.3 System.Char Inspection Methods

Name

Description

IsControl

Denotes a control character such as a tab or carriage return.

IsDigit

Indicates a single decimal digit.

IsLetter

Used for alphabetic characters.

IsLetterOrDigit

Returns true if the character is a letter or a digit.

IsLower

Used to determine whether a character is lowercase.

IsNumber

Tests whether a character is a valid number.

IsPunctuation

Denotes whether a character is a punctuation symbol.

IsSeparator

Denotes a character used to separate strings. An example is the space character.

IsSurrogate

Checks for a Unicode surrogate pair, which consists of two 16-bit values primarily used in localization contexts.

IsSymbol

Used for symbolic characters such as $ or #.

IsUpper

Used to determine whether a character is uppercase.

IsWhiteSpace

Indicates a character classified as whitespace such as a space character, tab, or carriage return.


Figure 3.1
Use the static method in the System.Char class to inspect character attributes.

The System.Char structure isdesigned to work with a single Unicode character. Because a Unicodecharacter is 2 bytes, the range of a character is from 0 to 0xFFFF. Forportability reasons in future systems, you can always check the size ofa char by using the MaxValue constant declared in the System.Char structure. One thing to keep in mind when working with characters is to avoid the confusion of mixing chartypes with integer types. Characters have an ordinal value, which is aninteger value used as a lookup into a table of symbols. One example ofa table is the ASCII table, which contains 255 characters and includesthe digits 0 through 9, letters, punctuation symbols, and formattingcharacters. The confusion lies in the fact that the number 6, forinstance, has an ordinal char value of 0×36. Therefore, the line of code meant to initialize a character to the number 6

char ch = (char) 6;

is wrong because the actual character inthis instance is ^F, the ACK control character used in modemhandshaking protocols. Displaying this value in the console would notprovide the 6 that you were looking for. You could have chosen twodifferent methods to initialize the variable. The first way is

char ch = (char) 0×36;

which produces the desired result and prints the number 6 to the console if passed to the Console.Write method. However, unless you have the ASCII table memorized, this procedure can be cumbersome. To initialize a char variable, simply place the value between single quotes:

char ch = '6';

3.5. Case-Insensitive String Comparison

You want to perform case-insensitive string comparison on two strings.

Technique

Use the overloaded Compare method in the System.String class which accepts a Boolean value, ignoreCase, as the last parameter. This parameter specifies whether the comparison should be case insensitive (true) or case sensitive (false). To compare single characters, convert them to uppercase or lowercase, using ToUpper or ToLower, and then perform the comparison.

Comments

Validating user input requires a lot offorethought into the possible values a user can enter. Making sure youcover the range of possible values can be a daunting task, and you

might ultimately run into human-computer interaction issues by severelylimiting what a user can enter. Case-sensitivity issues increase thepossible range of values, leading to greater security with respect tosuch things as passwords, but this security is usually at the expenseof a user's frustration when she forgets whether a character iscapitalized. As with many other programming problems, you must weighthe pros and cons.

To perform a case-insensitive comparison, you can use one of the many overloaded Compare methods within the System.Stringclass. The methods that allow you to ignore case issues use a Booleanvalue as the last parameter in the method. This parameter is named ignoreCase, and when you set it to true, you make a case-insensitive comparison, as demonstrated in Listing 3.6.

Listing 3.6 Performing a Case-Insensitive String Comparison

using System;

namespace _5_CaseComparison
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string str1 = "This Is A String.";
string str2 = "This is a string.";

Console.WriteLine( "Case sensitive comparison of" +
" str1 and str2 = {0}", String.Compare( str1, str2 ));

Console.WriteLine( "Case insensitive comparison of" +
" str1 and str2 = {0}", String.Compare( str1, str2, true ));
}
}
}

3.6. Working with Substrings

You need to change or extract a specific portion of a string.

Technique

To copy a portion of a string into a new string, use the SubString method within the System.String class. You call this method using the string object instance of the source string:

string source = "ABCD1234WXYZ";
string dest = source.Substring( 4, 4 );
Console.WriteLine( "{0}\n", dest );

To copy a substring into an already existing character array, use the CopyTo method. To assign a character array to an existing string object, create a new instance of the string using the new keyword, passing the character array as a parameter to the string constructor as shown in the following code, whose ouput appears in Figure 3.2:

string source = "ABCD";
char [] dest = { '1', '2', '3', '4', '5', '6', '7', '8' };

Console.Write( "Char array before = " );
Console.WriteLine( dest );

// copy substring into char array
source.CopyTo( 0, dest, 4, source.Length );

Console.Write( "Char array after = " );
Console.WriteLine( dest );

// copy back into source string
source = new String( dest );

Console.WriteLine( "New source = {0}\n", source );

Figure 3.2
Use the CopyTo method to copy a substring into an existing character array.

If you need to remove a substring within a string and replace it with a different substring, use the Replace method. This method accepts two parameters, the substring to replace and the string to replace it with:

string replaceStr = "1234";
string dest = "ABCDEFGHWXYZ";

dest = dest.Replace( "EFGH", replaceStr );

Console.WriteLine( dest );

To extract an array of substrings that are separated from each other by one or more delimiters, use the Split method. This method uses a character array of delimiter characters and returns a string array of each substring within the original string as shown in the following code, whose output appears in Figure 3.3. You can optionally supply an integer specifying the maximum number of substrings to split:

char delim = '\';
string filePath = "C:\Windows\Temp";
string [] directories = null;

directories = filePath.Split( delim );

foreach (string directory in directories)
{
Console.WriteLine("{0}", directory);
}

Figure 3.3
You can use the Split method in the System.String class to place delimited substrings into a string array.

Comments

Parsing strings is not for the faint ofheart. However, the job becomes easier if you have a rich set ofmethods that allow you to perform all types of operations on strings.Substrings are the goal of a majority of these operations, and the string class within the .NET Framework contains many methods that are designed to extract or change just a portion of a string.

The Substring methodextracts a portion of a string and places it into a new string object.You have two options with this method. If you pass a single integer,the Substring method extracts the substring thatstarts at that index and continues until it reaches the end of thestring. One thing to keep in mind is that C# array indices are 0 based.The first character within the string will have an index of 0. Thesecond Substring method accepts an additionalparameter that denotes the ending index. It lets you extract parts of astring in the middle of the string.

You can create a new character array from a string by using the ToCharArray method of the string class. Furthermore, you can extract a substring from the string and place it into a character array by using the CopyTo method. The difference between these two methods is that the character array used with the CopyTo method must be an already instantiated array. Whereas the ToCharArray returns a new character array, the CopyTomethod expects an existing character array as a parameter to themethod. Furthermore, although methods exist to extract character arraysfrom a string, there is no instance method available to assign acharacter array to a string. To do this, you must create a new stringobject using the new keyword, as opposed to creatingthe familiar value-type string, and pass the character array as aparameter to the string constructor.

Using the Replace method isa powerful way to alter the contents of a string. This method allowsyou to search all instances of a specified substring within a stringand replace those with a different substring. Additionally, the lengthof the substring you want to replace does not have to be the samelength of the string you are replacing it with. If you recall thenumber of times you have performed a search and replace in anyapplication, you can see the possible advantages of this method.

One other powerful method is Split.By passing a character array consisting of delimiter characters, youcan split a string into a group of substrings and place them into astrin

g array. By passing an additional integer parameter, you can alsocontrol how many substrings to extract from the source string.Referring to the code example earlier demonstrating the Split method, you can split a string representing a directory path into individual directory names by passing the \character as the delimiter. You are not, however, confined to using asingle delimiter. If you pass a character array consisting of severaldelimiters, the Split method extracts substrings based on any of the delimiters that it encounters.

3.7. Using Verbatim String Syntax

You want to represent a path to a file using a string without using escape characters for path separators.

Technique

When assigning a literal string to a string object, preface the string with the @ symbol. It turns off all escape-character processing so there is no need to escape path separators:

string nonVerbatim = "C:\Windows\Temp";
string verbatim = @"C:\Windows\Temp";

Comments

A compiler error that happens so frequently comes from forgetting to escape path separators. Although a common programming faux pasis to include hard-coded path strings, you can overlook that rule whentesting an application. Visual C# .NET added verbatim string syntax asa feature to alleviate the frustration of having to escape all the pathseparators within a file path string, which can be especiallycumbersome for large paths.

3.8. Choosing Between Constant and Mutable Strings

You want to choose the correct string data type to best fit your current application design.

Technique

If you know a string's value will not change often, use a string object,which is a constant value. If you need a mutable string, one that canchange its value without having to allocate a new object, use a StringBuilder.

Comments

Using a regular string objectis best when you know the string will not change or will only changeslightly. This change includes the whole gamut of string operationsthat change the value of the object itself, such as concatenation,insertion, replacement, or removal of characters. The Common LanguageRuntime (CLR) can use certain properties of strings to optimizeperformance. If the CLR can determine that two string objects are thesame, it can share the memory that these string objects occupy. Thesestrings are then known as interned strings. The CLR contains anintern pool, which is a lookup table of string instances. Strings areautomatically interned if they are assigned to a literal string withincode. However, you can also manually place a string within the internpool by using the Intern method. To test whether a string is interned, use the IsInterned method, as shown in Listing 3.7.

Listing 3.7 Interning a String by Using the Intern Method

using System;

namespace _7_StringBuilder
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string sLiteral = "Automatically Interned";
string sNotInterned = "Not " + sLiteral;

TestInterned( sLiteral );
TestInterned( sNotInterned );

String.Intern( sNotInterned );
TestInterned( sNotInterned );
}

static void TestInterned( string str )
{
if( String.IsInterned( str ) != null )
{
Console.WriteLine( "The string \"{0}\" is interned.", str );
}
else
{
Console.WriteLine( "The string \"{0}\" is not interned.", str );
}
}
}
}

A StringBuilder behaves similarlyto a regular string object and also contains similar method calls.However, there are no static methods because the StringBuilder class is designed to work on string instances. Method calls on an instance of a StringBuilder object change the internal string of that object, as shown in Listing 3.8. A StringBuildermaintains its mutable appearance by creating a buffer that is largeenough to contain a string value and additional memory should thestring need to grow.

Listing 3.8 Manipulating an Internal String Buffer Instead of Returning New String Objects

using System;
using System.Text;

namespace _7_StringBuilder
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "";
String string2 = "";

Console.Write( "Enter string 1: " );
string1 = Console.ReadLine();
Console.Write( "Enter string 2: " );
string2 = Console.ReadLine();

BuildStrings( string1, string2 );
}

public static void BuildStrings( string str1, string str2 )
{
StringBuilder sb = new StringBuilder( str1 + str2 );
sb.Insert( str1.Length, " is the first string.\n" );
sb.Insert( sb.Length, " is the second string.\n" );

Console.WriteLine( sb );
}
}
}

3.9. Optimizing StringBuilder Performance

Knowing that a StringBuilder object can suffer more of a performance hit than a regular string object, you want to optimize the StringBuilder object to minimize performance issues.

Technique

Use the EnsureCapacity method in the StringBuilder class. Set this integral value to a value that signifies the length of the longest string you may store in this buffer.

Comments

The StringBuilder classcontains methods that allow you to expand the memory of the internalbuffer based on the size of the string you may store. As your stringcontinually grows, the StringBuilder won't have torepeatedly allocate new memory for the internal buffer. In other words,if you attempt to place a larger length string than what the internalbuffer of the StringBuilder class can accept, then the classwill have to allocate additional memory to accept the new data. If youcontinuously add strings that increase in size from the last inputstring, the StringBuilder class will have to allocate a new buffer size, which it does internally by calling the GetStringForStringBuilder method defined in the System.String class. This method ultimately calls the unmanaged method FastAllocateStrin

g. By giving the StringBuilder class a hint using the EnsureCapacity method, you can help alleviate some of this continual memory reallocation, thereby optimizing the StringBuilder performance by reducing the amount of memory allocations needed to store a string value.

3.10. Understanding Basic Regular Expression Syntax

You want to create a regular expression.

Technique

Regular expressions consist of a series ofcharacters and quantifiers on those characters. The charactersthemselves can be literal or can be denoted by using character classes,such as \d, which denotes a digit character class, or \S, which denotes any nonwhitespace character.

Table 3.4 Regular Expression Single Character Classes

Class

Description

\d

Any digit

\D

Any nondigit

\ws

Any word character

\W

Any nonword character

\s

Any whitespace character

\SW

Any nonwhitespace


Inaddition to the single character classes, you can also specify a rangeor set of characters using ranged and set character classes. Thisability allows you to narrow the search for a specified character bylimiting characters within a specified range or within a defined set.

Table 3.5 Ranged and Set Character Classes

Format

Description

.

Any character except newline.

\p{uc}

Any character within the Unicode character category uc.

[abcxyz]

Any literal character in the set.

\P{uc}

Any character not within the Unicode character category uc.

[^abcxyz]

Any character not in the set of literal characters.


Quantifierswork on character classes to expand the number of characters thecharacter classes should match. You need to specify, for instance, awildcard character on a character class, which means 0 or morecharacters within that class. Additionally, you can also specify a setnumber of matches of a class that should occur by using an integerwithin braces following the character class designation.

Table 3.6 Character Class Quantifiers

Format

Description

*

0 or more characters

+

1 or more characters

?

0 or 1 characters

{n}

Exactly n characters

{n,}

At least n characters

{n,m}

At least n but no more than m characters


Youcan also specify where a certain regular expression should start withina string. Positional assertions allow you to, for instance, match acertain expression as long as it occurs at the beginning or ending of astring. Furthermore, you can create a regular expression that operateson a set of words within a string by using a positional assertion thatcontinues matching on each subsequent word separated by anynonalphanumeric character.

Table 3.7 Positional (Atomic Zero-Width) Assertions

Format

Description

^

Beginning of a string or beginning of a newline

\z

End of the string, including the newline character

$

End of a string before a newline character or at the end of the line

\G

Continues where the last match left off

\A

Beginning of a string

\b

Between word boundaries (between alphanumeric and nonalphanumeric characters)

\Z

End of the string before the newline character

\B

Characters not between word boundaries


Comments

Regular expressions use a variety ofcharacters both symbolic and literal to designate how a particularstring of text should be parsed. The act of parsing a string is knownas matching, and when applied to a regular expression, the match willbe either true or false. In other words, when you use a regularexpression to match a series of characters, the match will eithersucceed or fail. As you can see, this process has powerfulapplicability in the area of input validation.

You build regular expressions using aseries of character classes and quantifiers on those character classesas well as a few miscellaneous regular-expression constructs. You usecharacter classes to match a single character based either on what typeof character it is, such as a digit or letter, or whether it belongswithin a specified range or set of characters (as shown in Table 3.4).Using this information, you can create a series of character classes tomatch a certain string of text. For instance, if you want to specify
aphone number using character classes, you can use the following regularexpression:

\(\d\d\d\)\s\d\d\d-\d\d\d\d

This expression begins by first escapingthe left parenthesis. You must escape it because parentheses are usedfor grouping expressions. Next you can see three digits representing aphone number's area code followed by the closing parenthesis. You use a\s to denote a whitespace character. The remainder of the regular expression contains the remaining digits of the phone number.

In addition to the single characterclasses, you can also use ranged and set character classes. They giveyou fine-grain control on exactly the type of characters the regularexpression should match. For instance, if you want to match anycharacter as long as it is a vowel, use the following expression:

[aeiou]

This line means that a character shouldmatch one of the literal characters within that set of characters. Aneven more specialized form of single character classes are Unicodecharacter categories. Unicode categories are similar to some of thecharacter-attribute inspection methods shown earlier in this chapter.For instance, you can use Unicode categories to match on uppercase orlowercase characters. Other categories include punctuation characters,currency symbols, and math symbols, to name a few. You can easily findthe full list of Unicode categories in MSDN under the topic "UnicodeCategories Enumeration."

You can optimize the phone-number expression, although it's completely valid, by using quantifiers.Quantifiers specify additional information about the character,character class, or expression to which it applies. Some quantifiersinclude wildcards such as *, which means 0 or more occurrences, and ?,which means only 0 or 1 occurrences of a pattern. You can also usebraces containing an integer to specify how many characters within agiven character class to match. Using this quantifier in thephone-number expression, you can specify that the phone number shouldcontain three digits for the area code followed by three digits andfour digits separated by a dash:

\(\d{3}\)\s\d{3}-\d{4}

Because the regular expression itself isn'tthat complicated, you can still see that using quantifiers can simplifyregular-expression creation. In addition to character classes andquantifiers, you can also use positional information within a regularexpression. For instance, you can specify that given an input string,the regular expression should operate at the beginning of the string.You express it using the ^ character. Likewise, you can also denote the end of a string using the $symbol. Take note that this doesn't mean start at the end of the stringand attempt to make a match because that obviously seemscounterintuitive; no characters exist at the end of the string. Rather,by placing the $ character following the rest of the regularexpression, it means to match the string with the regular expression aslong as the match occurs at the end of the string. For instance, if youwant to match a sentence in which a phone number is the last portion ofthe sentence, you could use the following:

\(\d{3}\)\s\d{3}-\d{4}$
My phone number is (555) 555-5555 = Match
(555) 555-5555 is my phone number = Not a match

3.11. Validating User Input with Regular Expressions

You want to ensure valid user input by using regular expressions to test for validity.

Technique

Create a RegEx object, which exists within the System.Text.RegularExpressions namespace, passing the regular expression in as a parameter to the constructor. Next, call the member method Match using the string you want to validate as a parameter to the method. The method returns a Match object regardless of the outcome. To test whether a match is made, evaluate the Boolean Success property on that Match objectas demonstrated in Listing 3.9. It should also be noted that in manycases, the forward slash (\) character is used when working withregular expressions. To avoid compilation errors from inadvertentlyspecifying an invalid control character, use the @ symbol to turn off escape processing.

Listing 3.9 Validating User Input of a Phone Number Using a Regular Expression

using System;
using System.Text.RegularExpressions;

namespace _11_RegularExpressions
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Regex phoneExp = new Regex( @"^\(\d{3}\)\s\d{3}-\d{4}$" );
string input;

Console.Write( "Enter a phone number: " );
input = Console.ReadLine();

while( phoneExp.Match( input ).Success == false )
{
Console.WriteLine( "Invalid input. Try again." );
Console.Write( "Enter a phone number: " );
input = Console.ReadLine();
}

Console.WriteLine( "Validated!" );
}
}
}

Comments

Earlier in this chapter I mentioned that you could perform data validation using the static methods within the System.Char class.You can inspect each character within the input string to ensure itmatches exactly what you are looking for. However, this method of inputvalidation can be extremely cumbersome if you have different inputtypes to validate because it requires custom code for each validation.In other words, using the methods in the System.Char class is not recommended for anything but the simplest of data-validation procedures.

Regular expressions, on the other hand,allow you to perform the most advanced input validation possible, allwithin a single expression. You are in effect passing the parsing ofthe input string to the regular-expression engine and offloading allthe work that you would normally do.

In Listing 3.9, you can see how you createand use a regular expression to test the validity of a phone numberentered by a user. The regular expression is similar to the previousexpressions used earlier for phone numbers except for the addition ofpositional markers. The regular expression is valid if a user enters aphone number and nothing else. A match is successful when the Success property within the Match object, which is returned from the Regex.Match method, is true.The only caveat to using regular expressions for input validation isthat even though you know the validation failed, you are unable toquery the Regex or Match class to see what part of the string failed.

3.12. Replacing Substrings Using Regular Expressions

You want to replace all substrings thatmatch a regular expression with a different substring that also usesregular-expression syntax.

Technique

Create a Regex object, passing the regular expression used to match characters in the input string to the Regex constructor. Next,

call the Regex method Replace, passing the input string to process and the string to replace each match within the input string. You can also use the static Replace method, passing the regular expression as the first parameter to the method as shown in the last line of Listing 3.10.

Listing 3.10 Using Regular Expressions to Replace Numbers in a Credit Card with xs

using System;
using System.Text.RegularExpressions;

namespace _12_RegExpReplace
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Regex cardExp = new Regex( @"(\d{4})-(\d{4})-(\d{4})-(\d{4})" );
string safeOutputExp = "$1-xxxx-xxxx-$4";
string cardNum;

Console.Write( "Please enter your credit card number: " );
cardNum = Console.ReadLine();

while( cardExp.Match( cardNum ).Success == false )
{
Console.WriteLine( "Invalid card number. Try again." );
Console.Write( "Please enter your credit card number: " );

cardNum = Console.ReadLine();
}

Console.WriteLine( "Secure Output Result = {0}",
cardExp.Replace( cardNum, safeOutputExp ));
}
}
}

Comments

Although input validation is an extremelyuseful feature of regular expressions, they also work well as textparsers. The previous recipe used regular expressions to verify that aparticular string matched a regular expression exactly. However, youcan also use regular expressions to match substrings within a stringand return each of those substrings as a group. Furthermore, you canuse a separate regular expression that acts on the result of theregular-expression evaluation to replace substrings within the originalinput string.

Listing 3.10 creates a regular expressionthat matches the format for a credit card. In that regular expression,you can see that it will match on four different groups of four digitsapiece separated by a dash. However, you might also notice that eachone of these groups is surrounded with parentheses. In an earlierrecipe, I mentioned that to use a literal parenthesis, you must escapeit using a backslash because of the conflict with regular-expressiongrouping symbols. In this case, you want to use the grouping feature ofregular expressions. When you place a portion of a regular expressionwithin parentheses, you are creating a numbered group. Groups arenumbered starting with 1 and are incremented for each subsequent group.In this case, there are four numbered groups. These groups are used bythe replacement string, which is contained in the string safeOutputExp. To reference a numbered group, use the $symbol followed by the number of the group to reference. This sequencerepresents all characters within the input string that match the groupexpression within the regular expression. Therefore, in the replacementstring, you can see that it prints the characters within the firstgroup, replaces the characters in the second and third groups with xs,and finally prints the characters in the fourth group.

One thing to note is that you can use the RegEx class to view the groups themselves. If you change the regular expression to "\d{4}", you can then use the Matches method to enumerate all the groups using the foreach keyword,as shown in Listing 3.11. In the listing, the program first checks tomake sure at least four matches were made. This number corresponds tofour groups of four digits. Next, it uses a foreach enumeration on each Match object that is returned from the Matches method. If the match is in the second or third group, the values are replaced with xs; otherwise, the Match object's value, the characters within that group, are concatenated to the result string.

Listing 3.11 Enumerating Through the Match Collection to Perform Special Operations on Each Match in a Regular Expression

static void TestManualGrouping()
{
Regex cardExp = new Regex( @"\d{4}" );
string cardNum;
string safeOutputExp = "";

Console.Write( "Please enter your credit card number: " );
cardNum = Console.ReadLine();

if( cardExp.Matches( cardNum ).Count < 4 )
{
Console.WriteLine( "Invalid card number" );
return;
}

foreach( Match field in cardExp.Matches( cardNum ))
{
if( field.Success == false )
{
Console.WriteLine( "Invalid card number" );
return;
}

if( field.Index == 5 || field.Index == 10 )
{
safeOutputExp += "-xxxx-";
}
else
{
safeOutputExp += field.Value;
}
}

Console.WriteLine( "Secure Output Result = {0}", safeOutputExp );
}

3.13. Building a Regular Expression Library

You want to create a library of regular expressions that you can reuse in other projects.

Technique

Use the CompileToAssembly static method within the Regex class to compile a regular expression into an assembly. This method uses an array of RegexCompilationInfo objects that contain any number of regular expressions you want to add to the assembly.

The RegexCompilationInfo classcontains a constructor with five fields that you must fill out. Theparameters denote the string for the regular expression; any optionsfor the regular expression, which appear in the RegexOptionsenumerated type; a name for the class that is created to hold theregular expression; a corresponding namespace; and a Boolean valuespecifying whether the created class should have a public accessmodifier.

After creating the RegexCompilationInfo object, create an AssemblyName object, making sure to reference the System.Reflection namespace, and set the Name property to a name you want the resultant assembly filename to be. Because the CompileToAssembly creates a DLL, exclude the DLL extension on the assembly name. Finally, place all the RegexCompilationInfo objects within an array, as shown in Listing 3.12, and call the CompileToAssembly method. Listing 3.12 demonstrates how to create a RegexCompilationInfo object and how to use that object to compile a regular expression into an assembly using the CompileToAssembly method.

Listing 3.12 Using the CompileToAssembly Regex Method to Save Regular Expressions in a New Assembly for Later Reuse

using System;
using System.Text.RegularExpressions;
using System.Reflection;

namespace _12_RegExpReplace
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
CompileRegex(@"(\d{4})-(\d{4})-(\d{4})-(\d{4})", @"regexlib" );
}

static void CompileRegex( string exp, string assemblyName )
{
RegexCompilationInfo compInfo =
new RegexCompilationInfo( exp, 0, "CreditCardExp", "", true );
AssemblyName assembly = new AssemblyName();
ass

embly.Name = assemblyName;

RegexCompilationInfo[] rciArray = { compInfo };

Regex.CompileToAssembly( rciArray, assembly );
}
}
}

Comments

If you use regular expressions regularly,then you might find it advantageous to create a reusable library of theexpressions you tend to use the most. The Regex class contains a method named CompileToAssembly that allows you to compile several regular expressions into an assembly that you can then reference within other projects.

Internally, you will find a class for eachregular expression you added, all contained within its correspondingnamespace, as specified in the RegexCompilationInfo object when you created it. Furthermore, each of these classes inherits from the Regex class so all the Regex methodsare available for you to use. As you can see, creating a library ofcommonly used regular expressions allows you to reuse and share theseexpressions in a multitude of different projects. A change in a regularexpression simply involves changing one assembly instead of eachproject that hard-coded the regular expression.


Twitter Digg Delicious Stumbleupon Technorati Facebook Email

No comments yet... Be the first to leave a reply!