Programming Archives

Software Quality: The Horror of Spaghetti Code

When scanning the news, I often see articles with titles such as, “The nation needs more programmers”, “The nation needs to increase the number of STEM graduates”, and so on. However, this is a simplistic idea for many reasons. I don’t plan to examine all the reasons in this article, but I do want to discuss the notion that “any kind of programming will do” and explain the reasons to avoid Spaghetti Code.

It seems that most non-programmers simply assume that “all programs are created equal”, in the sense that all code is of equivalent quality. Unfortunately, that is by no means the case. To be clear, the quality of the code has nothing to do with whether the intention of a program is “good” or “bad” (i.e., malware). A malicious program may be of very high quality, whereas a well-intentioned program may be badly written.

Recognition of the problems caused by poor quality programs over the past few decades has led to advances in the structure of programs. These structures are not necessary for the correct operation of a program, but they make the program much easier for humans to understand and maintain, and thus the code has much greater and more enduring value.

The Evolution of Programming Structure

Largely as a result of bitter experience, the discipline of programming has evolved rapidly over the past few decades. Some code that conforms to what was considered acceptable programming practice in, say, the 1980s would be regarded as appallingly inept today.

In the early days, it was considered sufficient just to write code that produced the desired result. There was very little consideration of how the code was implemented, and in any case the available coding tools were quite primitive.

However, it soon became apparent that merely having code that worked wasn’t adequate, because sooner or later the code would need to be modified. If the code had been implemented in a disorganized way, possibly by a programmer who had since moved onto other projects, the maintenance task became nightmarishly difficult. This realization led to the development of the principles of Structured Programming, then Object-Oriented Programming, and other more sophisticated approaches.

There isn’t space in this article to discuss the detailed history of programming structures and paradigms. For much more detail, see, for example, https://en.wikipedia.org/wiki/Structured_programming. Here, I just want to provide an example of why programming structure is so important.

Early programming languages had very limited control structures. Perhaps the most common situation in program writing is when the code must perform a test, and then take actions based on the result of the test.

The earliest programs were written in machine code. Even the use of Assembly code (which used mnemonics to describe machine instructions) offered limited control structures, which usually consisted of jumping to one code address or another, depending on the result of a test.

“High Level Languages” were created to make programming more efficient, and to offer more sophisticated control structures. However, some high-level languages still retained the GoTo instruction, which permitted unstructured jumps in the control flow.

When an error was discovered, or when it became necessary to change the code’s operation, the existence of all these jumps made it very difficult to trace through the sequence of programming steps. For obvious reasons, such code has come to be known as Spaghetti Code. If you see code that contains such instructions such as GoTo or GoSub (used in Visual Basic), then you’re probably looking at Spaghetti Code.

The deficiencies of Spaghetti Code led to the development of Structured Programming, where code is divided up into clearly defined modules, and control flow is determined by looping and branching constructs that make it much easier to follow the operation of a program. Later on, more sophisticated programming paradigms were developed, such as Object-Oriented Programming. These paradigms not only eliminated the Spaghetti Code, but also offered other advantages. The important point is that these paradigms were developed to make programming easier and more productive, so it really isn’t the case that writing Spaghetti Code is somehow simpler.

Not The State of the Art: A Horrifying Example of Spaghetti Code!

I’d like to be able to state that spaghetti code and its attendant nightmares are nothing but memories of the past, and that nobody would dream of writing Spaghetti Code today.

Unfortunately, that’s not universally true, and I still sometimes encounter such code today, as in the following example written in the Microsoft VBA (Visual Basic for Applications) language, which was intended for processing the content of Microsoft Word documents.

Lest you think that the following example is something I made up (“Surely nobody would really write code like this”), I assure you that this is real code that I excerpted from a VBA program that was being relied on by many users. All that I’ve done is to change some function and label names, to protect the guilty!

The purpose of this function is to find and select the first heading in a Word document (that is, the first paragraph of text with any of the styles Heading 1 through Heading 9). When this heading has been found, the code checks to see whether it contains the text “Heading Before”. If it does, the code jumps to the next heading in the document, and examines the style of that heading. If the style of the following heading is not “Heading 1”, then a new Heading 1-styled heading is inserted, with the text “Heading Inserted”. (Don’t worry about why it would be necessary to do this; rest assured that this was claimed to be necessary in the code on which I based this example!)

Notice particularly the statement GoTo InsertHeadingOne and its associated label InsertHeadingOne.

Sub insert_Heading1()
	Call Select_First_Heading_in_Doc
	With Selection
		If .Paragraphs(1).Range.Text Like "Heading Before" Then
			.GoToNext what:=wdGoToHeading
			GoTo InsertHeadingOne
		Else
InsertHeadingOne:
			If Not .Paragraphs(1).Style = "Heading 1" Then
				.MoveLeft
				.TypeText vbCr & "Heading Inserted" & vbCr
				.Paragraphs(1).Style = "Normal"
				.MoveLeft
				.Paragraphs(1).Style = "Heading 1"
			End If
		End If
	End With
End Sub

This is such a short subroutine that an experienced programmer would think it should be possible to write it without GoTo instructions and labels. That is correct; it is possible, and the result is much more succinct, as I show below.

Let’s examine the subroutine’s control flow. The code selects the first text in the document that has any Word “Heading” style. Then, it evaluates the “If” statement. If the evaluation is true, then the selection is moved to the next heading, following which the “Else” code is evaluated! In other words, the code within the “Else” clause is executed whatever the result of the “If” expression, and thus doesn’t need to be within an “Else” clause at all.

The following code is functionally identical to that above, but does not require either GoTo instructions or the spurious “Else” clause.

Sub insert_Heading1()
	Call Select_First_Heading_in_Doc
 	With Selection
		If .Paragraphs(1).Range.Text Like "Heading Before" Then
			.GoToNext what:=wdGoToHeading
		End If
		If Not .Paragraphs(1).Style = "Heading 1" Then
			.MoveLeft
			.TypeText vbCr & "Heading Inserted" & vbCr
			.Paragraphs(1).Style = "Normal"
			.MoveLeft
			.Paragraphs(1).Style = "Heading 1"
		End If
	End With
End Sub

This example gives the lie to the excuse that “we have to write spaghetti code because it’s more compact than structured code”, since the structured and well-organized version is clearly shorter than the spaghetti code version.

This example doesn’t include the use of the GoSub instruction, which is another “relic” from pre-structured programming days. GoSub offers a very primitive form of subroutine calling, but should always be avoided in favor of actual subroutine or function calls.

The Race to the Bottom

The issue of software quality frequently sets up an ongoing conflict between programmers and clients (or between employees and employers).

Clients or employers want to produce working code as quickly and as cheaply as possible. The problem with this is that it’s a short-sighted approach that frequently turns out to be a false economy in the long term. Cheap programmers tend to be inexperienced, and so produce poor quality code. Rushing code development leads to implementations that are not well thought out. The result of the short-sighted approach is that the code requires a disproportionate level of bug fixing and maintenance, or else has to be rewritten completely before the end of its anticipated lifetime.

Avoiding the Horror of Spaghetti Code

In summary, then, the message I want to offer here is this. If you’re going to write software, or hire someone else to write software for you, then you should make it your business to understand what constitutes high-quality software, and then take the time, effort and expense to ensure that that is what gets produced. Unfortunately, I realize that many software producers will continue to ignore this recommendation, but that doesn’t make it any less true.

Converting Between Absolute & Relative Paths in MadCap Flare: Sample C# Code

I regularly use MadCap Flare for the production of technical documentation. Flare is a sophisticated content authoring tool, which stores all its topic and control files using XML. This makes it relatively easy to process the content of the files programmatically, as in the example of CSS class analysis that I described in a previous post.

The Flare software is based on Microsoft’s .NET framework, so the program runs only under Windows. For that reason, this discussion will be restricted to Windows file systems.

In Windows, the “path” to a file consists of a hierarchical list of subfolders beneath a root volume, for example:

c:\MyRoot\MyProject\Content\MyFile.htm

Sometimes, however, it’s convenient to specify a path relative to another location. For example, if the file at:

c:\MyRoot\MyProject\Content\Subsection\MyTopic.htm

contained a link to MyFile.htm as above, the relative path could be specified as:

..\MyTopic.htm

In the syntax of relative paths, “..” means “go up one folder level”. Similarly, “.” means “this folder level”, so .\MyFile.htm refers to a file that’s in the same folder as the file containing the relative path.

If you’ve ever examined the markup in Flare files, you’ll have noticed that extensive use is made of “relative paths”. For example, a Flare topic may contain a hyperlink to another topic in the same project, such as:

<MadCap:xref href="..\MyTopic.htm">Linked Topic</MadCap:xref>

Similarly, Flare’s Table-Of-Contents (TOC) files (which have .fltoc extensions) are XML files that contain trees of TocEntry elements. Each TocEntry element has a Link attribute that contains the path to the topic or sub-TOC that appears at that point in the TOC. All the Link attribute paths start at the project’s Content (for linked topics) or Project (for linked sub-TOCs) folder, so in that sense they are relative paths.

An example of a TocEntry element would be:

<TocEntry Title="Sample Topic" Link="/Content/Subsection/MyTopic.htm" />

When I’m writing code to process these files (for example to open and examine each topic in a Flare TOC file), I frequently have to convert Flare’s relative paths into absolute paths (because the XDocument.Load() method, as described in my previous post, will accept only an absolute path), and vice versa if I want to insert a path into a Flare file. Therefore, I’ve found it very useful to create “library” functions in C# to perform these conversions. I can then call the functions AbsolutePathToRelativePath() and RelativePathToAbsolutePath() without having to think again about the details of how to convert from one format to the other.

I’m sure that there are probably similar functions available in other programming languages. For example, I’m told that Python includes a built-in conversion function called os.path.relpath, which would make it unnecessary to create custom code. Anyway, my experience as a programmer suggests that you can never have too many code samples, so I’m offering my own versions here to add to the available set. I have tested both functions extensively and they do work as listed.

The methods below are designed as static methods for inclusion in a stringUtilities class. You could place them in any class, or make them standalone functions.

AbsolutePathToRelativePath

This static method converts an absolute file path specified by strTargFilepath to its equivalent path relative to strRootDir. strRootDir must be a directory tree only, and must not include a file name.

For example, if the absolute path strTargFilepath is:

c:\folder1\folder2\subfolder1\filename.ext

And the root directory strRootDir is:

c:\folder1\folder2\folder3\folder4

The method returns the relative file path:

..\..\subfolder1\filename.ext

Note that there must be some commonality between the folder tree of strTargFilepath and strRootDir. If there is no commonality, then the method just returns strTargFilepath unchanged.

The path separator character that will be used in the returned relative path is specified by strPreferredSeparator. The default value is correct for Windows.

using System.IO;

public static string AbsolutePathToRelativePath(string strRootDir, string strTargFilepath, string strPreferredSeparator = "\\")
{
	if (strRootDir == null || strTargFilepath == null)
		return null;

 	string[] strSeps = new string[] { strPreferredSeparator };

 	if (strRootDir.Length == 0 || strTargFilepath.Length == 0)
		return strTargFilepath;

 	// Convert to arrays
	string[] strRootFolders = strRootDir.Split(strSeps, StringSplitOptions.None);
	string[] strTargFolders = strTargFilepath.Split(strSeps, StringSplitOptions.None);
	if (string.Compare(strRootFolders[0], strTargFolders[0], StringComparison.OrdinalIgnoreCase) != 0)
		return strTargFilepath;

 	// Count common root folders
	int i = 0;
	List<string> listRelFolders = new List<string>();
	for (i = 0; i < strRootFolders.Length; i++)
	{
		if (string.Compare(strRootFolders[i], strTargFolders[i], StringComparison.OrdinalIgnoreCase) != 0)
			break;
	}
	
	for (int k = i; k < strTargFolders.Length; k++)
		listRelFolders.Add(strTargFolders[k]);

	System.Text.StringBuilder sb = new System.Text.StringBuilder();
	if (i > 0)
	{
		// Note: the last element of strTargFolders is actually the filename, so must adjust count for that
		for (int j = 0; j < strRootFolders.Length - i; j++)
		{
			sb.Append("..");
			sb.Append(strPreferredSeparator);
		}
	}

	return sb.Append(string.Join(strPreferredSeparator, listRelFolders.ToArray())).ToString();
}

RelativePathToAbsolutePath

This static method converts a relative file path specified by strTargFilepath to its equivalent absolute path using strRootDir. strRootDir must be a directory tree only, and must not include a file name.

For example, if the relative path strTargFilepath is:

..\..\subfolder1\filename.ext

And the root directory strRootDir is:

c:\folder1\folder2\folder3\folder4

The method returns the absolute file path:

c:\folder1\folder2\subfolder1\filename.ext

If strTargFilepath starts with “.\” or “\”, then strTargFilepath is simply appended to strRootDir

The path separator character that will be used in the returned relative path is specified by strPreferredSeparator. The default value is correct for Windows.

using System.IO;

public static string RelativePathToAbsolutePath(string strRootDir, string strTargFilepath, string strPreferredSeparator = "\\")
{
	if (string.IsNullOrEmpty(strRootDir) || string.IsNullOrEmpty(strTargFilepath))
		return null;
	
	string[] strSeps = new string[] { strPreferredSeparator };

 	// Convert to lists
	List<string> listTargFolders = strTargFilepath.Split(strSeps, StringSplitOptions.None).ToList<string>();
	List<string> listRootFolders = strRootDir.Split(strSeps, StringSplitOptions.None).ToList<string>();

	// If strTargFilepath starts with .\ or \, delete initial item
	if (string.IsNullOrEmpty(listTargFolders[0]) || (listTargFolders[0] == "."))
		listTargFolders.RemoveAt(0);
	while (listTargFolders[0] == "..")
	{
		listRootFolders.RemoveAt(listRootFolders.Count - 1);
		listTargFolders.RemoveAt(0);
	}
	if ((listRootFolders.Count == 0) || (listTargFolders.Count == 0))
		return null;

 	// Combine root and subfolders
	System.Text.StringBuilder sb = new System.Text.StringBuilder();
	foreach (string str in listRootFolders)
	{
		sb.Append(str);
		sb.Append(strPreferredSeparator);
	}
	for (int i = 0; i < listTargFolders.Count; i++)
	{
		sb.Append(listTargFolders[i]);
		if (i < listTargFolders.Count - 1)
			sb.Append(strPreferredSeparator);
	}

	return sb.ToString();
}

[7/1/16] Note that the method above does not check for the case where a relative path contains a partial overlap with the specified absolute path. If required, you would need to add code to handle such cases.

For example, if the relative path strTargFilepath is:

folder4\subfolder1\filename.ext

and the root directory strRootDir is:

c:\folder1\folder2\folder3\folder4

the method will not detect that folder4 is actually already part of the root path.

Data Extinction: The Problem of Digital Obsolescence

I suspect that many of us, as computer users, have had the experience of searching for some computer file that we know we saved somewhere, but can’t seem to find. Even more frustrating is the situation where, having spent time looking for the file and having found it, we discover either that the file has been corrupted, or is in a format that our software can no longer read. This is perhaps most likely to happen with digital photographs or videos, but it can also happen with text files, or even programs themselves. This is the problem of Digital Obsolescence.

In an earlier post, I mentioned a vector graphics file format called SVG, and I showed how you can use a text editor to open SVG files and view the individual drawing instructions in the file. I didn’t discuss the reason why it’s possible to do that with SVG files, but not with some other file types. For example, if you try to open an older Microsoft Word file (with a .doc extension) with a text editor, all you’ll see are what appear to be reams of apparently random characters. Some file types, such as SVG, are “text encoded”, whereas other types, such as Word .doc files, are “binary encoded”.

Within the computer industry, there has come to be an increasing acceptance of the desirability of using text-encoded file formats for many applications. The reason for this is the recognition of a serious problem, whereby data that has been stored in a particular binary format eventually becomes unreadable because software is no longer available to support that format. In some cases, the specification defining the data structure is no longer available, so the data can no longer be decoded.

The general problem is one of “data retention”, and it has several major aspects:

Storing data on physical media that will remain accessible and readable for as long as required,
Storing data in formats that will continue to be readable for as long as required.
Where files are encrypted or otherwise secured, ensuring that passwords and keys are kept in some separate but secure location where they can be retrieved when necessary.

Most people who have used computers for a few years are aware of the first problem, as storage methods have evolved from magnetic tapes to optical disks, and so on. However, fewer people consider the second and third problems, which is what I want to discuss in this article.

Digital Obsolescence: The Cost of Storage and XML

In the early days of computers, device storage capacities were very low, and the memory itself was expensive. Thus, it was important to make the most efficient use of all available memory. For that reason, binary-encoded files tended to be preferred over text-encoded files, because binary encoding was generally more efficient.

However, those days are over, and immense quantities of memory are available very cheaply. Thus, even if text-encoding is less efficient than binary-encoding, that’s no longer a relevant concern in most cases.

Many modern text-encoding formats (including SVG and XHTML) are based on XML (eXtensible Markup Language). XML provides a basic structure for the creation of “self-describing data”. Such data can have a very wide range of applications, so, to support particular purposes, most XML files use document models, called Document Type Definitions (DTDs) or schemas. Many XML schemas have now been published, including, for example, Microsoft’s WordML, which is the schema that defines the structure of the content of newer Word files (those with a .docx extension).

XML is a huge subject in its own right, and many books have been written about it, even without considering the large number of schemas that have been created for it. I’ll have more to say about aspects of XML in future posts.

Digital Obsolescence: Long Term vs. Short Term Retention

Let’s be clear that the kind of “data retention” I’m talking about here refers to cases where you want to keep your data for the long term, and ensure that your files will still be readable or viewable many years in the future. For example, you may have a large collection of digital family photos, which you’d like your children to be able to view when they have grown up. Similarly, you may have a diary that you’ve been keeping for a long time, and you’ll want to be able to read your diary entries many years from now.

This is a very different problem from short-term data retention, which is a problem commonly faced by businesses. Businesses need to store all kinds of customer and financial information (and are legally required to do so in many cases), but the data only needs to be accessible for a limited period, such as a few years. Much of it becomes outdated very quickly in any case, so very old data is effectively useless.

There are some organizations out there who will be happy to sell you a “solution” to long-term data retention that’s actually useful only for short-term needs, so it’s important to be aware of this distinction.

Digital Obsolescence: Examples from my Personal Experience

In the early “pre Windows” days of DOS computers, several manufacturers created graphical user interfaces that could be launched from DOS. One of these was the “Graphical Environment Manager” (GEM), created by Digital Research. I began using GEM myself, largely because my employer at the time was using it. One facet of GEM was the “GEM Draw” program, which was (by modern standards) a very crude vector drawing program. I produced many diagrams and saved them in files with the .GEM extension.

A few years later, I wanted to reuse one of those GEM drawing files, but I’d switched to Windows, and GEM was neither installed on my computer nor even available to buy. I soon discovered that there was simply no way to open a GEM drawing file, so the content of those files had become “extinct”.

Similarly, during the 1990s, before high-quality digital cameras became available, I took many photographs on 35mm film, but had the negatives copied to Kodak Photo-CDs. The Photo-CD standard provided excellent digital versions of the photos (by contemporary standards), with each image stored in a PCD file in 5 resolutions. Again, years later, when I tried to open a PCD file with a recent version of Corel Draw, I discovered that the PCD format was no longer supported. Fortunately, in this case, I was able to use an older version of Corel Draw to batch-convert every PCD file to another more modern format, so I was able to save all my pictures.

Digital Obsolescence: Obsolete Data vs. Obsolete Media

As mentioned above, the problem I’m describing here doesn’t relate to the obsolescence of the media that contain the files you want to preserve. For example, there can’t be many operational computers still around that have working drive units for 5.25” floppy disks (or even 3.5” floppy disks), but those small disks were never particularly reliable storage media in any case, so presumably anyone who wanted to preserve files would have moved their data to more modern and robust devices anyway.

I’ll discuss some aspects of media obsolescence further in a future post.

Digital Obsolescence: Survival Practices

So what can you do to ensure that your data won’t go extinct? There are several “best practices”, but unfortunately some of these involve some form of tradeoff, whereby you trade data survivability for sophisticated formatting features.

Never rely on “cloud” storage for the long term. Cloud storage is very convenient for short-term data retention, or to make data available from multiple locations, but it’s a terrible idea for long-term retention. All kinds of bad things could happen to your data over long periods of time: the company hosting the data could have its servers hacked, or it could go out of business, or else you could simply forget where you stored the data, or the passwords you need to access it!
Prefer open data formats to proprietary formats.
Prefer XML-based formats to binary formats.
Try to avoid saving data in encrypted or password-protected forms. If it must be stored securely, ensure that all passwords and encryption keys exist in written form, and that you’ll be able to access that information when you need it! (That is, ensure that the format of the key storage file won’t itself become extinct.)
Expect formats to become obsolete, requiring you to convert files to newer formats every few years.
Copy all the files to new media every few years, and try opening some of the copied files when you do this. This reduces the danger that the media will become unreadable, either because of corruption or because physical readers are no longer available.

Sometimes you’ll see recommendations for more drastic formatting restrictions, such as storing text files as plain-text only. Personally, I don’t recommend following such practices, unless the data content is extremely critical, and you can live within the restrictions. If you follow the rules above consistently, you should be relatively safe from “data extinction”.

Analyzing CSS Styles in a Large Set of XML Documents

This post explains how I created a small program for the automated processing of text, so there won’t be anything about graphics or spelling here! I’ve created many such programs, large and small, over the years, so this is just intended as a sample of what’s possible. This program analyzes CSS styles applied to a large set of XML-based documents.

In my work, I frequently need to be able to analyze and modify large sets of XML-encoded documents (typically 10,000+ documents in a single project). It’s simply not practical to do this by opening and editing each document individually; the only realistic way is to write a script or program to perform the analysis and editing automatically.

Recently, I needed to update a Cascading Style Sheet (CSS file) that was linked to a large number of XHTML topic documents in a publishing project. The CSS file had been edited by several people over many years, with little overall strategy except to “keep it going”, so I was aware that it probably contained many “junk” style definitions. I wanted to be able to delete all the junk definitions from the CSS file, but how could I determine which styles were needed and which were not? I needed a way to scan through all 10,000+ files to find out exactly which CSS classes were used at least once in the set.

Years ago, I gained a great deal of experience of programming in C++, using Microsoft’s MFC framework. About 8 years ago, for compatibility with other work that was being done by colleagues, I began to transition to programming in C# using Microsoft’s .NET architecture. Thus, I decided to use C#/.NET to create a program that would allow me to scan huge numbers of XHTML files rapidly and create a list of the CSS styles actually found in the topic files.

Until the advent of .NET 3.5, I’d become accustomed to working with the class XmlDocument. While this was a definite improvement over previous “XML” handling classes, it could still be awkward for certain operations, such as, for example, constructing and inserting new XML snippets in an existing document. I was delighted, then, to discover the new XDocument class that was introduced with .NET 3.5, and I now use the newer class almost exclusively. (For some discussion of the differences, see http://stackoverflow.com/questions/1542073/xdocument-or-xmldocument)

Analyzing CSS Styles: Code Sample

Here are the critical methods of the class that I created to walk the XML tree in each document. The first method below, walkXMLTree(), executes the tree walking operation. To do that, it obtains the root element of the XML tree, then calls the second method, checkChildElemsRecursive(), which actually traverses the tree in each document.

using System.IO;
using System.Xml.Linq;

public int walkXMLTree(string strRootFolderpath, ref SortedList<string, string> setClasses)
{
    string[] strFilepaths = Directory.GetFiles(strRootFolderpath, "*.htm", SearchOption.AllDirectories);

    List<string> listDocsFailedToLoad = new List<string>();

    int iElemsChecked = 0;

    foreach (string strFilepath in strFilepaths)
    {
        try
        {
            _xdoc = XDocument.Load(strFilepath);
            _xelemDocRoot = _xdoc.Root;
            iElemsChecked += checkChildElemsRecursive(_xelemDocRoot, ref setClasses, strFilepath);
        }
        catch
        {
            listDocsFailedToLoad.Add(strFilepath);
        }
   }

   return iElemsChecked;
}

private int checkChildElemsRecursive(XElement xelemParent, ref SortedList<string, string> setClasses, string strFilename)
{
    int iElemsChecked = 0;
    string strClass;
    XAttribute xattClass;

    IEnumerable<XElement> de = xelemParent.Elements();

    foreach (XElement el in de)
    {
        // Find class attribute if any
        xattClass = el.Attribute("class");
        if (xattClass != null)
        {
            strClass = el.Name + "." + xattClass.Value;
            if (!setClasses.ContainsKey(strClass))
                setClasses.Add(strClass, strFilename);
        }
        iElemsChecked++;

        iElemsChecked += checkChildElemsRecursive(el, ref setClasses, strFilename);
    } 

    return iElemsChecked;
}

[Code correction 6/20/16: I changed xelemParent.Descendants() to xelemParent.Elements(). By using Descendants, I was doing the work twice, because Descendants returns child elements at all sub-levels, instead of just the first level. The code works correctly either way, but if you use Descendants, the recursion is unnecessary.]

The use of the System.IO and System.Xml.Linq libraries is declared at the top of the code.

The basic method is walkXMLTree(), which generates a sorted list setClasses of CSS classes used in every one of the XHTML files under the root folder strRootFolderpath. In this implementation, the returned list contains the CSS class name in the first element of each item (for example, “span.link”), and, in the second element, the file path of the first topic file that was found to contain that class.

The method walkXMLTree() contains a loop that examines every strFilepath in the array strFilepaths. Although every topic file under strRootFolderpath is expected to contain valid XML, it’s always possible that a file contains invalid XML markup. In that case, the XDocument.Load() method throws an exception, which stops program execution. To avoid crashing the program in such a case, I wrapped XDocument.Load() in a try-catch loop. If the method fails for a particular file, the code adds the path and name of that file to listDocsFailedToLoad, then moves on to the next file. When all the files have been scanned, I can then examine listDocsFailedToLoad to see how many files couldn’t be opened (hopefully not a large number, and usually it isn’t).

For each XHTML topic that it succeeds in opening, walkXMLTree() calls the method checkChildElemsRecursive() to traverse the element tree in that document. Note that the checkChildElemsRecursive() method is indeed recursive, since it calls itself in its own foreach loop. When checkChildElemsRecursive() is initially called from walkXMLTree(), the xelemParent parameter that is passed in is the root element of the XML tree in the document being scanned.

When control finally returns from checkChildElemsRecursive() to walkXMLTree(), the variable iElemsChecked contains the complete number of XML elements that were examined. This is likely to be a huge number; in one recent test, more than 8 million elements were processed.

The final content of setClasses will be a list of every class that’s used in at least one of the topic files. In the example above, I also set each item to show the filepath of the first file that was found that included that class, because I wasn’t expecting too many surprises! To obtain a complete analysis, you could, of course, make the second item in the SortedList a sublist that would include the file path of every topic using that class.