logo
Published on developer.* Blogs (http://www.developerdotstar.com/community)

A Microsoft Visual Studio C++ version of Brian Kernighan's Beautiful (?) Code

By Edward G Nilges
Created 2007-12-30 08:03

Here is the final command line executable version, tested but not comprehensively (yet) for Microsoft C++ Visual studio. I've CHANGED Pike's code to tell the caller the starting index and the length of the string that "satisfies" the regular expression, and I've included a somewhat extensive main() procedure to:

* Display all values in and out
* Enable quick debugging in Visual Studio, by defaulting to easily changed compile-time values for testing when run in debug mode inside the IDE

The main intent of returning the index and length of the satisfier string is to allow reuse of All This Useless Beauty, but it's incumbent on the caller to check for a zero satisfierLength when iterating through a string to find all occurences of a regex. Also, Pike's code follows the lazy shortest-match philosophy, which will return zero length strings when the asterisk is used. It's counterintuitive, but a regex of CC* and a string of CC gets you a match length of 1 for this reason.

In fact, the leftmost shortest match philosophy in use will return a match (correctly) of zero length and index 0 in this code for "C*" as the regex and "dddddCCCC" as the input string. If the caller of this code keeps iterating the call until a nonzero length match is found, incrementing the pointer to the string, a great deal of the time saved in recursion is wasted in iteration (and grey hair) in user land.

The shortest string philosophy is a mistake because it is counterintuitive. It isn't Beautiful.

There's a real divergence between what unix, linux and C programmers think about regular expressions and their mathematics. The user's INTENT is entering a potentially zero regular expression such as C* is two fold, to be cool if it isn't found, and consider that a match, or to use the first nonzero string of Cs. Unfortunately, the mathematical equivalent of C+ (at least one C in a sensible regex), CC*, will return using the code below a length of one for CCd and not two because zero is less than one.

Pike's code has no bugs because it creates its own reality. However, it and its clones define the user's world and gives her no chance to understand what's going on. I can only speculate how often this
English-centric code in use at the CIA failed to catch bad guys!

A Beautiful regular expression parser would take far more than an hour (I adpated the code in about two hours). It would allow the user to select the algorithm left first or right first, and longest or shortest match. It wouldn't assume that all strings are strings of bytes. It would support languages that run right to left, it would support the application of its operators to parenthesised sequences, and it WOULDN'T silently use the stack as its playtime funtime work area.

Above all, it would EXPLAIN its algorithms and provide enough output, as I try to do here, to the user, treating her with dignity and respect.

Comin' right up, time permitting, in C sharp, with hooks to this code to do speed comparisions, a regex that handles parentheses and international strings.

// kernighanRegexC.cpp : main project file.

#include "stdafx.h"
#include
#include
#include

#define REPCOUNT_BACKUP ("1")
#define REGEX_BACKUP ("^C")
#define TESTSTRING_BACKUP ("dC")

using namespace System;

public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text, char **start, int
*length)
{
*length = 0;
*start = text;
if (regexp[0] == '^')
{
return matchhere(regexp+1, &text)
?
(*length = text - *start, 1)
:
(*start = 0, 0);
}
do { /* must look even if string is empty */
*start = text;
if (matchhere(regexp, &text))
{
*length = text - *start;
return 1;
}
} while (*text++ != '\0');
*start = 0;
return 0;
}
private:
/* matchhere: search for regexp at beginning of text */
static int matchhere(char *regexp, char **textp)
{
if (regexp[0] == '\0') return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, textp);
if (regexp[0] == '$' && regexp[1] == '\0')
return **textp == '\0';
if (**textp!='\0' && (regexp[0]=='.' || regexp[0]==**textp))
return matchhere(regexp+1, (++(*textp), textp));
return 0;
}

/* matchstar: search for c*regexp at beginning of text */
static int matchstar(int c, char *regexp, char **textp)
{
do { /* a * matches zero or more instance */
if (matchhere(regexp, textp))
return 1;
} while (**textp != '\0' && (*(*textp)++ == c || c == '.'));
return 0;
}

};

int main(int argc, char *argv[] )
{
clock_t start;
int repeats, reps;
int result = 0;
char *regexAddr;
char *stringAddr;
char *satisfierAddr;
int satisfierLength;
if (argc != 4)
{
printf("My syntax is kernighanRegexC
\n");
printf("I'll use my backup test values\n");
repeats = atoi((REPCOUNT_BACKUP));
regexAddr = (REGEX_BACKUP);
stringAddr = (TESTSTRING_BACKUP);
} else
{
repeats = atoi(argv[1]);
regexAddr = (argv[2]);
stringAddr = (argv[3]);
}
printf( "Applying the regular expression '%s' to the string '%s' %i
time(s)\n",
regexAddr, stringAddr, repeats);
start = clock();
for (reps = 0; reps 0)
{
printf( "Satisfier string was %sfound\n", result==0 ? "not " : "");
if (result == 1)
{
printf("Zero-origin index of satisfier string: %i\n", satisfierAddr
- stringAddr);
printf("Length of satisfier string: %i\n", satisfierLength );
printf("Starting address of test string: %x\n", stringAddr);
printf("Starting address of satisfier string: %x\n",
satisfierAddr);
}
printf( "Time taken was about %2.1f seconds\n", duration );
}
return 0;

}


Source URL:
http://www.developerdotstar.com/community/community/node/777