A Microsoft Visual Studio C++ version of Brian Kernighan's Beautiful (?) Code
Here is the final command line executable version, tested but not comprehensively (yet) for Microsoft C++ Visual studio. I've CHANGED Pike's code to tell the caller the starting index and the length of the string that "satisfies" the regular expression, and I've included a somewhat extensive main() procedure to:
* Display all values in and out
* Enable quick debugging in Visual Studio, by defaulting to easily changed compile-time values for testing when run in debug mode inside the IDE
The main intent of returning the index and length of the satisfier string is to allow reuse of All This Useless Beauty, but it's incumbent on the caller to check for a zero satisfierLength when iterating through a string to find all occurences of a regex. Also, Pike's code follows the lazy shortest-match philosophy, which will return zero length strings when the asterisk is used. It's counterintuitive, but a regex of CC* and a string of CC gets you a match length of 1 for this reason.
In fact, the leftmost shortest match philosophy in use will return a match (correctly) of zero length and index 0 in this code for "C*" as the regex and "dddddCCCC" as the input string. If the caller of this code keeps iterating the call until a nonzero length match is found, incrementing the pointer to the string, a great deal of the time saved in recursion is wasted in iteration (and grey hair) in user land.
The shortest string philosophy is a mistake because it is counterintuitive. It isn't Beautiful.
There's a real divergence between what unix, linux and C programmers think about regular expressions and their mathematics. The user's INTENT is entering a potentially zero regular expression such as C* is two fold, to be cool if it isn't found, and consider that a match, or to use the first nonzero string of Cs. Unfortunately, the mathematical equivalent of C+ (at least one C in a sensible regex), CC*, will return using the code below a length of one for CCd and not two because zero is less than one.
Pike's code has no bugs because it creates its own reality. However, it and its clones define the user's world and gives her no chance to understand what's going on. I can only speculate how often this
English-centric code in use at the CIA failed to catch bad guys!
A Beautiful regular expression parser would take far more than an hour (I adpated the code in about two hours). It would allow the user to select the algorithm left first or right first, and longest or shortest match. It wouldn't assume that all strings are strings of bytes. It would support languages that run right to left, it would support the application of its operators to parenthesised sequences, and it WOULDN'T silently use the stack as its playtime funtime work area.
Above all, it would EXPLAIN its algorithms and provide enough output, as I try to do here, to the user, treating her with dignity and respect.
Comin' right up, time permitting, in C sharp, with hooks to this code to do speed comparisions, a regex that handles parentheses and international strings.
// kernighanRegexC.cpp : main project file.
#include "stdafx.h"
#include
#include
#include
#define REPCOUNT_BACKUP ("1")
#define REGEX_BACKUP ("^C")
#define TESTSTRING_BACKUP ("dC")
using namespace System;
public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text, char **start, int
*length)
{
*length = 0;
*start = text;
if (regexp[0] == '^')
{
return matchhere(regexp+1, &text)
?
(*length = text - *start, 1)
:
(*start = 0, 0);
}
do { /* must look even if string is empty */
*start = text;
if (matchhere(regexp, &text))
{
*length = text - *start;
return 1;
}
} while (*text++ != '\0');
*start = 0;
return 0;
}
private:
/* matchhere: search for regexp at beginning of text */
static int matchhere(char *regexp, char **textp)
{
if (regexp[0] == '\0') return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, textp);
if (regexp[0] == '$' && regexp[1] == '\0')
return **textp == '\0';
if (**textp!='\0' && (regexp[0]=='.' || regexp[0]==**textp))
return matchhere(regexp+1, (++(*textp), textp));
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
static int matchstar(int c, char *regexp, char **textp)
{
do { /* a * matches zero or more instance */
if (matchhere(regexp, textp))
return 1;
} while (**textp != '\0' && (*(*textp)++ == c || c == '.'));
return 0;
}
};
int main(int argc, char *argv[] )
{
clock_t start;
int repeats, reps;
int result = 0;
char *regexAddr;
char *stringAddr;
char *satisfierAddr;
int satisfierLength;
if (argc != 4)
{
printf("My syntax is kernighanRegexC
\n");
printf("I'll use my backup test values\n");
repeats = atoi((REPCOUNT_BACKUP));
regexAddr = (REGEX_BACKUP);
stringAddr = (TESTSTRING_BACKUP);
} else
{
repeats = atoi(argv[1]);
regexAddr = (argv[2]);
stringAddr = (argv[3]);
}
printf( "Applying the regular expression '%s' to the string '%s' %i
time(s)\n",
regexAddr, stringAddr, repeats);
start = clock();
for (reps = 0; reps 0)
{
printf( "Satisfier string was %sfound\n", result==0 ? "not " : "");
if (result == 1)
{
printf("Zero-origin index of satisfier string: %i\n", satisfierAddr
- stringAddr);
printf("Length of satisfier string: %i\n", satisfierLength );
printf("Starting address of test string: %x\n", stringAddr);
printf("Starting address of satisfier string: %x\n",
satisfierAddr);
}
printf( "Time taken was about %2.1f seconds\n", duration );
}
return 0;
}


A note on "conforming strings"
In wanting to return the start index and the length of the string that conforms I return counterintuitive results such as an index of 0 and a length of 1 for the regex CC* and the string CC, and, even worse, an index of 0 AND A LENGTH OF 0 for the regex C* and the string CC, using Pike's code, because Pike's code gets shortest string.
Independent of programmer implementation, there is a mathematics of regular expressions (that is not necessarily the property of professional mathematicians, any more than a regex processor defines human reality).
A programmer would observe that there isn't one and only one conforming substring, and that the conformant string is defined by the User, assuming you can find one (the "user" is an undefinable buzz word). A mathematician would say that the conformant strings are a plurality which can be thought of as a SET.
Therefore, any real regular expression processor has to by default give the caller the start index and length of the longest conformant string, NOT the shorter: Pike's code is mislabeled.
The longest conformant string can stand for the set since the set of conforming strings is just the set of substrings starting at the index i of the longest string, and then substring (i+1, maxlength-1), (i+2, maxlength-2) .. (i + maxlength - 1, 1). If the regex ends in an asterisk then the set also includes the null string.
This model can be extended to carat and dollar (the characters representing start and end of string) by transforming the input string s to ^s$ in a meta-notation.
Which means that a solid regex would return an object, a class, that allows retrieval of each conformant string from a set model. This class could implement laziness in a sweet fashion: if asked for the first and shortest conformant string, it could get only the conformants just-in-time by using an algorithm where the maximum length of conformant string, relative to the minimum, would be a parameter!
Indeed, a solid regex would be a class with methods that would EXPLAIN regular expressions, producing outputs such as "gets at least one C" for CC*, allowing its use as a real tool as opposed to a rather misleading snippet.
The ability to think in sets as opposed to bytes doesn't seem to transfer across the boundaries between languages, between languages that explicitly handle sets, and languages like C. In Beautiful Code, I don't see the beauty of ontology...making objects that correspond to thought.