Brian Kernighan's Beautiful Code?
When I had started working at Princeton University's Information Centers in 1987, my boss at a reception for alumni student Info Center workers said to me, Ed, this is *Brian Kernighan*.
It was all I could do to avoid making the yi er san kow-tow like Wayne and Garth in Wayne's World when they meet Alice Cooper, saying, I'm not worthy I'm not worthy I'm scum.
I'd been very impressed with Kernighan's work, in particular his literary and critical "take" on programming style in books he wrote in the 1970s.
However, I was disappointed by his essay "A Regular Expression Matcher", in Beautiful Code (O'Reilly Media, Sebastopol CA 2007) and moreover by the book.
Kernighan's essay, and the book in general, is post-Dijkstra, post-Algol, reflective in fact of an American centric anarcho-conservatism in which people retreat behind the flimsy yet obdurate walls of their favorite platforms without acknowledging that this serves corporate needs exclusively.
Mere structs are proudly labeled objects in Web hacks. In general, people are proud to be members of in-groups, not to evolve a global computing language. Dijkstra isn't even in the index despite the fact that he practically invented the very idea that code needed, as what Dijkstra called a matter of life and death, to be elegant syntactically and semantically over a lifetime of versions.
Turning to Kernighan's example of "beautiful" code, written in one hour by Rob Pike in 1998, we find the code I've placed at the bottom of this post.
[Note: Pike isn't in the index either. Oh well, he's just a programmer, right? The "programming style" movement that Kernighan helped start recentered the mere programmer as being a creative and intelligent Subject, but a new conservatism relegates him along with Dijkstra to a common grave.]
The privatized industrial *milieu* in which I fear Kernighan operates unquestioningly, preparing his students at Princeton for a lifetime in which there's never owing to profit pressures enough time to do it right, but plenty of time, and job opportunities, to do it over, is shown by the fact that to Kernighan, it's a Good Thing that Rob only took an hour.
Paradoxically, perhaps dialectically, the programmer's time, by becoming priced so high in an imaginary market, becomes worthless as becomes his authorship. A constant feature of programming is the after hours free labor of "valuable" programmers, where the worthlessness of their free time varies inversely to the market price of their billable time. I discovered at one firm that one of the Most Valuable Players had become an MVP by not ever reporting the extra hours he worked, even though he would have been paid for those hours by company policy. Rob's product is valued because it isn't "pretentious", he didn't have the right to work longer, only harder, producing, as we'll see, a flawed piece of code for posterity.
The most serious flaw is that it uses the test parameter of match as the index to the test "string" (which is only painfully a modern string, as we'll see, whence the scare quotes).
Yes, Brian, I understand that in C the programmer can use a name as a Von Neumann address to a byte now and forever, amen. I understand that test is passed as are all variables in C by value, therefore on a stack implementation (yes, the only possible implementation) the programmer is "free", o happy day, to use the stack copy, test itself, oh frabjous day, as the index itself.
"Freedom, hoy-day, freedom!" - Shakespeare, The Tempest
But I question the right of the C programmer to use something so idiomatic as confusing a value parameter, something which is not normally modifiable, as modifiable.
Brian, you cannot have your cake and eat it too, even on Christmas Day. You want to use C as a way to speak about Beautiful Code to Beautiful Minds.
But this requires the Beautiful Mind to call back to mind the fact that in most, but not all, stack virtual machines, value parameters on the stack can be slyly, in what I think to be an Ugly way, used as "work" variables.
Unless of course the C virtual machine is a piece of hardware, on which value parameters placed on the stack cannot be modified...or can be only slowly, while new values (function end results) can be pushed quickly. Such a machine, implemented in embedded hardware, is conceivable.
In the interests then of using C as you want to use it, as an Algol style publication language, you should have modified Rob's code to copy test to a private workplace. The "waste" of a cycle would save the time of the program reader.
C, as a language for talking about algorithms, is pernicious because it requires the reader to "understand" too much, viz., that it is a high level assembler language which obscures as much as it shows.
I understand your point that in Java and in C Sharp, the programmer would have been compelled (boo hoo) to use a string object, and this would be perhaps more time costly...at least on an unoptimized platform.
But you follow in the exposition after the code is presented the rule of the Duke of Wellington: never apologize, never explain. I looked for an apologetic explanation to people who don't use C that test may be used as an index in the way you use it.
You also fail to mention that the program doesn't even work for modern strings, only for what we call in .Net, sbyte, strings o' bytes. To call Rob's code in .Net, I have to create an Unsafe interface.
In 1998, it was painfully arguable that the world was still interested in scanning strings o' bytes using regular expressions that are also strings o' bytes.
Missing is Bjarne Stroustrup's Danish internationalism. In 2004 in Shenzen I read code which applied regular expressions to Chinese text in double-byte Unicode.
C programmers claim that their language is unicode-aware because they are, and they can hack it. This isn't the point. The point is information hiding, and the Java and C sharp strings can handle international input: Rob's code cannot, and will be keyed in, as Beautiful Code, by tyro programmers and it will fail. As in the case of the use of the test variable, you don't discuss this issue.
As a relatively minor cavil, I know that it has been very hip, since the days of the PDP8, to use lower case. The problem is that matchhere is ugly because it uses double h in a way that doesn't occur in English: it's Klingon, and why not bite the bullet and use matchHere?
In my opinion, camelCase is a thing of beauty in that it avoids the class bias of Proper case, refusing the first letter a special honor merely because of primogeniture, but preserving needed breaks between words. I understand that MATCH_HERE or match_here would be Coyote Ugly in a use of underscore common in the IBM, PL/I and Rexx tradition, but camelCase is to me an excellent compromise.
Also, why couldn't Rob have returned an index to the place where the regular expression returns, and, to be consistent with zero origin indexing, -1 for no hit? It's not elegant in a real timesaving way to do all the work you do, and return such a thin cold response to callers whose next question will be where the heck does the string start!
It's very confusing to have the code return 1 (affirmative) for regular expression C* in string DD until you remember "zero, one, or more" but there's no helping this. The reader has to understand the basics.
I like the way you show how to use recursion simply in the real world; too many programmers say "ooooo recursion, scarey and time wasting, don't make me think, aaaargh".
But in general, I think your C has become a lingua franca and as such a Frankfurt, a walled city with its own shibboleths.
This to me is contrary to the open, humanistic and critical spirit of Kernighan's early writings on programming style, which deconstructed bad Fortran and showed how it could be rewritten even in Fortran itself.
I may not have realized that my (computing) generation was engaged not in liberating computing from the shackles of privatized idiocy, but merely in an Oedipal destruction of the 1960s generation of the IBM fathers, creating, sad to say, its own shibboleths and orthodoxy...in which even such a brilliant man as Kernighan stays within a discourse boundary in which he cannot explain that "we use the value parameter test to index, don't worry, your copy of test will not be changed because test is called by value".
This is a computing culture dominated by American private corporations which use computing to obscure. I have learned, for example, that the securitized mortgages which are creating a global economic crisis cannot be reverse engineered in such a way as to find the original debtors, and reprice the security by determining the best risks.
This was probably because the programmers were under orders to do things in an hour or less, as Rob Pike was praised for, and they appeared to have discarded the idea of keeping keys in new tranches. I'm not an expert on this, but I have enough experience to somehow intuit that during the Big Party, yuppie scum who were getting rich were bullying programmers to emulate Rob, and discard "unnecessary features", whether international strings, or an audit trail.
Dijkstra is rolling in his grave. While he was alive, there was to my knowledge absolutely no two-way communication between Dijkstra and guys like Brian because in the American corporate context, Dijkstra's honest criticism was a speed bump.
Brian's essay, Rob's code, and the rest of the book is a disappointment, I'm afraid.
Here's the code wrapped in a value class for Microsoft C++ Visual Studio with a simple main().
public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text)
{
if (regexp[0] == '^')
return matchhere(regexp+1, text);
do { /* must look even if string is empty */
if (matchhere(regexp, text))
return 1;
} while (*text++ != '\0');
return 0;
}
private:
/* matchhere: search for regexp at beginning of text */
static int matchhere(char *regexp, char *text)
{
if (regexp[0] == '\0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (regexp[0] == '$' && regexp[1] == '\0')
return *text == '\0';
if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
return matchhere(regexp+1, text+1);
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
static int matchstar(int c, char *regexp, char *text)
{
do { /* a * matches zero or more instance */
if (matchhere(regexp, text))
return 1;
} while (*text != '\0' && (*text++ == c || c == '.'));
return 0;
}
};
int main(array ^args)
{
Console::WriteLine((kernighanRegexC::match("C", "DDC")).ToString());
return 0;
}
Recursive brain fart ((poot)*)
Of course, most users of the Pike code will pass a string literal as the "test" parameter, thereby losing the address of the satisfier.
(poot)*
(((poot)*)*)
Sorry, hopefully only one more error: I imply that you DO get the address IF you pass the test string in a variable. This is incorrect since value parameters in C are copied to the stack and, if modified, that modification is lost.
Which means that the original objection stands. The code isn't at all useful for finding WHERE the string, that satisfies the regular expression, exists in test. It merely finds WHETHER it occurs.
Is such spare functionality Beautiful? I don't know. It sure isn't Reusable.
Nonsense prevails, modesty fails
Grace and virtue turn into stupidity
While the calendar fades almost all barricades to a pale compromise
And our leaders have feasts on the backsides of beasts
They still think they're the gods of antiquity
If something you missed didn't even exist
It was just an ideal -- is it such a surprise?
What shall we do, what shall we do with all this useless beauty?
All this useless beauty
((((poot)*)*)*)
I also failed to realize that the test variable in Pike's "Beautiful" code isn't at all updated in matchhere() or matchstar() on behalf of match(): as in match() itself, the copy of test on the stack is used as a working variable. Basically, no effort at all was made by the Pikester to return the position of the satisfying string, and in the so-called real world, isn't this exactly your next question?
Summary of the flaws I think are in the Kernighan and Pike code
A discussion of the above blog post has started on programming.reddit.com in the usual way, with complaints about literary style rather than technical remarks, because, as one poster says, pointer arithmetic is hard.
Because the reddit posters complained about the political observations, I posted the following technical discussion at http://programming.reddit.com/info/63vc1/comments/.
Here is a summary of what I think are the flaws in the code Rob Pike wrote for Brian Kernighan, and which Brian presents as Beautful Code in a book of that name, published this year by O'Reilly, on page 3 (an unindented copy of Rob's code as keyed by me, and wrapped in a C++ class, is at my developerDotStar blog). I read C but have long since abandoned it because I don't think Beautiful Code can be written in C: the Pike code, I think, is an example of code that only seems Beautiful.
I won't include any postmodern theory in this particular post despite the fact that aesthetics is political.
(1) Pike's code doesn't implement a regular expression interpreter insofar as "regular expressions" have a formal, mathematical meaning. For example, it makes no provision for a character which must occur at least once.
(2) It is idiomatic C, which uses a parameter passed by value as the changeable index into the string it points-at and it expects the user to grok the fact that she has a useful value in that value parameter (the start of the text conforming to the regular expression), something unexpected outside of C. If the user passes the text as a string literal, the useful point at which the regular expression occurs is lost.
(3) While advertised as a string processor, it does not handle modern Unicode or double-byte strings containing international input.
(4) Its comments are unilluminating especially as regards need-to-know the fact that its value parameter contains a useful result (the start position of the regular expression), and Brian's discussion in Beautiful Code is unhelpful. Of course, as noted above, this address is lost when a string literal is passed.
(5) The length of the substring of characters that satisfies the regular expression is lost.
(6) Kernighan does express, in the text of the essay, the concern that the code, which does heavy recursion proportional to the string length, might run slowly or overrun stack limits.
But no consideration exists in the code that a semi-beautiful way exists to avoid most recursion. If the regular expression does not start with a string start carat or string end dollar sign, its first character, if not followed by an asterisk, is a "handle" which which can be scanned-for in a nonrecursive loop or by a strcspn function.
Even if the first character (that's not dollar or carat) is followed by an asterisk, if the asterisked character occurs frequently, the code can search for the first character, and return a match immediately upon finding it, or, entering the recursive code upon failure.
What would be Beautiful about this? The fact that it's not the sort of thing an optimizing compiler would "think" of, whereas the apparent efficiency of Pike's code is often discovered by an optimizing compiler (although an optimizing compiler would not "see" the utility of recursion in an iterative solution, to be sure).
It would, I think, amortize the cost of using a platform, whether Java or C Sharp, that handles real strings through an abstraction.
Please feel free to point out where I've gone off the rails, since I have such respect for Kernighan that I'm astonished that he feels that this snippet is Beautiful Code.
kscaldef's comment on the Kernighan code in reddit
kscaldef points out, with respect to my point (1), that you don't need in a minimal implementation to support a+ where + means one or more times: you just use aa*!
But I add at reddit that the code doesn't support nesting in the regular expression as in (a+c)*.
It's syntactic sugar to insist that Rob should have supported postfixable strings as well as characters (although the failure to support international characters is significant). But by allowing parentheses Rob could have had strings for free as in (abc)*.
I'm on admittedly shaky ground. A pure mathematician would ACCEPT only character processing, equating the char with a symbol on a Turing Machine square, and assume that any string could be represented as any one of an infinity of possible characters, as Chinese ideograms represent thousands of ideas.
A programmer like Rob simply restricts himself to the problem statement, so Rob did good.
What remains for me unforgiveable is that this hack is considered Beautiful Code by Brian, at least without a considerable discussion of its limitations. I'd accept it as neo-Primitive art if only Brian had done so.
Pike and Kernighan: .6 secs: C Sharp: 2.7
...for a long string, AFTER one million iterations.
See main blog post.


Brain Fart
It wasn't until I posted to comp.programming that in an exchange with Malcolm McLean that I realized that you DO know where the string occurs when you use the Rob Pike code.
Because the "test" variable is used as an index-address, it contains the start of the regular expression's occurence on exit from a successful search or a null length "string", in the case of failure.
Is this "elegant"? No, since you're not given the length of the string that satisfies the regular expression, and this cannot be determined from very simple regular expressions such as C*!
Furthermore, although it was sorta dumb of me not to see that the test value is a result as well as input, and to recall to mind that C "value" parameters can contain results, Brian should have spelled this out.
No, Brian, I'm not saying "Don't Make Me Think!" I am saying I would have spelled this out and not assumed a knowledge of C, because shibboleths (using idioms to select an audience) are tribal.
Primitive art linked to magic and ritual is beautiful, but it isn't beautiful to in effect celebrate 1971 as a computing year in which the theory of parameter types was in its infancy.
It is said that elegance is in the mind of the beholder, and a reputable literary theory, reader-response, affirms that in part literary elegance depends on an audience.
But this means that the very word, "elegance", has to be subscripted with the name of the audience.
The C expert no doubt gets a *frisson* of delight when he sees, a lot faster than I, that (1) test is used as a work area and (2) test gives back the starting address of the satisfier string.
However, what's missing is an acknowledgement that there's an audience out there of intelligent professional programmers who don't use C, or who used to, but then got help. What's missing is a true computing cosmopolitanism, that of the Algol publication language, that was short-changed by IBM's development of Fortran, and the Eisenhower administration's post-Marshall Plan desire to let the Western Europeans know that they were junior partners.