Software Development
Blogs and Discussion
developer.*
Books Articles Blogs Subscribe d.* Gear About Home

Does Software Need an Apgar Score?

This morning I was reading the Oct 9, 2006 issue of The New Yorker magazine, which contains an interesting article called "The Score: How childbirth went industrial." The article is about the process of childbirth, and more specifically, the medical techniques and industry, for lack of a better word, that have developed around childbirth. When I started reading the article, I was not expecting to encounter an intriguing idea related to software development--but ideas often spring from unexpected sources.

The article narrates dramatic improvements that were achieved in childbirth and mortality rates in the 20th century. Author Atul Gawande describes a stark situation for childbirth in the U.S. in the 1930's:

But in 1933 the New York Academy of Medicine published a shocking study of 2,041 maternal deaths in childbirth. At least two-thirds, the investigators found, were preventable. There had been no improvement in death rates for mothers in the preceding two decades; newborn deaths from birth injuries had actually increased. Hospital care brought no advantages; mothers were better off delivering at home. The investigators were appalled to find that many physicians simply didn’t know what they were doing: they missed clear signs of hemorrhagic shock and other treatable conditions, violated basic antiseptic standards, tore and infected women with misapplied forceps. The White House followed with a similar national report. Doctors may have had the right tools, but midwives without them did better.

The author then describes how childbirth was improved through standardization of techniques, training, and regulation of who exactly was allowed to perform certain procedures (based on whether they had the training and experience):

These standards reduced the number of maternal deaths substantially. In the mid-thirties, delivering a child had been the single most dangerous event in a woman’s life: one in a hundred and fifty pregnancies ended in the death of the mother. By the fifties, owing in part to the tighter standards, and in part to the discovery of penicillin and other antibiotics, the risk of death for a mother had fallen more than ninety per cent, to just one in two thousand.

But the situation wasn’t so encouraging for newborns: one in thirty still died at birth—odds that were scarcely better than those of the century before—and it wasn’t clear how that could be changed.

Later in the article, the author describes huge improvements that came in the second half of the century:

In the United States today, a full-term baby dies in just one out of five hundred childbirths, and a mother dies in one in ten thousand. If the statistics of 1940 had persisted, fifteen thousand mothers would have died last year (instead of fewer than five hundred)—and a hundred and twenty thousand newborns (instead of one-sixth that number).

How did these huge improvements happen?

...a doctor named Virginia Apgar, who was working in New York, had an idea. It was a ridiculously simple idea, but it transformed obstetrics and the nature of childbirth. ... she took a less direct, but ultimately more powerful, approach: she devised a score.

The Apgar score, as it became known universally, allowed nurses to rate the condition of babies at birth on a scale from zero to ten. An infant got two points if it was pink all over, two for crying, two for taking good, vigorous breaths, two for moving all four limbs, and two if its heart rate was over a hundred. Ten points meant a child born in perfect condition. Four points or less meant a blue, limp baby.

The score was published in 1953, and it transformed child delivery. It turned an intangible and impressionistic clinical concept—the condition of a newly born baby—into a number that people could collect and compare. Using it required observation and documentation of the true condition of every baby. Moreover, even if only because doctors are competitive, it drove them to want to produce better scores—and therefore better outcomes—for the newborns they delivered.

Is anyone else seeing the parallels to software development here? Am I off base thinking that the idea of a simple "score" to assess the "health" or "quality" of a software system could be a powerful tool, as it was, apparently, for obstetrics? Such a score, I think, would have to be as simple as described above for the Apgar score, based on easily observable phenomena, and only marginally dependent on subjective factors.

Typical "ility" descriptors used often in software circles would, I think, be ineffective in this context. Is the software maintainable? Is the software secure? These are too fuzzy, too broad, too open to interpretation. We need things more like, is the baby pink or blue? Is the baby crying?

Does the application have an exception handling scheme, or doesn't it? Is there clean separation of presentation logic and business logic, or isn't there? These are things that can be assessed relatively objectively by someone with the proper knowledge and experience.

I've spent the last five months on my day job spelunking through our client's entire enterprise of back office and web software systems, reading tons of old code, combing through databases, and trying to make sense of arcane user interfaces. All this has resulted in about 500 pages of maintenance documentation for my client, who recently purchased a company and wanted to know whether the software it purchased would scale up for further growth and acquisition. I've done similar system assessments many times in the past (though not usually this large), and my gut tells me that assigning a score to each of the systems I've examined would be very useful to my client.

On the front side, with a known scoring system in mind, is the novice (or even "senior") developer more likely to produce software that at least attempts to implement the necessary elements to achieve a respectable score? Would developers working in team environments encounter the same peer-related effects as seen in obstetrics?

The Apgar score changed everything. It was practical and easy to calculate, and it gave clinicians at the bedside immediate information on how they were doing. In the rest of medicine, we measure dozens of specific things: blood counts, electrolyte levels, heart rates, viral titers. But we have no measure that puts them together to grade how the patient as a whole is faring. It’s like knowing, during a basketball game, how many blocked shots and assists and free throws you have had, but not whether you are actually winning. We have only an impression of how we’re performing—and sometimes not even that.

More parallels with software development abound in this article about birthing babies:

There’s a paradox here. Ask most research physicians how a profession can advance, and they will talk about the model of “evidence-based medicine”—the idea that nothing ought to be introduced into practice unless it has been properly tested and proved effective by research centers, preferably through a double-blind, randomized controlled trial. But, in a 1978 ranking of medical specialties according to their use of hard evidence from randomized clinical trials, obstetrics came in last. Obstetricians did few randomized trials, and when they did they ignored the results.

...

The question facing obstetrics was this: Is medicine a craft or an industry? If medicine is a craft, then you focus on teaching obstetricians to acquire a set of artisanal skills—the Woods corkscrew maneuver for the baby with a shoulder stuck, the Lovset maneuver for the breech baby, the feel of a forceps for a baby whose head is too big. You do research to find new techniques. You accept that things will not always work out in everyone’s hands.

But if medicine is an industry, responsible for the safest possible delivery of millions of babies each year, then the focus shifts. You seek reliability. You begin to wonder whether forty-two thousand obstetricians in the U.S. could really master all these techniques.

For the record, I've always been skeptical of software engineering metrics and their usefulness in the software shops where I've worked. (See my "Balance in Scoring" comment, below.) But something about the directness, simplicity, and apparent effectiveness of the Apgar score struck me. The author of this article, himself a physician at the Harvard School of Public Health, was it seems struck by it also:

In a sense, there is a tyranny to the score. Against the score for a newborn child, the mother’s pain and blood loss and length of recovery seem to count for little. We have no score for how the mother does, beyond asking whether she lived or not—no measure to prod us to improve results for her, too. Yet this imbalance, at least, can surely be righted. If the child’s well-being can be measured, why not the mother’s, too? Indeed, we need an Apgar score for everyone who encounters medicine: the psychiatry patient, the patient on the hospital ward, the person going through an operation, and the mother in childbirth. My research group recently came up with a surgical Apgar score—a ten-point surgical rating based on the amount of blood loss, the lowest heart rate, and the lowest blood pressure that a patient experiences during an operation. We still don’t know if it’s perfect. But all patients deserve a simple measure that indicates how well or badly they have come through—and that pushes the rest of us to innovate.

Even if this is a good idea for software, obviously many details would need to discussed and worked out. For example, would we really need multiple scoring systems for different kinds of software? However, I'll stop writing at this point and assess whether there is any further interest in this idea. What do you think?

Thanks for reading,
Dan

P.S.
I've quoted heavily from it here, but I recommend the full article as a worthwhile read.

Interesting, yes. Practical? Well....

This is definitely thought-provoking, esp. considered in light of an "After the Gold Rush"-flavored push towards more organization and standards in the field. But as you say at the end, the devil is in the details.

Would you need a basic Apgar score for pretty much all software, and then subsequent industry/context specific additions to it? Would there be a side-effect where more software gets written to meet a pre-determined score level, but still largely misses the wider notion of "quality"? I'm with you on a generalized distaste for metrics (lies and damn lies, if I may paraphrase Twain). And so this notion kinda falls into that hole for me. BUT... if an effective rubric could be developed, one that subtlety enforced overall quality via the total impact of each scored component, you might have something.

I guess I think overall, my doubt springs from the fact that very little in software is as simple as "is the baby moving all four limbs". Sure, there are some very basic elements that are pretty much true/false. But so many are contextually variable.

Nonetheless, thought-provoking as always, D!

For any metric you can

For any metric you can design, I can write crappy software that meets the metric perfectly. Perhaps a score where software cannot be good without meeting the metric is possible and useful (two points for on time, two points for making into actual use, two for users NOT crying and having a heart rate over 100).

The problem is we know how a baby is supposed to be, and variation from the end result differs only by a few factors, all of which it is obvious if they are important or not.

Balance in Scoring

Brad writes:

For any metric you can design, I can write crappy software that meets the metric perfectly.

and Andy writes:

I'm with you on a generalized distaste for metrics (lies and damn lies, if I may paraphrase Twain). And so this notion kinda falls into that hole for me. BUT... if an effective rubric could be developed, one that subtlety enforced overall quality via the total impact of each scored component, you might have something.

Fair points. I think the key to making something like this work is balancing objectivity, subjectivity, and observability. Brad has a good point that, "we know how a baby is supposed to be." But, allowing for variations among different classifications of software programs, I don't know if I'm ready to concede that we can't identify some fundamental qualities or inclusions that a software program should have. Once we have that, there may also be a way to allow for a degree of subjectivity in the evaluation.

I also think it's useful to disconnect the concept of "score" from the concept of "metrics." I think people associate "metrics" with things like "how many bugs did the program have", "what percentage over budget are you on that task," and "how many bugs did you open and close today in the bug tracker?" and also with measurements of size and complexity. Counting function points and person-months interest me not at all. It may be important to disconnect the "score" from anything related to process, schedule, size, and complexity.

Perhaps the theme here is clearly defining the limits of the score, since "overall quality" type issues are so large, subjective, and hard to evaluate. On the other end of the scale, perhaps we should also avoid scoring elements that involve counting or "measuring." What I like about all of the Apgar score criteria is that they are all essentially binary, and easy to observe.

I do see a challenge in the relationship between items that are scored and requirements. For example, should we withhold extra points if the programmer did not implement logging as part of the exception handling scheme? Maybe the requirements didn't call for that, or maybe the programmer was expressly told not to add logging. Perhaps, though, if logging were seen as something important by virtue of the fact that it had been included in the scoring, then maybe stakeholders would empower programmers to make sure the system has logging, lest the score suffer.

My immediate reaction to Brad's statement, "For any metric you can design, I can write crappy software that meets the metric perfectly." was that if you included an error handling scheme in your crappy program just to get your score up, then that doesn't seem like such a bad thing. I suspect that the sentiment here is linked to that other association with "metrics" as a means of control in an organization to get people to work harder, produce more, and make fewer mistakes. My thinking about this score concept is, I think, separate from that--more of a bottom-up approach, in which programmers have a way of evaluating whether a piece of software meets some basic criteria; management can catch up with us later.

But I'm ready to let this idea go if it indeed does not have legs...

Dan

The baby and the bathwater

The problem with an Apgar score is that the Apgar score can be checked against and supplements the physical baby as the poor little tyke squirms in the nurse's hands, poor little perisher that he is.

You can't "game" the score.

Wherease software is a pure or *reinen* Idea on which we must focus, and gaming our Apgar score is a distraction.

Ultimately, a work of software must be judged as we judge a literary work, with the added constraint that it must be mathematically correct. This is the hard fact which "software engineers" try to avoid, because it means that the only qualified assessors of software are either collectives of users, or individuals who have been spared educational moronization and tracking into "SAT Verbal" or "SAT Math".

Scorable Traits

Since the topic of my post falls squarely into his area of expertise, I emailed a link to Robert L. Glass. He was kind enough to read it and reply with some comments, which he agreed to let me share:

A thought-provoking post. Your original idea is a good one, but/and your readers have done a good job of identifying the problems.

I think the key to making this work is easily-scored traits. And that's a HUGE problem. As you point out, metrics are different from scores, and I think they're a wrong path to go down. (Metrics are also hugely controversial--far too many are more a gleam in an academic's eye than anything meaningful to practice). Perhaps someone should hold a contest to identify software Apgar candidates. I could do something in The Software Practitioner. Want to put together an article?

You glossed over something else that I think is important, the idea that "nothing ought to be introduced into practice unless it has been properly tested and proved efficient by research centers." Implementing that idea alone would make for dramatic changes in how we do computing research; it's an idea I've been arguing for (unsuccessfully!) for far too long. That may not result in an Apgar either, but it would sure improve both how practice functions and how research and practice interact.

I agree with Mr. Glass that coming up with the list of "easily-scored traits" is key, and appears to be the next step in taking this idea further--and should indicate whether there's anywhere further to go. I might try getting a list going with developer.* readers first, and then perhaps we can open it up to Bob's Software Practitioner readers if we have something going.

The issue raised in the last paragraph of Bob's comments is interesting in that Bob and I seemed to draw opposite conclusions. Here was my comment back to Bob:

One of the reasons I included that quote from the article about "nothing ought to be introduced into practice unless it has been properly tested and proved efficient by research centers" is that the article seemed to be crediting obstetrics with being successful in spite of (or even because of) this lack of pre-confirmation. This reminded me of the general disconnect between software academe and practice, and also of the "practice often precedes theory" idea that you have discussed in your writing.

Thanks for reading,
Dan

P.S.
I can't resist getting in a plug for the two books by Robert L. Glass published by developer.* Books, Software Conflict 2.0 and Software Creativity 2.0.

Score candidates

Here are some candidates. Assume everything is worth one point, for now.

  • Error Handling
  • Error Logging
  • Comments that make explicit reference to the requirement supported by the code in question
  • Descriptive names
  • Internally consistent naming convention

I picked these because I can imagine taking any non-expert and bringing them up to speed on the scoring system in a couple of hours. Thus, I left out things like global variables, because that might require a level of expertise that may not be available. I'm thinking in terms of a system that the client can use to score the program. I'm not sure about 'descriptive names' because there is a subjectivity about it that may be at odds with objective scoring.

Another reason for picking these was because it isolates the 'code score' from the 'program score' in the same way that Apgar separates physical health from mental and emotional health. Not that we should be ignoring other factors, but other issues probably require different scoring systems.

See my blog if you like on this matter

See my blog post "Apgar score: less than zero"

[Editor's note: for those interested in following the discussion, an interesting thread has also sprung up on Edward's blog post, starting with this comment by regular developer.* reader Ron Porter.]

The Joel Score

The Joel Test is probably at a higher level than you were intending, but I think it is pretty quick and pretty telling.

Yes you can game it and the code can still be horrible, same for ISO-9000 and CMM. But the Apgar doesn't tell you if the baby has a tumor or genetic heart problem. It isn't trying to replace MRI machines.

I think we should improve on The Joel Test (e.g. automated unit tests, quality/quantity of documentation, etc.).

The Joel Test is Process-Oriented

Mark, thank you for pointing out The Joel Test. I had read this essay by Joel a couple times (once online, and once in the Joel on Software book), and it didn't even come to mind in my thinking about this Apgar score idea. Now, thanks to your link, I've read it a third time, and I can see why it did not come to mind: Joel's score is process-oriented. It measures, "how do you go about building software?" as opposed to, "how does this particular piece of software measure up to some idea of minimum quality?"

I don't mean this as a criticism of Joel's Test, nor as a criticism of your suggestion. I do, however, see the software Apgar as more about the *what* than the *how*, if that makes sense. There is a relationship, of course, between the what and the how. As I understand it, in obstetrics the Apgar does not judge hospital procedures, doctor and nurse training, sanitary conditions, equipment quality, etc. But of course, the how influences the what. If a maternity ward is doing a poor job with procedures, training, cleanliness, and equipment, then the babies it brings into the world might have lower Apgar scores. (Though of course it's more complicated than that, with things like the diet and health of the mother for the previous nine months having a huge influence.)

I really like this statement from your comment, Mark:

But the Apgar doesn't tell you if the baby has a tumor or genetic heart problem. It isn't trying to replace MRI machines.

Thanks again for the contribution.

Dan

Other categories

I bet it's best to pick the priorities for your project, and then devise a unique set of tests before you get going. That being said, here are some guidelines that I wish were in place for every project I've worked on.

1. No stupidly big code files, say over 1000 lines.
2. No commented out code.
3. No unused files, functions, or lines of code (this one could be tricky to enforce)

Getting Us Started

Thanks, Dave! I hope to post a new blog post soon to summarize some of the things that have been said in these discussions, and to start exploring some actual "score-able traits" (like those in your list), program classifications, and systems for scoring.

Best,
Dan

Balanced scorecard

At a recent management review, I was asked a great question: "Name a feature of the recent release which was directly requested by a customer". A software Apgar should include some indicators of whether the right software was built.

It's true that all metrics can be gamed. This is a direct consequence of the fact that what we do is hard. But it doesn't follow that they are useless.

I use number of commits per month as an indicator of team morale. It's a very crude one, but it has some utility. Either it confirms what I already knew - but in an objective way, that I can quote, my bosses don't want to rely on my gut feelings. Or it contains a surprise, in which case I have to do some listening and thinking. Sometimes the conclusion is that the result is an artefact, e.g. due to holidays. But any reality check is good, even an imperfect one.

On important guideline is not to reward the team for improving the metrics. That is asking for trouble. A objective measure of the fitness for purpose of the product is not possible.

Incidentally, I'm sceptical about the idea that doctors can't game the Apgar score. We'd have to ask an obstretrician, but babies are more complicated than software products. In the UK we have free health care, which also means rationed health care, and we have lots of experience of hospital administrators gaming targets.

CVS - the Vessel of The Awful Truth

I tend to agree with Chris that there's perhaps more value in repository changes than many other metrics. My personal take on it is that it might be a good idea to look at who writes code that gets delta'ed alot. Check out my blog entry on the subject for a complete musing on the subject.

working software <-> good software

Hi,

reading your article, Dan and then the comments, i feel, two things here might be mixed up. For me, the original apgar score sounds like a good way to see, if the baby is actually well and alive, to see, if the baby is 'working'. The apgar score doesn't tell you, if the brain has any damage, if the kidney is working and many more things. So i think, the apgar score is good in telling, if the baby will survive the next days but it cannot tell, if the baby will develop to a healthy grown-up, and i guess, it wasn't meant to.

The same is for software. We already have instruments for knowing, if a program 'works'. Unit Testing, for example. Unit tests have the binary system, that you mentioned. It tells you, if that thing is running or not. But of course, it doesn't tell you, if the design of the software is good, if the architecture is good.

It is the same, as the apgar score doesn't tell you, if the baby has a high potential of getting cancer in the future.

To get this information, the baby would have to be checked very intensively and the resulting report would surely consist of many subjective opinions.

The same goes for analyzing a programm in terms of desgin and architecture.

My 2 cents :o)

Cheers

Sven

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Recent comments

User login

About our advertising.

Atom Feed

developer.* Blogs also has an Atom feed, located at this url.

Click here for more information about Atom.

A Jolt Award Finalist
Software Creativity 2.0
Foreword by Tom DeMarco

Recent Posters

Based on most recent 60 days, sorted by # of posts and name.

Google
Web developer.*

Who's online

There are currently 0 users and 11 guests online.

Syndicate

Syndicate content
All views expressed by authors, bloggers, and commentors are their own and do not necessarily reflect the views of developer.* or its proprietors.
Click to read the Copyright Notice.

All content copyright ©2000-2005 by the individual specified authors (and where not specified, copyright by Read Media, LLC). Reprint or redistribute only with written permission from the author and/or developer.*.

www.developerdotstar.com