logo
Published on developer.* Blogs (http://www.developerdotstar.com/community)

SQL Server Job Success, Failure, and Everything In-Between

By Donna L Davis
Created 2005-12-14 12:13

If your scheduled SQL Server job fails at 2:00AM when no one's checking, will anyone know, or is it like a tree falling silently in a forest? "No problem," you might reply. You have email notification enabled.

Email notification on SQL Server job status presents at least three dilemmas:

#1 Who receives notification?

If you send notification to an individual, that person could be on vacation or otherwise unable to receive the email. If that concern is addressed by sending notification to a group, who will take action? Each person may assume someone else will respond. If you decide to send notification to a generic email account that can be checked, the timeliness of notification is lost and depends on dutiful employees remembering to check. If you decide to send the notification to a pager (which is rotated among on-call staff), employees may become weary of receiving pages at all hours of the night, when jobs are scheduled to run.

#2 Should you notify on job/step success?

If you set up the job to notify on success, you may be surprised at the results. SQL Server interprets "success" in the broadest sense of the word: something ran and it must be good. No matter that your primary table was truncated but wasn't repopulated. When it comes to interpreting success for jobs that actually *do* something involving data (and not just backups), it seems that SQL Server is a hound dog with its head out the car window and ears flapping in the wind. Plus, sending a notification every time a job succeeds is pretty much crying wolf. Who's going to notice or care that a message *wasn't* received?

#3 Should you notify on job failure?

Notification on job failure seems reasonable, but what happens if something prevents the job from even starting? Is that failure? Not to SQL Server...not unless the condition causing the job not to run (such a the backup job getting hung up, suspending all future operations) is trapped and sends a notification. What if the problem is serious enough to prevent the service responsible for notification to run?

An Approach (or Two)

I understand one approach to this general quandary is to create a single job success/failure report by querying the msdb database for job history information, and email that to the DBA. That at least reduces the number of notifications, but takes us back to dilemma #1, requiring the DBA's email to be forwarded to someone else in her absence, and also depends on a DBA who doesn't mind going blind reviewing what could be a substantial report, trying to pick the needle failure out of a job run haystack.

Yet, as I mentioned before, native definitions of success are often not granular enough to be truly useful or reliable. I haven't looked at SQL Server 2005 enough yet to know if things have improved. Yet, I can imagine a custom solution (that I never seem to have time to work on): Every critical job could conclude with a call to a custom stored procedure that will check the pulse of the primary tables (record counts, etc) affected by the job and update a table with the server, application, job name, date/time, and a health indicator. (This would work conceptually like the exception handling we usually build into our applications.) Then we'd build a simple front-end dashboard application for the Data Center so that the first thing they'd see when they come in is an attractive interface with color-codes based on health indicator flags for all the jobs important enough to track. A red light means something is wrong (for real): call somebody.

I can imagine that these dilemmas can't be uncommon. If someone else has a another solution or a good workaround, I'd love to hear about it.


Source URL:
http://www.developerdotstar.com/community/community/node/305