Sunday, September 28, 2008

I've Tried A! I've Tried B! I've...

I don't do tech stuff any more. There are times that I wish I did, and there are times that I'm glad that I don't. Oddly enough, both of those are occurring right now.

About five minutes into the preparation for brunch, my wife got a call from a disaster test site saying that they were having a serious problem. Most times, calls from DR sites are serious but easily fixable -- a piece of JCL refers to an esoteric that's not defined on that system, or a person's userid doesn't have the authority it needs, something like that. Sometimes, it's more. This is one of those times. In a nutshell, four mainframe systems which share DASD and tape storage are locked up (in operator speak, 'locked up tighter than a drum'). This isn't quite the case; you can enter commands, and the systems will nod and smile and say just hang on a sec, I'll get to you when I can. This can be disconcerting when the idea of a command is a response, and not go 'way, you bother me. You want the system to snap to. When it doesn't, you get an oh, crap feeling. When its all four machines, you get the same feeling, quadrupled.

We don't know what the problem is, and the odds are that we -- or more accurately they; I'm just an interested observer -- are going to have to punch the systems. They're hoping 'not all of them' (said with a tone of panic on the part of the guy running the drill), but you don't know; the problem could be on any one of the systems, so that your initial odds are 3:1 of picking the bad one, or it could be two systems that aren't playing nice with each other, and the other two are staying around like observers at a knife fight. You know, someone should do...something. Call the cops...something. You really don't like punching a system, particularly when you don't know (or strongly believe) that it's going to fix the problem. And the knowledge that afterward, other people will be second-guessing you (well, heck, I'd have...) doesn't make for joy in Mudville, either.

So I'm glad this isn't my problem. And yet, you know? I almost wish it was. There's just such a joy when you figure it out, you get it to work, you unknot the string without cutting it. Doesn't happen every time, and the weirder a problem gets, the less likely a fix will occur to you. Sometimes, it just unknots all by itself. Hey, did you do something? I didn't do anything, but it looks like its running now. Sometimes, what you do to get ready to do something abruptly fixes it. Hey, I cancelled the active userids so we could punch the system, and suddenly it cleared up. Sometimes, you find the end of the string -- hey, that task has got 60% of the damn storage, that's why all the others are swapping out, let's cancel it. Sometimes, nothing works. Which is, at the moment, where they are.

So I miss it. And I don't.
=====================
UPDATE: The problem was that on a regular system, the automation subsystem routinely submits jobs to dump SMF. On this system, that wasn't necessary - but the jobs were submitting anyway. They'd hang looking for the production file name, thus tying up JQEs. Eventually, the system ran out of JQEs, and at that point, new jobs couldn't be started, including TSO signons.

2 comments:

Unknown said...

DASD, JCL? Race conditions?

Oo, I love race conditions. We used to get them on Oracle databases when they competed (with the OS) for memory.

One time, someone managed to get the database to compete with itself. The results were "interesting". Especially as the decision was made to cut testing and just role the thing into production. Because... The users were a prickly bunch. Oh, were they ever after that little calamity!

(I seem to remember it was something about using files to store intermittent data, and then recalculating the file, and using a file to store the intermittent data, and so on. The problem was small and unnoticed with 5 testers; with lots of users, it wasn't a small problem!)

I don't miss the 2AM calls, the at-lunch calls or the "but I just got home" calls.

Carolyn Ann

Cerulean Bill said...

DASD - storage. JCL - job control language. I don't know what you mean by 'race conditions', but it sounds like enqueues, or locking conditions.

This is why 'test in production' is a common phrase.