What happened with the Y2K problem, anyway? The world didn’t end. I’m sure I would have noticed if the world had ended — there would have been something on TV.
What happened, then?
One can spend a mildly diverting afternoon or two trying to answer this question with the assistance of the internets.
It seems clear that some large amount of money was spent on preemptive remediation. A November, 1999 report[PDF] from the US Department of Commerce puts the American cost at “in the neighborhood of $100 billion”. The typical worldwide figures being thrown around are $300 billion and up, although nobody seems particularly motivated to provide provenance for those numbers. On one hand, this is a whooooole lot of money. On the other hand, nobody gave any of it to me. On the other other hand, assuming they didn’t just take it out and burn it KLF-style, that much money probably represents a lot of employment for a lot of software types. So: hooray? To the extent that employment is generally preferable to unemployment, sure. Compared to other things that might have been done with that money (possibly involving much more exciting software work): not-hooray.
It’s harder to get any sort of picture of Things That Went Wrong, with Wikipedia’s list being about as good as anyone’s. Reported problems due to unfixed Y2K bugs were generally minor – incorrect dates on paperwork, email messages erroneously deleted, and so forth. However, privacy and other organizational barriers to reporting mean that the full scale and cost of experienced problems will likely never be known.
Ok, but it’s in the past now, yeah?
Y2K was an exceptional event in that an enormous number of software makers had independently made the same date representation decisions and so the overflows and their consequences were all due at the same time (modulo time zone). It was unexceptional in that any representation of a date in a real (and thus discrete) system must have some limit, with probably-negative consequences if the limit is exceeded. For any day on the calendar there is a piece of software – extant or potential – that will overflow a date field on that day.
Meanwhile, the “Unix Millennium” approaches: 03:14:07 UTC on Tuesday, 19 January 2038 (no time zone effect this time) is the last date representable with a 32-bit counter that obeys the POSIX standard. Embedded software, with its emphasis on optimizing space and generally-narrower processors, is particularly likely to be vulnerable to problems.
No, really, we care because?
“Your date storage will eventually overflow” is a useful thing to remember, as is the generalized “any monotonically increasing counter will eventually overflow”. However, Y2K is a nice big illustration of something more general still: when you first write software, you don’t know where it will end up.
The people who wrote 2-digit-date software in the 60s and 70s might well have imagined that it would be replaced long before the year 2000, or that it wouldn’t matter to them because they’d be off in a moon colony by then anyway. Which is silly, because surely moon colonies would use software too.
The last half-century has given us a bit more information to work with. Firstly, we’re probably not going to get to live in a moon colony. Secondly, software always seems to outlive its expected span. Not only that, but there appears to be some law under which the egregiousness of a hack is proportional to its lifetime. Search your feelings, you know it to be true. So the lesson there is straightforward, if unwelcome: it’s safest to assume indefinite lifetime. This doesn’t just mean that counters can overflow and resources be exhausted, but also that low-probability events have more time in which to occur. It might not make sense to devote a lot of time to beautiful handling in those cases, but if there is no handling at all then someone is going to be sorry later.
But here’s a thought to chill your soul. Some of those 60s and 70s people probably did assume indefinite lifetime and did consider dates beyond 1999 and did have guards and handling and all that good stuff, but then someone came along and wanted some date-related code and oh look here’s some I’ll just copy this bit…
Ah, reuse. It saves money! It saves time! You benefit from prior debugging and testing! These slogans do seem to be at least partly reflected in reality – Mohagheghi and Conradi have published an interesting review[PDF, citation] of several published studies. However, reliability of the reused component is typically lower than in its original deployment. For example, Thomas et al found[citation] that code reused without modification had 0.9 errors/KLOC – substantially better than the 9.5 they found in new code, but substantially worse than zero. Either those problems were not detected originally, or they were detected but judged ‘unlikely’ or ‘impossible’, or they did not exist at all in the original context. Effects can be serious and, again, expensive. Ariane 5 flight 501 is frequently cited as an example software disaster, and reuse is definitely implicated. Leveson gives an excellent accounting[PDF]. (Actually, Leveson does all kinds of interesting things and you should bookmark her homepage now so you can go back and read it all.)
So that’s several more unwelcome lessons. Firstly, exercise caution around claims that a problem “can’t happen”. There is a huge continuum, with “no, seriously, check out my diagonalization proof” at one end and “ugh, let’s agree not to worry about it” at the other. Most things are somewhere in between. There is an important difference between cases that can’t ever happen, cases that can’t happen in the current system (which can bite you if you modify the system or reuse any of it in a different context), and cases that can happen but are improbable/ugly/whatever (which can bite you at any time).
Secondly, reused code needs to be evaluated carefully and can’t be assumed to be problem-free. Sorry.
This sounds like a good application for static analysis tools?
Yes, it’s funny you should mention that.