The claim that programs stored dates as two ASCII or similar characters because computers were limited in resources seems wrong to me because it takes more memory than one 8-bit integer would. Also in certain cases it’s also slower.
Storing the year as one 8-bit signed integer would give (assuming 1900 + year) a range of 1772 (1900 – 128) to 2027 (1900 + 127). An unsigned 8-bit integer gives 1900 to 2155 (1900 + 255). In either case the end result is a much wider range and half the storage and RAM requirements.
In addition to perform arithmetic on the years the software would need to convert the two characters back to an integer whereas if it were stored as an integer it could be loaded without conversion. This is where the slower in certain cases comes in. In addition it could make sorting by year slower.
In my view the only reason I can think of for why someone would store a year as two characters is bad programming or using a text based storage format (such as something like XML or JSON which I know those are more contemporary in comparison to programs that had the Y2K bug). Arguably you can say that choosing a text-based storage format is an example of bad programming because it’s not a good choice for a very resource limited computer.
How many programs stored years as two characters and why?
Short Answer: BCD rules over a single byte integer.
The claim that programs stored dates as two ASCII or similar characters because computers were limited in resources seems wrong to me
The point wasn’t about using ASCII or ‘similar’, using only two decimal digits. That can be two characters (not necessary ASCII) or two BCD digits in a single byte.
because it takes more memory than one 8-bit integer would.
Two BCD digits fit quite nicely in 8 bits – in fact, that’s the very reason a byte is made of 8 bit.
Also in certain cases it’s also slower.
Not really. In fact, on quite some machines using a single byte integer will be considerable slower than using a word – or as well BCD.
In addition to perform arithmetic on the years the software would need to convert the two characters back to an integer whereas if it were stored as an integer it could be loaded without conversion.
That is only true and needed if the CPU in question can not handle BCD native.
This is where the slower in certain cases comes in. In addition it could make sorting by year slower.
Why? Sorting is done usually not by year, but as a full date – which in BCD is 3 Byte.
In my view the only reason I can think of for why someone would store a year as two characters is bad programming
Are you intend to say everyone during the first 40 years of IT, up to the year 2000 were idiots?
or using a text based storage format
Now you’re coming close. Ever wondered why your terminal emulation defaults to 80 characters? Exactly, it’s the size of a punch card. And punch cards don’t store bytes or binary information but characters. One column one digit. Storage evolved from there.
And storage on mainframes was always a rare resource – or how much room do you think one can give to data when the job is to
- handle 20k transactions per hour
on a machine with a
- 0.9 MIPS CPU,
- 1.5 megabyte of RAM, some
- 2 GiB of disk storage?
That was all present to
- serve 300-500 concurrent users
Yes, that is what a mainframe in 1980 was.
Load always outgrew increased capabilities. And believe me, shaving off a byte from every date to use only YYMMDD instead of YYYYMMDD was a considerable (25%) gain.
(such as something like XML or JSON which I know those are more contemporary in comparison to programs that had the Y2K bug).
Noone would have ever thought of such bloaty formats back then.
Arguably you can say that choosing a text-based storage format is an example of bad programming because it’s not a good choice for a very resource limited computer.
Been there, done that, and it is. Storing a date in BCD results in overall higher performance.
- optimal storage (3 byte per date)
- No constant conversion from and to binary needed
- Conversion to readable (mostly EBCDIC, not ASCII) is a single machine instruction.
- Calculations can be done without converting and filling by using BCD instructions.
How many programs stored years as two characters and why?
A vast majority of mainframe programs did. Programs that are part of incredibly huge applications in worldwide networks. Chances are close to 100% that any financial transaction you do today still is done at some point by a /370ish mainframe. And these are as well the ones that not only come from punch card age and tight memory situations, but also handle BCD as native data types.
And another story from Grandpa’s vault:
For one rather large mainframe application we solved the Y2K problem without extending records, but by extending the BCD range. So the next year after 99 (for 1999 became A0 (or 2000). This worked as the decade is the topmost digit. Of course, all I/O functions had to be adjusted, but that was a lesser job. Changing data storage formats would have been a gigantic task with endless chances of bugs. Also, any date conversion would have meant to stop the live system maybe for days (there were billions of records to convert) — something not possible for a mission critical system that needs to run 24/7.
We also added a virtual conversation layer, but that did only kick in for a small number of data records for a short time during roll over.
In the end we still had to stop short before midnight MEZ and restart a minute later, as management decided that this would be a good measure to avoid roll over problems. Well, their decision. And as usual a completely useless one, as the system did run multiple time zones (almost around the world, Wladiwostock to French Guiana), so it passed multiple roll over points that night.
I programmed in Cobol for nearly 20 years at the start of my career and it was quite common to see two digit years used. Initially this was due to storage concerns – not necessarily only memory, but disk storage also.
Cobol has no intrinsic date field.
It was very common to store dates in Cobol like:
01 EMP-HIRE-DATE. 03 EMP-HIRE-DATE-YY PIC 99. 03 EMP-HIRE-DATE-MM PIC 99. 03 EMP-HIRE-DATE-DD PIC 99.
If it’s 1970, you may not think your program is going to be used for another 30 years… it was more important to save those two bytes. Multiplied out by multiple fields and millions of rows, it all adds up when storage is a premium.
As 2000 started approaching, people had to start dealing with this. I worked for a company that had software that dealt with contracts and the end dates started stretching into 2000’s in the mid-80’s. As a rookie, I got to deal with fixing that.
There were two ways. First, if it was feasible, I added
03 EMP-HIRE-DATE-CC PIC 99.
To every date field. It ended up looking like:
01 EMP-HIRE-DATE. 03 EMP-HIRE-DATE-YR. 05 EMP-HIRE-DATE-CC. PIC 99. 05 EMP-HIRE-DATE-YY PIC 99. 03 EMP-HIRE-DATE-MM PIC 99. 03 EMP-HIRE-DATE-DD PIC 99.
But in Cobol, you can’t just change that if it is a record description of a file… Cobol has fixed length records, so if you insert two bytes in the middle, you have to create a new file, read all the records in the old format, move them to the new record format, and write to the new file. Then “swap” the file back to its original name. If you are dealing with huge datasets you needed lots of disk space and time to do all that. And also rebuild all the programs.
When it wasn’t feasible, we added code when printing or displaying dates.
IF EMP-HIRE-DATE-YY > 50 MOVE 19 TO PRINT-HIRE-DATE-CC ELSE MOVE 20 TO PRINT-HIRE-DATE-CC.
This bought another 50 years…
To add to Raffzahn’s answer, here’s an example from a 1958 book on data processing:
… The names of authors are coded in columns 11-14. The first four consonants of the name form the code. The last two digits of the year of the reference are punched directly into columns 15-16. The journal, or other source for the reference, is coded in columns 17-20. …
(from Punched Cards Their Applications To Science And Industry. 2nd ed. Edited by Robert S. Casey, James W. Perry, Madeline M. Berry and Allen Kent. New York, Reinhold Publishing Corp; 1958; emphasis mine. The card image shows
52, assumed to mean 1952, in the noted columns)
This book barely touches on computer applications, since unit record usage was more prevalent at the time. Computers inherited these tabular data formats, and the physical inertia of the data format — cardboard boxes containing millions of records — meant that tabular years stuck around for a very long time. The book does have a couple of mentions of dealing with centuries, but the typical outlook can be summarized by the quotation from p.326 “The century is neglected inasmuch as most entries for this particular file are in the 20th century”. Would you spend/waste 2½% of your available fields in a record just to encode something you (probably) won’t use?
Data, especially government data, sticks around for a long time. Consider this: the last person to receive a US Civil War pension died in May 2020. The US Civil War ended in 1865. Tabulating equipment didn’t come in until the 1880s. Bytes didn’t happen until 1956. JSON was this century. So data formats tend to be what the past gives you.
There are actually still RTC chips on the market which store the year only as a pair of BCD digits, and do not have a century field. These are usually packed into one byte when read, in common with the fields for day-of-month, month-of-year, hour, minute, and second. The same format is used for the alarm settings.
Old software using these chips would simply prefix the year digits with 19 for display. Current systems using them would prefix 20. The problem rears its ugly head whenever the system is used in a century that it wasn’t designed for – so potentially also in the year 2100.
If an RTC chip existed that counted in binary instead of BCD, then it could handle years in two and a half centuries without trouble. An old machine adding 1900 to the stored year would not see any rollover trouble until after the year 2155. This would however require the host system to convert binary to decimal for display – not a problem for a PC, a minicomputer, or even an 8-bit microcomputer, but potentially a burden for extremely lightweight embedded systems.
Another point that hasn’t yet been mentioned is that many programs would store data in whatever format it was keyed in. If data was stored as six digits YYMMDD, that’s what data-entry clerks would be expected to type. For a program to store data as YYYYMMDD, it would be necessary to either have clerks enter an extra two digits per record, or have software compute an extra two digits based upon the two digits of the year the person typed. Keeping information as typed, but using modular arithmetic so that 01-99=02 wouldn’t necessarily be any more difficult than having to add extra logic at the data entry stage.
I worked with IBM minicomputers (S/36, S/38, and AS/400) in the 1980s and 1990s. The very ugly programming language RPG was popular. It looked very different from Cobol but had similar abilities. It supported only two data types: fixed length alphanumeric and fixed precision decimal numeric. The numeric variables could be zoned which meant one digit per byte or packed which was two digits per byte. Storage of dates varied quite a lot but the most common was as 6 digit numbers rather than separate year, month, and day. Usually, but not always, these would be YYMMDD in the database so that the sorting would be sensible.
Numeric fields were always signed so the last half of the last byte was reserved for the sign (hex F for + and hex D for -). So, a 6 digit number required 4 bytes rather than 3 and the first half of the first byte was unused. This allowed a trick when year 2000 became an issue: the unused first half of the first byte could be used for a century flag: 0 for 19xx and 1 for 20xx. The format is usually described as CYYMMDD. Systems which use this format will have a 2100 or 2900 problem but the culprits won’t be around to see it. Actually, many systems will have a problem sooner as 6 digit format is still popular for display and entry. Arbitrary cut off dates, e.g. 40, are used to guess the century when 6 digits are entered. So, 010195 is assumed to be 1995 but 010115 is assumed to be 2015. So, a problem is due in next couple of decades for some but it won’t be as bad as the year 2000 since at least the database can cope beyond the limit.
On BCD and performance, BCD is not necessarily much or any slower since machines intended for business use usually have hardware support. Also, conversion to and from human readable format is simpler. Many business programs will display and accept entry of data far more often than perform arithmetic on it so even if there was an arithmetic performance impact it would typically be outweighed by the I/O gains.
Not relevant to dates but BCD is useful for money. Try adding 0.01 up 100 times in a float or double and then comparing it to 1.00. BCD gets this right but float and double don’t. Accountants get very upset if the balance sheet is off even by a cent.
The mainframes have been covered. Let’s look at what led up to the PC which of course is where a lot of business software evolved.
Many PC programmers had previously been programmers for 8-bit systems, and had grown up on the “every cycle counts” mentality. Many programs for MS-DOS had ports for older 8-bit systems, too, so would have often followed the same internal formats.
- Most 8-bit CPUs (and also the 16-bit CPU used in IBM PCs) had specific BCD instructions, which operated on one pair of decimal digits.
- Most 8-bit CPUs did not have multipy or divide (or mod) instructions at all! To convert from binary to base 10 and back was a lot of work and involved loops or tables. OK, you could use two BCD bytes for a 4-digit number, but if you wanted to, e.g. subtract one date from another it would have required maybe five times as many assembly instructions/cycles.
- In terms of RAM, having each date take an extra byte or two is one thing—having 5-10 extra instructions to deal with it was not worth it when you have 64KB of RAM or less. Consider that even by just typing the number “86” the program would have to add 1900 to it.
- Assembly language instructions don’t just take up space—assembly language programming is hard! Every instruction took up a line of screen space, had potential for a bug, and consumed a bit of the programmer’s headspace. (Everyone was confident at working with BCD; it has now fallen by the wayside because of high-level languages.)
While we’re on the topic of screen space, many programs didn’t even display a leading “19” on the screen. (Hello—40 character screen width!) So there was hardly any point at all for the program to keep track of that internally.
Back in olden times (1960s and earlier), when data came on Hollerith cards (80 columns, 12 rows), using 2 extra columns (of the 80 available) to store the fixed string “19” as part of a date seemed useless.
Having more than 80 columns of data meant that one had to use multiple cards, and lost several columns to the overhead needed to say “I’m with him”, plus the extra programming effort to validate the pairing.
Hardly anybody thought their code would still be in use when “19” changed to “20”.
Anybody who stores fixed field size dates as “yyyymmdd” has a Y10K problem pending.
In banking, many computations have historically used binary-coded decimal (BCD) as a compromise between space, speed, and consistent rounding rules. Floating point was uncommon, slow, and difficult to round off consistently. So even things like interest rates might be stored in fixed-point BCD. Something like an interest rate might be stored as four BCD digits, in two byes, with an implicit decimal point after the first digit. Thus, your account could pay interest rates from 0.000% to 9.999%, which seemed perfectly sufficient, because a bank would never pay 10% or more. This thinking left 1970s programmers scrambling to update data structures when high inflation and new types of accounts like Certificates of Deposit crossed that threshold.
When gas prices first cross the $1/gallon threshold, many of the pumps could not be set with the actual value. Though that was largely a mechanical limitation rather than what we would consider a computing problem today, the thinking that led to the situation was largely the same: reduce costs by not bothering to enable values that were unimaginably large at design time.
[Apocryphal] When the first 9-1-1 emergency dispatch service was put into use in Manhattan, legend says its database schema hadn’t anticipated five-digit apartment numbers, leading to delays of paramedics reaching sick or injured high-rise dwellers.
In the 1990s, phone numbers in North America Number Plan were all of the form (AAA) BBB-CCCC, and the second A was always a 0 or 1, and the first B could not be a 1. (There were other restrictions, too, but they aren’t relevant here.) By exploiting the well-known restrictions, the contact management software on your PC could represent a phone number with a 32-bit value, saving disc space and RAM (at the run time cost of a little bit swizzling to encode and decode the values). But the soaring popularity of fax machines, pagers, modems, and cell phones drove demand for phone numbers until North America ran out of area codes, causing software publishers to scramble to update their software to use less elegant space-saving tricks.
It’s not always about memory savings or binary-coded decimal. Sometimes it’s about space on a form or a display. In high-end software for stock trading, display space is at a premium. When it first seemed the DOW Jones Industrials Average could potentially break $10,000.00, there was concern about whether these displays would show the additional digit, and whether there was room for a thousands separator. If the values were truncated or displayed in other confusing ways, would it slow down the traders who relied on these system? Could it trigger a DOW10K panic sell off if it confused them into thinking the value had plummeted?
Microsoft Word did
I distinctly remember looking into the binary .doc format used by Microsoft Word 5.0 for DOS, because documents saved when the year was after 1999 couldn’t be opened again. It turned out that the internal meta-data included the date using two ASCII digits for the year and 2000 was blindly stored as
100, thus overwriting one byte of an adjacent meta-data field, breaking the document.
The NORAD Two Line Element format remains in wide use for cataloging satellite orbits. Originally intended for punched cards, it uses a two character year. Years <57 are defined to be in the 21st century. It may be supplanted soon, though. The immediate problem is not years, but the number of trackable objects in orbit.
Even systems that stored the year as a byte often just printed “19” in front of the value. So the year 2000 (stored as 100) would be printed as 19100 or 1910 or 1900 depending on exactly how the number was converted from a byte to printable characters.
Of course a lot of systems didn’t even both printing the 19, they just printed 100/01/01 or 00/01/01. Similarly date entry systems often only accepted two digits making it impossible to enter years beyond 1999.
You may not remember the enormous market share that IBM occupied in the business computing space in the 20th century. I don’t mean just the mainframe market where S/360, S/370, S/390 & their successors play, but also the much larger (in terms of units) midrange/small business market occupied by the S/3, S/32, S/34, S/36, S/38, AS/400, and their successors. IBM’s midrange systems were ubiquitous in smaller businesses and in branch offices of mega-corporations around the world, and at their peak they brought in $14 billion annually to IBM.
To be successful in this target market, these systems had to be far more affordable than a mainframe. And since “inexpensive” disk storage cost around $10,000 per GB in 1990, disk capacities were small by modern standards. You might only have 13 MB — yes, MEGABYTES — of total disk capacity to work with online on a S/32. That’s to serve a corporate operation with reveneues of $250 million and up (that’s IBM’s definition of “small business”) and running accounting, billing, payroll, inventory control, manufacturing production control, and similar applications, which are all about DOLLARS and DATES.
And these were all EBCDIC machines.
So yeah, absolutely, EVERYONE used 2-digit years on these systems prior to Y2K.
Fortunately, disks got cheaper in the late 1990s, and thousands of contract programmers put out Y2K consulting shingles at about the same time. In the 3 years from 1997 to 1999, most of the mess got cleaned up. I’m sure that here and there, a dental office had billing with dates in the wrong century, or a mortgage amortization schedule showed negative dollars owed after the turn of the millenium, but the truly bad-news applications that could hurt customers on a massive scale got fixed before zero hour on December 31, 1999, 23:59:59 UTC.
Many fixed record text formats (with or without COBOL copybook definitions) used two digit years, and some do still in this year. So it does not only affect storage format (like in databases) but also interchange formats. Those formats deliberately did not use binary representations but did try to be compact.
This happened way into the 90ies, where everybody knew about the upcoming millennial change.
Funny side note, many systems used the year 2020 as cutoff, which results in a increased problem rate this year, again.
Even in standardized EDI Formats like X.12 or EDIFACT you see the 4 digits years optional. The Date or Time Period Format Code 2379 for example starts with code “
DDMMYY. (And back then I was a really bad idea to allow more then one unique format for each instant type). One of the major drivers for switching EDIFACT syntax version from 3 to 4 in 1998 was the 8 digit date stamps in UNB service segment.
Another example would be the VDA fixed record messages used in the (German) automotive sector, things like VDA 4912 from 1991 specify delivery date in 6 digits (or even shorter with week numbers). They have been in used way past 2000.