Four Column ASCII (2017)

I have lived my whole professional life with this being 'beyond obvious'... It's hard to imagine a generation where it's not. But then again, I did work with EBCDIC for awhile and we were reading and translating ASCII log tapes (ITT/Alcatel 1210 switch, phone calls, memory dumps).

I once got drunk with my elderly unix supernerd friend and he was talking about TTYs and how his passwords contained embedded ^S and ^Q characters and he traced the login process to learn they were just stalling the tty not actually used to construct the hash. No one else at the bar got the drift. He patched his system to put do 'raw' instead of 'cooked' mode for login passwords. He also used backspaces ^? ^H as part of his passwords. He was a real security tiger. I miss him.

It doesn't seem to have been mentioned in the comments so far, but as a floppy-disk era developer I remember my mind was blown by the discovery that DEL was all-bits-set because this allowed a character on paper tape and punched card to be deleted by punching any un-punched holes!

For me was interesting that all digits in ASCII starts with 0x3, eg. 0x30 - 0, 0x31 - 1, ..., 0x39 - 9. I thought it was accidental, but in real it was intended. This was giving possibility to build simple counting/accounting machines with minimal circuit logic with BCD (Binary Coded Decimals). That was wow for me ;)

This is by design, so that case conversion and folding is just a bit operation.

The idea that SOH/1 is "Ctrl-A" or ESC/27 is "Ctrl-[" is not part of ASCII; that idea comes from they way terminals provided access to the control characters, by a Ctrl key that just masked out a few bits.

If Unicode had used a full 32 bits from the start, it could have usefully reserved a few bits as flags that would divide it into subspaces, and could be easily tested.

Imagine a Unicode like this:

8:8:16

- 8 bits of flags. - 8 bit script family code: 0 for BMP. - 16 bit plane for every script code and flag combination.

The flags could do usefuil things like indicate character display width, case, and other attributes (specific to a script code).

Unicode peaked too early and applied an economy of encoding which rings false now in an age in which consumer devices have two digit gigabyte memories, multi terabyte of storage, and high definition video is streamed over the internet.

I made an interactive viewer some time ago (scroll down a bit):

https://blog.glyphdrawing.club/the-origins-of-del-0x7f-and-i...

It really helps understand the logic of ASCII.

For whatever reason, there are extraordinarily few references that I come back to over and over, across the years and decades. This is one of them.

I came across this a week ago when I was looking at some LLM generated code for a ToUpper() function. At some point I “knew” this relationship, but I didn’t really “grok” it until I read a function that converted lowercase ascii to uppercase by using a bitwise XOR with 0x20.

It makes sense, but it didn’t really hit me until recently. Now, I’m wondering what other hidden cleverness is there that used to be common knowledge, but is now lost in the abstractions.

Some of this elegance discussed from a programmatic point of view

https://www.pixelbeat.org/docs/utf8_programming.html

I have a command called `ascii-4col.txt` in my personal `bin/` folder that prints this out:

https://github.com/jez/bin/blob/master/ascii-4col.txt

It's neat because it's the only command I have that uses `tail` for the shebang line.

Back in early times, I used to type ctrl-M in some situations because it could be easier to reach than the return key, depending on what I was typing.

Also easy to see why Ctrl-D works for exiting sessions.

This is also why the Teletype layout has parentheses on 8 and 9 unlike modem keyboards that have them on 9 and 0 (a layout popularised by the IBM Selectric). The original Apple IIs had this same layout, with a “bell” on top of the G.

This is why Ctrl+C is 0x03 and Ctrl+G is the bell. The columns aren't arbitrary. They're the control codes with bit 6 flipped. Once you see it, you can't unsee it. Best ASCII explainer I've read.

Related. Others?

Four Column ASCII (2017) - https://news.ycombinator.com/item?id=21073463 - Sept 2019 (40 comments)

Four Column ASCII - https://news.ycombinator.com/item?id=13539552 - Feb 2017 (68 comments)

If Ctrl sets bit 6 to 0, and Shift sets bit 5 to 1, the logical extension is to use Ctrl and Shift together to set the top bits to 01. Surely there must be a system somewhere that maps Ctrl-Shift-A to !, Ctrl-Shift-B to " etc.

I love this stuff. It's the kind of lore that keeps getting forgotten and re-discovered by swathes of curious computer scientists over the years. So easy to assume many of the old artifacts (such as the ASCII table) had no rhyme or reason to them.

I still find weird that they didn't make A,B... just after the digits, that would make binary to hexadecimal conversion more efficient..

credit to William Crosby, "Note on an ASCII-Octal Code Table", CACM 8.10, Oct 1965

https://dl.acm.org/doi/epdf/10.1145/365628.365652

also defined 6-bit ASCII subset

anyone remember 005 ENQ (also called WRU who are you) and its effect on a teletype?

Very cool.

Though the 01 column is a bit unsatisfying because it doesn’t seem to have any connection to its siblings.

first I was like "What but why? You don't save any space or what's that excercise about" then I read it again and it blew my mind. I thought I knew everything about ASCII. What a fool I am, Sokrates was right. Always.

Just wait until someone finally gets why CSI ( aka the “other escape” from the 8-bit ansi realm, which is now eternalized in unicode C1 block ) is written ESC [ in 7-bit systems, such as the equally now eternal utf-8 encoding

On early bit-paired keyboards with parallel 7-bit outputs, possibly going back to mechanical teletypes, I think holding Control literally tied the upper two bits to zero. (citation needed)

Also explains why there is no difference between Ctrl-x and Ctrl-Shift-x.

where does this character set come from? It looks different on xterm.

for x in range(0x0,0x20): print(chr(x),end=" ")

Imho ascii wasted over 20 of its precious 128 values on control characters nobody ever needs (except perhaps the first few years of its lifetime) and could easily have had degree symbol, pilcrow sign, paragraph symbol, forward tick and other useful symbols instead :)

I have a command called `ascii-4col.txt` in my personal `bin/` folder that prints this out:

https://github.com/jez/bin/blob/master/ascii-4col.txt

It's neat because it's the only command I have that uses `tail` for the shebang line.

If Unicode had used a full 32 bits from the start, it could have usefully reserved a few bits as flags that would divide it into subspaces, and could be easily tested.

Imagine a Unicode like this:

8:8:16

- 8 bits of flags. - 8 bit script family code: 0 for BMP. - 16 bit plane for every script code and flag combination.

The flags could do usefuil things like indicate character display width, case, and other attributes (specific to a script code).

Some of this elegance discussed from a programmatic point of view

https://www.pixelbeat.org/docs/utf8_programming.html

Back in early times, I used to type ctrl-M in some situations because it could be easier to reach than the return key, depending on what I was typing.

I made an interactive viewer some time ago (scroll down a bit):

https://blog.glyphdrawing.club/the-origins-of-del-0x7f-and-i...

It really helps understand the logic of ASCII.

Also easy to see why Ctrl-D works for exiting sessions.

Related. Others?

Four Column ASCII (2017) - https://news.ycombinator.com/item?id=21073463 - Sept 2019 (40 comments)

Four Column ASCII - https://news.ycombinator.com/item?id=13539552 - Feb 2017 (68 comments)

This is why Ctrl+C is 0x03 and Ctrl+G is the bell. The columns aren't arbitrary. They're the control codes with bit 6 flipped. Once you see it, you can't unsee it. Best ASCII explainer I've read.

anyone remember 005 ENQ (also called WRU who are you) and its effect on a teletype?

On early bit-paired keyboards with parallel 7-bit outputs, possibly going back to mechanical teletypes, I think holding Control literally tied the upper two bits to zero. (citation needed)

Also explains why there is no difference between Ctrl-x and Ctrl-Shift-x.

Very cool.

Though the 01 column is a bit unsatisfying because it doesn’t seem to have any connection to its siblings.

credit to William Crosby, "Note on an ASCII-Octal Code Table", CACM 8.10, Oct 1965

https://dl.acm.org/doi/epdf/10.1145/365628.365652

also defined 6-bit ASCII subset

Bit-level skeuomorphism! And since NUL is zero, does that mean the program ends wherever you stop punching? I've never used punch cards so I don't know how things were organized.

Regarding ^?: shouldn't that be ^_ instead?

This is by design, so that case conversion and folding is just a bit operation.

I guess it's an age thing, but I thought this was really basic CS knowledge. But I can see why this may be much less relevant nowadays.

Yes, the diagram just shows the ASCII table for the old teletype 6-bit code (and 5-bit code before), with the two most significant bits spread over 4 columns to show the extension that happened while going 5→6→7 bits. It makes obvious what was very simple bit operations on very limited hardware 70–100 years ago.

(I assume everybody knows that on mechanical typewriters and teletypes the "shift" key physically shifted the caret position upwards, so that a different glyph would be printed when hit by a typebar.)

It makes sense, but it didn’t really hit me until recently. Now, I’m wondering what other hidden cleverness is there that used to be common knowledge, but is now lost in the abstractions.

A similar bit-flipping trick was used to swap between numeric row + symbol keys on the keyboard, and the shifted symbols on the same keys. These bit-flips made it easier to construct the circuits for keyboards that output ASCII.

I believe the layout of the shifted symbols on the numeric row were based on an early IBM Selectric typewriter for the US market. Then IBM went and changed it, and the latter is the origin of the ANSI keyboard layout we have now.

For whatever reason, there are extraordinarily few references that I come back to over and over, across the years and decades. This is one of them.

Tangentially related, there is much insight about Unix idioms to be gained from understanding the key layout of the terminal Bill Joy used to create vi

https://news.ycombinator.com/item?id=21586980

I still find weird that they didn't make A,B... just after the digits, that would make binary to hexadecimal conversion more efficient..

where does this character set come from? It looks different on xterm.

for x in range(0x0,0x20): print(chr(x),end=" ")

It's more that shift flips that bit. Also I'd call them bit 0 and 1 and not 5 and 6 as 'normally' you count bits from the right (least significant to most significant). But there are lots of differences for 'normal' of course ('middle endian' :-P )

I guess in this system, you'd also type lowercase letters by holding shift?

Modern keyboards = some keyboards. In the Nordic Countries modern keyboards have parantheses on 8 and 9.

What happened to this block and the keyboard key arrangement?

  ESC  [  {  11011
  FS   \  |  11100
  GS   ]  }  11101

Also curious why the keys open and close braces, but ... the single and double curly quotes don't open and close, but are stacked. Seems nuts every time I type Option-{ and Option-Shift-{ …

Going off the timelines on Wikipedia, the first version of ASCII was published (1963) before the 0-9,A-F hex notation became widely used (>=1966):

- https://en.wikipedia.org/wiki/ASCII#History

- https://en.wikipedia.org/wiki/Hexadecimal#Cultural_history

Currently 'A' is 0x41 and 0101, 'a' is 0x61 and 0141, and '0' is 0x30 and 060. These are fairly simple to remember for converting between alphanumerics and their codepoint. Seems more advantageous, especially if you might be reasonably looking at punchcards.

[0-9A-Z] doesn't fit in 5 bits, which impedes shift/ctrl bits.

I'm not sure if our convention for hexadecimal notation is old enough to have been a consideration.

EDIT: it would need to predate the 6-bit teletype codes that preceded ASCII.

They put : ; immediately after the digits because they were considered the least used of the major punctuation, so that they could be replaced by ‘digits’ 10 and 11 where desired.

(I'm almost reluctant to to spoil the fun for the kids these days, but https://en.wikipedia.org/wiki/%C2%A3sd )

ASCII was started in 1960. A terminal then would have been a mostly-mechanical teletype (keyboard and printer, possibly with paper tape reader/punch), without much by way of "circuit logic". Think of it more as a bit caused a physical shift of a linkage to do something like hit the upper or lower part of a hammer, or a separate set of hammers for the same remaining bits.

Look at the Teletype ASR-33, introduced in 1963.

And this is exactly why I find the usual 16x8 at least as insightful as this proposed 32x4 (well, 4x32, but that's just a rotation).

I still wonder if it wouldn't have been better to let each digit be represented by its exact value, and then use the high end of the scale rather than the low end for the control characters. I suppose by 1970 they were already dealing with the legacy of backwards-compatibility, and people were already accustomed to 0x0 meaning something akin to null?

Smaller, 6-bit code pages existed before and after that. They did not even have space for upper and lower case letters, but had control characters. Those codes directly moved the paper, switched to next punch card or cut the punched tape on the receiving end, so you would want them if you ever had to send more than a single line of text (or a block of data), which most users did.

Even smaller 5-bit Baudot code had already had special characters to shift between two sets and discard the previous character. Murray code, used for typewriter-based devices, introduced CR and LF, so they were quite frequently needed in way more than few years.

Maybe 32 was a bit much, but even fitting a useful set of control characters into, say, 16, would be tricky for me. For example, ^S and ^Q are still useful when text is scrolling by too fast.

On top of the control symbols being useful, providing those symbols would have reduced the motivation for Unicode, right?

ASCII did us all the favor of hitting a good stopping point and leaving the “infinity” solution to the future.

I started using the separator symbols (file, group, record, unit separator, ascii 60-63 ... though mostly the last two) for CSV like data to store in a database. Not looking back!

only that would have broken the whole thing back in the days ;)

It is interesting that, as a guess, we waste an average of ~5% of storage capacity for text (12.5% of Unicode's first byte, but many languages regularly use higher bytes of course).

I don't fault the creators of ASCII - those control characters were probably needed at the time. The fault is ours for not moving on from the legacy technology. I think some non-ASCII/Unicode encodings did reuse the control character bytes. Why didn't Unicode implement that? I assume they were trying to be be compatible with some existing encodings, but couldn't they have chosen the encodings that made use of the control character code points?

If Unicode were to change it now (probably not happening, but imagine ...), what would they do with those 32 code points? We couldn't move other common characters over to them - those already have well-known, heavily used code points in Unicode and also iirc Unicode promises backward compability with prior versions.

There still are scripts and glyphs not in Unicode, but those are mostly quite rare and effectively would continue to waste the space. Is there some set of characters that would be used and be a good fit? Duplicate the most commonly used codepoints above 8 bits, as a form of compression? Duplicate combining characters? Have a contest? Make it a private area - I imagine we could do that anyway, because I doubt most systems interpret those bytes now.

Also, how much old data, which legitimately uses the ASCII control characters, would become unreadable?

What are you trying to achieve, none of those characters are printable, and definetly not going to show up on the web.

    for x in range(0x0,0x20): print(f'({chr(x)})', end =' ')
    (0|) (1|) (2|) (3|) (4|) (5|) (6|) (7|) (8) (9| ) (10|
    ) (11|
          ) (12|
    ) (14|) (15|) (16|) (17|) (18|) (19|) (20|) (21|) (22|) (23|) (24|) (25|)    (26|␦) (27|8|) (29|) (30|) (31|)

Ok this set does not even show on Android, just some boxes. Very strange.

Bit-level skeuomorphism! And since NUL is zero, does that mean the program ends wherever you stop punching? I've never used punch cards so I don't know how things were organized.

Regarding ^?: shouldn't that be ^_ instead?

I guess in this system, you'd also type lowercase letters by holding shift?

Tangentially related, there is much insight about Unix idioms to be gained from understanding the key layout of the terminal Bill Joy used to create vi

https://news.ycombinator.com/item?id=21586980

What happened to this block and the keyboard key arrangement?

  ESC  [  {  11011
  FS   \  |  11100
  GS   ]  }  11101

Also curious why the keys open and close braces, but ... the single and double curly quotes don't open and close, but are stacked. Seems nuts every time I type Option-{ and Option-Shift-{ …

You're no longer talking about ASCII. ASCII has only a double quote, apostrophe (which doubles as a single quote) and backtick/backquote.

Note on your Mac that the Option-{ and Option-}, with and without Shift, produce quotes which are all distinct from the characters produced by your '/" key! They are Unicode characters not in ASCII.

In the ASCII standard (1977 version here: https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...) the example table shows a glyph for the double quote which is vertical: it is neither an opening nor closing quote.

The apostrophe is shown as a closing quote, by slanting to the right; approximately a mirror image of the backtick. So it looks as though those two are intended to form an opening and closing pair. Except, in many terminal fonts, the apostrophe is a just vertical tick, like half of a double quote.

The ' being veritcal helps programming language '...' literals not look weird.

> What happened to this block and the keyboard key arrangement?

There's also these:

  | ASCII      | US keyboard |
  |------------+-------------|
  | 041/0x21 ! | 1 !         |
  | 042/0x22 " | 2 @         |
  | 043/0x23 # | 3 #         |
  | 044/0x24 $ | 4 $         |
  | 045/0x25 % | 5 %         |
  |            | 6 ^         |
  | 046/0x26 & | 7 &         |

Modern keyboards = some keyboards. In the Nordic Countries modern keyboards have parantheses on 8 and 9.

According to the layouts on this site, there're more European layouts with parenthesis on 8, 9 than on 9, 0. (I had to zoom out to see the right-side of the comparisons.)

https://www.farah.cl/Keyboardery/A-Visual-Comparison-of-Diff...

Going off the timelines on Wikipedia, the first version of ASCII was published (1963) before the 0-9,A-F hex notation became widely used (>=1966):

- https://en.wikipedia.org/wiki/ASCII#History

- https://en.wikipedia.org/wiki/Hexadecimal#Cultural_history

The alphanumeric codepoints are well placed hexadecimally-speaking though. I don't imagine that was just an accident. For example, they could've put '0' at 050/0x28, but they put it at 060/0x30. That seems to me that they did have hexadecimal in consideration.