Next: , Previous: Miscellaneous Character Operations, Up: Characters


5.4 Internal Representation of Characters

An MIT/GNU Scheme character consists of a code part and a bucky bits part. The MIT/GNU Scheme set of characters can represent more characters than ASCII can; it includes characters with Super and Hyper bucky bits, as well as Control and Meta. Every ASCII character corresponds to some MIT/GNU Scheme character, but not vice versa.1

MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits. The character code contains the Unicode code point for the character. This is a change from earlier versions of the system, which used the ISO-8859-1 code point, but it is upwards compatible with previous usage, since ISO-8859-1 is a proper subset of Unicode.

— procedure: make-char code bucky-bits

Builds a character from code and bucky-bits. Both code and bucky-bits must be exact non-negative integers in the appropriate range. Use char-code and char-bits to extract the code and bucky bits from the character. If 0 is specified for bucky-bits, make-char produces an ordinary character; otherwise, the appropriate bits are turned on as follows:

          1               Meta
          2               Control
          4               Super
          8               Hyper

For example,

          (make-char 97 0)                          #\a
          (make-char 97 1)                          #\M-a
          (make-char 97 2)                          #\C-a
          (make-char 97 3)                          #\C-M-a
— procedure: char-bits char

Returns the exact integer representation of char's bucky bits. For example,

          (char-bits #\a)                           0
          (char-bits #\m-a)                         1
          (char-bits #\c-a)                         2
          (char-bits #\c-m-a)                       3
— procedure: char-code char

Returns the character code of char, an exact integer. For example,

          (char-code #\a)                           97
          (char-code #\c-a)                         97

Note that in MIT/GNU Scheme, the value of char-code is the Unicode code point for char.

— variable: char-code-limit
— variable: char-bits-limit

These variables define the (exclusive) upper limits for the character code and bucky bits (respectively). The character code and bucky bits are always exact non-negative integers, and are strictly less than the value of their respective limit variable.

— procedure: char->integer char
— procedure: integer->char k

char->integer returns the character code representation for char. integer->char returns the character whose character code representation is k.

In MIT/GNU Scheme, if (char-ascii? char) is true, then

          (eqv? (char->ascii char) (char->integer char))

However, this behavior is not required by the Scheme standard, and code that depends on it is not portable to other implementations.

These procedures implement order isomorphisms between the set of characters under the char<=? ordering and some subset of the integers under the <= ordering. That is, if

          (char<=? a b)    #t    and    (<= x y)    #t

and x and y are in the range of char->integer, then

          (<= (char->integer a)
              (char->integer b))                    #t
          (char<=? (integer->char x)
                   (integer->char y))               #t

In MIT/GNU Scheme, the specific relationship implemented by these procedures is as follows:

          (define (char->integer c)
            (+ (* (char-bits c) #x200000)
               (char-code c)))
          
          (define (integer->char n)
            (make-char (remainder n #x200000)
                       (quotient n #x200000)))

This implies that char->integer and char-code produce identical results for characters that have no bucky bits set, and that characters are ordered according to their Unicode code points.

Note: If the argument to char->integer or integer->char is a constant, the compiler will constant-fold the call, replacing it with the corresponding result. This is a very useful way to denote unusual character constants or ASCII codes.

— variable: char-integer-limit

The range of char->integer is defined to be the exact non-negative integers that are less than the value of this variable (exclusive). Note, however, that there are some holes in this range, because the character code must be a valid Unicode code point.


Footnotes

[1] Note that the Control bucky bit is different from the ASCII control key. This means that #\SOH (ASCII ctrl-A) is different from #\C-A. In fact, the Control bucky bit is completely orthogonal to the ASCII control key, making possible such characters as #\C-SOH.