A LOCKING SHIFT MECHANISM FOR THE KERMIT FILE TRANSFER PROTOCOL Christine M. Gianone Frank da Cruz Columbia University New York, NY USA DRAFT 4.2 October 2, 1991 ABSTRACT 7-bit communication channels remain quite common: they are in use on IBM mainframes, public data networks, in virtual terminal protocols like TCP/IP TELNET, and on any connection in which a device uses parity. The Kermit file transfer protocol achieves transparency over hostile communication environments by encoding all data as printable characters. In the 7-bit communications environment, 8-bit data is encoded in 7-bit form using the single shift; the "&" character acts as a prefix, meaning that the following character should have its 8th bit set to 1 after decoding. Kermit's single-shift 8th-bit quoting mechanism can add excessive transmission overhead to certain kinds of files, particularly text encoded in character sets like ISO 8859 Cyrillic, Greek, Hebrew, or Arabic, or 8-bit Japanese Kanji codes like EUC in which most data bytes have their 8th bit set to 1, resulting in 8th-bit quoting overhead up to 100%. A new locking shift mechanism is proposed to allow 8-bit data to be transferred more efficiently. This mechanism is an adaptation of the familiar Shift-In / Shift-Out scheme combined with Kermit's present single-shift technique, with some quoting rules added. This proposal was prompted not only by the longstanding need for increased efficiency in this area, but by a conference between the authors and Dr. Hirofumi Fujii of the Japan National Laboratory for High Energy Physics regarding the establishment of an official Kermit transfer syntax for Japanese, the subject of a separate proposal, and subsequent meetings in Japan. The algorithm and user interface were designed by Gianone and the detailed protocol design was contributed by da Cruz in the course of programming a trial implementation. The reader is assumed to be familiar with the Kermit file transfer protocol and with commonly used computer character sets. TERMINOLOGY In this proposal, the term "character" refers to an 8-bit byte, or octet, even if the data is encoded in a multibyte character set, or if it is not encoded in any character set at all (such as a binary file). An "8-bit character" is a data byte with its 8th bit set to 1. A "7-bit character" is one whose 8th bit is set to 0. A "control character" is a byte in the range 0-31 or 127 decimal (the "C0" set) or 128-159 or 255 (the "C1" set). A "printable character" is any character that is not a control character. NOTATION Numbers are written in decimal. "" stands for an ASCII control character. "XXX" is replaced by the character's name, for example "" for Start of Heading (Control-A). "<1>X" stands for an 8-bit character. The "X" can be a literal printable character (for example, "<1>A" is the ASCII letter A with its 8th bit set to 1) or a control character (for example "<1>" is a Control-A with its 8th bit set to 1). Similarly, "<0>X" stands for a 7-bit character. BACKGROUND The Kermit protocol presently specifies three separate prefix characters to be used within Kermit packets for transparency, compression, and quoting: The Control Prefix For transparency on serial communication links that are sensitive to control characters, the file sender precedes each C0 and C1 control with the control prefix, normally "#" (ASCII 35), and then encodes the control character itself by "exclusive-ORing" it with 64 decimal (i.e. inverting bit 6) to produce a character in the printable ASCII range. For example, Control-C (ASCII 3) becomes "#C" (3 XOR 64 = 67, which is the ASCII code for the letter C). Similarly, NUL becomes "#@", Control-A becomes "#A", Control-Z becomes "#Z", Escape becomes "#[", and DEL becomes "#?". The receiver decodes by discarding the prefix and XORing the character with 64 again. For example, in "#C", C = ASCII 67, and 67 XOR 64 = 3 = Control-C. Control prefixing is mandatory. The control prefix is also used for quoting prefix characters that occur in the data itself; see "The Prefix Quote" below. The 8th-bit Prefix When one or both of the two Kermit programs knows that the connection between them is not transparent to the 8th bit (e.g. because the Kermit PARITY variable is not NONE, or because the program always operates that way), a feature called "8th-bit prefixing" is used if the two Kermit programs negotiate an agreement to do so. The 8th-bit prefix is Kermit's single shift, normally the ampersand character "&" (ASCII 38). When the file sender encounters an 8-bit character, it inserts the "&" prefix in front of it, and then inserts the data character itself with its 8th bit set to 0. If the data character is a control character, it is inserted after the 8th-bit prefix in control-prefixed form. Examples: an "A" with its 8th bit set to 1 ("<1>A") becomes "&A"; a Control-A with its 8th bit set to 1 ("<1>") becomes "&#A". The Repeat-Count Prefix The repeat-count prefix provides a simple form of data compression. It is used only when both Kermit programs support this feature and agree to use it. This prefix, normally tilde "~" (ASCII 126), precedes a repeat count, which can range from 0 to 94. The repeat count is encoded as a printable ASCII character in the range SP (32) - tilde (126) by adding 32. For example, a series of 36 G's would be encoded as "~DG" (D = ASCII 68 - 32 = 36). The repeat-count prefix applies to the following prefixed sequence, which may be a single character ("~DG"), an 8th-bit prefixed character ("~D&G" = 36 Control-G characters with their 8th bits set to 1), a control-prefixed character ("~D#M" = 36 Control-M's), or an 8th-bit-and-control-prefixed character ("~~&#Z" = 94 Control-Z's with their 8th bits set to 1). The Prefix Quote The control prefix, normally "#", is also used to quote the control prefix itself if it occurs in the data: "##", means that the "#" character should be taken literally. If 8th-bit prefixing is in effect, the control prefix also quotes the 8th-bit prefix: "#&", so "#&D" stands for "&D" rather than "<1>D". If repeat count prefixing is in effect, the control prefix is also used to quote the repeat count prefix: "#~", so "#~CG" stands for "~CG" rather than 35 "G" characters. So the complete meaning of the "#" prefix is: if the value of the following character is 63-95 or 191-223, the prefixed character is to be XORed with 64, otherwise it is to be taken literally. The prefix quote can also be used harmlessly to quote 8th-bit or repeat-count prefixing characters even when these types of prefixing are not in effect. On a 7-bit connection the file sender, after encoding the data, adds the appropriate parity bit to all characters -- prefixes as well as data -- before transmission, and the file receiver strips the parity bit from all received characters before processing them. On an 8-bit-clean connection, 8th-bit prefixing need not be (and normally is not) done, and data characters retain their original 8th bit. For example, "A" with its 8th bit set to 1 is transmitted literally, without any prefixing ("<1>A"). Control-A with its 8th bit set to 1 is transmitted as "#" followed by the letter A with its 8th bit set to 1 ("#<1>A") because control prefixing is always in effect. SINGLE AND LOCKING SHIFTS The shift key on a typewriter lets the regular keys do "double duty". A given key produces different results depending on whether the shift key is up or down. Kermit's single shift (8th-bit prefix) is like the shift key: just as you must press two keys on the typewriter for every uppercase letter, Kermit must send two 7-bit characters for every 8-bit character when 8th-bit prefixing is in effect. Certain types of files have many 8-bit characters in a row. When this is the case, the overhead of single shifting could be as high as 100%. Efficiency could be much improved by the use of "locking shifts": the file sender tells the file receiver "Here comes a sequence of 8-bit characters" and then sends these characters in 7-bit form, relying on the receiver to put their 8th bits back before storing them. The locking shift behaves like the shift-lock key on a typewriter: to type a series of uppercase letters, you press the shift lock key once and then type the letters, one key per letter, rather than two. To go back to lowercase letters, release the shift lock key and then type more letters. When the data communications "shift-lock" key is active, 7-bit characters are said to be "shifted": they are not what they appear to be, but instead represent 8-bit characters. When the locking shift is not in effect, 7-bit characters stand for themselves; they are "unshifted". The locking shift characters are SO (Shift Out, Control-N, ASCII 14), and SI (Shift In, Control-O, ASCII 15). SO is sent at the beginning of a shifted sequence, SI is sent to return to normal unshifted operation. For example, on a 7-bit connection, the following string of characters (written using our notation): <0>A<0>B<0>C<1>D<1>E<1>F<1>G<1>H<1>I<0>J<0>K<0>L<0>M (13 characters) would be transmitted like this with single shifts: ABC&D&E&F&G&H&IJKLM (19 characters) and like this with locking shifts: ABCDEFGHIJKLM (15 characters) On an 8-bit connection, of course, the string of 13 characters can be transmitted as-is, with no overhead at all. Now suppose we have the following character sequence: <1>A<1>B<1>C<0>D<1>E<1>F<1>G<0>H<1>I<1>J<1>K<0>L<1>M (13 characters) Here several isolated 7-bit characters are found in the middle of a long run of 8-bit characters. Using locking shifts alone, this would be encoded as: ABCDEFGHIJKLM (20 characters) But using a combination of locking and single shifts, it can be encoded more compactly, as in this example, in which "&" is the single-shift character: ABC&DEFG&HIJK&LM (17 characters) This proposal adds the locking Shift-In/Shift-Out mechanism to the Kermit file transfer in a way that it can be used in conjunction with single shifts for maximum efficiency. NEGOTIATION Locking shifts are, like all new additions, an optional feature of the Kermit protocol. To allow old Kermit programs to interoperate transparently with the new ones that implement locking shifts, the use of this feature must be negotiated and agreed upon by both Kermit programs before it can be used. Two Kermit programs agree to use the locking shift extension via a new capability bit, together with the existing 8th-bit prefixing (QBIN) field. The capabilities mask is the 10th character in the initialization string. It contains a bit mask encoded as a printable character by adding 32 (ASCII Space). Capability number 1 (bit 5, which until now has been reserved for future use) will be used to indicate the locking shift capability: 1 if enabled, 0 if not. Thus old Kermits automatically disable the use of locking shifts because they never set this bit. The format of Kermit's capability mask is: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 +----+----+----+----+----+----+----+----+ | X | X | 1 | 2 | 3 | 4 | 5 | Z | +----+----+----+----+----+----+----+----+ where: X = Must not be used 1 = Locking Shift Capability 2 = Extra-Long Packet Capability (9025-857374) 3 = Attribute Packet Capability 4 = Sliding Window Capability 5 = Long Packet Capability (95-9024) Z = Capability Mask Extension Bit (allows addition of new mask bytes) The locking shift protocol is used if and only if: 1. The file sender sets the Locking Shift Capability bit in the S (Send Initialization) packet; 2. The file receiver also sets the same bit in its acknowledgement to the S packet; and 3. The parties have agreed to use single shifts via the QBIN field. Thus, locking shifts REQUIRE 8th-bit prefixing. This is reasonable because (a) 8th-bit prefixing is easy to program; (b) all the popular Kermit programs already implement it; (c) little is gained by using locking shifts without single shifts; (d) it simplifies the user interface and the negotiation process; and (e) it allows the file receiver as well as the sender to request locking shifts. ENCODING RULES Kermit's locking shift protocol uses the C0 control character Shift Out (SO, Control-N, ASCII 14) to precede a sequence of 8-bit characters, and Shift In (SI, Control-O, ASCII 15) to precede a sequence of 7-bit characters. Whether or not locking shift protocol is in effect, all of Kermit's normal or negotiated prefixing rules also remain in effect, so SO appears in the packet as "#N" and SI appears as "#O". Each Kermit program maintains a SHIFT-STATE, which may be SHIFTED (shifted out) or UNSHIFTED (shifted in). SHIFTED means that 8-bit characters are being transmitted in 7-bit form (preceded by a Shift-Out character) and UNSHIFTED means that 7-bit characters represent themselves. For each file, the initial SHIFT-STATE is defined to be UNSHIFTED, so there is no need for the sender to transmit an initial Shift-In (but it does no harm). A. When the file sender's SHIFT-STATE is UNSHIFTED and it reads a 7-bit character, it adds the character to the packet according to Kermit's other prefixing rules (control and repeat count), and adds the appropriate parity bits. Thus, any number of 7-bit characters can be transmitted in a row. B. When the file sender's SHIFT-STATE is UNSHIFTED and it reads an 8-bit data character, there are two possibilities: 1. If single-shifting (8th-bit prefixing) is in effect, insert a single-shift character ("&") with the appropriate parity bit before the 8-bit data character, and add the data character itself with its 8th bit replaced by the appropriate parity bit. OR: 2. Insert a Shift Out (SO) character into the packet (encoded as "#N" with the appropriate parity bits), change the SHIFT-STATE to SHIFTED, and then add the data character with its 8th bit replaced by the appropriate parity bit. C. When the file sender's SHIFT-STATE is SHIFTED and it reads an 8-bit character, it adds the character to the packet according to Kermit's other prefixing rules (control and repeat count), replacing the character's 8th bit by the appropriate parity bit. Thus, any number of 8-bit characters may be transmitted in a row in 7-bit form after the SO. D. When the file sender's SHIFT-STATE is SHIFTED and a 7-bit character is encountered, there are two possibilities: 1. If single-shifting is in effect, insert a single-shift character ("&") before the 7-bit character and add the appropriate parity bits. OR: 2. Insert a Shift-In (SI) character (encoded as "#O" with the appropriate parity bits) into the packet and change the SHIFT-STATE to UNSHIFTED, and then insert the data character itself with the appropriate parity bit. E. If a repeated sequence of characters occurs where the shift state changes, the locking shift is encoded BEFORE the repeat-count sequence: #O~xA, not ~x#OA. F. If the file ends in SHIFTED state, there is no need to issue a Shift-In code at the end of the file, but it does no harm either. SINGLE AND LOCKING SHIFTS When locking shifts and single shifts are in effect, the meaning of the single-shift character is reversed when the SHIFT-STATE is SHIFTED. Single shifts can be used to efficiently encode isolated characters that don't fit the current SHIFT-STATE. For example: Data Encoding 1. ABCABC<1>EBCABC ABCABC&EBCABC 2. <1>A<1>B<1>C<1>A<1>BXY<1>B<1>C<1>A #NABCAB&X&YBCA In (1) the single shift "&" sets the 8th bit of "E" to 1 (normal Kermit practice), but in (2) the single shift sets the 8th bit of "X" and "Y" to 0 because the SHIFT-STATE is SHIFTED (#N). The file sender can decide whether to use single or locking shifts by looking ahead in the input file data. Single shifts are more efficient when there are one, two, or three n-bit characters in a row; locking shifts are more efficient when there are five or more n-bit characters in a row (n is either 7 or 8): Single Shift Locking Shift &A (2) #OA#N (5) (worse) &A&B (4) #OAB#N (6) (worse) &A&B&C (6) #OABC#N (7) (worse) &A&B&C&D (8) #OABCD#N (8) (same) &A&B&C&D&E (10) #OABCDE#N (9) (better) Thus five-character lookahead is sufficient to make the best decision. REPEAT COUNTS AND LOCKING SHIFTS A repeated sequence of 8-bit characters that occurs while in UNSHIFTED state, for example abc<1>X<1>X<1>X<1>X, can be encoded by using a single shift: abc~$&X A repeated sequence of 8-bit characters that occurs while in SHIFTED state, for example: abc<1>A<1>B<1>C<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>D<1>E<1>F is encoded using the same repeat-count notation: abc#NABC~(XDEF Just as the # and & prefixes are used as prefixes in both UNSHIFTED and SHIFTED states, so is the repeat-count prefix, ~. The same sequence could also be encoded less efficiently as: abc#NABC#O~$&X#NDEF PREFIX CHARACTERS THAT OCCUR IN THE DATA Since Kermit prefix characters can occur within file data, they must be prefixed to distinguish them from true prefixes. The following encoding is used: STATE.............. CHARACTER UNSHIFTED SHIFTED # ## &## & #& &#& ~ #~ &#~ <1># &## ## <1>& &#& #& <1>~ &#~ #~ QUOTING THE LOCKING SHIFT CHARACTERS Since Control-O and Control-N can appear within file data, there has to be a way to distinguish the use of these characters as locking shifts from their use as data characters. When (and only when) locking shift protocol is in effect, SO and SI characters that appear in the data must be prefixed by Data Link Escape (DLE, Control-P, ASCII 16), normally encoded as "#P". If DLE itself appears in the file, it too must be prefixed by DLE. The DLE character applies to the ENTIRE PREFIXED SEQUENCE that follows it. This may be a single character, a control-prefixed character, an 8th-bit prefixed character, or a repeat-count-prefixed sequence of any combination of these. To illustrate the difference between quoting by "#" and DLE, "##O" indicates a literal "#" character followed by the letter "O", whereas "#O" indicates a literal Control-O. In practice, the file sender should use DLE only to prefix SO, SI, and itself, but the receiver should treat DLE as a general "prefixed sequence" quote: it should discard the DLE, decode the following prefixed sequence, and treat the result as data rather than Kermit protocol information. Should a repeated sequence of SO's, SI's, or DLE's occur within the data, the entire sequence may be encoded with a repeat count and prefixed by a single DLE, which applies to all copies of the repeated character. For example, "#P~A#N" indicates 33 SO characters in a row that are not to be treated as locking shifts. When locking shift protocol is in effect, we must handle the C1 counterparts of SO, SI, and DLE (that is, using our notation, <1>SO, <1>SI, and <1>DLE). These characters would be inserted into the packet in their 7-bit form when the SHIFT-STATE is SHIFTED, and the receiver would have no way of distinguishing a data #O from a Shift-In #O, or a data #N from a Shift-Out #N, or a data #P from a quoting #P. Therefore these characters too should be prefixed by DLE when in SHIFTED state. If a 7-bit SO, SI, or DLE appears in the data during SHIFTED state, the file sender can "single-shift" it in the normal manner, for example "&#O". The file receiver must treat such sequences as literal data characters, as if they had been prefixed by DLE, not as shifts and quotes. The rule, therefore, is that if #O, #N, and #P have no prefix of any kind, then they are used for shifting and quoting. When these characters are prefixed by either "&" or DLE, no matter what the SHIFT-STATE is, they are data characters: File SHIFT-STATE Character UNSHIFTED SHIFTED SI #P#O &#O or #P&#O <1>SI &#O or #P&#O #P#O SO #P#N &#N or #P&#N <1>SO &#N or #P&#N #P#N DLE #P#P &#P or #P&#P <1>DLE &#P or #P&#P #P#P The "&#O" form need not be prefixed by "#P", but no harm is done if it is. The packet receiver must respond to these prefixed sequences as follows: Packet SHIFT-STATE Sequence UNSHIFTED SHIFTED #O Discard* Shift Out #P#O Literal SI Literal <1>SI &#O or #P&#O Literal <1>SI Literal SI #N Shift In Discard* #P#N Literal SO Literal <1>SO &#N or #P&#N Literal <1>SO Literal SO #P Quote Quote #P#P Literal DLE Literal <1>DLE &#P or #P&#P Literal <1>DLE Literal DLE The "Discard*" entries are for when a redundant shift is received, for example an unprefixed Shift-Out when the Kermit receiver is already shifted out. Redundant shifts do not affect the current SHIFT-STATE and are not interpreted as data; they are simply ignored and discarded by the receiver. BOUNDARY CONDITIONS Although sequences of characters prefixed by "#", "&", or "~" may not be broken across packet boundaries, locking shifts are effective across packet boundaries. However, locking shifts are not effective across file boundaries; when a group of files is being transferred, the SHIFT-STATE must be set to UNSHIFTED at the beginning of each file. THE FILE RECEIVER The file receiver has no decisions to make, it is totally driven by the sequence of characters in each packet it receives. The receiver operates as it does without the locking shift protocol, but with additional rules: it must recognize the locking shift indicators "#N" and "#O", set the SHIFT-STATE to SHIFTED when it sees "#N" and to UNSHIFTED when it sees "O", and set the value of the 8th bit of each data character according to the current SHIFT-STATE. It must treat #, &, and ~ as prefix characters even when the SHIFT-STATE is SHIFTED, remembering that the meaning of the single-shift prefix "&" is inverted. (The file receiver can also store the shift characters as is -- see the COMMANDS section below.) COMMANDS One new command is required: SET TRANSFER LOCKING-SHIFT { ON, OFF, FORCED } The options are as follows: ON: Enables the use of locking shifts. The Kermit program sets the locking shift capability bit in any S or I packets it sends, or in any acknowledgement to an S or I packet. Locking shifts are actually used if and only if both Kermits set this bit AND single-shifts are successfully negotiated. If a Kermit program implements the locking shift protocol, the default TRANSFER LOCKING-SHIFT setting should be ON. OFF: Disables the use of locking shifts. The Kermit program sets the locking shift capability bit to zero in all negotiation packets, and treats SO, SI, and DLE as ordinary data characters in Kermit data packets. FORCED: Forces the use of locking shifts, regardless of the PARITY setting and capability negotiation. The file sender sets the locking shift bit in the capability mask, sets the QBIN (8th-bit prefix) field to "N", and ignores the receiver's reply. The file receiver sets the same values, regardless of the sender's values. A Kermit program that has been given this command acts as if locking shift protocol had been successfully negotiated and single shifts have been disabled. With these facilities and defaults in effect, the Kermit user will get locking shift protocol automatically whenever PARITY is not NONE and both Kermits support locking shifts (which implies they also support single shifts and that single shifts were negotiated successfully). SET TRANSFER LOCKING-SHIFT FORCED can be used to force the file sender to use locking shifts even if the receiver doesn't understand this protocol, or to force the file receiver to treat SO/SI/DLE codes in arriving files as prescribed by this proposal. This allows an 8-bit data file to be sent through a 7-bit connection to a Kermit program that does not implement 8th-bit prefixing or locking shifts. The result can displayed on terminals or printers that respond appropriately to Shift-In/Shift-Out codes, sent through e-mail, or postprocessed with a simple SO/SI filter to reconstruct it, provided the original file does not contain SO, SI, or DLE characters. If a file containing SO/SI codes is sent to a Kermit program with SET TRANSFER LOCKING-SHIFT FORCED in effect, the data is reconstructed according to the imbedded shifts. The SET TRANSFER LOCKING-SHIFT FORCED option is, of course, risky, and can result in undesired effects if used improperly. For example, if the file contains SO or SI characters as data, the shift state can become inverted. Furthermore, DLE does not serve to "quote" SO or SI characters in ordinary data communication; SO and SI usually act as locking shifts even when preceded by DLE (or any other character). For example, when the sequence "ABCDEF" is sent to a VT300 terminal, the DLE is ignored and the characters DEF are shifted. Here are the possible SET TRANSFER LOCKING-SHIFT combinations and their effects. The OFF entries also apply to Kermit programs that don't implement locking shift protocol at all: Sender Receiver Effect ON ON Locking shift protocol done if single shifts negotiated ON OFF No locking shifts ON FORCED SO/SI/DLE in data interpreted as shifts by receiver OFF ON No locking shifts OFF OFF No locking shifts OFF FORCED SO/SI/DLE in data interpreted as shifts by receiver FORCED ON Sender adds shifts, receiver stores them as data (*) FORCED OFF Sender adds shifts, receiver stores them as data FORCED FORCED Locking shift protocol is done with no single shifts (*) Sender announces that it WON'T do single shifts, which disables the receiver's locking-shift protocol. CHARACTER SET TRANSLATION SET TRANSFER LOCKING-SHIFT FORCED (or any other LOCKING-SHIFT settting) does not affect character set translation. Translation is still done if the user has elected to do it. Here are the possibilities when the sender has SET LOCKING-SHIFT FORCED and has announced an 8-bit transfer character set in the Attribute packet, and the receiver supports character-set translation, but is not doing LS protocol: 1. Receiver translates the transfer character set into an 8-bit file character set whose first 128 characters are ASCII, such as an IBM code page, KOI-8, the Apple or NeXT character set, etc. In this case, the desired effect is achieved automatically. 2. Receiver translates the transfer character set into a 7-bit file character set such as an ISO 646 NRC or Short KOI. In this case the result is garbage. Locking shifts should not be used here. For the languages covered by ISO 646 NRCs, single shifts are more efficient. 3. The receiver does not understand the transfer character set. The situation here is no different with locking shifts than without them. PERFORMANCE A preliminary implementation of the shifting algorithms described in this proposal was coded and tested on a large number of text and binary files and worked correctly: the result of encoding and then decoding each file was identical to the original. All combinations of single shift, locking shift, and repeat-count compression were tested successfully in both text and binary file mode. The following table shows the number of characters required to encode files of different representative types (taken from a much larger sample) using different combinations of single shifts (SS) and locking shifts (LS), but without repeat-count compression (R). For comparison, the final column includes repeat-count compression. The number in parentheses is the "expansion factor" showing how much the data grew in the encoding process. The .TXT files were encoded in text mode, the others were encoded in binary mode. File Encoding.................................................. Name Length SS........... LS........... LS+SS........ LS+SS+R...... ASCII.TXT 190689 202173 (1.06) 202126 (1.06) 202173 (1.06) 194938 (1.02) GERMAN.TXT 39611 42159 (1.06) 43336 (1.09) 42169 (1.06) 41558 (1.05) FRENCH.TXT 108021 116426 (1.08) 124446 (1.15) 116446 (1.08) 115531 (1.07) CYRILL1.TXT 52046 95700 (1.84) 80998 (1.56) 64602 (1.24) 64476 (1.24) CYRILL2.TXT 13699 25293 (1.85) 23429 (1.71) 18306 (1.34) 18078 (1.32) CYRILL3.TXT 28434 49834 (1.75) 43029 (1.51) 37104 (1.30) 35519 (1.25) CYRILL4.TXT 51011 89419 (1.75) 78217 (1.53) 63157 (1.24) 63010 (1.24) Cyrillic Totals 145190 260246 (1.79) 225673 (1.55) 183169 (1.26) 181083 (1.25) KANJI.TXT 29706 59494 (2.00) 32527 (1.09) 32629 (1.10) 32648 (1.10) KANJIA.TXT 106943 157536 (1.47) 122043 (1.14) 121822 (1.14) 118563 (1.11) Kanji Totals 136649 217030 (1.59) 154570 (1.13) 154451 (1.13) 151211 (1.11) MSVIBM.EXE 146989 247766 (1.69) 302348 (2.06) 248991 (1.69) 210598 (1.43) WERMIT 419861 737812 (1.76) 923451 (2.20) 760912 (1.81) 713830 (1.70) FILE.ZIP 96911 173145 (1.79) 226407 (2.34) 172627 (1.78) 172841 (1.78) ASCII.TXT is a plain US ASCII text file containing English prose and no 8-bit characters. GERMAN.TXT and FRENCH.TXT are German- and French-language documents coded in ISO 8859-1 Latin Alphabet 1. CYRILL1.TXT is a chapter from a Russian computer book, containing only a few English words. CYRILL2.TXT is a poem, The Bronze Horseman by Pushkin; its lines are short and there are many blank lines so there is a higher CRLF-to- text ratio. CYRILL3.TXT is "Murphy's Laws" in Russian, in which lines tend to be short, blank, or indented. CYRILL4 is a RussTeX source file in which the TeX commands are ASCII and the text is Cyrillic. The Cyrillic text in all these files is ISO 8859-5 Latin/Cyrillic 8-bit text. KANJI.TXT is a Japanese-language text file encoded in the Japanese EUC code. KANJIA.TXT contains a mixture of ASCII English and Japanese Kanji encoded in EUC. MSVIBM.EXE is an IBM PC binary executable program image. WERMIT is a SUN-4 (Sparc) binary executable program image. FILE.ZIP is a binary MS-DOS ZIP archive. ANALYSIS For binary files, locking (combined) shifts generally provide no benefit over single shifts. These files tend to have a high percentage of bytes in the C0 and C1 ranges, and therefore suffer high overhead from control prefixing. Furthermore, they rarely have long runs of 8-bit characters. The reason the combined shift is less efficient than the single shift is the necessity to quote SO, SI, and DLE characters that occur in the data. For text files encoded in "left-handed" 8-bit character sets such as ISO 8859 Latin Alphabets 1-4 and 9 (for languages based on Roman characters), 8-bit characters generally occur only in isolation, and so locking (combined) shifts provide no significant benefit over single shifts. Locking and combined shifts provide a substantial performance improvement over single shifts for text files written in "right-handed" 8-bit character sets like the Latin Arabic, Cyrillic, Greek, and Hebrew alphabets where long sequences of 8-bit bytes predominate, and for certain multibyte character sets like as Japanese EUC, in which all Kanji-character bytes have their 8th bits set to 1. CONCLUSION The locking shift algorithm is easy to program and is inexpensive in both execution time and code space. Implementation of locking shift protocol is recommended for Kermit programs that must transfer files likely to contain many sequences of 5 or more consecutive 8-bit GR bytes over 7-bit communication channels. Such files tend to be text files encoded in the ISO character sets for non-Roman alphabets and in EUC Kanji codes, but there might be other candidates too: binary image (raster) data, spreadsheet data, etc. For such files, the efficiency improvement can approach 100%. REFERENCES Gianone, Christine M., "A Kermit Protocol Extension for International Character Sets", Columbia University (1990). da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987). ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange". ISO 2022, "Information processing - ISO 7-bit and 8-bit coded character sets - Code extension techniques" (1985). ISO 8859, "Information processing - 8-bit single-byte coded graphic character sets", parts 1-9 (1987-present) "JIS X 0212 Study Group Interim Report" ACKNOWLEDGEMENTS Thanks to John Chandler, John Klensin, Paul Placeway, and Konstantin Vinogradov for their detailed comments on this proposal, and to Gisbert W. Selke for the German file, Andre' Pirard for the French, Konstantin Vinogradov and Dimitri Vulis for the Russian files, and Hirofumi Fujii for the Japanese files. (The End)