Unicode related Standards

The Methodius.org site has assumed the creation and maintenance of three Unicode related standards. We hope that in future this work will be undertaken by the Unicode consortium, but while we are waiting for this to happen, we will try some self help.

The standards offered by us are related to three key problems, and they are the conversion from Unicode to an 8-bit code table, the search for a typescript in a Unicode file and the depicting of Unicode symbols missing in the typescript.

We plan to present to our users not only a standard, but also a software code to support this standard. We believe that it is stupid for each company to provide for the creation of such software and to support it throughout the standard changes. For this purpose, we are planning the creation of a library, which will include all functions needed for the work with Unicode. This is a function, which converts the small letter codes into respective capital letters codes. Such a function returns information for the type of the Unicode symbol (whether it is a letter, digit or a punctuation mark).

Conversion from Unicode to an 8-bit code table

When we convert from Unicode to an 8-bit code table, we are faced with the problem of the insertion of 96000 symbols into 256. This problem resembles the one dealing with how to insert an elephant into a matchbox. Usually, the solution to similar tasks is to sacrifice part of the elephant. For example, at present, most programs effect this transformation by replacing all missing symbols with a question mark.

In this way, the sentence:

На компютъра е инсталирана операционната система Windows Vista.

can turn into the nonsense:

?? ????????? ? ??????????? ????????????? ??????? Windows Vista.

This means that we sacrifice a rather big part of the elephant, and we could get away with a less significant casualty.

Lets take for example the word Müller and ask what is the best way to convert it into an 8-bit code table, which does not contain the letter ü. We shall offer three variants:

M?ller Muller Mueller

The first variant is the worse possible and we do not recommend it. The second variant is better and is used by more application programs. The third variant is the best, because it comes from the typewriting practice.

In other words, this is not a new problem, and people have been looking for a solution about two centuries ago, when they have made the transition from handwriting to typewriting. Then they were faced with the problem how to present all the possible handwritten signs into a hundred typewritten symbols. This question was very similar to the one we are looking at now, and it was even more difficult. After all, the question for the transfer from handwriting to typewriting is already solved and there is a tradition created regarding the replacement of handwritten symbols with a combination of typewritten symbols. What is required from us is to assemble the tradition and to describe it in one standard.

Here you can see the initial variant of this standard. Besides the utilization of the typewritten tradition, we have also used transliteration, which could furthermore be accepted as part of the typewritten tradition. For example, the sentence:

На компютъра е инсталирана операционната система Windows Vista.

will turn into:

Na kompjutyra e instalirana operacionnata sistema Windows Vista.

which is difficult for understanding but it is better than this nonsense:

?? ????????? ? ??????????? ????????????? ??????? Windows Vista.

For most Unicode symbols, only one replacement variant is offered, but this is not a must. For example, for the Cyrillic И gave letter, we have a typewritten tradition to replace it with Й, but this letter could also be missing and then, according to the transliteration principle, we shall replace it with J.

The conversion standard from Unicode to an 8-bit code table has one good point and it is that the standard be can easily changed. For example, the Unicode standard can effortlessly be supplemented with new symbols, but the already positioned ones cannot be changed, because this is related to the typescripts and to the texts written to the present moment. The conversion from Unicode to an 8-bit code table can be drastically changed, because it is not related to the texts already written. This diminishes our liability for the creation of this standard. Here is its first variant:

First variant

The application of our standard will be much more difficult from the hitherto prevailing replacement of all missing symbols with a question mark. Therefore, as we said, besides a standard, we shall also present a software library for its maintenance.

Text search in a Unicode file

Text search in a Unicode file is a serious problem of which the Unicode Consortium is aware. In order to solve it, they create two normal forms of the Unicode texts (actually, the normal forms are four, but here we shall only examine NFC and NFKC, because the other two forms are their derivatives). From a mathematical point of view, NFC and NFKC correspond to two equivalence relations and the smaller relation of equivalence corresponds to NFC. In other words, if two texts have one and the same normal form in NFC, they are very much alike and we can consider them undistinguishable. On the other hand, if they have one and the same normal form in NFKC, they are less similar and are alike, but not undistinguishable. To a certain extent, NFKC conforms to the typewriting tradition and accepts as equal the symbol ½ and the text 1/2.

What we want is, on the ground of the typewritten tradition and transliteration, to standardise a third relation of equivalence, which would be larger than the one corresponding to NFKC. In other words, this will be a relation in which we will have a correspondence weaker than the one of NFKC and much weaker than the one of NFC.

Some Internet search engines utilise an analogous correspondence. For example, they search the letters with diacritic signs with and without those signs. So, if you are searching for Müller, you will also receive Muller. We would like to create an even weaker similarity, which would also include Mueller.

Depicting Unicode symbols missing in the font

Presently this is done the stupid way - all missing symbols are replaced with the default symbol of the respective typescript. That is, they are usually replaced by a square.

At least, in a similar transformation, the normal NFC form shall be used. For example, the Cyrillic grave И symbol (U+040D) and the sequence Cyrillic И and grave symbol (U+0418 U+0300) are equivalent with regard to NFC. In other words, if the typescript lacks the grave И symbol (U+040D), instead of a square, one should receive the automatically generated image of a grave И (though this automatically generated image can be quite ugly). Vice versa, if there is a grave И in the typescript, this symbol should be depicted when we have the sequence Cyrillic И and grave symbol (U+0418 U+0300) instead of the automatically generated image of a grave И, because the symbol in the typescript is expected to be much better from the automatically generated image.

One should also foresee an option for the square of the missing symbol to be replaced with the same symbol of another typescript or with the sequence prescribed by the standard for conversion from Unicode to 8-bit. There should be an opportunity for this option to be switched on and off, because when it is on, it will assist reading, but will impede editing, because it will hide the problems.

This possible change in visualization should not reflect on the essence of the text and upon Copy and Paste it should remain the same (i.e., the same sequence of Unicode symbols with the same font).

Write us: kiril@2-box.net (protected e-mail)
Back to: