![]() ![]() ![]() Consider the decimal number 251 which can be written as 251 = 2 x 10 2 + 5 x 10 1 + 1 x 10 0 = 200 + 50 + 1. ![]() Arabic presentation forms B: FE70 to FEFFĪside: refresher on hexadecimal: In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.Arabic presentation forms A: FB50 to FDFF.In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane. Converting the smallest code points ( 00 to 7F) to UTF-8 generates 1 byte whilst the higher code point values ( 10000 to 10FFFF) generate 4 bytes.įor example, the Arabic letter ش (“sheen”) is allocated the Unicode code point value 0634 (hex) and its representation in UTF-8 is the two byte sequence D8 B4 (hex). The “TF” in UTF-8 stands for Transformation Format so, in essence, you can think of UTF-8 as a “recipe” for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Reminder on code points: The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111 each integer is called a code point. Additionally, I’m missing out a lot of detail and not taking a “rigorous” approach in my explanations, so I’d be grateful to know if readers feel whether or not it is useful. As usual, I’m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. C code will encode strings using UTF-8 directly.I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. length will be correct, has some inconveniences like some standard bindings working in unexpected ways. Using Duktape specific non-BMP strings: more natural for C code.C code will first encode non-BMP characters into surrogate pairs, with each codepoint in the pair then encoded using CESU-8. length of strings counting the surrogate pair characters individually. Using surrogate pairs: standard approach, engine neutral, has some inconveniences like.Main approaches for dealing with non-BMP characters There are some individual ECMAScript bindings which may not work as expected because the standard bindings expect codepoints to be at most 16 bits. Non-BMP characters will mostly work as one expects in ECMAScript code.C code can push such strings, expressed as (extended) UTF-8, directly using e.g.Also arbitrary byte sequences (which are invalid UTF-8) are allowed:Īs a result, Duktape supports characters in the non-BMP range directly: Non-BMP characters are intended to be represented using surrogate pairs:ĮS2015 RegExp patterns having the u flag support non-BMP characters by interpreting the string data as UTF-16:ĮS2015 () also has special handling for non-BMP characters (again interpreting the string as UTF-16):ĭuktape strings support up to 32-bit codepointsĭuktape represents strings in an extended UTF-8 format which allows both arbitrary 16-bit codepoints (as required by ECMAScript) but also extended codepoints for the full 32-bit range. How to work with non-BMP characters ECMAScript and Duktape support for non-BMP ECMAScript standard strings are 16-bit onlyĮCMAScript standard itself does not support non-BMP characters: all codepoints are strictly 16-bit. ![]()
0 Comments
Leave a Reply. |