Yuchao's blogspot: unicode

Thursday, August 31, 2017

unicode

http://www.diveintopython3.net/strings.html

The default encoding for python 2 is ASCII, for python 3 is UTF-8, Unicode Transformation Format 8 bit.

for i in range(256):
    if chr(i)== 'A' or chr(i) == 'a' or chr(i)=='0':
        print("attentions!")
    print(i,":",chr(i),end=" ")

chr() and ord() is for ASCII decoding and encoding.

so char ‘0’-‘9’ are encoded as 48-57; char ‘A’-‘Z’ : 65-90, char ‘a’-‘z’: 97-122. Actually, the first 0-127 characters (7 bits) are the same with ASCII defined in 1968, which is Amercian Standard.

UTF-8 is elastic encoding, that each character is represented by 1 to 4 bytes, depending on its belonging. https://en.wikipedia.org/wiki/UTF-8#Description. It is capable of encoding 1,112,064 valid code points in Unicode. (17*2^16-2048, or 0x10FFFF)

中文编码

标准中文电码(Chinese Telegraphic/Commercial code): (http://code.usvisa-application.com/) is 4 digits, from 0000 to 9999.

中国国家标准：

GB2312: in 1981, use 2 bytes, 1st byte(0xA1-0xFE), 2nd byte (0xA1-0xFE), so 83x93= 8000, enough for 3755+3008 = 6763 characters.

GBK/GB13030: use 2 bytes, 2nd byte can be (0x00-0xFF), so 86x256 = 22016

Unicode 10.0: example link , has 87, 882 CJK Unified Ideographs, out of totally 136,755 characters. In unicode 1.01, out of total 28,359 characters, 20,914 is occupied by CJK Ideographs.

Unicode has 17 planes, each plan is 2 bytes, or 2^16= 65,536 coding points.

It requires 3 k characters for general literacy, but 40 k characters for reasoanably complete coverage.

Source separate principle. The term ideograph may be misleading because Chinese script is not strictly a picuture writing system.

Unicode vs UTF-8: https://www.zhihu.com/question/23374078

For a freqently-used Chinese Character, Unicode uses 2 bytes, UTF-8 uses 3 bytes.

bytes vs string

Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode is called a string. An immutable sequence of numbers (0-256) is called a bytes objects. Each byte within the byte literal can be an ascii character or an encoded hexadecimal number from \x00 to \xff (0–255).

In python 3, string is default unicoded, so no need to use u'literal'.

You can’t mix bytes and strings. (TypeError)

# http://graphemica.com/unicode/characters/page/80
u"\u0041"  # 'A'
u"\u4dff"  # '䷿'
u"\u4E00"  # '一'

b = b'abc' # bytes, immutable
barr = betearray(b) # bytearray, mutable
barr[0] = 102 #  = \x66 = f
barr  # bytearray(b'fbc')

# try different encoding
'一'.encode('raw_unicode_escape')  # b'\\u4e00'
u'哈'.encode('utf8')  # b'\xe5\x93\x88'
u'哈'.encode('gbk')   # b'\xb9\xfe'
u'哈'.encode('big5')  # b'\xab\xa2'

Yuchao's blogspot

Thursday, August 31, 2017

unicode

unicode

中文编码

bytes vs string

No comments:

Post a Comment