Chapter 4: Text versus Bytes
more about str and byte
Charecter Issue
Now the best definition of what a Charater is, is its Unicode representation. The Unicode standard explicitly seperates the identity of charaters from its bytes representation.
The identity of the character is its code point (ie in the form U+000000). For example A is U+0041.
The actual bytes representation depends on the encoding that is used. If we use UTF-8 encoding, the byte representation of A is \x41 but if we use UTF-16LE it is \x41\x00
Coverting from code points to bytes is called encoding and from bytes to code points is called decoding
s = 'café'
len(s)
b = s.encode('utf_8')
b
len(b), b.decode('utf8')
cafe = bytes('café', encoding='utf_8')
cafe
len(cafe)
cafe[0]
for i in range(len(cafe)):
print(cafe[i])
cafe[:1]
cafe_arr = bytearray(cafe)
cafe_arr
cafe_arr[-1:]
len(cafe_arr)
As you can see, even though the binary sequence is sequence of integers, they are displayed on each byte value. ASCII text and espace squences are displayed as its is while all other bytes use a hexadecimal escape sequence.
Both bytes and bytearray support every str method except those that do formatting ( format , format_map ) and a few others that depend on Unicode data, including case fold , isdecimal , isidentifier , isnumeric , isprintable , and encode . This means that you can use familiar string methods like endswith , replace , strip , translate , upper , and dozens of others with binary sequences—only using bytes and not str arguments. In addition, the regular expression functions in the re module also work on binary sequences, if the regex is compiled from a binary sequence instead of a str .
A byte
or bytearray
can be build in many ways like
- using the constructors calling classmethods like
byte.fromhex()
- A
str
and anencoding
keyword argument. - An iterable providing items with values from 0 to 255.
- An object that implements the buffer protocol (e.g., bytes, bytearray, memory view, array.array); this copies the bytes from the source object to the newly created binary sequence.
b = bytes.fromhex('31 4B CE A9')
print(b)
# from an object that implements the buffer protocol
import array
# 'h' denotes short integers
nums = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(nums)
print(octets)
# from iterable
b = bytes([65, 66, 67])
print(b)
Creating bytes/bytearray from buffer like source will always copy the bytes. In contrast memoryview
objects let you share memory between both. To extract structured information from the binary source use the struct
module.
Structs and Memory Views
To show the power of struct and memoryview let see an example were we took at the header of a gif file.
import struct
# format of the bytes string (gif header)
fmt = '<3s3sHH'
with open('leo.gif', 'rb') as fp:
img = memoryview(fp.read())
header = img[:10] # creates a new memory view for only the headers
bytes(header), header
struct.unpack(fmt, header)
del header
del img
for codec in ['latin_1', 'utf_8', 'utf_16']:
print(codec, 'El Niño'.encode(codec), sep='\t')
Problems you might encounter
Python throws a encoding or decoding error UnicodeEnodeError
or UnicodeDecodeError
. We will look into how to handle these. It can also raise a SyntaxException
when the source encoding is unexpected.
UnicodeEncodeError
Most not-utf codecs only handle a small subset of chars. This error is raised when converting from text to bytes and a char is not supported by the codec.
city = 'São Paulo'
city.encode('utf_8')
city.encode('utf_16')
city.encode('cp437')
The error
argument however gives you options on how to handle these errors.
city.encode('cp437', errors='ignore')
city.encode('cp437', errors='replace')
city.encode('cp437', errors='xmlcharrefreplace')
UnicodeDecodeError
Not every byte holds a valid ASCII char and not every byte sequence is a valid UTF-8 so if you assume a perticular encoding while converting a binary sequence to text you will get a UnicodeDecodeError
if unexpected bytes are encountered.
But some leagacy enodings like cp1252
are able to decode any stream of bytes and hence no errors are generated even though the output might be junk.
octets = b'Montr\xe9al'
octets.decode('cp1252')
octets.decode('iso8859_7')
octets.decode('utf_8')
octets.decode('utf_8', errors='replace')
SyntaxError when loading Modules with Unexpected Encodings
UTF-8 is the default encoding of Python 3 and if your code contains anything else and there is no encoding declaration it will throw an error.
You can use the coding
command to provide the expected encoding of the module.
# coding: cp1252
print('Olá, Mundo!')
Now coming to whether we should use non-ASCII identifier in source code is a personal choice. Preference should be given to people who is most likely to read and write the codebase more often and you should optimize for that.
How to Discover the Encoding of a Byte Sequence
in short, you can't, you have to be told. There are tools like Chardet - The Universal Charater Encoding Detector
that works to detect the encoding with some assumptions.
BOM: A Useful Gremlin
You may notice an couple of extra bytes at the beginning of UTF-16 encoded sequence. The bytes '\xff\xfe' are used to denote little endian/big endian.
u16 = 'El Niño'.encode('utf_16')
u16