class: slide-title
Software Design by Example
Binary Data
chapter
--- ## Why Binary? - Operations are much faster - Write addition using textual digits as an exercise - Bits take less space - `"10239472"` is 8 bytes, but `10239472` is just 4 - How would we represent images, audio, or video as characters? ```txt ,_, (O,O) ( ) -"-"- ``` --- ## Integers - If all we have is 1's and 0's, use base-2 - 9
10
is (1×8)+(0×4)+(0×2)+(1×1) = 1001
2
- Can write numbers in binary using `0b` prefix ```py print(0b101101) # (1 * 32) + (1 * 8) + (1 * 4) + 1 ``` ``` 45 ``` --- ## Hexadecimal - More common to use
hexadecimal
(base 16) - Digits are 0123456789ABCDEF - Each digit represents 4 bits (half a byte) ```py print(0x2D) # (2 * 16) + 13 ``` ``` 45 ``` --- ## Negative Numbers - Could use
sign and magnitude
- `0100` is 4 - `1100` is -4 - But: - Gives us two zeroes (one positive, one negative) - Makes the hardware to do arithmetic more complicated --- ## Two's Complement -
Two's complement
wraps around like an odometer
Base 10
Base 2
Base 10
Base 2
3
011
-1
111
2
010
-2
110
1
001
-3
101
0
000
-4
100
--- ## Two's Complement - Can still determine sign by looking at the first bit - But two's complement is asymmetric - No positive number to match the largest negative number --- ## Bitwise Operations - Operate on corresponding bits in representation - `&` (and) is 1 if both bits are 1, 0 otherwise - `0b1100 & 0b1010 == 0b1000` - `12 & 10 == 8` - `|` (or) is 1 if *either* bit is 1, 0 otherwise - `0b1100 | 0b1010 == 0b1110` - `12 | 10 == 14` --- ## Bitwise Operations
Name
Expression
Bitwise
Result
and
12 & 6
1100 & 0110
4 (0100)
or
12 | 6
1100 | 0110
14 (1110)
xor
12 ^ 6
1100 ^ 0110
10 (1010)
not
~ 6
~ 0110
-7(1001)
left shift
12 << 2
1100 << 2
48 (110000)
right shift
12 >> 2
1100 >> 2
3 (0011)
--- class: aside ## This Is Not Arithmetic - Take a closer look at `~6` - We are using two's complement, so 6 is `0b000…00110` - Its bitwise negation is `0b111…11001`, which is -7 - Shifting up and down is *almost* like multiplying or dividing by 2 - But what if the top bit changes? - If we only have 4 bits, `0b1111 >> 1` is `0b0111`, so -1/2 is 7 --- ## Storing Numbers - C and Fortran store numbers as numbers - Python used **boxed values** - Reference count - Type code - Value
--- ## Storing Arrays - The differences are even larger for arrays and lists
--- ## Packing and Unpacking - Operations on unboxed (raw) values are much faster - Most numerical libraries written in C or Fortran - Then wrapped in Python or R - Need to: - Get data from raw bytes into Python structures - Copy data from Python structures into packed bytes - Also do this for efficient storage of large data --- ## The `struct` Module ```py import struct fmt = "ii" # two 32-bit integers x = 31 y = 65 binary = struct.pack(fmt, x, y) print("binary representation:", repr(binary)) normal = struct.unpack(fmt, binary) print("back to normal:", normal) ``` ``` binary representation: b'\x1f\x00\x00\x00A\x00\x00\x00' back to normal: (31, 65) ``` --- class: aside ## Hexadecimal Again - Not all bytes correspond to common characters - So Python uses two-digit hex representation `\xPQ` - `\x00` is a
null byte
(value 0) - Easy to miss the actual `A` between one `\x00` and the next --- ## Packing With Counts ```py from struct import pack print(pack("3i", 1, 2, 3)) print(pack("5s", bytes("hello", "utf-8"))) print(pack("5s", bytes("a longer string", "utf-8"))) ``` ``` b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' b'hello' b'a lon' ``` - Only packs as much as we tell it to --- ## Dynamic Formats - Construct format dynamically ```py text = "hello" print(f"{len(text)}s") ``` ``` 5s ``` --- ## Variable-Length Packing - Pack strings as a fixed-size count and that many bytes - Use `bytes` to convert character string to bytes ```py def pack_string(as_string): as_bytes = bytes(as_string, "utf-8") header = pack("i", len(as_bytes)) format = f"{len(as_bytes)}s" body = pack(format, as_bytes) return header + body ``` --- ## Variable-Length Packing ```py if __name__ == "__main__": result = pack_string("hello!") print(repr(result)) ``` ``` b'\x06\x00\x00\x00hello!' ``` - First four bytes are the 32-bit integer representation of 6 - Next six bytes are our characters --- ## Unpacking ```py def unpack_string(buffer): header, body = buffer[:4], buffer[4:] length = unpack("i", header)[0] format = f"{length}s" result = unpack(format, body)[0] return str(result, "utf-8") buffer = pack_string("hello!") result = unpack_string(buffer) print(result) ``` ``` hello! ``` --- ## Bytes and Text - ASCII originally defined 128 7-bit characters - 0–31 were
control codes
- Since bytes have 8 bits, programmers used the values 128–255 however they wanted - ANSI standard defined (for example) 231
10
to be "ç" - But what about Turkish, Devanagari, kanji, hieroglyphics, …? - Two bytes wouldn't be enough - Four bytes per character would quadruple storage requirements - And would mostly not be needed (by American businesses) --- ## Unicode - Define a
code point
for every character - U+0065 for an upper-case Latin "A" - U+2605 for a black star ★ - Define several
character encodings
- UTF-32 uses 32 bits for every character - Most popular is
UTF-8
- Code points 0–127 are stored in a single byte with a leading 0 - If the top bit is 1, the number of 1's tells UTF-8 how many bytes there are in the character --- ## Unicode - If the first byte is `0b11101101`: - The leading 1 means "multibyte" - The next two bits mean "this is a three-byte character" - The first 0 separates the header from the start of the character - The final `1101` is the first four bits of the character - Every
continuation byte
starts with `10` - So we can tell if a byte is in the middle of a character --- ## Characters as Bytes ```py result = pack_string("こんにちは!") print(repr(result)) ``` ``` b'\x10\x00\x00\x00\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\x \ e3\x81\xaf!' ``` --- class: aside ## Binary Mode - `open(filename, "r")` converts bytes to characters - And converts Windows line endings `\r\n` to Unix `\n` - Use `open(filename, "rb")` to read in
binary mode
--- class: summary ## Summary