COMPSCI 225:
Discrete Structures in Mathematics and Computer Science

Codes

Prefix codes and optional binary trees

Codewords

Link copied

To represent symbols, computers use strings of 0’s and 1’s called codewords. For example, in the ASCII (American Standard Code for Information Interchange) code, the letter A is represented by the codeword 01000001, B by 01000010, and C by 01000011. In this system each symbol is represented by some string of eight bits, where a bit is either a 0 or a 1. To translate a long string of 0’s and 1’s into its ASCII symbols we use the following procedure: Find the ASCII symbol represented by the first 8 bits, the ASCII symbol represented by the second 8 bits, etc. For example, 010000110100000101000010 is decoded as CAB.

For many purposes this kind of representation works well. However, there are situations, as in large-volume storage, where this is not an efficient method. In a fixed length representation, such as ASCII, every symbol is represented by a codeword of the same length. A more efficient approach is to use codewords of variable lengths, where the symbols used most often have shorter codewords than the symbols used less frequently. For example, in normal English usage the letters E, T, A, and O are used much more frequently than the letters Q, J, X, and Z.


Example 8.1.Link copied The most frequently used letters in English are, in order: E, T, A, O, I, N, S, H, R, D, \dots . The simplest way to assign the shortest codewords to the most frequently used symbols is by the following table.
ETAOINSHRD0100011011000001010011\begin{array}{c|c|c|c|c|c|c|c|c|c} E & T & A & O & I & N & S & H & R & D\\ 0 & 1 & 00 & 01 & 10 & 11 & 000 & 001 & 010 & 011\\ \end{array}

In this way we have assigned the shortest possible codewords to the most frequently used letters and longer codewords to the other letters. This appears to be a more efficient approach than assigning all these letters a codeword of the same fixed length, which would have to be three or more.

But how can we decode a string of 0’s and 1’s? For example, how should the string 0110110 be decoded? Should we start by looking at only the first digit, or the first two, or the first three? Depending upon the number of digits used, the first letter could be E, O, D or something else. We see that in order to use variable length codewords we need to select representations that permit unambiguous decoding.

Prefix property and prefix codes

Link copied

A way to do this is to construct codewords so that no codeword is the first part of any other codeword. Such a set of codewords is said to have the prefix property. This property is not enjoyed by the above choice of codewords since the codeword for T is also the first part of the codeword for N. On the other hand, the set of codewords S={000,001,01,10,11}S = \{000, 001, 01, 10, 11\} has the prefix property since no codeword appears as the first part of another codeword. The method to decode a string of 0’s and 1’s into codewords having the prefix property is to read one digit at a time until this string of digits becomes a codeword, then repeat the process starting with the next digit, and continue until the decoding is done.

Example 8.2.Link copied Using the set of codewords SS above, decode the string 001100100011.

An efficient method of representation should use codewords such that :

  1. the codewords have the prefix property; and

  2. the symbols used frequently have shorter codewords than those used less often.

Using a binary tree to construct prefix codes

Link copied

A full binary tree can be used to construct a set of codewords with the prefix property by assigning 0 to each edge from a parent to its left child and 1 to each edge from a parent to its right child. Following the unique directed path from the root to a terminal vertex will give a string of 0’s and 1’s. The set of all strings formed in this way will be a set of codewords with the prefix property, because corresponding to any codeword we can find the unique directed path by working down from the root of the binary tree, going left or right accordingly as each digit is 0 or 1. By definition we finish at a terminal vertex, and so this codeword cannot be the first part of another codeword.

Example 8.3.Link copied For the binary tree in the figure below, assign 0’s and 1’s to its edges as implied above. What codewords are produced?

By using a binary tree we have found a way to produce codewords that have the prefix property. It remains to find a method for assigning shorter codewords to the more frequently used symbols.


Example 8.4.Link copied For the codewords produced in the previous example, which should be used for the more frequently used symbols?

Notice that these codewords correspond to the terminal vertices that are closest to the root. Thus, to obtain an efficient method for representing symbols by variable length codewords, we can use a binary tree and assign the most frequently used symbols to the terminal vertices that are closest to the root.

Weighted trees

Link copied

Suppose w1w_{1} ,w2w_{2} ,\ldots ,wkw_{k} are nonnegative real numbers. A binary tree for the weights w1w_{1} ,w2w_{2} ,\ldots ,wkw_{k} is a binary tree with kk terminal vertices labeled w1,w2,,wkw_{1},w_{2},\ldots,w_{k} . A binary tree for the weights w1,w2,,wkw_{1},w_{2},\ldots,w_{k} has total weight d1w1+d2w2++dkwkd_{1}w_{1}+d_{2}w_{2}+\cdots +d_{k}w_{k} , where did_{i} is the length of the directed path from the root to the vertex labeled wiw_{i} (i=1,,k)(i = 1, \ldots, k) .

Example 8.5.Link copied What are the total weights of the binary trees shown below?

For the coding problem we want to find a binary tree of smallest possible total weight in which the frequencies of the symbols to be encoded are the weights. A binary tree for the weights w1,w2,,wkw_{1}, w_{2},\ldots, w_{k} is called an optimal binary tree for the weights w1,w2,,wkw_{1}, w_{2}, \ldots, w_{k} when its total weight is as small as possible.

The following algorithm due to David A. Huffman produces an optimal binary tree for the weights w1,w2,,wkw_{1}, w_{2}, \ldots, w_{k} . The idea is to create a binary tree by using the two smallest weights, say w1w_{1} and w2w_{2} , replace w1w_{1} and w2w_{2} by w1+w2w_{1} + w_{2} in the list of weights, and then repeat the process.

Huffman's optimal binary tree algorithm

Link copied

For nonnegative real numbers w1,w2,,wkw_{1}, w_{2}, \ldots, w_{k} , this algorithm constructs an optimal binary tree for the weights w1,w2,,wk.w_{1}, w_{2}, \ldots, w_{k}.

Step 1 (select smallest weights). If there are two or more weights in the list of weights, select the two smallest weights, say VV and WW . (Ties can be broken arbitrarily.) Otherwise, we are done.

Step 2 (make binary tree). Construct a binary tree with the root assigned the label V+WV + W , its left child assigned the label VV , and its right child assigned the label WW . Include in this construction any binary trees that have the labels VV or WW assigned to a root. Replace VV and WW in the list of weights by V+WV + W . Go to step 1.


Example 8.6.Link copied Construct the optimal binary tree for the weights 2, 3, 4, 7, 8. What is the total weight of this tree?

In order to find codewords with the prefix property such that the most frequently used symbols are assigned the shortest codewords, we construct an optimal binary tree with the stated frequencies of the symbols as its weights. Then by assigning 0’s and 1’s to the edges of this tree as described previously, codewords can be efficiently assigned to the various symbols.


Example 8.7.Link copied Suppose the characters E, T, A, Q, and Z have expected usage rates of 32, 28, 20, 4, and 1, respectively. Construct an efficient assignment of prefix codes to these characters.

Example 8.8.Link copied Given a text, let wiw_i be the number of occurrences of the ii -th symbol. Suppose a binary tree for those weights has total weight W=idiwiW=\sum_i d_iw_i . If we encode our text using the corresponding prefix code, then the resulting file has length WW . (So it is shortest when we minimize WW .)

Introduction to error-correcting codes

When a radio message is sent through space, interference with the transmission of a single letter or character may result in the receipt of an incorrect communication.

If the message NEXT STOP JOME is received, it may be clear that this was not the intended message. Perhaps the actual message was NEXT STOP ROME or NEXT STOP NOME, or something else.

It is often essential that a message be received as intended. For this reason, such messages are rarely sent in English (or French or German or any other common language), since transmission error often results in a misinterpretation.

Alphabets and words

Link copied

To ensure the accuracy of a message, it is useful to encode it, that is, to use a new language in which it is less likely that two words will be confused with each other.

Definition 8.1 For an integer b2b\geq 2 , the set

A={0,1,,b1}A=\{0,1,\ldots,b-1\}

of bb symbols is an_ alphabet. An ordered nn -tuple of symbols in AA is called a word of length nn _or simply an nn -word (over AA .)

The Hamming distance

Link copied

Definition 8.2 For two nn -words xx and yy , the distance d(x,y)d(x,y) , often called the Hamming distance, _between xx and yy is the number of coordinates at which xx and yy differ.


Example 8.9 _Let x=0101101x = 0101101 , y=1011110y = 1011110 and z=0111101z = 0111101 . Then d(x,y)=5,d(x,z)=1,d(y,z)=4d(x,y) = 5, d(x,z) = 1, d(y,z) = 4 .


Theorem 8.1 The Hamming distance has the following properties.

1. d(u,v)0d(u,v)\geq 0 , and d(u,v)=0d(u,v)=0 if and only if u=vu=v ;

2 d(u,v)=d(v,u)d(u,v)=d(v,u) for all u,vV(G)u,v\in V(G) ;

3 d(u,v)d(u,w)+d(w,v)d(u,v)\leq d(u,w)+d(w,v) for all u,v,wV(G)u,v,w\in V(G) (The triangle inequality)

Codes

Link copied

Definition 8.3 Let nn be a positive integer and AA an alphabet of bb elements. A fixed-length code CC is a collection of words of length nn over AA . The fixed-length code CC is also called an (n,b)(n,b) -code. If b=2b=2 , then CC is a binary code.

Example 8.10.Link copied Give an example of a (6,2)(6,2) -code.

Distance of a code

Link copied

Definition 8.4 The distance d(C)d(C) _of CC is

d(C)=min{d(x,y)},d(C)=\min \{d(x,y)\},

where the minimum is taken over all pairs x,yx,y of distinct code words.


Example 8.11.Link copied What is the distance of the following code?
C={000000,001101,010011,100101}. C = \{ 000000 , 001101, 010011, 100101 \}.

The challenge in coding theory is to construct a code that

  1. uses as simple an alphabet as possible to facilitate transmission of messages,

  2. uses a sufficiently large nn so that many code words are available and so that there is a great enough distance between every two code words, and

  3. uses a sufficiently small nn to simplify transmission of messages.


Example 8.12.Link copied Suppose a code word of the code in the previous example is transmitted, but that one symbol gets changed. If the received word is
110011110011

then what word was sent?


tt -error correcting codes

We now turn to the problem of constructing codes. Ideally, a code should be constructed so that even if a small number of errors do occur due to interference, the message can still be understood.

Definition 8.5 A code is tt -error correcting if, whenever a code word xx is transmitted, then xx is the unique code word closest to the word received, even when up to tt errors have occurred in the transmission of xx .


Example 8.13.Link copied Let C={00000,11111}C = \{ 00000, 11111 \} . Then CC is a 2-error correcting code. (This is not an efficient code!)

We assume that if there is a unique code word at minimum distance from the received word, then the received word was intended to be the code word.


What can we say about d(C)d(C) ?

Example 8.14.Link copied Suppose that an (n,b)(n,b) -code is 1-error correcting. Is it possible for two code words to be at distance 1? Is it possible for two code words to be at distance 2?

Theorem 8.2 A code CC is tt -error correcting if and only if d(C)2t+1d(C)\geq 2t+1 .

Proof: See lecture.

Perfect codes

Link copied

If an (n,b)(n,b) -code CC is tt -error correcting, there is a limit on how many code words CC can contain.

Theorem 8.3 If CC is a tt -error correcting (n,b)(n,b) -code, then

Cbns,|C| \leq \lfloor \frac{b^n}{s} \rfloor,

where

s=(n0)+(n1)(b1)+(n2)(b1)2++(nt)(b1)t.s= \binom{n}{0} + \binom{n}{1}(b-1) + \binom{n}{2}(b-1)^2 + \cdots + \binom{n}{t}(b-1)^t.

Definition 8.6 A tt -error correcting (n,b)(n,b) -code CC is perfect if C=bn/s|C|=b^n/s , where ss is given by the above theorem. (That is, for every nn -word ww over the alphabet {0,1,,b1}\{0,1,\ldots,b-1\} , there is exactly one code word in CC within a distance of at most tt from ww .)

A graphical representation of codes

Link copied

Consider the graph GG on {0,1,,b1}n\{ 0,1, \dots, b-1 \}^n where there is an edge between vertices u=(u1,u2,,un)u=(u_1,u_2,\ldots, u_n) and v=(v1,v2,,vn)v=(v_1, v_2, \ldots, v_n) if and only if uiviu_i\not = v_i for exactly one ii (1in1\leq i \leq n ). A code can be considered as a subset of GG .

Example 8.15.Link copied How many vertices does GG have? Describe the graph when b=2b=2 .

The Hamming distance between two code words is the same as the length of the shortest path in GG between the two corresponding vertices. For a good code (i.e., one with a large distance), the code words are placed at suitably selected vertices of GG so that the minimum distance among all pairs of code words is as large as possible.


RSA public-key cryptosystem

Consider the following problems:

  1. PRIME: Is nn a prime number?

  2. COMPOSITE: Is nn a composite number?

  3. FACTORIZE: Find a,b<na,b<n , if possible, such that n=abn=ab . Otherwise report that none exist (and that, for n2n\geq 2 , nn is prime.)

We have listed problems in this form in order to pinpoint the difference between COMPOSITE and FACTORIZE, which at first sight look as if they might be the same. Actually it is the first two, PRIME and COMPOSITE that are identical in their complexity.

Although FACTORIZE may look the same as COMPOSITE, it is in reality asking for a lot more, namely an explicit factorization. If we can find aa and bb less than nn such that n=abn=ab , then we automatically know that nn is composite, but the point is that there are circumstances under which we are able to assert that nn is composite without any factors being explicitly given.

Fermat's little theorem

Link copied

A standard result in elementary number theory, known as ‘Fermat’s little theorem’, illustrates how this can happen.

Theorem 8.4 If pp is a prime number, then for any integer aa , apamodp,a^p\equiv a\mod p, and if p does not divide a<a< then ap11modp.a^{p-1}\equiv 1\mod p.


Example 8.16 _Suppose that, for some nn and aa , we find that an≢amodna^n\not\equiv a\mod n . It immediately follows from Fermat’s little theorem that nn cannot be prime, so (if n2n\geq 2 ) it must be composite. But the method of showing this is rather indirect, and the computation that an1≢1modna^{n-1}\not\equiv 1 \mod n may not itself tell us anything about the factorization of nn .


Example 8.17.Link copied Show that if an11(modn)a^{n-1} \equiv 1 \pmod{n} then ana(modn)a^n \equiv a \pmod{n} . Find a counterexample to the converse of this statement.

The theorem and example show that it is sometimes possible to discover that a number nn is composite without actually finding any of its factors. A more elaborate version of these ideas leads to the Miller-Rabin primality test which, assuming the Riemann hypothesis is true, determines whether an integer is prime or composite in polynomial time. A different method, which determines primality in polynomial time without any assumptions, is due to Agrawal, Kayal and Saxena.

Public-key cryptosystems

Link copied

The apparent difficulty of factorizing large numbers, and the comparative ease of producing large primes, has given rise to one of the most popular ‘public-key cryptosystems’, called the RSA system after its inventors Ronald Rivest, Adi Shamir and Leonard Adleman.

The idea of a public-key cryptosystem is that there is a public key and a private key. Everyone can know the public key, but the private key must be kept secret. Using the public key, anyone can encrypt a message to get a ciphertext. Determining the message in a given ciphertext should be extremely hard without the private key. The important mathematical notion is that of a ‘trapdoor’ or one-way function: namely a function which is easy to compute, but whose inverse is hard to compute without some additional information.

Most importantly, public key cryptography can be used for digital signatures, which solve the authentication problem. For example, how do you know that your automatic software updates are not a virus? The software update comes with a digital signature (generated by the software vendor) which can be verified using a public key “hard-wired” into the operating system/application. Enabling secure automatic updates would have been a real headache without public key cryptography.

The private key

Link copied

Suppose that pp and qq are chosen as two distinct ‘large’ prime numbers, and let n=pqn=pq . Once the encryption procedure has been carried out (that is, input has been stored in a coded, ‘secret’ form), the special knowledge needed for decryption (or recovering the input), called a private key, is a number dd between 1 and nn which is coprime with (p1)(q1)(p-1)(q-1) , that is, dd and (p1)(q1)(p-1)(q-1) share no common factor.

The public key

Link copied

As gcd(d,(p1)(q1))=1\gcd(d,(p-1)(q-1))=1 , by Euclid’s algorithm there are integers ee and bb such that ed+b(p1)(q1)=1ed+b(p-1)(q-1)=1 . We assume that 0e<(p1)(q1).0\leq e< (p-1)(q-1). The public key, the Information required for encryption is then the pair (e,n)(e,n) .

The encryption function

Link copied

Now we describe how an integer MM in the range 00 to n1n-1 (public key cryptography is mainly used to transmit symmetric session keys; so there is no loss of generality in assuming the message is a moderate-sized integer) is encrypted using the public key. We write the encrypted version as f(M)f(M) , so that ff is the trapdoor function mentioned above. We let

f(M)=Memodnf(M)=M^e\mod n

Suppose that we are also in possession of the private key, dd . This may then be used to ‘decrypt’ the original value MM from its encrypted version f(M)f(M) by using

M=(f(M))dmodnM=(f(M))^d \mod n

To check the truth of this equation, since 0Mn10\leq M\leq n-1 it is enough to show that MedMmodnM^{ed}\equiv M\mod n . Now pp and qq are distinct primes and so this amounts to show that MedMmodpM^{ed}\equiv M \mod p and MedMmodq.M^{ed}\equiv M \mod q.

Example 8.18.Link copied Show that MedMmodpM^{ed}\equiv M \mod p and MedMmodq.M^{ed}\equiv M \mod q.

The signature function

Link copied

Now we describe how to make digital signature using RSA. A document or file is passed through a “hash function” to obtain an integer MM . A signature is generated (by the user who knows the private key) as

S=Mdmodn<LS=M^d \mod n<L

The receiver gets the document and SS , as well as the public key (n,e)(n,e) . They also use the hash function to compute MM and then verify the equation

M=SemodnM =S^e\mod n

Efficiency and Security

Link copied

For the method to be useful, two properties are required:

  1. (Efficiency) It must be possible to carry out the arithmetic involved in selecting p,qp,q and dd , and in calculating the encrypted and decrypted numbers reasonably quickly (by computer), given the relevant keys.

  2. (Security) It must be prohibitively costly to decrypt the encrypted version of the message without possession of the private key.


Selecting p,qp,q and dd

First, one selects pp and qq of the desired size using the primality test mentioned above. Next, to select dd , it is sufficient to take a prime number larger than pp and qq (and below pqpq ) since all prime factors of (p1)(q1)(p-1)(q-1) must be less than any number so chosen.

To perform the encryption and decryption, we have to see how powers modn\mod n can be rapidly computed.


Example 8.19.Link copied Compute 107mod323310^7\mod 3233 using just 9 multiplications.

In general, if ee has kk binary digits, ll of them 1s, in its binary expansion, then it will require k+l2k+l-2 multiplications modnmod n to evaluate MemodnM^e \mod n which is O(logn).O(\log n).

Cost to decrypt without private key?

Link copied

A justification of property 2 is more a matter of faith at present, but certainly no easy method of decryption is known which does not essentially involve factorization of nn .

Given that the system stands or falls on the difficulty of factorizing large numbers, how large do pp and qq have to be to make it reasonably secure? At present there are algorithms known which will factorize numbers of over 200 digits using massive distributed computation, so allowing for possible improvements in this it seems safe for the moment to employ primes pp and qq of at least 1000 bits (i.e., 300 digits; so that nn will have about 600 digits).