Lab 6

Chapter 7 - String Manipulation

Synopsis
Chapter 7: String Manipulation covers a variety of functions that are useful for dealing with strings. But instead of providing a list and description of each function, we will explore string manipulation methods in the context of a coding problem. Our goal is to simulate the Infinite Monkey Theorem. We will use many of the key function described in Chapter 7: String Manipulation and many of the tools that we learned in past lectures.


Daily Quiz

Quiz 11


Lab 6: Finite Monkey Simulation

The Infinite Monkey Theorem states that a monkey typing on a typewriter for an infinite amount of time is almost surely (probability 1) going to type a given text. For example, said monkey will almost surely produce the entire Emory Student Handbook. This might sound like a silly scenario, but the underlying mathematics contain some truly powerful ideas. The theorem more generally states that, given an infinite sequence of characters chosen from a random distribution, you will almost surely find any and all possible finite subsequences. This theorem is particularly useful in studies of cryptography and information theory.

In this lab, we will write a simulation of the Infinite Monkey Theorem. As you might expect, we cannot write a program to simulate a monkey typing for an infinite amount of time, but we can simulate a monkey typing for a very long time. Let's begin by defining our goal.

Goal: Write a MATLAB script that will simulate a monkey typing on a typewriter for x number of years, and displays all lines that contain a specified phrase. The line should be displayed in a way that highlights the phrase of interest.

We can begin by putting together an algorithm for such as simulation. We will then attack each step of the algorithm until we have a working script.

Algorithm

  1. Define any constants and parameters required for the simulation (i.e. average number of characters per word, number of years, phrase of interest, typing speed, etc.)
  2. Calculate the total number of lines written by the monkey in the given number of years
  3. Loop over each line and perform the following:
    1. Generate a random sequence of characters
    2. Search for the phrase of interest
    3. If the phrase is found, display the line so that the phrase is visible

Now that we know what is required, let's start coding.

1. Define any constants and parameters required for the simulation

Before we can do any number crunch, we first need to define some variables. We will begin with our simulation parameters. From the Algorithm, we know that we need to define the phrase of interest and the number of years, but we also need to define how fast the monkey will type. For this example, we will be looking for the word code over a modest 10 years. Let's assume that our monkey can type at a respectable 150 words per minute (wpm).

1
2
3
4
%% Define Simulation Parameters
phrase = ' code ';
numYears = 10;
wpm = 150;

Now we need to define some constants to help us calculate the total number of lines that the monkey will produce in 10 years. These constants are not official MATLAB constants like pi, but they are variables that will remain unchanged in our code. So, what constants do we need? Well, we need to convert the wpm and numYears to the total number of lines typed. Therefore, we will need to know how many minutes are in a year, the average number of characters per word, and the number of characters per line. We will see why each constant is necessary in the next steps. For now, we will define the constants as variables with uppercase letters.

1
2
3
4
%% Define Constants
MPY = 525600;   %minutes per year
CPW = 5;        %characters per word
CPL = 72;       %characters per line

2. Calculate the total number of lines written by the monkey in the given number of years

Now that we have our constants and simulation parameters, we can determine how many lines the monkey will type in the time numYears. Let's start by calculating the total number of words written. We can do this by converting words per minute to words per year with the constant MPY. Then we can multiply the words per year rate by the number of years to determine the number of words written over the 10 year period. With this information, we can convert the total number of words to the total number of characters by multiplying with the constant CPW. Lastly, we can determine the total number of lines by dividing the total number of characters by the constant CPL. Since the division may produce a non-integer, we will have to round the final value.

1
2
3
4
5
6
7
8
9
10
%% Calculate Lines

%calculate total number of words
totalWords = MPY*wpm*numYears;

%calculate total number of characters
totalChar = totalWords*CPW;

%calculate total number of lines
totalLines = round(totalChar/CPL);

3. Loop over each line

Here is where the fun starts. We will simulate the monkey typing random lines until we have reached the last line. Along the way, each line will be searched for the phrase of interest. Before we get to the loop, we need to understand the tasks performed within the loop.

3.1. Generate a random sequence of characters

To understand how we can generate a random sequence of characters, we should recognize how MATLAB treats strings. As we've seen in past lectures about character encoding, MATLAB handles strings as vectors. This means that the vector operations that we learned previously apply to strings. This includes vector indexing, vector concatenation, and logical operations. For now, we simply want to create a random sequence of characters of length CPL. We can do this by using the randi function to produce a random vector of integers. If we only produce integers between the values of 97 and 122, we can use character encoding to convert the random integers to characters. But what about spaces? A blank space has the character encoding of 32. To insert blank spaces, we can simply ask randi to produce integers from 97 to 123, and convert any value of 123 to 32. And just like that, we have our randomly generated line.

1
2
3
4
5
6
7
8
%generate line of random characters
line = randi([97 123], 1, CPL);

%change 123 to 32 (i.e. changing character to blank space)
line(line == 123) = 32;

%convert line of digits to string
line = char(line);

3.2. Search for the phrase of interest

Now that we have our randomly generated line, we need to check it for our phrase of interest. MATLAB gives us a convenient way to do this with the strfind function. strfind will search through our string and return the index of any substring that matches our phrase. If there are no substrings that match our phrase, strfind will return an empty vector. Therefore, the number of times the phrase appears in the string is given by the length of the index vector.

1
2
3
%search for phrase in line
idx = strfind(line, phrase);
instances = length(idx);

3.3. If the phrase is found, display the line so that the phrase is visible

Lastly, we need to check if the phrase was found, and print out the line in a way that makes the phrase visible. If the phrase was found, then instances must be greater than zero. In this case, we can print a string where the phrase is capitalized. To replace a substring of a string, we can use the strrep function. strrep will replace all instances of a given substring with a new substring. The new substring will be an uppercase version produced by the upper function.

1
2
3
4
5
6
7
8
9
10
11
%print message if phrase is found
if instances > 0
    line = strrep(line, phrase, upper(phrase));

    %display message
    if instances > 1
        fprintf('Found %d instances:\n%s\n', instances, line);
    else
        fprintf('Found 1 instance:\n%s\n', line);
    end %display
end %instance

As you can see, the phrase substring is replace by an uppercase version of phrase. The display message consists of two different fprintf functions that indicate the number of instances found and the line.

Putting it all together

We finally have all the parts we need. We can now put the entire script together to build our Finite Monkey simulator.

lab6key_20160316.m


Final Words

Phew, that was a lot of coding. This problem required us to use many of the concepts covered in past lectures as well as many new string manipulation methods. I hope that this exercise helped show you how different functions and control structures can be used in practical programming scenarios. In the coming lectures, we will learn about more advanced data structures to help us tackle even larger problems.