In this lab you will enhance your previous WordCounter program to do some very simple digital humanities computational textual analysis of the book. (Or, to put it more simply: word counting.)
As a first exercise in analyzing words in a text, let's calculate approximately how many words there are in a line in this book. We could start by counting how many words there are in the first line.
Read the class documentation for the WordReader
class
to find a method that can be used to divide up the string
containing the first line into a group of words. What
type does that method return? Construct a variable of
that type, giving it a meaningful name. (Remember, the convention
is that variable names start with lower-case letters.) Initialize
your new variable to the return value of the method you found.
The general template for calling a method and storing its return
value in a variable is just:
<type> <variableName> = <object>.<correctMethod>(<parameters>);
There are two ways to
get the number of words in the line: you could loop through the
list of words, incrementing a counter as you go, or you could use
the size
method on a list. Using either method,
calculate the number of words in the first line and print that
information right after saying what its length is. (You could do
this on the same line or on the next line.) For example, you might
have
The length of the first line is 76. There are 12 words in the line.
In the previous Reading From a File lab, you skipped over a number of lines and then printed the next line along with its length. For example, if you printed the first line and then skipped 60 lines, the next line in the text would be the 62nd line. Add code to report on the number of words in that line, as you did for the first line. Remember that if you reuse a variable name, you do not need to re-declare its type; the general template for storing the return value of a method call in a variable that has already been declared is:
<variableName> = <object>.<correctMethod>(<parameters>);
We might wonder whether the number of words on a single line is representative of the book as a whole. Let's calculate the average number of words per line over the extended quote (20 lines or so) you printed in the previous lab.
Create a variable to store the total word counts, similar to your
variable storing total line lengths. Give it a name that indicates
its purpose; for example, wordSum
or something like
that. Initialize the new variable to zero.
In the loop for your extended quote, determine how many words there are on each line and add that to your sum.
Tip: Don't forget that theArrayList
class has asize
method. (See the ArrayList Methods quick reference for more information.)
After the loop, divide the sum by the number of lines in the
extended quote, and print that out. To get a floating
point number rather than an integer, divide by the floating point
version of the number of lines rather than the integer value. For
example, using wordSum/20.0
might give you
On average, SherlockHolmes.txt has about 12.275 words per line.
This is better than wordSum/20
, which would do integer
division, giving you
On average, SherlockHolmes.txt has about 12 words per line.
The number of words per line is really about how the book is formatted, and doesn't say anything about the author's word choices or the target audience for the book. It might be more meaningful to look at the average word length.
Add another variable, right after wordSum
to sum up
the number of characters read. This time you can't update the
variable by calling a single ArrayList
method, as you
did in the previous exercise. Instead, you will need a nested
loop (a loop within your existing loop) to step through
the words in the current line, incrementing the character count by
the number of characters in each word. So, the outer loop is
stepping through lines, and the inner loop is stepping through
words in a line. Since you are stepping through all the elements
in a collection, you can use either a traditional for
loop or the simpler for-each style.
Tip: Don't forget that theString
class has alength
method. (See the String Methods quick reference for more information.)
After you print the average number of words per line, print the
average length of a word, which is the total number of characters
divided by the number of words. Is it "rounding" the average down
again? (This isn't really rounding, since it always goes down. It
is really truncating the fractional part.) As in the
Reading From a File lab, you
don't have a constant that you can just add .0
to,
so modify one of your counters to be a double
instead
of an int
. (Or modify both, if you want.)
Test your program.
At this point, your program prints several points of information about the book you chose. Your output might look something like:
Welcome to the Word Counting Mini-Lab. The first line in SherlockHolmes.txt is: Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle The length of the first line is 76. There are 12 words in the first line. Skipping 60 lines: ............................................................ Line 62: 'summons to Odessa in the case of the Trepoff murder, of his clearing up' Length of this line is 71. There are 14 words in the line. On average, SherlockHolmes.txt has about 12.275 words per line. Words are about 3.9898167 characters long.
Rather than just calculating the average number of words per line and the average number of characters per word in your extended quote, calculate those two averages for the entire work, including the first line, the lines you "skipped", the extended quote, and the rest of the lines in the text.
Reading to the end of the file:
To read the rest of the lines until you reach the end of the file,
you will need to use a slightly different type of loop.
(You may have done this already in the optional
Reading the Whole
Work section of the previous lab.)
We know we have reached the end of the file when the
getNextLine
method returns null
instead of a
String containing the line. (The null
keyword indicates a
non-existent object.) We can write a loop like this:
The initialization in this loop reads in the first line we are going to process in this loop. We then check if it isfor ( nextLine = reader.getNextLine(); nextLine != null; nextLine = reader.getNextLine() ) { // Do something in the loop body. // For example, print nextLine, skip it, increment a counter, etc. }
null
(end-of-file). If it isn't, we do whatever is in
the loop body, then read in the next line in the step
section before checking again to see if it is null
.
In this case, the loop body will be where you increment the line
counter, break the line into words, and add to your word total.
(*There's another style for writing this loop at the end of this lab. If you want to be adventurous, you may use that style instead, but this one is probably easier to understand.)
Including the title, TOC, etc., SomeFile.txt has 111298 words across 9633 lines.
(Note: the word and line counts will not be entirely meaningful. Not only will they include the title and table of contents, they may also have hundreds of lines and thousands of words of copyright and license information. For example, a version of The Adventures of Sherlock Holmes downloaded from Project Gutenberg had more than 300 lines and 3000 words of extra information about Project Gutenberg and its license. This will affect your averages also.)
Counting the number of words and the average word length may not seem like it actually provides any useful information, but you will see some interesting results if you compare works written in different centuries, in different genres, in different languages, or for different audiences. For example, if you compare Hamlet written by Shakespeare (1603), Pride and Prejudice written by Jane Austen (1813), the What to the Slave is the Fourth of July? speech by Frederick Douglass (1852), and Alice's Adventures in Wonderland written by Lewis Carroll for children (1865), you will see interesting differences in vocabulary. You can download any of these, or others, from Project Gutenberg. Choose books that are very different from each other, so that you can compare them along the way or at the end. (Don't forget to download Plain Text versions.)
You will be adding to this program, but submitting it at this point will allow you to get feedback before you submit a final version.
YourName_WordCounting
).
(Do
this from the Mac Finder or Windows Explorer, not from within BlueJ.)
This will help whoever grades it
when they receive a dozen or more projects with similar names.
Rather than repeat the code to get the next line in both the initialization
and step parts of a traditional for
loop, experienced Java
programmers will often write a while
loop that combines
getting the next line and testing that it is not null
in one,
more complex expression.
reader = new WordReader(filename); while ( (nextLine = reader.getNextLine()) != null ) { // Do something in the loop. }