In this lab you will construct an object of the WordReader
class (provided to you), giving it the name of a book to read. Over the
course of several labs and mini-labs, you will
do some very simple digital humanities computational
textual analysis of the book.
In this lab, to get started, you will be reading in lines from a piece of
literature, printing some, skipping some, and reporting on the length of
some.
The first step is to choose a book that is available in plain text format. Project Gutenberg is a well-established library of over 60,000 free eBooks, focusing mostly on books published before 1924, whose copyright has expired. A good place to start is with their Top 100 or with their Recently added eBooks. Another interesting source of accessible materials is Wikisource, which has documents in many languages.
Choose one or more books to download from Project Gutenberg or
Wikisource. Clicking on the name of a book should take you to a
page with several download formats (HTML, EPub, Kindle, and Plain
Text). (You may need to choose "Other formats" before you find
Plain Text.) Download the Plain Text version of the book to your
own machine, storing it in the folder where you have your Java
projects for this course. The name of the file will probably end
in .txt
(1661-0.txt
, for example),
although your computer might not show you the .txt
extension.
(If you download several different types of documents, you will be able to see interesting differences among them at later points in this lab series.)
In the first exercise, you will download some useful code and write a program that reads and prints out the first non-empty line in the book you chose. For example, your program's output at the end of this exercise might be something like:
Welcome to the Word Counting Program. The first line in SherlockHolmes.txt is: Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle The length of the first line is 76.
Implementation: A good software development practice is to start by writing the smallest amount of code that you can test, test it, then continue by adding small, incremental changes and testing all along the way. (This is sometimes known as Agile Development, or Iterative, Incremental Development, or "always have working code.")
Your first testable step will be to get the classes you need and to use them to read the first line
MainClassTemplate
class,
find the line that says
public class MainClassTemplate
,
and change the name of the class from
MainClassTemplate
to something appropriate (examples:
WordCountingApp, WordCounterApp, etc). When you compile your
class, BlueJ will automatically change the name of the file
containing this class to match your new class name.
WordReader
class along with your previous class in
the class diagram. Double-click on it.
WordReader
class; if it opens to the source code,
switch to the class documentation in the pull-down menu at the
top-right corner of the window.
Note: In BlueJ, you can always switch between the source code and the class documentation for a class using the Source Code/Documentation pull-down menu in the top-right of the source code window.
WordReader
constructor expects a
String
argument, which is a file name for a book.
Below the "reminder" comments in the main
method,
create a String
variable that holds the name of the
file you downloaded. (This will be useful again later.)
Construct a WordReader
object, passing it the
filename variable. For example,
String fileVariableName = "name-of-your-file.txt"; WordReader readerVariableName = new WordReader(fileVariableName);
WordReader
class documentation to find a
method you can use to read in a line of text. Add a statement
that invokes that method to ask the reader to read in a line of
text. Put the line in a String
variable. Then
print the line you just read in. The output should look similar
to the first lines in the example at the beginning of this
"Getting Started" section.
Note:
Some books might have an "invisible" first line with special
codes for eReader software. If your first line appears to
be blank, read a second line into your same String
variable and print that.
String
class has a length
method.
(See the String
Methods quick reference for more information.)
The first line is probably not actually a line from the book; it might contain the name of the publisher or the book's title, but it also might be a blank line or contain copyright or publication history information. In fact, many of the lines at the beginning might be such meta-information, including the table of contents, and so forth. To get a more meaningful line of text, skip the first 60 lines, showing the user that you are doing so. To do this you will have to read and ignore the lines you are skipping because a file is like a casette tape rather than a CD; you can't just go straight to the section you want, you have to "fast forward" through everything that comes before it.
To skip past 60 lines,
create a loop that will repeat the right number of times. Inside
the loop, read the next line from the file as you did before (but
without printing it) and do System.out.print(".");
.
Note that you want to use System.out.print
, not
println
, in the loop, but you will want to do a single
println
after the loop to go to the next line.
For example, if you were skipping 10 lines, you might have:
Skipping 10 lines: ..........
After skipping over those lines, get a new line and then
print it and the length of the line, just as you did
earlier. (If the printed line doesn't show up, make sure you
printed a newline after the ...
; it might be
that your printed line is at the end of the line of dots
and you would have to scroll right to see it.)
Test your program.
Question: Was 60 lines enough to skip past all of the meta-information (title, table of contents, etc.)? It might not be if your book has about 55 lines of Project Gutenberg information at the beginning. You might choose to skip past more lines.
Starting wherever you are now in the file, print the next 20 lines. (Or you could pick a different number of lines.)
One simple type of digital humanities analysis is to measure the difficulty and variety of the vocabulary used in a work of literature. For example, how many different words appear? How long are the words? Etc. We'll see how to break a line of text up into individual words in a future lab, but for now we could ask simpler questions: how many lines are there, and what is the average length of a line?
Create a new variable near the beginning of your program (before
you have read any lines of text)
to count the number of lines you have read in
and printed or skipped. Give it a name that indicates its purpose
(e.g., counter
, lineCounter
,
numLines
, or anything like that). Set it to 0
initially. Then add a line to increment the counter every time you
read in a line.
Modify your program to show the line number for each line you print out. For example, the line in the previous example might now look like:
Line 1: Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
Modify your program to show the average length of the lines you printed out (or, alternatively, the average length of the lines you read in, including the skipped lines). To do this, you will want to introduce yet another variable after your line counter to sum up the lengths of all the lines. Again, initialize it to 0. Then, whenever you write (or read) a line, add the length of that line to your overall length count. After you have printed all of the lines you are going to print, calculate the average by dividing the total length of all lines by the number of lines written (or read). Print that value with appropriate explanatory text, such as "The average line length is …"
NOTE: If your total length variable and your line counter variable are both integers, then your average will always be "rounded" down because you are doing integer division. (This isn't really rounding, since it always goes down. It is really truncating the fractional part.) You can force it to do floating point division if you declare one or both of those variables to be afloat
ordouble
, instead of anint
. (Quick test of understanding: why would it be enough for just one of the variables to be afloat
ordouble
? What will the compiler do if you have one floating point number and one integer for your division?)
getNextLine
method will return null
instead of a
String containing the line. (The null
keyword indicates a
non-existent object.) We can write a loop like this:
The initialization in this loop reads in the first line we are going to process in this loop. We then check if it isfor ( nextLine = reader.getNextLine(); nextLine != null; nextLine = reader.getNextLine() ) { // Do something in the loop body. // For example, print nextLine, skip it, increment a counter, etc. }
null
(end-of-file). If it isn't, we do whatever is in the loop body, then read
in the next line in the step section before checking again to see
if it is null
. The job of the step part of the loop
is to make each iteration through the loop different from the one before,
which we do in this case by reading in the next line of the file.
You will be adding to this program, but submitting it at this point will allow you to get feedback before you submit a final version.
YourName_Lab2
).
(Do
this from the Mac Finder or Windows Explorer, not from within BlueJ.)
This will help whoever grades it
when they receive a dozen or more projects with similar names.