Lab: Using Data

 


Introduction

The purpose of this mini-lab is to practice loading data, plotting data, saving figures, and fitting data to a model.



Loading Data into an Array

  1. Create a new file for the statements and functions you will write in this mini-lab. Save this new file with a name representative of this mini-lab.

  2. It is often useful for us to plot an array of data that we either generate from our own code, or that we read in from a file. Download the file class-grades.csv and save it in the same folder as this lab file. This data set is a set of grades from a Chemical Engineering course at MacMaster University. This data set contains 99 rows and six columns. The columns give us: a prefix denoting which year the student is, the assignment grade, the tutorial grade, the midterm grade, the takehome exam grade, and the final exam grade for each student. The following code may be used to read this data into an array of size 99x6:
        def readGrades():
            gradeArray = np.zeros((99,6)) 
            with open('class-grades.csv', 'r') as f:
                
                # read from the file, a line at time, adding it to a list
                index = 0
                for grades in csv.reader(f):
                    gradeArray[index] = [float(grades[0]), float(grades[1]), float(grades[2]), float(grades[3]), float(grades[4]), float(grades[5])]
                    index += 1
            
            print(gradeArray)
        
    

    Copy this code and test it to see what it does. Notice that the rows in the array correspond to the rows of the data file.

  3. NumPy has a function, loadtxt that will do this same thing for us, where we don't have to read the data line by line, put it into the list, and then convert the list to an array. Our new statement would look like:
    gradeSet = np.loadtxt('class-grades.csv', delimiter = ',')

    (Note: the delimiter is used to specify what separates the items on each line of the data file. Since our data is in a .csv (Comma Separated Values) file, we use a comma. ) Copy this statement and print out the array.

  4. One question we can ask about this data is how many homework grades fall into different categories. To create a histogram of homework grades, try the following:
        hwGrades = [gradeArray[i][1] for i in range(len(gradeArray))]
        plt.hist(hwGrades, 20)
        
    (Remember, you will need to import matplotlib.pyplot in order to use the hist function.)

  5. Create a histogram of the final exam scores.

  6. The file JanTemps.csv contains the daily high and low temps for the month of January, for each of the years 1980, 2005, 2010, 2020, and 2021. This is the same data that was used in Lab 3: Using Arrays. Download this csv file and read the data into a 31x10 array. Print this array.

  7. Create a histogram for one set of temperatures (either the high temps or low temps for one year).

  8. Plot all 5 sets of high temps in the same figure and all 5 sets of low temps in the same figure. Do you see any patterns?

Saving Figures

In the previous set of exercises, you created a histogram showing the distribution of the homework grades for the 99 students. When you end your session in Spyder (or whatever development environment you are using), your figures will disppear. If you wrote a script (i.e, code) to create this histogram, you can always recreate the figure. It is possible you would like to use this figure (or another one) in a report of some kind. Saving figures is straightfoward.


CHALLENGE EXERCISES (Optional) Fitting HIV Data to a Model*

A viral load is the number of virions in the blood of a patient infected with HIV after the administration of an antiretroviral drug. One model for the viral load predicts that the concentration V(t) of HIV in the blood at time t after the start of treatment will be
V(t) = Aexp(-αt) + Bexp(-βt).
The four parameters A, α, B, and β are constants that control the behavior of the model.
In this section, we will use Python to generate plots based on this model, import and plot experimental data, and then fit model parameters to the data+.

  1. Create an array of 101 numbers ranging from 0 to 10 using the linspace function. Assign it to the variable time.

  2. Create variables named A, alpha, B, and beta, and give them some initial values. You might start with B as 0 to get an idea of how the function V(t) will work.

  3. Next, create an array called viralLoad of viral load values corresponding to the time values. (Remember, if nums is an array of 10 numbers, to multiply and add something to each of these array values, we could type a statement like newNums = 3 * nums + 7.)

  4. You now have two arrays of the same length, so plot them, as in
    plt.plot(time, viralLoad).
    Remember that to see just the points, you can specify a color and shape to be used for the points, such as
    plt.plot(time, viralLoad,'ro').

  5. Label the axes of your graph "Time" (x-axis) and "Viral Load" (y-axis).

  6. Experiment with different values of A, alpha, B, and beta to see how the shape of the graph changes.

  7. Download the file HIVseries.csv, which contains experimental HIV data. Use the np.loadtxt command to load the data into an array called HIV_Data.

  8. Create an array called time from the first column of the HIV_Data array.

  9. Create an array called viralLoad from the second column of the HIV_Data array.

  10. Plot these time and viral load arrays the same way you plotted the previous time and viralLoad arrays.

  11. Create a new array, called viralLoadFcnArray by computing the values of the viral load function for this new time array. (Create it the same way as you first created the viral load array, by using the function for viral load.)

  12. Plot this new array on the same axis with the experimental data. Plot this function as a curve, and use just the points for the experimental data. You might end up with something such as:

  13. The goal now is to tune the four parameters of the viral load function until the model agrees with the data. It is hard to find the right needle in a 4-dimensional haystack! So let's try to be a little more systematic about it. Think about how the initial value V(0) depends on the four constants. What does it tell us about A and B? Now vary your constants (assuming β > α) so that you always get the correct initial value. Next experiment with α and β so that your long term behavior matches that of the data. Continue adjusting the four parameters until you are satisfied that your model is now a good fit for the data.

*This example is taken from Kinder, J.and Nelson, P., "A Student's Guide to Python for Physical Modeling", Princeton University Press, 2015, pp. 61 - 63.
+For more details of this model, see Chapter 1 of Nelson, "Physical Models of Living Systems", W.H. Freeman, 2015.

Submit