Child pages
  • Biochemical data analysis using spreadsheet software
Skip to end of metadata
Go to start of metadata

Olli Niemitalo, o@iki.fi
Department of Biochemistry, University of Oulu

Introduction

The most common use for spreadsheets is storage and processing of financial records. You can input many, many numbers and write whatever calculations you want to be carried out on the numbers. Later you can change any of the numbers, and all the dependent calculations are automatically redone for you. Spreadsheet programs also have basic capabilities for graphing the data. All this is very handy, also for processing biochemical data straight from the laboratory. A great thing worth mentioning about computerized calculations is that there will be no human errors in handling the numbers. 

You will learn to:
•  Enter data onto a spreadsheet
•  Write functions to do calculations on the data
•  Do least squares fitting of a curve on the data
•  Graph your data and the fitted curve 

This text can be used for teaching in computer lab sessions and is also suitable for self-study.

Software

There are two good and widely available spreadsheet programs, both installed on university computers. For this spreadsheet excercise, you can choose either :

Microsoft Excel
•  Windows, Mac
•  Microsoft Office Academic is 69 € for students via Lyyra, Microsoft office is 4 € for university staff

OpenOffice.org Calc
•  Windows, Mac, Linux
•  Mimics most features of Excel
•  Free (smile)

For data analysis, there is also available a Windows software Origin, which is free for students from the university download site at https://www.oulu.fi/jakelu/.

Spreadsheet basics 


 
Figure 1. A spreadsheet sheet

A spreadsheet sheet is a grid of cells that contain data. Cells are referred to by their location, for example D3 is on column D, row 3 (Figure 1).

Data in each cell can be one of (with examples):
•  Number  0.3696867843344
•  Text     L-Dopa:
•  Function  =1/A3

A function always begins with the equals sign, and it can make references to other cells, like A3 here. The value of a function is automatically calculated and displayed on the sheet. To view the actual formula, bring the cursor over the cell. You can move the cursor by clicking the target cell, or by keyboard. The function is displayed in a separate edit box (Figure 1).

 
Figure 2. Formatting cells from a popup menu invoked by right-click
 
Depending on the language settings, decimal numbers are entered either as 123,456 or as 123.456. Correctly entered numbers are displayed aligned to the right (text is aligned to the left). The number of decimals displayed on the sheet depends on the formatting of the cells. Sometimes funny things may happen if a cell is automatically and erroneously formatted as a date, in which case the cell should be manually formatted (Figure 2). 

When entering a series of numerical data, it is good practice to write a short title describing the series on one cell and to enter the data on the cells below (Figure 3). It is also not a bad idea to explain somewhere on the sheet to which laboratory experiment the data is related.

 
 
Figure 3. Data series with proper titles
 
The example data in Figure 3 is absorbance readings. Let's say you want to calculate the concentration of a chemical based on the absorbance readings and known absorption coefficient and path length. According to Beer-Lambert law, c= A/(αl), where c is concentration, A is absorbance, α is absorption coefficient and l is path length, the function can be written for Sample 1 as in Figure 4.  
 
Figure 4. A function to calculate concentration for Sample 1
 
Concentration is calculated automatically! Now we would like to repeat the calculation for Samples 2-7. The function can be copy-pasted, by selecting the cell C2, pressing CTRL-C to copy, selecting all the target cells C2 to C8 at once (either by mouse, or holding down SHIFT while moving the cursor using ARROW KEYS. Have a peek at Figure 6 to see how a selection looks) and pressing CTRL-V to paste. Copy and paste functionality is also available from a popup menu invoked by right-click. 
 
Figure 5. A failed attempt to copy-paste a function, with the cause revealed
 
But the result is a failure (Figure 5)! Looking more closely at C3, it appears that the spreadsheet program has been very helpful by incrementing in the pasted function, =B3/(D3*D6), the row from which to get the absorbance, to B3, but it has also incremented the rows from which to get the absorption coefficient and the length, to D3 and D6, which contain no data! The workaround is to start over by rewriting the function in C2 as =B2/($D$2*$D$5). The dollar signs in front of the column and row indices indicate that they should not be automatically and "helpfully" incremented. Then the copy-pasting is repeated (Figure 6). Now everything works! A useful guideline is to put in $'s whenever you are referring to a cell that is not part of a series. Actually, here it would be enough to use $'s in front of the column numbers only. 
 
Figure 6. Correction of the function in C2 and copy-pasting again solves the problem

You may try changing any of the source data, including the absorption coefficient and the path length to see how the calculations are automatically redone. Data can be located anywhere on the sheet, and sometimes it would be useful to move things around. Existing data can be moved by marking it, cutting it (CTRL-X), moving the cursor to the destination location, and pasting it (CTRL-V). If another cell or a graph refers to the moved cell (for example if D2 is
moved here), the reference is automatically updated to point to the new location. Very handy! To erase data, press DELETE. Sometimes it would be useful to paste the displayed results of the calculations, and not the formulas. This can be done using Paste Special, found in the popup menu invoked by right-click. Also, when pasting spreadsheet data to a word processor, there is normally a possibility to paste the data as plain numbers. 

Spreadsheet programs have many built-in functions. For example, the mean of a set of numbers in cells A2 to A7 can be calculated by  =AVERAGE(A2:A7) (Figure 7), where A2 is the upper left corner of the selection and A7 is the lower right corner. When writing the function, you can mark the cells A2 to A7 by mouse rather than typing in the reference. Some functions accept multiple arguments, for example =POWER(6, 2) or =POWER(6; 2), the separator depending on the spreadsheet software. 
 
Figure 7. Average value of a die calculated using the AVERAGE function

The function names may depend on the language version of the spreadsheet program. Translations to different languages can be found by googling for Microsoft Excel function translations. Also the help that comes with the spreadsheet program is a useful reference to the preprogrammed functions. Table I lists some useful functions in both English and Finnish. There are much more functions available that are not listed here. 

Table I. A short reference of useful spreadsheet functions

English

Finnish

Description

ABS
EXP
LN
LOG10
POWER
RAND
SQRT
SUM
AVERAGE
CORREL
COUNT
INTERCEPT
LINEST
MAX
MEDIAN
MIN
SLOPE
STDEV
TREND

ITSEISARVO
EKSPONENTTI
LUONNLOG
LOG10
POTENSSI
SATUNNAISLUKU
NELIÖJUURI
SUMMA
KESKIARVO
KORRELAATIO
LASKE
LEIKKAUSPISTE
LINREGR
MAKS
MEDIAANI
MIN
KULMAKERROIN
KESKIHAJONTA
SUUNTAUS

Returns the absolute value of a number
Returns e raised to the power of a given number
Returns the natural logarithm of a number
Returns the base-10 logarithm of a number
Returns the result of a number raised to a power
Returns a random number between 0 and 1
Returns a positive square root
Adds its arguments
Returns the average of its arguments  
Returns the correlation coefficient between two data sets
Counts how many numbers are in the list of arguments
Returns the intercept of the linear regression line
Returns the parameters of a linear trend  
Returns the maximum value in a list of arguments  
Returns the median of the given numbers  
Returns the minimum value in a list of arguments  
Returns the slope of the linear regression line  
Estimates standard deviation based on a sample  
Returns values along a linear trend 

Graphing

Two types of charts are most commonly needed in biochemistry, a column chart and a scatter chart (Figure 8). On the column chart, each column has a name or an index number, and the height is a continuous variable. On the scatter chart, both the independent variable (horizontal axis, x) and the dependent variable (vertical axis, y) are continuous.

 
Figure 8. a) A column chart b) A scatter chart

To graph your data, you should ideally have a series of x and a series (or multiple series) of y values next to each other with the titles on top. If your x runs 0, 1, 2, 3, 4, 5, 6, 7, 8, ..., or similar, almost to no end, then you shouldn't type in all those numbers by hand. Instead, write a function to increment the value by each row (Figure 9), using a function such as =A4+1, and copy-paste it to cells below. 
 
Figure 9.Automatically incrementing a number by each row using a suitable function

To create a chart, select the data, including the titles. When creating a column chart, in Excel, do not include the x series in the selection, because the program will think it is another y series, and in Open Office, have the x series as the first column and check First column as label in the next step. The next step depends somewhat on the used software, but goes something along the lines of Insert → Chart → Clustered column for a column chart or Insert → Chart → XY Chart for a scatter chart. Formatting the chart is up to your taste. You may have to click, double-click and right-click parts of the charts to be able to change their properties. In Excel, you can set the column labels at this phase using Select data source. 

Linear regression

There is often need to fit a continuous curve to experimental data points. For example, protein concentration and absorbance have a linear relation in a colorimetric protein concentration assay, according to the Beer-Lambert law, as already discussed earlier. To figure out the relation, absorbance measurements from standard samples with known protein concentration are taken (Figure 10).


 
Figure 10. Absorbance measurements of standard samples in a protein assay

Note that the zero samples are treated as part of the data series. The spectrophotometer reading 0.000 is only due to setting the instrument zero level at that absorbance. A linear standard curve is fitted to the data points (Figure 11) an  can be later used to calculate the concentration of an unknown sample from its absorbance. In the spreadsheet software, first select the data setries in the chart and right-click.Then, in Excel, use Add trendline. In Calc, use Object properties→Statistics→Regression curves. The software fits the linear regression line by the least squares method, which minimizes the sum of squared error at each point. The error at each point is the vertical distance from the sample to the regression line. The errors in the dependent variable y are presumed to be random and are due to pipetting, any other disturbances that might influence the measurement, and ultimately the precision of the spectrophotometer.
 
Figure 11. A linear standard curve may give a poor fit due to non-linearity.

Unfortunately, there is nonlinearity in the data that makes a linear fit rather poor. The curve doesn't go through all points! The nonlinearity can be due to scattering of light and other factors, causing a deviation from the ideal Beer-Lambert law. A much better fit is obtained by reducing the concentration range by leaving out standard samples with too high absorbance readings (near 1.0 or higher). The actual samples, the concentration of which is to be determined, may then have to be diluted to bring absorbance into the reduced range.


 
Figure 12. A better linear fit is obtained by using a reduced absorbance range

For the standard curve to be useful, one needs to know its equation. In Excel, it can be displayed on the chart by Format trendline→Trendline options→Display equation on chart, and the visible function can be further formatted to display more decimals in the numbers. A better practice, also available in Calc, is to use spreadsheet functions SLOPE and INTERCEPT (Figure 13).
 
Figure 13. Slope and intercept of linear regression

Spreadsheet software always create graphical regression lines (Figure 12) as y = f(x ), whereas for calculation of concentration from absorbance, x = f(y ) would be more useful. Swapping the chart axes would be a way around, but this is against the common style of showing the dependent variable on the vertical axis. Swapping x and y in calculation of INTERCEPT and SLOPE is neither an option, as this would produce a different regression line to what is displayed on the chart, creating a discrepancy.

The regression line function is:

y = INTERCEPT(...) + x * SLOPE(...).

To be used for calculation of concentration from absorbance (x from y), this must be solved manually for x, the concentration (Figure 14):

x = (y - INTERCEPT(...))/SLOPE(...).


 
Figure 14. Regression line is solved for x to be used for concentration calculation.

Better results than by linear standard curve can be obtained using a 2nd order polynomial as the standard curve (Figure 15). A 2nd order polynomial is a smooth function that can compensate for slight nonlinearity of data. Solving a 2nd order polynomial for x is a bit hard due to there being two solutions, one of which is the correct one. But it can be done. Unfortunately, Calc is not able to display the coefficients of a polynomial fit, but the coefficients could be solved by nonlinear regression, which is the topic of the next section after the exercises.
 
Figure 15. A 2nd order polynomial standard curve

A spreadsheet is not the best tool for this kind of work. In actual research, more advanced graphing and data analysis programs are often used. From undergraduate students, a linear standard curve is usually all that is expected.

Nonlinear regression

Functions that are more complex than polynomials can be fitted to data using nonlinear least squares regression. The Michaelis-Menten model of enzyme action relates the velocity of the catalyzed reaction, v, and the substrate concentration [S], by:

  v = Vmax[S] / ([S] + Km),

where, Vmax is the maximum reaction velocity (at infinite substrate concentration) and Km is the substrate concentration that is required to reach half the maximum velocity. The equation is of form: 

  y = a*x / (x + b),

where a and b are constants to be determined by fitting the model  to data. A free online tool http://www.xuru.org/rt/NLR.asp (or google for online nonlinear regression) is able to do the fitting using nonlinear least squares. Data is pasted directly on the web page (figure 16) as series of x and y values from adjacent spreadsheet columns. The number of parameters should be limited to two (a, and b). 


 
Figure 16. The on-line nonlinear regression tool at http://www.xuru.org/rt/NLR.asp

The tool then fits to the data a number of different functions, listing them sorted by the residual fitting error. One of the functions, usually near the top of the list if the fit was good, has a form familiar to us: y=0.10405995 x / (x + 2.884414923- 10-1)
  
The values for a, and for b (as 2.884414923E-1), can then be copied and used on the spreadsheet (Figure 17). Note that these are directly the Vmax and Km, respectively.  
 
Figure 17.The best-fit Vmax and Km are used for graphing the Michaelis-Menten curve in Excel

Spreadsheet programs do not support graphing of functions directly, but it is still possible through sampling (Figure 17, 18). A series of x values is created for roughly the same x range as the measured data, and the function value is sampled at each point. In Excel, the x and y series of the Michaelis- Menten fit are added into the existing chart by right-clicking the chart background and by Select data→Add. Only the numbers and not the titles should be chosen for the x and y ranges. Calc doesn't support separate x series for the two y series (the original data and the fit), so he two y series must be located in adjacent columns, sharing the same x values, but extra x values may be added for the fit to increase graphing resolution. When creating the chart, the x series and both of the y series are selected
(Figure 18). 

Figure 18. OpenOffice.org Calc requires that the two y series to be graphed share the same x values

The chart can be formatted to be display a smoothed, continuous line on the fit rather than markers (Figure 19). The fit should always be inspected graphically to see if it matches the data well. With nonlinear least squares it cannot be guaranteed that the best fit is found, especially if there are many unknowns in the model. It should be noted that this type of analyses should really be done using specialized software. These also calculate error propagation. For example, it would be interesting to know to what precision the Vmax and Km obtained can be trusted, based on scattering of the data. Such an analysis would be a lot of hard work using a spreadsheet, and is not expected from undergraduate students.
 
Figure 19. The Michaelis-Menten model fitted on enzyme kinetics data

Exercises

The idea here is to play with the concepts until you grasp them. Freely use your imagination!

Part 1 - BASICS
1.  Write some text in some cells and numbers in some other cells.
2.  Write in one cell a function that uses data from other cells.
3.  Modify the data in the referred to cells and see how the value of the function changes.
4.  Try moving (cut-pasting) data to a different location on the sheet and note how the references in functions are automatically updated.
5.  Change the number of decimals displayed in some cells with numerical data.
6.  Create a column series of data that goes 1, 2, 3, 4, 5, ..., 100 using a suitable function in the "2" cell and by copy-pasting it to cells below.
7.  On the next column, take for example the square root of each number or try  other mathematical functions (EXP, LN, POWER, SQRT, ...).
8.  Try statistical functions (SUM, AVERAGE, MEDIAN, ...) on a series of numbers. 
9.  Try moving data around the sheet by keyboard alone, using ARROW KEYS, PAGE UP, PAGE DOWN, CTRL-ARROW KEYS, CTRL-END, SHIFT, CTRL-X (cut) and CTRL-V (paste). Try CTRL-C (copy), too. 

Part 2 - GRAPHING
1.  Create a column chart of data of your own choice.
2.  Create a scatter chart of your own data you want to analyze. Excel can Open text files and Calc can Insert→Sheet from file. You may have to find-replace the ,'s with .'s. (Here is some data for you)
3.  Make the chart beautiful and easy to read! 
 
Part 3 - CURVE FITTING - Read the next section for instructions!
1.  Fit a linear function to your data using the spreadsheet. Is it a good fit? Would another function be better or should the range of data be limited? Use the SLOPE and INTERCEPT functions to get the coefficients of the function. Write a spreadsheet function to calculate the y value of the fit at x value of your own choice.
2.  Fit a nonlinear function to your data using http://www.xuru.org/rt/NLR.asp. Graph the function in the same chart with your data using sampling. 

  • No labels