Material for Session 8: Regression, Correlation and Curve Fitting
General Least Squares Curve Fitting
Some Terminology:
Independent Variable (also called a predictor): an experimental condition that you can arbitrarily set (temperature, concentration of a reagent).
Univariate: a model containing only one independent variable.
Multivariate: a model containing more than one independent variable.
Dependent Variable (also called an outcome): an experimental quantity that is observed as a result of the way an experimental system responds to the conditions you have set.
Model: A formula, or other mathematical relationship between one or more predictors, and an outcome. Will also involve one or more unknown parameters. A model can be visualized as a curve (one predictor) or a surface (two predictors).
Data Point: a set of values for the independent variable(s) and the dependent variable (e.g.: x and y).
Parameters: (also called regression coefficients): variables appearing in a model that can assume specific values to characterize a particular situation. Every distinct curve is defined by the values of its parameters.
Degrees of Freedom: The number of data points minus the number of parameters being estimated.
Linear and Non-linear Models: refer to the way the parameters enter into the model, not the way the independent variables enter into the model. So, fitting a polynomial is considered to be a linear curve-fit. Even though the polynomial is a non-linear function of x, it is a linear function of the parameters.
Curve Fitting (also called parameter estimation, or regression): The process of determining the values of one or more parameters that make the model reflect the observed data.
Least Squares: a criterion that says the best curve is the one for which the sum of the squares of the vertical distances of each point from the fitted curve is minimized.
Standard Error (S.E.): A measure of the uncertainty associated with a quantity. The quantity can be a measurement, a number derived from a measurement, a parameter resulting from a curve-fit, or a quantity calculated from the the fitted parameter or parameters.
Confidence Interval (C.I.): a range of values between which you assert, with some specified degree of confidence, that the true value lies. A 95% C.I. should have at least a 95% chance of encompassing the true value.
Confidence Band: A pair of curves placed around a fitted curve. They encompass a region of space in which the true curve is asserted to lie.
Examples:
Straight Line:
Parameters: a is the y-intercept; b is the slope.
Radioactive Decay:
Parameters: A0 is the radioactivity at time=0; k is the decay constant.
Resources: Books:
There is a very complete 400-page book on curve-fitting for scientists. It can be downloaded for free at:
The curve-fitting portion of the book can also be viewed online as web pages.
Resources: Software:
All modern statistical analysis and charting programs support linear regression; and many support non-linear regression: SPSS, SAS, DeltaGraph, Prism, SigmaPlot.
Specialized programs exist for specific areas: WinNonlin (pharmacology), SAAM and Boomer (physiological modeling).
The Nonlinear Least Squares Curve Fitting web page can also do linear and nonlinear curve-fitting:
http://members.aol.com/johnp71/nonlin.html
this web page can be used effectively in conjunction with Excel
Weighting of Points
A more precise measurement should influence the curve more strongly than an imprecise measurement. A good curve-fitting program will give more "weight" (or curve-pulling power) to measurements with small standard errors. A point with a weight factor of 2 will be equivalent to 2 identical points that have a weight factor of 1.
Weight varies inversely as the square of the standard error:
Weights are relative: If you believe all points are equally precise, you can set the S.E.'s all equal to 1.
Providing Input:
Model: In general, you must enter the equation describing your model, or select it from a library of "canned" models. You must adhere to the syntax rules for the software you are using.
Data: If using the web page, you can type your data into the page directly, or copy it from Excel and paste it into the web page.
Parameters: For non-linear models, you must provide initial estimates for the parameters.
Interpreting the Output
All general curve fitters produce similar output, which will usually include the following components:
Goodness of Fit: Chi Square or RMS Error for the model.
Parameters: may be referred to as a, b, c, etc.; p1, p2, p3, etc.; b1, b2, b3, etc.; or b1, b2, b3, etc.
Standard Errors of Parameters: an estimate of how uncertain the estimate is, based on the errors in the original data, the nature of the equation being fit, and the number and spacing of the observed points.
p-values: the significance level of the test whether that parameter is different from zero. If p<0.05, you can be reasonably sure that the parameter is not zero. This may or may not be meaningful, depending on the situation. For a straight line, the significance of a non-zero slope is equivalent to a significant correlation between x and y, but the significance of a non-zero y-intercept may not indicate anything of importance.
Calculated y-values: these, together with the corresponding x-values, can be used to graph the fitted curve.
yo-yc deviations: tell whether the observed point falls above (+) or below (-) the fitted curve. Examining these deviations can give you important information about possible outliers, and about how well the model actually fits the data.
Graphical Output: This will usually show the observed points, the fitted curve, and perhaps confidence bands around the curve.
Confidence Bands: On a graph, these would be indicated by a pair of curves hugging the fitted curve, one above and one below, which indicate the likely region in which the true curve falls. In a table, these might be indicated by a low and high y-value for each point.
Covariance / Error Correlation Matrix: This shows the extend to which the uncertainties in pairs of fitted parameters are correlated. These are important if you are going to use two or more parameters from the same curve-fit in a subsequent calculation.
Nonlinear Functions:
A function is linear in the parameters if each parameter appears in the function as either:
· a constant term, or
· a term multiplying something that doesn't contain any parameters.
Polynomials:
These are curved lines, but are actually linear in the parameters
Transformations to Linear:
By algebraically manipulating the equation, you can sometimes turn it into one that is linear in the parameters:
Take the natural logarithm of each side:
This new equation is nonlinear in a, but is linear in Ln(a)
Weight Factors:--
Whenever you transform the Y variable, you must also adjust the weight factors by calculating how an error in Y would propagate through the transformation.
Truly Nonlinear Functions
No algebraic manipulations can transform this equation into one that is linear in all three parameters.
Iteration
Nonlinear least squares curve fitting is an iterative process:
· Start with initial guesses
· Refine those guesses to produce estimates
· Repeat the refinement until no further improvement
Things that can go wrong:
Divergence:
Divergence is the failure of the parameters to converge to a meaningful solution.
Usually results from poor initial guesses. More sophisticated software will converge with worse guesses than more primitive software.
Invalid formulation of the function:
You may be using two different parameters to do the job of one:
In this formula, a and c are both the intercept, and are said to be redundant, or degenerate. This will produce a "singular matrix", often indicated as a "division by zero" error.
With nonlinear functions, the redundancy may not be as obvious:
In this formula, a and c are also degenerate, although it is not as obvious why.
Insufficient or poorly-distributed points:
This may show up as very large uncertainties in the estimated parameters, or very high intercorrelations among the parameter error estimates.