Segmented regression

Segmented linear regression to detect relations and breakpoints despite scatter

Purpose
In statistics, regression analysis (Snedecor 1986 ) is done to detect a mathematical relation between several series of measured things ("elements") that have variable values, especially when the relation is scattered due to random variation. Linear regression analysis is done to detect a "straight line" relationship between the elements. Linear regression can be done "in one go" with all the data, or by parts ("segments"} of the data base. The segments are introduced to see if there are abrupt changes in the relation under investigation.

The results of regression analysis are used amongst other in scientific research, technical designs, or in practical planning of activities.

Definitions
Segmented regression is a method in statistical regression analysis whereby the independent variable(s) are segmented (divided into sequential groups according to their value) and the regression analysis is performed separately for the segments. The boundaries between the segments may be called breakpoints (break points).The resulting regression equations may show a discontinuity at the breakpoints. A variable is anything that may have varying values (e.g. annual rainfall, length of people).

Segmented linear regression is segmented regression whereby the regression within the segments is linear.

It is not recommended to carry out segmented nonlinear regressions, because it is usually not useful to break up the curved regression function into (discontinuous) bits and pieces. Moreover, by testing the improvement of the segmented regression over the non-segmented regression, one will often find that it is statistically insignificant. For the same reason it is also not recommended to carry out a segmented linear regression with many breakpoints.

Regression analysis is done with at least two variables: the independent variable (say x) and the dependent or response variable (say y). This presupposes that the x value influences the y value and there is a cause-effect relation. In such a case one performs the regression of y upon x using the least squares method, whereby the sum of squares of deviations of the regression function from the observed y values is minimized so that on obtains the closest fit. In symbols: if Yx is the value of y at x according to the regression equation then the sum of squares of Yx-y is minimized.

Representing the linear regression equation by Yx = A.x + B, then the factor A is called regression coefficient and the term B is called regression constant or intercept. The expression intercept means that B is equal to Yx when x=0. [In a graph of the equation, the straight line intercepts the Y axis at an elevation B. In that same graph the coefficient A represents the slope of the line.

The least squares method must be handled with care. For example, It may be possible that y and x have no causal relation, but there may some other factor that influences both x and y. For example the sales of cars and the number of holiday trips to tropical paradises in the years 1950 to 2000 have increased and are both influenced by economic development, but it can not be concluded that the car industry can promote sales by organizing trips to tropical paradises. To the contrary, the trips may have been made because the travelers' savings were not invested in a new car. In the absence of a causal relation one could do both the regression of y upon x and, reversely, of x upon y and use the geometric mean of the respective regression coefficients as the "true" coefficient. This may be called a two-way regression.

When the number of independent variables is greater than 1, one speaks of multiple regression or multiple linear regression or polynomial regression. It is not recommended to use large numbers of independent variables.On the one hand, when a few of the independent variables (say 2) explain the dependent variable clearly, chances are high that all other independent variables make statistically insignificant contributions. On the other hand, when many independent variables together give a fair degree of explanation while the contribution of each individual variable is insignificant, the regression model obtained is not very useful, especially when there are mutually dependent independent variables or when some of the independent variables have no direct causal relation with the response variable.

Segmented linear regression
with 1 or 2 independent variables using 1 breakpoint

Segmented regression can be useful to detect an abrupt change of the response function at an increase or decrease of an influential factor. The breakpoint can be taken as a critical or safe value beyond or below which (un)desired effects occur. The breakpoint can be important in decision making.

If the reader wishes to know how linear regression is used in the segmented approach, he or she may consult the Chapter 6: "Frequency and Regression Analysis" in ILRI Publ. 16: "Drainage Principles and Applications" (R.J.Oosterbaan 1994 ) that can be viewed in and freely downloaded from the ILRI-Alterra website or from the Articles page in the waterlog.info website.

Further, a free paper "Crop production and soil salinity: evaluation of field data from India" (R.J.Oosterbaan et al. 1990 ) can be found in the SegReg page of website waterlog.info.

The paper "Data analysis in drainage research", to be found on Articles page of the above mentioned info site waterlog, gives examples of application of SegReg, a computer program designed for segmented linear regression with 1 or 2 independent variables using 1 breakpoint. The paper shows the numerous types of trends that can be detected. In the determination of the most suitable trend, statistical test must be performed to ensure that the trends are reliable (see also the figures attached here at the right). For example, when no significant breakpoint can be detected, one must fall back on a regression without breakpoint.

The SegReg program can be freely downloaded from the software page of website waterlog.info.

In the SegReg program the following statistical tests are used to determine the type of trend:


 * 1) significance of the breakpoint (BP) by expressing BP as a function of  regression coefficients A1 and A2 and the means Y1 and Y2 of y and the means X1 and X2 of x (left and an right of BP), using the laws of accumulation of errors in additions and multiplications to compute the standard error (SE) of BP, and applying Student's t-test
 * 2) significance of A1 and A2 applying Student's t-distribution and the SE of A1 and A2
 * 3) significance of the difference of A1 and A2 applying Student's t-distribution using the SE of the difference.
 * 4) significance of the difference of Y1 and Y2 applying Student's t-distribution using the SE of the difference.

In addition, use is made of the correlation coefficient, the coefficient of determination (explanation), confidence intervals of the regression functions and Anova analysis.

An explanation of the Anova (analysis of variance) in linear regression can be found as answer to question 12 in the FAQ's page of the earlier mentioned website.

The coefficient of explanation (CE) is found as 1 minus the the addition of minimized sum of squares of of Yx-y at both sides of the BP divided by SE2 of all y values. In a linear regression, the CE is equal to the squared value of the correlation coefficient. In nonlinear regressions, the correlation coefficient has no meaning, but CE does. In segmented regression the correlation coefficient has a meaning within the domains to the left and right of BP, but is has no meaning for the overall result, while the CE does.