Skip to content

Regression analysis functions

Mats Alm edited this page Jul 22, 2024 · 7 revisions

On this page we have documented the EPPlus implementations some of the more complex statistical functions. Note that the below functions in some edge cases might return different results depending on which spreadsheet application (Microsoft Excel, LibreOffice, etc) you are using.

LINEST

Returns statistics for a straight line that best fits the inputted data. The function can also calculate the statistics for multiple linear regression when there are several independent variables in the input data (more than one column/row in knownXs). The inputs for LINEST are knownYs (required), knownXs (optional), const (optional) and stats (optional). knownYs is the y-coordinates and knownXs is the x-coordinates. If knownXs is omitted it is assumed to be 1, 2, ..., {length of knownYs}. Const is a boolean flag, and if false forces the intercept to go through origo. Stats determines if descriptive statistics is included (if stats = true).

The statistics calculated with LINEST are the following:

  • Coefficient(s) for all independent variables. Variables deemed collinear are represented by a zero.
  • Standard error for all coefficients, including the intercept. The standard error for the intercept is set to zero if const = False.
  • R-squared (coefficient of determination). This value tells us how well the independent variables explain the dependent variable. This is calculated as the sum of squares regression divided by the sum of squares total.
  • Standard error for the y-estimate.
  • F-statistic. Do keep in mind that LINEST calculates the observed F-value and not an F-test.
  • Degrees of Freedom.
  • The sum of squares regression (ssreg).
  • The residual sum of squares (ssresid).

In some cases, an independent variable can be fully explained by another independent variable. This is called collinearity and poses an issue since the parameter estimation can have infinitely many solutions and the moment matrix X'X can not be inverted due to it being singular. In this situation, LINEST removes collinear columns by transforming the moment matrix to echelon format with complete pivoting. The process of removing collinear columns affects the degrees of freedom of the regression model and some of the statistics.

Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated. For a case with two collinear variables, this implies that one column can be expressed as a function of the other, making one of the columns redundant. LINEST removes redundant variables from the regression model to handle multicollinearity.

EPPLUS uses a gaussian elimination process to identify collinear columns and which ones to remove. The idea is to reduce the matrix (X^T * X) to echelon format, which is done with complete pivoting. After the row and column swaps, if a value in the diagonal contains zero (EPPLUS LINEST version uses a threshold of 2e-13), that corresponding column is deemed redundant. In some rare cases the result might differ from the LINEST implementation in Microsoft Excel.

High multicollinearity between variables is fairly uncommon, and there is no standard practice for removing collinear variables. When pivoting equal columns, EPPLUS always chooses the second column.

LOGEST

LOGEST is based on the same calculations as LINEST, including how it handles collinearity for multiple regression analysis. However, the function returns statistics to a curve that best fits the data, rather than a line. The calculations for the LOGEST coefficients are calculated as EXP(LINEST(LN(knownYs), knownXs)). It contains the same argument and outputs as LINEST.

TREND

This function takes four arguments, knownYs, knownXs, newXs and const. TREND calculates the y-values on the regression line that is fitted to knownYs and knownXs. If newXs is given, TREND returns the y-values corresponding to those x-values. TREND uses LINEST in order to construct the regression line and find all necessary coefficients.

The dimensions of newXs have to correspond to knownXs in terms of variables, but can be shorter in terms of amount of observations. For example, if knownXs is 4 columns wide and 20 rows long, newXs must be 4 columns wide and between 1 - 20 rows long.

If const is set to FALSE, the regression line is forced to origo. If knownXs or newXs is omitted, they are assumed to be an array 1, 2, ..., {length of knownYs}. For more information on how the regression model works, see below.

EPPlus wiki

Versions

Worksheet & Ranges

Styling

Import/Export data

Formulas and filters

Charts & Drawing objects

Tables & Pivot Tables

VBA & Protection

Clone this wiki locally