What is burr regression 1

Original link:

Machine learning summary (1): linear regression, burr regression, lasso regression 1, linear regression

Scikit-Learn Study Notes-Linear Regression (Basic Function Regression, Ridge Regression Regularization, Lasso Regularization)

[Machine Learning] An article on understanding regularization and LASSO regression, Ridge regression

 

1.1, the general form of linear regression

1.2. Problems that linear regression may encounter

  • There are two ways to find the minimum value of the loss function: the gradient descent method and the normal equation.
  • Feature scaling: normalization of feature data. Feature scaling offers two advantages: Firstly, it increases the speed of convergence of the model. If the data difference between features is large, use the two features as an example. Use These Two Features As Horizontal And Vertical Coordinates To Draw A Contour Map It Is A Flat Ellipse. At this point, when the gradient direction is determined by the gradient descent method, a zigzag route perpendicular to the contour may be performed and the iteration speed will slow down. However, when the feature is normalized, the entire contour map appears circular, the direction of the gradient points to the center of the circle, and the iteration speed is much faster than the former. The second is to improve the accuracy of the model.
  • Selection of the learning rate α: If the learning rate α is selected too small, the number of iterations increases and the convergence speed is slower. If the learning rate α is chosen too large, the optimal solution can be skipped, no convergence at all.

1.3 Overfitting and its solutions

If the sample has many features and the number of samples is relatively small, the model tends to overfit. To solve the overfitting problem, there are two methods:

Method 1: Reduce the number of features (manually select key features to keep, or you can do this via the PCA algorithm which discards some of the information).

Method 2: Regularization, where all features are retained, but the size of the parameter θ in front of the features is reduced, in particular by modifying the shape of the loss function in linear regression, e.g. B. Burr Regression and Lasso Regression.

1.4, simple linear regression code example

2. Regularization (regularization)

Regularization is the embodiment of the minimization strategy for the structural risk (loss function + regularization term), in which a regularization term is added to the empirical risk (average loss function). The function of regularization is to select a model with less empirical risk and less model complexity.

The principle of preventing overfitting: the regularization term is generally a monotonically increasing function of the model complexity, and the empirical risk is responsible for minimizing the error and keeping the model deviation as small as possible. The smaller the empirical risk and the greater the more complex the model, the greater the value of the regularization. In order to also make the regularization term small, the model complexity is limited, so that overfitting can be effectively prevented.


3. Regularization of the linear regression

In general, regularization has the following optimization goals:

below,It is a coefficient that is used to balance the regularization term and empirical risk.

The description of the L1 regularization and the L2 regularization is as follows:

  • The L1 regularization refers to the sum of the absolute values ​​of each element in the weight vector w, usually expressed as ∣∣w∣∣1
  • L2 regularization refers to the sum of the squares of the elements in the weight vector w and then the square root, usually expressed as ∣∣w∣∣2

The role of L1 regularization and L2 regularization:

  • The L1 regularization can produce a low density matrix, that is, a low density model that can be used for feature selection
  • L2 regularization can prevent overfitting of the model, to a certain extent L1 can also prevent overfitting

Among these, norm regulation and norm regulation can help reduce the risk of overfitting. Norm uses the square root of each element of the parameter vector to minimize the norm so that each element of the parameter is close to 0 but not equal to 0.

We consider the simplest linear regression model.

4. Ridge regression solution (Ridge)

Ridge regression does not give up any feature and reduces the regression coefficient. The solution of the ridge regression is the same as the general linear regression.

Sample ridge regression code

 

5. LASSO regression solution

byThe norm is an absolute value, which leads to the fact that the optimization goal of LASSO is not continuous and derivable, i.e. the least squares method, the gradient descent method, the Newton method and the quasi-Newton method cannot be used.

The regularization problem can be solved with Proximal Gradient Descent (PGD).

Example code for Lasso regression: