Simple Linear Regression
Linear Regression is the foundation of statistics, giving a simple yet effective technique to understand and predict the relationship between variables.
"Linear Regression: The straight path to predictive success".
The term "regression" was first used in the field of statistics by Sir Francis Galton in the late 19th century and is defined as estimating the average value of one variable for the value of another variable and means “passing back.” Francis Galton mentioned Regression in his article in1877 “Regression towards mediocrity in hereditary structure.”
Regression is a statistical method used to analyze the relationship between a dependent variable (also called the response variable or outcome variable) and one or more independent variables (also called predictor variables or explanatory variables). The goal of regression is to understand how changes in the independent variable(s) affect the dependent variable.
Let us consider two continuous variables x and y. We want to predict the value of y given the value of x. Let us first understand the problem statement with an example. We hypothesize that CO2 emission in a particular city depends on the number of cars in the city. Our goal is to estimate the CO2 emission for a given number of cars. In this case number of cars become the independent variable lets say X and CO2 emission is dependent variable, say Y. We need to define a formula where if we put the value for X we get the Y.
Y ∝ X Or Y = m.X
Above equation explains that Y depends on X with some factor k. Hence to know the exact value of Y for a given X we need to estimate what is m. But there is an issue with this equation, for X=0; Y is also going to be zero. In real world, this may not be case we might see CO2 emission even if there are no cars in the city or in other words Car is not the only reason for CO2 emission. Hence, a bias need to be added and we modify above equation as,
Y = m.X + b
We add b in the equation to include the possibility that when X in 0, Y may have some nonzero value. This is called as bias. The relationship between X and Y in the above equation is linear, because k explain a constant change in Y with per unit change in X. If we take X as a series of numbers where two consecutive numbers have infinitely small distance between them the points place in two dimensional space will appear in a straight line. m in this equation is called as slope or weight and the equation represents a line.
What is Line?
A line is a straight one-dimensional geometric shape that extends infinitely in both directions. It is represented mathematically by the equation mentioned above y = mx + b, where k is the slope of the line, x is the independent variable, y is the dependent variable, and b is the y-intercept. The slope of the line represents the rate of change of the dependent variable with respect to the independent variable, and the y-intercept represents the expected value of the dependent variable when the independent variable is equal to zero. In linear regression, the goal is to find the best fit line that describes the relationship between the independent variable and the dependent variable.
What is Slope?
The slope of a line is a measure of the steepness or incline of the line. It is the ratio of the vertical change (rise) to the horizontal change (run) between two points on the line. The slope of a line is also known as its gradient. In the above example, let dx and dy be the change in x and y direction respectively between two points. The slope m is the dx/dy.
Making predictions using line
Below is the scatter plot for two continuous variables X and Y. Each datapoint a single measurement for each variable. In the above example X is number if cars and Y is amount of CO2 emission. Now let us consider that if we have given on the X value of an observation how do we find the Y value. Imagine we don’t know the Y value of point p in the below figure B. If we have a line that we can represents all these observation then its easy for us to find the Y value. We can project the x on to the line and get the estimate for Y. This can be done by replacing the X value in the equation of that particular line. But the question is how do we find the best fit line for a given dataset?
Least Square method
The least squares method is a statistical method used to find the best-fitting line that represents a set of data points. It is a commonly used technique in linear regression, where the goal is to estimate the optimum value of slope and intercept that minimizes the sum of the squared differences between the observed data points and the line. The goal of fitting regression model is to estimate the optimum value of slope and intercept which gives the minimum ‘sum of squared error’.
The least squares method works by finding the line that minimizes the difference between the observed values and the values predicted by the line. The predicted values are found by plugging the x-values of the observed data into the equation of the line and solving for y. The difference between the observed and predicted values is then squared (Figure E), and the sum of these squared differences is used to find the best-fitting line. To find the line that minimizes the sum of the squared differences, the method solves a set of simultaneous equations that represent the relationships between the observed data and the parameters of the line (intercept and slope). This can be done using matrix algebra or numerical optimization techniques such as gradient descent.
Comments
Post a Comment