Log Linear Model
Log Linear model, this is not the same as logistic regression. This one is actually meant for regression analysis, not classification problems.
Table of Contents
Frequently Asked Questions
Implementing a log linear model in R & Python
Linear Model (Regression)
Log Transformation
Independent variables & log linear models
1. Frequently Asked Questions
1.) Why do we use log in linear regression?
The logarithm is often used in linear regression because it "linearizes" the relationship between the predictor (X) and the response (Y) variables. This means that the logarithm of Y is a linear function of the logarithm of X.
This is important because it means that we can use linear regression to model the relationship between Y and X without having to worry about non-linearities in the data. In other words, if we use the logarithm to transform our data, then we can safely assume that the relationship between Y and X is linear.
This is a very powerful assumption because it allows us to use standard techniques for modelling linear relationships
2.) What does log-linear model do?
A log-linear model is a machine learning model used to predict the relationship between two or more variables. The log-linear model is especially useful for predicting relationships that are nonlinear (ie, not a straight line) in nature.
The linear part of the name refers to the fact that the model assumes a linear relationship between the input and output variables. The log part of the name refers to the fact that the model uses a logarithmic transformation of the input data before fitting it to a linear equation. This transformation helps to account for nonlinear relationships in the data, and results in a more accurate prediction of those relationships.
3.) What does it mean if data is logarithmic?
A logarithmic scale is one in which the distance between numbers increases or decreases at a proportional rate. So, for example, if you were to graph the number of Facebook users over time on a linear scale, the slope of the line would be steepest in the early years and then taper off as more and more people signed up. But if you graphed the number of Facebook users on a logarithmic scale, the slope of the line would be roughly constant over time. This is because a logarithmic scale compresses large distances and spreads out small distances.
In data science, logarithmic data is often used when there is a power law relationship between two variables. A power law basically just means you will be working with exponents, so you will want to use some sort of a log transformation (or natural log) to get the data into a usable state.
4.) How do you interpret a log regression coefficient?
A log regression coefficient measures the magnitude of the relationship between a predictor and an outcome, when that predictor is entered into the model as a transformed variable (i.e., in terms of its natural logarithm). In other words, a log regression coefficient indicates how much change in the outcome is associated with a one-unit change in the predictor, after controlling for all other predictors in the model.
The sign of a log regression coefficient tells you which direction the relationship between the predictor and outcome goes (positive or negative), while its magnitude indicates how strong that relationship is. In general, you want your model to have statistically significant log regression coefficients for all of your predictors (in other words, p-values should be less than 0.05).
5.) What is the difference between linear and logarithmic regression?
Linear regression is a way of describing how one variable (the dependent variable) changes with respect to changes in another variable (the independent variable). In linear regression, the equation describing the relationship between the two variables is a straight line.
Logarithmic regression is also a way of describing how one variable changes with respect to changes in another variable, but in logarithmic regression, the equation describing the relationship between the two variables is a curved line. Logarithmic regression is often used when dealing with data that follow a power law.
2. Implementing a log linear model in R & Python
For this dataset, we will use the opossum dataset. We'll start off by doing a simple log transformation on our data set. Doing so will transform the residuals from a normal distribution to a lognormal distribution. Once we are done with that, we can just continue and run the simple linear regression model as normal without any trouble.
In this case, what we will do is apply the log fn on our dependent variable (age), then run a simple linear model on it. Then we will run the predictions, and apply an inverse log fn, and then round off the predicted values. Typically, you will want to apply this technique on data where you have exponential growth.
To run the regression in R, we will use the lm() function, to run it in Python, we'll use sklearn's LinearRegression() function.
R
Loading up our data
As usual, we'll load up our data using the fread() function from the data.table library. For more information on libraries, click here.
Transforming our data points
With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.
Running a Log Linear Model
Running the log linear model is basically no different than running a linear regression model in R. Just use the lm() function on your data, just point the dependent and independent variables, and do some model parameters tweaking if you have to.
Predicting and evaluating a log linear model
We just use the predict() function, and point our test data at it, and the job is done. Surprisingly, even though log linear models aren't meant for this type of problem, you can see this one performed quite well.
Python
Loading up our data
we'll use the .read_csv() function from the pandas library to load up the oppossum data set. The goal of the this task is to use a log linear model to try to predict the age of the opossum given a lot of data about the oppossum.
Transforming our data points
With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.
Running a Log Linear Model
Now, we will just call upon the LinearRegression() function, and have it fit our Training X data, and our Training Y data.
Predicting and evaluating a log linear model
We just use the predict() function, and point our test data at it, and the job is done. Surprisingly, even though log linear models aren't meant for this type of problem, you can see this one performed quite well.
3. Linear Model (regression)
Recall that Multiple linear regression tries to fit a linear regression model (linear line) for 1 response variable using multiple explanatory variables. Here is the equation it tries to solve for:
Unfortunately, this model has a few limitations. What if you were trying to use this to analyze something that is quadratic, cubic, or some sort of a exponential function? In that case, this current version would fail badly. Here's a few more limitations of a linear model.
Limitations of a linear model
To stick to the top of log linear model, I will only list out the limitations that exist in linear models, but do not exist in the log linear model.
Linear models are only able to extrapolate, while log linear models can also interpolate
Linear models are less accurate when predicting values far outside the input data range
Linear models can not handle non-normalized data that well, hence why we typically have to use some sort of a scaler from sklearn
Each additional predictor increases the complexity of the model by 1 degree, which makes it very very likely to easily overfit. See this post for more information. We used p-value to prove this.
4. Log Transformation
The log transformation means taking a log function, or a ln (natural log fn) if you prefer, and applying this function to our data. When this function is applied to our data, if it was some sort of an exponential distribution, this then converts it to a linear distribution, which we can then go ahead and easily throw in a trend line at. Observe below:
Log Linear Models
In a log linear model, we basically apply the log transformations function to our dependent variables (the y values). Here's a wiki article on some of the mathematics behind it. By doing so, this let's us basically take an exponential distribution, and helps scale the values down to a more workable linear distribution.
In other words, our data should now have some sort of a linear relationship, so any basic linear regression model should be competent enough to get the job done. This process is sometimes referred to as log linear modeling, or a first form log linear model. Here's a visual video to help you out a bit.
Log Linear Model
Now that we've got the theory out of the way, let's talk about more about the impact this model has in data science.
The advantage of using a log linear model is that it can better handle datasets with many predictor variables and/or interactions between predictor variables. The logarithmic transformation helps to "smooth" out the relationships between the variables, making them easier to visualize and interpret.
Loglinear models are commonly used in data science and machine learning tasks such as predictive modelling, anomaly detection, and clustering.
The basic thought process is that once you've applied a log transformation upon your dependent variables, you will have effectively smoothened out the curve. But, even after smoothing out the curve to a simple linear model, if you still have weird response variable that is just too far away from what should be considered acceptable, then that thing is clearly an outlier. In such cases, you will typically want to just do whatever your company policy for outlier detection.
5. Independent variables & log linear models
Remember, since you are applying the log transformation scale on your dependent variable (exponential growth), you are effectively re-scaling the y values to bring them more in line to what a linear regression model would be like. In this case, when you look at the coefficients for your independent variables, you have to remember that you should do the inverse of the transformation function that you used in order to get the original 1 to 1 dataset back.
For example, if you used a natural log function for your transformation, you will want to use e (mathematical constant) in order to reverse this, and get your original data set back. If you used just a simple standard log function, then you will want to use inverse log in order to get your original data set back.
First Differences
Another way people sometimes use this log linear model is by actually applying the log transformation on the first difference. See the image below:
The first differences of a time series are the differences between consecutive values in the series, and they are used to remove trend and seasonal components from the data. So if you apply the log transformation to the first differences of a data set, you will remove most of the trend and seasonal components from the data. This can be useful for time-series analysis or for modeling purposes.