Log Linear Model

Log Linear model, this is not the same as logistic regression. This one is actually meant for regression analysis, not classification problems.

Aug 02, 2022

Frequently Asked Questions
Implementing a log linear model in R & Python
Linear Model (Regression)
Log Transformation
Independent variables & log linear models

1. Frequently Asked Questions

1.) Why do we use log in linear regression?

A log linear model vs linear model, with the log transformation applied to it. It's essentially just linear regression with the log transformation

The usage of a logarithm in linear regression comes from its ability to 'linearize' the relationship between the independent (X) and dependent (Y) variables. This means that the logarithm of Y corresponds to a linear function of the logarithm of X.

This is important because it enables us to use linear regression to portray the relationship between Y and X, disregarding the non-linear elements in the data. Basically, if we apply a logarithmic transformation to our data, we can safely assume that the relationship between Y and X is linear.

This assumption is important because it lets us leverage standard techniques for modeling linear relationships.

2.) What is the function of a log-linear model?
let’s define linear relationship first. In a linear relationship, you have an A log-linear model is a machine learning model designed to predict the relationship among two or more variables. It’s useful for predicting relationships that are nonlinear (i.e., not a straight line).

The 'linear' in the name means the model's presumption of a linear relationship between the input and output variables. And, 'log' refers to the model's use of a logarithmic transformation of the input data before fitting it into a linear equation. This transformation accommodates nonlinear relationships in the data, resulting in a more accurate prediction of those relationships.

3.) What is the implication of logarithmic data?
A logarithmic scale is one in which the gap between figures expands or contracts at a proportional rate. If you were to plot the number of Facebook users over time on a linear scale, the line's slope would be steepest in the initial years, then gradually taper as more individuals joined. However, if you plotted the same data on a logarithmic scale, the line's slope would remain approximately constant over time. This is because a logarithmic scale condenses large distances and expands small distances.

In data science, logarithmic data is commonly employed when there's a power law relationship between two variables. Power law essentially means you'll be dealing with exponents, necessitating the use of a logarithmic transformation (or natural log) to make the data functional.

4.) How is a log regression coefficient interpreted?
A log regression coefficient quantifies the relationship's strength between a predictor and an outcome. When the predictor is input into the model as a transformed variable (i.e., in terms of its natural logarithm). It shows how much change in the outcome corresponds to a one-unit shift in the predictor, after adjusting for all other predictors in the model.

The sign of a log regression coefficient reveals the direction of the relationship between the predictor and outcome (either positive or negative), while its magnitude indicates the strength of that relationship. Ideally, your model should have statistically significant log regression coefficients for all your predictors, meaning p-values should be less than 0.05.

5.) What distinguishes linear from logarithmic regression?
Linear regression is a way of describing how one variable (the dependent variable) changes with respect to changes in another variable (the independent variable). In linear regression, the equation expressing the relationship between the two variables is a straight line.

On the other hand, logarithmic regression also illustrates how one variable changes concerning changes in another variable. However, in logarithmic regression, the equation that articulates the relationship between the variables is a curved line. Logarithmic regression is commonly employed when working with data that adheres to a power law

log linear models would be handy here, and you can see how a linear model vs a log transformation on this linear model shows like. Natural log works too.

2. Implementing a log linear model in R & Python

For this dataset, we will use the opossum dataset. We'll start off by doing a simple log transformation on our data set. Doing so will transform the residuals from a normal distribution to a lognormal distribution. Once we are done with that, we can just continue and run the simple linear regression model as normal without any trouble.

The log fn being run on the dependent variable, this makes the data go from a normal distribution to a skewed distribution: lognormal distribution.

In this case, what we will do is apply the log fn on our dependent variable (age), then run a simple linear model on it. Then we will run the predictions, and apply an inverse log fn, and then round off the predicted values. Typically, you will want to apply this technique on data where you have exponential growth.

To run the regression in R, we will use the lm() function, to run it in Python, we'll use sklearn's LinearRegression() function.

R

Loading up our data

As usual, we'll load up our data using the fread() function from the data.table library. For more information on libraries, click here.

Transforming our data points

With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.

Now, we will apply the log fn to our dependent variable to get it ready, and a simple train/test split.

Running a Log Linear Model

Running the log linear model is basically no different than running a linear regression model in R. Just use the lm() function on your data, just point the dependent and independent variables, and do some model parameters tweaking if you have to.

Running the actual model, remember to point out the dependent and independent variables, and do some model tweaking if you must.

Predicting and evaluating a log linear model

We just use the predict() function, and point our test data at it, and the job is done. Surprisingly, even though log linear models aren't meant for this type of problem, you can see this one performed quite well.

You can see the normal distribution for the residuals for our data is basically un-affected in this picture.

Python

Loading up our data

we'll use the .read_csv() function from the pandas library to load up the oppossum data set. The goal of the this task is to use a log linear model to try to predict the age of the opossum given a lot of data about the oppossum.

Transforming our data points

With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.

Running a Log Linear Model

Now, we will just call upon the LinearRegression() function, and have it fit our Training X data, and our Training Y data.

Predicting and evaluating a log linear model

Even though the residuals are meant for normal distribution, using the log fn still works just fine here.

3. Linear Model (regression)

Recall that Multiple linear regression tries to fit a linear regression model (linear line) for 1 response variable using multiple explanatory variables. Here is the equation it tries to solve for:

Formula and Calculation of Multiple Linear Regression

Unfortunately, this model has a few limitations. What if you were trying to use this to analyze something that is quadratic, cubic, or some sort of a exponential function? In that case, this current version would fail badly. Here's a few more limitations of a linear model.

Limitations of a linear model

To stick to the top of log linear model, I will only list out the limitations that exist in linear models, but do not exist in the log linear model.

You can use linear regression to extrapolate (easily find the y-intercept), while log-linears can also interpolate

Linear models are only able to extrapolate, while log linear models can also interpolate
Linear models are less accurate when predicting values far outside the input data range
Linear models can not handle non-normalized data that well, hence why we typically have to use some sort of a scaler from sklearn
Each additional predictor increases the complexity of the model by 1 degree, which makes it very very likely to easily overfit. See this post for more information. We used p-value to prove this.

4. Log Transformation

The log transformation means taking a log function, or a ln (natural log fn) if you prefer, and applying this function to our data. When this function is applied to our data, if it was some sort of an exponential distribution, this then converts it to a linear distribution, which we can then go ahead and easily throw in a trend line at. Observe below:

The transformation of a log linear model, with several independent variables to a simple linear model, just by applying the log transformations. Note: This is the same thing as a log lin model.

Log Linear Models

In a log linear model, we basically apply the log transformations function to our dependent variables (the y values). Here's a wiki article on some of the mathematics behind it. By doing so, this let's us basically take an exponential distribution, and helps scale the values down to a more workable linear distribution.

In other words, our data should now have some sort of a linear relationship, so any basic linear regression model should be competent enough to get the job done. This process is sometimes referred to as log linear modeling, or a first form log linear model. Here's a visual video to help you out a bit.

Log Linear Model

Now that we've got the theory out of the way, let's talk about more about the impact this model has in data science.

The advantage of using a log linear model is that it can better handle datasets with many predictor variables and/or interactions between predictor variables. The logarithmic transformation helps to "smooth" out the relationships between the variables, making them easier to visualize and interpret.

Loglinear models are commonly used in data science and machine learning tasks such as predictive modelling, anomaly detection, and clustering.

The basic thought process is that once you've applied a log transformation upon your dependent variables, you will have effectively smoothened out the curve. But, even after smoothing out the curve to a simple linear model, if you still have weird response variable that is just too far away from what should be considered acceptable, then that thing is clearly an outlier. In such cases, you will typically want to just do whatever your company policy for outlier detection.

Using exponential data in data science with a log linear model for outlier detection (anomaly detection)

5. Independent variables & log linear models

Remember, since you are applying the log transformation scale on your dependent variable (exponential growth), you are effectively re-scaling the y values to bring them more in line to what a linear regression model would be like. In this case, when you look at the coefficients for your independent variables, you have to remember that you should do the inverse of the transformation function that you used in order to get the original 1 to 1 dataset back. If you wanted to learn stepwise regression, go here.

For example, if you used a natural log function for your transformation, you will want to use e (mathematical constant) in order to reverse this, and get your original data set back. If you used just a simple standard log function, then you will want to use inverse log in order to get your original data set back.

First Differences

Another way people sometimes use this log linear model is by actually applying the log transformation on the first difference. See the image below:

First difference on the independent variable, and the dependent variable

The first differences of a time series are the differences between consecutive values in the series, and they are used to remove trend and seasonal components from the data. So if you apply the log transformation to the first differences of a data set, you will remove most of the trend and seasonal components from the data. This can be useful for time-series analysis or for modeling purposes.

Data Science & Machine Learning 101

Discussion about this post

Data Science & Machine Learning 101

Log Linear Model

Log Linear model, this is not the same as logistic regression. This one is actually meant for regression analysis, not classification problems.

Table of Contents

1. Frequently Asked Questions

2. Implementing a log linear model in R & Python

R

Loading up our data

Transforming our data points

Running a Log Linear Model

Predicting and evaluating a log linear model

Python

Loading up our data

Transforming our data points

Running a Log Linear Model

Predicting and evaluating a log linear model

3. Linear Model (regression)

Limitations of a linear model

4. Log Transformation

Log Linear Models

Log Linear Model

5. Independent variables & log linear models

First Differences

Discussion about this post