Data Science & Machine Learning 101

Share this post

Log Linear Model

bowtiedraptor.substack.com

Log Linear Model

Log Linear model, this is not the same as logistic regression. This one is actually meant for regression analysis, not classification problems.

BowTied_Raptor
Aug 2, 2022
4
Share
Share this post

Log Linear Model

bowtiedraptor.substack.com

Table of Contents

  1. Frequently Asked Questions

  2. Implementing a log linear model in R & Python

  3. Linear Model (Regression)

  4. Log Transformation

  5. Independent variables & log linear models

Data Science & Machine Learning 101 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

1. Frequently Asked Questions

1.) Why do we use log in linear regression?

A log linear model vs linear model, with the log transformation applied to it.  It's essentially just linear regression with the log transformation

The logarithm is often used in linear regression because it "linearizes" the relationship between the predictor (X) and the response (Y) variables. This means that the logarithm of Y is a linear function of the logarithm of X.

This is important because it means that we can use linear regression to model the relationship between Y and X without having to worry about non-linearities in the data. In other words, if we use the logarithm to transform our data, then we can safely assume that the relationship between Y and X is linear.

This is a very powerful assumption because it allows us to use standard techniques for modelling linear relationships

2.) What does log-linear model do?
A log-linear model is a machine learning model used to predict the relationship between two or more variables. The log-linear model is especially useful for predicting relationships that are nonlinear (ie, not a straight line) in nature.

The linear part of the name refers to the fact that the model assumes a linear relationship between the input and output variables. The log part of the name refers to the fact that the model uses a logarithmic transformation of the input data before fitting it to a linear equation. This transformation helps to account for nonlinear relationships in the data, and results in a more accurate prediction of those relationships.

3.) What does it mean if data is logarithmic?
A logarithmic scale is one in which the distance between numbers increases or decreases at a proportional rate. So, for example, if you were to graph the number of Facebook users over time on a linear scale, the slope of the line would be steepest in the early years and then taper off as more and more people signed up. But if you graphed the number of Facebook users on a logarithmic scale, the slope of the line would be roughly constant over time. This is because a logarithmic scale compresses large distances and spreads out small distances.

In data science, logarithmic data is often used when there is a power law relationship between two variables. A power law basically just means you will be working with exponents, so you will want to use some sort of a log transformation (or natural log) to get the data into a usable state.

4.) How do you interpret a log regression coefficient?
A log regression coefficient measures the magnitude of the relationship between a predictor and an outcome, when that predictor is entered into the model as a transformed variable (i.e., in terms of its natural logarithm). In other words, a log regression coefficient indicates how much change in the outcome is associated with a one-unit change in the predictor, after controlling for all other predictors in the model.

The sign of a log regression coefficient tells you which direction the relationship between the predictor and outcome goes (positive or negative), while its magnitude indicates how strong that relationship is. In general, you want your model to have statistically significant log regression coefficients for all of your predictors (in other words, p-values should be less than 0.05).

5.) What is the difference between linear and logarithmic regression?
Linear regression is a way of describing how one variable (the dependent variable) changes with respect to changes in another variable (the independent variable). In linear regression, the equation describing the relationship between the two variables is a straight line.

Logarithmic regression is also a way of describing how one variable changes with respect to changes in another variable, but in logarithmic regression, the equation describing the relationship between the two variables is a curved line. Logarithmic regression is often used when dealing with data that follow a power law.

log linear models would be handy here, and you can see how a linear model vs a log transformation on this linear model shows like.  Natural log works too.

2. Implementing a log linear model in R & Python

For this dataset, we will use the opossum dataset. We'll start off by doing a simple log transformation on our data set. Doing so will transform the residuals from a normal distribution to a lognormal distribution. Once we are done with that, we can just continue and run the simple linear regression model as normal without any trouble.

The log fn being run on the dependent variable, this makes the data go from a normal distribution to a skewed distribution:  lognormal distribution.

In this case, what we will do is apply the log fn on our dependent variable (age), then run a simple linear model on it. Then we will run the predictions, and apply an inverse log fn, and then round off the predicted values. Typically, you will want to apply this technique on data where you have exponential growth.

To run the regression in R, we will use the lm() function, to run it in Python, we'll use sklearn's LinearRegression() function.

Data Science & Machine Learning 101 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

R

Loading up our data

As usual, we'll load up our data using the fread() function from the data.table library. For more information on libraries, click here.

Loading up our data set, which has the list of our dependent and independent variables

Transforming our data points

With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.

Now, we will apply the log fn to our dependent variable to get it ready, and a simple train/test split.

Running a Log Linear Model

Running the log linear model is basically no different than running a linear regression model in R. Just use the lm() function on your data, just point the dependent and independent variables, and do some model parameters tweaking if you have to.

Running the actual model, remember to point out the dependent and independent variables, and do some model tweaking if you must.

Predicting and evaluating a log linear model

We just use the predict() function, and point our test data at it, and the job is done. Surprisingly, even though log linear models aren't meant for this type of problem, you can see this one performed quite well.

You can see the normal distribution for the residuals for our data is basically un-affected in this picture.

Python

Loading up our data

we'll use the .read_csv() function from the pandas library to load up the oppossum data set. The goal of the this task is to use a log linear model to try to predict the age of the opossum given a lot of data about the oppossum.

Loading up our data set, which has the list of our dependent and independent variables

Transforming our data points

With our data read to go, let's do a simple train and test split. Then we'll apply the log fn to our training data set's dependent variable. We will also kick out some useless independent variables.

Now, we will apply the log fn to our dependent variable to get it ready, and a simple train/test split.

Running a Log Linear Model

Now, we will just call upon the LinearRegression() function, and have it fit our Training X data, and our Training Y data.

Running the actual model, remember to point out the dependent and independent variables, and do some model tweaking if you must.

Predicting and evaluating a log linear model

We just use the predict() function, and point our test data at it, and the job is done. Surprisingly, even though log linear models aren't meant for this type of problem, you can see this one performed quite well.

Even though the residuals are meant for normal distribution, using the log fn still works just fine here.

3. Linear Model (regression)

Recall that Multiple linear regression tries to fit a linear regression model (linear line) for 1 response variable using multiple explanatory variables. Here is the equation it tries to solve for:

Formula and Calculation of Multiple Linear Regression

Unfortunately, this model has a few limitations. What if you were trying to use this to analyze something that is quadratic, cubic, or some sort of a exponential function? In that case, this current version would fail badly. Here's a few more limitations of a linear model.

Limitations of a linear model

To stick to the top of log linear model, I will only list out the limitations that exist in linear models, but do not exist in the log linear model.

You can use linear regression to extrapolate (easily find the y-intercept), while log-linears can also interpolate
  • Linear models are only able to extrapolate, while log linear models can also interpolate

  • Linear models are less accurate when predicting values far outside the input data range

  • Linear models can not handle non-normalized data that well, hence why we typically have to use some sort of a scaler from sklearn

  • Each additional predictor increases the complexity of the model by 1 degree, which makes it very very likely to easily overfit. See this post for more information. We used p-value to prove this.

4. Log Transformation

The log transformation means taking a log function, or a ln (natural log fn) if you prefer, and applying this function to our data. When this function is applied to our data, if it was some sort of an exponential distribution, this then converts it to a linear distribution, which we can then go ahead and easily throw in a trend line at. Observe below:

The transformation of a log linear model, with several independent variables to a simple linear model, just by applying the log transformations.  Note: This is the same thing as a log lin model.

Log Linear Models

In a log linear model, we basically apply the log transformations function to our dependent variables (the y values). Here's a wiki article on some of the mathematics behind it. By doing so, this let's us basically take an exponential distribution, and helps scale the values down to a more workable linear distribution.

In other words, our data should now have some sort of a linear relationship, so any basic linear regression model should be competent enough to get the job done. This process is sometimes referred to as log linear modeling, or a first form log linear model. Here's a visual video to help you out a bit.

Log Linear Model

Now that we've got the theory out of the way, let's talk about more about the impact this model has in data science.

The advantage of using a log linear model is that it can better handle datasets with many predictor variables and/or interactions between predictor variables. The logarithmic transformation helps to "smooth" out the relationships between the variables, making them easier to visualize and interpret.

Loglinear models are commonly used in data science and machine learning tasks such as predictive modelling, anomaly detection, and clustering.

The basic thought process is that once you've applied a log transformation upon your dependent variables, you will have effectively smoothened out the curve. But, even after smoothing out the curve to a simple linear model, if you still have weird response variable that is just too far away from what should be considered acceptable, then that thing is clearly an outlier. In such cases, you will typically want to just do whatever your company policy for outlier detection.

Using exponential data in data science with a log linear model for outlier detection (anomaly detection)

5. Independent variables & log linear models

Remember, since you are applying the log transformation scale on your dependent variable (exponential growth), you are effectively re-scaling the y values to bring them more in line to what a linear regression model would be like. In this case, when you look at the coefficients for your independent variables, you have to remember that you should do the inverse of the transformation function that you used in order to get the original 1 to 1 dataset back.

For example, if you used a natural log function for your transformation, you will want to use e (mathematical constant) in order to reverse this, and get your original data set back. If you used just a simple standard log function, then you will want to use inverse log in order to get your original data set back.

Data Science & Machine Learning 101 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

First Differences

Another way people sometimes use this log linear model is by actually applying the log transformation on the first difference. See the image below:

First difference on the independent variable, and the dependent variable

The first differences of a time series are the differences between consecutive values in the series, and they are used to remove trend and seasonal components from the data. So if you apply the log transformation to the first differences of a data set, you will remove most of the trend and seasonal components from the data. This can be useful for time-series analysis or for modeling purposes.

4
Share
Share this post

Log Linear Model

bowtiedraptor.substack.com
Previous
Next
Comments
Top
New
Community

No posts

Ready for more?

© 2023 BowTied_Raptor
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing