Coding4: Immutability & String Functions
Focusing on common functions those that end up working with Natural Language Processing (NLP) will use. For now, we'll stick with the basic in-built functions, later on we'll deal with packages.
Quick Recap
As of right now, we know what are the 3 unique properties that apply to all strings. We also know how to pull individual characters from strings based off their position (index), and also how to grab multiple characters at a time.
Now, we’ll focus on some common problems you will encounter in your day-to-day DS life (NLP only) and how to deal with them.
Table of Contents:
Immutable object
Python
R
Upper & Lowercase
Python
R
Removing Junk Whitespace
Python
R
Exercises
Immutable object
String objects have another property to them, that’s a bit less talked about. Strings are an immutable object. In functional programming, immutable means that once we’ve created the strings, we can no longer change them. Here is an example:
We create a variable jibby, and assign it the value ‘jobby’
We use string indexing to point to the first character
we now assign a different character to that first character, such as ‘g’
Here is a visual example in Python:
You can see we encounter an issue on line 3. So, that doesn’t work. But, if that doesn’t work, how can we systematize the process of getting ‘jobby’ to turn to ‘gobby’. The answer lies in combining String Concatenation, and String Slicing Together.
Python
Notice on line 2, we re-assign the value of jibby. When you reference a variable on both the left and right side of the = sign. Python assumes you are going to replace the old value of the variable with a new value, in a fashion similar to recursion. We will use this a lot when we get to for, and while loops.
R
Here is what it looks like in R.
When we are doing concatenation in R, using the paste() function, one of the parameters we can use is called sep. If we say sep='', this means we tell R don’t put anything in the middle when we combine the 2 strings together. If you don’t have the sep='' parameter, then R will automatically will add a space in between.
Upper and Lowercase
When users enter data in some sort of an input form, sometimes they enter in everything in all caps, or all small caps, or sometimes some weird bizarre mixture of the two. To ensure quality, we will sometimes either want to convert everything to all caps, or only small caps.
Python
We can use the .lower(), and .upper() functions in Python to convert a string to upper or lowercase. One thing to note is that it just assigns a temporary change, and doesn’t affect the variable, unless we force it to. Observe:
We can use the .upper() in the exact same fashion to convert everything to uppercase.
R
R lets us use the toupper(), and the tolower() functions in order to convert a string to upper and lowercase. Similarily, as was done in Python, we will have to force the variable to change if we want the lower, and upper functions to change the original variable.
Note: Pay Close attention to the upper and lower functions. You will use them the most often as some people just type their name in input forms in all small caps, while Other Capitalize the first letter.
Removing Junk Whitespace
Sometimes, we can get issues with our data, when we load it up from somewhere. This could happen due to one of several reasons: issue with the function, issue with the user input data, issue with an Extract Transform Load (ETL) somewhere. At the end of the day, regardless of what caused the issue, the expectation will be upon you to fix it, and put it into a usable format. To do this, we talk about how we can remove a lot of extra whitespace from a string.
Python
In Python, we have a series of strip() functions that we can utilize to remove whitespace. Similar to the above example, if you do not re-assign the new string to the old variable, the value of the variable will not change.
Python offers us:
.lstrip(): gets rid of the whitespace to the left
.rstrip(): gets rid of the whitespace to the right
.strip(): gets rid of the whitespace to the left & right.
Observe below:
Red Line from the above indicates excess whitespace to be removed on line 1
R
To achieve a similar result, we will use a function called gsub(). gsub is basically like a find and replace function.
In the above code, we have R look for a string like this:' '
Then in the next parameter, we replace it with an empty string: ''
In this specific example it works great, however in a real life scenario, you would be better off going with a package instead.
Note: In order to properly remove whitespace in R requires the usage of a package, since we are not there yet, we’ll go with a simple solution that works for now.
Check if a string starts/ends
Sometimes, when we are working with vast amount of string data, we want to create some sort of a decision Tree based off of what the string we are dealing with starts/ends with. We do a simple demonstration on how to check if a string starts with a particular substring or not.
Python
Python lets us use the .startswith(), and .endswith() functions in order to see if a string/ends starts with a specific substring, or if a string ends with a specific substring.
The reason the 3rd line gives us False is because Python treats uppercase letters differently than small case letters.
R
R lets us use the startsWith(), and endsWith() functions in order to assess if a string starts/ends with a particular substring. This is demonstrated in the following example.
4 - Exercises
Here are some exercises you can do on string immutability, and string objects:
Write a program that converts 'Weenis', 'Raptah', and 'Deez' to lowercase, then uppercase.
Now go ahead and repeat 1.) But, this time, replace the 1st character with 'X'
Click Here to continue to Coding 5: Loops & fns.
x = 'Weenis'
y = 'Raptah'
z = 'Deez'
x = x.lower()
y = y.lower()
z = z.lower()
print(x)
print(y)
print(z)
x = x.upper()
y = y.upper()
z = z.upper()
print(x)
print(y)
print(z)
x = 'X' + x[1:]
y = 'X' + y[1:]
z = 'X' + z[1:]
print(x)
print(y)
print(z)