Python Data Skills 9: Pandas Series
Pandas Series, Series Summary Statistics, Modifying Series Values, Changing Series Values
Let’s skip the fluff, and get right to it.
Table of Contents
Working with a Pandas Series
Summary Statistics for a Pandas Series
Modifying Series Values in Pandas
Changing series values conditionally
1 - Working with a Pandas Series
A Pandas Series is a one-dimensional labeled array that can hold any data type. It is capable of holding text, floating numbers, integers, and more. It is one of the fundamental data structures in the Pandas library, apart from the DataFrame.
1.1 Creating a Pandas Series
There are several ways to create a Pandas Series. One of the simplest is to pass a list to the Series constructor:
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
You can also create a Series from a numpy array or a dictionary. If you create a Series from a dictionary, the keys are then used as index labels:
import numpy as np
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
dict_data = {'a': 0., 'b': 1., 'c': 2.}
s = pd.Series(dict_data)
1.2 Slicing a Pandas Series
You can access and slice Pandas Series which is like Python lists:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# get the first element
print(s[0]) # output: 1.0
# slice from the second to the fourth element
print(s[1:4]) # output: 1 3.0
# 2 5.0
# 3 NaN
# dtype: float64
1.3 loc vs iloc
When dealing with a Pandas Series, two important attributes come into play: loc and iloc. These attributes help with accessing the data.
loc is label-based data selection. It means you have to pass the name of the row or column to select. This method includes the last element of the range passed in. Here's an example:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
# get the value associated with 'a'
print(s.loc['a'])
# get values from 'a' to 'c'
print(s.loc['a':'c'])
iloc is integer index-based selection. Here you have to pass integer index in the method to select specific rows/columns. This method does not include the last element of the range passed, like Python and numpy indexing:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
# get the first element
print(s.iloc[0])
# get the first three elements
print(s.iloc[0:3])
In summary, loc includes the last element of the range, while iloc does not. Both are powerful ways to access and slice your data when you're dealing with Pandas Series.
2 - Summary Statistics for a Pandas Series
Pandas provides many methods for obtaining summary statistics from a Series. These statistics summarize the central tendency and dispersion. It also summarizes and shape of a dataset’s distribution. This is excluding NaN values.
2.1 Describe & Quantile
Describe:
The describe method generates descriptive statistics. It summarizes the central tendency, dispersion, and shape of a dataset's distribution:
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9])
s.describe()
The output includes count, mean and standard deviation. It also includes minimum and maximum values, and the quartiles.
Quantile:
The quantile method returns the value at a given quantile ranging between 0 and 1:
s.quantile(0.5) # returns the median
2.2 Descriptive of a subset of a series
You can also get summary statistics for a subset of the Series. For instance, if you want to get the statistics only for values greater than 5:
s[s > 5].describe()
2.3 Descriptive of a subset of the Series, based on values in a different column
Consider a DataFrame with two columns, 'A' and 'B'. If you want to get summary statistics of column 'A' based on values in column 'B', you can use:
df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9], 'B': ['a','b','a','b','a','b','a','b','a']})
df[df['B'] == 'a']['A'].describe()
2.4 Descriptive and Frequencies for a Series containing Categorical Data
If your Series contains categorical data, describe will return different results:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'c', 'a'])
s.describe()
The output now includes count, unique, top, and freq. Unique is the number of distinct categories. Top is the most frequent category. Freq is the most frequent category's frequency.
For categorical Series, you can also use the value_counts method. Use this method to get a frequency distribution:
s.value_counts()
The methods mentioned above provide a wide range of tools. These tools used to extract essential information from a Pandas Series. This will help you understand your data better.
3 - Modifying Series Values in Pandas
Pandas Series is a one-dimensional labeled array. It is a fundamental data structure of the pandas library. Moreover, it allows for a broad range of possibilities when it comes to data manipulation. Here we delve into different techniques to change its values.
3.1 Edit All the Values Based on a Scalar
Often, you may need to apply a certain operation to all elements of a series. This is achievable through broadcasting. A scalar operation is now applied to each element:
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
s = s + 10 # Adds 10 to each value in the series
Thanks to broadcasting, the above example increments every element in the series s by 10.
3.2 Set Values Using Index Labels
One of the core features of pandas is its advanced indexing capabilities. You can change the value of a particular element by referring to its index label:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
s['a'] = 10 # Sets the value at index 'a' to 10
This flexibility in indexing makes pandas a powerful tool for data manipulation.
3.3 Set Values Using an Operator on More Than One Series
You can also perform operations on two or more series. Based on matching index labels, the operation applies element-wise.
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30, 40, 50])
s1 = s1 + s2 # Adds corresponding values from s2 to s1
The above code adds each element in s1 to the corresponding element in s2, which produces a new series.
3.4 Set Values Using Position
Pandas also provides position-based indexing using the iloc attribute. This allows for integer-based indexing:
s = pd.Series([1, 2, 3, 4, 5])
s.iloc[0] = 10 # Sets the value at position 0 (first element) to 10
The iloc indexer is handy under two conditions. This is when the index label is not known and you want to refer to elements based on their integer location.
3.5 Set Values After Filtering
Sometimes, you might want to change values in a series based on certain conditions. Pandas supports this kind of operation using Boolean indexing:
s = pd.Series([1, 2, 3, 4, 5])
s[s > 3] = 10 # All values greater than 3 are set to 10
Here, the condition s > 3 returns a series of Booleans. These Booleans are then used to filter the corresponding true values in the series. They are also used to change the corresponding true values in the series.
Mastering these techniques will allow you to change pandas Series. This can occur with precision and efficiency, enhancing your data wrangling capabilities.
4 - Changing series values conditionally
In data analysis, we often encounter situations where we need to change values. This is often done based on specific conditions. Such requirements make conditional operations a crucial part of our data manipulation toolkit. Pandas provides powerful and flexible ways to perform these operations.
4.1 Boolean Indexing
Boolean indexing is a filtering technique. It lets us specify conditions that return a Boolean series. This Boolean series is then used to index the original series.
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
s[s > 2] = 10
Here, we create a new series where we replace all values greater than 2 by 10. This kind of operation is particularly useful. It is when we need to update values meeting certain criteria.
4.2 Using the where() function
The where()
function is another powerful feature of pandas. It allows us to replace values where a condition is False. The where()
function takes a condition and two optional arguments. The first for replacing values where the condition is False. The second for replacing values where the condition is True.
s = pd.Series([1, 2, 3, 4, 5])
s = s.where(s > 2, 10)
In the above example, we replaced all values that are not greater than 2 with 10. The where()
function is very handy. It is handy when you need to replace values based on a condition, without altering the original data.
4.3 Using the np.where() function
NumPy's where()
function is another powerful tool for conditional operations. Unlike Pandas' where()
, it takes three arguments. A condition, a value to use if the condition is True, and a value to use if the condition is False.
import numpy as np
s = pd.Series([1, 2, 3, 4, 5])
s = np.where(s > 2, 10, s)
In the example, we replace all values greater than 2 with 10, otherwise we use the original value. np.where()
is an excellent choice for replacements. Use this when you need more control over the replacement of both True and False conditions.