2  Data Types (DeepDive)

import pandas as pd
import pyarrow as pa
string_pa = pd.ArrowDtype(pa.string())

2.1 String Type

text_freeform = ['My name is Jeff', 'I like pandas', 
                 'I like programming']
text_with_missing = ['My name is Jeff', None, 'I like programming']

the type of the series is object. This is because the series is storing Python objects. Pandas 1.x stores the str type as Python objects. This is because NumPy doesn’t support strings.

pd.Series(text_freeform)
0       My name is Jeff
1         I like pandas
2    I like programming
dtype: object

Pandas 2.0 string type:

tf1 = pd.Series(text_freeform, dtype=string_pa)
tf1
0       My name is Jeff
1         I like pandas
2    I like programming
dtype: string[pyarrow]
pd.Series(text_with_missing, dtype=string_pa)
0       My name is Jeff
1                  <NA>
2    I like programming
dtype: string[pyarrow]

2.2 Categorical Type

When you load data, you can indicate that the data is categorical. If we know that our data is limited to a few values; we might want to use categorical data. Categorical values have a few benefits:

  • Use less memory than strings
  • Improve performance
  • Can have an ordering
  • Can perform operations on categories
  • Enforce membership on values
s = pd.Series(['s', 'm', 'l'], dtype='category')
s
0    s
1    m
2    l
dtype: category
Categories (3, object): ['l', 'm', 's']
# By defaut, category has no ordering
s.cat.ordered
False

2.2.1 Ordered Category

## Wrap up
s2 = (
    pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='string')
        .astype(pd.CategoricalDtype(categories=['s','m','l'], ordered=True))
)
s2
0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
s2.cat.ordered
True

Make comparison

s2 > "s"
0     True
1     True
2    False
3    False
4    False
dtype: bool
# Boolean Subsetting
s2[s2 > "s"]
0    m
1    l
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

2.2.2 Reorder Category

s = pd.Series(['s', 'm', 'l'], dtype='category')
print(s.cat.categories)
Index(['l', 'm', 's'], dtype='object')
s_ordered = (
    s.cat.add_categories(["xs", "xl"])
    .cat.reorder_categories(['xs','s','m','l', 'xl'], ordered=True)
)

s_ordered
0    s
1    m
2    l
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

2.3 Date & Times

import datetime as dt
import pandas as pd

dt_list = [dt.datetime(2020, 1, 1, 4, 30 ), dt.datetime(2020, 1, 2), dt.datetime(2020, 1, 3)]
string_dates = ['2020-01-01 04:30:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00']
string_dates_missing = ['2020-01-01 4:30', None, '2020-01-03']
epoch_dates = [1577836800, 1577923200, 1578009600]
pd.Series(dt_list)
0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]
pd.Series(string_dates, dtype='datetime64[ns]')
0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]
pd.Series(string_dates_missing, dtype='datetime64[ns]')
0   2020-01-01 04:30:00
1                   NaT
2   2020-01-03 00:00:00
dtype: datetime64[ns]
# Convert from Secounds
pd.Series(epoch_dates, dtype='datetime64[s]') 
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[s]