import pandas as pd
import pyarrow as pa
string_pa = pd.ArrowDtype(pa.string())2 Data Types (DeepDive)
2.1 String Type
text_freeform = ['My name is Jeff', 'I like pandas',
'I like programming']
text_with_missing = ['My name is Jeff', None, 'I like programming']the type of the series is object. This is because the series is storing Python objects. Pandas 1.x stores the str type as Python objects. This is because NumPy doesn’t support strings.
pd.Series(text_freeform)0 My name is Jeff
1 I like pandas
2 I like programming
dtype: object
Pandas 2.0 string type:
tf1 = pd.Series(text_freeform, dtype=string_pa)
tf10 My name is Jeff
1 I like pandas
2 I like programming
dtype: string[pyarrow]
pd.Series(text_with_missing, dtype=string_pa)0 My name is Jeff
1 <NA>
2 I like programming
dtype: string[pyarrow]
2.2 Categorical Type
When you load data, you can indicate that the data is categorical. If we know that our data is limited to a few values; we might want to use categorical data. Categorical values have a few benefits:
- Use less memory than strings
- Improve performance
- Can have an ordering
- Can perform operations on categories
- Enforce membership on values
s = pd.Series(['s', 'm', 'l'], dtype='category')
s0 s
1 m
2 l
dtype: category
Categories (3, object): ['l', 'm', 's']
# By defaut, category has no ordering
s.cat.orderedFalse
2.2.1 Ordered Category
## Wrap up
s2 = (
pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='string')
.astype(pd.CategoricalDtype(categories=['s','m','l'], ordered=True))
)
s20 m
1 l
2 NaN
3 s
4 NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
s2.cat.orderedTrue
Make comparison
s2 > "s"0 True
1 True
2 False
3 False
4 False
dtype: bool
# Boolean Subsetting
s2[s2 > "s"]0 m
1 l
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
2.2.2 Reorder Category
s = pd.Series(['s', 'm', 'l'], dtype='category')
print(s.cat.categories)Index(['l', 'm', 's'], dtype='object')
s_ordered = (
s.cat.add_categories(["xs", "xl"])
.cat.reorder_categories(['xs','s','m','l', 'xl'], ordered=True)
)
s_ordered0 s
1 m
2 l
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
2.3 Date & Times
import datetime as dt
import pandas as pd
dt_list = [dt.datetime(2020, 1, 1, 4, 30 ), dt.datetime(2020, 1, 2), dt.datetime(2020, 1, 3)]
string_dates = ['2020-01-01 04:30:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00']
string_dates_missing = ['2020-01-01 4:30', None, '2020-01-03']
epoch_dates = [1577836800, 1577923200, 1578009600]pd.Series(dt_list)0 2020-01-01 04:30:00
1 2020-01-02 00:00:00
2 2020-01-03 00:00:00
dtype: datetime64[ns]
pd.Series(string_dates, dtype='datetime64[ns]')0 2020-01-01 04:30:00
1 2020-01-02 00:00:00
2 2020-01-03 00:00:00
dtype: datetime64[ns]
pd.Series(string_dates_missing, dtype='datetime64[ns]')0 2020-01-01 04:30:00
1 NaT
2 2020-01-03 00:00:00
dtype: datetime64[ns]
# Convert from Secounds
pd.Series(epoch_dates, dtype='datetime64[s]') 0 2020-01-01
1 2020-01-02
2 2020-01-03
dtype: datetime64[s]