import pandas as pd
import pyarrow as pa
= pd.ArrowDtype(pa.string()) string_pa
2 Data Types (DeepDive)
2.1 String Type
= ['My name is Jeff', 'I like pandas',
text_freeform 'I like programming']
= ['My name is Jeff', None, 'I like programming'] text_with_missing
the type of the series is object. This is because the series is storing Python objects. Pandas 1.x stores the str type as Python objects. This is because NumPy doesn’t support strings.
pd.Series(text_freeform)
0 My name is Jeff
1 I like pandas
2 I like programming
dtype: object
Pandas 2.0 string type:
= pd.Series(text_freeform, dtype=string_pa)
tf1 tf1
0 My name is Jeff
1 I like pandas
2 I like programming
dtype: string[pyarrow]
=string_pa) pd.Series(text_with_missing, dtype
0 My name is Jeff
1 <NA>
2 I like programming
dtype: string[pyarrow]
2.2 Categorical Type
When you load data, you can indicate that the data is categorical. If we know that our data is limited to a few values; we might want to use categorical data. Categorical values have a few benefits:
- Use less memory than strings
- Improve performance
- Can have an ordering
- Can perform operations on categories
- Enforce membership on values
= pd.Series(['s', 'm', 'l'], dtype='category')
s s
0 s
1 m
2 l
dtype: category
Categories (3, object): ['l', 'm', 's']
# By defaut, category has no ordering
s.cat.ordered
False
2.2.1 Ordered Category
## Wrap up
= (
s2 'm', 'l', 'xs', 's', 'xl'], dtype='string')
pd.Series([=['s','m','l'], ordered=True))
.astype(pd.CategoricalDtype(categories
) s2
0 m
1 l
2 NaN
3 s
4 NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
s2.cat.ordered
True
Make comparison
> "s" s2
0 True
1 True
2 False
3 False
4 False
dtype: bool
# Boolean Subsetting
> "s"] s2[s2
0 m
1 l
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
2.2.2 Reorder Category
= pd.Series(['s', 'm', 'l'], dtype='category')
s print(s.cat.categories)
Index(['l', 'm', 's'], dtype='object')
= (
s_ordered "xs", "xl"])
s.cat.add_categories(['xs','s','m','l', 'xl'], ordered=True)
.cat.reorder_categories([
)
s_ordered
0 s
1 m
2 l
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
2.3 Date & Times
import datetime as dt
import pandas as pd
= [dt.datetime(2020, 1, 1, 4, 30 ), dt.datetime(2020, 1, 2), dt.datetime(2020, 1, 3)]
dt_list = ['2020-01-01 04:30:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00']
string_dates = ['2020-01-01 4:30', None, '2020-01-03']
string_dates_missing = [1577836800, 1577923200, 1578009600] epoch_dates
pd.Series(dt_list)
0 2020-01-01 04:30:00
1 2020-01-02 00:00:00
2 2020-01-03 00:00:00
dtype: datetime64[ns]
='datetime64[ns]') pd.Series(string_dates, dtype
0 2020-01-01 04:30:00
1 2020-01-02 00:00:00
2 2020-01-03 00:00:00
dtype: datetime64[ns]
='datetime64[ns]') pd.Series(string_dates_missing, dtype
0 2020-01-01 04:30:00
1 NaT
2 2020-01-03 00:00:00
dtype: datetime64[ns]
# Convert from Secounds
='datetime64[s]') pd.Series(epoch_dates, dtype
0 2020-01-01
1 2020-01-02
2 2020-01-03
dtype: datetime64[s]