Setup
%matplotlib inline
import pandas as pd
import matplotlib
import numpy as np
pd.__version__, matplotlib.__version__, np.__version__
The pandas library is very popular among data scientists, quants, Excel junkies, and Python developers because it allows you to perform data ingestion, exporting, transformation, and visualization with ease. But if you are only familiar with Python, pandas may present some challenges. Since pandas is inspired by Numpy, its syntax conventions can be confusing to Python developers.
Pandas has two main datatypes: a Series and a DataFrame
s = pd.Series([0, 4, 6, 7])
temps = [30, 40, 60, 90]
temp_series = pd.Series(temps)
temp_series.sum()
# Boolean Arrays
hot = pd.Series([False, False, True, True])
temp_series[hot]
# Masking
mask1 = temp_series > 30
mask2 = temp_series < 60
result = temp_series[mask1 & mask2]
dates = pd.date_range('20160101', periods=4)
temp3 = pd.Series(temps, index=dates)
A DataFrame is like a spreadsheet
df = pd.DataFrame({'name': ['Fred', 'Johh', 'Joe', 'Abe'], 'age': s})
%ls data/
nyc = pd.read_csv('data/central-park-raw.csv', parse_dates=[0])
# Columns for DataFrame
nyc.columns
# Get Datatypes for Columns
nyc.dtypes
# DataFrame Info
nyc.info
# Show Only First Bit
nyc.head(10)
df['w'].value_counts()
- Count number of rows with unique value of variabledf.describe()