Using Pandas

Explore, Visualize, and Predict using Pandas & Jupyter

Setup

%matplotlib inline

import pandas as pd
import matplotlib
import numpy as np
pd.__version__, matplotlib.__version__, np.__version__

Pandas Intro

The pandas library is very popular among data scientists, quants, Excel junkies, and Python developers because it allows you to perform data ingestion, exporting, transformation, and visualization with ease. But if you are only familiar with Python, pandas may present some challenges. Since pandas is inspired by Numpy, its syntax conventions can be confusing to Python developers.

Pandas has two main datatypes: a Series and a DataFrame

  • A Series is like a column from a spreadsheet
s = pd.Series([0, 4, 6, 7])

temps = [30, 40, 60, 90]
temp_series = pd.Series(temps)
temp_series.sum()

# Boolean Arrays
hot = pd.Series([False, False, True, True])
temp_series[hot]

# Masking
mask1 = temp_series > 30
mask2 = temp_series < 60
result = temp_series[mask1 & mask2]

dates = pd.date_range('20160101', periods=4)
temp3 = pd.Series(temps, index=dates)

A DataFrame is like a spreadsheet

df = pd.DataFrame({'name': ['Fred', 'Johh', 'Joe', 'Abe'], 'age': s})

Load Data

%ls data/
nyc = pd.read_csv('data/central-park-raw.csv', parse_dates=[0])

Inspecting Data

# Columns for DataFrame
nyc.columns

# Get Datatypes for Columns
nyc.dtypes

# DataFrame Info
nyc.info

# Show Only First Bit
nyc.head(10)

Summarize Data

  • df['w'].value_counts() - Count number of rows with unique value of variable
  • df.describe()

Reference