Saturday, August 6, 2016

IPND, stage 5, data analyst, notebook

update 2017-2-27

It is almost 7 months after the original post, which is really a mess from my current knowledge. I realize there is a huge cognitive gap between the learner and the teacher. This is where the “curse of knowledge” come in. Because we stand on the shoulder of giant. But the “giant” is hugely different for individuals. Different background, different culture, different learning pace, different learning style. We build new knowledge on top of what we already know. And the way we construct our knowledge is more like a Graph database. Knowledge is stored at the vertexes or edges. Our brain doesn’t store information in the SQL style.

Pandas

auto = pd.read_csv('data/auto.csv')
auto.head()
auto.describe()
auto.mpg.describe()
auto['mpg'].describe()
auto.mpg.std()
auto.price.hist()
auto.boxplot(column='price')
grouped = titanic.groupby('Sex')
grouped.Age.describe()

# add 2 pd.Series and fill missing value
s= s1.add(s2, fill_value = 0)
# def fun(x): return x**2
s. apply(fun)

df.loc[] # label based position
df.iloc[] # integer position
df.sort_values(ascending = False)

matplotlib

import matplotlib.pyplot as plt
plt.hist(list) or Series.hist()
x = np.arange(0, 5, 0.1)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel()
plt.ylabel()
plt.title()
plt.show() #show plot in a new window

df.plot(kind='hist',title="passengers vs. sex")
plt.xlabel("gender, 0=male, 1=female")
plt.ylabel("number of passengers")

plt.legend(["dead","survived"])

seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Full customization of the figures will require a sophisticated understanding of matplotlib objects.
  • histogram
  • boxplot
  • kernel density estimation
  • violin plot
  • cumulative distribution function
import seaborn as sns
plt.hist(titanic.Age.dropna(),bins=25)
sns.boxplot(titanic.Age, titanic.Sex, vert=False)
sns.kdeplot(titanic.Age.dropna(), shade=True)
sns.distplot(titanic.Age.dropna())  # density + hist
sns.violinplot(titanic.Age.dropna()) # density + box
sns.kdeplot(titanic.Age.dropna(), cumulative=True)