Saturday, August 6, 2016

IPND, stage 5, data analyst, notebook

update 2017-2-27

It is almost 7 months after the original post, which is really a mess from my current knowledge. I realize there is a huge cognitive gap between the learner and the teacher. This is where the “curse of knowledge” come in. Because we stand on the shoulder of giant. But the “giant” is hugely different for individuals. Different background, different culture, different learning pace, different learning style. We build new knowledge on top of what we already know. And the way we construct our knowledge is more like a Graph database. Knowledge is stored at the vertexes or edges. Our brain doesn’t store information in the SQL style.

Pandas

auto = pd.read_csv('data/auto.csv')
auto.head()
auto.describe()
auto.mpg.describe()
auto['mpg'].describe()
auto.mpg.std()
auto.price.hist()
auto.boxplot(column='price')
grouped = titanic.groupby('Sex')
grouped.Age.describe()

# add 2 pd.Series and fill missing value
s= s1.add(s2, fill_value = 0)
# def fun(x): return x**2
s. apply(fun)

df.loc[] # label based position
df.iloc[] # integer position
df.sort_values(ascending = False)

matplotlib

import matplotlib.pyplot as plt
plt.hist(list) or Series.hist()
x = np.arange(0, 5, 0.1)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel()
plt.ylabel()
plt.title()
plt.show() #show plot in a new window

df.plot(kind='hist',title="passengers vs. sex")
plt.xlabel("gender, 0=male, 1=female")
plt.ylabel("number of passengers")

plt.legend(["dead","survived"])

seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Full customization of the figures will require a sophisticated understanding of matplotlib objects.
  • histogram
  • boxplot
  • kernel density estimation
  • violin plot
  • cumulative distribution function
import seaborn as sns
plt.hist(titanic.Age.dropna(),bins=25)
sns.boxplot(titanic.Age, titanic.Sex, vert=False)
sns.kdeplot(titanic.Age.dropna(), shade=True)
sns.distplot(titanic.Age.dropna())  # density + hist
sns.violinplot(titanic.Age.dropna()) # density + box
sns.kdeplot(titanic.Age.dropna(), cumulative=True)

1 comment:

  1. So on the off chance that somebody from you is confronting the issue of information taking care of and its administration at that point quit confronting this issue and simply visit this https://activewizards.com/ for getting the best information researcher for you. The information researcher is the main individual who can without much of a stretch fathom your everything kind of information issues in brief time and they will likewise propose you an ideal approach to sue the information on the appropriate way.

    ReplyDelete

Note: Only a member of this blog may post a comment.