Data Wrangling

Shanon Bradshaw, director of education at MongoDB, open source NoSQL database.

Data scientist spend 70% time on data wrangling. If you don’t take the time to ensure your data is in good shape before doing any analysis, you run a big risk of wasting a lot of time later on, or losing the faith of your colleagues who depend on your data.

Generally, we should not trust any data we get.where does it come from? typo, missing value, different format of daytime, outliers.

Data extraction

Extracting data from CSV, excel and local/remote JSON files are relatively easy. See python codes here

Read XML/HTML

XML design goals:

platform-independent data transfer
easy to write code to read/write
document validation
human readable
support a wide variety of applications
robust parsers in most languages.

It is very common to fill in the form to make HTML request to get desirable information. what exact information do you send in the post request can be seen in the browser developer tool -> Network -> Data_Elements.aspx?Data=2 -> Headers -> Form Data. there are 7 parameters you need to pass the form.

python codes here and practise

simple comparison:

Difference	XML	HTML
brith	1996	1993
pre-defined tag?	NO	Yes
purpose	store data	display data
? Markup Language	Extensible	HyperText

Data Quality

This process is very situation specific.

typo
legacy data
time format
statistical analysis, identify causes

SQL

The major content is similar to the free course: intro to Relational Databases. I actually wrote a blog post last August when I was at stage 5( back end ) of the Intro to Programming ND.

The nice improvement made by the nanodegree program is a set up of the local environment and a set of problems to get you play with the chinook database.

comparison of different RDBMS

SQL environments:

mysql,postgresql,oracle: code->network-> server-> data on disk
sqlite: code-> DB library-> data on disk

	born	creator	features	DB-API
sqlite	2000	D. Richard Hipp	fast, free	sqlite3
SQL Server	1989	Microsoft	SaaS
MySQL	1995	Oracle	partial free	mysql.connector
PostgreSQL	1996	UC berkeley	more format, free	psycopg2

I am touched by the SQLite author, Hipp, who keep it free to everyone

“It’s very clear to me that if I’d had any business sense whatsoever I could have made a lot of money, but I don’t,” he says. “I like to tell people that we make enough to live fine in Charlotte, North Carolina. We don’t make nearly enough money to live in London or San Francisco, but we don’t live there so that’s OK,” he adds - with a touch, just a touch, of wistfulness.

sqlite

SQLite is preinstalled in Mac. My version is 3.13, while the latest version is 3.17. I haven’t figured out a simple way to update it. sqlite3 in the shell to enter the environment.

.open chinook.db
.exit
.quit
.help
.tables
.schema

Alternatively, sqlite studio is a visual appealing IDE. You can see how many tables, which is the primary key and the whole tabular data at your clicks. No black box anymore. it really makes life much easier. I guess the only catch is to be careful about the size. Get a sample size first.

sqlite datatype

https://sqlite.org/datatype3.html

TEXT
REAL
INTEGER
BLOB
NULL

no class for storing data/time, TEXT or NUMERIC can be used to store data. There are built-in function to parse into the right format

sqlite tricks

http://www.sqlite.org/cli.html

change output format

.show # show setting
.mode list  # default output mode, default separator is a pipe symbol
.separator ","
.mode quote
.mode line
.mode column
.mode tabs
.header off
.mode insert  # used to input data elsewhere

check, output results

.tables  # see a list of tables
.schema
.databases
.output filename.txt # all subsequent results will be written to this file

import/ output table

.mode csv  # alignment format
.import somedata.csv table1 # import table

.header on
.mode csv
.once filename.csv
select * from table1;
.sytem open filename.csv  # open file to display

create a new SQLite database:

sqlite3 ex1 # enter sqlite and create exl database

create table tb1(one varchar(10), two smallint);
insert into tb1 values('hello!',10);
insert into tb1 values('goodbye', 20);
CREATE TABLE tb2 (f1 varchar(30) primary key,
f2 text,f3 real);

note: SQL qury command is not case-sensitive. But the “text” value is case-sensitive.

chinook database

The sample database at your disposal is called chinook, hosted at https://chinookdatabase.codeplex.com/. The Chinook data model represents a digital media store, including 11 tables (artists, albums, media tracks, invoices and customers). It was first public in 2008. Now it is available for all major languages.

example query

.open chinook.db
SELECT Name FROM Track WHERE Composer='AC/DC';
SELECT Composer, sum(Milliseconds) FROM Track WHERE Composer='Johann Sebastian Bach';
SELECT FirstName, LastName, Title, Birthdate FROM Employee;
SELECT Composer, Name FROM Track WHERE Composer = 'Stevie Ray Vaughan';
select composer, count(*) from track group by composer  order by count(*) desc limit 10;
select artist.name, album.title from album join artist on artist.artistid = album.artistid where name = 'Iron Maiden' or name = 'Amy Windhouse';
select billingcountry, count(*) from invoice group by billingcountry order by count(*) desc limit 3;
select customer.email, customer.firstname,customer.lastname, sum(invoice.total) from customer, invoice on customer.customerid = invoice.customerid group by invoice.customerid order by sum(invoice.total) desc limit 1;
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid and invoice.invoiceid = invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid where genre.name="Rock" group by customer.email;
select billingcity, sum(Total) from invoice group by billingcity order by sum(total) desc limit 10;
select billingcity,count(Genre.Name),genre.name from invoice, invoiceline,track, genre on invoice.invoiceid= invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid group by billingcity order by sum(Invoice.total) desc limit 3;

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning, and organizing of disparate data, which is plentiful but messy.

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,”

The result, Mr. Weaver said, is being able to see each stage of a business in greater detail than in the past, to tailor product plans and trim inventory. “The more visibility you have, the more intelligent decisions you can make,” he said.

But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.

Data formats are one challenge, but so is the ambiguity of human language. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.

“You can’t do this manually,” Ms. Shahani-Mulligan said. “You’re never going to find enough data scientists and analysts.”

“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,” said Cathy O’Neil, a data scientist at the Columbia University Graduate School of Journalism, and co-author, with Rachel Schutt, of “Doing Data Science” (O’Reilly Media, 2013).

Yuchao's blogspot

Wednesday, March 1, 2017

Data Analyst ND 2, data wrangling, SQL