Data Wrangling
Shanon Bradshaw, director of education at MongoDB, open source NoSQL database.
Data scientist spend 70% time on data wrangling. If you don’t take the time to ensure your data is in good shape before doing any analysis, you run a big risk of wasting a lot of time later on, or losing the faith of your colleagues who depend on your data.
Generally, we should not trust any data we get.where does it come from? typo, missing value, different format of daytime, outliers.
Data extraction
Extracting data from CSV, excel and local/remote JSON files are relatively easy. See python codes here
Read XML/HTML
XML design goals:
- platform-independent data transfer
- easy to write code to read/write
- document validation
- human readable
- support a wide variety of applications
- robust parsers in most languages.
It is very common to fill in the form to make HTML request to get desirable information. what exact information do you send in the post request can be seen in the browser developer tool -> Network -> Data_Elements.aspx?Data=2 -> Headers -> Form Data. there are 7 parameters you need to pass the form.
simple comparison:
Difference | XML | HTML |
---|---|---|
brith | 1996 | 1993 |
pre-defined tag? | NO | Yes |
purpose | store data | display data |
? Markup Language | Extensible | HyperText |
Data Quality
This process is very situation specific.
- typo
- legacy data
- time format
- statistical analysis, identify causes
SQL
The major content is similar to the free course: intro to Relational Databases. I actually wrote a blog post last August when I was at stage 5( back end ) of the Intro to Programming ND.
The nice improvement made by the nanodegree program is a set up of the local environment and a set of problems to get you play with the chinook database.
comparison of different RDBMS
SQL environments:
- mysql,postgresql,oracle: code->network-> server-> data on disk
- sqlite: code-> DB library-> data on disk
born | creator | features | DB-API | |
---|---|---|---|---|
sqlite | 2000 | D. Richard Hipp | fast, free | sqlite3 |
SQL Server | 1989 | Microsoft | SaaS | |
MySQL | 1995 | Oracle | partial free | mysql.connector |
PostgreSQL | 1996 | UC berkeley | more format, free | psycopg2 |
I am touched by the SQLite author, Hipp, who keep it free to everyone
“It’s very clear to me that if I’d had any business sense whatsoever I could have made a lot of money, but I don’t,” he says. “I like to tell people that we make enough to live fine in Charlotte, North Carolina. We don’t make nearly enough money to live in London or San Francisco, but we don’t live there so that’s OK,” he adds - with a touch, just a touch, of wistfulness.
sqlite
SQLite is preinstalled in Mac. My version is 3.13, while the latest version is 3.17. I haven’t figured out a simple way to update it.
sqlite3
in the shell to enter the environment..open chinook.db
.exit
.quit
.help
.tables
.schema
Alternatively, sqlite studio is a visual appealing IDE. You can see how many tables, which is the primary key and the whole tabular data at your clicks. No black box anymore. it really makes life much easier. I guess the only catch is to be careful about the size. Get a sample size first.
sqlite datatype
- TEXT
- REAL
- INTEGER
- BLOB
- NULL
no class for storing data/time, TEXT or NUMERIC can be used to store data. There are built-in function to parse into the right format
sqlite tricks
change output format
.show # show setting
.mode list # default output mode, default separator is a pipe symbol
.separator ","
.mode quote
.mode line
.mode column
.mode tabs
.header off
.mode insert # used to input data elsewhere
check, output results
.tables # see a list of tables
.schema
.databases
.output filename.txt # all subsequent results will be written to this file
import/ output table
.mode csv # alignment format
.import somedata.csv table1 # import table
.header on
.mode csv
.once filename.csv
select * from table1;
.sytem open filename.csv # open file to display
create a new SQLite database:
sqlite3 ex1 # enter sqlite and create exl database
create table tb1(one varchar(10), two smallint);
insert into tb1 values('hello!',10);
insert into tb1 values('goodbye', 20);
CREATE TABLE tb2 (f1 varchar(30) primary key,
f2 text,f3 real);
note: SQL qury command is not case-sensitive. But the “text” value is case-sensitive.
chinook database
The sample database at your disposal is called chinook, hosted at https://chinookdatabase.codeplex.com/. The Chinook data model represents a digital media store, including 11 tables (artists, albums, media tracks, invoices and customers). It was first public in 2008. Now it is available for all major languages.
example query
.open chinook.db
SELECT Name FROM Track WHERE Composer='AC/DC';
SELECT Composer, sum(Milliseconds) FROM Track WHERE Composer='Johann Sebastian Bach';
SELECT FirstName, LastName, Title, Birthdate FROM Employee;
SELECT Composer, Name FROM Track WHERE Composer = 'Stevie Ray Vaughan';
select composer, count(*) from track group by composer order by count(*) desc limit 10;
select artist.name, album.title from album join artist on artist.artistid = album.artistid where name = 'Iron Maiden' or name = 'Amy Windhouse';
select billingcountry, count(*) from invoice group by billingcountry order by count(*) desc limit 3;
select customer.email, customer.firstname,customer.lastname, sum(invoice.total) from customer, invoice on customer.customerid = invoice.customerid group by invoice.customerid order by sum(invoice.total) desc limit 1;
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid and invoice.invoiceid = invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid where genre.name="Rock" group by customer.email;
select billingcity, sum(Total) from invoice group by billingcity order by sum(total) desc limit 10;
select billingcity,count(Genre.Name),genre.name from invoice, invoiceline,track, genre on invoice.invoiceid= invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid group by billingcity order by sum(Invoice.total) desc limit 3;
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning, and organizing of disparate data, which is plentiful but messy.
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,”
The result, Mr. Weaver said, is being able to see each stage of a business in greater detail than in the past, to tailor product plans and trim inventory. “The more visibility you have, the more intelligent decisions you can make,” he said.
But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
Data formats are one challenge, but so is the ambiguity of human language. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.
“You can’t do this manually,” Ms. Shahani-Mulligan said. “You’re never going to find enough data scientists and analysts.”
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,” said Cathy O’Neil, a data scientist at the Columbia University Graduate School of Journalism, and co-author, with Rachel Schutt, of “Doing Data Science” (O’Reilly Media, 2013).
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.