Wednesday, March 1, 2017

Machine Learning ND 5, dropout and review

I am too busy for preparing the on-going interview and have no time to finish the capstone. So now I do a quick review and then drop out of the nanodegree. Dropout, as a powerful technique in deep neural network, is also useful for me to make real-life decision. If you are learning too hard, you may be overfit. It is time to step back and reflect.
As my previous post said, the deep learning section in this nano degree is poorly designed. I got stuck here in the last 2 months and made very slow progress. Deep learning really takes time and effort.
Step 4 of digit_recognition is a real challenge. It makes me realize the real-world scenario is much more complicated. I couldn’t figure out how to do a localizer that can deal with different size of inputs.
And it is my first time to use the 1:1 appointment via Zoom. However, the tutor seems not well prepared. I spend 20 mins getting him understand the difficulty of the project. At last, we look to the Forum for solution: https://discussions.udacity.com/t/tips-for-svhn-project-with-bounding-boxes/219969. But after that, I became super busy with a data scientist opportunity and have no time to go back.

capstone

The capstone project is kind of DIY thing. You learn by exploring yourself.
Below are a few suggested problem areas you could explore if you are unsure what your passion is:

Review: is this machine learning ND worth it?

The short answer is: Yes!
Of course, Udacity can do a better job. The point is l learned a lot at my own pace and get prompt feedback. I am pretty confident to talk about machine learning with other colleagues. And I know how to improve myself: keep practicing on the real-world dataset and keep sharing what I learn.
By the way, I didn’t realize that nanodegree is a trademark of Udacity, who applied it in 2004

Data Analyst ND 3, data visualization by D3

Visualization Fundamentals

Jonathan: data visualization is about conveying a story or an idea as efficiently as possible. A picture is worth 1 k words.
Ryan: I think Visual Display of Quantitative Information by Edward Tufte gets the core:how best to represent visually some underlying data using color, size, shape to convey some information or some insight to their audience and their reader. And it actually goes a little further than that, incorporating storytelling and narrative elements, to tell some interesting insight they discovered on their own and want to share it with their audience.
What is a good visualization depends on the purpose of the data visualization:
  • exploratory: trys to get a sense of what the data is, what it can tell you, turn over 100 rocks to find 1 or 2 interesting nuggets.
  • explanatory: once you found the nuggets, connect things in interesting ways, look at data from different angles.
    1. have a really robust understanding of the context. Who is your audience and what they need to know or do before you show.
    2. choose an appropriate type of visual. what’s the most straightforward fashion for audience?
    3. clutter. cutting uninformative info, decrease cognitive load, so important data stand out more.
    4. draw attention to where you want them to pay it. use color, size and placement on page.
Your greatest insight is only as good as your ability to communicate it. So don’t spend too much time on complicated model.
How much do you identify yourself as:
  • designer
  • engineer (computer science)
  • storyteller (communicator)
There is no one place to get all the skills.
A pipeline of (acquire, parse), (filter, mine), (represent, refine),(interact) can be identified as computer science, statistical learning, graphic design and Infovis and HCI.
basically 3 steps: data wrangling, data mining, and data visualization.
retina to brain is 10 M bits/second.
Encoded: location, color, shape, size
Examples:

visualization spectrum

productivity visual technologies metaphor
Raw, Chartio, Tableau Excel
high NVD3, Dimple.js, Rickshaw python, ruby
median D3.js C,C++
low WebGL, HTML5 Canvas,SVG assembly
Strike a balance between abstraction and flexibility.
D3 is Data Driven Documents, built on CSS, HTML, JavaScript, SVG.
DOM(document object model) is created during page load. It is accessed by JavaScirpt API. It is a specification and a hierarchical object.
SVG(scalable vector graphic) is a graphic object that has a scalable size.

D3 building blocks

D3 was born on 2011.2.18 by Mike Bostock. version 4.0 was released on 2016.6.2. The major change is the namespace is flat rather than nested.

environment setup: loading D3 specification

note: In the Udacity video, they used v3 as shown in their html source file: <script src="./lib/d3.v3.js.min"></script>
D3 is a client-side JavaScript library.
  1. Copy the D3.js source code into the browser console.
  2. Alternatively, write this in console:
var script = documment.createElement('script');
script.type ='text/javascript';
script.src = "https://d3js.org/d3.v4.min.js";
document.head.appendChild(script);

use D3 module to control DOM

document.getElementById(); #  return DOM node
document.querySelector('.main'); #  return DOM node
var elem = d3.select('.main'); # return an array with D3 object
elem.style('background-color','#757c81');
d3.selectAll('.navar');
d3.select('.main-title').text('china');
d3.select('.navbar-brand.logo');
var parent_el = d3.select('#header-logo');  # pass id using hashtag
parent_el.select('img').attr('alt','Udacity');
d3.select('#header-logo').select('img'). attr('src','./assets/udacity_white.png') ; # change logo
d3.select('.main').html('');  # empty the body
var svg = d3.select('.main').append('svg'); # add svg element
svg.attr('width', 600).attr('height',300); # change size
var y = d3.scale.linear().domain([15,90]).range([250,0]); # map function
var x = d3.scale.log().domain([250,100000]).range([0,600]);
var r = d3.scale.sqrt().domain([52070,138000000]).range([10,50]);
svg.append('circle').attr('fill','red').attr('r',50).attr('cx',398).attr('cy',43);
Some methods are both getter (pass 1 parameter) and setter (pass 2 parameters)

let’s make a bar chart

some free tools

http://app.rawgraphs.io/ Drag data and see magic!

be kind to color-blinder

10% male and 1% female.
Edward Tufte:“Indeed, so difficult and subtle that avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”

something interesting about the inventor Mike Bostock

An “ask me anything” section on reddit. Many interesting first-hand materials.
What was the defining moment you realized you had to create D3?
The defining moment was when I got the data-join working for the first time. It was magic. I wasn’t even sure I understood how it worked, but it was a blast to use. I realized there could be a practical tool for visualization that didn’t needlessly restrict the types of visualizations you could make.
That was just a brief moment, though. The longer effort started with Protovis, which was a response to the limitations of chart typologies. I wanted something that gave the designer greater control over the output—the kind of control that the early practitioners like Minard, Playfair and Bertin had because they did things by hand. Even within Protovis I felt like I was limited by its mark types; I wanted something that could use all of the DOM and SVG.
what motivated your career choices outside of the academy?
I wish I could pick and chose the best parts of academia and industry, but I’m not sure it’s possible. The primary advantage of academia in my view is that you can afford a long-term perspective (assuming you have funding taken care of): you care about advancing human understanding and not “capturing value.” Yet the danger of academia is that it can easily become too abstract. There are many important, solvable real-world problems that are uninteresting in the academic sense. Ideally you find a way to be productive in the short term whilst moving towards true innovative in the long term.
You mentioned your first steps in programming, but where did you pick up design? Was it a deliberate thing, did someone force you to take courses, did it happen accidentally …?
I studied Human-Computer Interaction as an undergrad, and Don Norman’s book The Design of Everyday Things greatly resonated with me. Once you start thinking about design it becomes impossible to stop, and often greatly frustrating to see so many examples of bad design out in the world.
Tufte’s books were also a huge influence for me. I suspect that the undergraduate (and later graduate) courses were probably the strongest force pushing me to think critically about design, so finding a course you can audit would probably be the best—a little secret about academia is that professors often don’t mind you sitting in on lectures, provided you ask first. A reading list from an introductory HCI course would also be a good place to start.
What’s your daily routine?
Ha. My routine is totally off at the moment because we have a newborn. I get up, feed my daughter, bike her to school, and then come home to help my wife look after the baby, run errands and clean up around the house. (And sometimes, play Hearthstone.) Hopefully… sometime soon I’ll be able to find some quiet space, because as ecstatic as I am about our new family member I still hope to be able to work again.
Before we had children, I would often get excited about ideas and experiments and tinker on them late into the evening and on weekends. I find it to be the easiest thing in the world to work on something if you are passionate about it, and you can break it up into small pieces (like examples) that you can publish and share with others for external validation. So probably, choosing to work on things you are excited about, and then finding space to avoid distractions or interruptions is the key.
“Why? It’s hard to go beyond incremental maintenance of open-source projects while publishing on deadline. Long thoughts take time.” Mike Bostock is the man behind the widely used D3.js

interactive and animation

draw world map!
var projection = d3.geo.mercator();
var path = d3.geo.path().projection(projection);
var map = svg.selectAll('path').data(geo_data.features).enter()
.append('path').attr('d',path)
.style('fill','rgb(9,157,217)').style('stroke','black').style('stroke-wdith',0.5);
...
d3.json("world_countries.json", draw);
Alternative, R and python has its own world map: https://pypi.python.org/pypi/basemap/1.0.7
The most important thing is to decide what you want audience to know and how you are going to show that. As a data scientist, it’s your job to tell people what they need to know.

Data Analyst ND 2, data wrangling, SQL

Data Wrangling

Shanon Bradshaw, director of education at MongoDB, open source NoSQL database.
Data scientist spend 70% time on data wrangling. If you don’t take the time to ensure your data is in good shape before doing any analysis, you run a big risk of wasting a lot of time later on, or losing the faith of your colleagues who depend on your data.
Generally, we should not trust any data we get.where does it come from? typo, missing value, different format of daytime, outliers.

Data extraction

Extracting data from CSV, excel and local/remote JSON files are relatively easy. See python codes here

Read XML/HTML

XML design goals:
  • platform-independent data transfer
  • easy to write code to read/write
  • document validation
  • human readable
  • support a wide variety of applications
  • robust parsers in most languages.
It is very common to fill in the form to make HTML request to get desirable information. what exact information do you send in the post request can be seen in the browser developer tool -> Network -> Data_Elements.aspx?Data=2 -> Headers -> Form Data. there are 7 parameters you need to pass the form.
python codes here and practise
simple comparison:
Difference XML HTML
brith 1996 1993
pre-defined tag? NO Yes
purpose store data display data
? Markup Language Extensible HyperText

Data Quality

This process is very situation specific.
  • typo
  • legacy data
  • time format
  • statistical analysis, identify causes

SQL

The major content is similar to the free course: intro to Relational Databases. I actually wrote a blog post last August when I was at stage 5( back end ) of the Intro to Programming ND.
The nice improvement made by the nanodegree program is a set up of the local environment and a set of problems to get you play with the chinook database.

comparison of different RDBMS

SQL environments:
  • mysql,postgresql,oracle: code->network-> server-> data on disk
  • sqlite: code-> DB library-> data on disk
born creator features DB-API
sqlite 2000 D. Richard Hipp fast, free sqlite3
SQL Server 1989 Microsoft SaaS
MySQL 1995 Oracle partial free mysql.connector
PostgreSQL 1996 UC berkeley more format, free psycopg2
I am touched by the SQLite author, Hipp, who keep it free to everyone
“It’s very clear to me that if I’d had any business sense whatsoever I could have made a lot of money, but I don’t,” he says. “I like to tell people that we make enough to live fine in Charlotte, North Carolina. We don’t make nearly enough money to live in London or San Francisco, but we don’t live there so that’s OK,” he adds - with a touch, just a touch, of wistfulness.

sqlite

SQLite is preinstalled in Mac. My version is 3.13, while the latest version is 3.17. I haven’t figured out a simple way to update it. sqlite3 in the shell to enter the environment.
.open chinook.db
.exit
.quit
.help
.tables
.schema
Alternatively, sqlite studio is a visual appealing IDE. You can see how many tables, which is the primary key and the whole tabular data at your clicks. No black box anymore. it really makes life much easier. I guess the only catch is to be careful about the size. Get a sample size first.

sqlite datatype

  • TEXT
  • REAL
  • INTEGER
  • BLOB
  • NULL
no class for storing data/time, TEXT or NUMERIC can be used to store data. There are built-in function to parse into the right format

sqlite tricks

change output format
.show # show setting
.mode list  # default output mode, default separator is a pipe symbol
.separator ","
.mode quote
.mode line
.mode column
.mode tabs
.header off
.mode insert  # used to input data elsewhere
check, output results
.tables  # see a list of tables
.schema
.databases
.output filename.txt # all subsequent results will be written to this file
import/ output table
.mode csv  # alignment format
.import somedata.csv table1 # import table

.header on
.mode csv
.once filename.csv
select * from table1;
.sytem open filename.csv  # open file to display
create a new SQLite database:
sqlite3 ex1 # enter sqlite and create exl database
create table tb1(one varchar(10), two smallint);
insert into tb1 values('hello!',10);
insert into tb1 values('goodbye', 20);
CREATE TABLE tb2 (f1 varchar(30) primary key,
f2 text,f3 real);
note: SQL qury command is not case-sensitive. But the “text” value is case-sensitive.

chinook database

The sample database at your disposal is called chinook, hosted at https://chinookdatabase.codeplex.com/. The Chinook data model represents a digital media store, including 11 tables (artists, albums, media tracks, invoices and customers). It was first public in 2008. Now it is available for all major languages.
chinook data model
example query
.open chinook.db
SELECT Name FROM Track WHERE Composer='AC/DC';
SELECT Composer, sum(Milliseconds) FROM Track WHERE Composer='Johann Sebastian Bach';
SELECT FirstName, LastName, Title, Birthdate FROM Employee;
SELECT Composer, Name FROM Track WHERE Composer = 'Stevie Ray Vaughan';
select composer, count(*) from track group by composer  order by count(*) desc limit 10;
select artist.name, album.title from album join artist on artist.artistid = album.artistid where name = 'Iron Maiden' or name = 'Amy Windhouse';
select billingcountry, count(*) from invoice group by billingcountry order by count(*) desc limit 3;
select customer.email, customer.firstname,customer.lastname, sum(invoice.total) from customer, invoice on customer.customerid = invoice.customerid group by invoice.customerid order by sum(invoice.total) desc limit 1;
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid
select customer.email, customer.firstname, customer.lastname, genre.name from customer, invoice, invoiceline, track,genre on customer.customerid = invoice.customerid and invoice.invoiceid = invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid where genre.name="Rock" group by customer.email;
select billingcity, sum(Total) from invoice group by billingcity order by sum(total) desc limit 10;
select billingcity,count(Genre.Name),genre.name from invoice, invoiceline,track, genre on invoice.invoiceid= invoiceline.invoiceid and invoiceline.trackid = track.trackid and track.genreid = genre.genreid group by billingcity order by sum(Invoice.total) desc limit 3;

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning, and organizing of disparate data, which is plentiful but messy.
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,”
The result, Mr. Weaver said, is being able to see each stage of a business in greater detail than in the past, to tailor product plans and trim inventory. “The more visibility you have, the more intelligent decisions you can make,” he said.
But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
Data formats are one challenge, but so is the ambiguity of human language. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.
“You can’t do this manually,” Ms. Shahani-Mulligan said. “You’re never going to find enough data scientists and analysts.”
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,” said Cathy O’Neil, a data scientist at the Columbia University Graduate School of Journalism, and co-author, with Rachel Schutt, of “Doing Data Science” (O’Reilly Media, 2013).