Recap: Data Analyst ND 2, data wrangling, SQL

The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.

This post now turns attention to NoSQL, especially in MongoDB.

course link: https://classroom.udacity.com/courses/ud032

course syllabus and setup: https://www.udacity.com/wiki/ud032

course material: https://github.com/udacity/ud032.git

course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.

NoSQL

MongoDB was first released in 2009.

Why MongoDB?

flexible schema
oriented toward Programmers
flexible deployment
designed for big data
aggregation framework

document database vs relational database.

here, document means associative array:

JSON object
PHP array
python dict
Ruby Hash

MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.

install and setup

https://www.udacity.com/wiki/ud032#!#installing-mongodb

brew install mongodb
brew services start mongodb
mkdir -p /data/db    //create default directory
sudo chown -R `id -u` /data/db    // give permission
mongod

practice by pymongo

start the server first by mongod

learning points:

create client
create database
create table by insert one or multiple dictionaries
query and projection by dictionary

from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples    # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"})  # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class": "full-size",
    "production":[2012,2013]
}
db.autos.insert_one(tesla_s)  # insert a dict to table "autos"
for a in db.autos.find():
    pprint.pprint(a)
autos = db.autos.find({"manufacturer":"Tesla Motors"})  # use dict for query
for a in autos:
    pprint.pprint(a)
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1}  // 0: hide; 1; show
autos = db.autos.find(query, projection) # multiple field queires and projection

mongoimport

Documentation for mongoimport can be found here.

The command used in this video:

mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo  // start mongo shell locally

If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017

Queries

operator is in subqueries with leading dollar sign

query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
                        "$lte":datetime(1837,12,31)}}
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
db.autos.find(query).count()
db.autos.find(query).pretty()
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}

To my surprise, 21st century begins from 2001.1.1

http://www.pcre.org/ Perl Compatible Regular Expressions

Live RegEx tester at regexpal.com.

MongoDB $regex Manual.

Official Python Regular Expression HOWTO.

Another good Python Regular Expressions page.

update/drop

db.cities.save(city)   // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})
db.ciites.drop()
db.cities.remove(query)

aggregation framework

$group aggregation operator documentation.

db.tweets.aggregate(pipeline)
# 3, group & sort operator
pipeline = [
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
## 8,match & project operator
pipeline =[
  {"$match": query},
  {"$project": {"ratio":
               {"$divide":["$user.followers_count",                                 "$user.friends_count"]},
              "screen_name":"$user.screen_name"}},
  {"$sort":{"ratio":-1}},
    {"$limit":1}
]
## 10, quize
pipeline =[
  {"$match": {"user.time_zone":"Brasilia",
              "user.statuses_count":{"$gte":100}}},
  {"$project": {"followers":"$user.followers_count",
                "tweets":"$user.statuses_count",
              "screen_name":"$user.screen_name"}},
  {"$sort":{"followers":-1}},
    {"$limit":1}
]
# 11, unwind operator: expand single to multiple instances
pipeline = [
    {"$unwind":"$entities.user_mentions"},
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
pipeline = [
   {"$match": {"country":"India"}},
    {"$unwind":"$isPartOf"},
    {"$group":{"_id": "$isPartOf",
            "count":{"$sum":1}}
    },
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]

other operator

{"$addToSet":"$text"}  # unique
{"$push":"$text"}  # everything

10 problem set

pipeline = [
    {"$match": {"name":{"$ne":None}}},
    {"$group":{"_id": "$name",
        "count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]

project: OpenStreetMap

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).

parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore

http://www.openstreetmap.org/

OpenStreetMap Documentation

OSM XML Documentation

Open Street Map has customized its own version: OSM XML. Major tags

way: street
node

search tag name use element.find(s), search attribute name use element.get(s)

search() vs. match()

https://docs.python.org/2/library/re.html#search-vs-match

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object at ...>

I didn’t find OpenStreetMap particularly interesting so I paused here.

Yuchao's blogspot

Wednesday, April 26, 2017

Data Wrangling with MongoDB