Wednesday, April 26, 2017

Data Wrangling with MongoDB

The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.
This post now turns attention to NoSQL, especially in MongoDB.

course syllabus and setup: https://www.udacity.com/wiki/ud032
course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.

NoSQL

MongoDB was first released in 2009.
Why MongoDB?
  • flexible schema
  • oriented toward Programmers
  • flexible deployment
  • designed for big data
  • aggregation framework
document database vs relational database.
here, document means associative array:
  • JSON object
  • PHP array
  • python dict
  • Ruby Hash
MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.

install and setup

brew install mongodb
brew services start mongodb
mkdir -p /data/db    //create default directory
sudo chown -R `id -u` /data/db    // give permission
mongod

practice by pymongo

start the server first by mongod
learning points:
  1. create client
  2. create database
  3. create table by insert one or multiple dictionaries
  4. query and projection by dictionary
from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples    # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"})  # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class": "full-size",
    "production":[2012,2013]
}
db.autos.insert_one(tesla_s)  # insert a dict to table "autos"
for a in db.autos.find():
    pprint.pprint(a)
autos = db.autos.find({"manufacturer":"Tesla Motors"})  # use dict for query
for a in autos:
    pprint.pprint(a)
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1}  // 0: hide; 1; show
autos = db.autos.find(query, projection) # multiple field queires and projection

mongoimport

Documentation for mongoimport can be found here.
The command used in this video:
mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo  // start mongo shell locally
If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017

Queries

operator is in subqueries with leading dollar sign
query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
                        "$lte":datetime(1837,12,31)}}
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
db.autos.find(query).count()
db.autos.find(query).pretty()
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}
To my surprise, 21st century begins from 2001.1.1
http://www.pcre.org/ Perl Compatible Regular Expressions
Live RegEx tester at regexpal.com.
MongoDB $regex Manual.
Official Python Regular Expression HOWTO.
Another good Python Regular Expressions page.

update/drop

db.cities.save(city)   // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})
db.ciites.drop()
db.cities.remove(query)

aggregation framework

db.tweets.aggregate(pipeline)
# 3, group & sort operator
pipeline = [
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
## 8,match & project operator
pipeline =[
  {"$match": query},
  {"$project": {"ratio":
               {"$divide":["$user.followers_count",                                 "$user.friends_count"]},
              "screen_name":"$user.screen_name"}},
  {"$sort":{"ratio":-1}},
    {"$limit":1}
]
## 10, quize
pipeline =[
  {"$match": {"user.time_zone":"Brasilia",
              "user.statuses_count":{"$gte":100}}},
  {"$project": {"followers":"$user.followers_count",
                "tweets":"$user.statuses_count",
              "screen_name":"$user.screen_name"}},
  {"$sort":{"followers":-1}},
    {"$limit":1}
]
# 11, unwind operator: expand single to multiple instances
pipeline = [
    {"$unwind":"$entities.user_mentions"},
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
pipeline = [
   {"$match": {"country":"India"}},
    {"$unwind":"$isPartOf"},
    {"$group":{"_id": "$isPartOf",
            "count":{"$sum":1}}
    },
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]
other operator
{"$addToSet":"$text"}  # unique
{"$push":"$text"}  # everything

10 problem set

pipeline = [
    {"$match": {"name":{"$ne":None}}},
    {"$group":{"_id": "$name",
        "count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]

project: OpenStreetMap

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).
parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore
Open Street Map has customized its own version: OSM XML. Major tags
  • way: street
  • node
search tag name use element.find(s), search attribute name use element.get(s)

search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.
>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object at ...>
I didn’t find OpenStreetMap particularly interesting so I paused here.