Wednesday, April 26, 2017

Data Wrangling with MongoDB

The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.
This post now turns attention to NoSQL, especially in MongoDB.

course syllabus and setup:
course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.


MongoDB was first released in 2009.
Why MongoDB?
  • flexible schema
  • oriented toward Programmers
  • flexible deployment
  • designed for big data
  • aggregation framework
document database vs relational database.
here, document means associative array:
  • JSON object
  • PHP array
  • python dict
  • Ruby Hash
MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.

install and setup

brew install mongodb
brew services start mongodb
mkdir -p /data/db    //create default directory
sudo chown -R `id -u` /data/db    // give permission

practice by pymongo

start the server first by mongod
learning points:
  1. create client
  2. create database
  3. create table by insert one or multiple dictionaries
  4. query and projection by dictionary
from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples    # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"})  # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class": "full-size",
}  # insert a dict to table "autos"
for a in
autos ={"manufacturer":"Tesla Motors"})  # use dict for query
for a in autos:
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1}  // 0: hide; 1; show
autos =, projection) # multiple field queires and projection


Documentation for mongoimport can be found here.
The command used in this video:
mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo  // start mongo shell locally
If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017


operator is in subqueries with leading dollar sign
query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}
To my surprise, 21st century begins from 2001.1.1 Perl Compatible Regular Expressions
Live RegEx tester at
MongoDB $regex Manual.
Official Python Regular Expression HOWTO.
Another good Python Regular Expressions page.

update/drop   // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})

aggregation framework

# 3, group & sort operator
pipeline = [
## 8,match & project operator
pipeline =[
  {"$match": query},
  {"$project": {"ratio":
               {"$divide":["$user.followers_count",                                 "$user.friends_count"]},
## 10, quize
pipeline =[
  {"$match": {"user.time_zone":"Brasilia",
  {"$project": {"followers":"$user.followers_count",
# 11, unwind operator: expand single to multiple instances
pipeline = [
pipeline = [
   {"$match": {"country":"India"}},
    {"$group":{"_id": "$isPartOf",
other operator
{"$addToSet":"$text"}  # unique
{"$push":"$text"}  # everything

10 problem set

pipeline = [
    {"$match": {"name":{"$ne":None}}},
    {"$group":{"_id": "$name",

project: OpenStreetMap

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).
parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore
Open Street Map has customized its own version: OSM XML. Major tags
  • way: street
  • node
search tag name use element.find(s), search attribute name use element.get(s)

search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while checks for a match anywhere in the string.
>>> re.match("c", "abcdef")    # No match
>>>"c", "abcdef")   # Match
<_sre.SRE_Match object at ...>
I didn’t find OpenStreetMap particularly interesting so I paused here.