The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.
This post now turns attention to NoSQL, especially in MongoDB.
course link: https://classroom.udacity.com/courses/ud032
course syllabus and setup: https://www.udacity.com/wiki/ud032
course material: https://github.com/udacity/ud032.git
course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.
NoSQL
MongoDB was first released in 2009.
Why MongoDB?
- flexible schema
- oriented toward Programmers
- flexible deployment
- designed for big data
- aggregation framework
document database vs relational database.
here, document means associative array:
- JSON object
- PHP array
- python dict
- Ruby Hash
MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.
install and setup
brew install mongodb
brew services start mongodb
mkdir -p /data/db //create default directory
sudo chown -R `id -u` /data/db // give permission
mongod
practice by pymongo
start the server first by
mongod
learning points:
- create client
- create database
- create table by insert one or multiple dictionaries
- query and projection by dictionary
from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"}) # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
"manufacturer": "Tesla Motors",
"class": "full-size",
"production":[2012,2013]
}
db.autos.insert_one(tesla_s) # insert a dict to table "autos"
for a in db.autos.find():
pprint.pprint(a)
autos = db.autos.find({"manufacturer":"Tesla Motors"}) # use dict for query
for a in autos:
pprint.pprint(a)
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1} // 0: hide; 1; show
autos = db.autos.find(query, projection) # multiple field queires and projection
mongoimport
Documentation for mongoimport can be found here.
The command used in this video:
mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo // start mongo shell locally
If no hostname and credentials are supplied,
mongoimport
will try to connect to the default localhost:27017
Queries
operator is in subqueries with leading dollar sign
query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
"$lte":datetime(1837,12,31)}}
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
db.autos.find(query).count()
db.autos.find(query).pretty()
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}
To my surprise, 21st century begins from 2001.1.1
http://www.pcre.org/ Perl Compatible Regular Expressions
Live RegEx tester at regexpal.com.
MongoDB $regex Manual.
Official Python Regular Expression HOWTO.
Another good Python Regular Expressions page.
update/drop
db.cities.save(city) // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})
db.ciites.drop()
db.cities.remove(query)
aggregation framework
db.tweets.aggregate(pipeline)
# 3, group & sort operator
pipeline = [
{"$group":{"_id":"$user.screen_name",
"count":{"$sum":1}}},
{"$sort":{"count":-1}}
]
## 8,match & project operator
pipeline =[
{"$match": query},
{"$project": {"ratio":
{"$divide":["$user.followers_count", "$user.friends_count"]},
"screen_name":"$user.screen_name"}},
{"$sort":{"ratio":-1}},
{"$limit":1}
]
## 10, quize
pipeline =[
{"$match": {"user.time_zone":"Brasilia",
"user.statuses_count":{"$gte":100}}},
{"$project": {"followers":"$user.followers_count",
"tweets":"$user.statuses_count",
"screen_name":"$user.screen_name"}},
{"$sort":{"followers":-1}},
{"$limit":1}
]
# 11, unwind operator: expand single to multiple instances
pipeline = [
{"$unwind":"$entities.user_mentions"},
{"$group":{"_id":"$user.screen_name",
"count":{"$sum":1}}},
{"$sort":{"count":-1}}
]
pipeline = [
{"$match": {"country":"India"}},
{"$unwind":"$isPartOf"},
{"$group":{"_id": "$isPartOf",
"count":{"$sum":1}}
},
{"$sort":{"count":-1}},
{"$limit":1}
]
other operator
{"$addToSet":"$text"} # unique
{"$push":"$text"} # everything
10 problem set
pipeline = [
{"$match": {"name":{"$ne":None}}},
{"$group":{"_id": "$name",
"count":{"$sum":1}}},
{"$sort":{"count":-1}},
{"$limit":1}
]
project: OpenStreetMap
SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).
parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore
Open Street Map has customized its own version: OSM XML. Major tags
- way: street
- node
search tag name use
element.find(s)
, search attribute name use element.get(s)
search() vs. match()
Python offers two different primitive operations based on regular expressions:
re.match()
checks for a match only at the beginning of the string, while re.search()
checks for a match anywhere in the string.>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object at ...>
I didn’t find OpenStreetMap particularly interesting so I paused here.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.