Friday, April 28, 2017

Google Prediction API

Google Cloud Prediction API provides a RESTful API to build Machine Learning models.
steps:
  1. create a cloud platform project: predictionapi0
  2. enable billing
  3. enable API
  4. download training data (txt file)
  5. create bucket: jychstar-bucket, upload txt file to bucket
  6. project: predicitonapi0
    request body: {
    “id”: “language-identifier”,
    “storageDataLocation”: “jychstar_bucket/language_id.txt”
    }
It turns out this language-identifier API is only a toy, with 403 input instances and 3-class labels(English, French, Spanish). It is a blackbox that are written for a specific purpose.
The business model for prediction API is $0.50/ 1000 prediction after 10 k free prediction. And they charged the training as well. I think such API is application specific. As a black box, it should generalize well enough to be useful in the changing world.
Some mature APIs are:
  • Natural language analysis: syntax, entity, sentiment
  • speech to text
  • translation
  • image analysis
  • video analysis

Amazon Redshift

Amazon Redshift was launched in 2012.11. It is targeted at big data which is at the level of petabytes.
An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.Reserving compute nodes offers significant savings compared to the hourly rates that you pay when you provision compute nodes on demand.
With Amazon Redshift, you can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional solutions.
When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. The dense storage (DS) node types are storage optimized. The dense compute (DC) node types are compute optimized. more details here.

getting started

there are 7 steps to follow:

1, sql client and driver

SQL Workbench/J is a free, DBMS-independent, cross-platform SQL query tool. It is written in Java and should run on any operating system that provides a Java Runtime Environment.
You can use a JDBC connection to connect to your Amazon Redshift cluster from many third-party SQL client tools. To do this, you need to download a JDBC driver.

2, create an IAM role

roles -> create new role -> AWS service Role: Amazon Redshift -> Attache policy: AmazoneS3ReadOnlyAccess -> role Name: myRedshiftRole
Role ARN: arn:aws:iam::992413356070:role/myRedshitRole

3, launch a redshift cluster

There will be a charge for $0.25/hour, so delete the cluster after tutorial.
launch cluster -> cluster identifier: examplecluster, Master User name: masteruser, password: Ai8920113

4, Authorize Access to the Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster properties -> VPC security Groups -> inbound -> Edit -> Type: Custom TCP rule, protocol: TC: , Port Range: 5439

5, Connect to the Sample Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster databse properties, JDBC URL: jdbc:redshift://examplecluster.cl0oz8dhlrae.us-east-1.redshift.amazonaws.com:5439/dev
software SQL workbench/J -> file -> connect window -> new profile -> manage Drivers -> new driver, load the jdbc driver, ok -> continue fill profile, driver, url, user name, password, autocommit, ok

6, Load Sample Data from Amazon S3

When lauch cluster, there is a default database “dev”. The database is still empty. So the first thing is to create some tables such as “users”, by writing queries in “statement” of SQL WorkbenchJ:
create table users(
    userid integer not null distkey sortkey,
    username char(8),
    firstname varchar(30),
    lastname varchar(30),
    city varchar(30),
    state char(2),
    email varchar(100),
    phone char(14),
    likesports boolean,
    liketheatre boolean,
    likeconcerts boolean,
    likejazz boolean,
    likeclassical boolean,
    likeopera boolean,
    likerock boolean,
    likevegas boolean,
    likebroadway boolean,
    likemusicals boolean);
Then we load sample data to the tables by “copy” from Amazon S3:
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';
Note that the credential string in <iam-role-arn> is from step 2. Unfortunately , I got S3ServiceException: Access Denied due to my setup in cluster launch.
Now you are ready to write queries like select * from users.
Check your query history at redshift -> example cluster -> Queries tab

7 try something interesting or reset environment

If you feel ambitious, try Tutorial: Loading Data from Amazon S3
Otherwise, revoke access from the VPC seucrity Group. redshift-> clusters -> example cluster -> configuration -> cluster properties -> inbound -> edit, delete custom TCP rule, save.
redshift -> cluster -> examplecluster -> configuration ->cluster -> delete. create snapshot: no, delete.