Jiawen Qi's Website

[Tableau & R] Labor force participation rate

3/18/2017

I was inspired by rud.is's post: Putting It All Together
So, I begin to think about putting the charts, line chart and "dumbbell"chart into Tableau.

1. Load the data

2. Transform to Tableau friendly dataset

To make it easily shown in Tableau, we need to "flatten" the dataset

3. Line charts

It's pretty easy to make some line charts in Tableau.
Column: Day(Date)
Row: AVG(Participation Rate)
Color: Degree
Sheet Name: US Labor Force Participation Rate
y-axis: title: Participation Rate; format number into percentage
x-axis: remove title

4. "Dumbbell" chart, also called DNA chart

DNA chart is a little bit difficult, so let's do it step by step:
Drag Degree to the rows -> Drag Participation Rate to columns -> Change SUM to MIN:

Drag Participation Rate to x-axis and using MAX -> Drag Measure Names on color -> Change the Mark Automatic to Circle:

Drag Measure Values on Column again -> Change to Dual Axis -> Synchronize x axis -> Go to second Measure Value Panel, change mark to line

Move Measure Name to the path:

To make it similar with the blog post in rud.is:
Edit the color:

Change Line size and edit color:

Then do some minor changes:

It's done! To view full dashboard:
https://public.tableau.com/views/U_S_LaborForceParticipationRate/Dashboard1?:embed=y&:display_count=yes

0 Comments

[Tableau & R] Breast Cancer Data: Benign or Malignant?

3/1/2017

0 Comments

1. Get the data

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.:
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Totally, there are 10 attributes in this dataset:

Clump Thickness
Uniformity of Cell Size: 1-10
Uniformity of Cell Shape 1-10
Marginal Adhesion 1-10
Single Epithelial Cell Size 1-10
Bare Nuclei 1-10
Bland Chromatin 1-10
Normal Nucleoli: 1-10
Mitoses: 1-10
Cancer Class: benign & malignant

2. Variable Identification

Target Variable: Cancer Class
Predictor Variables: other attributes
For this case, all the predictor variables can be treated as numeric variable, because, essentially, these variables were numeric meaningfully and then categorized to some degree.

3. Univariate Analysis

At this stage, we explore variables one by one with histograms. Using One dashboard:

4. Bi-variate Analysis

In Tableau, it's easy and quick to add a new 'element', a dimension into the Univariate Analysis Dashboard:

Generally speaking, from the degree 3 -4 of each attributes, the percentage of malignant cancer increases much a lot, sometimes hit the 100% line aggressively.

5. Classification Prediction

Clearly, it's a classification problem. Our goal is to predict whether a tumor is benign or malignant. I will using R to do this task and import the prediction result into Tableau.

5.1 Let's begin with KNN.

Create 10 groups -> 10-fold cross validation -> calculate the confusion matrix -> calculate accuracy, precision, recall and f-score.

This result has been pretty perfect.

5.2 Try one more: Naive Bayes

6. Take Away

This dataset is a little bit old...
I didn't do anything to find if there is any relationship between attributes in bi-variate analysis part.
Mathematics is not medical diagnose, please see a doctor not data analyst if you have a tumor.

Check the full dashboards here:
https://public.tableau.com/profile/jiawen.qi#!/vizhome/BreastCancer_1/BreastCancerDataset

0 Comments

Grid Search of Iris Data

12/13/2016

0 Comments

View this post on RPubs:
rpubs.com/JiawenQi/GridSearchIris
View this post on Github:
github.com/JiawenQi98/KeepLearningR/blob/master/Grid%20Search%20of%20Iris%20Dataset.Rmd

Introduction

On Dec 10th, Simon Jackson blogged a new article about Grid Search in the tidyverse: https://drsimonj.svbtle.com/grid-search-in-the-tidyverse
As a beginner in this field, Grid Search is a new term for me.

Simon explained grid search in this way: “grid search involves running a model many times with combinations of various hyperparameters. The point is to identify which hyperparameters are likely to work best.”.
In Wikipedia, “The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm.”

I will follow Simon’s script on Iris Dataset and markdown more details for beginners.

Decision Tree Example of Iris

Iris dataset is very famous and predictable. Let’s predict the Species by other four features

tidyverse: this package is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple tidyverse packages in a single step. Learn more about the tidyverse at https://github.com/hadley/tidyverse. It will imports: broom, DBI, dplyr, forcats, ggplot2, haven, httr, hms, jsonlite, lubridate, magrittr, modelr, purrr, readr, readxl, stringr, tibble, rvest, tidyr, xml2.
ggplot(): initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
geom_point(): The point geom is used to create scatterplots.

From the scatterplots, it’s obvious that setosa species can be identified strongly.
Let’s do a decision tree.

rpart: Recursive partitioning for classification, regression and survival trees.
rpart.plot: Plot rpart Models: An Enhanced Version of plot.rpart
rpart(): Fit an rpart model
prp(): Plot an rpart model.

Compared with the tree in Simon’s article, this tree is perfect at this point. But I still want to know which hyperparameter specification can surprise me.

Split Training and Testing

80% training, and 20% testing.

set.seed(): the recommended way to specify seeds.
nrow(): return the number of rows.
sample(): takes a sample of the specified size from the elements of x using either with or without replacement.

Create the Grid

Define hyperparameter combinations: (you can try different value for minsplit and maxdepth)

list(): Functions to construct, coerce and check for both kinds of R lists.
cross_d(): cross_d() takes a list .l and returns a data frame, with one combination by row

Create a Model Function

Create a function to go through our grid search hyperparameter combinations and modeling easily.

function(): provides the base mechanisms for defining new functions in the R language.

Fit the Models

Iterate down the values and fit our models.

mutate(): Mutate adds new variables and preserves existing.
pmap(): pmap allows you to provide any number of arguments in a list.

Obtain Accuracy

Create a function to get accuracy easily:

predict(): a generic function for predictions from the results of various model fitting functions.
mean(): Generic function for the (trimmed) arithmetic mean.

Apply to fit models:

select(): keeps only the variables you mention
map_dbl(): return vectors of the corresponding type

Now we can see all the accuracy results of our fits.

Arrange Results

Sort results by high-accuracy, more-minsplit and less-maxdepth:

arrange(): Arrange rows by variables.
desc(): In the order of descending.

Our best fit is minsplit=10 and maxdepth=3 combination

Take Away

Iris data is very clean and predictable, I should have tried more chanllenging and complex dataset.
Arranging your results, I mean acuuracy or something else, will help you when doing a large project.
Learning from other’s scripts is one of the best ways to understand and master a language.
Do what you like.

0 Comments

＃vizmas: Global Heights Dataset Analysis, Visualization and A Simple Prediction

12/6/2016

0 Comments

It's the season of holidays. The gift from Tableau Public is a bunch of free and interesting datasets.
Merry Vizmas! Follow These Tableau-Themed Advent Calendars:
https://public.tableau.com/en-us/s/blog/2016/12/merry-vizmas-follow-these-tableau-themed-advent-calendars

The dataset for Dec 1st is Global Heights. The heights in this dataset is the mean height at age 18 for a country by sex.

The data source: http://www.ncdrisc.org/d-height.html
Journal article using this data: A century of trends in adult human height
BBC articel using this data: Dutch men revealed as world's tallest

1. Import Data Into Tableau

The format of the data is xlsx with size of 2.2 MB. Click Excel in the Connect Panel, locate to Global Heights data, and Open.

2. Quick View with R

Country: The full name of a country
ISO: Standardized code for each country.
Sex: Men and Women.
Year of birth
Mean height: in centimetres
Mean height lower 95% uncertainty interval: in centimetres
Mean height upper 95% uncertainty interval: in centimetres

R can give us a quick summary of each field.

2.1 How many countries are included?

200 countries are included in this dataset.

2.2 The range of Year of birth

The mean height are come from the people born from 1896 to 1996. So the year of collecting the data is from 1914 to 2014, 101 years.

2.3 Highest Man & Woman

The highest mean height of men is 182.5673 cm from Netherlands born in 1985. For women, it's 169.7979 cm from Latvia.

2.4 Shortest Man & Woman

Men from Lao PDR born in 1896 has the lowest mean height, 152.8846 cm. Women from Guatemala born in 1986 are 140.29 cm.

3. Basic Visualizations

3.1 Statistics on Maps

Height of Men, 1896

Height of Men, 1996

Height of Women, 1896

Height of Women, 1996

Generally speaking, height of people in Greenland grows faster than others. Men in China and nearby countries has the height more than average of the world 100 years later. However, people in Africa Area seems left behind.

3.2 Trend Lines

Trend Lines, Height of the world, by sex. (Men in blue, Women in orange)

From 1896 to around 1970, there is an obvious increasement of height for the whole world. However, after 1980, there is even a decreasement.

Trend Lines, Height of several countries, average of women and men.

Interesting results and also the questions to consider.

3.3 Make A Dashboard

To make dashboards, the thing to consider is what people would like to see with the dashboard.
Several points in my mind:

A dashboard which can see the trend of 101 years, with comparing the height of all countris
A dashboard which can select several countries to compare

With the simple ideas, dashboards can be viewed here:
https://public.tableau.com/profile/jiawen.qi#!/vizhome/Book1_15239/GlobalHeight

4. Do a simple linear regression

The overall MES is 2.53. The range of height is 140~183. The prediction is not bad.

0 Comments

Redo The Tip From Andy Kriebel: How to Create a Diverging Bar Chart with One Measure

10/31/2016

0 Comments

Andy Kriebel posted an article about the tip to create a diverging bar chart with one measure two weeks ago.
Simply speaking, diverging bar chart is good at comparing two categories in Tableau. For example, comparing the age, height, weight of Tom and Jerry.
This post will use a famous dataset, iris data, to redo the tip step by step.

Step 1: Import the data into Tableau 10

Data connection is located at the left blue panel. You need to choose text if your iris data is in the csv format. Then find your iris data and click open.

Step 2: Data Overview

After clicking the open button in the first step, you will get a general overview of iris dataset.

We have one categorical variable: Species and four numeric variables: sepal length, sepal width, petal length and petal width. Species is the variable that we can choose 2 kinds to compare.

Step 3: Get your first sheet

Just click the Sheet1 at left bottom, you will jump to your first sheet.
Let's compare the sepal length of setosa and versicolor to try.

Drag Species into the Filters and choose setosa and versicolor

Drag Sepal length into Columns and change measure from sum to avg

Drag Species from Dimensions into Color in Marks

Now, we have a stacked bar chart. But we want the middle to be 0 and left side to be the average sepal length of versicolor and right side to be the sepal length of setosa.

Double click the AVG(Sepal.Length) in the Columns and now you are able to edit it

Edit it: IF INDEX() = 1 THEN AVG([Sepal.Length]) ELSE -AVG([Sepal.Length])
Click the formula and choose Compute Using Species

Now, it looks like 0 in the middle and close to our target.

Drag Sepal Length from Measures into the Label in the Marks
Unclick the Show Hider

Double click the sheet 1 to edit the sheet name into Sepal Length

Here we go. The first sheet is created.

Step 4: Repeat Step 3 for other measures you want and create a dashboard

Repeat Step 3 for sepal width, petal length and petal width
Create a new Dashboard called Versicolor vs. Setosa.
Drag a vertical container from Objects into the dashboard
Drag your four sheets into the container

It looks good but we want the middle to be in the same center.

Center the titles

Click the Show dashboard title in the Objects
Change the Versicolor in the title into orange and bold and change Setosa in the title into blue and bold, then center the titel
Delete the legend

In Andy's post, he didn't give the solution to put all middle into the same center. One way in my mind is:

Go to each sheet
Edit Axis

Choose Fixed Axis
Because the average sepal length for versicolor is larger than setosa, to make the 0 in center, we can change the from -larger value to +larger value

Change axis for each sheet.

Here we go:

To view this dashboard:
https://public.tableau.com/profile/jiawen.qi#!/vizhome/DivergingBarChart/Versicolorvs_Setosa

0 Comments

San Francisco Crime Dataset

9/6/2016

0 Comments

Tableau 10 was released several weeks ago. The 10 version has more functions like cross-database join and cluster algorithm model. And I believe that San Francisco Crime Dataset is a good dataset to start.

San Francisco Crime Dataset was released on Kaggle competition and you can download data here: https://www.kaggle.com/c/sf-crime/data?test.csv.zip

This competition was designed to predict the category of crimes that occurred in the city by the bay. Here is the description of this competition:

"From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred."
First, let's look at several lines of this dataset:

There are 9 variables in the dataset:
Dates: timestamp of the crime incident
Category: category of the crime incident
Descript: detailed description of the crime incident
Day Of Week: the day of the week
Pd District: name of the Police Department District
Resolution: how the crime incident was resolved
Address: the approximate street address of the crime incident
X: Longitude
Y: Latitude

1. Time Series of Crime Records

It is interesting to find that the top three peaks are all at the first day of 2006, 2011 and 2012. Which need more thinking is that most of other small peaks from 20 to 60 are always the first day of a month.
When sum the crime records for each day of month:

It is easy to understand that the sum of crime records of 31th is almost half of 1st, because not every month has 31th. Let's pick out the records in 1st and check the categories.

The top 5 categories of crime incidents are LARCENY/THEFT, OTHER OFFENSES, NON-CRIMINAL, ASSAULT and DRUG/NARCOTIC.

2. Where did crime happen?
With x and y variable in dataset, we can map the data.

It is very clear and interesting to see the places of crime happened follow the streets pattern. I use records to size of circle and find that 800 Block of Bryant St. is a place that has 26,354 crime incidents during 2003 to 2015. Then remove the spots that have less than 1000 crimes during 12 years.

Can I conclude that the south east of San Francisco Bay Area is more dangerous?
Then let's see the category of crimes for 800 Block of BRYANT ST.

When adding the Police Department Districts to map:

3. How did crime incidents resolved?

Other than ARREST resolution, most of the crime incidents were under the NONE resolution, which is understandable and disappointed.

4. Create a dashborad

VIEW THE STORY

0 Comments

Rossmann Store Sales Data Exploration and Visualization

8/4/2016

0 Comments

Rossmann Store Sales is a Kaggle Competition. You can find information about this dataset and download it here : https://www.kaggle.com/c/rossmann-store-sales/data

The purpose of this competition was to predict sales using store, promotion, and competitor data.

Here is some information of this competition from the Kaggle Website:

"Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! "

1. Understand this dataset
There are four csv files in this dataset: store.csv, train.csv, test.csv and sample_submission.csv.
In the first step, we just need to look into store and train files. Store file contains the supplemental information about stores. Train file contains the historical data including Sales.

## Load store file
store <- read.csv("store.csv", header = TRUE, stringsAsFactors = FALSE)
## Get the dimension of store file
dim(store)

## [1] 1115   10

## Get the summary of store file
summary(store)

##      Store         StoreType          Assortment       
##  Min.   :   1.0   Length:1115        Length:1115       
##  1st Qu.: 279.5   Class :character   Class :character  
##  Median : 558.0   Mode  :character   Mode  :character  
##  Mean   : 558.0                                        
##  3rd Qu.: 836.5                                        
##  Max.   :1115.0                                        
##                                                        
##  CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
##  Min.   :   20.0     Min.   : 1.000            Min.   :1900            
##  1st Qu.:  717.5     1st Qu.: 4.000            1st Qu.:2006            
##  Median : 2325.0     Median : 8.000            Median :2010            
##  Mean   : 5404.9     Mean   : 7.225            Mean   :2009            
##  3rd Qu.: 6882.5     3rd Qu.:10.000            3rd Qu.:2013            
##  Max.   :75860.0     Max.   :12.000            Max.   :2015            
##  NA's   :3           NA's   :354               NA's   :354             
##      Promo2       Promo2SinceWeek Promo2SinceYear PromoInterval     
##  Min.   :0.0000   Min.   : 1.0    Min.   :2009    Length:1115       
##  1st Qu.:0.0000   1st Qu.:13.0    1st Qu.:2011    Class :character  
##  Median :1.0000   Median :22.0    Median :2012    Mode  :character  
##  Mean   :0.5121   Mean   :23.6    Mean   :2012                      
##  3rd Qu.:1.0000   3rd Qu.:37.0    3rd Qu.:2013                      
##  Max.   :1.0000   Max.   :50.0    Max.   :2015                      
##                   NA's   :544     NA's   :544

The store file has 1,115 observations and 10 fields.
From competition website we can find data fields descriptions:

Store - a unique Id for each store
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

## Load the train file
train <- read.csv("train.csv", header = TRUE, stringsAsFactors = FALSE)
## Get the dimension of train file
dim(train)

## [1] 1017209       9

## Get the summary of train file
summary(train)

##      Store          DayOfWeek         Date               Sales      
##  Min.   :   1.0   Min.   :1.000   Length:1017209     Min.   :    0  
##  1st Qu.: 280.0   1st Qu.:2.000   Class :character   1st Qu.: 3727  
##  Median : 558.0   Median :4.000   Mode  :character   Median : 5744  
##  Mean   : 558.4   Mean   :3.998                      Mean   : 5774  
##  3rd Qu.: 838.0   3rd Qu.:6.000                      3rd Qu.: 7856  
##  Max.   :1115.0   Max.   :7.000                      Max.   :41551  
##    Customers           Open            Promo        StateHoliday      
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Length:1017209    
##  1st Qu.: 405.0   1st Qu.:1.0000   1st Qu.:0.0000   Class :character  
##  Median : 609.0   Median :1.0000   Median :0.0000   Mode  :character  
##  Mean   : 633.1   Mean   :0.8301   Mean   :0.3815                     
##  3rd Qu.: 837.0   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :7388.0   Max.   :1.0000   Max.   :1.0000                     
##  SchoolHoliday   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1786  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

In the train file, there are 1,017,209 observations and 9 data fields.

Store - a unique Id for each store
DayOfWeek - 1~7
Date - e.g. "2015-07-31"
Sales - the turnover for any given day (this is what you are predicting)
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
Promo - indicates whether a store is running a promo on that day: 0 = no promo, 1 = promo
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

2. Exploration and Visualization

2.1 Load Data in Tableau
Choose Text File -> Select train.csv -> Drag store.csv to join with train.csv -> Edit the join type into "Left join".

2.2 Ask Questions and Answer them

Q1: Which Store has the highest Sales?
A1: According to the Store file, there are 111,5 stores. We can use bar chart to show. (Sum the sales and sort by descending order, keep only top 10 stores). The top 1 is store 262.

Q2: Which Store Type has the highest Sales
A2: Type a has the highest sum Sales, but type b has the highest average Sales, the reason is in near 1 million dataset, there are only around 15 thousand observations that belong to type b.

Q3: Which Store Assortment has the highest Sales
A3: Basic store has the highest total sales, but extra stores has the highest average sales. The same reason as store type.

Q4: Is there a possible relation between competitor distance and the sales?
A4: When use scatterplot, we can analysis by trend line.
When try linear trend line:

When try Logarithmic trend line:

In general, higher R-Squared value means better model. So, there is a logarithmic relation between competition distance and sales.

Q5: Which Day of the week have higher sales?
A5: The surprising is that people shopped much much more less on Sundays.

Q6: Did the Date Influence Sales?
A6: When aggregate data by week, the 51st week has higher sales than others. This might because in the end of a year, people need to buy some gifts for families or need to prepare for the new year, also the Christmas.

When Aggregate by Month, the first peak appears on Mar, second July, then December. My guess is, spring break, summer vacation and winter vacation, also the Christmas and new year.

Tableau is clever. This software can do prediction, although we don't know the background algorithm.

Q7: Till now, we are curious about the sales, sales, and sales...anything else?
A7: How about the people's ability of consumption? We have total sales of a given day, we have customer numbers of a given day. So, can we create a new attribute, consumption ability = sales/customers. And, from the line graph, people's ability of consumption is increasing by year fluctuating in each year.

In order to let you interactive with the dataset, here is a dashboard:
https://public.tableau.com/views/RossmannStoreSales/Dashboard1?:embed=y&:display_count=yes

0 Comments

Blog

1. Load the data

2. Transform to Tableau friendly dataset

3. Line charts

4. "Dumbbell" chart, also called DNA chart

1. Get the data

2. Variable Identification

3. Univariate Analysis

4. Bi-variate Analysis

5. Classification Prediction

5.1 Let's begin with KNN.

5.2 Try one more: Naive Bayes

6. Take Away

Introduction

Decision Tree Example of Iris

Split Training and Testing

Create the Grid

Create a Model Function

Fit the Models

Obtain Accuracy

Arrange Results

Take Away

1. Import Data Into Tableau

2. Quick View with R

2.1 ​How many countries are included?

2.2 The range of Year of birth

2.3 Highest Man & Woman

2.4 Shortest Man & Woman

3. Basic Visualizations

3.1 Statistics on Maps

3.2 Trend Lines

3.3 Make A Dashboard

4. Do a simple linear regression

Step 1: Import the data into Tableau 10

Step 2: Data Overview

Step 3: Get your first sheet

Step 4: Repeat Step 3 for other measures you want and create a dashboard

​Archive

Category

2.1 How many countries are included?

Archive