Error Loading pandas_profiling

Pandas profiling is good, but sometimes install and make it ready to use is a hassle. Here is one error that I was encountered when I use pandas_profiling

cannot import name 'GridspecLayout' from 'ipywidgets'

Error Gridspec Layout

From stackoverflow, I need to install ipywidgets version 7.5

SOipywidgets

Installing it and it works fine. Here are sample result from my previous blog post

Easy Data Understanding Using pandas_profiling

Source:

https://stackoverflow.com/questions/56953612/importerror-cannot-import-name-applayout-from-ipywidgets

Easy Data Understanding Using pandas_profiling

How do you usually explore new dataset? By checking the file and the column one by one. The checking include:

a. Number of column and row

b. Data type of each column

c. Missing data in each column

d. Distribution of each column

e. Correlation between column

This checking task is tedious, especially if we have data where we have a lot of columns and rows.

I was just encounter a python package called “pandas_profiling”. We can easily install the package using pip or conda

conda install -c anaconda pandas-profiling

pip install pandas-profiling

Here are some example.

Data was downloaded from Kaggle

https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset

Reading the data and explore using pandas profiling

# Importing package
import pandas as pd
#or
from pandas_profiling import ProfileReport
# Reading csv data
london = pd.read_csv('london_merged.csv')
london.head()

LondonHead

london.info()

LondonInfo

ProfileReport(london)

LondonOverviewLondonVariableLondonCorrelationLondon MissingLondonSample
Source:

https://stackoverflow.com/questions/49314314/unable-to-import-pandas-profiling

https://github.com/pandas-profiling/pandas-profiling

When doesn’t an RFM segmentation work well?

First step to understand various customer characteristics is to build a segment. Many methodology has been created and the most simple yet very useful segmentation type is RFM segmentation.

R stand for Recency, meaning the number of time units that have passed since the customer last purchase. F stand for Frequency, the number of purchase per time unit and M stand for Monetary, the total amount spent per time unit. The order of RFM is also NOT without meaning. The most recent (R) customer who made transaction is more valuable compared to customer who transacted few unit time ago. From this three metrics, a cut off value based on value is determined to decide which cluster does a customer belongs to.

This segmentation is very simple, but is there any catch regarding the methodology? Supposed we already decide a unit time cut off as the metrics of R, and observed the relationship between frequency and monetary. In a case when the price and quantity variance of product are low, we expect that a high correlation between frequency and monetary. See chart below for reference (data is simulated and normalized), where variance of price unit and quantity are low, hence the correlation between frequency and monetary is high, 0.92

RFM1

By using the median of frequency and monetary as the cut off and put the customers accordingly based on their value, we get the customer segmentation like below matrix, a good segment but not optimal because the clusters are concentrated is two clusters, Low Frequency – Low Monetary and High Frequency – High Monetary

RFM2

The final segment is okay, but the strategy that can be derived from such as cluster will be limited (not granular). How to improve it? One way to improve those segment is to take another metrics to related monetary value but has low correlation between frequency and the new metrics.

In the final case, I take the average spend per transaction by each customer and got very low correlation (-0.09). The customers in final segment is spread evenly in each of the cluster.

RFM3

Do you think that the approach is good? Comment and feedback are welcome.

#muse #3 #customer #marketing #analytics #RFM

How much is a customer worth?

What is the simplest way to measure the value of the customer? Is it by their spending? Or by the frequency of their engagement with the enterprise/brand?
Marketers have been thinking about it a long time ago, and other than frequency or their spending, the most important attribute of the customer value and importance is what is called as recency, or the last time when the customer interacts with the enterprise/brand.
What does it mean? Does a customer A who spent, say, $150 three months ago is less valuable than customer B who spend $100 last month? I am afraid, that is the case.
Consider this chart (data is a simulation), I’ve been working on many projects and see many customer data and yet I always encounter this similar chart (with a small degree of difference).
Probability to Transact
On the x-axis is the gap a customer made since the last transaction, while the y-axis is the probability of the customer coming back after certain months of inactivity. If the customer B made a transaction last month, the probability of him to come back is around 70%, thus the expected value from him in the current month is $100×70% or $70. On the other hand, when the customer A made transaction three months ago, we could expect that the probability of her to come back is 8%, thus the expected value from her in the current month is $150×8% or $12.
If you are interested, there are books that touch lightly about this and how to use the information for customer segmentation
Marketing Analytics: Data-Driven Techniques with Microsoft Excel – Chapter 30
Introduction to Algorithmic Marketing – Chapter 3
#muse #2 #marketing #analytics #customervalue

RStudio: MemoryError: Java heap space

Working with not so big data in my laptop, I often encounter a problem such as, when I want to read thousand of file to RStudio.

The error message can be seen below:

R Error

Error in code execution because of “java.lang.OutOfMemoryError: Java heap space”

Java Heap Space

What is the solution?

options(java.parameters = "-Xmx8g")

I embed those code to increase the allocation memory from default to 8G.

Does it work?

It doesn’t work when I use 8G.

How to load Parquet data into R

sparklyr-v2
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Parquet datafile usually coming from a Spark system. Two package in R that normally used to read parquet file: SparkR and sparklyr
I use “sparklyr” package because it was distributed by RStudio and it seems used by more people than SparkR.
Based on this link source, https://spark.rstudio.com/  it seems that using the package was a real no brainer, just install and it will work.
Here is the common code:

Install package 
install.packages(“sparklyr”)
library(sparklyr)

Install the spark environment
spark_install(version = “2.1.0”)

Connect to spark
sc <- spark_connect(master = “local”)

Read the table
spark_tbl_handle <- spark_read_parquet(sc, “tbl_name_in_spark”, “/path/to/parquetdir”)

Convert it to dataframe – please note that I need to load “dplyr”
regular_df <- collect(spark_tbl_handle)

Close connection
spark_disconnect(sc)

But this is a false hope, I found few error/problem when I want to read the parquet data.
Problem
Here are some error message:
Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ‘-‘ expecting <EOF>(line 1, pos 6)
org.apache.spark.sql.analysisexception: invalid view name
 
I realised he problem  is because of:
a. Install Spark in R
b. The parquet file use “-” as its filename
Solution:
a. Install Spark in Windows, by following this link
I mainly use link 1 and link 2. Link 3 is just for additional knowledge
b. Change the parquet file name, “-” (dash) change to “_” (underscore)
Change 1 filename is easy, but when I have more than 100,000 file, it become tedious to change it one by one. So instead, I use the power of “Windows Powershell” to edit the filename.
Follow this link
Here is the final script
Get-ChildItem -File -Recurse | % { Rename-Item -Path $_.PSPath -NewName $_.Name.replace(“-“,”_”)}

What is Uplift Model? Intro

In targeting customer world, using classification model, usually we only interested on one behaviour that we want to predict. In the term of cross-sell/up-sell for example, we interested only to the customers who has high likelihood that she like the product that we (as business) want to introduce. Thus, we create a classification algorithm to score the customers on the likelihood that they will happy if we show them the product.

Now, they way the customer react with the offer depend largely on how we communicate the product to them. There are customer who will buy the product, even though we only show the product without offer (via email or more like Amazon “people who buy….also…”). There are people who like the product we showed them, but only buy if there is a discount. The last thing that we want to avoid is the the people who will react negatively to the communication, the one who will likely to buy if even thought we don’t give them offer, but once we show them discounted item, they just lost, and no longer interested. They become the loss cause.

The customer reaction to offer can be draw into two by two matrix (picture taken from here)

The question is, how to reduce the associated cost of the marketing budget (giving discount to the one really matters) and optimising revenue (avoiding the customer who will react negatively)? The methodology is called Uplift Modeling and has been studied extensively. On my research, there has been seven methodology developed just to answer this problem alone, including using decision tree with different criteria of split (instead of using gini or entropy), creating two model with response and calculate the difference, using class variable transformation/XOR methodology, using class probability decomposition etc.

However, in the real world, I don’t see much benefit of using Uplift Model to create more precision targeted marketing because of several reason:

Point 1. Prior to create the model, a test communication was sent to selected group of customers. The reaction of this group will be used to create the basis of the Uplift Modelling. This preparation alone will add lead time of the campaign offer to the customer

Point 2. There is a business that communicate too extensively to their customer (due to the nature of their business) and sending on average 10 of ad hoc campaigns per day (for example). Back to point 1, preparation for this campaign will seriously take a good chunk of time wastage.

Point 3. To point 2, it is just impossible to create 10 model accurate enough per day.

The way I see the Uplift Model is it’s only effective for the businesses with less frequent communication to the customer. I do not have experience on this industry but what I think it might be applicable for financial industry such as insurance. Or, for business with high frequency campaign, it can be used for recurring campaign, such as churn campaign.

What do you think?

Predictive Model is Not the End of the Story!

It is not definitely. The sexiness term of Data Science lately has been skewed only to one part, the predictive analytics. Many article, books, blog, podcast, you name it has been dedicated to this topic, days, and night. Predictive analytics, part of supervised machine learning, is a way a machine can understand the pattern available in the data and predict something. The type of prediction that a machine can do generally is only two term, classification – predict if an observation belongs to a category, and regression – predict how much/many the output based on an observed variable. Another prediction might be related to forecasting. Very simple, right?

To be a professional data scientist, I believe understanding algorithm of predictive modeling is not enough. It is not the end of story. While, understanding those might be handy for future use, understanding what should be done before jumping to algorithm and what decision should be made after making prediction are crucial parts.

Data analytics project cycle, in general consist of three phases: understand phase, analytics (and modeling) phase and finally the recommendation phase.

Other articles might divide them into four phases, but essentially both are the same. I will explain each part of the cycle and finally will give an example or case that suitable on each phase and what kind of skill that are needed.

Understand

Understanding phase is the most important in analytics project. Provided the data is available, a data scientist need to bridge what the executive have in mind, what is their biggest challenge, what he wants to achieve in short term then long term. Using the available data, a data scientist should ask question “What happened?”. To get the answer, a data scientist need to really understand the data, the context of the data, the type of data recorded. Then from that data, data scientist start to explore the data by creating histogram, time series chart, correlation chart etc.

Analytics and Modeling

On the understanding phase, data scientist might find several business challenges. Some interesting, some do not. Rather than doing the challenge that he thinks most interesting, it is better to cross check the challenge found to the short term and long term goal of the company. By doing this, a data scientist can walk side by side with the executive and in the end expected to support the bottom line of the company, to make money. Suppose that the executive really concern about the churning customer, but there are other problems that also interesting mathematically, but the impact is miniscule, creating a churn model is make more sense. If the short-term goal is to keep the customer buying more, a cross sell and upsell model might be prioritized. In this step, knowing suitable predictive algorithm to approach particular problem is very important.

Recommendation

Once the analysis and model has been made, the last part is the recommendation. For example, suppose that the churn model has been made. What the executive need to do to optimize the target? Rather than contacting all his customer, they can just contact the customer that has big probability to churn. If the money is a concern, the company can target top 10% of 20% of customer with the highest probability of churn.

Another example is about the targeting customer for cross sell or upsell. I always like an explanation that I hear in a seminar, and it goes like this: From historical data, the average response rate of all customer from a campaign is the blue color in the box below, and it is around 10%.

So, we think it is better to send a targeted message to customer with high probability based on the cross-sell model that has been created and the end result is like the box below

Based on the model, the suggested customer to be sent by the offer is about 50% of total (see the right triangle on the right). Congratulations, now your response rate just up from 10% to 19%. This is massive achievement. The question is, is it worth it? To raise the take up rate from 10% to 19%, the company just lose 0.5% of potential customer. However, with the proliferation of cheap tools to contact customer, losing 0.5% of customer seems not a thing that a company could afford. 0.5% of customer can worth million. So, does it means that the model is useless? Why do the data scientist need to build model in the first place? The answer is, the model is still valuable. With a bit of creativity and different messaging, we could create a promotion with take up rate as below

By using this, we could increase the take up rate and sell more product.

Is that all? Is there any sophisticated way to optimize decision aside of using a common sense and creativity? Yes, there are two sophisticated mathematical formulation to assist human in making optimal decision. One is optimization, the other one is simulation. Optimization algorithm is the bread and butter that decide how much we should pay for a hotel room, a plane ticket and how the taste of orange juice should look like. I might write it as another article, but for a starter, read this story from Bloomberg on the application of forecasting and optimization on orange juice production Coke Engineers Its Orange Juice—With an Algorithm

Understanding Quadratic Optimization Solver in R

quadratic picture

I promise to write technical stuff in my blog in a more popular way, just like Olav Laudy. But I have to take my promise. This problem has been bugging me and pushed me to write it for my documentation.

I have been trying to solve some problem using optimization in R, especially the quadratic programming problem. Let’s take a look at the standard formulation of quadratic programming

quadratic

Where x is the optimal parameter that need to be found, Q is the hessian matrix and c is the cost. The second line related to the constraint.

In R, package solve.QP can be used to solve quadratic programming problem. The code to solve the problem is


solve.QP(Dmat, dvec, Amat, bvec, meq=0, factorized=FALSE)

where Dmat is the hessian matrix, dvec is the cost or c, Amat is the A matrix and bvec is the b vector. The meq is used for equality constrain for the top constrain. Factorized is logical flag, if the value is true then the Dmat will pass inverse of R where D=RTR instead of hessian matrix D.

Some example from the CRAN website

##
## Assume we want to minimize: -(0 5 0) %*% b + 1/2 b^T b
## under the constraints: A^T b >= b0
## with b0 = (-8,2,0)^T
## and (-4 2 0) 
## A = (-3 1 -2)
## ( 0 0 1)
## we can use solve.QP as follows:
##
Dmat = matrix(0,3,3)
diag(Dmat) = 1
dvec = c(0,5,0)
Amat = matrix(c(-4,-3,0,2,1,0,0,-2,1),3,3)
bvec = c(-8,2,0)
solve.QP(Dmat,dvec,Amat,bvec=bvec)

If we look at it closely, it is apparent that now the inequality constrain signed is now >= instead of <=. This is my first confusion. Why R is change it? Look carefully at the objective function, we can now see that the c vector (dmat) is –(0 5 0). See the minus sign? Yes, that’s the solution. On the first part, x was forced to be negative, while on the second part (quadratic) x is also forced to be negative, but since it is square the negative become positive. That is why the sign of the constrain change to >=, because in the end it is the same thing.

Let see the quadratic programming to solve regression problem. We know that regression optimal parameter are approached using optimization, we want to minimize the error of residual sums of squares (RSS) or S(b) in below equation

error

And the parameter beta is

min_error

Since this is a convex problem, of first derivative will do the trick. Just for fun, we want to solve it using optimization algorithm. The RSS can be expanded like this

RSS = ( Y - X b )' ( Y - X b )
    = Y Y' - 2 Y' X b + b X' X b

My second confusion dealt with the optimization approach for linear regression problem on this website http://zoonek.free.fr. The code for nonconstrained quadratic optimization in R is as follow


# Sample data
n = 100
x1 = rnorm(n)
x2 = rnorm(n)
y = 1 + x1 + x2 + rnorm(n)
X = cbind( rep(1,n), x1, x2 )

# Regression
r = lm(y ~ x1 + x2)

# Optimization

library(quadprog)
s = solve.QP( t(X) %*% X, t(y) %*% X, matrix(nr=3,nc=0), numeric(), 0 )

coef(r)
s$solution # Identical

see that Dmat is X’ X and coded as t(X) %*% X

and dvec is -2Y’ X  and coded as t(y) %*% X

So, where is the value 2 and Y Y’? This is answer, the problem is a separable problem. It means that we can simplify the eqution

min Y Y’ – 2 Y’ X b + b X’ X b is equal to min Y Y’ +2* min(-  Y’ X b) + min(b X’ X b).

Since Y Y’ and 2 are constant, it wont affect the optimization thus, the final formulation is

min – Y’ X b + b X’ X b

which is exactly like the code from the website

3 Predictive Model to Optimize Adwords

With the burst of online business, it is hard for new company without much differentiator to get attention in the internet. Even with the good press release, getting the company name in the head of consumer to get certain service is a daunting task. In travel business for example, there are a lot of companies offer their service, some fail, some thrive, so is too advance to the market and deem a failure.

Nobody can live without search engine currently. That is why search engine is the default tools for online business to gain masses, the customer that will eventually using the services. Big company such as Price Line group spend 1.8 billion USD for online advertising, 90% goes to google ad platform. But, not every company has massive budget like Priceline, and their product is not in the top of mind of every customer, thus they have to optimize their ad spending budget. There is three predictive model that can be developed to optimize online ad spending. The predictive model are developed based on the question below. Of course, the question can be further developed thus any alternative model can be created to answer the question and support the objective.
3 Step Analytics
Why do we need those three model?
Model 1 Classification – Predicting “Which keywords?”
Not all keyword are working. A company can creatively create and bid on as many keyword that people look on the internet, but not all keyword are working. I’ve been audited keyword campaign for a company and that is true. They have about 8000 keywords online and only about 3500 of them generate traffic. The other 4500 never get clicked what so ever. So about 56% keyword are wasted. The classification model that was build has higher accuracy and can reduce the wastage keyword to 30%.
Model 2 Regression – Predicting “How many traffic?”
If we know which keyword can drive traffic, the next question is how many traffic can be generated using this keyword? Some keyword are very good at generating traffic, many of them can generate small amount of traffic. This problem can be approach using regression model. Using the right variable, the digital advertiser can focus on the keyword that can generate desired amount of traffic.
Model 3 Regression – Predicting “What is the conversion rate?”
Even if a keyword can generate traffic, not all keyword can generate conversion, meaning people using the product. Given right number of sample, a regression model can be developed to predict the number of conversion. Conversion here is stated in fraction. Thus the output of the prediction will have the value between 0 to 1. Focusing on keyword that can drive high conversion can help digital advertiser to optimize the spending of the online advertising budget.
Now that all of the possible modeling has been explained, what kind of variable to use for the modeling? The variable can be when the keyword are advertised such as the time, date, or derived from the keyword to create any meta variable such as number of words, any “free” words?, “discount” words?, specific city, competitor brand and the list can go longer.