The data was obtained from the UCI repository.

Loading and exploring dataset

First, we will want to take a look at the dataset to see what we have:

# Load the tidyverse library. It contains tools to load and clean/manipulate data
library(tidyverse)
library(ggplot2)
# Load dataset and save it in a data frame object
data <- readxl::read_xls("/home/daniel/Dropbox/Maestria Data Science/Semestre 1/Aprendizaje de Maquina/creditcarddefault/default_of_credit_card_clients.xls", col_names = TRUE) %>% as.data.frame()
# Get dataframe dimension
dim(data)

## [1] 30000    25

# Column names
colnames(data)

##  [1] "ID"                         "LIMIT_BAL"                 
##  [3] "SEX"                        "EDUCATION"                 
##  [5] "MARRIAGE"                   "AGE"                       
##  [7] "PAY_0"                      "PAY_2"                     
##  [9] "PAY_3"                      "PAY_4"                     
## [11] "PAY_5"                      "PAY_6"                     
## [13] "BILL_AMT1"                  "BILL_AMT2"                 
## [15] "BILL_AMT3"                  "BILL_AMT4"                 
## [17] "BILL_AMT5"                  "BILL_AMT6"                 
## [19] "PAY_AMT1"                   "PAY_AMT2"                  
## [21] "PAY_AMT3"                   "PAY_AMT4"                  
## [23] "PAY_AMT5"                   "PAY_AMT6"                  
## [25] "default payment next month"

We know that our dataset consists of 30,000 records with 25 columns. The columns are the following (according to accompanying documentation on website):

ID
Limit balance (credit amount in taiwanese dollars)
Gender (1=male; 2=female)
Level of education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
Marital status (1 = married; 2 = single; 3 = others)
Age (in years)
PAY_0 through PAY_6 indicate the payment status over the last 6 months. (-1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = payment delay for nine months and above)
BILL_AMT1 through BILL_AMT6 (remaining amount in bill for each of the last six months)
PAY_AMT_1 through PAY_AMT6 (amount paid in each bill over the last six months)
Indicator variable for default on the following month. (this is our variable of interest, which we will want to predict)

We can then take a first preview of the data:

# Print first few lines of dataset
head(data)

##   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 1  1     20000   2         2        1  24     2     2    -1    -1    -2
## 2  2    120000   2         2        2  26    -1     2     0     0     0
## 3  3     90000   2         2        2  34     0     0     0     0     0
## 4  4     50000   2         2        1  37     0     0     0     0     0
## 5  5     50000   1         2        1  57    -1     0    -1     0     0
## 6  6     50000   1         1        2  37     0     0     0     0     0
##   PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 1    -2      3913      3102       689         0         0         0
## 2     2      2682      1725      2682      3272      3455      3261
## 3     0     29239     14027     13559     14331     14948     15549
## 4     0     46990     48233     49291     28314     28959     29547
## 5     0      8617      5670     35835     20940     19146     19131
## 6     0     64400     57069     57608     19394     19619     20024
##   PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 1        0      689        0        0        0        0
## 2        0     1000     1000     1000        0     2000
## 3     1518     1500     1000     1000     1000     5000
## 4     2000     2019     1200     1100     1069     1000
## 5     2000    36681    10000     9000      689      679
## 6     2500     1815      657     1000     1000      800
##   default payment next month
## 1                          1
## 2                          1
## 3                          0
## 4                          0
## 5                          0
## 6                          0

Data exploration and cleaning

By convention, it is suggested to work with variable names in lowercase and with an underscore instead of spaces. We will correct this next:

names(data) <- names(data) %>% tolower() %>% stringr::str_replace_all(.," ","_")
colnames(data)

##  [1] "id"                         "limit_bal"                 
##  [3] "sex"                        "education"                 
##  [5] "marriage"                   "age"                       
##  [7] "pay_0"                      "pay_2"                     
##  [9] "pay_3"                      "pay_4"                     
## [11] "pay_5"                      "pay_6"                     
## [13] "bill_amt1"                  "bill_amt2"                 
## [15] "bill_amt3"                  "bill_amt4"                 
## [17] "bill_amt5"                  "bill_amt6"                 
## [19] "pay_amt1"                   "pay_amt2"                  
## [21] "pay_amt3"                   "pay_amt4"                  
## [23] "pay_amt5"                   "pay_amt6"                  
## [25] "default_payment_next_month"

Afterwards, we can verify whether or not we have missing values in the dataset:

data %>% apply(.,2,is.na) %>% apply(.,2,sum)

##                         id                  limit_bal 
##                          0                          0 
##                        sex                  education 
##                          0                          0 
##                   marriage                        age 
##                          0                          0 
##                      pay_0                      pay_2 
##                          0                          0 
##                      pay_3                      pay_4 
##                          0                          0 
##                      pay_5                      pay_6 
##                          0                          0 
##                  bill_amt1                  bill_amt2 
##                          0                          0 
##                  bill_amt3                  bill_amt4 
##                          0                          0 
##                  bill_amt5                  bill_amt6 
##                          0                          0 
##                   pay_amt1                   pay_amt2 
##                          0                          0 
##                   pay_amt3                   pay_amt4 
##                          0                          0 
##                   pay_amt5                   pay_amt6 
##                          0                          0 
## default_payment_next_month 
##                          0

And we have no missing values.

We can then do some exploration to understand how our variables behave:

Limit balance

hist(data$limit_bal, main = "Histogram of loan amount", xlab = "Loan amount")

Most of the loans belong to the lower end of the spectrum.

data$limit_bal %>% mean()

## [1] 167484.3

data$limit_bal %>% quantile()

##      0%     25%     50%     75%    100% 
##   10000   50000  140000  240000 1000000

Our mean loan amount is 167,500 Taiwanese dollars and half of the loans were under 140,000 dollars. The maximum loan was for 1,000,000 taiwanese dollars.

Sex

# change coding to 0 and 1 where now female = 0 and male = 1
data$sex <- ifelse(data$sex == 2, 0,1)

prop.table(table(data$sex))

## 
##         0         1 
## 0.6037333 0.3962667

We can see that the majority of loanees are female, making up 60% of the list. Males make up the remanining 40%.

Education

data$education %>% table %>% prop.table() %>% plot()

From the plot we can see that we have more groups than were promised in the database description. We could maybe try to infer what they are, but since we can’t be sure as we cannot ask anyone, well group 0, 5 and 6 into group 4, which already represents ‘others’.

data$education <- ifelse(data$education %in% c(0,5,6), 4, data$education)
data$education %>% table %>% prop.table()

## .
##         1         2         3         4 
## 0.3528333 0.4676667 0.1639000 0.0156000

From the previous table we can observe that this bank has given loans mainly to people with a univeristy degree, followed by people with a graduate degree and then people with a high school degree. It seems to work mainly with highly educated people.

Marriage

data$marriage %>% table()

## .
##     0     1     2     3 
##    54 13659 15964   323

Again, with marriage we have more groups than were described, so we will add the observations with a ‘0’ to group ‘3’.

data$marriage <- ifelse(data$marriage == 0, 3, data$marriage)
data$marriage %>% table %>% prop.table()

## .
##          1          2          3 
## 0.45530000 0.53213333 0.01256667

Customers seem to be pretty evenly divided between married and single. ‘Others’ represents only 1.2 percent of the data.

Age

data$age %>% hist(., main = "Histogram of age", xlab = "Age")

Customers mainly belong to the ‘young adult’ (between 25 and 40) segment of the population.

PAY_0 to PAY_6

ggplot(data = data %>% gather(c(7:12), key = "month", value = "pay")) + geom_histogram(aes(x=pay)) + facet_grid(~month) + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the previous tables, we can see that most customers pay duly (on time) and then quite a few also pay one or two months early. There is another spike with customers who pay 3 months late, and very few customers pay over three months late.

Pay amount

ggplot(data = data %>% gather(c(19:24), key = "month", value = "pay")) + geom_histogram(aes(x=pay)) + facet_grid(~month) + xlim(c(0,150000)) + ylim(c(0,10000)) + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, most customers make low payments towards their loan. However, this must be directly related to the initial amount of their loan, so it might be interesting to add a variable equivalent to the proportion of the full initial loan paid. I have changed the plot limits to avoid a bad formatting of the plots due to outliers.

Credit default

data$default_payment_next_month %>% table() %>% prop.table()

## .
##      0      1 
## 0.7788 0.2212

Apparently, about 22% of the clients in the database defaulted. We can explore some variable interactions with the objective to see if maybe we can start seeing a trend:

data[,c(3,25)] %>% table() %>% prop.table() %>% plot()

Regarding gender, men tend to default more than women (12.5% vs 9.6% respectively).

data[,c(4,25)] %>% table() %>% prop.table() %>% plot()

Proportionately, clients belonging to the ‘others’ category default significantly less that the others, however, there are far less clients belonging to this category than the others (468 out of the 30,000), which could mean its down to ‘luck’. There seems to be a trend where less educated people start defaulting more (1 is clients with graduate education, 3 is clients with highschool only).

data[,c(5,25)] %>% table() %>% prop.table() %>% plot()

Can’t see a big difference among marital status groups, although the proportion of singles who have defaulted has historically been slighlt lower than married couples.

Feature engineering

Since we have several categorical variables, it would be a good idea to convert them to dummy variables. We can use the One Hot Encoding (onehot) library to achieve this. But before, we must convert our categorical variables to factor type:

library(onehot)

# sex is a categorial variable but is already formatted as dummy
data$education <- data$education %>% as.factor()
data$marriage <- data$marriage %>% as.factor()
data$pay_0 <- data$pay_0 %>% as.factor()
data$pay_2 <- data$pay_2 %>% as.factor()
data$pay_3 <- data$pay_3 %>% as.factor()
data$pay_4 <- data$pay_4 %>% as.factor()
data$pay_5 <- data$pay_5 %>% as.factor()
data$pay_6 <- data$pay_6 %>% as.factor()


data_onehot <- onehot(data, max_levels = 15)

data_cat <- predict(data_onehot,data) %>% as.data.frame()

Afterwards, we can separate our data into train and test sets.

set.seed(138176)
data_cat$unif <- runif(nrow(data_cat))
data_or <- data_cat %>% arrange(unif)
data_or <- data_cat %>% select(-unif)
split_train <- nrow(data_or)*0.6
split_val <- nrow(data_or)*0.8

X_train <- data_or[1:split_train,] %>% select(-default_payment_next_month, -id) %>% as.matrix()
y_train <- data_or$default_payment_next_month[1:split_train]

X_val <- data_or[(split_train+1):split_val,] %>% select(-default_payment_next_month, -id) %>% as.matrix()
y_val <- data_or$default_payment_next_month[(split_train+1):split_val]

X_test <- data_or[(split_val+1):nrow(data_or),] %>% select(-default_payment_next_month, -id) %>% as.matrix()
y_test <- data_or$default_payment_next_month[(split_val+1):nrow(data_or)]
rm(data, data_or)

prop.table(table(y_train))

## y_train
##         0         1 
## 0.7694444 0.2305556

prop.table(table(y_val))

## y_val
##         0         1 
## 0.7966667 0.2033333

We can see that we have achieved to mantain a similar distribution of defaulted customers to the original, which is important to make sure we train our model correctly and not under or over represent defaulted customers. It is important to remember that out objective in this project is to devise a model to predict which loans are more prone to default. In this sense, it might be in the bank’s best interest to try and identify all loans with default even if this means we will have more false positives (denying loans because they were classified as prone to default when they actually wouldn’t have defaulted).

Logistic regression with regularization

I will implement a logistic regression using the glmnet library

library(glmnet)

glm_model <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 1, lambda = exp(seq(-10,0,1)), nfolds =10)
plot(glm_model)

preds_val_log <- predict(glm_model, newx = X_val)
print("Confusion Matrix")

## [1] "Confusion Matrix"

prop.table(table(preds_val_log > 0.5, y_val),2)

##        y_val
##                  0          1
##   FALSE 0.98410042 0.80819672
##   TRUE  0.01589958 0.19180328

In this case, and using Lasso regularization to reduce our model’s compexity, we obtain a model which is much better at identifying non-default loans than defaulting loans, which is opposite to our main objetive.

Random Forest

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rand_forest <-tuneRF(X_train, factor(y_train), ntree = 300, stepFactor = 1.3, doBest = TRUE)

## mtry = 9  OOB error = 19.02% 
## Searching left ...
## mtry = 7     OOB error = 19.18% 
## -0.00817757 0.05 
## Searching right ...
## mtry = 11    OOB error = 19.19% 
## -0.008761682 0.05

preds_val_rf <- predict(rand_forest, type = "prob", newdata = X_val)
print("Confusion Matrix")

## [1] "Confusion Matrix"

prop.table(table(preds_val_rf[,2] > 0.5, y_val),2)

##        y_val
##                  0          1
##   FALSE 0.96401674 0.64590164
##   TRUE  0.03598326 0.35409836

The random forest tuning function found that the optimal number of variables to use in each branch was 9, as it produces the lowest out of bag error. This model produces a much better confusion matrix according to out objective. It accurately classified 35% of the defaulting loans as defaulted on the validation data.

Gradient Boosting Models

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

d_entrena <- xgb.DMatrix(X_train, label = y_train) 
d_valida <- xgb.DMatrix(X_val, label = y_val) 
watchlist <- list(eval = d_valida, train = d_entrena)
params <- list(booster = "gbtree",
               max_depth = 3, 
               eta = 0.03, 
               nthread = 8, 
               subsample = 0.75, 
               lambda = 0.001,
               objective = "binary:logistic", 
               eval_metric = "error")
bst <- xgb.train(params, d_entrena, nrounds = 300, watchlist = watchlist, verbose=0)

preds_val_bst <- predict(bst, newdata = X_val)
print("Confusion matrix")

## [1] "Confusion matrix"

prop.table(table(preds_val_bst > 0.5, y_val),2)

##        y_val
##                  0          1
##   FALSE 0.96276151 0.63114754
##   TRUE  0.03723849 0.36885246

Finally, the boosting model seems slightly better than the random forest at classifying defaulted loans as defaults.

Models evaluation:

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred_rocr <- prediction(preds_val_log, y_val) 
perf <- performance(pred_rocr, measure = "sens", x.measure = "fpr") 
graf_roc_log <- data_frame(tfp = perf@x.values[[1]], sens = perf@y.values[[1]], 
                       d = perf@alpha.values[[1]])

pred_rocr <- prediction(preds_val_rf[,2], y_val) 
perf <- performance(pred_rocr, measure = "sens", x.measure = "fpr") 
graf_roc_rf <- data_frame(tfp = perf@x.values[[1]], sens = perf@y.values[[1]], 
                       d = perf@alpha.values[[1]])

pred_rocr <- prediction(preds_val_bst, y_val) 
perf <- performance(pred_rocr, measure = "sens", x.measure = "fpr") 
graf_roc_bst <- data_frame(tfp = perf@x.values[[1]], sens = perf@y.values[[1]], 
                       d = perf@alpha.values[[1]])


graf_roc_log$model <- 'Logistic Regression'
graf_roc_rf$model <- 'Random Forest'
graf_roc_bst$model <- 'Boosting'
graf_roc <- bind_rows(graf_roc_log, graf_roc_rf, graf_roc_bst)


ggplot(graf_roc, aes(x = tfp, y = sens, colour = model)) + geom_point() +
  xlab('1-Specificity') + ylab('Sensitivity')

Using the ROC curve, we can confirm what we observed with the confusion matrices. The Boosting model is better than the other two as it achieves a greater area under curve, and seems to be specifically better under the sensitivity matrix, which is our main objetive. It might be interesting to change our ‘cutoff value’ to define if a customer should be classified as default or not, while maintaining a good balance with those who are classified as not-defaulted. We can do this with the confusion matrix previously reviewed.

It all depends on the business rules the bank works with, but since in this case I cannot communicate with the bank, I would suggest using the following ‘cutoff probability’. Why? This way we correctly classify about 50% of the defaulting loanees, while only ‘losing’ 10% of the non-defaulting. And, instead of losing them, a strategy to futher analyze those specific cases/applications could be defined.

print("Confusion matrix - Boosting")

## [1] "Confusion matrix - Boosting"

prop.table(table(preds_val_bst > 0.32, y_val),2)

##        y_val
##                  0          1
##   FALSE 0.90062762 0.47950820
##   TRUE  0.09937238 0.52049180

Thus, we will use the Boosting model to predict on our test set and see how it performs:

preds_val_bst <- predict(bst, newdata = X_test)
print("Confusion matrix")

## [1] "Confusion matrix"

prop.table(table(preds_val_bst > 0.32, y_test),2)

##        y_test
##                  0          1
##   FALSE 0.90409801 0.49289100
##   TRUE  0.09590199 0.50710900

The final result was consistent with what was observed with the validation data. We can conclude that this model is correctly learning from the data fed to it and that it could prove useful for use in the bank’s loan application decisions. It has the potential to reduce costs by helping the bank avoid giving out loans to customers who will default on them.

Credit card default