SlideShare a Scribd company logo
Principal Component Analysis
and Clustering
Professor Daymond
27-Nov-2016
UNDERSTANDING BORROWER SEGMENTS
Majority of the accounts are of credit based borrowers whose revolving utilization with the most
revolving accounts and bankcards
Credit based
accounts
The accounts are mostly with fixed instalments like car loans, student loans etc.,
Most instalment accounts and instalment utilization are the major factors of this segment
Fixed
Instalment
accounts
These are borrowers with past due records and most of the late fees of credit and loan amount. Also
with the recent history of delinquency this segment is medium risk
Past due
accounts
These are borrowers who are highly inquired for loans which exhibits the most credit card purchase
behaviour and attempt to try all possible loans for one
Highly
Inquired
accounts
Debt to collection accounts holds the most number of public records like tax liens etc.,
Collections money owed and tax liens are the major factors of this segment
With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times
makes this segment as high risk
Debt
Collections
accounts
High risk
delinquent
accounts
IDENTIFYINGTHEPRINCIPALCOMPONENTS
With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be
envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data
based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with
all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components
The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is
explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the
cumulative variance should be at least 75%.
From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately
76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of
the inputs which implies the axis length and the direction of each principal components
.From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small
to the variance. Hence there are total of 18 principal components that provides a significant variance of data
Figure 1 Figure 2
INTERPRETINGTHEPRINCIPALCOMPONENTS
In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original
variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e.
either positive or negative are said to be highly correlated with the original variables.
PRINCIPAL COMPONENT 1
From Figure 4, it is observed that the highest coefficients
are correlated with the various number of accounts i.e. how
valuable are the customers in terms of usage and the least
correlated with the duration since the recent account i.e.
how credible the customers are?
Similarly, each of the principal component is analysed for
the highest and the lowest coefficients and tabulated for
reference.
Figure 3
Figure 4
IDENTIFYINGTHECLUSTERS
Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with
various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the
approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest
values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in
clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the
cluster mean.
Summary
The summary of statistics of clusters displays the frequency of observations in each
cluster and the root mean square deviation. The next column displays the largest
distance from the seed to the observation i.e. the total spread of the cluster
approximately. The last column displays the distance from the centre of the cluster
to the centre of the nearest cluster.
Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration.
Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be
the nearest cluster for all the clusters
Goodness-of-fit metrics
The higher values of Pseudo F Statistic are preferred to attain good number of
clusters
R-square accounts for the variance accounted by the clusters
The higher CCC values are indicate good clustering generally expected to be more
than 2 or 3.
Higher F Statistic and CCC implies that the clustering solution is good
IDENTIFYINGTHECLUSTERS
Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the
cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments.
Figure 4
The clusters are analysed and derived with respect to the loan data
variables. Figure 4, displays the customer segment identified after
the analysis of the coefficient matrix. These are the major segments
of the loan data
• Credit based – revolving accounts
• Fixed instalment based loan accounts
• accounts who are mostly past due of credit and late fees
• accounts who are highly inquired
• accounts who more than 75% and creates many new accounts
Further PROC UNIVARIATE is executed with the new cluster dataset
and the output are approximately same with respect to the box plot.
Hence it is ensured that the segments are almost correct
Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
SCORINGTHENEWDATA
The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps:
• The outputs stats from the PRINCOMP is used to score the new dataset
• The output from STDIZE is used as input to standardize the new scored dataset
• The output stat from the FASTCLUS is used as input stat for the new dataset
Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are
approximately the same and the segments have been identified correctly.
OLD DATA
NEW DATA
LEARNINGS
Identifying the principal components is complex and after clustering the same
gives a much more clear picture
With very less business knowledge, identifying the clusters and the segment
verification was difficult
Learnt how to write a macro to run the clusters from 3 to 20 and then identify the
best one from the batch
Use of UNIVARIATE was a revelation when my segments matched with the box
plot even though I am not sure if the segments are correct as such.
APPENDIX1–EIGENVALUESWHENCURVECHANGES
APPENDIX2–EIGENVECTORS OFFIRST10PRINCIPALCOMPONENTS
Ad

More Related Content

What's hot (20)

PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
Learnbay Datascience
 
06 Community Detection
06 Community Detection06 Community Detection
06 Community Detection
Duke Network Analysis Center
 
Pca ppt
Pca pptPca ppt
Pca ppt
Dheeraj Dwivedi
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Venkata Reddy Konasani
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 
Tbs910 regression models
Tbs910 regression modelsTbs910 regression models
Tbs910 regression models
Stephen Ong
 
Statistical Analysis with R -I
Statistical Analysis with R -IStatistical Analysis with R -I
Statistical Analysis with R -I
Akhila Prabhakaran
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
kalung0313
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
PCA
PCAPCA
PCA
mathurnidhi
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
EshanAgarwal4
 
Linear regression
Linear regressionLinear regression
Linear regression
vermaumeshverma
 
Correspondence Analysis
Correspondence AnalysisCorrespondence Analysis
Correspondence Analysis
Gaetan Lion
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Sujoy Bag
 
Pca ppt
Pca pptPca ppt
Pca ppt
Alaa Tharwat
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 
Clustering
ClusteringClustering
Clustering
Dr. C.V. Suresh Babu
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
Learnbay Datascience
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 
Tbs910 regression models
Tbs910 regression modelsTbs910 regression models
Tbs910 regression models
Stephen Ong
 
Statistical Analysis with R -I
Statistical Analysis with R -IStatistical Analysis with R -I
Statistical Analysis with R -I
Akhila Prabhakaran
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
kalung0313
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
EshanAgarwal4
 
Correspondence Analysis
Correspondence AnalysisCorrespondence Analysis
Correspondence Analysis
Gaetan Lion
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Sujoy Bag
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 

Viewers also liked (20)

Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
Swetha A
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
Farah M. Altufaili
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case Analysis
Usha Vijay
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing Research
Usha Vijay
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial Data
Wen-Ting Wang
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYI
Hosting Dergi
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.
TOBM Ternopil
 
Mi auto biografía
Mi auto biografíaMi auto biografía
Mi auto biografía
dayanna2016ramirez
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision Toothbrush
Priyadarsini Somasundaram
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.
Valentino Spina
 
Reglamento interno itei 2014
Reglamento interno itei 2014Reglamento interno itei 2014
Reglamento interno itei 2014
Cultura San Gabriel
 
Panorama sobre Teste de Software
Panorama sobre Teste de SoftwarePanorama sobre Teste de Software
Panorama sobre Teste de Software
Patrícia Araújo Gonçalves
 
2° informe s. gabriel 2014
2° informe s. gabriel 20142° informe s. gabriel 2014
2° informe s. gabriel 2014
Cultura San Gabriel
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
zukun
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
grssieee
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
grssieee
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical Distance
Khulna University
 
Dem ham bang odontosil final - hn 042016
Dem ham bang odontosil   final - hn 042016Dem ham bang odontosil   final - hn 042016
Dem ham bang odontosil final - hn 042016
DentechUMP
 
Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
Swetha A
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case Analysis
Usha Vijay
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing Research
Usha Vijay
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial Data
Wen-Ting Wang
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYI
Hosting Dergi
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.
TOBM Ternopil
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision Toothbrush
Priyadarsini Somasundaram
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.
Valentino Spina
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
zukun
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
grssieee
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
grssieee
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical Distance
Khulna University
 
Dem ham bang odontosil final - hn 042016
Dem ham bang odontosil   final - hn 042016Dem ham bang odontosil   final - hn 042016
Dem ham bang odontosil final - hn 042016
DentechUMP
 
Ad

Similar to Principal Component Analysis and Clustering (20)

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
Saleesh Satheeshchandran
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
Tejas Jagtap
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
Smarten Augmented Analytics
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...
Dr Harris Apostolopoulos EMBA, PfMP, PgMP, PMP, IPMO-E
 
522323444-Presentation-HousePricePredictionSystem.pptx
522323444-Presentation-HousePricePredictionSystem.pptx522323444-Presentation-HousePricePredictionSystem.pptx
522323444-Presentation-HousePricePredictionSystem.pptx
aasthamahajan2003
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
Rafael Bustamante Romaní
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
inventionjournals
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 Poster
Reuben Hilliard
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
Saleesh Satheeshchandran
 
Data Science Using Python
Data Science Using PythonData Science Using Python
Data Science Using Python
Lakshmi Sarvani Videla
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
Jonathan Fivelsdal
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
ijmvsc
 
ai-for-finance-and-banking-application-study-material.pdf
ai-for-finance-and-banking-application-study-material.pdfai-for-finance-and-banking-application-study-material.pdf
ai-for-finance-and-banking-application-study-material.pdf
NehaKaleK
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
Gaurav Sawant
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
Tejas Jagtap
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
Smarten Augmented Analytics
 
522323444-Presentation-HousePricePredictionSystem.pptx
522323444-Presentation-HousePricePredictionSystem.pptx522323444-Presentation-HousePricePredictionSystem.pptx
522323444-Presentation-HousePricePredictionSystem.pptx
aasthamahajan2003
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
inventionjournals
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 Poster
Reuben Hilliard
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
ijmvsc
 
ai-for-finance-and-banking-application-study-material.pdf
ai-for-finance-and-banking-application-study-material.pdfai-for-finance-and-banking-application-study-material.pdf
ai-for-finance-and-banking-application-study-material.pdf
NehaKaleK
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
Gaurav Sawant
 
Ad

Recently uploaded (20)

End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
The-Future-is-Now-Information-Technology-Trends.pptx.pdf
The-Future-is-Now-Information-Technology-Trends.pptx.pdfThe-Future-is-Now-Information-Technology-Trends.pptx.pdf
The-Future-is-Now-Information-Technology-Trends.pptx.pdf
winnt04
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
aacj102006
 
Introduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdfIntroduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdf
goldenflower34
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
 
End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
The-Future-is-Now-Information-Technology-Trends.pptx.pdf
The-Future-is-Now-Information-Technology-Trends.pptx.pdfThe-Future-is-Now-Information-Technology-Trends.pptx.pdf
The-Future-is-Now-Information-Technology-Trends.pptx.pdf
winnt04
 
Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2Introduction to Artificial Intelligence_ Lec 2
Introduction to Artificial Intelligence_ Lec 2
Dalal2Ali
 
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
aacj102006
 
Introduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdfIntroduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdf
goldenflower34
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Digital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdfDigital Disruption Use Case_Music Industry_for students.pdf
Digital Disruption Use Case_Music Industry_for students.pdf
ProsenjitMitra9
 

Principal Component Analysis and Clustering

  • 1. Principal Component Analysis and Clustering Professor Daymond 27-Nov-2016
  • 2. UNDERSTANDING BORROWER SEGMENTS Majority of the accounts are of credit based borrowers whose revolving utilization with the most revolving accounts and bankcards Credit based accounts The accounts are mostly with fixed instalments like car loans, student loans etc., Most instalment accounts and instalment utilization are the major factors of this segment Fixed Instalment accounts These are borrowers with past due records and most of the late fees of credit and loan amount. Also with the recent history of delinquency this segment is medium risk Past due accounts These are borrowers who are highly inquired for loans which exhibits the most credit card purchase behaviour and attempt to try all possible loans for one Highly Inquired accounts Debt to collection accounts holds the most number of public records like tax liens etc., Collections money owed and tax liens are the major factors of this segment With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times makes this segment as high risk Debt Collections accounts High risk delinquent accounts
  • 3. IDENTIFYINGTHEPRINCIPALCOMPONENTS With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the cumulative variance should be at least 75%. From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately 76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of the inputs which implies the axis length and the direction of each principal components .From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small to the variance. Hence there are total of 18 principal components that provides a significant variance of data Figure 1 Figure 2
  • 4. INTERPRETINGTHEPRINCIPALCOMPONENTS In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e. either positive or negative are said to be highly correlated with the original variables. PRINCIPAL COMPONENT 1 From Figure 4, it is observed that the highest coefficients are correlated with the various number of accounts i.e. how valuable are the customers in terms of usage and the least correlated with the duration since the recent account i.e. how credible the customers are? Similarly, each of the principal component is analysed for the highest and the lowest coefficients and tabulated for reference. Figure 3 Figure 4
  • 5. IDENTIFYINGTHECLUSTERS Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the cluster mean. Summary The summary of statistics of clusters displays the frequency of observations in each cluster and the root mean square deviation. The next column displays the largest distance from the seed to the observation i.e. the total spread of the cluster approximately. The last column displays the distance from the centre of the cluster to the centre of the nearest cluster. Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration. Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be the nearest cluster for all the clusters Goodness-of-fit metrics The higher values of Pseudo F Statistic are preferred to attain good number of clusters R-square accounts for the variance accounted by the clusters The higher CCC values are indicate good clustering generally expected to be more than 2 or 3. Higher F Statistic and CCC implies that the clustering solution is good
  • 6. IDENTIFYINGTHECLUSTERS Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments. Figure 4 The clusters are analysed and derived with respect to the loan data variables. Figure 4, displays the customer segment identified after the analysis of the coefficient matrix. These are the major segments of the loan data • Credit based – revolving accounts • Fixed instalment based loan accounts • accounts who are mostly past due of credit and late fees • accounts who are highly inquired • accounts who more than 75% and creates many new accounts Further PROC UNIVARIATE is executed with the new cluster dataset and the output are approximately same with respect to the box plot. Hence it is ensured that the segments are almost correct Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
  • 7. SCORINGTHENEWDATA The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps: • The outputs stats from the PRINCOMP is used to score the new dataset • The output from STDIZE is used as input to standardize the new scored dataset • The output stat from the FASTCLUS is used as input stat for the new dataset Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are approximately the same and the segments have been identified correctly. OLD DATA NEW DATA
  • 8. LEARNINGS Identifying the principal components is complex and after clustering the same gives a much more clear picture With very less business knowledge, identifying the clusters and the segment verification was difficult Learnt how to write a macro to run the clusters from 3 to 20 and then identify the best one from the batch Use of UNIVARIATE was a revelation when my segments matched with the box plot even though I am not sure if the segments are correct as such.
  翻译: