Linear Regression – Frequentist vs Bayesian approach

In this post, I will go over how to use different Linear Regression techniques to build models for predicting a “payment score” for delinquent accounts. A “payment score” represents the amount a delinquent account is expected to pay back once debt collection process is initiated. We used two different approaches to build linear regression models […]

Enterprise Data Lake – Data organization on HDFS

Organizing data in HDFS We built an Enterprise Data Lake on Hadoop. The aim was to provide a common data hub for entire organization to store data to facilitate data sharing across teams and business units (that was not easily possible earlier). There were multiple steps involved in creating enterprise level lake – automating data […]

Machine Learning – Working with text data

While building a predictive model using logistical regression (to predict customer churn), we used data points from multiple systems/sources including customer details, demographic data, customer billing and payment history, customer usage history, current plan and contract details, interactions with customer support etc. One of the challenges was how to utilize rich contextual information captured during end […]

Moving from in-house Hadoop cluster to cloud based cluster – Part 2

In this post, I will talk about various options for workflow orchestration, storage and cluster configuration/set up for AWS EMR. WORKFLOW ORCHESTRATION (Oozie vs AWS Data PipeLine, AWS SQS) There are two ways to classify workflow orchestration Workflow related to pre-defined schedule. There are multiple options available depending on the type of Hadoop cluster For […]