Written by Nikhil SmotraJuly 4, 2018July 4, 2018

Building a scalable Data Processing Platform for Analytics – Part 3

In Part 2 of this 4 blog series, I described the tools and frameworks we used to build Data Processing Platform. In this post, I will describe at a very high level the unified data processing platform architecture, different components of the platform and how they interact with each other. Figure-1 : Data Processing Architecture In the last […]

Written by Nikhil SmotraJune 23, 2018June 11, 2019

Building a Scalable Data Processing Platform for Analytics – Part 2

In Part 1 of this 4 blog series, I discussed the key factors that influenced our design and architecture around building Data Processing Platform (DPP). In this post, I will talk about the tools and frameworks we used to build DPP. Initially, we used a small number of mission-critical use cases to design and build a platform […]

Written by Nikhil SmotraMay 31, 2018July 11, 2018

Building a Scalable Data Processing Platform for Analytics – Part 1

In this series of 4 blogs, I will discuss our journey towards building an enterprise Data Processing Platform (DPP). We use DPP as a central platform to Collect high velocity, volume, and variety data from multiple sources – namely, all of the ingredients our very diverse business needs to produce information and percolate insights in […]

Written by Nikhil SmotraOctober 18, 2017

Density Estimation – Anomaly Detection using Machine Learning

Today, I want to share my thoughts about how we use Unsupervised Machine Learning to detect anomalies in data feeds we receive from our clients as a part of B2B exchange. The aim is to flag anomalies in any external data feeds we receive before they are processed through our system workflows/processes for additional review. […]

Written by Nikhil SmotraSeptember 30, 2017

Linear Regression – Frequentist vs Bayesian approach

In this post, I will go over how to use different Linear Regression techniques to build models for predicting a “payment score” for delinquent accounts. A “payment score” represents the amount a delinquent account is expected to pay back once debt collection process is initiated. We used two different approaches to build linear regression models […]

Written by Nikhil SmotraJune 19, 2017

Enterprise Data Lake – Data organization on HDFS

Organizing data in HDFS We built an Enterprise Data Lake on Hadoop. The aim was to provide a common data hub for entire organization to store data to facilitate data sharing across teams and business units (that was not easily possible earlier). There were multiple steps involved in creating enterprise level lake – automating data […]

Written by Nikhil SmotraMarch 31, 2017June 11, 2019

Machine Learning – Working with text data

While building a predictive model using logistical regression (to predict customer churn), we used data points from multiple systems/sources including customer details, demographic data, customer billing and payment history, customer usage history, current plan and contract details, interactions with customer support etc. One of the challenges was how to utilize rich contextual information captured during end […]

Written by Nikhil SmotraJuly 22, 2016July 22, 2016

Spark Cluster – On-premise deployment or Cloud deployment

We had to set up a Spark cluster for a Memory Intensive Application that is not Network Bandwidth intensive. We had the option of either setting up a Spark cluster in a 42U Rack in our Data Center using PowerEdge R730xd Rack Servers ( already available with our IT) or using AWS. Although bandwidth usage, […]

Written by Nikhil SmotraJune 5, 2016

Transition from traditional architecture to Real Time Processing using Apache Spark

Many traditional applications fail to handle exponential increase in volume and velocity of business data being generated in today’s day and age. In this post, I want to share my experiences during our journey to move our commercial offering from a traditional java RDBMS architecture to move core processing logic to MapReduce Programming model (on […]

Written by Nikhil SmotraSeptember 20, 2015July 22, 2016

Moving from in-house Hadoop cluster to cloud based cluster – Part 2

In this post, I will talk about various options for workflow orchestration, storage and cluster configuration/set up for AWS EMR. WORKFLOW ORCHESTRATION (Oozie vs AWS Data PipeLine, AWS SQS) There are two ways to classify workflow orchestration Workflow related to pre-defined schedule. There are multiple options available depending on the type of Hadoop cluster For […]