BIG DATA ANALYTICS PROJECTS-When it comes to becoming a data scientist, there are still a lot of challenges that are being faced in managing and aggregating this data so that it becomes useful. These challenges can be overcome by learning from the success stories of other fellow data scientists who have completed a data analytics project after pivoting from the original idea and later delivered positive results to their organization.
We wanted to understand what are the key lessons that data scientists in large companies have learned while working on their big data analytics project. Therefore, we asked AIM Expert Network (AEN) members to share a lesson in the form of an insight that they have recently learned. In this article, AEN members have shared what they were planning to achieve with the help of big data analytics, at what point they realized a pivot is required and the key lesson they learned from this process.
This article will help other fellow data scientists to avoid common mistakes which they may make while executing data analysis operations for their organization.
Convert Data Pipelines built on Traditional Databases to Big Data Platform
Initial Plan: A few years back when we started transitioning our Data pipelines as part of “BI Modernization using Big Data” enterprise-wide project, we believed code movement which included Stored Procedures, Macros, SQLs in Teradata to Big Data tool Apache Hive would be:-
- Mostly lifted and shift
- Only 10% would require code refactoring
As Hive was SQL 2 compliant and Teradata was SQL 3 compliant.
Pivot point: When we went into the build phase of the project, we realized that there is a lot of tuning which is needed to be done at multiple places when it comes to Hive queries for e.g. leverage SMB join (Sort Merge Bucket Join), converting your slowly changing dimension (SCD1 and 2) pipelines which required more than 40% of code refactoring because hive does not support updates, etc.