Overview of Modernizing Analytics Infrastructure at Ebates
Ebates, under the leadership of Mark Stange-Tregear, embarked on a journey to modernize its analytics infrastructure by transitioning from on-premise setups to cloud-based solutions. Through the adoption of guiding principles like minimizing ETL steps and siloing, consolidating business logic, and focusing on stability and efficiency, Ebates successfully navigated challenges with its old data warehouse, implemented cloud data structures, and improved reporting capabilities. The timeline showcases the evolution from system diagnosis to the deployment of core ETL and reporting tools, while emphasizing innovation and adaptability.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
FEET ON THE GROUND, HEAD IN THE CLOUDS Mark Stange-Tregear
AGENDA Introductions Ebates Timeline The On-Premise Setup Why the Cloud, Why Now? The Future?
INTRODUCTIONS MARK STANGE-TREGEAR EBATES VP of Analytics, Ebates Leading cashback membership site 10+ years experience in BI and Analytics space 10s of millions of members worldwide Leads analytics and data science team Partnered with 2,000+ merchants Product management of data warehouse ~5% of all online buying in US done through Ebates Deploy/build BI and analytics software Training and empowering business users Acquired by Rakuten in 2015 for $1B $1B in total cashback paid
WHERE THE JOURNEY STARTED The Old DW WhatWeNeeded Missed SLAs Fast ETL and fast reporting to hit SLAs Inability to scale Horizontally scalable to keep up with growth Inability to incorporate new, often non-relational, data sets Ability to incorporate different types of data without pain Increasing risk of disconnected data silos Comprehensive enterprise data-warehouse Struggling to query granular data Granular support as needed
GUIDING PRINCIPLES 1. MINIMIZE ETL STEPS & SILOING 2. CONSOLIDATE BUSINESS LOGIC Less to break Less to explain Less to keep in sync One question, one answer Less data to blend, merge and audit Build trust and confidence
TIMELINE Problem diagnosis + solution brainstorming POC over & Big Data decision Phase 2 cluster bought Phase 3 cluster bought POC Cloud Data structure defined & coding frenzy Cloud Live 2017 2014 2015 2016 2019 2018 Focus on stability, increased functionality and efficiency Legacy Systems Gone Drag and drop reporting POC cluster bought and deployed Replica Cluster Phase 1 cluster bought Core ETL and core reporting deployed (Tableau Server)
2017 SYSTEMS DIAGRAM Cloudera TABLEAU SERVER ~1000 tables @SCALE PROD DBs TABLEAU DESKTOP EVENT FEEDS Sqoop, Flume & Kafka DATA LAKE & WAREHOUSE SQL mostly Impala HOMEGROWN TOOLS & FEEDS APIs ~1.5MM queries a month FLAT FILE SQL Spark with Scala/Pyt hon Everything auditable back to lake
GROWING PAINS On premise hardware, same disks, same memory/CPUs Additional functionality causes strain @ ~1MM queries/month Custom SQL and even Tableau unpredictable in timing and workload Result: slow downs, failures, spills
2018 SYSTEMS (PRE-CLOUD) TABLEAU SERVER @SCALE PROD DBs Direct connections still possible TABLEAU DESKTOP EVENT FEEDS DATA LAKE & WAREHOUSE HOMEGROWN TOOLS & FEEDS APIs distcp REPLICA LAKE & WAREHOUSE FLAT FILE SQL Protect lake and core business logic
OPTIMIZE FOR USE CASES ~0.5MM queries a month against @scale Activity ramps down (doesn t stop) over-night Cloud s promise of scale up/down attractive Small numbers of tables to replicate (can truncate and load daily)
SIDE-BY-SIDE EVALUATION S3 + RedShift/Spectrum BigQuery Snowflake CRITERIA Speed of transfer Speed and ease of load Concurrency and query performance
2018 SYSTEMS SNOWFLAKE @SCALE TABLEAU SERVER PROD DBs S3 TABLEAU DESKTOP EVENT FEEDS DATA LAKE & WAREHOUSE HOMEGROWN TOOLS & FEEDS APIs REPLICA LAKE & WAREHOUSE FLAT FILE SQL
THE FUTURE? API layer for direct application interaction in the Cloud Move lake and warehouse to Cloud? PII/GDPR/Cost/Functionality AI/ML processing in the Cloud? Write back to lake? Breaking core principles, creating silos?