During this Big Data Warehousing Meetup, Caserta Concepts and Databricks addressed the number one operational and analytic goal of nearly every organization today – to have complete view of every customer. Customer Data Integration (CDI) must be implemented to cleanse and match customer identities within and across various data systems. CDI has been a long-standing data engineering challenge, not just one of logic and complexity but also of performance and scalability. The speakers brought together best practice techniques with Apache Spark to achieve complete CDI. Speakers: Joe Caserta, President, Caserta Concepts Kevin Rasmussen, Big Data Engineer, Caserta Concepts Vida Ha, Lead Solutions Engineer, Databricks The sessions covered a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Topics included: · Building an end-to-end CDI pipeline in Apache Spark · What works, what doesn’t, and how do we use Spark we evolve · Innovation with Spark including methods for customer matching from statistical patterns, geolocation, and behavior · Using Pyspark and Python’s rich module ecosystem for data cleansing and standardization matching · Using GraphX for matching and scalable clustering · Analyzing large data files with Spark · Using Spark for ETL on large datasets · Applying Machine Learning & Data Science to large datasets · Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally The speakers also touched on data governance, on-boarding new data rapidly, how to balance rapid agility and time to market with critical decision support and customer interaction. They also shared examples of problems that Apache Spark is not optimized for. For more information on the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/