The document outlines Renault's big data initiatives from 2014-2016, including:
1. Starting with a big data sandbox in 2014 using an old HPC infrastructure for data exploration.
2. Implementing a DataLab in 2015 with a new HP infrastructure and establishing a first level of industrialization while improving data protection.
3. Creating a big data platform in 2016 to industrialize hosting both proofs of concept and production projects while ensuring data protection.
The document discusses Rocana Search, a system built by Rocana to enable large scale real-time collection, processing, and analysis of event data. It aims to provide higher indexing throughput and better horizontal scaling than general purpose search systems like Solr. Key features include fully parallelized ingest and query, dynamic partitioning of data, and assigning partitions to nodes to maximize parallelism and locality. Initial benchmarks show Rocana Search can index over 3 times as many events per second as Solr.
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
The document discusses managing Hadoop, HBase and Storm clusters at Yahoo scale. It describes Yahoo's grid infrastructure which includes 3 data centers with over 45k nodes across 18 Hadoop clusters, 9 HBase clusters and 13 Storm clusters. It then provides details on the rolling upgrade processes for HDFS, YARN, HBase and Storm which involve minimizing downtime, upgrading components independently and verifying upgrades. CI/CD processes are used to automate software deployment and upgrades.
This document summarizes a presentation about analyzing small files in HDFS clusters. It outlines the problems small files can cause, such as inefficient data access and slower jobs. It then describes the architecture of the small files analysis solution, which processes the HDFS fsimage to attribute and aggregate file information. This information is stored and used to power dashboards showing metrics like small file counts and distributions over time. Future work includes improving performance and developing a customizable compaction utility.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speakers
Chris Douglas, Microsoft, Principal Research Software Engineer
Thomas Denmoor, Western Digital, Object Storage Architect
This document discusses hybrid analytics and the movement of data between on-premise and cloud environments. It notes that IT infrastructures now require both traditional and cloud-native approaches. Moving forward, hybrid analytics using both on-premise and cloud resources will become more common. Specifically, automatically moving data between on-premise storage and the cloud based on policies will help break down data silos and enable greater analysis. The document also explores using Dell EMC's OneFS software to enable this type of policy-based data movement to and from the cloud.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
The document provides an overview of the Hadoop platform at Yahoo over the past year. It discusses the evolution of the platform infrastructure and metrics including growth in storage from 12PB to 65PB and compute capacity from 23TB to 240TB. It highlights new technologies added to the platform like CaffeOnSpark for distributed deep learning, Apache Storm for streaming analytics, and data sketches algorithms. It also discusses enhancements to existing technologies like HBase for transactions with Omid and improvements to Oozie for data pipelines. The document aims to provide insights on how the Hadoop platform at Yahoo has scaled to support growing analytics needs through consolidation, new services, and ease of use features.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage.
One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost.
Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.
Speaker
Ashish Thapliyal, Principal Program Manager, Microsoft Corp
This document discusses securing Spark applications. It covers encryption to protect data in transit and at rest, authentication using Kerberos to identify users, and authorization for access control through tools like Sentry and a proposed RecordService. While Spark can be secured today by leveraging Hadoop security, continued work is needed for easier encryption, improved Kerberos support for long-running jobs, and row/column-level authorization beyond file permissions.
This document summarizes improvements made to HDFS to optimize performance, stabilize operations, and improve supportability. Key areas discussed include logging enhancements, metrics and tools for troubleshooting, load management through RPC improvements, and changes to reduce garbage collection overhead and improve liveness detection. Specific optimizations covered range from code changes to reduce logging verbosity to adding batch processing of block reports.
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS). MARK LOCHBIHLER, Principal Architect, Hortonworks and VIPLAVA MADASU, Big Data Systems Engineer, Hewlett Packard Enterprise
An agile data fabric powered by Brocade networking solutions provides benefits for business intelligence initiatives. It allows data and applications to be deployed flexibly across distributed infrastructure in a way that is automated, intelligent, and optimized for performance. Case studies demonstrate how Brocade fabrics have enabled multi-tenant data lakes and analytics platforms by integrating diverse storage, computing, and networking resources.
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
Your transformation through Innovation & industrialization - with a Business ...Capgemini
The document discusses setting up a Business Information Service Center (BISC) to improve an organization's business intelligence capabilities. A BISC industrializes BI development and support through an offshore factory model to deliver information more quickly, reliably and cost effectively. It involves reviewing the existing BI environment, setting up the BISC infrastructure both onsite and offshore, engaging business and IT stakeholders, and driving continuous innovation. The BISC approach aims to transform BI delivery through industrialization and rightshoring.
Big Data Means Big Business
Big data has the potential to disrupt existing businesses and help create new ones by extracting useful information from huge volumes of structured and unstructured data. To realize this promise, organizations need cheap storage, faster processing, smarter software, and access to larger and more diverse data sets. Big data can unlock new business value by enabling better-informed decisions, discovering hidden insights, and automating business processes. While the technology is available, organizations must also invest in skills, cultural change, and using information as a corporate asset to fully leverage big data.
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
This document discusses Databricks' goal of democratizing access to Spark. It introduces the Databricks cloud platform, which provides a hosted model for Spark with rapid releases, dynamic scaling, and security controls. The platform is used for just-in-time data warehousing, advanced analytics, and real-time use cases. Many companies struggle with the steep learning curve and costs of big data projects. To empower more developers, Databricks trained thousands on Spark and launched online courses with over 100,000 students. They are announcing the Databricks Community Edition, a free version of their platform, to further democratize access to Spark through mini clusters, notebooks, APIs, and continuous delivery of learning content.
The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.
This document discusses Apache Spark, an open-source cluster computing framework. It describes how Spark allows for faster iterative algorithms and interactive data mining by keeping working sets in memory. The document also provides an overview of Spark's ease of use in Scala and Python, built-in modules for SQL, streaming, machine learning, and graph processing, and compares Spark's machine learning library MLlib to other frameworks.
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
This document summarizes the process of migrating clinical data from various legacy formats, including SAS datasets, Word/PDF listings, Excel files, and scans, into a clinical data management system (CDMS). A team was assembled with roles for building the studies, loading the data, quality control, and data entry. The overall process involved building the studies, parsing the source data into a loadable format, quality checks, and loading the data. Lessons learned focused on standardizing the data format versus loading "as is" and improving communication and consistency.
Building a modern Application with DataFramesDatabricks
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. The document discusses Spark's architecture including its core abstraction of resilient distributed datasets (RDDs), and demos Spark's capabilities for streaming, SQL, machine learning and graph processing on large clusters.
Scalable And Incremental Data Profiling With SparkJen Aman
This document discusses how Trifacta uses Spark to enable scalable and incremental data profiling. It describes challenges in profiling large datasets, such as performance and generating flexible jobs. Trifacta addresses these by building a Spark profiling job server that takes profiling specifications as JSON, runs jobs on Spark, and outputs results to HDFS. This pay-as-you-go approach allows profiling to scale to large datasets and different user needs in a flexible manner.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Big Data : au delà du proof of concept et de l'expérimentation (Matinale busi...Jean-Michel Franco
Concrétiser les promesses du Big Data avec Hadoop, le Self-Service, les data lakes et le machine learning. Quels cas d'usage, quels retours d'expérience, quelle plate-forme?
빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다.
스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가 조금 자세히 설명 되어있습니다. (텅스텐엔진, 카탈리스트 옵티마이져에 대한 간략한 설명이 있습니다.)
마지막에는 간단한 설치 및 interactive 분석 실습자료가 포함되어 있습니다.
원본 ppt 를 공개해 두었으니 언제 어디서든 필요에 따라 변형하여 사용하시되 출처만 잘 남겨주시면 감사드리겠습니다.
다른 슬라이드나, 블로그에서 사용된 그림과 참고한 자료들은 작게 출처를 표시해두었는데, 본 ppt의 초기버전을 작성하면서 찾았던 일부 자료들은 출처가 불분명한 상태입니다. 자료 출처를 알려주시면 반영하여 수정해 두도록하겠습니다. (제보 부탁드립니다!)
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
For organisations to successfully adopt data mesh, setting up and maintaining infrastructure needs to be easy.
We believe the best way to achieve this is to leverage the learnings from building a ‘central nervous system‘, commonly used in modern data-streaming ecosystems. This approach formalises and automates of the manual parts of building a data mesh.
This presentation introduces SpecMesh; a methodology and supporting developer toolkit to enable business to build the foundations of their data mesh.
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
Watch here: https://bit.ly/2NGQD7R
In an era increasingly dominated by advancements in cloud computing, AI and advanced analytics it may come as a shock that many organizations still rely on data architectures built before the turn of the century. But that scenario is rapidly changing with the increasing adoption of real-time data virtualization - a paradigm shift in the approach that organizations take towards accessing, integrating, and provisioning data required to meet business goals.
As data analytics and data-driven intelligence takes centre stage in today’s digital economy, logical data integration across the widest variety of data sources, with proper security and governance structure in place has become mission-critical.
Attend this session to learn:
- Learn how you can meet cloud and data science challenges with data virtualization.
- Why data virtualization is increasingly finding enterprise-wide adoption
- Discover how customers are reducing costs and improving ROI with data virtualization
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
This is a brief introduction to Snowflake Cloud Data Platform and our revolutionary architecture. It contains a discussion of some of our unique features along with some real world metrics from our global customer base.
Virtualisation de données : Enjeux, Usages & BénéficesDenodo
Watch full webinar here: https://bit.ly/3oah4ng
Gartner a récemment qualifié la Data Virtualisation comme étant une pièce maitresse des architectures d’intégration de données.
Découvrez :
- Les bénéfices d’une plateforme de virtualisation de données
- La multiplication des usages : Lakehouse, Data Science, Big Data, Data Service & IoT
- La création d’une vue unifiée de votre patrimoine de données sans transiger sur la performance
- La construction d’une architecture d’intégration Agile des données : on-premise, dans le cloud ou hybride
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
Watch full webinar here: https://bit.ly/2EpHGyd
Presented at Data Champions, Online Asia 2020
Businesses and individuals around the world are experiencing the impact of a global pandemic. With many workers and potential shoppers still sequestered, COVID-19 is proving to have a momentous impact on the global economy. Regardless of the current situation and post-pandemic era, real-time data becomes even more critical to healthcare practitioners, business owners, government officials, and the public at large where holistic and timely information are important to make quick decisions. It enables doctors to make quick decisions about where to focus the care, business owners to alter production schedules to meet the demand, government agencies to contain the epidemic, and the public to be informed about prevention.
In this on-demand session, you will learn about the capabilities of data virtualization as a modern data integration technique and how can organisations:
- Rapidly unify information from disparate data sources to make accurate decisions and analyse data in real-time
- Build a single engine for security that provides audit and control by geographies
- Accelerate delivery of insights from your advanced analytics project
Presentation architecting virtualized infrastructure for big datasolarisyourep
The document discusses how virtualization can help simplify big data infrastructure and analytics. Key points include:
1) Virtualization can help simplify big data infrastructure by providing a unified analytics cloud platform that allows different data frameworks and workloads to easily share resources.
2) Hadoop performance on virtualization has been proven with studies showing little performance overhead from virtualization.
3) A unified analytics cloud platform using virtualization can provide benefits like better utilization, faster provisioning of elastic resources, and multi-tenancy for secure isolation of analytics workloads.
Watch full webinar here: https://bit.ly/3mdj9i7
You will often hear that "data is the new gold"? In this context, data management is one of the areas that has received more attention from the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
In this webinar, we will discuss the technology trends that will drive the enterprise data strategies in the years to come. Don't miss it if you want to keep yourself informed about how to convert your data to strategic assets in order to complete the data-driven transformation in your company.
Watch this on-demand webinar as we cover:
- The most interesting trends in data management
- How to build a data fabric architecture?
- How to manage your data integration strategy in the new hybrid world
- Our predictions on how those trends will change the data management world
- How can companies monetize the data through data-as-a-service infrastructure?
- What is the role of voice computing in future data analytic
The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology
This is the presentation from Big Data November Bangalore Meetup 2014.
http://technology.inmobi.com/events/bigdata-meetup
Talk Outline:
- What does THE HIVE provide?
- Goals of Synapse Tech Stack
- THE HIVE Startups
- Demystifying IoT Market
- Synapse Stack for IoT
- Big Data Challenge
- Synapse Lambda Architecture
- Synapse Components
- Synapse Internals
- AKILI – Synapse Machine Learning
Big Data International Keynote Speaker Mark van Rijmenam shared his vision on Hadoop Data Lakes during a Zaloni Webinar. What are the Hadoop Data Lake trends for 2016, what are the data lake challenges and how can organizations benefit from data lakes.
Integration intervention: Get your apps and data up to speedKenneth Peeples
SOA has been the defacto methodology for enterprise application and process integration, because loosely coupled components and composite applications are more agile and efficient. The perfect solution? Not quite.
The data’s always been the problem. The most efficient and agile applications and services can be dragged down by the point-to-point data connections of a traditional data integration stack. Virtualized data services can eliminate the friction and get your applications up to speed.
In this webinar we'll show you how to (replay at http://www.redhat.com/en/about/events/integration-intervention-get-your-apps-and-data-speed):
-Quickly and easily create a virtual data services layer to plug data into your SOA infrastructure for an agile and efficient solution
-Derive more business value from your services.
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
Watch full webinar here: https://bit.ly/32TT2Uu
Data virtualization is not just for self-service, it’s also a first-class citizen when it comes to modern data platform architectures. Technology has forced many businesses to rethink their delivery models. Startups emerged, leveraging the internet and mobile technology to better meet customer needs (like Amazon and Lyft), disrupting entire categories of business, and grew to dominate their categories.
Schedule a complimentary Data Virtualization Discovery Session with g2o.
Traditional companies are still struggling to meet rising customer expectations. During this webinar with the experts from g2o and Denodo we covered the following:
- How modern data platforms enable businesses to address these new customer expectation
- How you can drive value from your investment in a data platform now
- How you can use data virtualization to enable multi-cloud strategies
Leveraging the strategy insights of g2o and the power of the Denodo platform, companies do not need to undergo the costly removal and replacement of legacy systems to modernize their systems. g2o and Denodo can provide a strategy to create a modern data architecture within a company’s existing infrastructure.
Ron Kasabian - Intel Big Data & Cloud Summit 2013IntelAPAC
The document discusses how big data is driving new investments in tools and services to analyze growing volumes of data from sources such as sensors, social media, and mobile devices. It outlines barriers to big data adoption like complexity and cost. Intel aims to address these barriers through its portfolio of hardware, software, and solutions to enable end-to-end big data analytics from edge devices to the data center.
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
This document discusses architecting virtualized infrastructure for big data. It summarizes that big data is growing exponentially and new frameworks like Hadoop are enabling analysis of large, diverse data sets. Virtualization can simplify and optimize big data platforms by providing a unified analytics cloud that elastically provisions various data systems like Hadoop and SQL clusters on shared hardware infrastructure. This improves utilization and makes big data platforms faster and easier to deploy and manage.
The document discusses Cisco's Hadoop as a service offering on their Intercloud platform. Some key points:
- Cisco provides managed Hadoop, including Cloudera's distribution, on optimized instances with local storage and object storage. This offers a scalable, reliable, and secure environment for Hadoop workloads.
- Use cases discussed include predictive maintenance using IoT data and analyzing customer journeys across multiple channels.
- A pilot test showed Cisco's platform could process over 100 million records from production data across various Hadoop jobs.
- Cisco also discusses their data virtualization product CiscoDV, which can integrate data across on-premises, cloud sources on Cisco and AWS.
-
Customer migration to Azure SQL database, December 2019George Walters
This is a real life story on how a software as a service application moved to the cloud, to azure, over a period of two years. We discuss migration, business drivers, technology, and how it got done. We talk through more modern ways to refactor or change code to get into the cloud nowadays.
An overview of datonixOne a new, evolutionary and effective Data Preparation Platform.
We do introduce a new disruptive technology into the Data Management Space, it is the Data Scannew.
Using the Data Scanner Data Science is more accurate and feasible.
datonixOne is a perfect Satellite of any Enterprise Data Hub.
Metadata Lakes for Next-Gen AI/ML - DatastratoZilliz
As data catalogs evolve to meet the growing and new demands of high-velocity, unstructured data, we see them taking a new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also discuss Apache (incubating) Gravitino and its open source-first approach to data cataloging across multi-cloud and geo-distributed architectures.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
Splunk App for Stream - Einblicke in Ihren NetzwerkverkehrGeorg Knon
The document discusses the Splunk App for Stream, which enables real-time insights into private, public and hybrid cloud infrastructures by capturing and analyzing critical events from wire data not found in logs or with other collection methods. It provides an overview of the app, what's new, important features, architecture and deployment, customer success examples, and FAQs.
Similar to Big Data Platform Industrialization (20)
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
2. BIG DATA INITIAVES @Renault
2014
Big Data Sandbox on old
HPC Infrastructure.
Site: Innovation LAB
POC: Quality Data
Exploration
2015
DataLab Implementation
New HP Infrastructure
Data Protection: NO
1st Level of
Industrialization
2016
Big Data Platform
Industrialization to host
both Pocs and Projects in
Production.
Data Protection: YES
3. 3
DIRECTION
REDACTOR DATE
Big Data Deployment Production Stakes
• One Hadoop cluster with a 24/7 always-on visibility of data(instead of siloing them).
• Many crossing Data possibilities
• Simplify Operations
• Design Simplicity
• Charge Back Model
• Scalability and Isolation
• Isolate Experimental applications from Production
4. 4
DIRECTION
REDACTOR DATE
Déploiement des projets Qualité sur DataLake
Big Data
Developpers
Serveur Client
(Clients Hadoop
Installés)
DMZ
KNOX
GATEWAY
Search
Node
Edge
Node
Name
Node
DataStore
Master DNDNDNDNDNDNDNDNDN
DNDNDNDNDNDNDNDNDN
Data Sources
LoadBalancer
Web
Applications
Web service
Import
Access GUI
Search
Node
5. 5
DIRECTION
REDACTOR DATE
Quality
Sales and
Marketing
Supply Chain Engineering
Consumers
Open Data
Internet of Things
Producers
Batch (RDBMS,
Files)
Messages, Logs
Streaming, Data Flow
NFS Gateway, Sqoop,
Spark SQL
FLUME
LOGSTASH
KAFKA PRODUCERS
Kafka
Broker
(Topics)
Spark
Streaming
Elasticsearch
HBASE
HIVE
HDFS
Spark SQL
Spark RDD
Big Data Ecosystem @ Renault
YARN + HDFS
6. 6
DIRECTION
REDACTOR DATE
Data Ingestion Scenarios : RDBMS
RDBMS
Sqoop
Spark SQL
HIVE
Flat Files
HBase
Basic data import based on column id (Integer) or timestamp
Example: SOPHIA
ELT Architecture: Extract Load and Transform
Non standard Data Import, Specific schema
Example: BLMS
Support ETL Architecture : Extract Transfrom and Load
Support Ingestion directly to Elasticsearch
INSERT ONLY
NOCTURNAL BATCH
SQL QUERIES
HIGH LATENCY
INSERT ONLY
NOCTURNAL BATCH
DATA PROCESSING
Files Format: CSV,
PARQUET, AVRO
INSERT AND UPDATE
NOSQL DB (KEY-VALUE SCHEMA)
LOW LATENCY
FOR SCALING OUT RELATIONAL
DB ON HADOOP
INTERACTIVE ANALYTICS
(SPOTFIRE)
VERY LOW LATENCY (SSD DISKS)
NEAR REAL TIME ANALITYCS
(WATCHING ALERTS)
TEXT SEARCH (LOG ANALYSIS)
NESTED and PARENT/CHILD
RELATIONSHIPS
Elasticsearch
7. 7
DIRECTION
REDACTOR DATE
Interactive SQL Data Analytics
Main Objectives
• Speeding up BI queries on Big Data Stores
• Hide Complexity of Big Data Architecture to End-User and Provide Only one Data
Connector for Spotfire
• Provide Interactive User Experience
• Data Virtualization (No need to import RDBMS systematically for Crossing Data)
Spark SQL (1.6) : The emerging solution for Interactive SQL Data Analytics with the Data
Source API.
8. 8
DIRECTION
REDACTOR DATE
Only One Data Connector
Data Processing Applications
Add In-Memory Capability
Load/Insert Load/Insert Load/Insert
Interactive SQL Data Analytics
HBASEHIVE Elasticsearch
Spark SQL
Files Parquet
DataSource API
Load/Insert
RDBMS
Load/Insert
9. 9
DIRECTION
REDACTOR DATE
Big Data In Action
Multitenancy
High Availability
Security
Data Governance
Policy
Continuous Delivery
Data Protection
Hadoop
Organization
11. 11
DIRECTION
REDACTOR DATE
Hadoop Global Data Life Cycle
Data Sources
Load or archive
batch data
Stream real time
data
Mask Sensitive data
with Automated
Process
Renault Big Data
Platform
Refine, curate, process,
query data
Big Data Web
Services
INGESTION
Scheduled and
Monitored by
AITS in PROD
Policies pre-defined by
Security Officer
ADA
ARCA
subscription
request to
datasets
Data
Access
AITS: Groups
Management
DIRx Data
Owner
Validation
Sync
Users
Defined in Ranger
and Protegrity
Defined in
Falcon
12. 12
DIRECTION
REDACTOR DATE
Active / Active Hadoop Platform
Rack Salle 1 C2 Rack Salle 2 C2
Data Nodes
for Bloc Storage
HDFS PROD HDFS POC
High Availability Architecture
13. 13
DIRECTION
REDACTOR DATE
Hadoop Security Levels
OS Security
Authorization
Perimeter Level Security
Protected Zone
Data Protection
Selected Solution :
Tokenization
(Protegrity)
14. 14
DIRECTION
REDACTOR DATE
Tokenization Definition
Selected solution: Tokenization
• Tokenization is a form of data protection that
converts sensitive data into fake data.
• The real data can be retrieved by authorized
users.
• Protegrity: The only Available Solution for Hadoop
(supports also traditional Data Systems)
Data Protection Key Requirements:
• Ability to de-identify Personally Identifiable Data
• Restrict data access (financial data, Data
residency obligations, …)
• Provide central management and control of all
data security operations
15. 15
DIRECTION
REDACTOR DATE
Identifier Clear Protected
Authorized Role 1
* Can see most data in the
clear
Authorized Role 2
* Can see limited data in the
clear
Name Joe Smith csu wusoj Joe Smith Joe Smith
Address 100 Main Street, Pleasantville,
CA
476 srta coetse, cysieondusbak,
HA
100 Main Street, Pleasantville,
CA
“No Access”
Date of Birth 12/25/1966 01/02/1966 12/25/1966 01/02/1966
VIN VF1112C0000724284 AB9875R8467364752 VF1112C0000724284 “No Access”
Credit Card
Number
3678 2289 3907 3378 3846 2290 3371 3890 xxxx xxxx xxxx 3378 3846 2290 3371 3890
E-mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.aku joe.smith@surferdude.org joe.smith@surferdude.org
Telephone
Number
760-278-3389 998-389-2289 760-278-3389 998-389-2289
DATA PROTECTION Example 1
Fine Grained Protection
17. 17
DIRECTION
REDACTOR DATE
The Next Step: HDFS FEDERATION
Name Service 1
/store1/
Name Service 2
/store2/
Name
Node 1
Name
Node 2
Name
Node 3
Name
Node 4
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
Data
Node 5
Data
Node 6
Data
Node 7
Data
Node 8
Federation
Scale-out Data Nodes
Resources usage orchestrated by YARN
HA HA
• see only access /store2
• cannot access to /store1
• cannot access to
Name Node 1 and 2
• see only access /store1
• cannot access to /store2
• cannot access to
Name Node 3 and 4
PHYSICALISOLATION
Privileged Users can
access /store1 and /store2
for Crossing Data
Bloc Storage
Unreadable Data
Data
Node 9
18. 18
DIRECTION
REDACTOR DATE
BIG DATA PROJECT PROCESS
Business
Use Case DIRx
POC PROD Project
Data
Exploration
AT
Data Sources Ingestion
Security Policies
Data Processing development
User Access Authorization
CPT
DIA - Innovation Front End deployment
Admin
@BICC
architecture
development
MEP
Implementation
Deployment
management
DAT
19. 19
DIRECTION
REDACTOR DATE
DATA CHANGE MANAGEMENT
• Publish real time dashboard of all data flows.
• For each data source the producer and the consumers will be displayed:
• Producer: the DIRx generating the Data
• Consumers: all the Big Data applications using this Data source
• If the Producer decides to change the data source schema, He has to send notification
from the dashboard to all consumers.
• By monitoring in real time the logs of data ingestion jobs, the failure detection is more
reactive and efficient.
20. 20
DIRECTION
REDACTOR DATE
COLLECT LOGS FOR (NRT) CARTO
Sqoop
Metastore
Storing Jobs
Hive
Metastore
Storing SQL Schema
Oozie
Metastore
Storing
Workflows
Oozie Logs
Yarn
Job History Logs
Storing
Projects
Hue
Database
Elasticsearch
J
D
B
C
P
L
U
G
I
N
L
O
G
S
T
A
S
H
Kibana