Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The document recommends experimenting with the benchmarks, as performance can vary based on data characteristics and use cases like column projections.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
This document discusses Apache NiFi and how it was used to create a new composable data flow system for Schlumberger in just 10 man hours. The previous system was very complex, took over 100 man years to create, and was difficult to change. NiFi allows for easy visualization of the data flow, debugging of issues, and rapid creation of new processors. It also enables quick testing of data flows using curated test data sets and live data in Docker containers. Next steps discussed include further exploring use cases for rig data ingestion with NiFi to provide data provenance and understand the chain of custody of data as it moves through the system.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Sanjay Radia presents on evolving HDFS to support a generalized storage subsystem. HDFS currently scales well to large clusters and storage sizes but faces challenges with small files and blocks. The solution is to (1) only keep part of the namespace in memory to scale beyond memory limits and (2) use block containers of 2-16GB to reduce block metadata and improve scaling. This will generalize the storage layer to support containers for multiple use cases beyond HDFS blocks.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...DataWorks Summit
A data highway is a network that collects, aggregates, and delivers large volumes of data between data centers. The presented data highway collects 250 billion events totaling 800 terabytes of input data and delivers 1.2 petabytes of output data to over 30,000 publisher hosts and 60 tenants. It provides high throughput collection and delivery of data across data centers while allowing customization for individual tenant needs. The architecture includes routers to collect and forward data, gateways to aggregate and buffer data, and prisms to deliver data to various endpoints. Challenges discussed include improving throughput, controlling variable input throughput, and ensuring delivery during system recoveries. Solutions demonstrated rate limiting, load balancing, and additive-increase/multiplicative-decrease
Ted Dunning presents on streaming architectures and MapR Technologies' streaming capabilities. He discusses MapR Streams, which implements the Kafka API for high performance and scale. MapR provides a converged data platform with files, tables, and streams managed under common security and permissions. Dunning reviews several use cases and lessons learned around real-time data processing, microservices, and global data management requirements.
This document summarizes a presentation about new features in Apache Hadoop 3.0 related to YARN and MapReduce. It discusses major evolutions like the re-architecture of the YARN Timeline Service (ATS) to address scalability, usability, and reliability limitations. Other evolutions mentioned include improved support for long-running native services in YARN, simplified REST APIs, service discovery via DNS, scheduling enhancements, and making YARN more cloud-friendly with features like dynamic resource configuration and container resizing. The presentation estimates the timeline for Apache Hadoop 3.0 releases with alpha, beta, and general availability targeted throughout 2017.
This document discusses securing Hadoop and Spark clusters. It begins with an overview of Hadoop security in four steps: authentication, authorization, data protection, and audit. It then discusses specific Hadoop security components like Kerberos, Apache Ranger, HDFS encryption, Knox gateway, and data encryption in motion and at rest. For Spark security, it covers authentication using Kerberos, authorization with Ranger, and encrypting data channels. The document provides demos of HDFS encryption and discusses common gotchas with Spark security.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
This document discusses YARN federation, which allows multiple YARN clusters to be connected together. It summarizes:
- YARN is used at Microsoft for resource management but faces challenges of large scale and diverse workloads. Federation aims to address this.
- The federation architecture connects multiple independent YARN clusters through centralized services for routing, policies, and state. Applications are unaware and can seamlessly run across clusters.
- Federation policies determine how work is routed and scheduled across clusters, balancing objectives like load balancing, scaling, fairness, and isolation. A spectrum of policy options is discussed from full partitioning to full replication to dynamic partial replication.
- A demo is presented showing a job running across
http://hortonworks.com/hadoop/spark/
Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=03debab5ba04b34a033dc5c2f03c7967
As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously. This is a continuation of our YARN Ready Series, aimed at helping developers learn the different ways to integrate to YARN and Hadoop. Tools and applications that are YARN Ready have been verified to work within YARN.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization.
Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
The document discusses how Apache Ambari can be used to streamline Hadoop DevOps. It describes how Ambari can be used to provision, manage, and monitor Hadoop clusters. It highlights new features in Ambari 2.4 like support for additional services, role-based access control, management packs, and Grafana integration. It also covers how Ambari supports automated deployment and cluster management using blueprints.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
The newly enacted GDPR regulations which become effective in 2018 require comprehensive protection of personal information of EU subjects. In this paper, we outline a solution that discovers and classifies personal data that is subject to GDPR in Hadoop ecosystem and uses such precise classification to automatically create a robust set of policies for authorization. The solution consists of using Dataguise’s DgSecure sensitive data detection to automatically classify sensitive data assets in Apache Atlas and author comprehensive and robust authorization policies via Apache Ranger. DgSecure is used to detect sensitive data in Hive databases and continuously update the classification in Apache Atlas via tags. Apache Atlas tags are used to create Apache Ranger policies that protect access to sensitive HDFS files, Hive tables, and Hive columns. We demonstrate a workflow where the components of the solution are automated requiring little or no manual intervention to provide protection of such sensitive data in Hadoop clusters.
Big data is driving transformative changes in traditional data warehousing. Traditional ETL processes and highly structured data schemas are being replaced with schema flexibility to handle all types of data from diverse sources. This allows for real-time experimentation and analysis beyond just operational reporting. Microsoft is applying lessons from its own big data journey to help customers by providing a comprehensive set of Apache big data tools in Azure along with intelligence and analytics services to gain insights from diverse data sources.
The document discusses the past, present, and future of Apache Hadoop YARN. It describes how YARN started as a sub-project of Hadoop to improve its resource management capabilities. Today, YARN is central to modern data architectures, providing centralized resource management and scheduling. Going forward, YARN aims to better support containers, simplified APIs, treating services as first-class citizens, and enhance its user experience.
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronDataWorks Summit
Apache Metron is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable advanced analytics through machine learning and data science to the users. Because of the relative immaturity of data science platform infrastructure integrated into Hadoop that is oriented to streaming analytics applications, we have been forced to create the requisite platform components out of necessity, utilizing many of the pieces of the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and how it utilizes a custom data science model deployment and autodiscovery service that is tightly integrated with Hadoop via Yarn and Zookeeper. We will discuss how we interact with the models deployed there via a custom domain specific language that can query models as data streams past. We will generally discuss the full-stack data science tooling that has been created to enable data science at scale on an advanced analytics
streaming application.
Speaker:
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
Information security is a big problem today. With more attacks happening all the time, and increasingly sophisticated attacks beyond the script-kiddies of yesterday, patrolling the borders of our networks, and controlling threats both from outside and within is becoming harder. We cannot rely on endpoint protection for a few thousand PCs and servers anymore, but as connected cars, internet of things, and mobile devices become more common, so the attack surface broadens. To face these problems, we need technologies that go beyond the traditional SEIM, which human operators writing rules. We need to use the power of the Hadoop ecosystem to find new patterns, machine learning to uncover subtle signals and big data tools to help humans analysts work better and faster to meet these new threats. Apache Metron is a platform on top of Hadoop that meets these needs. Here we will look at the platform in action, and how to use it to trace a real world complex threat, and how it compares to traditional approaches. Come and see how to make your SOC more effective with automated evidence gathering, Hadoop-powered integration, and real-time detection.
As more and more enterprises are adapting to utilize hadoop, integrating the enterprise users into the hadoop environment and enabling these users to access hadoop services and clusters becomes critical. Typically, enterprises manage the users and groups through directory servers and Hadoop services need to integrate with these enterprise directory servers (like Active Directory, LDAP etc...). This process presents its own unique set of challenges.
In this session, the speakers review the background of user and group management for Hadoop as well as covers various scenarios supported in Apache Ranger while interacting with AD or LDAP servers. This session also examines various use cases and some of the common challenges faced in the enterprise environments while interacting with Active Directory or LDAP server including filtering out users and/or groups to be brought into Hadoop from AD or LDAP servers, restricting access to a set of users and/or groups from AD, etc…
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
This document provides an overview of Apache Hadoop, a distributed processing framework for large datasets. It describes how Hadoop uses the Hadoop Distributed File System (HDFS) to provide a unified view of large amounts of data across clusters of computers. It also explains how the MapReduce programming model allows distributed computations to be run efficiently across large datasets in parallel. Key aspects of Hadoop's architecture like scalability, fault tolerance and the MapReduce programming model are discussed at a high level.
Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive SQL (Hive, Tez), real-time processing (Storm), existing services and a wide variety of custom applications. These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy.
YARN is now adding support for services in a first class manner. This talk will first cover the challenges of running services on YARN, and then move on to the changes that were made to the ResourceManager to support scheduling services on YARN(such as affinity and anti-affinity). The talk will then move on to cover the changes made in the NodeManager and features such as container restart and container upgrades. The talk will also cover new additions to YARN like the new application manager (that will allow users to bring services workloads onto YARN by providing features such as container orchestration and management) and the DNS server that uses the YARN registry to enable service discovery.
Apache Metron is a community-driven cybersecurity platform for ingesting, enriching, and analyzing security data at massive scale. It introduces Ned Shawa and Laurence Da Luz who discuss Metron's capabilities and architecture, including real-time parsing, enrichment using threat intelligence, and the ability to build a use case ingesting and analyzing Squid proxy logs. They demonstrate how to configure parsers, transformations, alerts and the user interface to add a new data source to the Metron platform.
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...DataWorks Summit
There have been many voices discussing how to architect streaming
applications on Hadoop. Before now, there have been very few worked
examples existing within the open source. Apache Metron (Incubating) is a
streaming advanced analytics cybersecurity application which utilizes
the components within the Hadoop stack as its platform.
We will attempt to go beyond theoretical discussions of Kappa vs Lambda
architectures and describe the nuts and bolts of a streaming
architecture that enables advanced analytics in Hadoop. We will discuss
the componentry that we had to build and what we could utilize. We will
discuss why we made the architectural decisions that we made and how
they fit together to knit together a coherent application on top of many
different Hadoop ecosystem projects.
We will also discuss the domain specific language that we created out of
necessity to enable a pluggable layer to enable user defined enrichments.
We will discuss how this helped make Metron less rigid and easier to
use. We will also candidly discuss mistakes that we made early on.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
The document discusses HDFS erasure coding storage, which provides the same reliability as HDFS replication but with better storage efficiency. It describes HDFS's current replication strategy and introduces erasure coding as an alternative. Erasure coding uses k data blocks and m parity blocks to tolerate m failures while using less storage than replication. The document outlines HDFS's technical approach to implementing erasure coding, including striped block encoding, reading and writing pipelines, and reconstruction of missing blocks. It provides details on architecture, development progress, and future work to further enhance erasure coding in HDFS.
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
This session is based on a full-day big data workshop delivered to 40 database professionals at the German User Group (DOAG) conference in 2016, garnering fantastic feedback (www.munzandmore.com/2016/ora/big-data-cloudera-oracle-training-feedback-doag). There are zillions of open source big data projects these days. In the session, you will learn about the core principles of four key technologies that are most often used in projects: Hadoop, Spark, Hive, and Kafka. The presentation first explains the fundamentals of those four big data technologies. Then you will see how to take the first easy steps into the big data world yourself, with Oracle Big Data Cloud Service and Oracle Event Hub Cloud Service live demos.
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
This talks about use cases and scenarios of running Hadoop applications in the cloud. It covers the problems encountered and lessons learned at Hortonworks. In the talk, we will see a couple deep dives, including Hadoop cluster/service auto-scaling, fault tolerance, and object storage consistency problems. This appears at DataWorks Summit 2017 San Jose, a co-joint talk by Ram Venkatesh and Mingliang Liu.
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling.
We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
The document discusses new features in Apache Hadoop Common and HDFS for version 3.0. Key updates include upgrading the minimum Java version to Java 8, improving dependency management, adding a new Azure Data Lake Storage connector, and introducing erasure coding in HDFS to improve storage efficiency. Erasure coding in HDFS phase 1 allows for striping of small blocks and parallel writes/reads while trading off higher network usage compared to replication.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
This document discusses challenges and solutions for using object storage with Apache Spark and Hive. It covers:
- Eventual consistency issues in object storage and lack of atomic operations
- Improving performance of object storage connectors through caching, optimized metadata operations, and consistency guarantees
- Techniques like S3Guard and committers that address consistency and correctness problems with output commits in object storage
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
The document provides an overview of new features in HDFS in Hadoop 2, including:
- A new appendable write pipeline that allows files to be reopened for append and provides primitives like hflush and hsync.
- Support for multiple namenode federation to improve scalability and isolate namespaces.
- Namenode high availability using techniques like ZooKeeper and a quorum journal manager to avoid single points of failure.
- A new file system snapshots feature that allows point-in-time recovery through copy-on-write snapshots without data copying.
The document discusses scaling challenges with HDFS and proposed solutions from Hortonworks called HDDS and Ozone. HDFS scales well for data and IO but has limitations scaling the namespace beyond 500 million files. HDDS aims to scale the block layer using block containers which can reduce block reports. Ozone uses a flat key-value namespace that is easier to shard and scale beyond billions of objects compared to HDFS hierarchical namespace. It also provides an HDFS compatible filesystem called OzoneFS. Together HDDS and Ozone aim to retain HDFS features while scaling to exabytes of data and trillions of files.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone.
This document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows Hadoop applications to work with both HDFS and cloud storage transparently. Recent enhancements to the S3A file system connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries with S3A compared to earlier versions. Upcoming work on output committers, object store abstraction, and consistency are outlined.
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022
Robinhood’s mission is to democratize finance for all. Data driven decision making is key to achieving this goal. Data needed are hosted in various OLTP databases. Replicating this data near real time in a reliable fashion to data lakehouse powers many critical use cases for the company. In Robinhood, CDC is not only used for ingestion to data-lake but is also being adopted for inter-system message exchanges between different online micro services. .
In this talk, we will describe the evolution of change data capture based ingestion in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We will go in-depth into the CDC architecture built around our Kafka ecosystem using open source system Debezium and Apache Hudi. We will cover online inter-system message exchange use-cases along with our experience running this service at scale in Robinhood along with lessons learned.
HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone.
Speaker: Sanjay Radia, Chief Architect, Founder, Hortonworks
The document discusses how EMC Isilon scale-out NAS storage improves Hadoop resiliency and operational efficiency. It analyzes the impact of DataNode and TaskTracker failures on Hadoop jobs. EMC Isilon provides high availability, independent scalability of storage and compute, data protection features, and support for multiple Hadoop distributions and protocols like HDFS, NFS, SMB. This allows using existing data for analysis without replication and reduces time-to-results for Hadoop jobs.
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
This document discusses optimizing Apache Spark machine learning workloads on OpenPOWER platforms. It provides an overview of Spark, machine learning, and deep learning. It then discusses how OpenPOWER systems are well-suited for these workloads due to features like high memory bandwidth, large caches, and GPU support. The document outlines various techniques for tuning Spark performance on OpenPOWER, such as configuration of executors, cores, memory, and storage levels. It also presents examples analyzing the performance of a matrix factorization machine learning application under different Spark configurations.
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Spark with GraphX is great for answering relatively simple graph questions which are worth starting a Spark job for, because they essentially involve the whole graph. But does it make sense to start one for every ad-hoc query or is it suitable for complex real-time queries?
In this talk I will introduce an alternative solution that adds those features to an existing Hadoop/Spark setup and enables real-time insights. I will address the following topics:
* Challenges in gaining deeper insights from large amounts of graph data
* Benefits and limitations of graph analysis with Spark
* Introduction to ArangoDB SmartGraphs
* Deployment of Hadoop, Spark and ArangoDB using DC/OS
* Performing complex queries on billions of nodes and vertices leveraging ArangoDB SmartGraphs (Live Demo)
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodesand eliminates the need to bundle multiple files into a single coding group.