This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
Bigdam is a planet-scale data ingestion pipeline designed for large-scale data ingestion. It addresses issues with the traditional pipeline such as imperfectqueue throughput limitations, latency in queries from event collectors, difficulty maintaining event collector code, many small temporary and imported files. The redesigned pipeline includes Bigdam-Gateway for HTTP endpoints, Bigdam-Pool for distributed buffer storage, Bigdam-Scheduler to schedule import tasks, Bigdam-Queue as a high throughput queue, and Bigdam-Import for data conversion and import. Consistency is ensured through at-least-once design and deduplication is performed at the end of the pipeline for simplicity and reliability. Components are designed to scale out horizontally.
Empowering developers to deploy their own data storesTomas Doran
Empowering developers to deploy their own data stores using Terrafom, Puppet and rage. A talk about automating server building and configuration for Elasticsearch clusters, using Hashicorp and puppet labs tool. Presented at Config Management Camp 2016 in Ghent
This document summarizes recent updates to Presto, including new data types, connectors, syntax, features, functions, and configuration options. Some key additions are support for DECIMAL, VARCHAR, and new data types; connectors for Redis, MongoDB, and other data sources; transaction support; and a variety of new SQL functions for strings, dates, aggregation, and more. Upcoming work includes prepared statements, a new optimizer, and other performance and usability improvements.
Presto is a distributed SQL query engine that allows for interactive analysis of large datasets across various data sources. It was created at Facebook to enable interactive querying of data in HDFS and Hive, which were too slow for interactive use. Presto addresses problems with existing solutions like Hive being too slow, the need to copy data for analysis, and high costs of commercial databases. It uses a distributed architecture with coordinators planning queries and workers executing tasks quickly in parallel.
How to create Treasure Data #dotsbigdataN Masahiro
This document provides an overview of Treasure Data's big data analytics platform. It discusses how Treasure Data ingests and processes large amounts of schema-less data from various sources in real-time and at scale. It also describes how Treasure Data stores and indexes the data for fast querying using SQL interfaces while maintaining schema flexibility.
This document summarizes recent updates to Norikra, an open source stream processing server. Key updates include:
1) The addition of suspended queries, which allow queries to be temporarily stopped and resumed later, and NULLABLE fields, which handle missing fields as null values.
2) New listener plugins that allow processing query outputs in customizable ways, such as pushing to users, enqueueing to Kafka, or filtering records.
3) Dynamic plugin reloading that loads newly installed plugins without requiring a restart, improving uptime.
Timothy Spann provides an overview of Apache NiFi, an open source dataflow software. Some key points about NiFi include:
- It provides guaranteed data delivery, buffering, prioritized queuing, and data provenance.
- It supports over 60 source connectors and has hundreds of processors for handling different data formats.
- The architecture includes repositories for storing metadata and provenance data, and supports clustering.
- Spann discusses best practices for using NiFi such as avoiding spaghetti flows, leveraging parameters and templates, and upgrading to the latest version. He also demonstrates how to consume data from sources like MQTT and FTP.
Embulk and Machine Learning infrastructureHiroshi Toyama
This document summarizes a presentation about using Embulk for machine learning and natural language processing. The presenter introduced themselves and their company, which uses Embulk for tasks like creating machine learning data, executing machine learning on the cloud, and natural language processing. They have developed several Embulk plugins for tasks like morphological analysis, Unicode normalization, and integrating with APIs from Amazon Machine Learning, Google Vision, and other cloud services. They discussed use cases for various machine learning and NLP tasks as well as limitations and capabilities of different cloud APIs.
Urban Airship is a mobile platform that provides services to over 160 million active application installs across 80 million devices. They initially used PostgreSQL but needed a system that could scale writes more easily. They tried several NoSQL databases including MongoDB, but ran into issues with MongoDB's locking, long queries blocking writes, and updates causing heavy disk I/O. They are now converging on Cassandra and PostgreSQL for transactions and HBase for analytics workloads.
This document discusses using pulsarctl and pulsar-manager to manage a Pulsar cluster. It introduces pulsarctl as a CLI tool developed in Go for managing Pulsar clusters that addresses some issues with the existing Pulsar admin tool. It then covers how to use the Admin API and CLI features of pulsarctl. Finally, it outlines some future plans, including adding more features to pulsarctl and pulsar-manager.
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Thursday 14:10 UTC
Apache NiFi 101: Introduction and Best Practices
Timothy Spann
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker
DZone Zone Leader and Big Data MVB
@PaasDev
https://github.com/tspannhw https://www.datainmotion.dev/
https://github.com/tspannhw/SpeakerProfile
https://dev.to/tspannhw
https://sessionize.com/tspann/
https://www.slideshare.net/bunkertor
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
Cracking the nut, solving edge ai with apache tools and frameworks
Using the FLaNK stack for Edge AI and Streaming AI.
Apache Flink, Apache Kafka, Apache Nifi, Apache Kudu, DJL, Apache MXNet, Apache OpenNLP, Apache Tika, Apache Hue, Apache Hadoop, Apache HDFS
Presented at AI DevWorld 2020 virtual
Treasure Data and AWS - Developers.io 2015N Masahiro
This document discusses Treasure Data's data architecture. It describes how Treasure Data collects and imports log data using Fluentd. The data is stored in columnar format in S3 and metadata is stored in PostgreSQL. Treasure Data uses Presto to enable fast analytics on the large datasets. The document provides details on the import process, storage, partitioning, and optimizations to improve query performance.
Spark Streaming allows processing of live data streams using Spark. It works by dividing the data stream into batches called micro-batches, which are then processed using Spark's batch engine to generate RDDs. This allows for fault tolerance, exactly-once processing, and integration with other Spark APIs like MLlib and GraphX.
Kafka is a high-throughput distributed messaging system with publish and subscribe capabilities. It provides persistence with replication to disk for fault tolerance. Kafka is simple to implement and runs efficiently on large clusters with low latency and high throughput. It was created at LinkedIn to process streaming data from the LinkedIn website and has since been open sourced.
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300
* https://github.com/tspannhw/EverythingApacheNiFi
* https://github.com/tspannhw/FLiP-ApacheCon2021
* https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
* https://github.com/tspannhw/FLiP-IoT
* https://github.com/tspannhw/FLiP-Energy
* https://github.com/tspannhw/FLiP-SOLR
* https://github.com/tspannhw/FLiP-EdgeAI
* https://github.com/tspannhw/FLiP-CloudQueries
* https://github.com/tspannhw/FLiP-Jetson
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Tuesday 17:10 UTC
Apache NIFi Deep Dive 300
Timothy Spann
For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production.
This will be interactive and I encourage questions and discussions.
You will take away examples and tips in slides, github, and articles.
This talk will cover:
Load Balancing
Parameters and Parameter Contexts
Stateless vs Stateful NiFi
Reporting Tasks
NiFi CLI
NiFi REST Interface
DevOps
Advanced Record Processing
Schemas
RetryFlowFile
Lookup Services
RecordPath
Expression Language
Advanced Error Handling Techniques
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
JRuby allows developers to write plugins for data processing systems like Norikra and Embulk in Ruby while taking advantage of Java libraries and the JVM. Norikra is a stream processing system that allows SQL queries over data streams. It is written in JRuby and uses the Java Esper library. Embulk is an open-source ETL tool that loads data between databases and file formats using plugins. Both systems use a plugin architecture where plugins can be written in JRuby or Java and are distributed as Ruby gems. This allows for a pluggable ecosystem that benefits from Ruby's productivity while utilizing Java libraries and the JVM's performance.
Learn how Spotify uses Puppet to manage the large and growing amount of servers used to stream music to millions of users. The presenter will also give an introduction to other technologies used to power Spotify.
Erik Dalén
System Engineer, Spotify
Erik is a system engineer within the site reliability engineering at Spotify with a focus on Puppet and automation. He is also a community contributor to Puppet and author of the puppetdbquery tool. Can be found at IRC and Github as dalen.
This document summarizes a presentation about the technologies that support Gathery, a C2C service within Recruit Lifestyle. It discusses how Gathery leverages AWS technologies like ElastiCache, CloudSearch, S3, and auto-scaling to reduce costs, improve scalability and ease of operations compared to an on-premises solution. It also explains challenges of developing new services within a large company like restrictions on modifying DNS or security groups, and how they are working to address issues like internal API access control. The presentation focuses on how engineering culture and challenges are fun and interesting parts of growing a business and technology together.
The document summarizes the key features and changes in Fluentd v0.14, including new plugin APIs, plugin storage and helpers, time with nanosecond resolution, a ServerEngine-based supervisor for Windows support, and plans for symmetric multi-core processing, a counter API, and TLS/authentication in future versions. It also benchmarks some performance improvements and outlines the roadmap for Treasure Agent 3.0 based on Fluentd v0.14.
The document summarizes the new plugin API in Fluentd v0.14. Key points include:
- The v0.12 plugin API was fragmented and difficult to write tests for. The v0.14 API provides a unified architecture.
- The main plugin classes are Input, Filter, Output, Buffer, and plugins must subclass Fluent::Plugin::Base.
- The Output plugin supports both buffered and non-buffered processing. Buffering can be configured by tags, time, or custom fields.
- "Owned" plugins like Buffer are instantiated by primary plugins and can access owner resources. Storage is a new owned plugin for persistent storage.
- New test drivers emulate plugin
This document discusses middleware in Ruby and provides examples of considerations when writing middleware:
- Middleware should be a long-running daemon process that is compatible across platforms and environments and handles various data formats and traffic volumes.
- Tests must be run on all supported platforms to ensure compatibility as thread and process scheduling differs between operating systems.
- Memory usage and object leaks must be carefully managed in long-running processes to avoid consuming resources over time.
- Performance of JSON parsing/generation should be benchmarked and the most optimized library used to avoid unnecessary CPU usage.
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
This document provides an overview and introduction to Amazon Athena, including:
- Athena is an interactive query service that allows users to analyze data directly from Amazon S3 using standard SQL.
- It is serverless, requiring no infrastructure management and with zero spin up time.
- Athena supports a variety of data formats and allows querying data directly from S3 without needing to load it elsewhere.
- Customers can use Athena to analyze large amounts of data in S3 in a cost effective and easy to use manner.
This document provides an introduction to Apache Spark and Zeppelin. It describes Spark as an open source cluster computing framework, and its APIs for Scala, Java, Python and R. Key Spark components are outlined like Spark Core, Spark SQL, MLlib and GraphX. RDDs are defined as Spark's primary abstraction, and DataFrames/Datasets are presented as higher-level APIs built on RDDs. The benefits of Spark SQL for structured data are highlighted. Examples demonstrate basic Spark and SQL usage. Finally, Apache Zeppelin and the Hortonworks sandbox are introduced as tools for interactive data analytics on Spark and Hadoop clusters.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.
Big data comes from many sources like social media, e-commerce sites, and stock markets. Hadoop is an open-source framework that allows processing and storing large amounts of data across clusters of computers. It uses HDFS for storage and MapReduce for processing. HDFS stores data across cluster nodes and is fault tolerant. MapReduce analyzes data through parallel map and reduce functions. Sqoop imports and exports data between Hadoop and relational databases.
This document discusses cloud and big data technologies. It provides an overview of Hadoop and its ecosystem, which includes components like HDFS, MapReduce, HBase, Zookeeper, Pig and Hive. It also describes how data is stored in HDFS and HBase, and how MapReduce can be used for parallel processing across large datasets. Finally, it gives examples of using MapReduce to implement algorithms for word counting, building inverted indexes and performing joins.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
This document discusses big data and Hadoop. It provides an overview of Hadoop, including what it is, how it works, and its core components like HDFS and MapReduce. It also discusses what Hadoop is good for, such as processing large datasets, and what it is not as good for, like low-latency queries or transactional systems. Finally, it covers some best practices for implementing Hadoop, such as infrastructure design and performance considerations.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Similar to Technologies for Data Analytics Platform (20)
Fluentd Project Intro at Kubecon 2019 EUN Masahiro
Fluentd is a streaming data collector that can unify logging and metrics collection. It collects data from sources using input plugins, processes and filters the data, and outputs it to destinations using output plugins. It is commonly used for container logging, collecting logs from files or Docker and adding metadata before outputting to Elasticsearch or other targets. Fluentbit is a lightweight version of Fluentd that is better suited for edge collection and forwarding logs to a Fluentd instance for aggregation.
Fluentd v1 provides major improvements over v0.12 including nanosecond event time resolution, multi-core support, Windows support, and new plugin APIs. The new plugin APIs provide well-controlled lifecycles and integrate all output plugins. v1 also introduces a server engine based supervisor, dynamic buffering capabilities, and various plugin helpers. While maintaining compatibility with v0.12 plugins, v1 focuses on ease of use, stability, performance and flexibility.
Fluentd and Distributed Logging at KubeconN Masahiro
This document discusses distributed logging with containers using Fluentd. It notes the challenges of logging in container environments where logs need to be collected from ephemeral containers and transferred to storage. It introduces Fluentd as a flexible data collection tool that can collect logs from containers using various plugins and methods like log drivers, shared volumes, and application libraries. The document discusses deployment patterns for Fluentd including using it for source-side aggregation to buffer and transfer logs more efficiently and for destination-side aggregation to scale log storage.
This document summarizes Fluentd v1.0 and provides details about its new features and release plan. It notes that Fluentd v1.0 will provide stable APIs and compatibility with previous versions while improving plugin APIs, adding Windows and multicore support, and increasing event time resolution to nanoseconds. The release is planned for Q3 2017 to allow feedback on v0.14 before finalizing v1.0 features.
This document summarizes the key features and changes between versions of Fluentd, an open source data collector.
The main points are:
1) Fluentd v1.0 will provide stable APIs and features while remaining compatible with v0.12 and v0.14. It will have no breaking API changes.
2) New features in v0.14 and v1.0 include nanosecond time resolution, multi-core processing, Windows support, improved buffering and plugins, and more.
3) The goals for v1.0 include migrating more plugins to the new APIs, addressing issues, and improving documentation. A release is planned for Q2 2017.
Fluentd is a data collector for unified logging that provides a robust core and plugins. It allows for reliable data transfer through error handling and retries. The core handles common concerns like parsing, buffering, and writing data, while plugins handle input, output, and other use cases. Fluentd has a pluggable architecture and processes data through a pipeline of input, parser, filter, buffer, formatter, and output plugins.
This document provides an overview of Fluentd, an open source data collector. It discusses the key features of Fluentd including structured logging, reliable forwarding, and a pluggable architecture. The document then summarizes the architectures and new features of different Fluentd versions, including v0.10, v0.12, and the upcoming v0.14 and v1 releases. It also discusses Fluentd's ecosystem and plugins as well as Treasure Data's use of Fluentd in its log data collection and analytics platform.
This document summarizes Masahiro Nakagawa's presentation on Fluentd and Embulk. Fluentd is a data collector for unified logging that allows for streaming data transfer based on JSON. It is written in Ruby and uses plugins to collect, process, and output data. Embulk is a bulk loading tool that allows high performance parallel processing of data to load it into various databases and storage systems. Both tools use a pluggable architecture to provide flexibility in handling different data sources and targets.
Fluentd Unified Logging Layer At FossasiaN Masahiro
Masahiro Nakagawa is a senior software engineer at Treasure Data and the main maintainer of Fluentd. Fluentd is a data collector for unified logging that provides a streaming data transfer based on JSON. It has a simple core with plugins written in Ruby to provide functionality like input/output, buffering, parsing, filtering and formatting of data.
- Treasure Data is a cloud data service that provides data acquisition, storage, and analysis capabilities.
- It collects data from various sources using Fluentd and Embulk and stores it in its own columnar database called Plazma DB.
- It offers various computing frameworks like Hive, Pig, and Presto for analytics and visualization with tools like Tableau.
- Presto is an interactive SQL query engine that can query data in HDFS, Hive, Cassandra and other data stores.
Masahiro Nakagawa introduced Fluentd, an open source data collector. Fluentd provides a unified logging layer and collects data through a streaming data transfer based on JSON. It is written in Ruby and uses a plugin architecture to allow for various input and output functions. Fluentd is used in production environments for log aggregation, metrics collection, and data processing tasks.
This document summarizes Masahiro Nakagawa's presentation on Fluentd at the Data Transfer Middleware Meetup #1. It discusses Fluentd's history and architecture, including the core plugins in v0.10 and new features in v0.12 like filtering and labeling. The roadmap is outlined, with v0.14 adding new plugin APIs and v1 focusing on stability. Other projects like Treasure Agent and fluentd-forwarder that comprise the Fluentd ecosystem are also briefly mentioned.
Fluentd: Unified Logging Layer at CWT2014N Masahiro
The document summarizes Masahiro Nakagawa's presentation on Fluentd at the Cloudera World Tokyo conference. Fluentd is an open source log collector written in Ruby that uses a pluggable architecture and JSON format for log messages. It provides unified logging and data processing capabilities. The presentation covered Fluentd's core functionality, related products from Treasure Data, use cases, and the company's roadmap.
The document discusses Presto, an open source distributed SQL query engine for interactive analysis of large datasets. It provides summaries of Presto's capabilities, architecture, and how it addresses issues with other SQL engines on Hadoop like Hive being too slow. Key points include that Presto allows direct querying of data in HDFS without needing to copy it elsewhere, uses a distributed query execution model rather than MapReduce, and supports many connectors and a PostgreSQL gateway.
Fluentd meetup dive into fluent plugin (outdated)N Masahiro
Fluentd meetup in Japan. I talked about "Dive into Fluent plugin".
Some contents are outdated. See this slide: http://www.slideshare.net/repeatedly/dive-into-fluentd-plugin-v012
This document provides a summary of a discussion comparing D and C++ programming languages. It includes several points about features of D such as its short name, lack of semicolons, struct syntax, delegate syntax, templates, compile-time function execution, classes, exceptions, contracts, and package management. It also references future potential features for D like the compiler as a library, thread-local garbage collection, and lightweight threads.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
2. Who are you?
• Masahiro Nakagawa
• github: @repeatedly
• Treasure Data Inc.
• Fluentd / td-agent developer
• https://jobs.lever.co/treasure-data
• I love OSS :)
• D Language, MessagePack, The organizer of several meetups, etc…
13. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
• Good data format for analytics workload
• Read only selected columns, efficient compression
• Not good for insert / update
Columnar Storage
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
Row Columnar
Unit
Unit
15. No silver bullet
• Performance depends on data modeling and query
• distkey and sortkey are important
• should reduce data transfer and IO Cost
• query should take advantage of these keys
• There are some problems
• Cluster scaling, metadata management, etc…
16. Performance is good :)
But we often want to change schema
for new workloads. Now,
hard to maintain schema and its data…
L
C C C
18. Schema on Write(RDBMS)
• Writing data using schema
for improving query performance
• Pros:
• minimum query overhead
• Cons:
• Need to design schema and workload before
• Data load is expensive operation
19. Schema on Read(Hadoop)
• Writing data without schema and
map schema at query time
• Pros:
• Robust over schema and workload change
• Data load is cheap operation
• Cons:
• High overhead at query time
20. Data Lake
• Schema management is hard
• Volume is increasing and format is often changed
• There are lots of log types
• Feasible approach is storing raw data and
converting it before analyze data
• Data Lake is a single storage for any logs
• Note that no clear definition for now
21. Data Lake Patterns
• Use DFS, e.g. HDFS, for log storage
• ETL or data processing by Hadoop ecosystem
• Can convert logs via ingestion tools before
• Use Data Lake storage and related tools
• These storages support Hadoop ecosystem
22. Apache Hadoop
• Distributed computing framework
• First implementation based on Google MapReduce
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
26. Apache Tez
• Low level framework for YARN Applications
• Hive, Pig, new query engine and more
• Task and DAG based processing flow
ProcessorInput Output
Task DAG
27. MapReduce vs Tez
MapReduce Tez
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
SELECT g1.x, g2.avg, g2.cnt
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x
JOIN (a, b)
ORDER BY
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
28. Superstition
• HDFS and YARN have SPOF
• Recent version doesn’t have SPOF on both
MapReduce 1 and MapReduce 2
• Can’t build from a scratch
• Really? Treasure Data builds Hadoop on CircleCI.
Cloudera, Hortonworks and MapR too.
• They also check its dependent toolchain.
29. Which Hadoop package
should we use?
• Distribution by Hadoop distributor is better
• CDH by Cloudera
• HDP by Hortonworks
• MapR distribution by MapR
• If you are familiar with Hadoop and its ecosystem,
Apache community edition becomes an option.
• For example, Treasure Data has patches and
they want to use patched version.
31. Ingestion tools
• There are two execution model!
• Bulk load:
• For high-throughput
• Almost tools transfer data in batch and parallel
• Streaming load:
• For low-latency
• Almost tools transfer data in micro-batch
32. Bulk load tools
• Embulk
• Pluggable bulk data loader for
various inputs and outputs
• Write plugins using Java and JRuby
• Sqoop
• Data transfer between Hadoop and RDBMS
• Included in some distributions
• Or each bulk loader for each data store
33. Streaming load tools
• Fluentd
• Pluggable and json based streaming collector
• Lots of plugins in rubygems
• Flume
• Mainly for Hadoop ecosystem, HDFS, HBase, …
• Included in some distributions
• Or Logstash, Heka, Splunk and etc…
37. MPP query engine
• It doesn’t have own storage unlike parallel RDBMS
• Follow “Schema on Read” approach
• data distribution depends on backend
• data schema also depends on backend
• Some products are called “SQL on Hadoop”
• Presto, Impala, Apache Drill, etc…
• It has own execution engine, not use MapReduce.
38. • Distributed Query Engine for interactive queries
against various data sources and large data.
• Pluggable connector for joining multiple backends
• You can join MySQL and HDFS data in one query
• Lots of useful functions for data analytics
• window functions, approximate query,
machine learning, etc…
40. HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ More work to manage
2 platforms
✓ Can’t query against
“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
44. Execution Model
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
45. Okay, we have now low latency
and batch combination.
Raw data
53. Push vs Pull
• Push:
• Easy to transfer data to multiple destinations
• Hard to control stream ratio in multiple streams
• Pull:
• Easy to control stream ratio
• Should manage consumers correctly
56. Amazon Redshift
• Parallel RDBMS on AWS
• Re-use traditional Parallel RDMBS know-how
• Scale is easier than traditional systems
• With Amazon EMR is popular
1. Store data into S3
2. EMR processes S3 data
3. Load processed data into Redshift
• EMR has Hadoop ecosystem
58. Google BigQuery
• Distributed query engine and scalable storage
• Tree model, Columnar storage, etc…
• Separate storage from workers
• High performance query by Google infrastructure
• Lots of workers
• Storage / IO layer on Colossus
• Can’t manage Parallel RDBMS properties like distkey,
but it works on almost cases.
61. Treasure Data
• Cloud based end-to-end data analytics service
• Hive, Presto, Pig and Hivemall for one big repository
• Lots of ingestion and output way, scheduling, etc…
• No stream processing for now
• Service concept is Data Lake
• JSON based schema-less storage
• Execution model is similar to BigQuery
• Separate storage from workers
• Can’t specify Parallel RDBMS properties
63. Resource Model Trade-off
Pros Cons
Fully Guaranteed
Stable execution
Easy to control resource
Non boost mechanizm
Guaranteed with
multi-tenanted
Stable execution
Good scalability
less controllable resource
Fully multi-tenanted
Boosted performance
Great scalability
Unstable execution
64. MS Azure also has useful services:
DataHub, SQL DWH, DataLake,
Stream Analytics, HDInsight…
65. Use service or build a platform?
• Should consider using service first
• AWS, GCP, MS Azure, Treasure Data, etc…
• Important factor is data analytics, not platform
• Do you have enough resources to maintain it?
• If specific analytics platform is a differentiator,
building a platform is better
• Use state-of-the-art technologies
• Hard to implement on existing platforms
66. Conclusion
• Many softwares and services for data analytics
• Lots of trade-off, performance, complexity,
connectivity, execution model, etc
• SQL is a primary language on data analytics
• Should focus your goal!
• data analytics platform is your business core?
If not, consider using services first.
69. Apache Spark
• Another Distributed computing framework
• Mainly for in-memory computing with DAG
• RDD and DataFrame based clean API
• Combination with Hadoop is popular
http://slidedeck.io/jmarin/scala-talk
70. Apache Flink
• Streaming based execution engine
• Support batch and pipelined processing
• Hadoop and Spark are batch based
•
https://ci.apache.org/projects/
flink/flink-docs-master/
71. Batch vs Pipelined
All stages are pipe-lined
✓ No wait time
✓ fault-tolerance with
check pointing
Batch(Staged) Pipelined
task task
task task
task
task
memory-to-memory
data transfer
✓ use disk if needed
task
disk
disk
Wait between
stagestask
task task
task task
task task stage3
stage2
stage1
72. Visualization
• Tableau
• Popular BI tool in many area
• Awesome GUI, easy to use, lots of charts, etc
• Metric Insights
• Dashboard for many metrics
• Scheduled query, custom handler, etc
• Chartio
• Cloud based BI tool
73. How to manage job dependency?
We want to issue Job X
after Job A and Job B are finished.
74. Data pipeline tool
• There are some important features
• Manage job dependency
• Handle job failure and retry
• Easy to define topology
• Separate tasks into sub-tasks
• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1,
etc…
75. Luigi
• Python module for building job pipeline
• Write python code and run it.
• task is defined as Python class
• Easy to manage by VCS
• Need some extra tools
• scheduled job, job hisotry, etc…
class T1(luigi.task):
def requires(self):
# dependencies
def output(self):
# store result
def run(self):
# task body
76. Airflow
• Python and DAG based workflow
• Write python code but it is for defining ADAG
• Task is defined by Operator
• There are good features
• Management web UI
• Task information is stored into database
• Celery based distributed execution
dag = DAG('example')
t1 = Operator(..., dag=dag)
t2 = Operator(..., dag=dag)
t2.set_upstream(t1)