Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 engineers/analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
This document discusses IBM's Elastic Storage product. It provides an overview of Elastic Storage's key features such as extreme scalability, high performance, support for various operating systems and hardware, data lifecycle management capabilities, integration with Hadoop, and editions/pricing. It also compares Elastic Storage to alternative storage solutions and discusses how Elastic Storage can be used to build private and hybrid clouds with OpenStack.
Red Hat Storage Day Boston - Supermicro Super StorageRed_Hat_Storage
The document discusses Supermicro's evolution from server and storage innovation to total solution innovation. It provides examples of their all-flash storage servers and Red Hat Ceph reference architectures using Supermicro hardware. The document also discusses optimizing hardware configurations for different workloads and summarizes Supermicro's portfolio of Ceph-ready nodes and turnkey storage solutions.
Red Hat Ceph Storage: Past, Present and FutureRed_Hat_Storage
Ceph is a massively scalable, open source, software-defined storage system that runs on commodity hardware. Get an update about the latest version of Red Hat Ceph Storage, including information about the newest features and use cases, with a particular focus on cloud storage and OpenStack. We’ll also explore the themes and directions for the roadmap for the next 12 months.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
This document discusses HDFS Erasure Coding and its usage at Yahoo Japan. It begins with an overview of erasure coding, how it is implemented in HDFS, and compares it to replication. Test results show the write performance is lower for erasure coding while read performance is similar. Yahoo Japan uses erasure coding for cold weblog data, reducing storage costs by 65% compared to replication. Future plans include supporting additional codecs and features to provide more usability.
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
Alluxio Community Office Hour
Apr 7, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker: Bin Fan
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
The document provides an overview and summary of Red Hat's reference architecture work including MySQL and Hadoop, software-defined NAS, and digital media repositories. It discusses trends toward disaggregating Hadoop compute and storage and various data flow options. It also summarizes performance testing Red Hat conducted comparing AWS EBS and Ceph for MySQL workloads, and analyzing factors like IOPS/GB ratios, core-to-flash ratios, and pricing. Server categories and vendor examples are defined. Comparisons of throughput and costs at scale between software-defined scale-out storage and traditional enterprise NAS solutions are also presented.
In 2018's user conference keynote MariaDB CEO, Michael Howard, announced an initiative to build a MariaDB DBaaS platform. In this session, the DBaaS team shares how MariaDB is approaching DBaaS, then discusses the role of containers and Kubernetes, the need for infrastructure-agnostic provisioning, support for day-two operations and enterprise requirements for large-scale DBaaS deployments.
Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy
With growing popularity of big data, it becomes imperative for enterprises to adopt the right platform for their workload, with efficient and user-friendly management of large scale clusters. In this session we will explore Cisco's revolutionary innovations that deliver leading-edge infrastructure, well suited for Cassandra like data base platforms, purpose built for performance and scalability. This enables our customers to unlock the intelligence in their data. Not only this provides a sustainable competitive advantage to their business, but also scales with their growing business needs.
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed_Hat_Storage
The document discusses the benefits of software-defined storage over traditional storage approaches. It argues that software-defined storage uses standard hardware and open source software, providing flexibility, scalability, and lower costs compared to proprietary appliances or public cloud storage. It also describes Red Hat's portfolio of software-defined storage solutions, including Ceph and Gluster, which leverage open source technologies to power a variety of enterprise workloads.
The document discusses IBM's Power Systems as an expert platform for artificial intelligence. Some key points:
- Power Systems are designed for modern AI workloads, with accelerated computing capabilities like GPUs and FPGAs.
- The IBM Power AC922 server provides an "acceleration superhighway" between CPUs, GPUs, and other accelerators for optimal AI performance.
- Tests show the AC922 can reduce AI model training times by 3.8x compared to x86 systems, thanks to features like high bandwidth NVLink connections between components.
- IBM's PowerAI software tools help make AI development easier on the Power platform.
An SDS (software-defined storage) refers to a software controller that is used for managing and virtualizing a physical storage for the purpose of controlling the way in which data is stored.
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red_Hat_Storage
Red Hat Gluster Storage is a software-defined, distributed, scale-out file storage solution that is cost-efficient, high performing at scale, and easy to deploy, manage and scale in public, private and hybrid cloud environments. It offers mature NFS, SMB and HDFS interfaces for enterprise applications such as analytics, media streaming, active archives and enterprise virtualization. The document discusses using Red Hat Gluster Storage for historical tick data repositories, including its architecture, benefits over traditional storage solutions, and analytics workflows.
The document discusses IBM Spectrum Scale, a software-defined storage solution from IBM. It provides:
1) A family of software-defined storage products including IBM Spectrum Control, IBM Spectrum Protect, IBM Spectrum Archive, IBM Spectrum Virtualize, IBM Spectrum Accelerate, and IBM Spectrum Scale.
2) IBM Spectrum Scale allows storing data everywhere and running applications anywhere. It provides highly scalable, high-performance storage for files, objects, and analytics workloads.
3) The document provides an overview of the IBM Spectrum Scale product and its capabilities for optimizing storage costs, improving data protection, enabling global collaboration, and ensuring data availability, integrity and security.
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed_Hat_Storage
This document discusses the evolution of storage from traditional appliances to software-defined storage. It notes that many IT decision makers find current storage capabilities inadequate and unable to handle emerging workloads. Traditional appliances face issues like vendor lock-in, lack of flexibility, and high costs. Public cloud storage is more scalable but still has complexity and limitations. The document then introduces software-defined storage as an open solution with standardized platforms that addresses these issues through increased cost efficiency, provisioning speed, and deployment options with less vendor lock-in and skill requirements. It describes Red Hat's portfolio of Ceph and Gluster open source software-defined storage solutions and their target use cases.
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red_Hat_Storage
The document discusses using Gluster Storage to provide storage for containerized applications in a Kubernetes cluster. It outlines the challenges of replatforming an ecommerce site to use open source technologies, applying RAS(S) principles, and having a scalable and fault-tolerant solution. The plan is to use Docker containers, Kubernetes for orchestration, and GlusterFS storage. GlusterFS provides highly available, replicated storage across all Kubernetes nodes to support the storage needs of containerized applications.
Ibm spectrum scale_backup_n_archive_v03_ashAshutosh Mate
IBM Spectrum Scale can be used as both the source and destination for backup and archiving. As a source, Spectrum Scale data can be backed up to products like Spectrum Protect, Spectrum Archive, and third-party backup software. As a destination, Spectrum Protect can use Spectrum Scale and ESS storage for storing backed up or archived data, providing scalability, performance, and cost benefits over other solutions. Case studies demonstrate how large enterprises and regional hospital networks have consolidated backup infrastructure and improved availability, capacity, and backup/restore speeds by combining Spectrum Scale and Spectrum Protect.
Heterogeneous Computing : The Future of SystemsAnand Haridass
Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
Severalnines Training: MySQL® Cluster - Part IXSeveralnines
This document discusses best practices for designing a MySQL Cluster database infrastructure. It recommends dedicating instances for data and API nodes and not co-locating them. The number of nodes depends on storage, throughput and redundancy requirements. Hardware recommendations include fast CPUs, RAM sized for the dataset, and SSDs or RAID for storage. Performance planning requires benchmarking typical workloads to determine if resources need scaling. The document provides formulas and tools to help calculate storage and memory needs.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
Hive provides an SQL-like interface to query data stored in Hadoop's HDFS distributed file system and processed using MapReduce. It allows users without MapReduce programming experience to write queries that Hive then compiles into a series of MapReduce jobs. The document discusses Hive's components, data model, query planning and optimization techniques, and performance compared to other frameworks like Pig.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers. It addresses problems like massive data storage needs and scalable processing of large datasets. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data reliably across commodity hardware and MapReduce provides a programming model for distributed computing of large datasets.
This document provides an introduction to Hadoop, including its ecosystem, architecture, key components like HDFS and MapReduce, characteristics, and popular flavors. Hadoop is an open source framework that efficiently processes large volumes of data across clusters of commodity hardware. It consists of HDFS for storage and MapReduce as a programming model for distributed processing. A Hadoop cluster typically has a single namenode and multiple datanodes. Many large companies use Hadoop to analyze massive datasets.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Optimizing Big Data to run in the Public CloudQubole
Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
Similar to Hadoop and Hive Development at Facebook (20)
ClojureScript allows developers to use the Clojure programming language to build applications that compile to JavaScript. This enables Clojure code to run in environments where JavaScript is supported, like web browsers and mobile apps. ClojureScript leverages the Google Closure compiler and library to provide whole program optimization of Clojure code compiling to JavaScript.
Why you should be excited about ClojureScriptelliando dias
ClojureScript allows Clojure code to compile to JavaScript. Created by Rich Hickey and friends, it provides optimizations for performance while maintaining readability and abstraction. As a Lisp for JavaScript, ClojureScript controls complexity on the web and benefits from JavaScript's status as a compilation target for many languages.
Functional Programming with Immutable Data Structureselliando dias
1. The document discusses the advantages of functional programming with immutable data structures for multi-threaded environments. It argues that shared mutable data and variables are fundamentally flawed concepts that can lead to bugs, while immutable data avoids these issues.
2. It presents Clojure as a functional programming language that uses immutable persistent data structures and software transactional memory to allow for safe, lock-free concurrency. This approach allows readers and writers to operate concurrently without blocking each other.
3. The document makes the case that Lisp parentheses in function calls uniquely define the tree structure of computations and enable powerful macro systems, homoiconicity, and structural editing of code.
O documento lista e descreve as principais partes de um contêiner de carga seco, incluindo o painel frontal, laterais, traseira, teto, piso e estrutura inferior. Muitos componentes como painéis laterais, travessas do teto e fundo são numerados de acordo com sua localização. As portas traseiras contêm quadros, painéis, dobradiças e barras de fechamento.
O documento discute a história da geometria projetiva, desde Euclides até seu uso em computação gráfica. Aborda figuras-chave como Pascal, que foi pioneiro na área, e como a perspectiva foi aplicada nas artes ao longo dos séculos.
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
This document discusses the benefits of polyglot and poly-paradigm programming approaches for building more agile applications. It describes how using multiple languages and programming paradigms can optimize both performance and developer productivity. Specifically, it suggests that statically-typed compiled languages be used for core application components while dynamically-typed scripting languages connect and customize these components. This approach allows optimizing areas that require speed/efficiency separately from those requiring flexibility. The document also advocates aspects and functional programming to address cross-cutting concerns and concurrency challenges that arise in modern applications.
This document discusses JavaScript libraries and frameworks. It provides an overview of some popular options like jQuery, Prototype, Dojo, MooTools, and YUI. It explains why developers use libraries, such as for faster development, cross-browser compatibility, and animation capabilities. The document also discusses how libraries resemble CSS and use selector syntax. Basic examples are provided to demonstrate common tasks like hover effects and row striping. Factors for choosing a library are outlined like maturity, documentation, community, and licensing. The document concludes by explaining how to obtain library code from project websites or Google's AJAX Libraries API.
How to Make an Eight Bit Computer and Save the World!elliando dias
This document summarizes a talk given to introduce an open source 8-bit computer project called the Humane Reader. The talk outlines the goals of providing a cheap e-book reader and computing platform using open source tools. It describes the hardware design which uses an AVR microcontroller and interfaces like video output, SD card, and USB. The talk also covers using open source tools for development and sourcing low-cost fabrication and assembly. The overall goals are to create an inexpensive device that can provide educational resources in developing areas.
Ragel is a parser generator that compiles to various host languages including Ruby. It is useful for parsing protocols and data formats and provides faster parsing than regular expressions or full LALR parsers. Several Ruby projects like Mongrel and Hpricot use Ragel for tasks like HTTP request parsing and HTML parsing. When using Ragel with Ruby, it can be compiled to Ruby code directly, which is slow, or a C extension can be written for better performance. The C extension extracts the parsed data from Ragel and makes it available to Ruby.
A Practical Guide to Connecting Hardware to the Webelliando dias
This document provides an overview of connecting hardware devices to the web using the Arduino platform. It discusses trends in electronics and computing that make this easier, describes the Arduino hardware and software, and covers various connection methods including directly to a computer, via wireless modems, Ethernet shields, and services like Pachube that allow sharing sensor data over the internet. The document aims to demonstrate how Arduinos can communicate with other devices and be used to build interactive systems.
O documento introduz o Arduino, uma plataforma de desenvolvimento open-source. Discute as características e componentes do Arduino, incluindo microcontroladores, software e exemplos de código. Também fornece instruções básicas sobre como programar o Arduino usando linguagem C.
O documento apresenta um mini-curso introdutório sobre Arduino, abordando o que é a plataforma Arduino, como é estruturado seu hardware, como programá-lo, exemplos básicos de código e aplicações possíveis como controle residencial e robótica.
The document discusses various functions for working with datasets in the Incanter library for Clojure. It describes how to create, read, save, select rows and columns from, and sort datasets. Functions are presented for building datasets from sequences, reading datasets from files and URLs, saving datasets to files and databases, selecting single or multiple columns, and filtering rows based on conditions. The document also provides an overview of the Incanter library and its various namespaces for statistics, charts, and other functionality.
Rango is a lightweight Ruby web framework built on Rack that aims to be more robust than Sinatra but smaller than Rails or Merb. It is inspired by Django and Merb, uses Ruby 1.9, and supports features like code reloading, Bundler, routing, rendering, and HTTP error handling. The documentation provides examples and details on using Rango.
Fab.in.a.box - Fab Academy: Machine Designelliando dias
This document describes the design of a multifab machine called MTM. It includes descriptions of the XY stage and Z axis drive mechanisms, as well as the tool heads and network used to control the machine. Key aspects of the design addressed include the stepper motor selection, drive electronics, motion control firmware, and use of a virtual machine environment and circular buffer to enable distributed control of the machine. Strengths of the design include low inertia enabling high acceleration, while weaknesses include low basic resolution and stiffness unsuitable for heavy milling.
The document discusses using Clojure for Hadoop programming. Clojure is a dynamic functional programming language that runs on the Java Virtual Machine. The document provides an overview of Clojure and how its features like immutability and concurrency make it well-suited for Hadoop. It then shows examples of implementing Hadoop MapReduce jobs using Clojure by defining mapper and reducer functions.
This document provides an overview of Hadoop, including:
1) Hadoop solves the problems of analyzing massively large datasets by distributing data storage and analysis across multiple machines to tolerate node failure.
2) Hadoop uses HDFS for distributed data storage, which shards massive files across data nodes with replication for fault tolerance, and MapReduce for distributed data analysis by sending code to the data.
3) The document demonstrates MapReduce concepts like map, reduce, and their composition with an example job.
Multi-core Parallelization in Clojure - a Case Studyelliando dias
The document describes a case study on using Clojure for multi-core parallelization of the K-means clustering algorithm. It provides background on parallel programming concepts, an introduction to Clojure, and details on how the authors implemented a parallel K-means algorithm in Clojure using agents and software transactional memory. They present results showing speedups from parallelization and accuracy comparable to R's implementation on both synthetic and real-world datasets.
From Lisp to Clojure/Incanter and RAn Introductionelliando dias
This document provides a comparison between the statistical computing languages R and Clojure/Incanter. It discusses the histories and philosophies behind Lisp, Fortran, R and Clojure. Key differences noted are that Clojure runs on the Java Virtual Machine, allowing it to leverage Java libraries, while R is primarily written in C and Fortran. Incanter is presented as a Clojure-based platform for statistical computing and graphics that is more immature than R but allows easier access to Java capabilities. Basic syntax comparisons are provided.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
1. Hadoop and Hive Development at
Facebook
Dhruba Borthakur Zheng Shao
{dhruba, zshao}@facebook.com
Presented at Hadoop World, New York
October 2, 2009
3. Who generates this data?
Lots of data is generated on Facebook
– 300+ million active users
– 30 million users update their statuses at least once each
day
– More than 1 billion photos uploaded each month
– More than 10 million videos uploaded each month
– More than 1 billion pieces of content (web links, news
stories, blog posts, notes, photos, etc.) shared each
week
4. Data Usage
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on production cluster per day
– 80K compute hours per day
Barrier to entry is significantly reduced:
– New engineers go though a Hive training session
– ~200 people/month run jobs on Hadoop/Hive
– Analysts (non-engineers) use Hadoop through Hive
5. Where is this data stored?
Hadoop/Hive Warehouse
– 4800 cores, 5.5 PetaBytes
– 12 TB per node
– Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch
6. Data Flow into Hadoop Cloud
Network
Storage
and
Servers
Web Servers
Scribe MidTier
Oracle RAC Hadoop Hive Warehouse MySQL
8. HDFS Raid
Start the same: triplicate
every data block
A B C
Background encoding
– Combine third replica of A B C
blocks from a single file to
create parity block A B C
– Remove third replica
– Apache JIRA HDFS-503 A+B+C
DiskReduce from CMU
– Garth Gibson research A file with three blocks A, B and C
http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
9. Archival: Move old data to cheap storage
Hadoop Warehouse
NFS
Hadoop Archive Node Cheap NAS
Hadoop Archival Cluster
hEp://issues.apache.org/jira/browse/HDFS‐220
Hive Query
10. Dynamic-size MapReduce Clusters
Why multiple compute clouds in Facebook?
– Users unaware of resources needed by job
– Absence of flexible Job Isolation techniques
– Provide adequate SLAs for jobs
Dynamically move nodes between clusters
– Based on load and configured policies
– Apache Jira MAPREDUCE-1044
11. Resource Aware Scheduling (Fair Share
Scheduler)
We use the Hadoop Fair Share Scheduler
– Scheduler unaware of memory needed by job
Memory and CPU aware scheduling
– RealTime gathering of CPU and memory usage
– Scheduler analyzes memory consumption in realtime
– Scheduler fair-shares memory usage among jobs
– Slot-less scheduling of tasks (in future)
– Apache Jira MAPREDUCE-961
12. Hive – Data Warehouse
Efficient SQL to Map-Reduce Compiler
Mar 2008: Started at Facebook
May 2009: Release 0.3.0 available
Now: Preparing for release 0.4.0
Countable for 95%+ of Hadoop jobs @ Facebook
Used by ~200 engineers and business analysts at Facebook
every month
14. Hive DDL
DDL
– Complex columns
– Partitions
– Buckets
Example
– CREATE TABLE sales (
id INT,
items ARRAY<STRUCT<id:INT, name:STRING>>,
extra MAP<STRING, STRING>
) PARTITIONED BY (ds STRING)
CLUSTERED BY (id) INTO 32 BUCKETS;
15. Hive Query Language
SQL
– Where
– Group By
– Equi-Join
– Sub query in from clause
Example
– SELECT r.*, s.*
FROM r JOIN (
SELECT key, count(1) as count
FROM s
GROUP BY key) s
ON r.key = s.key
WHERE s.count > 100;
16. Group By
4 different plans based on:
– Does data have skew?
– partial aggregation
Map-side hash aggregation
– In-memory hash table in mapper to do partial
aggregations
2-map-reduce aggregation
– For distinct queries with skew and large cardinality
17. Join
Normal map-reduce Join
– Mapper sends all rows with the same key to a single
reducer
– Reducer does the join
Map-side Join
– Mapper loads the whole small table and a portion of big
table
– Mapper does the join
– Much faster than map-reduce join
18. Sampling
Efficient sampling
– Table can be bucketed
– Each bucket is a file
– Sampling can choose some buckets
Example
– SELECT product_id, sum(price)
FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
GROUP BY product_id
19. Multi-table Group-By/Insert
FROM users
INSERT INTO TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid)
GROUP BY gender
INSERT INTO
DIRECTORY '/user/facebook/tmp/pv_age_sum.dir'
SELECT age, count(DISTINCT userid)
GROUP BY age
INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir'
SELECT country, gender, count(DISTINCT userid)
GROUP BY country, gender;
20. File Formats
TextFile:
– Easy for other applications to write/read
– Gzip text files are not splittable
SequenceFile:
– Only hadoop can read it
– Support splittable compression
RCFile: Block-based columnar storage
– Use SequenceFile block format
– Columnar storage inside a block
– 25% smaller compressed size
– On-par or better query performance depending on the query
21. SerDe
Serialization/Deserialization
Row Format
– CSV (LazySimpleSerDe)
– Thrift (ThriftSerDe)
– Regex (RegexSerDe)
– Hive Binary Format (LazyBinarySerDe)
LazySimpleSerDe and LazyBinarySerDe
– Deserialize the field when needed
– Reuse objects across different rows
– Text and Binary format
22. UDF/UDAF
Features:
– Use either Java or Hadoop Objects (int, Integer, IntWritable)
– Overloading
– Variable-length arguments
– Partial aggregation for UDAF
Example UDF:
– public class UDFExampleAdd extends UDF {
public int evaluate(int a, int b) {
return a + b;
}
}
23. Hive – Performance
Date SVN Revision Major Changes Query A Query B Query C
2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec
2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec
3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec
4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec
6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec
QueryA: SELECT count(1) FROM t;
QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
QueryC: SELECT * FROM t;
map-side time only (incl. GzipCodec for comp/decompression)
* These two features need to be tested with other queries.
24. Hive – Future Works
Indexes
Create table as select
Views / variables
Explode operator
In/Exists sub queries
Leverage sort/bucket information in Join