The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Sqoop on Spark provides a way to run Sqoop jobs using Apache Spark for parallel data ingestion. It allows Sqoop jobs to leverage Spark's speed and growing community. The key aspects covered are:
- Sqoop jobs can be created and executed on Spark by initializing a Spark context and wrapping Sqoop and Spark initialization.
- Data is partitioned and extracted in parallel using Spark RDDs and map transformations calling Sqoop connector APIs.
- Loading also uses Spark RDDs and map transformations to parallelly load data calling connector load APIs.
- Microbenchmarks show Spark-based ingestion can be significantly faster than traditional MapReduce-based Sqoop for large datasets
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
The document provides instructions for setting up the environment and coding tutorial for the BOSS'21 Copenhagen tutorial on Apache Calcite.
It includes the following steps:
1. Clone the GitHub repository containing sample code and dependencies.
2. Compile the project.
3. It outlines the draft schedule for the tutorial, which will cover topics like Calcite introduction, demonstration of SQL queries on CSV files, setting up the coding environment, using Lucene for indexing, and coding exercises to build parts of the logical and physical query plans in Calcite.
4. The tutorial will be led by Stamatis Zampetakis from Cloudera and Julian Hyde from Google, who are both committers to
This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
Parquet is a columnar storage format for Hadoop data. It was developed by Twitter and Cloudera to optimize storage and querying of large datasets. Parquet provides more efficient compression and I/O compared to traditional row-based formats by storing data by column. Early results show a 28% reduction in storage size and up to a 114% improvement in query performance versus the original Thrift format. Parquet supports complex nested schemas and can be used with Hadoop tools like Hive, Pig, and Impala.
The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
Facebook uses HBase running on HDFS to store messaging data and metadata. Key reasons for choosing HBase include high write throughput, horizontal scalability, and integration with HDFS. Typical clusters have multiple regions and racks for redundancy. Facebook stores small messages, metadata, and attachments in HBase, while larger messages and attachments are stored separately. The system processes billions of read and write operations daily and continues to optimize performance and reliability.
Building Mission Critical Messaging System On Top Of HBase
Facebook chose HBase as the storage system for its messaging platform due to HBase's high write throughput, good random read performance, horizontal scalability, and automatic failover. Facebook stores messages, metadata, and search indices in HBase. To improve performance and reliability, Facebook developed the system on a production-stabilized branch of HBase, used shadow testing, added extensive monitoring, and contributed improvements back to the HBase community.
This document provides an overview of HBase, including:
- HBase is a distributed, scalable, big data store modeled after Google's BigTable. It provides a fault-tolerant way to store large amounts of sparse data.
- HBase is used by large companies to handle scaling and sparse data better than relational databases. It features automatic partitioning, linear scalability, commodity hardware, and fault tolerance.
- The document discusses HBase operations, schema design best practices, hardware recommendations, alerting, backups and more. It provides guidance on designing keys, column families and cluster configuration to optimize performance for read and write workloads.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
This document contains information about HBase concepts and configurations. It discusses different modes of HBase operation including standalone, pseudo-distributed, and distributed modes. It also covers basic prerequisites for running HBase like Java, SSH, DNS, NTP, ulimit settings, and Hadoop for distributed mode. The document explains important HBase configuration files like hbase-site.xml, hbase-default.xml, hbase-env.sh, log4j.properties, and regionservers. It provides details on column-oriented versus row-oriented databases and discusses optimizations that can be made through configuration settings.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
The document discusses how HDFS architecture has evolved to meet new requirements for higher scalability, availability, and improved random read performance. It summarizes the key aspects of HDFS architecture in 2010, including limitations, and improvements made since then, such as read pipeline optimizations, federated namespaces, and high availability name nodes. It also outlines future directions for HDFS architecture.
At StampedeCon 2012 in St. Louis, Pritam Damania presents: Reliable backup and recovery is one of the main requirements for any enterprise grade application. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such, they are looking for backup solutions that are reliable, easy to use, and can co-exist with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
The document discusses different approaches for searching large datasets in Hadoop, including MapReduce, Lucene/Solr, and building a new search engine called HSearch. Some key challenges with existing approaches included slow response times and the need for manual sharding. HSearch indexes data stored in HDFS and HBase. The document outlines several techniques used in HSearch to improve performance, such as using SSDs selectively, reducing HBase table size, distributing queries across region servers, moving processing near data, byte block caching, and configuration tuning. Benchmarks showed HSearch could return results for common words from a 100 million page index within seconds.
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPERKrishnaVeni451953
HBase is an open source, column-oriented database built on top of Hadoop that allows for the storage and retrieval of large amounts of sparse data. It provides random real-time read/write access to this data stored in Hadoop and scales horizontally. HBase features include automatic failover, integration with MapReduce, and storing data as multidimensional sorted maps indexed by row, column, and timestamp. The architecture consists of a master server (HMaster), region servers (HRegionServer), regions (HRegions), and Zookeeper for coordination.
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
Hbase status quo apache-con europe - nov 2012Chris Huang
The document summarizes the status of HBase and its relationship with HDFS. In the past, HDFS did not prioritize HBase's needs, but reliability, availability, and performance have improved with Hadoop 1.0 and 2.0. Hadoop 2.0 features like HDFS high availability and wire compatibility directly benefit HBase. Further improvements planned for Hadoop 2.x like direct reads and zero-copy support could significantly boost HBase performance. The HBase project is also advancing with new versions focused on features like coprocessors and performance optimizations.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
6. Monthly data volume prior to launch
15B x 1,024 bytes = 14TB
120B x 100 bytes = 11TB
7. Messaging Data
▪ Small/medium sized data HBase
▪ Message metadata & indices
▪ Search index
▪ Small message bodies
▪ Attachments and large messages Haystack
▪ Used for our existing photo/video store
8. Open Source Stack
▪ Memcached --> App Server Cache
▪ ZooKeeper --> Small Data Coordination Service
▪ HBase --> Database Storage Engine
▪ HDFS --> Distributed FileSystem
▪ Hadoop --> Asynchronous Map-Reduce Jobs
9. Our architecture
User Directory Service
Clients
(Front End, MTA, etc.)
What’s the cell for
this user?
Cell 2
Cell 1 Cell 3
Cell 1 Application Server
Application Server Application Server
Attachments
HBase/HDFS/Z
HBase/HDFS/Z KMessage, Metadata,
HBase/HDFS/Z
K Search Index
K
Haystack
11. HBase in a nutshell
• distributed, large-scale data store
• efficient at random reads/writes
• initially modeled after Google’s BigTable
• open source project (Apache)
12. When to use HBase?
▪ storing large amounts of data
▪ need high write throughput
▪ need efficient random access within large data sets
▪ need to scale gracefully with data
▪ for structured and semi-structured data
▪ don’t need full RDMS capabilities (cross table transactions, joins, etc.)
13. HBase Data Model
• An HBase table is:
• a sparse , three-dimensional array of cells, indexed by:
RowKey, ColumnKey, Timestamp/Version
• sharded into regions along an ordered RowKey space
• Within each region:
• Data is grouped into column families
▪ Sort order within each column family:
Row Key (asc), Column Key (asc), Timestamp (desc)
14. Example: Inbox Search
• Schema
• Key: RowKey: userid, Column: word, Version: MessageID
• Value: Auxillary info (like offset of word in message)
• Data is stored sorted by <userid, word, messageID>:
User1:hi:17->offset1
Can efficiently handle queries like:
User1:hi:16->offset2
User1:hello:16->offset3 - Get top N messageIDs for a
User1:hello:2->offset4 specific user & word
...
User2:.... - Typeahead query: for a given user,
User2:... get words that match a prefix
...
15. HBase System Overview
Database Layer
HBASE
Master Backup
Master
Region Region Region ...
Server Server Server
Coordination Service
Storage Layer
HDFS Zookeeper Quorum
Namenode Secondary Namenode ZK ZK ...
Peer Peer
Datanode Datanode Datanode ...
16. HBase Overview
HBASE Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1 Memstore
(in memory data structure)
HFiles (in HDFS) flush
Write Ahead Log ( in HDFS)
17. HBase Overview
• Very good at random reads/writes
• Write path
• Sequential write/sync to commit log
• update memstore
• Read path
• Lookup memstore & persistent HFiles
• HFile data is sorted and has a block index for efficient retrieval
• Background chores
• Flushes (memstore -> HFile)
• Compactions (group of HFiles merged into one)
19. Horizontal scalability
▪ HBase & HDFS are elastic by design
▪ Multiple table shards (regions) per physical server
▪ On node additions
▪ Load balancer automatically reassigns shards from overloaded
nodes to new nodes
▪ Because filesystem underneath is itself distributed, data for
reassigned regions is instantly servable from the new nodes.
▪ Regions can be dynamically split into smaller regions.
▪ Pre-sharding is not necessary
▪ Splits are near instantaneous!
20. Automatic Failover
▪ Node failures automatically detected by HBase Master
▪ Regions on failed node are distributed evenly among surviving nodes.
▪ Multiple regions/server model avoids need for substantial
overprovisioning
▪ HBase Master failover
▪ 1 active, rest standby
▪ When active master fails, a standby automatically takes over
21. HBase uses HDFS
We get the benefits of HDFS as a storage system for free
▪ Fault tolerance (block level replication for redundancy)
▪ Scalability
▪ End-to-end checksums to detect and recover from corruptions
▪ Map Reduce for large scale data processing
▪ HDFS already battle tested inside Facebook
▪ running petabyte scale clusters
▪ lot of in-house development and operational experience
22. Simpler Consistency Model
▪ HBase’s strong consistency model
▪ simpler for a wide variety of applications to deal with
▪ client gets same answer no matter which replica data is read from
▪ Eventual consistency: tricky for applications fronted by a cache
▪ replicas may heal eventually during failures
▪ but stale data could remain stuck in cache
23. Typical Cluster Layout
▪ Multiple clusters/cells for messaging
▪ 20 servers/rack; 5 or more racks per cluster
▪ Controllers (master/Zookeeper) spread across racks
ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer
HDFS Namenode Backup Namenode Job Tracker Hbase Master Backup Master
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
19x... 19x... 19x... 19x... 19x...
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
Rack #1 Rack #2 Rack #3 Rack #4 Rack #5
26. Goal of Zero Data Loss/Correctness
▪ sync support added to hadoop-20 branch
▪ for keeping transaction log (WAL) in HDFS
▪ to guarantee durability of transactions
▪ Row-level ACID compliance
▪ Enhanced HDFS’s Block Placement Policy:
▪ Original: rack aware, but minimally constrained
▪ Now: Placement of replicas constrained to configurable node groups
▪ Result: Data loss probability reduced by orders of magnitude
27. Availability/Stability improvements
▪ HBase master rewrite- region assignments using ZK
▪ Rolling Restarts – doing software upgrades without a downtime
▪ Interrupt Compactions – prioritize availability over minor perf gains
▪ Timeouts on client-server RPCs
▪ Staggered major compaction to avoid compaction storms
28. Performance Improvements
▪ Compactions
▪ critical for read performance
▪ Improved compaction algo
▪ delete/TTL/overwrite processing in minor compactions
▪ Read optimizations:
▪ Seek optimizations for rows with large number of cells
▪ Bloom filters to minimize HFile lookups
▪ Timerange hints on HFiles (great for temporal data)
▪ Improved handling of compressed HFiles
29. Operational Experiences
▪ Darklaunch:
▪ shadow traffic on test clusters for continuous, at scale testing
▪ experiment/tweak knobs
▪ simulate failures, test rolling upgrades
▪ Constant (pre-sharding) region count & controlled rolling splits
▪ Administrative tools and monitoring
▪ Alerts (HBCK, memory alerts, perf alerts, health alerts)
▪ auto detecting/decommissioning misbehaving machines
▪ Dashboards
▪ Application level backup/recovery pipeline
30. Working within the Apache community
▪ Growing with the community
▪ Started with a stable, healthy project
▪ In house expertise in both HDFS and HBase
▪ Increasing community involvement
▪ Undertook massive feature improvements with community help
▪ HDFS 0.20-append branch
▪ HBase Master rewrite
▪ Continually interacting with the community to identify and fix issues
▪ e.g., large responses (2GB RPC)
33. Move messaging data from MySQL to HBase
▪ In MySQL, inbox data was kept normalized
▪ user’s messages are stored across many different machines
▪ Migrating a user is basically one big join across tables spread over
many different machines
▪ Multiple terabytes of data (for over 500M users)
▪ Cannot pound 1000s of production UDBs to migrate users
34. How we migrated
▪ Periodically, get a full export of all the users’ inbox data in MySQL
▪ And, use bulk loader to import the above into a migration HBase
cluster
▪ To migrate users:
▪ Since users may continue to receive messages during migration:
▪ double-write (to old and new system) during the migration period
▪ Get a list of all recent messages (since last MySQL export) for the
user
▪ Load new messages into the migration HBase cluster
▪ Perform the join operations to generate the new data
▪ Export it and upload into the final cluster
36. Facebook Insights Goes Real-Time
▪ Recently launched real-time analytics for social plugins on top of
HBase
▪ Publishers get real-time distribution/engagement metrics:
▪ # of impressions, likes
▪ analytics by
▪ Domain, URL, demographics
▪ Over various time periods (the last hour, day, all-time)
▪ Makes use of HBase capabilities like:
▪ Efficient counters (read-modify-write increment operations)
▪ TTL for purging old data
37. Future Work
It is still early days…!
▪ Namenode HA (AvatarNode)
▪ Fast hot-backups (Export/Import)
▪ Online schema & config changes
▪ Running HBase as a service (multi-tenancy)
▪ Features (like secondary indices, batching hybrid mutations)
▪ Cross-DC replication
▪ Lot more performance/availability improvements