This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
The document provides an overview of the activity feeds architecture. It discusses the fundamental entities of connections and activities. Connections express relationships between entities and are implemented as a directed graph. Activities form a log of actions by entities. To populate feeds, activities are copied and distributed to relevant entities and then aggregated. The aggregation process involves selecting connections, classifying activities, scoring them, pruning duplicates, and sorting the results into a merged newsfeed.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
This document provides an overview of ClickHouse, an open source column-oriented database management system. It discusses ClickHouse's ability to handle high volumes of event data in real-time, its use of the MergeTree storage engine to sort and merge data efficiently, and how it scales through sharding and distributed tables. The document also covers replication using the ReplicatedMergeTree engine to provide high availability and fault tolerance.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
1) Apache Kafka is a distributed streaming platform that can be used for publish-subscribe messaging and storing and processing streams of data. However, there are many potential anti-patterns to be aware of when using Kafka.
2) Some common anti-patterns include not properly configuring data durability, ignoring error handling and exceptions, failing to use Kafka's built-in retries and idempotence features, and not embracing Kafka's at least once processing semantics.
3) It is also important to properly configure Kafka for production use by tuning OS settings, reading documentation on best practices, implementing monitoring, and addressing topics and partitioning design.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.
This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.
This document provides an overview of Apache Cassandra and how it can be used to build a Twitter-like application called Twissandra. It describes Cassandra's data model using keyspaces and column families, and how they can be mapped to represent users, tweets, followers, and more. It also shows examples of common operations like inserting and querying data. The goal is to illustrate how Cassandra addresses issues like scalability and availability in a way relational databases cannot, and how it can be used to build distributed, highly available applications.
Cassandra By Example: Data Modelling with CQL3Eric Evans
CQL is the query language for Apache Cassandra that provides an SQL-like interface. The document discusses the evolution from the older Thrift RPC interface to CQL and provides examples of modeling tweet data in Cassandra using tables like users, tweets, following, followers, userline, and timeline. It also covers techniques like denormalization, materialized views, and batch loading of related data to optimize for common queries.
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
The document discusses hash partitioning and consistent hashing strategies for partitioning and distributing data across multiple nodes in a distributed database system like Cassandra. It explains how consistent hashing with random tokens assigns data to nodes in a way that balances the load and makes adding or removing nodes easier compared to simple hash partitioning. The document also demonstrates how to generate random tokens for nodes.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single points of failure and linear scalability as nodes are added. Cassandra uses a peer-to-peer distributed architecture and tunable consistency levels to achieve high performance and availability without requiring strong consistency. It is based on Amazon's Dynamo and Google's Bigtable papers and provides a combination of their features.
This document provides an overview of a NoSQL Night event presented by Clarence J M Tauro from Couchbase. The presentation introduces NoSQL databases and discusses some of their advantages over relational databases, including scalability, availability, and partition tolerance. It covers key concepts like the CAP theorem and BASE properties. The document also provides details about Couchbase, a popular document-oriented NoSQL database, including its architecture, data model using JSON documents, and basic operations. Finally, it advertises Couchbase training courses for getting started and administration.
Christian Johannsen presents on evaluating Apache Cassandra as a cloud database. Cassandra is optimized for cloud infrastructure with features like transparent elasticity, scalability, high availability, easy data distribution and redundancy. It supports multiple data types, is easy to manage, low cost, supports multiple infrastructures and has security features. A demo of DataStax OpsCenter and Apache Spark on Cassandra is shown.
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
There are hundreds of possible databases you can choose from today. Yet if you draw up a short list of critical criteria related to performance and scalability for your use case, the field of choices narrows and your evaluation decision becomes much easier.
In this session, we’ll explore 5 essential factors to consider when selecting a high performance low latency database, including options, opportunities, and tradeoffs related to software architecture, hardware utilization, interoperability, RASP, and Deployment.
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
This document provides an overview and outline of a 1-hour introduction to building a big data pipeline using Docker, Cassandra, Spark, Spark-Notebook and Akka. The introduction is presented as a half-day workshop at Devoxx November 2015. It uses a data pipeline environment from Data Fellas and demonstrates how to use scalable distributed technologies like Docker, Spark, Spark-Notebook and Cassandra to build a reactive, repeatable big data pipeline. The key takeaway is understanding how to construct such a pipeline.
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
Speaker: Marc Fielding, Co-speaker: Maris Elsins.
Oracle Database Appliance provides a robust, highly-available, cost-effective, and surprisingly scalable platform for database as a service environment. By leveraging Oracle Enterprise Manager's self-service features, databases can be provisioned on a self-service basis to a cluster of Oracle Database Appliance machines. Discover how multiple ODA devices can be managed together to provide both high availability and incremental, cost-effective scalability. Hear real-world lessons learned from successful database consolidation implementations.
Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability and performance, as well as flexibility in schemas. Cassandra finds use in large companies like Facebook, Netflix and eBay due to its abilities to scale and perform well under heavy loads. However, it may not be suited for applications requiring many joins, transactions or strong consistency guarantees.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
1) Apache Cassandra in term of CAP Theorem
2) What makes Apache Cassandra "Available"?
3) How Apache Cassandra ensures data consistency?
4) Cassandra advantages and disadvantages
5) Frameworks/libraries to access Apache Cassandra + performance comparison
Le but est de partager avec le public les connaissances et expériences éprouvées dans la conception, la mise en œuvre et l'exécution de plateformes DBaaS. La présentation comprend des exemples et des explications sur les environnements de base de données consolidées délivrant des performances sans compromis, l'évolutivité et la flexibilité en liaison avec le "time-to-market" et la rentabilité.
DBaaS - The Next generation of database infrastructureEmiliano Fusaglia
Database as a Service (DBaaS) delivers database functionality as an on-demand cloud service, masking complexity. It offers flexible, scalable, secure databases with self-service provisioning and consolidated resources. DBaaS provides advantages over traditional databases like lower costs, faster provisioning, and increased efficiency through standardization and automation. DBaaS can be implemented through virtualization or using Oracle's Grid Infrastructure and multitenant database features which provide high availability, scalability, and performance isolation through resource management. DBaaS offers a standardized platform that can be engineered once and used for multiple applications in a pay-as-you-grow model.
This document provides an introduction to NoSQL and Cassandra. It begins with an introduction of the presenter and an overview of what will be covered. It then discusses the history of databases and why alternatives to relational databases were needed to address challenges of scaling to internet-level data volumes, varieties, and velocities. It introduces key NoSQL concepts like CAP theorem, BASE, and the different types of NoSQL databases before focusing on Cassandra. The document summarizes Cassandra's origins, capabilities, data model involving column families and super column families, and architecture.
This document provides an overview of Apache Cassandra, including its history, key features, architecture, and use cases. Cassandra is an open-source, decentralized, distributed database management system that provides high availability with no single point of failure. It scales linearly as nodes are added and easily handles large amounts of data across clusters. Popular companies that use Cassandra include Netflix, Spotify, and Hulu for its capabilities such as replication, high performance, and scalability.
Cassandra is a highly scalable, open-source distributed database designed to handle large amounts of structured data across many servers. It provides high availability with no single point of failure and was created by Facebook to power search on their messaging platform. Cassandra uses a decentralized peer-to-peer architecture and replicates data across multiple data centers for fault tolerance. It emphasizes performance and scalability over more complex query options and does not support features like joins typically found in relational databases. Companies like Netflix and Hulu use Cassandra for its availability, scalability, and ability to span large clusters with minimal maintenance.
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten.
FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
* Easy exactly-once ingestion from Kafka for streaming and IoT applications
* Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Cassandra is a distributed database designed to handle large amounts of structured data across commodity servers. It provides linear scalability, fault tolerance, and high availability. Cassandra's architecture is masterless with all nodes equal, allowing it to scale out easily. Data is replicated across multiple nodes according to the replication strategy and factor for redundancy. Cassandra supports flexible and dynamic data modeling and tunable consistency levels. It is commonly used for applications requiring high throughput and availability, such as social media, IoT, and retail.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
This document discusses using Apache Cassandra for business intelligence, reporting and analytics. It covers:
- Data modeling and querying Cassandra data using CQL
- Accessing Cassandra data through drivers, ODBC/JDBC, and analytics frameworks like Spark and Hadoop
- Doing reporting, dashboards, and analytics on Cassandra data using CQL, Solr, Spark, and BI tools
- Capabilities of DataStax Enterprise for integrated search, batch analytics, and real-time analytics on Cassandra
- Example architectures that isolate workloads and handle hot vs cold data
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
The document discusses the evolution of Cassandra's data modeling capabilities over different versions of CQL. It covers features introduced in each version such as user defined types, functions, aggregates, materialized views, and storage attached secondary indexes (SASI). It provides examples of how to create user defined types, functions, materialized views, and SASI indexes in CQL. It also discusses when each feature should and should not be used.
Cisco has a large global IT infrastructure supporting many applications, databases, and employees. The document discusses Cisco's existing customer service and commerce systems (CSCC/SMS3) and some of the performance, scalability, and user experience issues. It then presents a proposed new architecture using modern technologies like Elasticsearch, Cassandra, and microservices to address these issues and improve agility, performance, scalability, uptime, and the user interface.
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
This document promotes Datastax Academy and Certification resources for learning Cassandra including a three step process of learning Cassandra, getting certified, and profiting. It lists community evangelists like Luke Tillman, Patrick McFadin, Jon Haddad, and Duy Hai Doan who can provide help and resources.
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
This document summarizes three presentations from a Cassandra Meetup:
1. Jason Cacciatore discussed monitoring Cassandra health at scale across hundreds of clusters and thousands of nodes using the reactive stream processing system Mantis.
2. Minh Do explained how Cassandra uses the gossip protocol for tasks like discovering cluster topology and sharing load information. Gossip also has limitations and race conditions that can cause problems.
3. Chris Kalantzis presented Cassandra Tickler, an open source tool he created to help repair operations that get stuck by running lightweight consistency checks on an old Cassandra version or a node with space issues.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
The document discusses Cassandra's use by Sony Network Entertainment to handle the large amount of user and transaction data from the growing PlayStation Network. It describes how the relational database they previously used did not scale sufficiently, so they transitioned to using Cassandra in a denormalized and customized way. Some of the techniques discussed include caching user data locally on application servers, secondary indexing, and using a real-time indexer to enable personalized search by friends.
This document provides guidance on setting up server monitoring, application metrics, log aggregation, time synchronization, replication strategies, and garbage collection for a Cassandra cluster. Key recommendations include:
1. Use monitoring tools like Monit, Munin, Nagios, or OpsCenter to monitor processes, disk usage, and system performance. Aggregate all logs centrally with tools like Splunk, Logstash, or Greylog.
2. Install NTP to synchronize server times which are critical for consistency.
3. Use the NetworkTopologyStrategy replication strategy and avoid SimpleStrategy for production.
4. Avoid shared storage and focus on low latency and high throughput using multiple local disks.
5. Understand
This document discusses real time analytics using Spark and Spark Streaming. It provides an introduction to Spark and highlights limitations of Hadoop for real-time analytics. It then describes Spark's advantages like in-memory processing and rich APIs. The document discusses Spark Streaming and the Spark Cassandra Connector. It also introduces DataStax Enterprise which integrates Spark, Cassandra and Solr to allow real-time analytics without separate clusters. Examples of streaming use cases and demos are provided.
Introduction to Data Modeling with Apache CassandraDataStax Academy
This document provides an introduction to data modeling with Apache Cassandra. It discusses how Cassandra data models are designed based on the queries an application will perform, unlike relational databases which are designed based on normalization rules. Key aspects covered include avoiding joins by denormalizing data, using a partition key to group related data on nodes, and controlling the clustering order of columns. The document provides examples of modeling time series and tag data in Cassandra.
The document discusses different data storage options for small, medium, and large datasets. It argues that relational databases do not scale well for large datasets due to limitations with replication, normalization, sharding, and high availability. The document then introduces Apache Cassandra as a fast, distributed, highly available, and linearly scalable database that addresses these limitations through its use of a hash ring architecture and tunable consistency levels. It describes Cassandra's key features including replication, compaction, and multi-datacenter support.
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
This document provides an overview of using Datastax Enterprise (DSE) Search to enable full-text search capabilities in Cassandra applications. It discusses how DSE Search integrates Solr/Lucene indexing with the Cassandra database to allow searching of application data without requiring a separate search cluster, external ETL processes, or custom application code for data management. The document also includes examples of different types of searches that can be performed, such as filtering, faceting, geospatial searches, and joins. It concludes with basic steps for getting started with DSE Search such as creating a Solr core and executing search queries using CQL.
The document discusses common bad habits that can occur when working with Apache Cassandra and provides recommendations to avoid them. Specifically, it addresses issues like sliding back into a relational mindset when the data model is different, improperly benchmarking Cassandra systems, having slow client performance, and neglecting important operations tasks. The presentation provides guidance on how to approach data modeling, querying, benchmarking, driver usage, and operations management in a Cassandra-oriented way.
This document provides an overview and examples of modeling data in Apache Cassandra. It begins with an introduction to thinking about data models and queries before modeling, and emphasizes that Cassandra requires modeling around queries due to its limitations on joins and indexes. The document then provides examples of modeling user, video, and other entity data for a video sharing application to support common queries. It also discusses techniques for handling queries that could become hotspots, such as bucketing or adding random values. The examples illustrate best practices for data duplication, materialized views, and time series data storage in Cassandra.
The document discusses best practices for using Apache Cassandra, including:
- Topology considerations like replication strategies and snitches
- Booting new datacenters and replacing nodes
- Security techniques like authentication, authorization, and SSL encryption
- Using prepared statements for efficiency
- Asynchronous execution for request pipelining
- Batch statements and their appropriate uses
- Improving performance through techniques like the new row cache
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
1. Cassandra
Introduction & Key Features
Meetup Vienna Cassandra Users
13th of January 2014
philipp.potisk@geroba.com
2. Definition
Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available,
fault-tolerant, tuneably consistent, column-oriented
database that bases its distribution design on Amazon’s
Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it is now used at some of the most
popular sites on the Web [The Definitive Guide, Eben
Hewitt, 2010]
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
2
4. Key Features
Distributed
and
Decentralized
High Performance
CQL – A SQL
like query
interface
Elastic
Scalability
Cassandra
Columnoriented
Key-Value
store
13/01/2014
High
Availability
and Fault
Tolerance
Tuneable
Consistency
Cassandra Introduction & Key Features by Philipp Potisk
4
5. Distributed and Decentralized
Datacenter 1
• Distributed: Capable of running
on multiple machines
• Decentralized: No single point of
failure
No master-slave issues due to
peer-to-peer architecture
(protocol "gossip")
Single Cassandra cluster may run
across geographically dispersed
data centers
13/01/2014
Datacenter 2
1
7
6
2
5
3
4
12
8
11
9
10
Read- and writerequests to any node
Cassandra Introduction & Key Features by Philipp Potisk
5
6. Elastic Scalability
1
8
1
• Cassandra scales horizontally,
adding more machines that have
all or some of the data on
• Adding of nodes increase
performance throughput linearly
• De-/ and increasing the
nodecount happen seamlessly
4 Performance
2
throughput = N
3
2
Performance
throughput = N x 2
7
4
6
5
Linearly scales to
terabytes and
petabytes of data
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
3
6
7. Scaling Benchmark By Netflix*
48, 96, 144 and 288
instances, with 10, 20,
30 and 60 clients
respectively. Each client
generated ~20.000w/s
having 400byte in size
Cassandra scales linearly far
beyond our current capacity
requirements, and very
rapid deployment
automation makes it easy to
manage. In particular,
benchmarking in the cloud
is fast, cheap and scalable,
*http://techblog.netflix.com/201
1/11/benchmarking-cassandrascalability-on.html
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
7
8. High Availability and Fault Tolerance
• High Availability?
Multiple networked computers
operating in a cluster
Facility for recognizing node
failures
Forward failing over requests to
another part of the system
1
6
2
5
3
4
• Cassandra has High Availability
No single point of failure
due to the peer-to-peer
architecture
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
8
9. Tunable Consistency
• Choose between strong and eventual
consistency
• Adjustable for read- and writeoperations separately
• Conflicts are solved during reads, as
focus lies on write-performance
TUNABLE
Available
Consistency
Use case dependent
level of consistency
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
9
10. When do we have strong consistency?
• Simple Formula:
jsmith
(nodes_written + nodes_read) >
replication_factor
jsmith
t1
t2
NW: 2
NR: 2
RF: 3
t1
t2
jsmith
t1
• Ensures that a read always
reflects the most recent write
• If not: Weak consistency
Eventually consistent
jsmith
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
t2
10
11. Column-oriented Key-Value Store
Row Key1
Column
Key1
Column
Value1
Column
Key2
Column
Value2
Column
Key3
Column
Value3
…
…
…
• Data is stored in sparse
multidimensional hash tables
• A row can have multiple columns –
not necessarily the same amount of
columns for each row
• Each row has a unique key, which
also determines partitioning
• No relations!
Stored sorted by row key *
Stored sorted by column key/value
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
11
12. CQL – An SQL-like query interface
• “CQL 3 is the default and primary interface into the Cassandra DBMS” *
• Familiar SQL-like syntax that maps to Cassandras storage engine and
simplifies data modelling
CRETE TABLE songs (
id uuid PRIMARY KEY,
title text,
album text,
artist text,
data blob,
tags set<text>
);
INSERT INTO songs
(id, title, artist,
album, tags)
VALUES(
'a3e64f8f...',
'La Grange',
'ZZ Top',
'Tres Hombres'‚
{'cool', 'hot'});
SELECT *
FROM songs
WHERE id = 'a3e64f8f...';
“SQL-like” but NOT
relational SQL
* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
12
13. High Performance
• Optimized from the ground up
for high throughput
• All disk writes are sequential,
append only operations
• No reading before writing
• Cassandra`s threading-concept is
optimized for running on
multiprocessor/ multicore
machines
13/01/2014
Optimized for writing,
but fast reads are
possible as well
Cassandra Introduction & Key Features by Philipp Potisk
13
14. Benchmark from 2011 (Cassandra 0.7.4)*
ops
Cassandra showed
outstanding throughput in
“INSERT-only” with 20,000
ops
Insert: Enter 50 million 1K-sized records
Read: Search key for a one hour period + optional update
Hardware: Nehalem 6 Core x 2 CPU, 16GB Memory
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
*NoSql Benchmarking by Curbit
http://www.cubrid.org/blog/de
v-platform/nosqlbenchmarking/
14
15. Benchmark from 2013 (Cassandra 1.1.6)*
* Benchmarking Top NoSQL Databases by End Point Corporation,
http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdf
Yahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
15
16. When do we need these features?
Lots of
Writes,
Statistics, and
Analysis
Geographical
Distribution
Large
Deployments
13/01/2014
Evolving
Applications
Cassandra Introduction & Key Features by Philipp Potisk
16
17. Who is using Cassandra?
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
17
18. ebay Data Infrastructure*
•
•
•
•
•
•
Thousands of nodes
> 2K sharded logical host
> 16K tables
> 27K indexes
> 140 billion SQLs/day
> 5 PB provisioned
• 10+ clusters
• 100+ nodes
• > 250 TB provisioned
(local HDD + shared SSD)
• > 9 billion writes/day
• > 5 billion reads/day
• Hundreds of nodes
• Persistent & in-memory
• > 40 billion SQLs/day
Not replacing RDMBS but
complementing!
Hundreds of nodes
> 50 TB
> 2 billion ops/day
• Thousands of nodes
• The world largest cluster
with 2K+ nodes
*by Jay Patel, Cassandra Summit June 2013 San Francisco
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
18
19. Cassandra Use Case at Ebay
Application/Use Case
• Time-series data and real-time insights
• Fraud detection & prevention
• Quality Click Pricing for affiliates
• Order & Shipment Tracking
•…
• Server metrics collection
• Taste graph-based next-gen recommendation
system
• Social Signals on eBay Product & Item pages
13/01/2014
Why Cassandra?
• Multi-Datacenter (active-active)
• No SPOF
• Easy to scale
• Write performance
• Distributed Counters
Cassandra Introduction & Key Features by Philipp Potisk
19
21. Summary
• History
• Key features of Cassandra
•
•
•
•
•
•
•
Distributed and Decentralized
Elastic Scalability
High Availability and Fault Tolerance
Tunable Consistency
Column-oriented key-value store
CQL interface
High Performance
• Ebay Use Case
13/01/2014
Apache project: http://cassandra.apache.org
Community portal: http://planetcassandra.org
Documentation: http://www.datastax.com/docs
Cassandra Introduction & Key Features by Philipp Potisk
21