This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming. In particular, I am going to discuss the following. – Different stateful operations in Structured Streaming – How state data is stored in a distributed, fault-tolerant manner using State Stores – How you can write custom State Stores for saving state to external storage systems.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Our journey with druid - from initial research to full production scaleItai Yaffe
Here at the Nielsen Marketing Cloud we use druid.io (http://druid.io/) as one of our main data stores, both for simple counts and for approximate count-distinct (DataSketches).
It’s been more than a year since we started using it, injecting billions of events each day to multiple druid clusters for different use-cases.
In this meet-up, we will share our journey, the challenges we had, the way we overcame them (at least most of them) and the steps we made to optimize the process around Druid to keep the solution cost effective.
Before diving into Druid, we will briefly present our data pipeline architecture, starting from the front-end serving system, deployed in number of geo-locations, to a centralized Kafka cluster in the cloud, and give some examples of the different processes that consume from Kafka and feed our different data sources.
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
This document summarizes several systems for big data processing that extend or improve upon the MapReduce programming model. It discusses systems for iterative processing like HaLoop, stream processing like Muppet, improving performance through caching and indexing like Incoop and HAIL, and automatic optimization of MapReduce programs like MANIMAL and SkewTune. The document also briefly introduces broader distributed data processing frameworks beyond MapReduce like Dryad, SCOPE, Spark, Nephele/PACTs, and the ASTERIX scalable data platform.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
This document discusses BlaBlaCar's transition from a relational database to the search engine Elasticsearch. It describes how BlaBlaCar changed its data model and indexing approach to better suit Elasticsearch. Challenges included modifying mappings, handling grouping, and dealing with IO limits on their initial two node cluster. The document encourages following BlaBlaCar and applying for open jobs.
Big data challenges are common : we are all doing aggregations , machine learning , anomaly detection, OLAP ...
This presentation describe how InnerActive answer those requirements
This document discusses using Druid, an open-source data store, to analyze large amounts of ad tech data. It summarizes Druid's requirements of handling over 80 dimensions and 20 metrics from 5 terabytes of raw data per day. It also describes the implementation of ingesting data from JSON to Parquet files in S3 using Spark, and then ingesting into Druid. Materialized views are created to optimize query performance by pre-aggregating data into different time intervals like hours, days, and weeks.
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
The document provides an overview of NoSQL solutions and their advantages over traditional SQL databases. It discusses key-value stores like Redis and Cassandra, document databases like MongoDB and CouchDB, and graph databases. It also summarizes MySQL and HBase, evaluating them for the needs of Toluna's user vote data. HBase is suitable for analytics but not traditional queries, while MongoDB is good for new applications but MySQL may remain important due to performance and proven stability. Overall the document compares NoSQL options to traditional and distributed SQL solutions for scaling big data workloads.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
The document discusses using a vector database to enable question answering with custom data. Key points:
- Data is converted to vector embeddings and stored in a vector database like Pinecone to allow for similarity searches.
- When a user asks a question, it is converted to a vector and queried against the database to retrieve similar content to provide as input to a language model for generating an answer.
- The OpenAI API can also be used to build an assistant using a language model, where custom data is loaded to enable answering questions about that data as a "support manager."
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
The document discusses saving streaming data from Kafka to S3 using Spark Streaming while ensuring exactly-once delivery. It describes two options for handling failures: (1) writing offsets to a database, requiring additional cleanup; and (2) combining offsets with file paths in S3 to allow overwriting on failure without duplication. The implemented solution uses the second approach by partitioning data by date and sum of starting offsets and deleting folders before writing to ensure exactly-once delivery in a simple way without additional systems.
ElasticCache is a caching service that uses Memcached. Memcached is an in-memory key-value store that provides no persistence or replication. It is fast and preferable for caching relatively small static data. At a certain point, implementation knowledge is needed to ensure Memcached is behaving as expected. Production issues can occur if objects do not fit properly into Memcached slabs, which allocate fixed-size chunks of memory. Monitoring tools like "stats slabs" help analyze slab allocation and object eviction patterns.
The document discusses new features in Java 8 including lambda expressions, method references, functional interfaces, default methods, streams, Optional class, and the new date/time API. It provides examples and explanations of how these features work, common use cases, and how they improve functionality and code readability in Java. Key topics include lambda syntax, functional interfaces, default interface methods, Stream API operations like filter and collect, CompletableFuture for asynchronous programming, and the replacement of java.util.Date with the new date/time classes.
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Spark Streaming can be used to process streaming data from Kafka in real-time. There are two main approaches - the receiver-based approach where Spark receives data from Kafka receivers, and the direct approach where Spark directly reads data from Kafka. The document discusses using Spark Streaming to process tens of millions of transactions per minute from Kafka for an ad exchange system. It describes architectures where Spark Streaming is used to perform real-time aggregations and update databases, as well as save raw data to object storage for analytics and recovery. Stateful processing with mapWithState transformations is also demonstrated to update Cassandra in real-time.
This document provides an overview of key concepts in Android development including:
- Android components like Activities, Fragments, Services, etc.
- The application and activity lifecycles and how activities can move between states like active, paused, and stopped.
- Common Android development tools and how to structure an Android project with activities, fragments, and other components.
- Basic Android app architecture including the Linux kernel, Java framework layers, and how apps are packaged as APK files.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Getting Started with Interactive Brokers API and Python.pdf
Druid
1. Druid in Production
Dori Waldman - Big Data Lead
Guy Shemer - Big Data Expert
Alon Edelman - Big Data consultant
2. ● Druid
○ Demo
○ What is druid and Why you need it
○ Other solutions ...
● Production pains and how it works :
○ cardinality
○ cache
○ dimension types (list/regular/map)
○ segment size
○ partition
○ monitoring and analyze hotspot
○ Query examples
○ lookups
● Interface
○ Pivot, Facet , superset ...
○ Druid-Sql (JDBC)
○ Rest
Agenda
4. Why ?
● fast (real-time) analytics on large time series data
○ MapReduce / Spark - are not design for real time queries.
○ MPP expensive / slow
● Just send raw data to druid → mention which attributes are the
dimensions, metrics and how to aggregate the metrics → Druid
will create cube (datasource)
○ Relational does not scale , we need fast queries on large
data.
○ Key-value tables require table per predefined query , and we
need dynamic queries (cube)
http://static.druid.io/docs/druid.pdf
● We want to answer questions like:
○ #edits on the page Justin Bieber from males in San Francisco?
○ average #characters , added by people from Calgary over the last month?
○ arbitrary combination of dimensions to return with subsecond latencies.
5. Row value can be dimension (~where in sql) or metric (measure)
● Dimensions are fields that can be filtered on or grouped by.
● Metrics are fields that can be aggregated. They are often stored as numbers but
can also be stored as HyperLogLog sketches (approximate)..
For example
If Click is a dimension we can select this dimension and see how the data is splitted
according to the selected value (might be better to convert as categories 0-20)
If Click is a metric it will be a counter result like for how many clicks we have in
Israel
Dimension / Metric
Country ApplicationId Clicks
Israel 2 18
Israel 3 22
USA 80 19
6. Other options
● Open source solution:
○ Pinot (https://github.com/linkedin/pinot)
○ clickHouse (https://clickhouse.yandex/)
○ Presto (https://prestodb.io/)
https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
8. Components
● RealTime nodes - ingest and query event streams, events
are immediately available to query, saved in cache and
persist to global storage (s3/hdfs) “deepStorage”
● Historical nodes - load and serve the immutable blocks of
data (segments) from deep storage, those are the main
workers
● Broker nodes - query routers to historical and real-time
nodes, communicate with ZK to understand where relevant
segments are located
● Coordinator nodes - tell historical nodes to load new data,
drop outdated data, replicate data, and Balance by move data
9. Components
● Overlord node - Manages task
distribution to middle managers.
responsible for accepting tasks,
coordinating task distribution, creating
locks around tasks, and returning statuses
to callers
● Middle manager node - Executes
submitted tasks by forward slices of tasks
to peons.
In case druid runs in local mode, this part
is redundant since overlord will also take
this responsibility
● Peon - single task in a single JVM.
Several peons may run on same node
10. Components
(Stream)
● Tranquility -
○ Ingest from kafka/http/samza/flink …
○ Will be out of life
○ Connects to the ZooKeeper of the kafka cluster
○ Can connect to several clusters and read from several topics for the
same Druid data source.
○ Can't handle events after window closes
● Kafka-Indexing-Service
○ ingest from kafka only (kafka 10+)
○ Connects directly to Kafka’s brokers
○ Is able to connect to one cluster and one topic for one druid data source
○ Indexer manage its tasks better, use checkpoint (~exactly once)
○ Can update events for old segments (no window)
○ can be spot instance (for other nodes its less recommended)
15. Batch
Ingestion
Indexing Task Types
● index_hadoop (with EMR)
○ Hadoop-based batch ingestion. using Hadoop/EMR cluster to
perform data processing and ingestion
● Index (No EMR)
○ For small amount of data , task execute within the indexing
service without external hadoop resources
16. Batch
Ingestion
Input source for batch indexing
● local
○ For POC
● Static (S3/ HDFS etc..)
○ Ingesting from your raw data
○ Support also Parquet
○ Can be mapped dynamically
to specific date
● Druid’s Deep Storage
○ Use segments from one
datasource from deep storage
and transform them to another
datasource, clean dimensions,
change granularity etc..
"inputSpec" : {
"type" : "static",
"paths" : "/MyDirectory/example/wikipedia_data.json"
}
"inputSpec": {
"type": "static",
"paths": "s3n://prod/raw/2018-01-01/00/,
s3n://staging/raw/2018-01-01/00/",
"filePattern": ".gz"
}
"inputSpec": {
"type": "dataSource",
"ingestionSpec": {
"dataSource" : "Hourly",
"intervals" :
["2017-11-06T00:00:00.000Z/2017-11-07T00:00:00.000Z"]
}
18. Lookups
● Purpose : replace dimensions values , for example replace “1”
with “New York City”
● in case the mapping is 1:1 an optimization (“injective”:true)
should be used, it will replace the value on the query result
and not on the query input
● Lookups has no history (if value of 1 was “new york” and it was
changed to “new your city” the old value will not appear in the
query result.
● Very small lookups (count of keys on the order of a few dozen to
a few hundred) can be passed at query time as a "map" lookup
● Usually you will use global cached lookups from DB / file / kafka
20. Query:
TopN
● TopN
○ grouped by single dimension, sort
(order) according to the metric
(~ “group by” one dimension + order )
○ TopNs are approximate in that each node
will rank their top K results and only return
those top K results to the broker
○ To get exact result use groupBy query
and sort the results (better to avoid)
21. Query:
TopN
● TopN Hell- in the Pivot
Pivot use nested TopN’s (filter and topN per row)
Try to reduce number of unnecessary topN queries
22. Query:
GroupBy
GroupBy
○ Grouped by multiple dimensions.
○ Unlike TopN, can use ‘having’ conditions over aggregated data.
Druid vision is to
replace timeseries and
topN with groupBy
advance query
23. Query:
TimeSeries
● Timeseries
○ grouped by time dimension only (“no dimensions)
○ Timeseries query will generally be faster than groupBy as it
taking advantage of the fact that segments are already sorted on
time and does not need to use a hash table for merging.
24. Query:
SQL
● Druid SQL
○ Translates SQL into native Druid queries on the query broker
■ using JSON over HTTP by posting to the endpoint /druid/v2/sql/
■ SQL queries using the Avatica JDBC driver.
25. Query:
TimeBoundary
/ MetaData
● Time boundary
○ Return the earliest and latest
data points of a data set
● Segment metadata
○ Per-segment information:
■ dimensions cardinality,
■ min/max value in
dimension
■ number of rows
● DataSource metadata
○ ...
26. Other
Queries...
● Select / Scan / Search
○ select - supports pagination, all data is loaded to memory
○ scan - return result in streaming mode
○ search - returns dimension values which match a search criteria.
The biggest difference between select and scan is that, scan query
doesn't retain all rows in memory before rows can be returned to
client.
27. Query
Performance
● Query with metrics
Metric calculation is done in real time per metric meaning doing sum
of impression and later sum of impressions and sum of clicks will
double the metric calculation time (think about list dimension...)
30. ● Index 5T row daily from 3 different resources (s3 / kafka)
● 40 dimensions, 10 metrics
● Datasource (table) should be updated every 3 hours
● Query latency ~10 second for query on one dimension , 3
month range
○ Some dimensions are list …
○ Some dimensions use lookups
Requirements
31. Work in scale
● We started with 14 dimensions (no lists) → for 8 month druid
answer all requirements
● We added 20 more dimensions (with list) → druid query time
was slow ...
32. ● Hardware :
○ 30 nodes(i3.8xlarge), each node manage historical and
middleManager service
○ 2 nodes (m4.2xlarge) , each node manage coordinator and
overload services
○ 11 nodes (c4.2xlarge), each node manage tranquility service
○ 2 nodes (i3.8xlarge), each node manage broker service
■ (1 broker : 10 historical)
○ Memcached : 3 nodes (cache.r3.8xlarge), version: 1.4.34
Hardware
33. Data cleanup
● Cleanup reduce cardinality (replace it with dummy value)
● Its all about reducing number of rows in the datasource
○ Druid saves the data in columnar storage but in order to get
better performance the cleanup process reduces #rows
(although the query is on 3 columns it needs to read all items in
the column)
34. Data cleanup
● The dimensions correlation is important.
○ lets say we have one dimension which is city with 2000 unique
cities
■ Adding gender dimension will double #rows (assume in our
row data we have both male/female per city)
■ Adding country (although we have 200 unique countries) will
not impact the same (cartesian product) as there is a relation
between city and county of 1:M.
● better to reduce non related dimensions like country and age
35. Data cleanup
○ Use timeseries query with “count” aggregation (~ count(*) in
druid Sql) to measure your cleanup benefit
○ you can also use estimator with cardinality aggregation
○ if you want to estimate without doing cleanup you can use
virtualColumns (filter out specific values) with byRow cardinality
estimator
36. segments
● Shard size should be balanced between disk optimization
500M-1.5G and cpu optimization (core per segment during
query), take list in this calculation …
Shard minimum size should be 100M
● POC - convert list to bitwise vector
37. Partition
● Partition type
○ By default, druid partitions the data according to timestamp In
addition you need to specify hashed/ single dimension partition
■ partition may result with unbalanced segments
■ The default of hashed partition using all dimensions
■ Hashed partitioning is recommended in most cases, as it will improve indexing
performance and create more uniformly sized data segments relative to
single-dimension partitioning.
■ single-dimension partition may be preferred in context of multi tenancy use cases.
■ Might want to avoid default hashed in case of long tail
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": "4500000"
}
"partitionsSpec": {
"type": "dimension",
"targetPartitionSize": "10000000",
"partitionDimension": publisherId"
}
"partitionsSpec": {
"type": "hashed",
"numShards": "12",
"partitionDimensions": ["publisherId"]
}
38. Cache
● Cache :
○ hybrid
■ L1-caffein (local),
■ L2-memcached (global)
when segment move between machines , caffein cache will be
invalidate for those segments
○ warm Cache for popular queries (~300ms)
○ Cache is saved per segment and date , cache key contains the
dimensions ,metrics , filters
○ TopN threshold are part of the key : 0-1000, 1001, 1002 …
○ Cache in the historical nodes not broker in order to merge less
data in the broker side
39. Cache
● Cache :
○ Lookup has pollPeriod meaning if its set to 1 day then cache will
be invalid (no eviction) every day even if lookup was not updated
(tscolumn), since Imply 2.4.6 this issue should be fixed by setting
injective=true in the lookup configuration meaning lookup is not
part of the cache key anymore its a post-aggregation action in the
brokers.
■ increase lookup pooling period + hard set injective=true in the
query is workaround till 2.4.6
○ rebuild segment (~new) cause the cache to be invalidate
41. ● Production issue :
○ Cluster was slow
■ doing rebalance all the time
■ nodes disappear , no crash in the nodes
■ we found that during this time GC took long time , and in the log
we saw ZK disconnect-connect
○ We increased ZK connection timeout
○ Solution was to decrease historical memory (reduce GC time)
42. Monitoring
/ Debug
Fix hotspot by increase
#segment to move till
data is balanced
Statsd emitter does not
send all metrics , use
another (clarity / kafka)
43. Druid
Pattern
● Two data sources
○ small (less rows and dimensions)
○ Large (all data) , query with filter only
44. Extra
● Load rules are used to manage which data is available to druid,
for example we can set it to save only last month data and drop
old data every day
● Priority - druid support query by priority
● Avoid Javascript extension (post aggregation function)