The document discusses Rocana Search, a system built by Rocana to enable large scale real-time collection, processing, and analysis of event data. It aims to provide higher indexing throughput and better horizontal scaling than general purpose search systems like Solr. Key features include fully parallelized ingest and query, dynamic partitioning of data, and assigning partitions to nodes to maximize parallelism and locality. Initial benchmarks show Rocana Search can index over 3 times as many events per second as Solr.
HPE provides optimized server architectures for Hadoop including the Apollo 4200 server which offers high storage density. HPE also offers a reference architecture for Hadoop that separates compute and storage resources for better performance, using optimized servers like Moonshot for processing and Apollo for storage. Additionally, HPE contributes to Apache Spark through HP Labs to improve efficiency and scale of memory and performance.
This document summarizes improvements made to HDFS to optimize performance, stabilize operations, and improve supportability. Key areas discussed include logging enhancements, metrics and tools for troubleshooting, load management through RPC improvements, and changes to reduce garbage collection overhead and improve liveness detection. Specific optimizations covered range from code changes to reduce logging verbosity to adding batch processing of block reports.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
The document discusses managing Hadoop, HBase and Storm clusters at Yahoo scale. It describes Yahoo's grid infrastructure which includes 3 data centers with over 45k nodes across 18 Hadoop clusters, 9 HBase clusters and 13 Storm clusters. It then provides details on the rolling upgrade processes for HDFS, YARN, HBase and Storm which involve minimizing downtime, upgrading components independently and verifying upgrades. CI/CD processes are used to automate software deployment and upgrades.
Hadoop clusters are operated on an ephemeral basis in the cloud by Qubole, processing over 300 petabytes of data per month across over 100 customers. Qubole addresses challenges of ephemeral clusters through auto-scaling of resources using YARN, optimizing performance for cloud storage, and storing job history remotely. Volatile low-cost nodes are leveraged through policies that ensure data replication despite potential node failures.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
The document discusses the Stinger Initiative from Hortonworks to improve the performance and capabilities of interactive queries in Hive. The initiative takes a two-pronged approach, focusing on improvements to the query engine and the introduction of a new optimized column store file format called ORCFile. A new Tez execution engine is also introduced to avoid bottlenecks in MapReduce and enable lower latency queries. The goal is to extend Hive's ability to handle interactive queries with response times measured in seconds rather than minutes.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
This document discusses the new features of Apache Hive 2.0, including:
1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops.
2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons.
3) Using HBase as the metastore to speed up query planning by reducing metadata access times.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins.
5) Improvements to the cost-based optimizer including better statistics collection.
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
Kafka and Hadoop were introduced at Exponea to address several issues:
- The in-memory database was very fast but limited by memory constraints. Customers wanted the freedom to analyze all their data.
- Processing large volumes of streaming data was problematic.
- HDFS does not support appending new data files. Kafka was introduced to stream data for storage in Hadoop.
- The new technologies introduced monitoring challenges for the expanded data stack.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
The document discusses scaling HDFS to manage billions of files through distributed storage schemes. It outlines the current HDFS architecture and challenges with namespace and block scaling. It proposes a storage container architecture with distributed block maps and a storage container manager to address these challenges. This would allow HDFS to easily scale to manage trillions of blocks and billions of files across large clusters.
This document discusses deep learning using Spark and DL4J. It introduces the speakers, Adam Gibson and Dhruv Kumar, and outlines the topics to be covered: an overview of deep learning, architectures, implementation and libraries for real-life applications, and a demonstration. Deep learning is described as one technique in data science that excels at tasks like image recognition, speech translation, and voice recognition by being loosely inspired by human brain models. The document then discusses using these techniques for enterprise use cases and realizing modern data applications in a Hadoop-centric world.
Este documento contiene los testimonios de varias personas con discapacidad que han trabajado o trabajan actualmente en diferentes empresas a través del programa Lantegi Batuak o de forma directa. Expresan en general sentirse contentos con sus trabajos y compañeros, y su deseo de seguir desarrollando sus funciones o mejorar sus condiciones laborales en el futuro.
We recently updated the Developmental Continua at GWA in Math and Literacy. Here is a parent presentation I did to help parents understand the new continua.
El documento argumenta que las personas no son recursos y que el departamento de Recursos Humanos debería llamarse de otra forma. Se comete un error al considerar a las personas como recursos cuando en realidad son humanos con recursos propios como el conocimiento y el talento. Algunas empresas ahora lo llaman Departamento de Personal Profesional u otra denominación que reconozca a las personas como la clave de la ventaja competitiva de una empresa.
Körting Hannover AG is a leading manufacturer of ejectors for the shipbuilding industry with over 140 years of experience. Ejectors are self-priming fluidic devices that use liquids, gases, or vapors to pump, evacuate, mix, or discharge other fluids without moving parts. Körting ejectors are customized for individual ship applications and used widely for bilge pumping, ballast handling, and other tasks. They provide reliable operation with low maintenance needs and costs.
Keynote at Codebits in Portugal, April 2014, explaining the how and why of Firefox OS and how to use it.
Video: https://videos.sapo.pt/ZYQyY57ZlB6lhgIdBzrs
La historia se sitúa en Ciudad Academia, una ciudad tecnológicamente avanzada localizada al oeste de Tokio que se especializa en el desarrollo de poderes psíquicos, pero también se sitúa en un mundo donde la magia es real. Touma Kamijou es un estudiante de secundaria que posee en su mano derecha un misterioso poder llamado Imagine Breaker, el cual puede negar cualquier fenómeno sobrenatural ya sea psíquico o mágico, lo que también anula su propia buena suerte. Un día encuentra a una chica llamada Index colgada de su balcón. Ella es una monja que posee en su mente el Index Librorum Prohibitorum, el cual es una colección de 103,000 libros mágicos. Cuando los caminos de la ciencia y la magia se cruzan, esta historia comienza.
El documento proporciona instrucciones para construir un plantinero artesanal para la producción de plantines de hortalizas. Explica los pasos para seleccionar un sitio, excavar los cimientos, colocar postes y paredes, y construir el techo. También cubre la instalación de malla anti-insectos, láminas transparentes y bandejas para plantas, así como los beneficios de proteger las plantines y recomendaciones para la construcción.
The document describes a cloud ERP system called Tesla ERP. It provides the following key benefits:
- It allows companies to access all their business information from any device with an internet connection. This gives employees mobility and real-time visibility.
- Data is securely stored on servers rather than locally, eliminating risks of data loss, theft, or corruption. Automatic backups are done daily.
- Various modules can be selected to manage tasks like accounting, sales, customer support, and more. This streamlines workflows and allows for fast decision making based on analytics and reporting.
El documento presenta un resumen de la entrevista al presidente del club deportivo Coutadas Trail Team. En ella, Bernardo Rodríguez explica que el club se formó hace unos años por un grupo de amigos que practicaban running y ahora cuenta con alrededor de 30 socios. Su objetivo principal es mantener la amistad entre los miembros y disfrutar de la práctica deportiva juntos. El club participa en carreras de montaña pero no ofrece otros servicios más allá de las licencias federativas.
Este documento describe la importancia de la participación comunitaria y diferentes niveles de participación. También explica las metodologías tradicionales versus participativas, dando ejemplos de técnicas participativas como mapas parlantes, rotafolios y dinámicas de grupo. El objetivo es promover la participación activa de las comunidades en su propio desarrollo.
Este documento presenta el informe de interventoría técnica No. 10 del proyecto de construcción de la Ciudadela Escolar en Santa Cruz de Lorica, Córdoba. Describe el avance de las obras entre el 30 de octubre y el 30 de noviembre de 2012, señalando que los trabajos se encuentran suspendidos sin justificación por parte del contratista. La interventoría recomienda que el contratista reanude las obras para cumplir con los plazos establecidos y que la administración municipal tome medidas al respecto
El documento describe los fundamentos de la inspección de soldaduras mediante ultrasonido en lugar de radiografía, cumpliendo con códigos y estándares internacionales. Explica que los equipos de ultrasonido deben ser apropiados, con metodologías y procedimientos validados y personal competente para asegurar la confiabilidad. Describe diferentes métodos de inspección ultrasónica como Phased Array, TOFD y AUT, sus aplicaciones, ventajas sobre la radiografía y requisitos para la inspección.
Resource Sharing and Communication: serving the needs of the malaria clienteleJoseph Yap
The document summarizes the history and activities of the ACTMalaria Information Resource Center (AIRC), which aims to facilitate information sharing on malaria among countries in Asia. AIRC was established in 2005 to revitalize the previous Mekong Malaria Documentation Center. It collects malaria-related resources from 10 partner countries and libraries and makes them available online. AIRC also provides training to partners and works to establish satellite libraries to expand access to malaria information. Based on user feedback, AIRC is still developing but aims to be an effective online databank on malaria in Asia through cooperation with partners.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Este documento describe las categorías gramaticales y tipos de palabras desde el punto de vista morfológico y sintáctico. Explica las propiedades de sustantivos, adjetivos, determinantes, pronombres y verbos, incluyendo sus clases y concordancia. También describe los morfemas verbales como persona, número, tiempo y aspecto.
DataEngConf SF16 - High cardinality time series searchHakka Labs
The document discusses high cardinality time series search and scaling to large datasets. It describes the author's company which collects and analyzes machine data at terabytes per day. General purpose search systems are good for moderate scale but have limitations for high cardinality data with large retention. The author's company built Rocana Search to optimize for their time-oriented event data, with features like parallel ingestion and querying, dynamic partitioning, and keeping all data online without wasted resources. It can handle billions of events per day with low latency and utilizes modern hardware through full distribution.
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data.
We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases.
This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
This document provides an agenda for a hands-on introduction and hackathon kickoff for Apache Geode. The agenda includes details about the hackathon, an introduction to Apache Geode including its history and key features, a hands-on lab to build, run, and use Geode, and a Q&A session. It also outlines how to contribute to the Geode project through code, documentation, issue tracking, and mailing lists.
Building production spark streaming applicationsJoey Echeverria
Designing, implementing, and testing an Apache Spark Streaming application is necessary to deploy to production but is not sufficient for long term management and monitoring. Simply learning the Spark Streaming APIs only gets you part of the way there. In this talk, I’ll be focusing on everything that happens after you’ve implemented your application in the context of a real-time alerting system for IT operational data.
Slides presented at Great Indian Developer Summit 2016 at the session MySQL: What's new on April 29 2016.
Contains information about the new MySQL Document Store released in April 2016.
Memory-optimized indexes allow Couchbase indexes to scale independently from document data by keeping indexes entirely in memory. This enables superior index performance by leveraging multiple CPU cores and large amounts of RAM. The Nitro storage engine powers memory-optimized indexes through lock-free operations, fast snapshotting using MVCC, and concurrent non-intrusive backup and recovery. Benchmark results showed memory-optimized indexes can scale write throughput linearly with CPU cores and eliminate the need for partitioning to scale performance.
The document is a presentation on Oracle NoSQL Database that discusses its use cases, Oracle's NoSQL and big data strategy, technical features of Oracle NoSQL Database, and customer references. The presentation covers how Oracle NoSQL Database can be used for real-time event processing, sensor data acquisition, fraud detection, recommendations, and globally distributed databases. It also discusses Oracle's approach to integrating NoSQL, Hadoop, and relational databases. Customer references are provided for Airbus's use of Oracle NoSQL Database for flight test sensor data storage and analysis.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document discusses strategies for scaling a Splunk deployment. It begins by describing how customers typically start with a single use case but then need to scale to handle more data and use cases. It then covers strategies for scaling the forwarding, indexing, search, and management components of Splunk. Key topics include load balancing forwarders, using indexer clustering for high availability, scaling search heads by clustering, and using the deployment server and distributed management console for centralized management. The document emphasizes planning storage capacity and I/O when scaling indexers and considering Splunk's application support when scaling search heads.
Apache Geode Meetup, Cork, Ireland at CITApache Geode
This document provides an introduction to Apache Geode (incubating), including:
- A brief history of Geode and why it was developed
- An overview of key Geode concepts such as regions, caching, and functions
- Examples of interesting large-scale use cases from companies like Indian Railways
- A demonstration of using Geode with Apache Spark and Spring XD for a stock prediction application
- Information on how to get involved with the Geode open source project community
YARN: a resource manager for analytic platformTsuyoshi OZAWA
The document discusses YARN, a resource manager for Apache Hadoop. It provides an overview of YARN and its key features: (1) managing resources in a cluster, (2) managing application history logs, and (3) a service registry mechanism. It then discusses how distributed processing frameworks like Tez and Spark work on YARN, focusing on their directed acyclic graph (DAG) models and techniques for improving performance on YARN like container reuse.
This document discusses architectural considerations for running big data workloads on OpenStack at Comcast. It provides an overview of Comcast's use of OpenStack, describes several big data use cases and application profiles at Comcast, and makes recommendations for using disaggregated or hyper-converged storage approaches for different applications like Kafka and HDFS. It also covers testing strategies, operational considerations, and choices around implementing HDFS and S3 object storage.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
The document discusses managing security events at scale using Elasticsearch. Some key points:
- The author manages security logs for customers, collecting, correlating, storing, indexing, analyzing, and monitoring over 1 million events per second.
- Before Elasticsearch, traditional databases couldn't scale to billions of logs, searches took days, and advanced analytics weren't possible. Elasticsearch allows customers to access and search logs in real-time and perform analytics.
- Their largest Elasticsearch cluster has 128 nodes indexing over 20 billion documents per day totaling 800 billion documents. They use Hadoop for long term storage and Spark and Kafka for real-time analytics.
Frontera: open source, large scale web crawling frameworkScrapinghub
This document describes Frontera, an open source framework for large scale web crawling. It discusses the architecture and components of Frontera, which includes Scrapy for network operations, Apache Kafka as a data bus, and Apache HBase for storage. It also outlines some challenges faced during the development of Frontera and solutions implemented, such as handling large websites that flood the queue, optimizing traffic to HBase, and prioritizing URLs. The document provides details on using Frontera to crawl the Spanish (.es) web domain and presents results and future plans.
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
What we do:
-We build a system for the operation of modern data centers
-Triage and diagnostics, exploration, trends, advanced analytics of complex systems
-Our data: logs, metrics, human activity, anything that occurs in the data center
-Enterprise Software (i.e. we build for others.)
Today's presentation: how we built what we built on top of Apache Hadoop
Similar to Time-oriented event search. A new level of scale (20)
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
YMMV Not necessarily true for you
Enterprise software – shipping stuff to people
Fine grained events – logs, user behavior, etc.
For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes).
This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
YMMV Not necessarily true for you
Enterprise software – shipping stuff to people
Fine grained events – logs, user behavior, etc.
For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes).
This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
It does most of what you want for most cases most of the time.
They’ve solved some really hard problems.
Content search (e.g. news sites, document repos), finite size datasets (e.g. product catalogs), low cardinality datasets that fit in memory. Not us.
Flexible systems with a bevy of full text search features
Moderate and fixed document count: big by historical standards, small by ours.
Design reflects these assumptions.
Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic.
Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region).
All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query.
Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache.
APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between.
Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk.
Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird.
We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want.
We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.
There are plenty of ways we could of pushed the general purpose systems, and we did.
We layered our own partitioning and shard selection on top of Solr Cloud with time-based collection round robining. That got us pretty far, but not far enough. We were starting to do a lot of query rewriting and scheduling.
Run mulitple JVMs per box. Gross. Unsupportable.
Push historical queries out of search to a system such as spark.
Build weird caches of frequent data sets.
At some point, the cost of hacking outweighed the cost of building.
Flexible systems with a bevy of full text search features
Moderate and fixed document count: big by historical standards, small by ours.
Design reflects these assumptions.
Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic.
Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region).
All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query.
Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache.
APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between.
Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk.
Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird.
We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want.
We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.