Cascalog is a Clojure-based query language for Hadoop that provides a powerful and easy-to-use tool for data analysis. It allows users to write queries as regular Clojure code, offering features like joins, aggregators, functions, and sorting. Cascalog is unique in that it offers the full power of Clojure at all times by integrating queries directly into the programming language. BackType uses Cascalog for tasks like identifying influencers on social media, determining exposure to URLs, and studying engagement over time.
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks
The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution.
The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs.
This session will showcase how to address the above mentioned issues for a drug similarity use case. We’ll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools.
This document discusses data engineering. It defines data engineering as software engineering focused on dealing with large amounts of data. It explains why data engineering has become important now due to advances in technology and economics. The document then discusses data engineering concepts like distributed systems, parallel processing, and databases. It provides an example of a data pipeline that collects tweets and processes them. Finally, it discusses qualities of an ideal data engineer.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerSri Ambati
H2O World 2015 - Brendan Herger of Capital One
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
The document discusses Sparkle, a solution built by Comcast to address challenges in processing massive amounts of data and enabling data science workflows at scale. Sparkle is a centralized processing system with SQL and machine learning capabilities that is highly scalable and accessible via a REST API. It is used by Comcast to power various use cases including churn modeling, price elasticity analysis, and direct mail campaign optimization.
This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.
This presentation focuses on Apache Spark’s MLlib library for distributed ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame API.
This document discusses the history and development of Python data analysis tools, including pandas. It covers Wes McKinney's work on pandas from 2008 to the present, including the motivations for making data analysis easier and more productive. It also summarizes the development of related projects like Apache Arrow for standardizing columnar data representations to improve code reuse across languages.
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles.
Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company.
In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced.
Take-aways for the audience:
1) A great example of stream processing large, personalization datasets at scale.
2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully.
3) Exposure to some of the technical challenges that should be expected along the way.
Presto is an open source distributed SQL query engine for running queries against large datasets stored in Hadoop/HDFS clusters. It uses in-memory parallel processing, pipelining, data locality, caching, and dynamic compilation to byte code for low query latency. Key techniques include caching frequently used metadata and compiled plans, processing data locally on nodes where it resides, and controlling garbage collection to optimize native code generation. Presto has been tested on TPC-H benchmarks and is used at Meituan to query their 300+PB dataset across Hadoop clusters.
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit
Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.
This document provides an overview of large scale graph analytics and JanusGraph. It discusses graph databases and their use cases. JanusGraph is presented as an open source graph database that can scale to billions of vertices and edges across multiple storage backends like HBase, Cassandra and Bigtable. It uses the TinkerPop framework and Gremlin query language. JanusGraph supports ACID transactions, external indices, and evolving schemas. Example graph queries are demonstrated using the Gremlin console.
El documento describe la vida y el trabajo de las hermanas Rosa y Carolina Agazzi, dos pedagogas italianas que trabajaron en la educación infantil de 0 a 6 años. Se basaron en respetar la libertad, experiencia y espontaneidad de los niños. Sus prioridades de enseñanza incluyeron la salud, la higiene, la cultura física y el lenguaje. También cultivaron las artes y las labores domésticas. Su metodología se centró en el carácter globalizador, el valor de la alegría, el conoc
This document discusses probability distributions that are useful for modeling parameters and data in computer vision. It introduces several common distributions, including Bernoulli, Beta, Categorical, Dirichlet, univariate and multivariate Normal distributions. It then presents conjugate distributions that are paired with the initial distributions, allowing parameters to be modeled. The conjugate relationship simplifies calculations for learning and inference involving these distributions.
The document contains tips and advice for running a business focused on delivering happiness, customer service, relationships, marketing, and handling mistakes. Some key points include pursuing growth and learning, creating fun and a little weirdness in the workplace, turning the business into a community, and using word-of-mouth and social media for marketing. The overall message is about prioritizing passion, purpose, and positive relationships.
El documento proporciona información sobre la evaluación y análisis de cargos. Explica que la evaluación de cargos consiste en conocer las funciones de todos los involucrados en una organización para determinar los objetivos y el desempeño. También describe los diferentes métodos para realizar la descripción y el análisis de cargos, como la observación, cuestionarios y entrevistas. Finalmente, presenta diferentes modelos para el diseño de cargos y los métodos tradicionales utilizados en el análisis de cargos.
Reasons for foreign listings by South African junior mining and exploration c...Vicki Shaw
This document provides background context on South African junior mining companies listing on foreign stock exchanges. It discusses the increased globalization of capital markets which has led to competition among exchanges. Some key points:
- Junior mining companies in South Africa have been listing on foreign exchanges such as London, Toronto, and Johannesburg.
- The research aims to identify reasons for their choice of listing location. This could help the JSE Securities Exchange develop strategies to attract more listings.
- The literature review uncovered six potential reasons for choice of listing location, which form the basis of research propositions.
El documento discute cómo la tecnología influye en la sociedad y la educación. Explica que la tecnología proporciona nuevos medios para satisfacer las necesidades humanas básicas como alimentarse, comunicarse y aprender. También señala que aunque la tecnología está tomando un papel importante en la escuela, no está claro si esto será positivo o perjudicial para los estudiantes y profesores. Además, reconoce que las nuevas tecnologías son parte de la vida de los niños, pero no deben ser el único foco de
The document outlines the pollution issues facing the Ganges River including industrial waste and lack of sewage treatment. It then discusses the various government plans established over the years to address this such as the Ganga Action Plan in 1985, Supreme Court directives, and the National River Ganga Basin Authority in 2009. The current Namami Ganga initiative launched in 2014 aims to take a zero tolerance approach to pollution through tactics like spending budget on developing the river banks, building more dams, and raising public awareness.
Historias desde el otro lado.
Dimensión Humana. Diciembre 1998. http://www.semfyc.es/pfw_files/cma/Informacion/modulo/documentos/DimensionHumana_dic98.pdf
Shahid Shabbir Qureshi is an Assistant Manager of Supply Chain with over 10 years of experience managing projects, resources, and staff. He is focused on identifying continuous improvements in the supply chain to reduce costs and increase efficiency. His career history includes roles with increasing responsibility in supply chain management, logistics, and inventory control. He holds a Master's Degree in Economics and has completed training in supply chain management, quality systems, safety, and time management.
Este documento discute la evolución del concepto de trabajador a colaborador, destacando que los trabajadores ya no son vistos como herramientas sino como personas con talento y compromiso. También describe la gestión de personas desde un enfoque de sistema, considerando a las organizaciones, grupos e individuos como sistemas abiertos que interactúan. Finalmente, analiza los conceptos de cultura organizacional y clima organizacional, y cómo estos afectan el desempeño y satisfacción de los empleados.
Este documento describe los dos tipos principales de pancreatitis: aguda y crónica. La pancreatitis aguda se asocia con enfermedades biliares y alcoholismo y puede causar necrosis y hemorragia en el páncreas. Sus síntomas incluyen dolor abdominal intenso y shock. La pancreatitis crónica se caracteriza por episodios repetidos de inflamación leve o moderada que conducen a la pérdida progresiva de tejido pancreático y su reemplazo por tejido fibroso, lo que puede causar diabetes, dolor abdominal recurrente y depend
Dr Steve Scholey: Hampshire and Isle of Wightlocalinsight
Hampshire County Council is mapping its property assets and customer demand through a program with the UK government. The goal is to better understand community needs, shape public services, and reprofile the public estate to improve outcomes and save costs. Hampshire has worked with partners in Winchester and Basingstoke, holding workshops to identify opportunities like co-locating services and rationalizing assets based on current and future customer demand.
☆BROCHAS PARA MAQUILLAJE☆ ¡¡Las imprescindibles!!Aitor BV
Este documento habla sobre las brochas de maquillaje más comunes y las que no pueden faltar en el maquillaje de día y noche. Explica 9 tipos de brochas diferentes y para qué se utilizan cada una, como brochas para aplicar base, colorete, sombras de ojos, rellenar cejas y pintar los labios. El objetivo es aprender a maquillarse correctamente y conseguir un maquillaje natural de día o sofisticado de noche.
Daniel Avidor - Deciphering the Viral Code – The Secrets of RedmatchMIT Forum of Israel
This document summarizes Daniel Avidor's presentation on viral marketing and Redmatch's approach.
Redmatch uses viral marketing techniques to rapidly spread job listings. Like viruses, it takes advantage of social networks to exponentially multiply messages. Redmatch also localizes to different languages and markets, scales to large databases and traffic, and integrates with other systems. Its matching engine anonymously and interactively matches candidates to jobs in real time based on skills and requirements. This helps spread job listings virally through social sharing and networks.
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
Cascalog is a query language for Hadoop written in Clojure. It provides an alternative to Pig and Hive that leverages Clojure's features like being a functional language, Lisp-based syntax, and first-class integration with Java. Cascalog allows for full use of Clojure during queries with features like custom operations, dynamic queries, and side-by-side use with other Clojure code. BackType uses Cascalog for tasks like identifying influencers and engagement on Twitter.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
This document summarizes Konrad Malawski's talk on reactive programming and related topics. Konrad discusses reactive streams, Akka toolkit for building concurrent applications, actors model for concurrency, and how circuit breakers can be used as a substitute for flow control. He also talks about the origins and development of reactive streams specification to provide a common set of semantics for backpressure.
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
Apache Mahout is an open source machine learning library for developing scalable algorithms. It includes algorithms for classification, clustering, recommendation engines, and frequent pattern mining. Mahout algorithms can be run locally or on Hadoop for distributed processing. Topic modeling using latent Dirichlet allocation is demonstrated for analyzing tweets and suggesting Twitter lists. While algorithms can provide benefits, some such as digital face manipulation can also be disturbing.
In this talk, we shared some of our highlights of the GraphQL Europe conference.
You can see the full coverage of the conference here: https://www.graph.cool/talks/
This document discusses using SQOOP to connect Hadoop and relational databases. It describes four common interoperability scenarios and provides an overview of SQOOP's features. It then focuses on optimizing SQOOP for Oracle databases by discussing how the Quest/Cloudera OraOop extension improves performance by bypassing Oracle parallelism and buffering. The document concludes by recommending best practices for using SQOOP and its extensions.
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
This document discusses Pearson's use of Apache Blur for distributed search and indexing of data from Kafka streams into Blur. It provides an overview of Pearson's learning platform and data architecture, describes the benefits of using Blur including its scalability, fault tolerance and query support. It also outlines the challenges of integrating Kafka streams with Blur using Spark and the solution developed to provide a reliable, low-level Kafka consumer within Spark that indexes messages from Kafka into Blur in near real-time.
Getting started with JavaScript can be somewhat challenging. Especially given how fast the scenery changes. In this presentation I provide a general view of the state of the art. Besides this I go through various JavaScript related tricks that I've found useful in practice.
survivejs.com is a companion site of the presentation and goes on further detail in various topics.
The original presentation was given at AgileJkl, a local agile conference held in Central Finland.
Sparklife - Life In The Trenches With SparkIan Pointer
This document provides tips and tricks for using Apache Spark. It discusses both the benefits of Spark, such as its developer-friendly API and performance advantages over MapReduce, as well as challenges, such as unstable APIs and the difficulty of distributed systems. It provides recommendations for optimizing Spark applications, including choosing the right data structures, partitioning strategies, and debugging and monitoring techniques. It also briefly compares Spark to other streaming frameworks like Storm, Heron, Flink, and Kafka.
An efficient data mining solution by integrating Spark and CassandraStratio
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.
Projects Valhalla, Loom and GraalVM at JUG MainzVadym Kazulkin
This document discusses three Java projects: Project Valhalla, Project Loom, and GraalVM.
Project Valhalla aims to introduce inline types (value types) to Java to improve performance by reducing memory usage and indirection. Project Loom introduces virtual threads and continuations to allow writing scalable concurrent code more easily. GraalVM is a runtime that uses partial evaluation to compile Java and other languages to machine code for high performance.
Similar to Cascalog at May Bay Area Hadoop User Group (20)
The inherent complexity of stream processingnathanmarz
The document discusses approaches for computing unique visitors to web pages over time ranges while dealing with changing user ID mappings.
Initially, three approaches are presented using a key-value store: storing user IDs in sets indexed by URL and hour bucket (Approach 1), using HyperLogLogs for more efficient storage (Approach 2), and storing at multiple granularities to reduce lookups (Approach 3).
The problem is made harder by the presence of "equiv" records that map one user ID to another. Later approaches try to incrementally normalize user IDs, sample user IDs, or maintain separate indexes.
Ultimately, a hybrid approach is proposed using batch computation over the entire dataset to build robust indexes,
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves:
1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views.
2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy.
3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.
The Epistemology of Software Engineeringnathanmarz
1. The document discusses the epistemology of software engineering and how it is impossible to know with certainty that software code is correct due to the limits of human knowledge.
2. It argues that the best approach is to embrace the imperfection of software and focus on minimizing errors through practices like testing, using small isolated components, and designing systems to be fault-tolerant of failures.
3. Key principles discussed include that reasoning is fundamentally difficult, so software should be designed in ways that require less reasoning, such as favoring pure functions and minimizing state mutation.
The document discusses how code is often wrong due to unanticipated inputs, changing requirements, and bugs. It advocates embracing the idea that "your code is wrong" to design more robust software through principles like measuring inputs, monitoring systems, embracing immutability, minimizing dependencies, respecting functional ranges, and embracing recomputation to handle changing needs. The document uses examples from Storm and other systems to illustrate these principles for building software that can withstand failures and remain operational.
Runaway complexity in Big Data... and a plan to stop itnathanmarz
This document discusses sources of complexity in big data systems and proposes a new design approach. Common sources of complexity include a lack of human fault tolerance, conflating data and queries, and schemas done wrong. The author proposes building data systems based on immutability, separating data storage from querying through precomputed views, and using batch processing to compute views periodically with real-time processing to handle new data. This "Lambda Architecture" isolates complexity in the real-time layer and allows mistakes to be corrected through recomputation, providing an approach that is scalable, fault tolerant, and avoids many issues around consistency.
Storm is an open source distributed real-time computation system. It provides guarantees of processing data reliably in real-time. Storm allows for building real-time streaming data pipelines that process unbounded streams of data reliably. Key features include being distributed, fault-tolerant, guaranteeing message processing, and providing a high level abstraction over message passing.
Storm: distributed and fault-tolerant realtime computationnathanmarz
Storm is a distributed real-time computation system that provides guaranteed message processing, horizontal scalability, and fault tolerance. It allows users to define data processing topologies and submit them to a Storm cluster for distributed execution. Spouts emit streams of tuples that are processed by bolts. Storm tracks processing to ensure reliability and replays failed tasks. It provides tools for deployment, monitoring, and optimization of real-time data processing.
ElephantDB is a specialized key-value database designed for exporting data from Hadoop. It allows for random reads, batch writes, and random writes of data in a scalable way. ElephantDB indexes data using MapReduce to group keys by shard and write to local databases. It supports incremental indexing to avoid reindexing from scratch. Queries can be performed directly on ElephantDB or by reading indexed files from Hadoop. It provides a simpler and more reliable alternative to HBase for certain use cases.
Become Efficient or Die: The Story of BackTypenathanmarz
BackType helps businesses understand social media data through analytics tools and APIs. With a small team of 3 employees and funding of $1.4 million, they are able to process 100 million social media messages per day from over 30 terabytes of data using a 100-200 machine computing cluster. The company focuses on developing efficiently using agile and lean startup methodologies like testing hypotheses with fake features before building them out fully, avoiding overengineering, and paying down technical debt regularly through "BackSweeps".
The Secrets of Building Realtime Big Data Systemsnathanmarz
The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.
Presented at POSSCON '11.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
The document describes the process of query planning and execution for the Cascalog query engine. It involves three main steps: 1) pre-aggregation to resolve variables and join data sources, 2) aggregation to group data and apply aggregators, and 3) post-aggregation to resolve remaining variables and apply filters. The document provides examples of how a sample query is planned and optimized in each system.
Cascading is a Java library that makes it easy to develop complex workflows on Hadoop. It allows processing of large amounts of data in a scalable and fault-tolerant way. Cascading represents all data as tuples and defines flows as sequences of tuple stream manipulations that compile to MapReduce jobs. For example, a flow could take tuples with words and values, split the words, group by word to sum the values for each word. This provides a more flexible way to develop MapReduce applications compared to Pig.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
11. What sets Cascalog apart?
Custom operations
No UDF interface
Just Clojure functions
12. What sets Cascalog apart?
Dynamic queries
Write functions that return queries
Manipulate queries as first-class entities in the
language
13. What sets Cascalog apart?
Use Cascalog side by side with other code
Appends and Distributed Copies
Consolidation
Application logic
14. Easy Experimentation
Ships with test
dataset that can be
queried locally (the
“playground”)
5 minutes to setup
Hadoop, Clojure, and
Cascalog locally - see
README
20. Cascalog at BackType
Cascalog is used to:
Identify influencers
Determine number of people exposed to URLs
on Twitter
21. Cascalog at BackType
Cascalog is used to:
Identify influencers
Determine number of people exposed to URLs
on Twitter
Identify “interesting tweets”
22. Cascalog at BackType
Cascalog is used to:
Identify influencers
Determine number of people exposed to URLs
on Twitter
Identify “interesting tweets”
Study social engagement of domains over time
23. Cascalog at BackType
Cascalog is used to:
Identify influencers
Determine number of people exposed to URLs
on Twitter
Identify “interesting tweets”
Study social engagement of domains over time
Etc, etc.
24. Cascalog at BackType
Input and output
Cascalog reads from MySQL databases and
HDFS
Cascalog writes to Cassandra and HDFS
25. Cascalog at BackType
Rapid development
Local playground dataset for development
Develop queries in the REPL
29. Cascading and Cascalog
Provided by Cascading:
Tuple abstraction and tuple manipulation
Workflow to MapReduce translation
Read and write from anywhere with Taps