Cascalog at May Bay Area Hadoop User Group

•Download as KEY, PDF•

4 likes•1,002 views

Cascalog is a Clojure-based query language for Hadoop that provides a powerful and easy-to-use tool for data analysis. It allows users to write queries as regular Clojure code, offering features like joins, aggregators, functions, and sorting. Cascalog is unique in that it offers the full power of Clojure at all times by integrating queries directly into the programming language. BackType uses Cascalog for tasks like identifying influencers on social media, determining exposure to URLs, and studying engagement over time.

What's hot

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Databricks

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Databricks

The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data. Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...

Databricks

The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution. The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs. This session will showcase how to address the above mentioned issues for a drug similarity use case. We’ll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools.

Demystifying Data Engineering

nathanmarz

This document discusses data engineering. It defines data engineering as software engineering focused on dealing with large amounts of data. It explains why data engineering has become important now due to advances in technology and economics. The document then discusses data engineering concepts like distributed systems, parallel processing, and databases. It provides an example of a data pipeline that collects tweets and processes them. Finally, it discusses qualities of an ideal data engineer.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Open Source Big Data Ingestion - Without the Heartburn!

Pat Patterson

Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.

H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger

Sri Ambati

Build Your Own Recommendation Engine

Sri Ambati

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Databricks

This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies. Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Spark Summit

The document discusses Sparkle, a solution built by Comcast to address challenges in processing massive amounts of data and enabling data science workflows at scale. Sparkle is a centralized processing system with SQL and machine learning capabilities that is highly scalable and accessible via a REST API. It is used by Comcast to power various use cases including churn modeling, price elasticity analysis, and direct mail campaign optimization.

Webinar: Solr & Fusion for Big Data

Lucidworks

This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.

Distributed ML in Apache Spark

Databricks

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Wes McKinney

This document discusses the history and development of Python data analysis tools, including pandas. It covers Wes McKinney's work on pandas from 2008 to the present, including the motivations for making data analysis easier and more productive. It also summarizes the development of related projects like Apache Arrow for standardizing columnar data representations to improve code reuse across languages.

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Spark Summit

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Presto

Chen Chun

Presto is an open source distributed SQL query engine for running queries against large datasets stored in Hadoop/HDFS clusters. It uses in-memory parallel processing, pipelining, data locality, caching, and dynamic compilation to byte code for low query latency. Key techniques include caching frequently used metadata and compiled plans, processing data locally on nodes where it resides, and controlling garbage collection to optimize native code generation. Presto has been tested on TPC-H benchmarks and is used at Meituan to query their 300+PB dataset across Hadoop clusters.

Spark Summit EU talk by Ruben Pulido Behar Veliqi

Spark Summit

The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.

The Past, Present, and Future of Hadoop at LinkedIn

Carl Steinbach

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Spark Summit

Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.

Large Scale Graph Analytics with JanusGraph

P. Taylor Goetz

This document provides an overview of large scale graph analytics and JanusGraph. It discusses graph databases and their use cases. JanusGraph is presented as an open source graph database that can scale to billions of vertices and edges across multiple storage backends like HBase, Cassandra and Bigtable. It uses the TinkerPop framework and Gremlin query language. JanusGraph supports ACID transactions, external indices, and evolving schemas. Example graph queries are demonstrated using the Gremlin console.

What's hot (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...

Demystifying Data Engineering

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Open Source Big Data Ingestion - Without the Heartburn!

H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger

Build Your Own Recommendation Engine

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Webinar: Solr & Fusion for Big Data

Distributed ML in Apache Spark

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Presto

Spark Summit EU talk by Ruben Pulido Behar Veliqi

The Past, Present, and Future of Hadoop at LinkedIn

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Large Scale Graph Analytics with JanusGraph

Viewers also liked

Luka Birsa: Building A Buttonless Web Kit Thinclient Device Thingyyy

Slo-Tech

power bhueno

Gisella Salinas Castillo

El documento describe la vida y el trabajo de las hermanas Rosa y Carolina Agazzi, dos pedagogas italianas que trabajaron en la educación infantil de 0 a 6 años. Se basaron en respetar la libertad, experiencia y espontaneidad de los niños. Sus prioridades de enseñanza incluyeron la salud, la higiene, la cultura física y el lenguaje. También cultivaron las artes y las labores domésticas. Su metodología se centró en el carácter globalizador, el valor de la alegría, el conoc

03 cv mil_probability_distributions

zukun

This document discusses probability distributions that are useful for modeling parameters and data in computer vision. It introduces several common distributions, including Bernoulli, Beta, Categorical, Dirichlet, univariate and multivariate Normal distributions. It then presents conjugate distributions that are paired with the initial distributions, allowing parameters to be modeled. The conjugate relationship simplifies calculations for learning and inference involving these distributions.

Zappos.com, My Experience: Colin Gilchrist

Colin Gilchrist

The document contains tips and advice for running a business focused on delivering happiness, customer service, relationships, marketing, and handling mistakes. Some key points include pursuing growth and learning, creating fun and a little weirdness in the workplace, turning the business into a community, and using word-of-mouth and social media for marketing. The overall message is about prioritizing passion, purpose, and positive relationships.

ebay for Beginners

Intranet Future

Unidad iii mantencion_de_personal

richard rivera

El documento proporciona información sobre la evaluación y análisis de cargos. Explica que la evaluación de cargos consiste en conocer las funciones de todos los involucrados en una organización para determinar los objetivos y el desempeño. También describe los diferentes métodos para realizar la descripción y el análisis de cargos, como la observación, cuestionarios y entrevistas. Finalmente, presenta diferentes modelos para el diseño de cargos y los métodos tradicionales utilizados en el análisis de cargos.

Reasons for foreign listings by South African junior mining and exploration c...

Vicki Shaw

This document provides background context on South African junior mining companies listing on foreign stock exchanges. It discusses the increased globalization of capital markets which has led to competition among exchanges. Some key points: - Junior mining companies in South Africa have been listing on foreign exchanges such as London, Toronto, and Johannesburg. - The research aims to identify reasons for their choice of listing location. This could help the JSE Securities Exchange develop strategies to attract more listings. - The literature review uncovered six potential reasons for choice of listing location, which form the basis of research propositions.

ExcelCertificate18122014

Peter Garces

Cuanto influye la tecnología en mi medio

agustinapascal

El documento discute cómo la tecnología influye en la sociedad y la educación. Explica que la tecnología proporciona nuevos medios para satisfacer las necesidades humanas básicas como alimentarse, comunicarse y aprender. También señala que aunque la tecnología está tomando un papel importante en la escuela, no está claro si esto será positivo o perjudicial para los estudiantes y profesores. Además, reconoce que las nuevas tecnologías son parte de la vida de los niños, pero no deben ser el único foco de

GANGA

aarshsaab

The document outlines the pollution issues facing the Ganges River including industrial waste and lack of sewage treatment. It then discusses the various government plans established over the years to address this such as the Ganga Action Plan in 1985, Supreme Court directives, and the National River Ganga Basin Authority in 2009. The current Namami Ganga initiative launched in 2014 aims to take a zero tolerance approach to pollution through tactics like spending budget on developing the river banks, building more dams, and raising public awareness.

Historias desde el otro lado

Rafa Cofiño

shahid shabbir cv

Shahid Shabbir

Shahid Shabbir Qureshi is an Assistant Manager of Supply Chain with over 10 years of experience managing projects, resources, and staff. He is focused on identifying continuous improvements in the supply chain to reduce costs and increase efficiency. His career history includes roles with increasing responsibility in supply chain management, logistics, and inventory control. He holds a Master's Degree in Economics and has completed training in supply chain management, quality systems, safety, and time management.

Leccion i persona_y_organizacion

richard rivera

Este documento discute la evolución del concepto de trabajador a colaborador, destacando que los trabajadores ya no son vistos como herramientas sino como personas con talento y compromiso. También describe la gestión de personas desde un enfoque de sistema, considerando a las organizaciones, grupos e individuos como sistemas abiertos que interactúan. Finalmente, analiza los conceptos de cultura organizacional y clima organizacional, y cómo estos afectan el desempeño y satisfacción de los empleados.

Pancreatitis

Alcantara Julio

Este documento describe los dos tipos principales de pancreatitis: aguda y crónica. La pancreatitis aguda se asocia con enfermedades biliares y alcoholismo y puede causar necrosis y hemorragia en el páncreas. Sus síntomas incluyen dolor abdominal intenso y shock. La pancreatitis crónica se caracteriza por episodios repetidos de inflamación leve o moderada que conducen a la pérdida progresiva de tejido pancreático y su reemplazo por tejido fibroso, lo que puede causar diabetes, dolor abdominal recurrente y depend

Dr Steve Scholey: Hampshire and Isle of Wight

localinsight

Hampshire County Council is mapping its property assets and customer demand through a program with the UK government. The goal is to better understand community needs, shape public services, and reprofile the public estate to improve outcomes and save costs. Hampshire has worked with partners in Winchester and Basingstoke, holding workshops to identify opportunities like co-locating services and rationalizing assets based on current and future customer demand.

☆BROCHAS PARA MAQUILLAJE☆ ¡¡Las imprescindibles!!

Aitor BV

Este documento habla sobre las brochas de maquillaje más comunes y las que no pueden faltar en el maquillaje de día y noche. Explica 9 tipos de brochas diferentes y para qué se utilizan cada una, como brochas para aplicar base, colorete, sombras de ojos, rellenar cejas y pintar los labios. El objetivo es aprender a maquillarse correctamente y conseguir un maquillaje natural de día o sofisticado de noche.

Daniel Avidor - Deciphering the Viral Code – The Secrets of Redmatch

MIT Forum of Israel

This document summarizes Daniel Avidor's presentation on viral marketing and Redmatch's approach. Redmatch uses viral marketing techniques to rapidly spread job listings. Like viruses, it takes advantage of social networks to exponentially multiply messages. Redmatch also localizes to different languages and markets, scales to large databases and traffic, and integrates with other systems. Its matching engine anonymously and interactively matches candidates to jobs in real time based on skills and requirements. This helps spread job listings virally through social sharing and networks.

Coca Cola Consoldiated incidence pricing agreement with Coca Cola

Neil Kimberley

Viewers also liked (18)

Luka Birsa: Building A Buttonless Web Kit Thinclient Device Thingyyy

power bhueno

03 cv mil_probability_distributions

Zappos.com, My Experience: Colin Gilchrist

ebay for Beginners

Unidad iii mantencion_de_personal

Reasons for foreign listings by South African junior mining and exploration c...

ExcelCertificate18122014

Cuanto influye la tecnología en mi medio

GANGA

Historias desde el otro lado

shahid shabbir cv

Leccion i persona_y_organizacion

Pancreatitis

Dr Steve Scholey: Hampshire and Isle of Wight

☆BROCHAS PARA MAQUILLAJE☆ ¡¡Las imprescindibles!!

Daniel Avidor - Deciphering the Viral Code – The Secrets of Redmatch

Coca Cola Consoldiated incidence pricing agreement with Coca Cola

Similar to Cascalog at May Bay Area Hadoop User Group

Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...

Hadoop User Group

Cascalog is a query language for Hadoop written in Clojure. It provides an alternative to Pig and Hive that leverages Clojure's features like being a functional language, Lisp-based syntax, and first-class integration with Java. Cascalog allows for full use of Clojure during queries with features like custom operations, dynamic queries, and side-by-side use with other Clojure code. BackType uses Cascalog for tasks like identifying influencers and engagement on Twitter.

Cascalog at Strange Loop

nathanmarz

Enterprise Data Workflows with Cascading and Windows Azure HDInsight

Paco Nathan

Building and deploying LLM applications with Apache Airflow

Kaxil Naik

Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions. This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data. In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise. https://airflowsummit.org/sessions/2023/keynote-llm/

Building Distributed Systems in Scala

Alex Payne

"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."

Big Data Processing with .NET and Spark (SQLBits 2020)

Michael Rys

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Helena Edelson

What is Apache Kafka®?

Eventador

What is apache Kafka?

Kenny Gorman

Boost your APIs with GraphQL 1.0

Otávio Santana

Not Only Streams for Akademia JLabs

Konrad Malawski

This document summarizes Konrad Malawski's talk on reactive programming and related topics. Konrad discusses reactive streams, Akka toolkit for building concurrent applications, actors model for concurrency, and how circuit breakers can be used as a substitute for flow control. He also talks about the origins and development of reactive streams specification to provide a common set of semantics for backpressure.

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...

confluent

Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.

Orchestrating the Intelligent Web with Apache Mahout

aneeshabakharia

Apache Mahout is an open source machine learning library for developing scalable algorithms. It includes algorithms for classification, clustering, recommendation engines, and frequent pattern mining. Mahout algorithms can be run locally or on Hadoop for distributed processing. Topic modeling using latent Dirichlet allocation is demonstrated for analyzing tweets and suggesting Twitter lists. While algorithms can provide benefits, some such as digital face manipulation can also be disturbing.

GraphQL Europe Recap

Philipp Sporrer

Hadoop and rdbms with sqoop

Guy Harrison

This document discusses using SQOOP to connect Hadoop and relational databases. It describes four common interoperability scenarios and provides an overview of SQOOP's features. It then focuses on optimizing SQOOP for Oracle databases by discussing how the Quest/Cloudera OraOop extension improves performance by bypassing Oracle parallelism and buffering. The document concludes by recommending best practices for using SQOOP and its extensions.

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Lucidworks

This document discusses Pearson's use of Apache Blur for distributed search and indexing of data from Kafka streams into Blur. It provides an overview of Pearson's learning platform and data architecture, describes the benefits of using Blur including its scalability, fault tolerance and query support. It also outlines the challenges of integrating Kafka streams with Blur using Spark and the solution developed to provide a reliable, low-level Kafka consumer within Spark that indexes messages from Kafka into Blur in near real-time.

Survive JavaScript - Strategies and Tricks

Juho Vepsäläinen

Getting started with JavaScript can be somewhat challenging. Especially given how fast the scenery changes. In this presentation I provide a general view of the state of the art. Besides this I go through various JavaScript related tricks that I've found useful in practice. survivejs.com is a companion site of the presentation and goes on further detail in various topics. The original presentation was given at AgileJkl, a local agile conference held in Central Finland.

Sparklife - Life In The Trenches With Spark

Ian Pointer

This document provides tips and tricks for using Apache Spark. It discusses both the benefits of Spark, such as its developer-friendly API and performance advantages over MapReduce, as well as challenges, such as unstable APIs and the difficulty of distributed systems. It provides recommendations for optimizing Spark applications, including choosing the right data structures, partitioning strategies, and debugging and monitoring techniques. It also briefly compares Spark to other streaming frameworks like Storm, Heron, Flink, and Kafka.

An efficient data mining solution by integrating Spark and Cassandra

Stratio

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

Projects Valhalla, Loom and GraalVM at JUG Mainz

Vadym Kazulkin

This document discusses three Java projects: Project Valhalla, Project Loom, and GraalVM. Project Valhalla aims to introduce inline types (value types) to Java to improve performance by reducing memory usage and indirection. Project Loom introduces virtual threads and continuations to allow writing scalable concurrent code more easily. GraalVM is a runtime that uses partial evaluation to compile Java and other languages to machine code for high performance.

Similar to Cascalog at May Bay Area Hadoop User Group (20)

Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...

Cascalog at Strange Loop

Enterprise Data Workflows with Cascading and Windows Azure HDInsight

Building and deploying LLM applications with Apache Airflow

Building Distributed Systems in Scala

Big Data Processing with .NET and Spark (SQLBits 2020)

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

What is Apache Kafka®?

What is apache Kafka?

Boost your APIs with GraphQL 1.0

Not Only Streams for Akademia JLabs

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...

Orchestrating the Intelligent Web with Apache Mahout

GraphQL Europe Recap

Hadoop and rdbms with sqoop

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Survive JavaScript - Strategies and Tricks

Sparklife - Life In The Trenches With Spark

An efficient data mining solution by integrating Spark and Cassandra

Projects Valhalla, Loom and GraalVM at JUG Mainz

More from nathanmarz

The inherent complexity of stream processing

nathanmarz

The document discusses approaches for computing unique visitors to web pages over time ranges while dealing with changing user ID mappings. Initially, three approaches are presented using a key-value store: storing user IDs in sets indexed by URL and hour bucket (Approach 1), using HyperLogLogs for more efficient storage (Approach 2), and storing at multiple granularities to reduce lookups (Approach 3). The problem is made harder by the presence of "equiv" records that map one user ID to another. Later approaches try to incrementally normalize user IDs, sample user IDs, or maintain separate indexes. Ultimately, a hybrid approach is proposed using batch computation over the entire dataset to build robust indexes,

Using Simplicity to Make Hard Big Data Problems Easy

nathanmarz

The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves: 1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views. 2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy. 3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.

The Epistemology of Software Engineering

nathanmarz

1. The document discusses the epistemology of software engineering and how it is impossible to know with certainty that software code is correct due to the limits of human knowledge. 2. It argues that the best approach is to embrace the imperfection of software and focus on minimizing errors through practices like testing, using small isolated components, and designing systems to be fault-tolerant of failures. 3. Key principles discussed include that reasoning is fundamentally difficult, so software should be designed in ways that require less reasoning, such as favoring pure functions and minimizing state mutation.

Your Code is Wrong

nathanmarz

The document discusses how code is often wrong due to unanticipated inputs, changing requirements, and bugs. It advocates embracing the idea that "your code is wrong" to design more robust software through principles like measuring inputs, monitoring systems, embracing immutability, minimizing dependencies, respecting functional ranges, and embracing recomputation to handle changing needs. The document uses examples from Storm and other systems to illustrate these principles for building software that can withstand failures and remain operational.

Runaway complexity in Big Data... and a plan to stop it

nathanmarz

This document discusses sources of complexity in big data systems and proposes a new design approach. Common sources of complexity include a lack of human fault tolerance, conflating data and queries, and schemas done wrong. The author proposes building data systems based on immutability, separating data storage from querying through precomputed views, and using batch processing to compute views periodically with real-time processing to handle new data. This "Lambda Architecture" isolates complexity in the real-time layer and allows mistakes to be corrected through recomputation, providing an approach that is scalable, fault tolerant, and avoids many issues around consistency.

Storm

nathanmarz

Storm: distributed and fault-tolerant realtime computation

nathanmarz

Storm is a distributed real-time computation system that provides guaranteed message processing, horizontal scalability, and fault tolerance. It allows users to define data processing topologies and submit them to a Storm cluster for distributed execution. Spouts emit streams of tuples that are processed by bolts. Storm tracks processing to ensure reliability and replays failed tasks. It provides tools for deployment, monitoring, and optimization of real-time data processing.

ElephantDB

nathanmarz

ElephantDB is a specialized key-value database designed for exporting data from Hadoop. It allows for random reads, batch writes, and random writes of data in a scalable way. ElephantDB indexes data using MapReduce to group keys by shard and write to local databases. It supports incremental indexing to avoid reindexing from scratch. Queries can be performed directly on ElephantDB or by reading indexed files from Hadoop. It provides a simpler and more reliable alternative to HBase for certain use cases.

Become Efficient or Die: The Story of BackType

nathanmarz

BackType helps businesses understand social media data through analytics tools and APIs. With a small team of 3 employees and funding of $1.4 million, they are able to process 100 million social media messages per day from over 30 terabytes of data using a 100-200 machine computing cluster. The company focuses on developing efficiently using agile and lean startup methodologies like testing hypotheses with fake features before building them out fully, avoiding overengineering, and paying down technical debt regularly through "BackSweeps".

The Secrets of Building Realtime Big Data Systems

nathanmarz

Clojure at BackType

nathanmarz

Cascalog workshop

nathanmarz

The document describes the process of query planning and execution for the Cascalog query engine. It involves three main steps: 1) pre-aggregation to resolve variables and join data sources, 2) aggregation to group data and apply aggregators, and 3) post-aggregation to resolve remaining variables and apply filters. The document provides examples of how a sample query is planned and optimized in each system.

Cascalog at Hadoop Day

nathanmarz

Cascading

nathanmarz

Cascading is a Java library that makes it easy to develop complex workflows on Hadoop. It allows processing of large amounts of data in a scalable and fault-tolerant way. Cascading represents all data as tuples and defines flows as sequences of tuple stream manipulations that compile to MapReduce jobs. For example, a flow could take tuples with words and values, split the words, group by word to sum the values for each word. This provides a more flexible way to develop MapReduce applications compared to Pig.

More from nathanmarz (14)

The inherent complexity of stream processing

Using Simplicity to Make Hard Big Data Problems Easy

The Epistemology of Software Engineering

Your Code is Wrong

Runaway complexity in Big Data... and a plan to stop it

Storm

Storm: distributed and fault-tolerant realtime computation

ElephantDB

Become Efficient or Die: The Story of BackType

The Secrets of Building Realtime Big Data Systems

Clojure at BackType

Cascalog workshop

Cascalog at Hadoop Day

Cascading

Recently uploaded

TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification

TrustArc

In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation. Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance. This webinar will review: - How compliance can play a role in the development and deployment of AI systems - How to model trust and transparency across products and services - How to save time and work smarter in understanding regulatory obligations, including AI - How to operationalize and deploy AI governance best practices in your organization

Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...

OnBoard

Generative AI Reasoning Tech Talk - July 2024

siddu769252

Top 12 AI Technology Trends For 2024.pdf

Marrie Morris

It's your unstructured data: How to get your GenAI app to production (and spe...

Zilliz

So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.

FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx

FIDO Alliance

Choosing the Best Outlook OST to PST Converter: Key Features and Considerations

webbyacad software

When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.

Enterprise_Mobile_Security_Forum_2013.pdf

Yury Chemerkin

Mule Experience Hub and Release Channel with Java 17

Bhajan Mehta

FIDO Munich Seminar Workforce Authentication Case Study.pptx

FIDO Alliance

Self-Healing Test Automation Framework - Healenium

Knoldus Inc.

NVIDIA at Breakthrough Discuss for Space Exploration

Alison B. Lowndes

UiPath Community Day Amsterdam: Code, Collaborate, Connect

UiPathCommunity

Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀. 📕 Agenda: 12:30 Welcome Coffee/Light Lunch ☕ 13:00 Event opening speech Ebert Knol, Managing Partner, Tacstone Technology Jonathan Smith, UiPath MVP, RPA Lead, Ciphix Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA Dion Mes, Principal Sales Engineer, UiPath 13:15 ASML: RPA as Tactical Automation Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives. Yannic Suurmeijer, System Architect, ASML 13:30 PostNL: an insight into RPA at PostNL Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations. Leonard Renne, RPA Developer, PostNL 13:45 Break (30') 14:15 Breakout Sessions: Round 1 Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding Mike Bos, Senior Automation Developer, Tacstone Technology Process Orchestration: scale up and have your Robots work in harmony Jon Smith, UiPath MVP, RPA Lead, Ciphix UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors Johans Brink, CTO, MvR digital workforce 15:00 Breakout Sessions: Round 2 Automation, and GenAI: practical use cases for value generation Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes Human in the Loop/Action Center Dion Mes, Principal Sales Engineer @UiPath Improving development with coded workflows Idris Janszen, Technical Consultant, Ilionx 15:45 End remarks 16:00 Community fun games, sharing knowledge, drinks, and bites 🍻

Indian Privacy law & Infosec for Startups

AMol NAik

Demystifying Neural Networks And Building Cybersecurity Applications

Priyanka Aash

In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).

How UiPath Discovery Suite supports identification of Agentic Process Automat...

DianaGray10

📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates. Topics Covered: 💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows. 🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation. 🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization. 🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates. 🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes. Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨ Speakers: Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP

FIDO Munich Seminar Introduction to FIDO.pptx

FIDO Alliance

Discovery Series - Zero to Hero - Task Mining Session 1

DianaGray10

"Making .NET Application Even Faster", Sergey Teplyakov.pptx

Fwdays

In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.

Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence

Quentin Reul

The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to: - Identify high-impact customer needs with precision - Harness the power of large language models to address specific customer needs effectively - Implement AI responsibly to build trust and foster strong customer relationships Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.

Recently uploaded (20)

TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification

Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...

Generative AI Reasoning Tech Talk - July 2024

Top 12 AI Technology Trends For 2024.pdf

It's your unstructured data: How to get your GenAI app to production (and spe...

FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx

Choosing the Best Outlook OST to PST Converter: Key Features and Considerations

Enterprise_Mobile_Security_Forum_2013.pdf

Mule Experience Hub and Release Channel with Java 17

FIDO Munich Seminar Workforce Authentication Case Study.pptx

Self-Healing Test Automation Framework - Healenium

NVIDIA at Breakthrough Discuss for Space Exploration

UiPath Community Day Amsterdam: Code, Collaborate, Connect

Indian Privacy law & Infosec for Startups

Demystifying Neural Networks And Building Cybersecurity Applications

How UiPath Discovery Suite supports identification of Agentic Process Automat...

FIDO Munich Seminar Introduction to FIDO.pptx

Discovery Series - Zero to Hero - Task Mining Session 1

"Making .NET Application Even Faster", Sergey Teplyakov.pptx

Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence

Cascalog at May Bay Area Hadoop User Group

1. Cascalog Nathan Marz, BackType Po wer fu l a n d ea sy-t o- us e data a n a lysi s to ol fo r H adoo p

2. About Me Tech Lead at BackType Have been working on many-terabyte scale systems for two years ETL workflows Data warehouses

3. Presentation Over view 1) High level introduction to Cascalog 2) Demo 3) Cascalog at BackType

4. What is Cascalog? Query language for Hadoop Queries are written as regular Clojure code Alternative to Pig and Hive

5. What is Clojure? Functional language that compiles to Java bytecode Lisp-based First-class integration with Java

6. Features Inner and outer joins Aggregators Functions Subqueries Sorting Arbitrary inputs and outputs

7. What sets Cascalog apart?

8. What sets Cascalog apart? Fully integrated in a general purpose programming language

9. What sets Cascalog apart? Full power of Clojure available at all times

10. What sets Cascalog apart? Full power of Clojure available at all times

11. What sets Cascalog apart? Custom operations No UDF interface Just Clojure functions

12. What sets Cascalog apart? Dynamic queries Write functions that return queries Manipulate queries as first-class entities in the language

13. What sets Cascalog apart? Use Cascalog side by side with other code Appends and Distributed Copies Consolidation Application logic

14. Easy Experimentation Ships with test dataset that can be queried locally (the “playground”) 5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README

15. Demo time!

16. Cascalog at BackType BackType collects data about conversations around the web Tweets Blog comments Social news People

17. Cascalog at BackType

18. Cascalog at BackType Cascalog is used to:

19. Cascalog at BackType Cascalog is used to: Identify influencers

20. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter

21. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets”

22. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets” Study social engagement of domains over time

23. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets” Study social engagement of domains over time Etc, etc.

24. Cascalog at BackType Input and output Cascalog reads from MySQL databases and HDFS Cascalog writes to Cassandra and HDFS

25. Cascalog at BackType Rapid development Local playground dataset for development Develop queries in the REPL

26. Cascalog Roadmap Optimized joins: Replicated joins Bloom joins Negations Recursion

27. Questions? Project page: http://www.github.com/nathanmarz/cascalog Tutorial: http://nathanmarz.com/blog/introducing-cascalog Follow me on Twitter: @nathanmarz

28. Clojure and Cascalog Provided by Clojure: Module system Dynamic queries Custom operations Interactive REPL

29. Cascading and Cascalog Provided by Cascading: Tuple abstraction and tuple manipulation Workflow to MapReduce translation Read and write from anywhere with Taps

Cascalog at May Bay Area Hadoop User Group

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Cascalog at May Bay Area Hadoop User Group

Similar to Cascalog at May Bay Area Hadoop User Group (20)

More from nathanmarz

More from nathanmarz (14)

Recently uploaded

Recently uploaded (20)

Cascalog at May Bay Area Hadoop User Group

Editor's Notes