Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
This document discusses leveraging major market opportunities with Microsoft Azure. It notes that worldwide cloud software revenue is expected to grow significantly between 2010-2017. By 2017, nearly $1 of every $5 spent on applications will be consumed via the cloud. It also notes that hybrid cloud deployments will be common for large enterprises by the end of 2017. The document then outlines several major enterprise workloads that can be moved to Azure, including test/development, SharePoint, SQL/business intelligence, application migration, SAP, and identity/Office 365. It provides examples of how partners can help customers with these types of migrations.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
eBay maintains hundreds of millions of accounts across its properties that are unstructured and in different formats. Identifying which accounts belong to the same person enables eBay to personalize customer experiences, provide customer service, and fight fraud. MapReduce provides a robust design pattern to simplify high-scale entity resolution through parallelized modular operations, including linking accounts pairwise, identifying connected components through iterative MapReduce jobs, and validating the results.
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
Real-time/online machine learning is an integral piece in the machine learning landscape, particularly in regard to unsupervised learning. Areas such as focused advertising, stock price prediction, recommendation engines, network evolution and IoT streams in smart cities and smart homes are increasing in demand and scale. Continuously-updating models with efficient update methodologies, accurate labeling, feature extraction, and modularity for mixed models are integral to maintaining scalability, precision, and accuracy in high demand scenarios.
This session explores a real-time/online learning algorithm and implementation using Spark Streaming in a hybrid batch/ semi-supervised setting. It presents an easy-to-use, highly scalable architecture with advanced customization and performance optimization. Within this framework, we will examine some of the key methodologies for implementing the algorithm, including partitioning and aggregation schemes, feature extraction, model evaluation and correction over time, and our approaches to minimizing loss and improving convergence. The result is a simple, accurate pipeline that can be easily adapted and scaled to a variety of use cases.
The performance of the algorithm will be evaluated comparatively against existing implementations in both linear and logistic prediction. The session will also cover real-time uses cases of the streaming pipeline using real time-series data and present strategies for optimization and implementation to improve both accuracy and efficiency in a semi-supervised setting.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
Jethro CTO Boaz Raufman and Jethro CEO Eli Singer discuss the performance benefits of adding auto microcubes to the processing framework in Jethro 2.0. They discuss how the auto microcubes working in tandem with full indexing and a smart caching engine deliver a consistently interactive-speed business intelligence experience across most scenarios and use cases. The main use case they discuss is querying data on Hadoop directly from a BI tool such as Tableau or Qlik.
Hadoop in the Cloud: Common Architectural PatternsDataWorks Summit
The document discusses how companies are using Microsoft Azure services like HDInsight, Data Factory, Machine Learning, and others to gain insights from large volumes of data. Specifically, it provides examples of:
1) A large computer manufacturer/retailer analyzing clickstream data with HDInsight to understand customer behavior and provide real-time recommendations to increase online conversions.
2) An industrial automation company partnering with an oil company to use IoT sensors and analytics to monitor LNG fueling stations for proactive maintenance based on sensor data analyzed with HDInsight, Data Factory, and Machine Learning.
3) How data from various industries like retail, oil and gas, manufacturing, and others can be analyzed
Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop automatically manages data replication and platform failure to ensure very large data sets can be processed efficiently in a reliable, fault-tolerant manner. Common uses of Hadoop include log analysis, data warehousing, web indexing, machine learning, financial analysis, and scientific applications.
100424 teradata cloud computing 3rd party influencers2cguest8ebe0a8
The document discusses Teradata's offerings for cloud computing environments. It outlines Teradata Express editions for Amazon EC2 and VMware that allow development, testing, and proof-of-concept workloads in public and private clouds. It also describes the Teradata Agile Analytics Cloud for self-service provisioning of sandboxes and data marts within a private cloud using Teradata databases. Customers benefit from the agility, capacity, and cost advantages of clouds for various analytics workloads.
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon RedshiftAmazon Web Services
Learn how Boingo Wireless and online media provider Edmunds gained substantial business insights and saved money and time by migrating to Amazon Redshift. Get an inside look into how they accomplished their migration from on-premises solutions. Learn how they tuned their schema and queries to take full advantage of the columnar MPP architecture in Amazon Redshift, how they leveraged third party solutions, and how they met their business intelligence needs in record time.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
This document provides an overview of the history of economic thought, outlining major economists and their ideas from ancient times through contemporary economic theory. It discusses early Greek philosophers like Xenophon and Aristotle and their work on household management. It then covers mercantilism and physiocrats before detailing the classical economists like Adam Smith, David Ricardo, Thomas Malthus, and Karl Marx. The document also summarizes the ideas of neoclassical thinkers, Keynes and the development of macroeconomics, monetarism, and concludes with some modern economists like Thomas Piketty.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
This document discusses how the cloud is well suited to address the challenges of big data. It notes that big data sets are getting larger and more complex, requiring new tools and approaches. The cloud optimizes precious IT resources by enabling elastic scaling, global accessibility, easy experimentation, and reducing costs. The cloud empowers users to balance costs and time. Several real-world examples are provided, such as banks using the cloud to perform Monte Carlo simulations and retailers using it for targeted recommendations and click stream analysis.
Ricardo developed the Ricardian theory of rent, which states that rent arises due to differences in the fertility of land used for agriculture. Land is assumed to have original and indestructible powers of fertility bestowed by nature. Rent is the surplus earned from more fertile land over the least fertile 'marginal' land that marks the extensive limit of cultivation. Rent is determined by the law of diminishing returns in agriculture under both extensive and intensive cultivation. The theory explains the origin of differential rent but makes unrealistic assumptions about land.
Brief review of Adam Smith's main concepts of growth.Prabha Panth
Adam Smith considered wealth of a nation to be its total output rather than just gold or agriculture. He believed economic growth increased total output, income, and standard of living. Smith argued growth occurs through increasing the division of labor, which raises productivity, and accumulating capital, which raises labor productivity by increasing the capital-labor ratio. This virtuous cycle of growth could eventually lead to a stationary state with zero growth.
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect. Jethro CEO Eli Singer discusses the limitations of Hadoop with Tableau. Presentation explores how Jethro's index-based architecture enables Tableau users to live-connect to data on Hadoop while maintaining the fast interactive speeds that they expect.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
The webinar discusses how organizations can make big data easy to use with the right tools and talent. It presents on MetaScale's expertise in helping Sears Holdings implement Hadoop and how Kognitio's in-memory analytics platform can accelerate Hadoop for organizations. The webinar agenda includes an introduction, a case study on Sears Holdings' Hadoop implementation, an explanation of how Kognitio's platform accelerates Hadoop, and a Q&A session.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Apache Drill is an open source SQL query engine for big data that provides highly flexible and high performance querying of data stored in Hadoop and NoSQL systems. It allows for ad-hoc queries on schema-less data without requiring upfront modeling or ETL. Drill uses a distributed, columnar data store and late binding to optimize query execution across systems. The project is actively developed with the goal of releasing version 1.0 in late 2014.
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
The document provides an overview of a data ingestion engine designed for big data. It discusses the motivation for the engine, including challenges with existing ETL and data integration approaches. The key aspects of the engine include a metadata repository that drives the ingestion process, access modules that connect to different data sources, and transform modules that process and mask the data. The metadata-driven approach provides benefits like automatically handling schema changes, tracking data lineage, and enabling retention policies based on metadata rather than scanning data. Future enhancements may include using KSQL to enrich streaming data and provisioning data to external locations by launching workflows.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
The document discusses StreamHorizon's "adaptiveETL" platform for big data analytics. It highlights limitations of legacy ETL platforms and StreamHorizon's advantages, including massively parallel processing, in-memory processing, quick time to market, low total cost of ownership, and support for big data architectures like Hadoop. StreamHorizon is presented as an effective and cost-efficient solution for data integration and processing projects.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
Big data comes from many sources like social media, e-commerce sites, and stock markets. Hadoop is an open-source framework that allows processing and storing large amounts of data across clusters of computers. It uses HDFS for storage and MapReduce for processing. HDFS stores data across cluster nodes and is fault tolerant. MapReduce analyzes data through parallel map and reduce functions. Sqoop imports and exports data between Hadoop and relational databases.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Similar to Jethro for tableau webinar (11 15) (20)
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Jethro for tableau webinar (11 15)
2. Webinar Topics
• Who is Jethro?
• Tableau & Big Data: Extract vs. Live Connect
• Big Data Platforms: Hadoop vs. EDW Appliances
• Two DB architectures: Full-scan vs. Index Access
• Live Demo: Tableau over Impala / Redshift / Jethro
• What is Jethro for Tableau and how it accelerates Tableau’s performance
• Q&A
3. About Us
• What does Jethro do?
– SQL engine optimized for accelerating
BI on big data
• How it works?
– Combines Columnar SQL DB design
with full-indexing technology
• Where is it?
– In dev since 2012; GA: mid 2015
– Download & free eval
• When to use it?
– BI Spinner Syndrome (BSS)
• Partnerships
– BI and Hadoop vendors
• Speaker
– Eli Singer, CEO JethroData
– esinger@jethrodata.com
– 917.509.6111
• Experience
– Long-time DBA
– Over 20 years of leading Tech startups
• Where to find us
– Jethrodata.com
– @JethroData
4. Tableau and Big Data: Extract (In-Mem)
Tableau
Extract
EDW / Hadoop
• Typical Tableau usage is based
on extracting selective data from
remote sources
• Extracted data is then
dynamically loaded into Tableau
memory for interactive analysis
• Limitations: Performance
degradation and scale (typically
~200M rows)
5. Tableau and Big Data: Live Connect (In-DB)
Tableau
EDW / Hadoop
• Tableau issues SQL queries
to the target DB for every
user interaction
• DB retrieves requested
data and returns to Tableau
• Limitation: DB
performance is significantly
slower than in-mem speed
Live
Connect
6. Big Data Platforms: Hadoop Vs. EDW Appliances
10x-100x Data
1/10 HW $cost
Open Platform
Analytics: ETL, Predictive, Reporting, BI
SQL enables the change of data platform while keeping the analytic apps intact
7. The Hadoop Trade-Off: Scale & Cost Vs. Performance
SQL-on-Hadoop
ETL Predictive Reporting
BI
Too SLOW in Hadoopx
It’s unrealistic to expect to the same performance when data is much
larger, and highly optimized hardware is replaced with commodity boxes.
8. SQL-on-Hadoop – MPP / Full Scan Architecture
Architecture:
MPP / Full-Scan (All SQL-on-Hadoop)
Query:
List books by author “Stephen King”
Process:
Each librarian is assigned a rack, they
then pull each book, check if author is
“Stephen King”, if so, get book title
Result:
Too slow, costly, unscalable.
Unsuitable for BI
A Library Analogy:
Billions of books, Thousands of racks
9. SQL-on-Hadoop – Index-Access Architecture
Architecture:
Index Access (Only Jethro)
Query:
List books by author “Stephen King”
Process:
Access Author index, entry of
“Stephen King”, get list of books, fetch
only these books
Result:
Fast, minimal resources, scalable
Optimal for BI
10. 10
SQL on Hadoop – Competitive Landscape
• Hive
• Impala
• Presto
• SparkSQL
• Drill
• Pivotal/HAWQ
• IBM/Big SQL
• Actian
• Teradata/SQL-H
• …
• Jethro
Full-Scan Based Solutions
Reads all rows. Every Time.
Index Based Solution
Reads ONLY needed rows.
Use-Case Comparison:
Full-Scan: Optimal for Predictive, reporting
Index: Optimal for Interactive BI
11. LIVE Benchmark: BI on Hadoop (and Redshift)
Hardware – AWS
• Hadoop: CDH 5.4
• 6 nodes: m1.xlarge, r3.xlarge
• Jethro: r3.8xlarge
• Point browser at: tableau.jethrodata.com
– UID/PWD: demo / demo
• Choose workbook: “Jethro”, “Impala”, “Redshift”
• BI Dashboard: choose year, category or any other filter to drill-down
• Data
– Based on TPC-DS benchmark
– 1TB raw data (400GB fact)
– Fact table: ~2.9B rows
– Dimensions: 7
Hardware Data
Format
Hadoop
Cluster
Compute
Cluster
Total
RAM, CPU
AWS
$ per hr.
Jethro Jethro
indexes
(250GB)
3x m1.xlarge 2x r3.4xlarge
(spot)
289GB,
44 cores
$0.80
Impala Parquet
(160GB)
8x r3.2xlarge
1x r3.xlarge
510GB
68 cores
$5.95
Redshift Redshift
(229GB)
8x dc1.large 120GB,
16 cores
$2.00
12. What Is Jethro for Tableau?
Tableau
EDW / Hadoop / Cloud / Local FS / NAS
Extract
• An indexing & caching server
• Relevant data is extracted from EDW
/ Hadoop into Jethro. No size
limitation
• Jethro then fully indexes the data
(every column!)
• Jethro’s column and index files are
stored back in Hadoop (or other
storage system)
• Tableau uses Live Connect to send
Jethro SQL queries (ODBC)
• Jethro uses indexes to speed up
queries and return results to Tableau
Live
Connect
2. Store
3.
1.
13. Selecting Data for Jethro Acceleration
• Select only Tableau “worthy” datasets
– Not ALL data in Hadoop should have Jethro
• Use any ETL tool to extract from source
– Jethro receives data in a CSV/delimited format
– Extracted data can be temporarily stored in a file or
“piped” live to Jethro
• After initial creation, incremental loads are supported
– As frequently as every few min
• Jethro stores it’s version of the dataset back in HDFS
– Can also use local filesystem, network storage or cloud storage
• Load is fast
– ~1B rows/hour
– Data in highly compressed: 1TB -> 400GB data + indexes
EDW / Hadoop
Extract
14. Data
Node
Index-Access – How it works
Data
Node
Data
Node
Data
Node
Data
Node
Jethro
Query
Node
Query
Node
1. Index Access 2. Read data only for require rows
Performance and resources based on the size of the working-set
Storage
- HDFS
- Cloud (S3, EFS)
- NAS/SAN
- Local FS
Tableau
SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
15. Jethro Indexes – Superior Technology
http://www.google.com/patents/WO2013001535A3?cl=enPatent Pending:
• Complete
– Every column is indexed
• Simple
Inverted-list indexes map each column
value to a list of rows
• Fast to read
Direct Access to a value entry
No need to scan entire index, or load
index to memory
• Scalable
Distributed, highly hierarchical
compressed bitmaps
Appendable Index Structure for
Fast Incremental Loads
16. Adaptive Optimization: Active Cache of Query Results
• Reuse of intermediate/final query results
– Repeat queries return immediately
• Addresses wide top-of-the-funnel queries
– Exploration starts with queries with no/few
filters
– Those queries are likely to be repeated in
dashboard scenarios
• Transparently adapts to incremental loads
– Execution on delta data + merge saved results
Query
Speed
Query
Selectivity
Fast
Slow
Few More
Query
speed
Query
Selectivity
Fast
Slow
Few More
Query
speed
Query
Selectivity
Fast
Slow
Few More
Index Performance Cache Performance
Index + Cache
17. Summary: Why Index Access Optimal for BI?
1. Use of indexes eliminates need to read unnecessary data
2. The deeper you go, the faster it gets: as users drill down and add
more filters the faster the queries perform
3. Unlimited flexibility: users can aggregate and filter by any columns
they choose with no performance penalty
4. Concurrent users accessing dashboards generate repeatable queries
that result in high cache efficiency
5. Shields BI workload from other analytics overwhelming the cluster
18. Ready to Try Jethro?
1. Register: jethrodata.com/download-jethro-for-tableau
2. Schedule a 45min POC review with Jethro SA (free!)
3. One time setup
- Download and Install Jethro on a server / VM
- Start services, configure instance
4. Extract & Load data
5. Use Tableau
- Install ODBC driver
- Point Tableau data source at Jethro
That’s It!