The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
1) Spark 1.0 was released in 2014 as the first production-ready version containing Spark batch, streaming, Shark, and machine learning libraries.
2) By 2014, most big data processing used higher-level tools like Hive and Pig on structured data rather than the original MapReduce assumption of only unstructured data.
3) Spark evolved to support structured data through the DataFrame API in versions 1.2-1.3, providing a unified way to read from structured sources.
Big data refers to large datasets that are difficult to process using traditional database management tools. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliable data storage with the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using MapReduce. The Hadoop ecosystem includes components like HDFS, MapReduce, Hive, Pig, and HBase that provide distributed data storage, processing, querying and analysis capabilities at scale.
This document summarizes Hadoop MapReduce, including its goals of distribution and reliability. It describes the roles of mappers, reducers, and other system components like the JobTracker and TaskTracker. Mappers process input splits in parallel and generate intermediate key-value pairs. Reducers sort and group the outputs by key before processing each unique key. The JobTracker coordinates jobs while the TaskTracker manages tasks on each node.
Spark Streaming allows for scalable, fault-tolerant stream processing of data ingested from sources like Kafka. It works by dividing the data streams into micro-batches, which are then processed using transformations like map, reduce, join using the Spark engine. This allows streaming aggregations, windows, and stream-batch joins to be expressed similarly to batch queries. The example shows a streaming word count application that receives text from a TCP socket, splits it into words, counts the words, and updates the result continuously.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer
Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.
Interactive Data Analysis in Spark Streamingdatamantra
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are:
- Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library.
- This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job.
- Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.
This document provides an overview of Apache Spark, including its architecture, usage model, and capabilities. The key points covered include Spark's use of resilient distributed datasets (RDDs) to perform parallel transformations efficiently across a cluster, its support for SQL, streaming, and machine learning workloads, and how it achieves faster performance than other frameworks like MapReduce through optimizations like caching data in memory. Examples of WordCount in Spark and MapReduce are also provided and compared.
The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.
This document provides an introduction to Apache Flink and its streaming capabilities. It discusses stream processing as an abstraction and how Flink uses streams as a core abstraction. It compares Flink streaming to Spark streaming and outlines some of Flink's streaming features like windows, triggers, and how it handles event time.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
Incorta allows users to create materialized views (MVs) using Spark. It provides functions to read data from Incorta tables and save Spark DataFrames as MVs. The document discusses Spark integration with Incorta, including installing and configuring Spark, and creating the first MV using Spark Python APIs. It demonstrates reading data from Incorta and saving a DataFrame as a new MV.
This document discusses Hadoop design and k-means clustering. It outlines Hadoop's fault tolerance through task tracking and task replication. It describes Hadoop's data flow including input splitting, mapping and reducing. It also discusses optimizations like combiners. Finally it explains the k-means clustering algorithm and different approaches to implementing it in Hadoop including iterative MapReduce and partitioning large numbers of clusters.
Hadoop and Mapreduce for .NET User GroupCsaba Toth
This document provides an introduction to Hadoop and MapReduce. It discusses big data characteristics and challenges. It provides a brief history of Hadoop and compares it to RDBMS. Key aspects of Hadoop covered include the Hadoop Distributed File System (HDFS) for scalable storage and MapReduce for scalable processing. MapReduce uses a map function to process key-value pairs and generate intermediate pairs, and a reduce function to merge values by key and produce final results. The document demonstrates MapReduce through an example word count program and includes demos of implementing it on Hortonworks and Azure HDInsight.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
This thesis evaluates different methods for performing large-scale semantic expansion of the NCBO Resource Index on distributed computing systems like Hadoop, HBase, Pig and MapReduce. It implements various join algorithms in these frameworks and compares their performance. The goal is to scale the Resource Index to include nearly 100 data resources containing over 50 million data elements and 100 billion annotations, as the current MySQL implementation is reaching its limits.
IBM's Big Data platform provides tools for managing and analyzing large volumes of structured, unstructured, and streaming data. It includes Hadoop for storage and processing, InfoSphere Streams for real-time streaming analytics, InfoSphere BigInsights for analytics on data at rest, and PureData System for Analytics (formerly Netezza) for high performance data warehousing. The platform enables businesses to gain insights from all available data to capitalize on information resources and make data-driven decisions.
Build a Big Data solution using DB2 for z/OSJane Man
The document discusses building a Big Data solution using IBM DB2 for z/OS and IBM BigInsights. It provides an overview of new functions in DB2 11 that allow DB2 applications to access and analyze data stored in Hadoop. Specifically, it describes the JAQL_SUBMIT and HDFS_READ functions that enable submitting analytic jobs to BigInsights from DB2 and reading the results back into DB2. Examples are provided that show an integrated workflow of submitting a JAQL query to BigInsights from DB2, reading the results into a DB2 table, and querying the results. Potential use cases for integrating DB2 and BigInsights are also outlined.
The document discusses Hadoop and Spark frameworks for big data analytics. It describes that Hadoop consists of HDFS for distributed storage and MapReduce for distributed processing. Spark is faster than MapReduce for iterative algorithms and interactive queries since it keeps data in-memory. While MapReduce is best for one-pass batch jobs, Spark performs better for iterative jobs that require multiple passes over datasets.
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
MapReduce is a programming model and software framework for processing large datasets in a distributed computing environment. It was originally developed at Google for processing web search data. The MapReduce framework breaks jobs into many small sub-tasks that are processed in parallel across large clusters of commodity servers. It handles parallelization, scheduling, load balancing, and fault tolerance. MapReduce jobs consist of a map step that processes key-value pairs to generate intermediate key-value pairs and a reduce step that merges all intermediate values associated with the same intermediate key.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
Big data refers to the massive amounts of structured, semi-structured and unstructured data being created from sources like sensors, social media, digital pictures and videos, and transactional systems. This document discusses how the volume of data is growing exponentially from sources like RFID tags and smart meters. It also explores how insights can be extracted from big data through analyzing trends, correlations and other patterns in volumes, varieties and velocities of data beyond what was previously possible. However, as more data is created, the percentage of available data an organization can analyze is decreasing, making enterprises relatively "more naive" about their business over time.
The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
The document discusses scheduling Hadoop pipelines using various Apache projects. It provides an example of a marketing profit and loss (PnL) pipeline that processes booking, marketing spend, and web log data. It describes scheduling the example jobs using cron-style scheduling and the problems with time-based scheduling. It then introduces Apache Oozie and Apache Falcon for more robust workflow scheduling based on dataset availability. It provides examples of using Oozie coordinators and workflows and Falcon feeds and processes to schedule the example PnL pipeline based on when input data is available rather than fixed time schedules.
This document provides an overview of big data including:
- Types of data like structured and unstructured data
- Characteristics of big data and how it has evolved with more unstructured data sources
- Sectors that benefit from big data including government, banking, telecommunications, marketing, and health and life sciences
- Advantages such as understanding customers, optimizing business processes, and improving research, healthcare, and security
- Challenges including privacy, data access, analytical challenges, and human resource needs
- The conclusion states big data generates productivity and opportunities but challenges must be addressed through talent and analytics
The document discusses different types of programming languages and software. It describes low-level languages like machine language and assembly language, and high-level languages used for scientific and business applications. It also defines algorithms, flowcharts, compilers, interpreters, and system and application software.
Software is a set of programs, which is designed to perform a well-defined function. A program is a sequence of instructions written to solve a particular problem.
This document discusses key concepts for structuring programs using modules, including: using modules to eliminate duplicate code; designing cohesive modules with single functions; defining local and global variables; using parameters to facilitate communication between modules; and employing logic structures like sequential, decision, loop, and case. It also covers variable naming conventions and types of modules.
System software includes operating systems and compilers that help utilize hardware resources, while application software performs specific tasks like word processing. Utility programs perform basic functions like formatting disks. High-level languages are easier for humans to read and write than low-level languages like assembly, which are closer to machine code.
The document summarizes different types of programming languages:
- Machine languages and assembly languages were early languages that mapped directly to computer hardware. They were inefficient for programmers.
- High-level languages like procedural languages made programming easier by using English-like syntax but were less efficient. Problem-oriented languages focused on solving specific problems.
- Compilers convert an entire program to machine code while interpreters convert each statement, making compilers generally more efficient once converted.
High Level Languages (Imperative, Object Orientated, Declarative)Project Student
Computer Science - High Level Languages
Different types of high level languages are explained within this presentation. For example, imperative, object orientated and declarative languages are explained. The two types of languages within declarative (logic and functional) are also mentioned and described as well as the characteristics of high level languages. There is also a hierarchy of high level languages and generations.
This document discusses open source relational databases. It begins by introducing the presenter and topic, which is the current state of components in open source SQL databases. It then covers key components such as the storage engine, query planner, protocols, transaction model, and others. For each component, it discusses the approaches taken by different databases like PostgreSQL, MySQL, CockroachDB, and ClickHouse. It also addresses topics like horizontal scalability and replication strategies. Overall, the document provides a detailed overview and comparison of the architectural components and capabilities across major open source relational database management systems.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
Cheetah is a custom data warehouse system built on top of Hadoop that provides high performance for storing and querying large datasets. It uses a virtual view abstraction over star and snowflake schemas to provide a simple yet powerful SQL-like query language. The system architecture utilizes MapReduce to parallelize query execution across many nodes. Cheetah employs columnar data storage and compression, multi-query optimization, and materialized views to improve query performance. Based on evaluations, Cheetah can efficiently handle both small and large queries and outperforms single-query execution when processing batches of queries together.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
Flink provides a convenient abstraction layer for YARN that simplifies distributing computational tasks across a cluster. It allows writing custom input formats and operators more easily than traditional approaches like MapReduce. This document discusses two examples - a MongoDB to Avro data conversion pipeline and a file copying job - that were simplified and made more efficient by implementing them in Flink rather than traditional MapReduce or custom YARN applications. Flink handles task parallelization and orchestration automatically.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Map reduce advantages over parallel databases Ahmad El Tawil
MapReduce has several advantages over parallel databases for processing large datasets:
1) MapReduce can handle heterogeneous systems with different storage systems more easily than parallel databases which require data copying and analysis.
2) Complex functions are more straightforward to express in MapReduce's simple map and reduce model compared to SQL in parallel databases which can require complicated user defined functions.
3) MapReduce provides better fault tolerance than parallel databases by using techniques like batching, sorting, grouping and smart task scheduling during data transfers between mapping and reducing tasks.
MapReduce is a programming model and implementation for processing large datasets in a distributed environment. It allows users to write map and reduce functions to process key-value pairs. The MapReduce library handles parallelization across clusters, automatic parallelization, fault-tolerance through task replication, and load balancing. It was designed at Google to simplify distributed computations on massive amounts of data and aggregates the results across clusters.
This document summarizes a meeting about Pig and Hive for Hadoop. The agenda included an overview of Hadoop and MapReduce, demonstrations of Pig and Hive, and demos of exercises and projects using an on-premise Hadoop emulator and Azure HDInsight. Pig and Hive were presented as domain-specific languages that simplify writing MapReduce jobs by translating queries into the jobs. Recommendation algorithms were demonstrated in C# and Pig.
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Impala is a SQL query engine for Apache Hadoop that allows real-time queries on large datasets. It is designed to provide high performance for both analytical and transactional workloads by running directly on Hadoop clusters and utilizing C++ code generation and in-memory processing. Impala uses the existing Hadoop ecosystem including metadata storage in Hive and data formats like Avro, but provides faster performance through its new query execution engine compared to traditional MapReduce-based systems like Hive. Future development of Impala will focus on improved support for features like HBase, additional SQL functionality, and query optimization.
The document discusses using MapReduce for machine learning algorithms on multi-core systems. It describes segmenting images into regions of interest for feature extraction using SIFT and SURF descriptors. Machine learning algorithms like AdaBoost, locally weighted linear regression, naive Bayes, and support vector machines are proposed to fit the MapReduce model by dividing data and computation across multiple cores. Experimental results show nearly 2x speedup on 2 cores and 54x speedup on 64 cores by parallelizing algorithm computations.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Lessons learnt from applying PyData to GetYourGuide marketingJose Luis Lopez Pino
This document summarizes how a marketing organization applied Python and data analytics (PyData) to improve their marketing efforts. Some key results included tripling marketing efforts while reducing ad creation time by 90% and launching in 7 new markets without growing their team. The document outlines their approach, including setting up infrastructure to retrieve and store marketing data, then analyzing and automating processes. Examples provided include customer segmentation, estimating new market sizes, and using machine learning for tasks like sentiment analysis and forecasting sales. The overall message is that applying a data-driven approach using Python tools can significantly impact marketing results.
Slides from my talk at Big Data Spain 2014 in Madrid.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
This presentation is part of my work for the course 'Heterogeneous and Distributed Information Systems' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
This presentation is part of my work for the course 'Big Data Seminar' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
This presentation is part of my work for the course 'Big Data Analytics Projects' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Report for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Presentation for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
¿Qué es la esteganografía?
¿Qué NO es la esteganografía?
Esteganografía y criptografía
¿Por qué usarla?
Esteganografía física
Técnicas de esteganografía digital
Usos curiosos de la esteganografía digital
Ataques
Técnicas de ataque
Estegoanálisis
Marcas de agua
Visuse es un metabuscador que clasifica y muestra los resultados de búsqueda de forma visual centrándose en contenido multimedia. Usa Python, Django y JavaScript para comunicarse con buscadores como YouTube y Flickr, organizar los resultados y mostrarlos de manera optimizada en el navegador del usuario.
Este documento presenta el proyecto Visuse, un meta-buscador visual que clasifica y muestra los resultados de búsqueda de varios motores de búsqueda y sitios web de forma visual, centrándose en contenidos multimedia como imágenes, videos y audio. El proyecto tiene dos partes principales: el desarrollo de un servidor que se comunica con los buscadores para procesar resultados y un cliente que se encarga de la visualización de dichos resultados de una manera optimizada.
Presentación realizada para el CUSL nacional.
Se puede probar la última versión de Visuse en www.visuse.com
Más información sobre el proyecto en http://visuse.wordpress.com
Visuse es un metabuscador visual que clasifica y muestra los resultados obtenidos de otros buscadores como imágenes y videos. Los objetivos son comunicarse con otros buscadores, organizar la información, puntuar resultados y mostrarlos de forma visual aprovechando el espacio del navegador. Las características incluyen módulos para buscadores como YouTube y Flickr, algoritmos para puntuar y ordenar resultados, y paginación.
Visuse es un metabuscador que clasifica y muestra resultados de búsqueda de forma visual centrándose en contenido multimedia. Usa Python, Django y JavaScript para recibir consultas de buscadores, determinar la importancia de los resultados y mostrarlos de forma optimizada. El proyecto aún necesita expandirse con más módulos, características de caché y configuración, y una versión pública.
Este documento proporciona instrucciones para desarrollar un módulo para el buscador Visuse. Explica los pasos necesarios para crear las clases que definan los resultados de búsqueda y el proceso de búsqueda, así como probar el módulo.
Este documento resume las mejoras realizadas en el proyecto Visuse, un metabuscador visual. Se mejoraron los módulos para incluir Wikimedia Commons, Picasa y Flickr. También se mejoró la interfaz para corregir errores y las instrucciones de instalación. Otras mejoras incluyeron traducciones, agregar copyright a los archivos de código y sugerencias para nuevos módulos y mejor organización. El documento concluye explicando cómo usar Visuse.
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/intels-approach-to-operationalizing-ai-in-the-manufacturing-sector-a-presentation-from-intel/
Tara Thimmanaik, AI Systems and Solutions Architect at Intel, presents the “Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” tutorial at the May 2024 Embedded Vision Summit.
AI at the edge is powering a revolution in industrial IoT, from real-time processing and analytics that drive greater efficiency and learning to predictive maintenance. Intel is focused on developing tools and assets to help domain experts operationalize AI-based solutions in their fields of expertise.
In this talk, Thimmanaik explains how Intel’s software platforms simplify labor-intensive data upload, labeling, training, model optimization and retraining tasks. She shows how domain experts can quickly build vision models for a wide range of processes—detecting defective parts on a production line, reducing downtime on the factory floor, automating inventory management and other digitization and automation projects. And she introduces Intel-provided edge computing assets that empower faster localized insights and decisions, improving labor productivity through easy-to-use AI tools that democratize AI.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
4. The MapReduce model
• Introduced in 2004 by Google
• This model allows programmers without any experience in
parallel coding to write the highly scalable programs and
hence process voluminous data sets.
• This high level of scalability is reached thanks to the
decomposition of the problem into a big number of tasks.
• The Map function produces a set of key/value pairs, taking a
single pair key/value as input.
• The Reduce function takes a key and a set of values related to this
key as input and it might also produce a set of values, but
commonly it emits only one or zero values as output
5. The MapReduce model
• Advantages:
• Scalability
• Handle failures and balance the system
• Pitfalls
• Complicated to code some tasks.
• Some tasks are very expensive.
• Difficulties to debug the code.
• Absence of schema and indexes.
• A lot of bandwidth might be consumed.
6. Hadoop
• An Apache Software foundation open source
project
• Hadoop – HDFS + Map Reduce
• DFS – Partitioning data & Storing in separate
machine
• HDFS – Stores large files, running on commodity
clusters of hardware and typically 64 MB for per
block
• Both FS and Map reduce are Co-Designed
7. Hadoop
• No separate storage network and processing network
• Moving compute to the data node
9. High level languages
• Two different types
• Created specifically for this model.
• Already existing languages
• Languages present in the comparison
• Pig Latin
• HiveQL
• Jaql
• Interesting languages
• Meteor
• DryadLINQ
10. Pig Latin
• Executed over Hadoop.
• Procedural language.
• High level operations similar to those that we can find in SQL
• Some interesting operators:
• FOREACH to process a transformation over every tuple of the set.
To make possible to parallelise this operation, the transformation
of one row should depend on another.
• COGROUP to group related tuples of multiple datasets. It is
similar to the first step of a join.
• LOAD to load the input data and its structure and STORE to save
data in a file.
11. Pig Latin
• Goal: to reduce the time of development.
• Nested data model.
• User-defined functions
• Analytic queries over text files (not need of loading the data)
• Procedural language -> control over the execution plan
• The user can speed the performance up.
• It makes easier the work of the query optimiser.
• Unlike SQL.
12. HiveQL
• Open-source DW
solution built on top
of Hadoop
• The queries looks
similar to SQL and
also has extensions
on it
• Complex column
types-
map, array, struct as
data types
• It stores the
metadata in RDBMs
13. HiveQL
• The Metastore acts as the system catalog for
Hive
• It stores all the information about the
tables, their partition, the schema and etc.,
• Without the system catalog it is not possible to
impose a structure on hadoop files.
• Facebook uses MySQL to store this metadata.
Reason: Since these information has to be
served fast to the compiler
14. JAQL
• What is Jaql?
• Declarative scripting programming language.
• Used over Hadoop’s MapReduce framework
• Included in IBM’s InfoSphere BigInsights and Cognos Consumer
Insight products.
• Developed after Pig and Hive.
• More scalable.
• More flexible
• More reusable.
• Data model
• Simple: similar to JSON.
• Values as trees.
• No references.
• Textual representation very similar.
• Flexible
• Handle semistructured documents.
• But also structured records validated against a schema.
15. JAQL
• Control over the evaluation plan.
• The programmer can work at different levels of abstraction
using Jaql's syntax:
• Full definition of the execution plan.
• Use of hints to indicate to the optimizer some evaluation
features.
• This feature is present in most of the database engines that use SQL
as query language.
• Declarative programming, without any control over the flow.
16. Other languages: Meteor
• Stratosphere stack
• Pact:
• Programming model
• It extends MapReduce with new second-order functions
• Cross: Cartesian product
• CoGroup: group all the records with the same key and process them.
• Match: similar to CoGroup but pairs with the same key could be processed
separately.
• Sopemo:
• Semantically rich operator model
• Extensible
• Meteor: query language
• Optimization
• Meteor code
• Logical plan using Sopemo operators -> Optimized
• Pact final program -> Physically optimized
17. Other languages: DryadLINQ
• Coded embedded in .NET programming languages
• Operators
• Almost all the operators available in LINQ.
• Some specific operators for parallel programming.
• Develop can include their own implementations.
• DryadLINQ code is translated to a Dryad plan
• Optimization
• Pipeline operations
• Remove redundancy
• Push aggregations
• Reduce network traffic
21. Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
22. Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
23. Expressive power
• Three categories by Robert Stewart:
• Relational complete
• SQL equivalent (aggregate functions)
• Turing complete
• Conditional branching
• Indefinite iterations by means of recursion
• Emulation of infinite memory model
25. Expressive power
• But this do not mean that they are SQL, Pig Latin
and HiveQL are the same.
• HiveQL
• Is inspired by SQL but it does not support the full
repertoire included in the SQL-92 specification
• Includes features notably inspired by MySQL and
MapReduce that are not part of SQL.
• Pig Latin
• It is not inspired by SQL.
• For instance, do not have OVER clause
26. SQL Vs. HiveQL (2009)
SQL HiveQL
Transactions Yes No
Indexes Yes No
Create table as select Not SQL-92 Yes
Subqueries In any clause
Correlated or not
Only in FROM clause
Only noncorrelated
Views Yes Not materialized
Extension with
map/reduce scripts
No Yes
28. Query Processing
• In order to make a good comparison we should
have the basic knowledge on how these HLQL
are working.
• How the abstract user representation of the
query or the script is converted to map reduce
jobs?
29. Query Processing – Pig Latin
• The goal of writing
Pig Latin script is to
produce an
equivalent map
reduce jobs that can
be executed in the
Hadoop environment
• Parser first checks for
the syntactic errors
32. Query Processing - Hive
• It gets the Hive SQL string from the client
• The parser phase converts it into parse tree
representation
• The logical query plan generator converts it into
logical query representation. Prunes the columns
early and pushes the predicates closer to the
table.
• The logical plan is converted to physical plan and
then map reduce jobs.
33. Query Processing - JAQL
• JAQL includes two higher order functions such as
mapReduceFn and mapAggregate
• The rewriter engine generates calls to the mapReduceFn or
mapAggregate
34. QP - Summary
• All these languages has its own methods
• All supports syntax checking usually done by the
compiler
• Pig currently misses out on optimized storage
structures like indexes and column groups
• HiveQL provides more optimizations
• it prunes the buckets that are not needed
• Predicate push down
• Query rewriting is the future work of JAQL
(Projection push-down )
36. JOIN in Pig Latin
• Pig Latin Supports inner join, equijoin and outer
join. The JOIN operator always performs inner
join.
• Join can also be achieved by COGROUP
operation followed by FLATTEN
• JOIN creates a flat set of output records while
COGROUP creates a nested set of output records
• GROUP – when only one relation
• COGROUP – when multiple relations are involved
• FLATTEN - (a, {(b,c), (d,e)}) (a, b, c) and (a, d, e)
37. JOIN in Pig Latin
• Fragment Replicate joins
• Trivial case, only possible if one of two relations are
small enough to fit into memory
• JOIN is in Map phase
• Skewed Joins
• Not equally distributed data
• Basically computes histogram of the key space and
uses this data to allocate reducers for a given key
• JOIN in reduce phase
• Merge Joins
• Only possible if the relations are already sorted
38. JOIN in Pig Latin
• The choice of join strategy can be specified by the user
39. JOIN in Hive
• Normal map-reduce Join
• Mapper sends all rows with the same key to a
single reducer
• Reducer does the join
• SELECT t1.a1 as c1, t2.b1 as c2
FROM t1 JOIN t2 ON (t1.a2 = t2.b2);
• Map side Joins
• small tables are replicated in all the mappers
and joined with other tables
40. JOIN in JAQL
• Currently JAQL supports equijoin
• The join expression supports equijoin of 2 or
more inputs. All of the options for inner and
outer joins are also supported
joinedRefs = join w in wroteAbout, p in
products
where w.product == p.name
into { w.author, p.* };
41. JOIN - Summary
• Both Pig and Hive has the possibility to performs
join in map phase instead of reduce phase
• For skewed distribution of data, the performance
of JAQL for join is not comparable to other two
languages
43. Benchmarks
• Pig Mix is a set of queries to test the
performance. These set checks the scalability
and latency
• Hive’s benchmark is mainly based on the queries
that are specified by Pavlo et al (selection
task, Aggregation task and a Join task)
• Pig-Latin implementation for the TPC-H queries
and HiveQL implementation of TPC-H queries
44. Performance - Summary
• The paper describes Scale up, Scale out and
runtime
• For skewed data, Pig and Hive seems to be more
effective in handling it compared to JAQL
runtime
• Pig and Hive better in utilizing the increase in the
cluster size compared JAQL
• Pig and Hive allows the user to explicitly specify
the number of reducers task
• This feature has significant influence on the
performance
46. Machine Learning
• What page will the visitor next visit?
• Twitter has extended Pig’s support of ML by
placing learning algorithms in Pig Storage
functions
• Hive - the machine learning is treated as UAFs
• A new data analytics platform Ricardo is
proposed combines the functionalities of R and
Jaql.
47. Interactive queries
• One of the main problems of MapReduce all the languages built on top
of this framework (Pig, Hive, etc.) is the latency.
• As a complement of those technologies, some new frameworks that
allow programmers to query large datasets in an interactive manner
have been developed
• Dremel by Google
• The open source project Apache Drill.
• How to reduce the latency?
• Store the information as nested columns.
• Query execution based on a multi-level tree architecture.
• Balance the load by means of a query dispatcher.
• Not too many details of the query language
• It is based on SQL
• It includes the usual operations (selection, projection, etc.)
• SQL-like languages features: user define functions or nested subqueries
• The characteristic that distinguish this languages is that it operates with
nested tables as inputs and outputs.
48. Conclusions
• The MapReduce programming model have big pitfalls.
• Each programming language try to solve some of these
disadvantages in a different way.
• No single language beat all the other options.
• Comparison
• Jaql is expressively more powerful.
• JAQL is at the lower level in case of performance when compared
to Hive and Pig
• HiveQL and Pig Latin supports map phase JOIN.
• HiveQL use more advanced optimization techniques for query
processing
• New technologies to solve those problems:
• Languages: Dremel and Apache Drill
• Libraries: Mahaut
80% of execution time is spent executing at most 20% of the codeis to provide anabstract data querying interface to remove the burden of the MR implementationaway from the programmer. whether or not programs pay a performance penalty foropting for these more abstract languagesLoop Recognition in C++/Java/Go/Scala. Robert Hundt,Google. 2011Is there any optimization techniques if so when and where?
In Pig the operator GROUP is translated as LOCALREARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan.Rearranging means either it does hashing or sorting by key.The combinationof local and global rearranges produces the result in such a way that the tupleshaving same group key will be moved to same machine
TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
JAQLincludes two higher order functions such as mapReduceFn and mapAggregateto execute map reduce and aggregate operations respectively. The rewriterengine generates calls to the mapReduceFn or mapAggregate, by identifyingthe parts of the scripts and moving them to map,reduce and aggregate functionparameters. Based on the some rules, rewriter converts them to Expr tree.Finally it checks for the presence of algebraic aggregates, if it is there then itinvokes mrAggregate function. In otherworlds it can complete the task withsingle map reduce job.
JAQLs physicaltransparency is an added value feature because it allows the user to add newrun time operator without aecting JAQLs internals.
TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
In this case, the big relation is distributedacross hadoop nodes and the smaller relation is replicated on each node. Herethe entire join operation is performed in Map phase.In general the data in data warehouse is not equally distributedand it is susceptible to skewed in nature. Pig handles this conditionby employingskewed join. The basic idea is to compute a histogram of the keyspace and uses this data to allocate reducers for a given key. Currently pigallows skewed join of only two tables. The join is performed in Reduce phase.
TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
TPC-H benchmark for relational OLTP systemThe ideal theoretical outcome wouldbe that there is no increase in the computation time (T) for a given job.Scalability and fault tolerancea document is a good match to a query if the document model is likely to generate the query
Dremel combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. Dremel uses a novel query execution engine based on aggregator trees.to run almost realtime , interactive AND adhoc queries both of which MapReduce cannot. And Pig and Hive aren't real timeDremel is what the future of HIVE (and not MapReduce as I mentioned before) should be. Hive right now provides a SQL like interface to run MapReduce jobs. Hive has very high latency, and so is not practical in ad-hoc data analysis. Dremel provides a very fast SQL like interface to the data by using a different technique than MapReduce.