This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
This document discusses data virtualization and how it can help organizations leverage data lakes to access all their data from disparate sources through a single interface. It addresses how data virtualization can help avoid data swamps, prevent physical data lakes from becoming silos, and support use cases like IoT, operational data stores, and offloading. The document outlines the benefits of a logical data lake created through data virtualization and provides examples of common use cases.
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
This document summarizes Walmart's transition to building an enterprise data platform on Azure Databricks to enable machine learning and data science at scale. Previously, Walmart had a complex and slow legacy technology stack. The new platform goals were to centralize data in the cloud, increase productivity with data science tools, and reduce costs. Key aspects of the new platform included using Azure and Databricks for data processing and machine learning, Airflow for orchestration, and building several machine learning models for applications like fraud detection and product recommendations. Challenges in the transition included optimizing performance and managing resources across the platforms.
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksDatabricks
Many companies need to analyze large datasets that include location information. To be able to derive business insights from these datasets you need a solution that provides geospatial analysis functionalities and can scale to manage large volumes of information. The combination of CARTO and Databricks allows you to solve this kind of large scale geospatial analytics problems. CARTO provides a location intelligence platform to discover and predict key insights through location data. In this session we will see how we can integrate CARTO and Databricks and how we can take advantage of this combination to solve specific problems for industries such as logistics, telecommunications or financial services.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...DataStax
This webinar covered graph databases and how they can solve problems that were previously difficult for traditional databases. It included presentations on why graph databases are useful, common use cases like recommendations and network analysis, different types of graph databases, and a demonstration of the DataStax Enterprise graph database. There was also a question and answer session where attendees could ask about graph databases and DataStax Enterprise graph.
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
The document discusses strategies for modern data warehousing and analytics on Azure including using Hadoop for ETL/ELT, integrating streaming data engines, and using lambda and hybrid architectures. It also describes using data lakes on Azure to collect and analyze large amounts of data from various sources. Additionally, it covers performing real-time stream analytics, machine learning, and statistical analysis on the data and discusses how Azure provides scalability, speed of deployment, and support for polyglot environments that incorporate many data processing and storage options.
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jNeo4j
The document discusses Neo4j's capabilities for scalability and graph analytics. It describes how Neo4j provides unbounded scalability through features like causal clustering and multi-tenancy. This allows large datasets to be queried efficiently across distributed databases. Neo4j also includes tools for graph analytics and data science through its graph data science library, which supports algorithms and analytics on graphs with billions of nodes. These capabilities enable use cases for telecommunications companies to perform scalable analytics on large, connected datasets.
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
At CenterPoint Energy, both structured and unstructured data are continuing to grow at a rapid pace. This growth presents many opportunities to deliver business value and many challenges to control costs. To maximize the value of this data while controlling costs, CenterPoint Energy created a data lake using SAP HANA and Hadoop. During this presentation, CenterPoint will discuss their journey of moving smart meter data to Hadoop, how Hadoop is allowing CenterPoint to derive value from big data and their future use case road map.
BICube is a machine learning platform for big data. It provides tools for ingesting, processing, analyzing and visualizing large datasets using techniques like Apache Spark, Hadoop, and machine learning algorithms. The platform includes modules for tasks like document clustering, topic modeling, image analysis, recommendation systems and more. It aims to allow users to build customized machine learning workflows and solutions.
The document discusses Intuit's vision to transform customers' lives by unleashing the power of data. It describes Intuit's Analytics Cloud (IAC), which provides a data platform and foundational services to derive value from data. The IAC allows for real-time and batch data ingestion from various sources and provides services like business lookups, unified customer profiles, and personalization. An example use case of using tax data to personalize the tax preparation experience is also mentioned. The document outlines Intuit's journey to building the IAC, including initially lifting existing systems to the cloud and now focusing on real-time streaming capabilities. Key practices for planning, deploying and managing the IAC are also listed.
This document outlines the big data landscape in 2016, including key components like data lakes, data warehouses, ingestion, processing, data science, analytics, and data sources. It also discusses related microservices, algorithms, data storage technologies, data workflows, stream processing systems, SQL and NoSQL databases, and specialized databases for time series, graphs, and other data types. The goal is to provide an overview of the different technologies and approaches for working with large and diverse datasets.
This document summarizes a presentation about big data analytics solutions from Think Big Analytics and Infochimps. It discusses using their platforms together to power applications with next-generation big data stacks. It highlights case studies, architecture diagrams, and polls to demonstrate how their services can accelerate time to value through a combination of data science, engineering, strategy, and hands-on training and education.
My other computer is a datacentre - 2012 editionSteve Loughran
An updated version of the "my other computer is a datacentre" talk, presented at the Bristol University HPC talk.
Because it is targeted at universities, it emphasises some of the interesting problems -the classic CS ones of scheduling, new ones of availability and failure handling within what is now a single computer, and emergent problems of power and heterogeneity. It also includes references, all of which are worth reading, and, being mostly Google and Microsoft papers, are free to download without needing ACM or IEEE library access.
Comments welcome.
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Databricks
"The modernization of the tobacco industry is resulting in a shift towards a more data-driven approach to trade, operations and the consumer. The need to scale while maintaining margins is paramount, and today’s consumer requires more personalized engagement and value at every interaction to drive sales and revenue.
At Altria, we’re at the forefront of this evolution, leveraging hundreds of terabytes of big data (such as point-of-sale, clickstream, mobile data, and more) and machine learning to improve our ability to make smarter decisions and outpace the competition. This talk recaps our big data journey from a legacy data infrastructure (Teradata), isolated data systems, and the lack of resources which prevented our ability to move quickly and scale, to our current state where we’ve successfully implemented, architected and on-boarded tools and processes in stages of data acquisition, store, prepare, and business intelligence with Azure Data Lake, Azure Databricks, Azure Data factory, APIs Managements, Streaming and Hosting technologies and provided Data Analytics platform.
We’ll discuss the roadblocks we came across, how we overcame them, and how we employed a unified approach to big data and analytics through the fully managed Azure Databricks platform and the Azure suite of tools which allowed us to streamline workflows, improve operational performance, and ultimately introduce new customer experiences that drive engagement and revenue."
Application-level Disaster Recovery on OpenStackAli Hodroj
This document discusses architecting high availability and disaster recovery solutions on OpenStack using Cloudify. It begins with an overview of key concepts like regions, availability zones, and single points of failure. It then covers challenges like deployment complexity and cost of redundancy. The document presents Cloudify's principles of automation, decoupling applications from infrastructure, and plugging in different clouds. Finally, it shares case studies of using Cloudify for operationally critical cold DR, business critical cross-region DR, and mission critical in-memory WAN replication across regions.
"Applications, programming languages, and libraries that leverage sophisticated network hardware capabilities have a natural advantage when used in today’s and tomorrow’s high-performance and data center computer environments. Modern RDMA based network interconnects provides incredibly rich functionality (RDMA, Atomics, OS-bypass, etc.) that enable low-latency and high-bandwidth communication services. The functionality is supported by a variety of interconnect technologies such as InfiniBand, RoCE, iWARP, Intel OPA, Cray’s Aries/Gemini, and others. OFA organization and LinuxRDMA community have been playing a predominant role in the enablement efficient and vendor agnostic software stack for those interconnects. Over the last decade, the community has developed variety user/kernel level protocols and libraries that enable a variety of applications over RDMA including MPI, SHMEM, NFS over RDMA, IPoIB, and many others."
"With the emerging availability server platforms based on ARM CPU architecture, it is important to understand ARM integrates with RDMA hardware and software eco-system. In this talk, we will overview ARM architecture and system software stack. We will discuss how ARM CPU interacts with network devices and accelerators. In addition, we will share our experience in enabling RDMA software stack (OFED/MOFED Verbs) and one-sided communication libraries (Open UCX, OpenSHMEM/SHMEM) on ARM and share preliminary evaluation results."
Watch the video presentation: http://wp.me/p3RLHQ-gyO
Learn more: https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
In this video from the OpenFabrics Workshop in Austin, Al Geist from ORNL presents: Exascale Computing Project - Driving a HUGE Change in a Changing World.
"In this keynote, Mr. Geist will discuss the need for future Department of Energy supercomputers to solve emerging data science and machine learning problems in addition to running traditional modeling and simulation applications. In August 2016, the Exascale Computing Project (ECP) was approved to support a huge lift in the trajectory of U.S. High Performance Computing (HPC). The ECP goals are intended to enable the delivery of capable exascale computers in 2022 and one early exascale system in 2021, which will foster a rich exascale ecosystem and work toward ensuring continued U.S. leadership in HPC. He will also share how the ECP plans to achieve these goals and the potential positive impacts for OFA."
Learn more: https://exascaleproject.org/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: https://www.openfabrics.org/index.php/abstracts-agenda.html
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Susan Coulter from LANL presented this deck at the OpenFabrics Workshop.
"The OpenFabrics Alliance (OFA) is an open source-based organization that develops, tests, licenses, supports and distributes OpenFabrics Software (OFS). The Alliance’s mission is to develop and promote software that enables maximum application efficiency by delivering wire-speed messaging, ultra-low latencies and maximum bandwidth directly to applications with minimal CPU overhead.
Founded in June 2004 as the OpenIB Alliance, the Alliance was originally focused on developing a vendor-independent, Linux-based InfiniBand software stack. In 2005, the Alliance committed itself to supporting Windows, a move that would make the Alliance’s software stack truly cross-platform. In 2006, the organization again expanded its charter to include support for iWARP and in 2010 it added support for RoCE (RDMA over Converged Ethernet), both for delivering high-performance RDMA and kernel bypass solutions over Ethernet. In 2014 the Alliance expanded again with the creation of the OpenFabrics Interfaces working group to investigate and incorporate support for other high performance networks."
Watch the video presentation: http://wp.me/p3RLHQ-gzo
Learn more: https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
In this video from the Open Compute Summit, Siamak Tavallaei from Microsoft presents an overview of the Microsoft Project Olympus AI Accelerator Chassis, also known as the HGX-1.
Watch the presentation video: http://wp.me/p3RLHQ-guX
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
This document discusses using MongoDB as part of an enterprise data management architecture. It begins by describing the rise of data lakes to manage growing and diverse data volumes. Traditional EDWs struggle with this new data variety and volume. The document then provides an overview of MongoDB's features like flexible schemas, secondary indexes, and aggregation capabilities that make it suitable for building different layers of an EDM pipeline for tasks like raw data storage, transformation, analysis, and serving data to downstream systems. Example use cases are presented for building a single customer view and for replacing Oracle with MongoDB.
Managing data analytics in a hybrid cloudKaran Singh
Managing Data Analytics in a Hybrid Cloud discusses challenges with traditional analytics approaches and proposes using shared data lakes with dynamic compute clusters. Common challenges include explosive analytics team growth leading to resource contention, and duplicating large datasets for each cluster. The proposed approach uses shared object storage to hold unified datasets accessed by multiple ephemeral analytics clusters provisioned on-demand. This allows teams independent resources while avoiding duplicate storage costs and improving agility. The document outlines example architectures and benefits of this shared data lake approach when implemented on a private or public cloud.
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
This document summarizes the profile of Shradha Ambekar, a software engineer at Intuit. She will be speaking at Strata New York 2019 about solving data pipeline mysteries. She is also the technical lead for Intuit's Real-Time Analytics and Lineage Framework and contributes to the spark-cassandra-connector project. Her LinkedIn and Twitter profiles are provided.
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
This document discusses high performance spatial-temporal trajectory analysis using Spark. It covers the background of analyzing mobile signaling data to enable smarter urban planning. The solution architecture includes data sources, distributed file system, computation engine, and visualization. Technical designs address the big data platform, data governance, algorithm models, and Spark spatial computing. Example scenarios are presented for population heatmaps, commute routes, and office-residence imbalance analysis.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j
Neo4j provides a concise summary of how graph databases have evolved and their advantages over traditional databases. Specifically, graph databases can handle billions of connections between data points and enable queries that can traverse thousands of relationships between nodes, providing answers in milliseconds rather than minutes. This level of connected data insight allows for real-time fraud detection, recommendations, knowledge graphs, and other applications that require understanding relationships in large, dynamic datasets.
This document discusses trends in high performance computing (HPC) and big data analytics. It notes that while HPC and big data have different resource needs and programming models traditionally, they are converging as big data workloads require more real-time processing and HPC workloads incorporate more data-driven analytics. The document outlines challenges in both HPC and big data such as system bottlenecks, energy efficiency, and barriers to wider usage. It advocates for more integrated solutions that combine storage, networking, processing and memory to address these challenges.
This document discusses applying Apache Spark to data science challenges in media and entertainment. It introduces Spark as a unifying framework for content personalization using recommendation systems and streaming data, as well as social media analytics using GraphFrames. Specific use cases discussed include content personalization with recommendations, churn analysis, analyzing social networks with GraphFrames, sentiment analysis, and viewership prediction using topic modeling. The document also discusses continuous applications with Spark Streaming, and how Spark ML can be used for machine learning workflows and optimization.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Spark is a general purpose computational framework that provides more flexibility than MapReduce. It leverages distributed memory and uses directed acyclic graphs for data parallel computations while retaining MapReduce properties like scalability, fault tolerance, and data locality. Cloudera has embraced Spark and is working to integrate it into their Hadoop ecosystem through projects like Hive on Spark and optimizations in Spark Core, MLlib, and Spark Streaming. Cloudera positions Spark as the future general purpose framework for Hadoop, while other specialized frameworks may still be needed for tasks like SQL, search, and graphs.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
The document discusses graph data science and Neo4j's Graph Data Science (GDS) framework. GDS allows running graph algorithms and machine learning models at scale on large graph datasets. It discusses key aspects of GDS including architecture, data import, algorithm selection, and case studies of customers using GDS on graphs with billions of nodes and relationships. GDS runs on dedicated instances and supports features like enterprise graph compression, unlimited parallelization, and named graphs to optimize performance on large datasets.
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.
If you're like most of the world, you're on an aggressive race to implement machine learning applications and on a path to get to deep learning. If you can give better service at a lower cost, you will be the winners in 2030. But infrastructure is a key challenge to getting there. What does the technology infrastructure look like over the next decade as you move from Petabytes to Exabytes? How are you budgeting for more colossal data growth over the next decade? How do your data scientists share data today and will it scale for 5-10 years? Do you have the appropriate security, governance, back-up and archiving processes in place? This session will address these issues and discuss strategies for customers as they ramp up their AI journey with a long term view.
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Andre Hora
The literature has provided evidence that developers are likely to test some behaviors of the program and avoid other ones. Despite this observation, we still lack empirical evidence from real-world systems. In this paper, we propose to automatically identify the tested paths of a method as a way to detect the method’s behaviors. Then, we provide an empirical study to assess the tested paths quantitatively. We monitor the execution of 14,177 tests from 25 real-world Python systems and assess 11,425 tested paths from 2,357 methods. Overall, our empirical study shows that one tested path is prevalent and receives most of the calls, while others are significantly less executed. We find that the most frequently executed tested path of a method has 4x more calls than the second one. Based on these findings, we discuss practical implications for practitioners and researchers and future research directions.
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsAlina Tait
BLR Tools provides an advanced BitLocker Data Recovery Tool specifically engineered to recover lost or inaccessible data from BitLocker-encrypted drives. Whether you're dealing with accidental deletion, encryption key problems, or system crashes, our cutting-edge software guarantees a secure and efficient recovery process. Rely on BLR Tools for dependable BitLocker data recovery and effortlessly restore access to your essential files.
Mastering MicroStation DGN: How to Integrate CAD and GISSafe Software
Dive deep into the world of CAD-GIS integration and elevate your workflows to nexl-level efficiency levels. Discover how to seamlessly transfer data between Bentley MicroStation and leading GIS platforms, such as Esri ArcGIS.
This session goes beyond mere CAD/GIS conversion, showcasing techniques to precisely transform MicroStation elements including cells, text, lines, and symbology. We’ll walk you through tags versus item types, and understanding how to leverage both. You’ll also learn how to reproject to any coordinate system. Finally, explore cutting-edge automated methods for managing database links, and delve into innovative strategies for enabling self-serve data collection and validation services.
Join us to overcome the common hurdles in CAD and GIS integration and enhance the efficiency of your workflows. This session is perfect for professionals, both new to FME and seasoned users, seeking to streamline their processes and leverage the full potential of their CAD and GIS systems.
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...David D. Scott
Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!
The SQDC (Safety, Quality, Delivery, Cost) process enhances manufacturing performance through daily safety meetings, defect tracking, and waste reduction. Orcalean’s FactoryKPI digital dashboard streamlines this process, providing real-time data and AI-powered analytics for continuous improvement.
How to Secure Your Kubernetes Software Supply Chain at ScaleAnchore
Achieving comprehensive security visibility in Kubernetes environments is essential for maintaining robust and compliant cloud-native applications. In this exclusive webinar, Anchore and Spectro Cloud team up to showcase how to enhance your Kubernetes security posture with SBOM (Software Bill of Materials) management and vulnerability scanning.
Join Cornelia Davis, VP of Product, Spectro Cloud and Alan Pope, Director of Developer Relations, Anchore to learn how to elevate your Kubernetes security visibility and protect your cloud-native applications effectively.
—Discover how Anchore can be integrated with Spectro Cloud Palette to take SBOM scanning to the next level, delivering fully automated software compliance
—Gain valuable insights into best practices for securing your Kubernetes workloads, ensuring compliance, and improving your DevSecOps processes.
How Generative AI is Shaping the Future of Software Application DevelopmentMohammedIrfan308637
Generative AI is revolutionizing software development. Find out how it enhances innovation and productivity. https://www.qisacademy.com/blog-detail/the-power-of-generative-ai-in-software-application-development
BDRSuite - #1 Cost effective Data Backup and Recovery Solutionpraveene26
BDRSuite and BDRCloud by Vembu are comprehensive and cost-effective backup and disaster recovery solutions designed to meet the diverse data protection requirements of Businesses and Service Providers.
With BDRSuite & BDRCloud, you can backup diverse IT workloads from any location, including VMs (VMware, Hyper-V, KVM, Proxmox VE, oVirt), Servers & Endpoints (Windows, Linux, Mac), SaaS Applications (Microsoft 365, Google Workspace), Cloud VMs (AWS, Azure), NAS/File Shares and Databases & Applications (Microsoft Exchange Server, SQL Server, SharePoint Server, PostgreSQL, MySQL).
You can store backup anywhere like On-Premise/Remote storage, Private/Public Cloud, and BDRCloud.
You can centrally manage the entire backup infrastructure with BDRSuite’s self-hosted centralized management console (or) BDRCloud-hosted centralized management console.
You can quickly recover from data loss or ransomware attacks—all at an affordable price.
To know more visit our website -
https://www.bdrsuite.com/
https://www.bdrcloud.com/
Crowd Strike\Windows Update Issue: Overview and Current Statusramaganesan0504
Crowd Strike\Windows Update Issue: Overview and Current Status
Discover the latest on the CrowdStrike Windows update issue, including an overview, current status, and support steps for affected customers. Learn about the identified defect, its impact on Windows hosts, and CrowdStrike's committed actions to ensure ongoing security and stability.
What is CrowdStrike?
CrowdStrike is a prominent cybersecurity technology company that specializes in providing advanced threat intelligence and endpoint protection solutions. Founded in 2011 by George Kurtz, Dmitri Alperovitch, and Gregg Marston, CrowdStrike has quickly established itself as a leader in the cybersecurity industry. Here are some key aspects of
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...John Gallagher
Rails apps can be a black box. Have you ever tried to fix a bug where you just can’t understand what’s going on? This talk will give you practical steps to improve the observability of your Rails app, taking the time to understand and fix defects from hours or days to minutes. Rails 8 will bring an exciting new feature: built-in structured logging. This talk will delve into the transformative impact of structured logging on fixing bugs and saving engineers time. Structured logging, as a cornerstone of observability, offers a powerful way to handle logs compared to traditional text-based logs. This session will guide you through the nuances of structured logging in Rails, demonstrating how it can be used to gain better insights into your application’s behavior. This talk will be a practical, technical deep dive into how to make structured logging work with an existing Rails app.
I talk about the Steps to Observable Software - a practical five step process for improving the observability of your Rails app.
2. 2
About me
• Vice President, Products and Strategy @ GigaSpaces
• (ex) Director of Solutions Architecture
• Blogging at http://blog.gigaspaces.com
• @ahodroj
• Email: ali@gigaspaces.com
• Slides at http://slideshare.com/ahodroj
4. 4
Do we need to bridge
online transaction
processing with real-time
operational intelligence?
5. 5
Modern applications: the line is blurred between…
Transactional Analytical
Essential to operate the
business
Turning data into value:
insights, diagnosis, decision
making
&
14. In-Memory Computing 101
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
Increased Capacity
No support for write-heavy scenarios
Limited to ID-based reads
Reads are the only part latency path
In-Memory Database
Scale-up system of record
15. Heavy Read/Write – sharded/partitioned architecture
Horizontally scalable on commodity HW (or cloud)
Serves as system of record with querying & transaction
semantics
Requires modifying your application’s data access layer
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
In-Memory Database
Scale-up system of record
In-Memory Computing 101
16. Read/Write Scalability
Drop-in SQL database replacement
Often lacks horizontal scalability (Joins)
Requires replacing your database
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
In-Memory Database
Scale-up system of record
In-Memory Computing 101
30. 30
● Nope: Your data sources and applications are
often distributed.
● In-Memory or not, these databases aren’t
built for horizontal scale-out
Approach Challenge
Just an IMDB Thing….
Shove it all in one “Big Iron”?
31. 31
● Not when your apps requires polyglot
analytics
● Unless you want to write ML algorithms, MDX
engines…etc from scratch
Approach Challenge
One large In-
Memory
Data Grid to
Rule them
all?
32. 32
What we needed
Low-latency Scale-Out In-
Memory Data Grid
Large-scale distributed
analytics framework
Maximize Data-
Analytics Locality
Minimize
Application Latency
33. 33
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+
41. 41
• List of parent RDDs – Empty
• An array of partitions that a dataset is divided to – IMDG Distributed Query
to get partitions and their hosts
• A compute function to do a computation on partitions – Iterator over portion
of data
• Optional preferred locations, i.e. hosts for a partition where the data will be
loaded – hosts from Distributed Query
Data Grid RDD: resilient distributed dataset
42. 42
node 1
Spark executor
Data Grid RDD: one-to-one partition
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3
44. 44
Grid DataFrames: predicates pushdown & columns pruning
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages - Python/R
Implementing DataSource API
54. 5454
In-Process HTAP
Read any POJO, JSON
Document, or
Transaction as a
DataFrame or RDD
Web services/apps can read
any DataFrame as POJO
True closed-loop analytics data pipeline
@SpaceClass
public class Product
{
private String name;
private String brand;
private Integer
quantity;
// …
}
55. 5555
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
Point of Decision HTAP XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
56. 5656
Case Study: Fleet Geo-analytics
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources