This document discusses Microsoft's support for open source tools and NoSQL databases on the Azure platform. It provides examples of various open source technologies supported, such as Linux, Hadoop, Java, PHP, and Node.js. It then discusses how Azure provides solutions for large-scale social networking and e-commerce applications through patterns like data sharding, caching layers, and messaging. Azure aims to provide high availability, elastic scale, and support for flexible data models and processing paradigms to meet the needs of NoSQL and big data applications.
This document presents an introduction to NoSQL databases. It begins with an overview comparing SQL and NoSQL databases, describing the architecture of NoSQL databases. Examples of different types of NoSQL databases are provided, including key-value stores, column family stores, document databases and graph databases. MapReduce programming is also introduced. Popular NoSQL databases like Cassandra, MongoDB, HBase, and CouchDB are described. The document concludes that NoSQL is well-suited for large, highly distributed data problems.
1. Introduction to Apache Accumulo provides an overview of the key-value store Accumulo.
2. Accumulo is a sorted, distributed key-value store that enables interactive access to trillions of records across hundreds to thousands of servers. It provides cell-based access control and customizable server-side processing.
3. The document discusses Accumulo's history and architecture, including how it uses Hadoop for storage and Zookeeper for coordination. It also covers Accumulo's features like iterators for server-side programming and cell-level access control labels.
Kudu is a storage engine for Hadoop designed to address gaps in Hadoop's ability to handle workloads that require both high-throughput data ingestion and low-latency random access. It is a columnar storage engine that uses a log-structured merge tree to store data and provides APIs for NoSQL and SQL access. Kudu aims to provide high performance for both scans and random access through its columnar design and tablet architecture that partitions data across servers.
You can watch the replay for this Geek Sync webcast, Successfully Migrating Existing Databases to Azure SQL Database, on the IDERA Resource Center, http://ow.ly/k4p050A4rBA.
First impressions have long-lasting effects. When dealing with an architecture change like migrating to Azure SQL Database the last thing you want to do is leave a bad first impression by having an unsuccessful migration. In this session, you will learn the difference between Azure SQL Database, SQL Managed Instances, and Elastic Pools. How to use tools to test migrations for compatibility issues before you start the migration process. You will learn how to successfully migrate your database schema and data to the cloud. Finally, you will learn how to determine which performance tier is a good starting point for your existing workload(s) and how to monitor your workload over time to make sure your users have a great experience while you save as much money as possible.
Speaker: John Sterrett is an MCSE: Data Platform, Principal Consultant and the Founder of Procure SQL LLC. John has presented at many community events, including Microsoft Ignite, PASS Member Summit, SQLRally, 24 Hours of PASS, SQLSaturdays, PASS Chapters, and Virtual Chapter meetings. John is a leader of the Austin SQL Server User Group and the founder of the HADR Virtual Chapter.
Jeremy Beard, a senior solutions architect at Cloudera, introduces Kudu, a new column-oriented storage system for Apache Hadoop designed for fast analytics on fast changing data. Kudu is meant to fill gaps in HDFS and HBase by providing efficient scanning, finding and writing capabilities simultaneously. It uses a relational data model with ACID transactions and integrates with common Hadoop tools like Impala, Spark and MapReduce. Kudu aims to simplify real-time analytics use cases by allowing data to be directly updated without complex ETL processes.
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
The research paper covers the consolidated interpretation of NoSQL systems, on the basis of performance, scalability and data aggregation, and compares the types of NoSQL databases based on their implementation and maintenance.
Haytham ElFadeel presented on next-generation storage systems and key-value stores. He began with an overview of scalable systems and the need for both vertical and horizontal scalability. He discussed the limitations of traditional databases in scaling, including complexity, wasted features, and multi-step query processing. Key-value stores were presented as an alternative, offering simple interfaces and designs optimized for scaling across hundreds of machines. Performance comparisons showed key-value stores significantly outperforming databases. Systems discussed included Amazon Dynamo, Facebook Cassandra, and Redis.
Soft-Shake 2013 : Enabling Realtime Queries to End UsersBenoit Perroud
Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness.
At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries.
Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind.
There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.
This document provides an agenda and background information for a workshop on Big Data and NoSQL in Microsoft-Land presented by Andrew Brust and Lynn Langit at SQL Server Live! Orlando 2012. The agenda includes an overview of Big Data, NoSQL, and their intersection, then drilldowns on Big Data technologies like Hadoop, Hive, and Microsoft HDInsight, and NoSQL databases. Biographies of the presenters are also provided.
Altoros using no sql databases for interactive_applicationsJeff Harris
This document compares the performance of Cassandra, MongoDB, and Couchbase for interactive applications. Benchmarking showed Couchbase had the lowest latencies and highest throughput. Cassandra demonstrated better performance than MongoDB. While MongoDB had the lowest throughput, Cassandra and Couchbase provided better scalability and flexibility in resizing clusters. The analysis concludes Couchbase is well-suited for interactive applications due to its in-memory caching and fine-grained locking, which enable high performance for reads and writes.
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
This document summarizes a survey of advanced non-relational database systems, their approaches, applications, and comparison to relational database management systems (RDBMS). It outlines the problem of scaling to meet new web-scale demands, describes how non-relational databases provide a solution by sacrificing consistency for availability and partition tolerance. Examples of non-relational databases are provided, including their data models, APIs, optimizations, and benefits compared to RDBMS such as improved scalability and fault tolerance.
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.
This document discusses the limitations of relational databases for modern applications and real-time architectures. It describes how NoSQL databases like Aerospike can provide better performance and scalability. Specific examples are given of how Aerospike has been used to power applications in domains like advertising technology, social media, travel portals, and financial services that require high throughput, low latency access to large datasets.
This document provides an introduction and overview of Couchbase Server, a NoSQL document database. It describes Couchbase Server as the leading open source project focused on distributed database technology. It outlines key features such as easy scalability, always-on availability, flexible data modeling using JSON documents, and core features including clustering, replication, indexing and querying. The document also provides examples of basic write, read and update operations on a single node and cluster, adding nodes, handling node failures, indexing and querying capabilities, and cross data center replication.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
Kudu is a new column-oriented storage system for Apache Hadoop that is designed to enable fast analytics on fast, changing data. It aims to provide high throughput for large scans and low latency for random access simultaneously. Kudu is optimized for workloads that involve high volumes of incoming data and require real-time analytics through inserts, updates, scans and lookups. It simplifies architectures for real-time analytics compared to hybrid approaches using HBase and HDFS. Currently, Kudu is in beta and provides APIs for Java, C++, Impala, MapReduce and Spark.
This document discusses SQL and NoSQL approaches to scaling databases. It describes how social networks and other large-scale websites use techniques like sharding and messaging to partition data across many databases. It also discusses how SQL Server is adopting NoSQL paradigms like flexible schemas and federated sharding to provide scalability. The document aims to educate about scaling databases and how SQL Server is evolving to support both SQL and NoSQL approaches.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
This document provides an overview and comparison of relational and NoSQL databases. Relational databases use SQL and have strict schemas while NoSQL databases are schema-less and include document, key-value, wide-column, and graph models. NoSQL databases provide unlimited horizontal scaling, very fast performance that does not deteriorate with growth, and flexible queries using map-reduce. Popular NoSQL databases include MongoDB, Cassandra, HBase, and Redis.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
Offline processing with Hadoop allows for scalable, simplified batch processing of large datasets across distributed systems. It enables increased innovation by supporting complex analytics over large data sets without strict schemas. Hadoop adoption is moving beyond legacy roles to focus on data processing and value creation through scalable and customizable systems like Cascading.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
This document discusses accelerating Spark ML models with Redis modules. It provides an overview of Redis and Spark, and describes how Redis modules can add new capabilities like secondary indexes, time series, and machine learning. The document demonstrates a Redis ML module that implements random forests and decision trees. It shows how Spark ML models can be trained, saved to Redis for low-latency serving, and evaluated directly in Redis for improved performance over Spark alone.
This document summarizes a presentation about using MySQL and the NDB storage engine to build a globally distributed in-memory database system on AWS. It proposes using MySQL/NDB clusters tiled across AWS availability zones to provide high availability and performance at a large scale. Key challenges discussed include managing data consistency across wide geographical distances and dealing with limitations of AWS like network performance and lack of global load balancing. Lessons learned are that NDB can successfully compete with NoSQL for most use cases by providing ACID compliance without sacrificing availability or performance.
The document discusses SQL versus NoSQL databases. It provides background on SQL databases and their advantages, then explains why some large tech companies have adopted NoSQL databases instead. Specifically, it describes how companies like Amazon, Facebook, and Google have such massive amounts of data that traditional SQL databases cannot adequately handle the scale, performance, and flexibility needs. It then summarizes some popular NoSQL databases like Cassandra, Hadoop, MongoDB that were developed to solve the challenges of scaling to big data workloads.
NoSQL and SQL databases can work together to handle real-time big data needs. Apache Drill is an open source tool that allows interactive analysis of big data using standard SQL queries across NoSQL, Hadoop, and relational data sources. It provides low-latency queries, full ANSI SQL support, and flexibility to handle rapidly evolving schemas and data in different systems. By enabling analysis of all data together using a common interface, it helps tackle challenges of combining operational and decision support systems on big, diverse datasets.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It provides low-latency queries using standard SQL and supports nested and hierarchical data. Drill is inspired by Google's Dremel system and provides an alternative to traditional batch processing systems like MapReduce for interactive analysis of big data.
This document provides an overview of NoSQL databases, including why they are used, common types, and how they work. The key points are:
1) SQL databases do not scale well for large amounts of distributed data, while NoSQL databases are designed for horizontal scaling across servers and partitions.
2) Common types of NoSQL databases include document, key-value, graph, and wide-column stores, each with different data models and query approaches.
3) NoSQL databases sacrifice consistency guarantees and complex queries for horizontal scalability and high availability. Eventual consistency is common, with different consistency models for different use cases.
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
The document provides an introduction and agenda for an HBase presentation. It begins with an overview of HBase and discusses why relational databases are not scalable for big data through examples of a growing website. It then introduces concepts of HBase including its column-oriented design and architecture. The document concludes with hands-on examples of installing HBase and performing basic operations through the HBase shell.
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
This document summarizes a presentation on tier-1 business intelligence (BI) in the world of big data. The presentation will cover Microsoft's BI capabilities at large scales, big data workloads from Yahoo and investment banks, Hadoop and the MapReduce framework, and extracting data out of big data systems into BI tools. It also shares a case study on Yahoo's advertising analytics platform that processes billions of rows daily from terabytes of data.
11. + SAMBA
“A few years back, a patch
submission from coders at
Microsoft would have been
amazing to the point of
unthinkable, but the battles
are mostly over and times
have changed.
17. Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility Online
- Provide social Monetize the Social:
Business - Improve individual
Monetize Individual: experience
- Upsell service
- VIP
Application - Re-sell Aggregate Data
(e.g., Advertisers)
- Speed
- Extra
Capabilities
18. Social NetworkING: the Business Problem
• 100s of million of users
• Terabytes to petabytes of data
• Required (eventual) data
consistency across users
19. Solution
• Shard/Partition user data across hundreds to
thousands of SQL Databases
• Propagate data changes from one DB to other DBs
using reliable, async Message Service
• Provide a caching layer for performance
• And also used for
20. Many LARGE SCALE customers using similar patterns
• Patterns
• Sharding and reliable messaging
• Sharding and fan/out query layer
• Caching layer
• Customer Examples
• Social Networking: Facebook, MySpace, etc
• Online electronic stores (cannot give names )
• Travel reservation systems (e.g. Choice International)
• MSN Casual Gaming
• etc.
21. • Require high availability
• Be able to scale out:
• Be able to quickly grow and change:
Move better support for these patterns into the Data Platform!
22. • NoSQL = operational and developer agility at low CapEx and OpEx!
• Low Cost
• Processing Paradigms
• Data Model Paradigms
• Range from devices, over OLTP Web 2.0 applications to BigData Analytics
23. Data Model Example Stores (apologies to the ones I did not list)
Simple Key-Value Pairs Memcache, Redis, Dynamo, Voldermort, LevelDB, Azure Caching
Wide Sparse Column Sets HyperTable, Big Table, Cassandra, HBASE, Hyperbase, Amazon
DynamoDB, Windows Azure Tables, SQL Server/Azure Sparse
columns
BLOBs Amazon S3, Oracle Berkeley NoSQL, Windows Azure Blob Store,
SQL Server RBS/FileTable
JSON Documents MongoDB, CouchBase, Riak, RavenDB
Graph Neo4J, GraphDB, HypergraphDB, Stig, Intellidimension
Objects and XML Documents Versant, Oracle Berkeley NoSQL, MarkLogic, existDB, EMC
HiveDB, SQL Server/Azure, Oracle, IBM DB2
Extended Relational Oracle, EMC SQLFire, IBM DB2, MySQL, Postgres, SQL
Server/Azure
24. • You want:
• You can only get 2 of 3 (CAP Theorem)
• In Brave New World:
25. • Performance and Elastic Scale on Demand
• Automate management lifecycle (or fail)
• Simple deployment lifecycle
• No DB or OS Admin telling me what to do
26. • Code First and revise quickly
• Application-model first (before database)
• Flexible open data models
• You don’t know exactly what you are looking for
• Lower Pain of adoption and maintenance
• No DB or OS Admin telling me what to do
27. • Low CapEx, Low OpEx
• Built-in tunable High-Availability
• Data scale-out (Sharding)
• Processing scale-out (Map-Reduce, Fan-Out, tunable consistency)
• Flexible Data Models
• Integrate with BigData Analytics (e.g., Hadoop)
Many Relational Database Systems are incorporating these learning!
28. • Provides Data Partitioning/Sharding at the Data Platform
• Enables applications to build elastic scale-out applications
• Provides non-blocking SPLIT/DROP for shards (MERGE to
come later)
• Auto-connect to right shard based on sharding keyvalue
• Provides SPLIT resilient query mode
29. • Flexible data is good, but:
• Procedural Scale-Out processing is good, but:
• Eventual Consistency is good, but:
• Simple Queries are good, but:
Many NoSQL Database Systems are starting to incorporate these learnings!
30. Attract Individual
Consumers:
- Provide interesting
service
- Provide mobility Online
- Provide social Monetize the Social:
Business - Improve individual
Monetize Individual: experience
- Upsell service
- VIP
Application - Re-sell Aggregate Data
(e.g., Advertisers)
- Speed
- Extra
Capabilities
31. Readable
Replica
Primary Copy
Shard
Readable
OLTP Workloads Replica
Traditional OLAP Workloads
Highly Available known schema
High Scale Readable Data warehouse, “Star joins”
Replica
High Flexibility
Primary
Shard Dynamic OLAP Workloads
mostly touching 1 Readable
to low number of Replica
3Vs (Volume, Velocity, Variety)
shards Exploratory
Readable
Replica
Scale-out queries, often using
Primary
Shard Query eventual consistent scale-out
Readable frameworks like Hadoop
Replica
SQL or NoSQL Store
33. http://www.windowsazure.com
Presentation Speaker Date and Time
Do We Have the Tools We Need to Navigate the
Dave Campbell 2/29 9:00am PST
New World of Data?
Onsite Interview * Tim O’Reilly, Dave Campbell 2/29 10:15am PST
Unleash Insights on All Data With Microsoft Big
Alexander Stojanovic 2/29 11:30am PST
Data
Office Hours (Q&A session) Dave Campbell 2/29 1:30pm PST
Hadoop + Javascript: What We Learned Asad Khan 2/29 2:20pm PST
Democratizing BI at Microsoft: 40,000 Users
Kirkland Barrett 3/1 10:40am PST
and Counting
Data Marketplaces For Your Extended
Piyush Lumba 3/1 2:20pm PST
Enterprise
33
34. • NoSQL and the Windows Azure Platform
http://download.microsoft.com/download/9/E/9/9E9F240D-0EB6-472E-B4DE-
6D9FCBB505DD/Windows%20Azure%20No%20SQL%20White%20Paper.pdf
http://blogs.msdn.com/b/cbiyikoglu/archive/2011/03/03/nosql-genes-in-sql-azure-
federations.aspx
<choose from slides 3 – 10 as alternative intro pictures>Timing: 1 minute Key Points:Microsoft has changed as a company and become more open.Script:Microsoft has changed as a company and become more open. The old debate – black or white; open source or commercial software; us versus them – is simply no longer relevant. Today, many customers manage mixed IT environments. And they have told us that what matters today is maximizing their existing IT investments while having the freedom to choose new solutions that best support their business goals. To meet these customer needs, Microsoft is committed to openness.
Timing: 2 minutes Key Points:We do not compete against open source as a category, we increasingly work collaboratively with this community. You may be surprised to learn what Microsoft is doing with open source. More and more, customers, partners and the industry understand that the work we are doing with open source is about helping customers and enabling a rich and robust ecosystem of developers and partners. The following slides will provide some great examples. Script:You may be surprised to learn what Microsoft is doing with open source. More and more, customers, partners and the industry understand that the work we are doing with open source is about helping customers and enabling a rich and robust ecosystem of developers and partners. We enable open source on our platforms. We recognize that if we’re going to use open source, then we also have to give back, especially if we want open source developers to continue to think of Windows and Windows Phone as platforms for them to develop on. For example, Windows Azure supports a wide-range of development languages, including Java, PHP and Node.js so that developers can build applications for using any language tool, or framework of their choice – including open source. Let’s review the following slides for some more detailed examples.
Timing: 2 minutes Key Points:Device Driver Code contributionsfor Linux: enables better performance of Linux when virtualized with Hyper-V CoApp: you are developing apps for Linux? Why not make them work on Windows and open up more opportunities for your app to get adopted? Windows Azure Virtual Machines enablescustomers to run both their existing Windows and Linux-based applications in the cloud. Compatible operating systems/images include CentOS, openSUSE, SUSE Linux Enterprise Server, Ubuntu, as well as Windows Server. Script:You may be surprised to learn what we are doing with Linux. We have learned a lot over the past decade. Embracing Linux on our platforms is a real business for us. For example, we work on a variety of interoperability initiatives with Linux vendors -- SUSE, Citrix, RedHat, CentOS -- to provide support for Linux as a “first-class guest” on Hyper-V. Another great example is CoApp, which is an is an open-source package management system for Windows. The goal of the CoApp project is to create a community of developers dedicated to creating a set of tools and processes that enable other open source developers to create and maintain their open source products with Windows as a build target. Further, with Windows Azure Virtual Machines, customers can run both their existing Windows and Linux-based applications in the cloud. Compatible operating systems/images include CentOS, openSUSE, SUSE Linux Enterprise Server, Ubuntu, and Windows Server, further illustrating Microsoft’s commitment to openness for customersand partners. FYI – Data sources and more information:-Robert McMillan, Wired Enterprise (March 2012): http://www.wired.com/wiredenterprise/2012/03/mr-linux/. Note: thequote is accurate, but the broader article is all about Linus and Linux
Timing: 2 minutes Key Points:We’re committed to helping customers manage “big data”, working with the Apache Hadoop community to support Hadoop on Windows Server and Windows Azure.Our Big Data solution is also integrated into the Microsoft BI tools such as SQL Server Analysis Services, Reporting Services and even PowerPivot and Excel. This enables you to do BI on all your data, including those in Hadoop.Script:Just ten years ago, most business data was locked up behind big applications. We are now entering an era when unlocking this data and its potential to drive new knowledge and insights is becoming a key success factor for many ventures. To embrace this “Big Data revolution”, we’ve launched customer previews of Apache Hadoop-based solutions for Windows Server and Windows Azure, which enables Hadoop apps to be deployed in hours instead of days. The most recent customer preview is called Windows Azure HDInsight and Microsoft HDInsight for Windows Server. Both solutions embrace enterprise-ready Apache Hadoop to enable most any user to begin viewing and truly analyzing Big Data, using such tools as Microsoft Excel, PowerPivot, and SQL Server Analysis Services. Regardless of the size or type of data, or where it’s stored, both HDInsight versions offer simple management via Microsoft System Center 2012, a shared codebase for platform consistency whether on Windows Server or Azure, and 100% compatibility with Hadoop.Customers such as Klout,Webtrends and the University of Dundee have been using the service to glean simple, actionable insights from complex data sets hosted in the cloud. FYI – Data Sources and more information:“Opening Doors To Real Big Data Value: Hadoop On Windows Azure And Windows Server” (Oct 2012): http://blogs.technet.com/b/openness/archive/2012/10/24/opening-doors-to-real-big-data-value-hadoop-on-windows-azure-and-windows-server.aspx “Openness Customer Spotlight: Klout Uses Microsoft BI and Hadoop to Bolster Big Data Insights” (Sept 2012): http://blogs.technet.com/b/openness/archive/2012/09/07/klout-uses-microsoft-bi-and-hadoop-to-bolster-big-data-insights.aspx“Navigating the New World of Data” (Mar 2012): http://blogs.technet.com/b/openness/archive/2012/03/01/navigating-the-new-world-of-data.aspxKurt Mackie, Redmondmag.com quote (Oct 2011):http://redmondmag.com/articles/2011/10/12/hadoop-efforts-announced-at-pass.aspx?admgarea=BDNA
Timing: 2 minutes Key Points:Great Java experience on Windows Server and Windows AzurePartners like Gigaspaces are taking advantage of Java support to provide services to customers with existing Java-based enterprise applications. Windows Azure plug-in for Eclipse with helps Eclipse users create and configure deployment packages of their Java applications for the Windows Azure cloud.Script:Customers and partners are taking advantage of the “first-class” Javaexperience on Windows Server and Windows Azure. For example, partners like Gigaspacesare now able to take advantage of Java support to provide services to customers with existing Java-based enterprise applications. Microsoft also continues to work on projects that foster interoperability with Java and Windows. For example, Windows Azure SDK for Java includes a Windows Azure plug-in for Eclipseprovides templates and functionality that allow you to easily create, develop, test, and deploy Windows Azure applications using the Eclipse development environment. It is an Open Source project, whose source code is available under the Apache License 2.0 from the project’s site at http://sourceforge.net/projects/waplugin4ej/.FYI – Data Sources and more information:Gigaspaces case study (Feb 2012): http://www.microsoft.com/casestudies/Windows-Azure/Gigaspaces/Solution-Provider-Streamlines-Java-Application-Deployment-in-the-Cloud/400000000081
Timing: 2 minutes Key Points:Great example of how far the Linux experience has evolved over the past several years – from no PHP experience on Windows to PHP running extremely well and with high performance on both Windows and Linux. PHP releases now include support for both Windows and Linux. Script:Over the past several years, Microsoft and its partners have worked diligently with the PHP community to improve the experience PHP developers and users have on Windows Server and Windows Azure. Now the PHP community supports Windows right alongside Linux, including the recent release of PHP 5.4.0. René de Haas, CEO of a Dutch webhosting company called SoHosted, is a partner who has been instrumental in improving the PHP on Windows experience. According to René, “Between 2003 and 2012 we've seen the general opinion about Microsoft, Windows and PHP turn 180 degrees” due to the improvements made.FYI – Data Sources and more information:“PHP 5.4 Available in Windows Azure Web Sites” (Nov 2012): http://blogs.technet.com/b/openness/archive/2012/11/27/php-5-4-available-in-windows-azure-web-sites.aspx“Evolution of PHP on Windows” (Mar 2012), including SoHosted interview:http://blogs.technet.com/b/openness/archive/2012/03/01/evolution-of-php-on-windows.aspx
Timing: 1 minute Key Points:Firefox browser is well supported across cloud services (Office 365, SkyDrive, Bing, Skype).Microsoft created a Firefox plug-in for Windows Media Player.Mozilla has acknowledged how Microsoft’s commitment to HTML5 enables this support for Firefox and other modern browsers. Script:Firefox browser is well supported across Microsoft’s cloud services like Office 365, SkyDrive, Bing, and Skype, as well as Microsoft created a Firefox plug-in for Windows Media Player. Those within the Mozilla community have acknowledged how Microsoft’s commitment to HTML5 enables this support for Firefox and other modern browsers. FYI – Data Sources and more information:Blizzard quote reference: http://www.theregister.co.uk/2010/06/09/mozilla_man_on_apple_google_and_html5/
Timing: 1 minute Key Points:Microsoft has worked with Drupal to improve interoperability, resulting in more choices for users. Script:Drupal is a popular open source content management system that powers many of the world's web sites.Microsoft has worked with Drupal to improve interoperability, resulting in more choices for users. The Screen Actors Guild recently migrated their Drupal site to Windows Azure. The SAG Awards, their biggest traffic day of the year, “went off with flying colors.” FYI – Data Sources and more information:“Drupal + Windows Azure: A Winning Combination for SAG” (Feb 2012): http://blogs.technet.com/b/openness/archive/2012/02/29/drupal-windows-azure-a-winning-combination-for-sag.aspx
Timing: 2 minutes Key Points:Node.js provides an end-to-end JavaScript experience for the development of a whole new class of real-time applications With the work that we did to enable Windows on Node.js, not did we support Windows, but the benchmarks for Linux also improvedDevelopers can also implement a Node.js application and deploy it to Windows Azure using Cloud9 IDEScript:Node.js is Node.js is a platform built on Chrome’s JavaScript runtime for easily building fast, scalable network applications. Microsoft’s support for Node.js on Windows Azure enables a new class of real-time applications. We also released the Windows Azure SDK for Node.js as open source, availableon Github, as well as the Windows Azure Development Centers has great Node.jsdocumentation, tutorials, samples and how-to guides to get you started with Node.js on Windows Azure.Also announced recently is support for Cloud9 IDE as a way to create Node.js applications and deploy to Windows Azure. FYI – Data Sources and more information:Scott Fulton, ReadWriteWeb quote (Dec 2011): http://www.readwriteweb.com/cloud/2011/12/windows-azure-adds-nodejs-supp.php
Timing: 1 minute Key Points:Patches have been submitted to SambaGreat example of how relationship between an open source solution and Microsoft can evolve Script:In late 2011, a patch to the Samba code was submitted that enables Linux clients to better interoperate with Microsoft Windows in mixed source environments. Contributed under GPL2+, the patch was an individual contribution made by Microsoft’s Stephen Zarkos (Open Source Technical Center team) in line with Samba policies in place at the time. Efforts also continue to move forward with Microsoft and the Samba team working together to support the SMB protocol. The comments by Chris Hertel of the Samba team reflect how the relationship between key open source solutions and Microsoft have been evolving in the past several years. FYI – Data Sources and more information:“Driving Interoperability with the SMB Open Specifications” (Jun 2012): http://blogs.technet.com/b/openness/archive/2012/06/29/driving-interoperability-with-the-smb-open-specifications.aspx
Timing: 3 minutes Key Points:The substantial growth of the Microsoft open source project community, Codeplex, which has tripled in size in the past two years, illustrates the momentum of Microsoft + Open Source. 9 of the top 10 most downloaded OSS projects run on Windows.In 2011 Microsoftlaunched WebMatrix -- a free, light-weight web development tool designed for quick website building and deployment. This tool puts open source tools at developers’ fingertips and these developers have downloaded more than one million open source web applications.Customers are benefitting from our work with open source solutions, including the more than 900 customers of the Microsoft-SUSE Alliance.Script:Our increased commitment to working with open source has sparked tremendous momentum and contributed to rapid growth of open source software on Windows – according to Sourceforge, 9 of the top 10 most downloaded OSS projects run on Windows today.(Side note: the compete project list is below; the only project that “isn’t supported on Windows” is the “Smart package of Microsoft's core fonts” which doesn’t need to be supported because is obviously already runs on Windows ). Further, Codeplex, Microsoft’s open source project community hosts more than 32,000 open source projects and has tripled its membership in just two years, from 300,000 members to more than 900,000 in 2012. Another great example is Webmatrix, a free, light-weight web development tool designed for quick website building and deployment. This tool puts open source tools at developers’ fingertips and these developers have downloaded more than one million open source web applications. Since it’s launch in 2011, there have been more than 1 million downloads. And customers as well as developers are benefitting directly from these efforts, including the more than 900 customers of the Microsoft-SUSE Alliance, which delivers interoperability solutions that help customers to get more out of their mixed Windows and Linux environments. FYI – Data Sources and more information:SUSE,Codeplex, and WebMatrix stats current as of Nov 2012Sorceforge top projects site (http://sourceforge.net/top/). “Most downloads over all time” as of Nov 25, 2012: VLC media playereMuleAzureus / VuzeAres Galaxy7-ZipSmart package of Microsoft's core fonts (“not supported on Windows” by Sourceforge definition)FileZillaPortableApps.com: Portable Software/USBMinGW - Minimalist GNU for WindowsNotepad++ Plugin Manager
<OPTIONAL SLIDE: Customize with local announcements as appropriate>Timing: 1 minute Key Points:MongoDB has been supported on Windows Azure for some time, but recently the setup, deployment, and development experience has been streamlined by the release of the MongoDB Installer for Windows Azure.In October, MongoLabreleased the preview of a MongoDB-as-a-Service offering through the Windows Azure Store. MongoLab is a full-featured MongoDB cloud database solution that completely automates the operational aspects of running MongoDB. Script: MongoDB is a very popular NoSQL database that is easy to learn if you have JavaScript (or Node.js) experience and is used in many high-volume web sites including Craigslist, FourSquare, Shutterfly, The New York Times, MTV, and others.People have been using MongoDB on Windows Azure for some time, but recently the setup, deployment, and development experience has been streamlined by the release of the MongoDB Installer for Windows Azure. It’s now easier than ever to get started with MongoDB on Windows Azure!Also, in October, MongoLab released the preview of a Mongo-DB-as-a-Service offering through the Windows Azure Store. MongoLab is a full-featured MongoDB cloud database solution that completely automates the operational aspects of running MongoDB. With the MongoLab cloud platform developers can deploy and manage highly-available databases for their applications and leverage automated backups, web-based tools, 24/7 monitoring, and expert support.FYI – Data Sources and more information:For more detail on the MongoDB Installer for Windows Azure: http://blogs.msdn.com/b/interoperability/archive/2012/07/09/mongodb-installer-for-windows-azure.aspxFor more detail on the MongoLab service: https://www.windowsazure.com/en-us/store/service/?name=mongolab
Timing: 3 minutesKey Points:Windows Azure is an open and flexible cloud platform. Developers can build applications using any language, tool or framework – including open source languages such as PHP, Java, and Node.js, and other open source tools. Our June 2012 technical preview release, brought support for Linux on Windows Azure Virtual Machines and further support for multiple frameworks and popular open source applications through Windows Azure Web Sites.Script:As part of our cloud platform, interoperability is a design-time requirement. Windows Azure is an open and flexible cloud platform that enables customers to quickly build, deploy and manage applications across a global network of Microsoft-managed datacenters. To do it right we know we’ve got to be open.Developers can build applications using any language, tool or framework – including open source languages such as PHP, Java, and Node.js, and other open source tools – which means they can utilize familiar open source skills on Microsoft's cloud platform. Currently features and services in Windows Azure are exposed using open REST protocols. Windows Azure client libraries are available for multiple programming languages and are released under an open source license and hosted on GitHub. As Microsoft continues to provide incremental improvements to Windows Azure, we remain committed to working with developer communities. Other recent interoperability enhancements include: Eclipse Plugin for Java, Mongo DB support, code configuration for hosting Solr/Lucene, Hadoop services preview. Also, our June 2012 technical preview brought support for Linux images on Windows Azure Virtual Machines and further support for multiple frameworks and popular open source applications through Windows Azure Web Sites (note: see appendix slides for more detail on Virtual Machines and Web Sites).
Timing: 1 minuteKey Points:Windows Azure Web Sites enable developers to quickly and easily deploy sites with support for multiple frameworks and popular open source applications to a highly scalable cloud environment.Script: Windows Azure Web Sites allows you to build highly scalable websites on Windows Azure. You can quickly and easily deploy sites to a highly scalable cloud environment that allows you to start small and scale as traffic grows. Windows Azure Web Sites uses the languages and open source apps of your choice and supports deployment with Git, FTP, and TFS. You can easily integrate other services like MySQL, SQL Database, Caching, CDN, and Storage.
In January 2011 Microsoftlaunched WebMatrix -- a free, light-weight web development tool designed for quick web site building and deployment. This tool puts open source tools at developers’ fingertips:Choose from a gallery of popular open source web applications to get a site up and running in a few clicks.Installs PHP & MySQL for necessary apps. Edit your code or database within WebMatrix.Utilizes NuGet to gain access to a community-driven gallery of ASP.NET “helpers” that given you small snippets of code to perform common tasks (bit.ly, Facebook integration, twitter, etc.).
Example MSN Casual Gaming:~2 Million users at launch~86 Million services requests/day 135 Windows Azure Data Services Hosting VMs ca. 18K connections in Connection Pools, this could grow with trafficCa. 1200 SQL Azure requests/second spread across all partitions during peak load~ 90% reads vs 10% writes (this varies per storage type)~ 200 bytes of storage per user~ 20% of database storage is currently used, but expect this to growSharded over 400 SQL Azure Databases
Note: Big-sized companies invest resources in building these platforms instead of using existing relational platforms!
No DB or OS Admin telling me what to do!
Performance and Scale:Map/Reduce PatternsEventual consistency (trade-off due to CAP)ShardingCachingAutomate management Lifecycle:Elastic Scale on demand (no need to pay for resources until needed)Automatic Fail-overScalable Schema version rolloutPerf troubleshootingAuto alertingAuto loadbalancingAuto resourcing (e.g., auto splits based on policies)Declarative policy-based management
Code First and revise quicklyWorking software over comprehensive documentationResponding to change over following a planApplication-model first (before database) Dictates the data model and queriesFlexible data modelsNo a priori modeling: Data first, schema later/Open SchemaKey/Value storesReduced impedance mismatch: JSON, XML, YAMLYou don’t know exactly what you are looking forMap/Reduce for adhoc analysisProvide Search across all your data instead of just queryLower Pain of adoption and maintenance From code to deployment & “monetization” of data, services, apps and tenantsRich Services out of the BoxData and services mashupEasy troubleshooting of deployed appsNo DB or OS Admin telling me what to do
Low CapEx, Low OpEx: SQL Azure and other Platform as a Service offeringsBuilt-in High-Availability (tunable): SQL Azure has quorum based built-in replicasData scale-out (Sharding): SQL Azure FederationsProcessing scale-out (Map-Reduce, Fan-Out, tunable consistency)Flexible Data ModelsJSON (& XML) supportSparse columns/Column sets Integrate with BigData Analytics (e.g., Hadoop)
SharePoint – BI, Enterprise Search, Enterprise Content Management, CollaborationTransform - ETLClean – Data Quality, AugmentationDiscover – Search, Meta-data, Classification, Information CatalogInfer – Recommendation Engines, Machine LearningShare – Publish, CollaborateGovern – Lineage & Impact Analysis, Master Data ManagementMarketplace – Private, Public, Bing Data, 3rd Party Data Sources, Models, Algorithms, APIs