Roger Barga

Seattle, Washington, United States Contact Info
7K followers 500+ connections

Join to view profile

About

Product leader with broad experience in enterprise computing, from strategy, product…

Activity

Join now to see all activity

Publications

  • Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes. Second Edition

    Apress

    Data Science and Machine Learning are in high demand, as customers are increasingly looking for ways to glean insights from all their data. More customers now realize that Business Intelligence is not enough as the volume, speed and complexity of data now defy traditional analytics tools. While Business Intelligence addresses descriptive and diagnostic analysis, Data Science unlocks new opportunities through predictive and prescriptive analysis.

    The purpose of this book is to provide a…

    Data Science and Machine Learning are in high demand, as customers are increasingly looking for ways to glean insights from all their data. More customers now realize that Business Intelligence is not enough as the volume, speed and complexity of data now defy traditional analytics tools. While Business Intelligence addresses descriptive and diagnostic analysis, Data Science unlocks new opportunities through predictive and prescriptive analysis.

    The purpose of this book is to provide a gentle and instructionally organized introduction to the field of data science and machine learning, with a focus on building and deploying predictive models.

    Other authors
    See publication
  • Project Daytona: Data Analytics as a Cloud Service

    Proceedings of the International Conference of Data Engineering (ICDE), International Conference on Data Engineering, 7 March 2012

    Spreadsheets are established data collection and analysis tools in business, technical computing and academic research. Excel, for example, offers an attractive user interface, provides an easy to use data entry model, and offers substantial interactivity for what-if analysis. However, spreadsheets and other common client applications do not offer scalable computation for large scale data analytics and exploration. Increasingly researchers in domains ranging from the social sciences to…

    Spreadsheets are established data collection and analysis tools in business, technical computing and academic research. Excel, for example, offers an attractive user interface, provides an easy to use data entry model, and offers substantial interactivity for what-if analysis. However, spreadsheets and other common client applications do not offer scalable computation for large scale data analytics and exploration. Increasingly researchers in domains ranging from the social sciences to environmental sciences are faced with a deluge of data, often sitting in spreadsheets such as Excel or other client applications, and they lack a convenient way to explore the data, to find related data sets, or to invoke scalable analytical models over the data. To address these limitations, we have developed a cloud data analytics service based on Daytona, which is an iterative MapReduce runtime optimized for data analytics. In our model, Excel and other existing client applications provide the data entry and user interaction surfaces, Daytona provides a scalable runtime on the cloud for data analytics, and our service seamlessly bridges the gap between the client and cloud. Any analyst can use our data analytics service to discover and import data from the cloud, invoke cloud scale data analytics algorithms to extract information from large datasets, invoke data visualization, and then store the data back to the cloud all through a spreadsheet or other client application they are already familiar with.

    Other authors
    See publication
  • CloudClustering: Toward an iterative data processing pattern on the cloud

    Proceedings IEEE Cloud 2011, The 4th International Conference on Cloud Computing, IEEE Computer Society

    As the emergence of cloud computing brings the potential for large-scale data analysis to a broader community, architectural patterns for data analysis on the cloud, especially iterative algorithms, are increasingly useful. MapReduce suffers performance limitations for this purpose as it is not inherently designed for iterative algorithms.

    In this paper we describe our implementation of Cloud-Clustering, a distributed k-means clustering algorithm on Microsoft’s Windows Azure cloud. The…

    As the emergence of cloud computing brings the potential for large-scale data analysis to a broader community, architectural patterns for data analysis on the cloud, especially iterative algorithms, are increasingly useful. MapReduce suffers performance limitations for this purpose as it is not inherently designed for iterative algorithms.

    In this paper we describe our implementation of Cloud-Clustering, a distributed k-means clustering algorithm on Microsoft’s Windows Azure cloud. The k-means algorithm makes a good case study because its characteristics are representative of many iterative data analysis algorithms. CloudClustering adopts a novel architecture to improve performance without sacrificing fault tolerance. To achieve this goal, we introduce a distributed fault tolerance mechanism called the buddy system, and we make use of data affinity and checkpointing. Our goal is to generalize this

    Other authors
    • Ankur Dave
    • Wei Lu
    • Jared Jackson
    See publication
  • A Scalable Communication Runtime for Clouds

    Proceedings IEEE Cloud 2011, The 4th International Conference on Cloud Computing, IEEE Computer Society

    Leveraging cloud computing to acquire the necessary computation resources to scale out parallel applications is becoming common practice. However, many such applications also require communication and synchronization between processes. Although, commercial cloud platforms provide ready access to scalable compute and storage services, implementing communication and synchronization between cooperating processes and efficiently exchanging arbitrary size messages remains a challenge for application…

    Leveraging cloud computing to acquire the necessary computation resources to scale out parallel applications is becoming common practice. However, many such applications also require communication and synchronization between processes. Although, commercial cloud platforms provide ready access to scalable compute and storage services, implementing communication and synchronization between cooperating processes and efficiently exchanging arbitrary size messages remains a challenge for application developers. In clouds, durable queues provide basic abstractions for communication. However, they are not sufficient for applications that require transferring arbitrary size messages or for applications that require higher level abstractions such as broadcast. Furthermore, direct socket based communication is susceptible to various fluctuations common in data center environments. We envision a solution to this problem that leverages scalable storage services, queues, and direct socket based communication. Publish/subscribe (pub/sub) is a well-known communication pattern that can achieve the above capabilities in a loosely coupled fashion, which is highly desirable in cloud environments where most services are asynchronous. In this paper, we describe the architecture of a pub/sub library implemented on a commercial cloud computing platform, which can be used to develop various parallel applications. We also present an evaluation of our implementation using both micro benchmarks and a real world application. Together, these demonstrate that our approach is both effective and scalable in performing communication and synchronization in cloud scale applications.

    Other authors
    • Jaliya Ekanayake
    • Jared Jackson
    • Wei Lu
    See publication
  • Accurate Latency Estimation in a Distributed Event Processing System

    27th International Conference on Data Engineering (ICDE '11)

    Other authors
    • Badrish Chandramouli, Jonathan Goldstein, Mirek Riedewald
    See publication
  • Versioning for Workflow Evolution

    Third International Workshop on Data Intensive Distributed Computing held in conjunction with the 19th International Symposium on High Performance Distributed Computing (HPDC'10)

    Scientists working in eScience environments often use workflows to carry out their computations. Since the workflows evolve as the research itself evolves, these workflows can be a tool for tracking the evolution of the research. Scientists can trace their research and associated results through time or even go back in time to a previous stage and fork to a new branch of research. In this paper we introduce the workflow evolution framework (EVF), which is demonstrated through implementation in…

    Scientists working in eScience environments often use workflows to carry out their computations. Since the workflows evolve as the research itself evolves, these workflows can be a tool for tracking the evolution of the research. Scientists can trace their research and associated results through time or even go back in time to a previous stage and fork to a new branch of research. In this paper we introduce the workflow evolution framework (EVF), which is demonstrated through implementation in the Trident workflow workbench. The primary contribution of the EVF is efficient management of knowledge associated with workflow evolution. Since we believe evolution can be used for workflow attribution, our framework will motivate researchers to share their workflows and get the credit for their contributions.

    See publication
  • Building the Trident Scientific Workflow Workbench for Data Management in the Cloud

    International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP), IEEE

  • GrayWulf: Scalable Software Architecture for Data Intensive Computing

    Hawaii International Conference on System Sciences (HICSS), IEEE Computer Society

    Other authors
    • Maria Nieto-Santisteban, Tamas Budavari, Nolan Li, Alex S. Szalay, Catharine van Ingen, Jim Heasley, et al
    See publication
  • Observing the Oceans - A 2020 Vision for Ocean Science

    The Fourth Paradigm: Data Intensive Scientific Discovery, Microsoft Research

    Other authors
    • John Delaney
    See publication

Patents

  • De-focusing over big data for extraction of unknown value

    Issued USPTO 08452792

    Techniques for defocusing queries over big datasets and dynamic datasets are provided to broaden search results and incorporate all potentially relevant data and avoid overly narrowing queries. An analytic component can receive queries directed at one region of a dataset and analyze the queries to generate inferences about the queries. The queries can then be defocused by a defocusing component and incorporate a larger dataset than originally searched to broaden the queries. The larger dataset…

    Techniques for defocusing queries over big datasets and dynamic datasets are provided to broaden search results and incorporate all potentially relevant data and avoid overly narrowing queries. An analytic component can receive queries directed at one region of a dataset and analyze the queries to generate inferences about the queries. The queries can then be defocused by a defocusing component and incorporate a larger dataset than originally searched to broaden the queries. The larger dataset can incorporate all, or a part of the original dataset and can also be disparate from the original dataset. Clusters of queries can also be merged and unified to deal with ‘local minima’ issues and broaden the understanding of the dataset. In other embodiments, dynamic data can be monitored and changes tracked, to ensure that all portions of the dataset are being searched by the queries.

    See patent
  • AUTOMATIC SIGNIFICANCE TAGGING OF INCOMING COMMUNICATIONS

    US 20090006366

  • Automatically managing incoming communications between sender and recipient, analyzing factors, selectively applying observed behavior, performing designated action

    US 7,885,948

  • BREAK-THROUGH MECHANISM FOR PERSONAS ASSOCIATED WITH A SINGLE DEVICE

    US 20110045806

  • CONSISTENCY SENSITIVE STREAMING OPERATORS

    US 20090125635

  • DISTRIBUTED WORKFLOW FRAMEWORK

    US 20110035506

  • Event stream conditioning

    US 8,099,452

  • FEDERATED DISTRIBUTED WORKFLOW SCHEDULER

    US 20110161391

  • Implementation of stream algebra over class instances

    US 7,676,461

  • Optimized recovery logging

    US 7,418,462

  • PUBLISHING WORK ACTIVITY TO SOCIAL NETWORKS

    US 20090006415

  • Persistent client-server database sessions

    US 7,386,557

  • Persistent stateful component-based applications via automatic recovery

    US 7,461,292

  • Publishing work activity information key tags associated with shared databases in social networks

    US Implementation of stream algebra over class instances

  • Recovery guarantees for general multi-tier applications

    US 7,478,277

  • Recovery guarantees for software components

    US 6,959,401

  • SINGLE DEVICE WITH MULTIPLE PERSONAS

    US 20110061008

  • Streaming operator placement for distributed stream processing

    US 8,060,614

  • TEMPORAL EVENT STREAM MODEL

    US 20090125550

Honors & Awards

  • National Academy of Engineers, Frontiers of Engineering

    National Academy of Engineering

  • Intel Graduate Fellowship Recipient

    Intel Corporation

    The Intel PhD Fellowship Program awards fellowships to PhD candidates doing work in fields related to Intel's business and research interests. These fellowships, available only at select U.S. universities, include tuition, and a stipend. Approximately 35 fellowships are awarded annually.

Recommendations received

More activity by Roger

View Roger’s full profile

  • See who you know in common
  • Get introduced
  • Contact Roger directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Roger Barga

Add new skills with these courses