Treasure Data provides a data analytics service with the following key components:
- Data is collected from various sources using Fluentd and loaded into PlazmaDB.
- PlazmaDB is the distributed time-series database that stores metadata and data.
- Jobs like queries, imports, and optimizations are executed on Hadoop and Presto clusters using queues, workers, and a scheduler.
- The console and APIs allow users to access the service and submit jobs for processing and analyzing their data.
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
Treasure Data is a data analytics service company that makes heavy use of Ruby in its platform and services. It uses Ruby for components like Fluentd (log collection), Embulk (data loading), scheduling, and its Rails-based API and console. Java and JRuby are also used for components involving Hadoop and Presto processing. The company's architecture includes collectors that ingest data, a PlazmaDB for storage, workers that process jobs on Hadoop and Presto clusters, and schedulers that queue and schedule those jobs using technologies like PerfectSched and PerfectQueue which are written in Ruby. Hive jobs are built programmatically using Ruby to generate configurations and submit the jobs to underlying Hadoop clusters.
This document summarizes a presentation about PlazmaDB, a distributed storage architecture that supports petabyte-scale data analysis. PlazmaDB uses a columnar data format partitioned by time and other columns. It features real-time and archive storage, with partitions that can be merged to reduce storage size over time. The document discusses PlazmaDB's indexing of partitions and optimization of queries through partition lookup. It also covers challenges like monitoring large data volumes and high write workloads on the metadata database.
This document discusses making the Norikra stream processing software more perfect. It outlines how Norikra currently works well for small to medium sites but has limitations for large deployments. The concept of a "Perfect Norikra" is introduced that would add distributed execution, high availability, and dynamic scaling capabilities. A rough design is sketched that involves a new query executor, dataflow manager, and strategies for dynamic scaling through intermediate results and merging across nodes. Challenges mentioned include resource monitoring, multi-tenancy, and supporting queries without aggregations.
Introduction to Presto at Treasure DataTaro L. Saito
Presto is a distributed SQL query engine that was developed by Facebook to make SQL queries scalable for large datasets. It translates SQL queries into multiple parallel tasks that can process data across many servers without using intermediate storage. This allows Presto to handle millions of records per second. Presto is now open source and used by many companies for interactive analysis of petabyte-scale datasets.
Presto is a distributed SQL query engine that allows for interactive analysis of large datasets across various data sources. It was created at Facebook to enable interactive querying of data in HDFS and Hive, which were too slow for interactive use. Presto addresses problems with existing solutions like Hive being too slow, the need to copy data for analysis, and high costs of commercial databases. It uses a distributed architecture with coordinators planning queries and workers executing tasks quickly in parallel.
Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.
Distributed Logging Architecture in Container EraSATOSHI TAGOMORI
Distributed Logging Architecture in Container Era
The document discusses distributed logging architecture in the container era. It covers: 1) The difficulties of logging with microservices and containers due to their ephemeral and distributed nature, 2) The need to redesign logging to push logs from containers to destinations quickly without fixed addresses or mappings; 3) Common patterns for distributed logging architectures including source aggregation, destination aggregation, and scaling; and 4) A case study using Docker and Fluentd to implement source aggregation and scaling for logging. Open source solutions are important to keep the logging layer transparent, interoperable, and able to scale independently of applications and infrastructure.
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
The document discusses various techniques used to optimize Hive query execution and deployment in Treasure Data, including:
1) Running Hive queries through a custom QueryRunner that handles query planning, execution, and statistics reporting.
2) Using an in-memory metastore and schema-on-read from Treasure Data's columnar storage to manage schemas and tables.
3) Configuring jobs through HiveConf properties to control mappings, partitions, and storage handlers for efficient execution on Hadoop clusters.
Logging for Production Systems in The Container Era discusses how to effectively collect and analyze logs and metrics in microservices-based container environments. It introduces Fluentd as a centralized log collection service that supports pluggable input/output, buffering, and aggregation. Fluentd allows collecting logs from containers and routing them to storage systems like Kafka, HDFS and Elasticsearch. It also supports parsing, filtering and enriching log data through plugins.
Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
This document discusses Presto, an open source distributed SQL query engine for interactive analysis of large datasets. It describes Presto's architecture including its coordinator, connectors, workers and storage plugins. Presto allows querying of multiple data sources simultaneously through its connector plugins for systems like Hive, Cassandra, PostgreSQL and others. Queries are executed in a pipelined fashion without disk I/O or waiting between stages for improved performance.
Open Source Software, Distributed Systems, Database as a Cloud ServiceSATOSHI TAGOMORI
- Treasure Data is a database as a cloud service company that collects and stores customer data beyond the cloud [1].
- It uses open source software like Fluentd and MessagePack to easily integrate and collect data from customers [2]. It also uses open source distributed systems software like Hadoop and Presto to store, process and query large amounts of customer data [3].
- As a database service, it needs to share computer resources securely for many customers. It contributes to open source to build and maintain the distributed systems software that powers its cloud database service [4].
Expand data analysis tool at scale with ZeppelinDataWorks Summit
Apache Zeppelin is one of the tools to help users and developers enrich their analysis with beautiful visualization without any additional work. But recently, teams and cooperation started to use it as a team and a cooperate tool, and they are suffering. Thus it should be improved to be used in multiple teams and in a cooperation to overcome an individual tool.
I will explain how to configure Apache Zeppelin and its useful interpreters including Spark and JDBC to help multiple users and teams use it simultaneously, and how to adopt LDAP and Kerberos to authenticate and authorize valid users. The presentation also includes a specific example of line case, what to have developed for realizing these use cases, and the feature roadmap to make a more powerful tool in a production level. For a long time, Apache Zeppelin has focused on making a result beautiful, but now, it should do its efforts to make it a more convenient tool by hiding some sophisticated settings and providing easier configuration. JONGYOUL LEE, Software Development Engineer, LINE
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...PROIDEA
The document discusses building a resilient log aggregation pipeline using Elasticsearch and Kafka. It recommends using Kafka as a centralized buffer due to its scalability, fault tolerance, and streaming capabilities. Daily or size-based indices in Elasticsearch are preferable to a single large index. The document also provides optimization strategies for Elasticsearch, Kafka, and log shipping, including maintaining separate hot and cold tiers and properly configuring resources for data, master and ingest nodes.
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
This document discusses technologies for data analytics services for enterprise businesses. It begins by defining enterprise businesses as those "not about IT" and data analytics services as providing insights into business metrics like customer reach, ad views, purchases, and more using data. It then outlines some key technologies needed for such services, including data management systems, distributed processing systems, queues and schedulers, tools for connecting systems, and methods for controlling jobs and workflows with retries to handle failures. Specific challenges around deadlines, idempotent operations, and replay-able workflows are also addressed.
This document summarizes a presentation about Presto, an open source distributed SQL query engine. It discusses Presto's distributed and plug-in architecture, query planning process, and cluster configuration options. For architecture, it explains that Presto uses coordinators, workers, and connectors to distribute queries across data sources. For query planning, it shows how SQL queries are converted into logical and physical query plans with stages, tasks, and splits. For configuration, it reviews single-server, multi-worker, and multi-coordinator cluster topologies. It also provides an overview of Presto's recent updates.
One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface.
Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.
This document discusses features that could make Norikra, an open source stream processing software, even more "perfect". It describes how Norikra currently works and highlights areas for improvement, such as enabling queries to resume processing from historical batch query results, sharing operators between queries to reduce memory usage, and developing a true lambda architecture with a single query language for both streaming and batch processing. The document envisions a "perfect stream processing engine" with these enhanced capabilities.
Optimizing Presto Connector on Cloud StorageKai Sasaki
This document discusses Presto connectors and how Treasure Data optimizes the Presto connector for cloud storage. It provides details on:
1) How Treasure Data uses Presto as a distributed SQL query engine and developed its own Presto connector to interface with its cloud-based data storage system called PlazmaDB.
2) Key aspects of PlazmaDB including using PostgreSQL for metadata and S3 for storage, with transactions managed across these systems.
3) How data is partitioned in PlazmaDB to optimize query performance, including time index partitioning based on ingestion time and user-defined partitioning.
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
The document discusses different cloud data architectures including streaming processing, Lambda architecture, Kappa architecture, and patterns for implementing Lambda architecture on AWS. It provides an overview of each architecture's components and limitations. The key differences between Lambda and Kappa architectures are outlined, with Kappa being based solely on streaming and using a single technology stack. Finally, various AWS services that can be used to implement Lambda architecture patterns are listed.
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
The document introduces Amazon Athena and AWS Glue. It summarizes that Amazon Athena allows users to interactively query data stored in Amazon S3 using standard SQL. It also summarizes that AWS Glue is a fully managed ETL service that automates data extraction, transformation and loading processes. Glue discovers how data is organized, crawls data sources to infer schemas, automatically generates ETL code and manages execution of data workflows.
This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
This document discusses building data warehouses and data lakes in the cloud using AWS services. It provides an overview of AWS databases, analytics, and machine learning services that can be used to store and analyze data at scale. These services allow customers to migrate existing data warehouses to the cloud, build new data warehouses and data lakes more cost effectively, and gain insights from their data more easily.
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
Learn how to monitor your database performance closely and troubleshoot database issues quickly using a variety of features provided by Amazon RDS and MySQL including database events, logs, and engine-specific features. You also learn about the security best practices to use with Amazon RDS for MySQL. In addition, you learn about how to effectively move data between Amazon RDS and on-premises instances. Lastly, you learn the latest about MySQL 5.6 and how you can take advantage of its newest features with Amazon RDS.
- Treasure Data is a data analytics platform that unifies raw data from over 100 sources in a scalable and secure manner. It stores data on cloud storage like S3.
- Its storage system called Plazma stores metadata in PostgreSQL and data files in S3. Data is partitioned based on time ranges for efficient querying.
- Presto is used as the distributed query engine. The Treasure Data connector implements metadata, splitting, and data access functions to allow Presto to query data stored in Plazma on S3. This utilizes time partitioning and predicate pushdown for high performance queries.
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), application programming interfaces (API), clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. Building scalable big data pipelines with automated extract-transform-load (ETL) and machine learning processes can address these limitations. JustGiving is the world’s largest social platform for online giving. In this session, we describe how we created several scalable and loosely coupled event-driven ETL and ML pipelines as part of our in-house data science platform called RAVEN. You learn how to leverage AWS Lambda, Amazon S3, Amazon EMR, Amazon Kinesis, and other services to build serverless, event-driven, data and stream processing pipelines in your organization. We review common design patterns, lessons learned, and best practices, with a focus on serverless big data architectures with AWS Lambda.
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
Redshift is a scalable SQL database in AWS that can store up to 1.6PB of data across multiple servers. It uses a columnar data storage model that makes adding or removing columns fast. Data is uploaded from S3 using SQL COPY commands and queried using standard SQL. The document provides recommendations for getting started with Redshift, such as performing daily full refreshes initially and then implementing incremental update mechanisms to enable more frequent updates.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
This document discusses strategies for logging at scale. It notes that logging presents challenges around temporary storage, data capture, permanent storage, and visualization. The document recommends starting with SQL databases and using NoSQL like Elasticsearch for very large datasets or fast data ingest. It presents Amazon Kinesis, Firehose, and Elasticsearch Service as tools to help with data capture, transport, and search. Visualization can be done with Kibana or by loading data into Redshift for use with existing BI tools. The key lessons are to reuse existing technologies when possible and choose the right tools for each part of the logging pipeline.
This document provides an overview of IoT databases and time series data. It discusses different database types and popular IoT database solutions like InfluxDB and TimescaleDB. Implementations and demos of these databases are shown, including writing and querying time series data. Challenges of IoT databases are also mentioned.
Similar to Overview of data analytics service: Treasure Data Service (20)
Ractor is a new experimental feature in Ruby 3.0 that allows Ruby code to run in parallel on CPUs. It manages objects per Ractor and can move objects between Ractors, making moved objects invisible to the original Ractor. It can share certain "shareable" objects like modules, classes, and frozen objects between Ractors. For web applications to fully utilize Ractors, an experimental application server called Right Speed was created that uses Rack and runs processing workers on Ractors. However, there are still problems to address like exceptions when closing connections and accessing non-shareable constants and instance variables across Ractors before Ractors can be ready for production use in web applications.
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
This document discusses the good and hard things about developing and operating a SaaS platform. It describes how the backend team at Treasure Data owns and manages various components of their distributed platform. It also discusses how they have modernized their deployment process from a periodic Chef-based approach to using CodeDeploy for more frequent deployments. This allows them to move faster by doing many small releases that minimize the number of affected components and customers.
The document is an invitation to a presentation about Maccro, a Ruby macro system that allows rewriting Ruby methods by registering rules. It summarizes Maccro's capabilities such as registering before/after matchers, applying rules to methods, matching method ASTs to rules, and rewriting method code by replacing placeholders. It also lists some limitations including only supporting Ruby 2.6+, not handling method call order dependencies, and not updating method visibility. The presentation aims to recruit more developers to Maccro to handle its remaining challenges.
The document discusses constants in Ruby programming. It notes that constant names start with capital letters and can be overwritten with warnings. It provides some constant name samples and encourages trying out different Ruby versions. It also discusses making Ruby scripts confusing by using Unicode characters and constant names that are difficult to understand for future maintainers.
This document summarizes a presentation given by @joker1007 and @tagomoris on hijacking Ruby syntax. It discusses various Ruby features like bindings, tracepoints, and refinements that can be used to modify Ruby's behavior. It then demonstrates several "hacks" that leverage these features, including finalist, overrider, and abstriker which add method modifiers, and binding_ninja which allows implicitly passing a binding. Other hacks discussed are with_resources and deferral which add "with" and "defer" statements to Ruby. The presentation emphasizes how these modifications are implemented using hooks, bindings, tracepoints and other Ruby internals.
Lock, Concurrency and Throughput of Exclusive OperationsSATOSHI TAGOMORI
1. The document discusses different patterns for implementing locks in a distributed key-value store to maximize concurrency and throughput of operations.
2. It describes a naive giant lock approach that locks the entire storage for any operation, resulting in poor concurrency and throughput.
3. Better approaches use metadata locks plus simple resource locks, and reference counting locks, to allow concurrent updates to different resources while minimizing critical sections.
This document discusses Ruby's role in data processing. It outlines the typical steps of data processing - collect, summarize, analyze, visualize. It then provides examples of open-source Ruby tools that can be used for each step, such as Fluentd for collection, and libraries for numerical analysis, bioinformatics, and machine learning. Services that use Ruby for collection and processing are also mentioned, like Log Analytics and Stackdriver Logging. The document encourages continuing to improve Ruby tools to make data processing better and celebrates Ruby's 25th anniversary.
Bigdam is a planet-scale data ingestion pipeline designed for large-scale data ingestion. It addresses issues with the traditional pipeline such as imperfectqueue throughput limitations, latency in queries from event collectors, difficulty maintaining event collector code, many small temporary and imported files. The redesigned pipeline includes Bigdam-Gateway for HTTP endpoints, Bigdam-Pool for distributed buffer storage, Bigdam-Scheduler to schedule import tasks, Bigdam-Queue as a high throughput queue, and Bigdam-Import for data conversion and import. Consistency is ensured through at-least-once design and deduplication is performed at the end of the pipeline for simplicity and reliability. Components are designed to scale out horizontally.
This document discusses using Ruby for distributed storage systems. It describes components like Bigdam, which is Treasure Data's new data ingestion pipeline. Bigdam uses microservices and a distributed key-value store called Bigdam-pool to buffer data. The document discusses designing and testing Bigdam using mocking, interfaces, and integration tests in Ruby. It also explores porting Bigdam-pool from Java to Ruby and investigating Ruby's suitability for tasks like asynchronous I/O, threading, and serialization/deserialization.
Fluentd is an open source data collector that provides a unified logging layer between data sources and backend systems. It decouples these systems by collecting and processing logs and events in a flexible and scalable way. Fluentd uses plugins and buffers to make data collection reliable even in the case of errors or failures. It can forward data between Fluentd nodes for high availability and load balancing.
This document discusses middleware in Ruby and provides examples of considerations when writing middleware:
- Middleware should be a long-running daemon process that is compatible across platforms and environments and handles various data formats and traffic volumes.
- Tests must be run on all supported platforms to ensure compatibility as thread and process scheduling differs between operating systems.
- Memory usage and object leaks must be carefully managed in long-running processes to avoid consuming resources over time.
- Performance of JSON parsing/generation should be benchmarked and the most optimized library used to avoid unnecessary CPU usage.
This document describes a presentation about introducing black magic programming patterns in Ruby and their pragmatic uses. It provides an overview of Fluentd, including what it is, its versions, and the changes between versions 0.12 and 0.14. Specifically, it discusses how the plugin API was updated in version 0.14 to address problems with the version 0.12 API. It also explains how a compatibility layer was implemented to allow most existing 0.12 plugins to work without modification in 0.14.
Fluentd is an open source data collector that allows flexible data collection, processing, and output. It supports streaming data from sources like logs and metrics to destinations like databases, search engines, and object stores. Fluentd's plugin-based architecture allows it to support a wide variety of use cases. Recent versions of Fluentd have added features like improved plugin APIs, nanosecond time resolution, and Windows support to make it more suitable for containerized environments and low-latency applications.
Fighting API Compatibility On Fluentd Using "Black Magic"SATOSHI TAGOMORI
The document discusses Fluentd's changes to its plugin API between versions 0.12 and 0.14. In 0.14, the API was overhauled to separate entry points from implementations and introduce a plugin base class to control data and control flow. A compatibility layer was added to allow most 0.12 plugins to work unmodified in 0.14 by handling calls to overridden methods. However, plugins that override certain methods like #emit may cause errors due to changes in how buffering works.
The document summarizes the new plugin API in Fluentd v0.14. Key points include:
- The v0.12 plugin API was fragmented and difficult to write tests for. The v0.14 API provides a unified architecture.
- The main plugin classes are Input, Filter, Output, Buffer, and plugins must subclass Fluent::Plugin::Base.
- The Output plugin supports both buffered and non-buffered processing. Buffering can be configured by tags, time, or custom fields.
- "Owned" plugins like Buffer are instantiated by primary plugins and can access owner resources. Storage is a new owned plugin for persistent storage.
- New test drivers emulate plugin
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
1. The document discusses the role of an engineer at a tech company, focusing on how engineers can lead both technical and business aspects.
2. It argues that tech-led businesses involve engineers finding customer needs, creating valuable products/services, providing them to customers, collecting feedback, and continuously improving.
3. The most important thing is for engineers to enjoy their work and find meaning in creating things that provide value both for customers and themselves.
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...vijayatibirds
Unlock the full potential of your business with iBirds Services. As a trusted Salesforce Consulting Partner, iBirds Software Pvt. Ltd. offers a wide range of customer-centric consulting services to help you seamlessly integrate, customize, and optimize your Salesforce CRM. Our team of experts specializes in delivering innovative software development solutions tailored to meet your unique business needs.
In this document, you will discover:
An overview of iBirds Services and our expertise in Salesforce CRM implementation.
Detailed insights into our software development services, including custom applications, integrations, and automation.
Case studies highlighting our successful projects and satisfied clients.
Key benefits of partnering with iBirds Services for your CRM and software development needs.
Whether you are a small business or a large enterprise, our proven strategies and cutting-edge technologies ensure your business stays ahead of the competition. Explore our services and learn how iBirds can transform your business operations with scalable and efficient solutions.
In today's dynamic business landscape, ERP software systems are essential tools for businesses worldwide, including those in the UAE. These systems cater to the unique needs of the UAE's rapidly changing economy and expanding industries.
This blog examines the top 10 ERP companies in the UAE, highlighting their innovative products, exceptional customer support, and significant impact on the regional business community. These companies excel in providing ERP solutions that enhance efficiency and growth for businesses throughout the UAE.
1. **Odoo**
- Odoo ERP is a comprehensive business management solution with features like accounting, HR, sales, inventory control, and CRM. Its user-friendly interface simplifies processes and boosts productivity. Banibro IT Solutions leverages Odoo to transform business operations.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Open Source: Yes
- Cloud-based: Yes (Cloud and On-premises)
- Support: Phone, Chat, Email
- Payment: Yearly, Monthly
- Multi-Language: Yes
- OS Support: Windows, Mac, iOS, Android
- API: Available
2. **Microsoft Dynamics 365**
- Dynamics 365 offers a centralized platform for small and medium-sized businesses, integrating with Microsoft apps and cloud services for scalability. It simplifies data processing with user-friendly interfaces and customizable reporting.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Support: Phone, Chat, Email, Knowledge Base
- Payment: One-Time, Yearly, Monthly
- Multi-Language: No
- OS Support: Web App, Windows, iOS, Android
- API: Not specified
3. **FirstBIT ERP**
- Known for serving small and medium-sized businesses, FirstBIT ERP offers comprehensive solutions and exceptional customer service, enhancing productivity and efficiency.
- **Details:**
- Suitable for: Medium, Large Businesses
- Open Source: Yes/No
- Cloud-based: Yes (Cloud and On-premises)
- Support: Phone, Email, Video Tutorials
- Payment: Yearly, Monthly
- Multi-Language: Yes
- OS Support: Web App, Windows, Mac, iOS, Android
- API: Available
4. **Ezware Technologies**
- Ezware Technologies provides top-notch ERP solutions for various industries with user-friendly modules that streamline complex business processes.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Support: Phone, Chat, Email, Knowledge Base
- Payment: One-Time, Yearly, Monthly
- Multi-Language: No
- OS Support: Web App, Windows, Mac, iOS, Android
- API: Not specified
5. **RealSoft**
- RealSoft by Coral is popular in Dubai, offering modules for contracting, real estate, job costing, manufacturing, trading, and finance. It's VAT-enabled and affordable for medium-sized businesses.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Open Source: No
- Cloud-based: On-premises
-
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsBenjamin Bischoff
In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools.
We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.
Understanding Automated Testing Tools for Web Applications.pdfkalichargn70th171
Automated testing tools for web applications are revolutionizing how we ensure quality and performance in software development. These tools help save time, reduce human error, and increase the efficiency of web application testing processes. This guide delves into automated testing, discusses the available tools, and highlights how to choose the right tool for your needs.
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio, Inc.
Alluxio Webinar
July.23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shouwei Chen (core maintainer and product manager, Alluxio)
In today's AI-driven world, organizations face unprecedented demands for powerful AI infrastructure to fuel their model training and serving workloads. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving. On July 9, 2024, we introduced Alluxio Enterprise AI 3.2, a groundbreaking solution designed to address these critical issues in the ever-evolving AI landscape.
In this webinar, Shouwei Chen will introduce exciting new features of Alluxio Enterprise AI 3.2:
- Leveraging GPU resources anywhere accessing remote data with the same local performance
- Enhanced I/O performance with 97%+ GPU utilization for popular language model training benchmarks
- Achieving the same performance as HPC storage on existing data lake without additional HPC storage infrastructure
- New Python FileSystem API to seamlessly integrate with Python applications like Ray
- Other new features, include advanced cache management, rolling upgrades, and CSI failover
How Generative AI is Shaping the Future of Software Application DevelopmentMohammedIrfan308637
Generative AI is revolutionizing software development. Find out how it enhances innovation and productivity. https://www.qisacademy.com/blog-detail/the-power-of-generative-ai-in-software-application-development
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Andre Hora
Positive tests (aka, happy path tests) cover the expected behavior of the program, while negative tests (aka, unhappy path tests) check the unexpected behavior. Ideally, test suites should have both positive and negative tests to better protect against regressions. In practice, unfortunately, we cannot easily identify whether a test is positive or negative. A better understanding of whether a test suite is more positive or negative is fundamental to assessing the overall test suite capability in testing expected and unexpected behaviors. In this paper, we propose test polarity, an automated approach to detect positive and negative tests. Our approach runs/monitors the test suite and collects runtime data about the application execution to classify the test methods as positive or negative. In a first evaluation, test polarity correctly classified 117 tests as as positive or negative. Finally, we provide a preliminary empirical study to analyze the test polarity of 2,054 test methods from 12 real-world test suites of the Python Standard Library. We find that most of the analyzed test methods are negative (88%) and a minority is positive (12%). However, there is a large variation per project: while some libraries have an equivalent number of positive and negative tests, others have mostly negative ones.
The code is written and the tests pass. I just have to commit this last round of changes to my branch. Wait, why does that say committed to main? Did I commit all those changes to main? Arghh! I can’t redo all of this!
Committing changes to the wrong branch, forgetting files, misspelling the commit message, and needing to undo commits are some of the “advanced” features of Git that we normal people run into way too often and need help with. The fixes are often easy – once you know what they are. But in the heat of the moment, with the deadline (or Friday afternoon) approaching, it isn’t always easy to figure out what magic spell to cast to get Git to do what you need.
We’ll spend some time looking at typical Git situations people get themselves into, and then we’ll demonstrate how to get out of them. This isn’t about Git internals or a Git master’s class – this real-world Git when things aren’t going right. And there will be plenty of time for questions, so bring your “best” Git nightmare scenarios so we can figure out how to recover.
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...OnePlan Solutions
Unlock the full potential of your projects with OnePlan’s seamless integration with monday.com. Join us to discover how OnePlan enhances monday.com by aligning your portfolio of projects with your organization’s strategic goals, optimizing resource allocation, and streamlining performance tracking. Learn how this powerful combination can drive efficiency, cost savings, and strategic success within your organization.
The SQDC (Safety, Quality, Delivery, Cost) process enhances manufacturing performance through daily safety meetings, defect tracking, and waste reduction. Orcalean’s FactoryKPI digital dashboard streamlines this process, providing real-time data and AI-powered analytics for continuous improvement.
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsAlina Tait
BLR Tools provides an advanced BitLocker Data Recovery Tool specifically engineered to recover lost or inaccessible data from BitLocker-encrypted drives. Whether you're dealing with accidental deletion, encryption key problems, or system crashes, our cutting-edge software guarantees a secure and efficient recovery process. Rely on BLR Tools for dependable BitLocker data recovery and effortlessly restore access to your essential files.
6. What I'll talk about today
• Architecture Overview
• Semantics for Distributed Systems
• Queue/Worker and Scheduler
• Data Processing: Hive and Presto
• Data Storage: PlazmaDB
• All about Our Infrastructure
• Data Collection: Fluentd and Embulk
• Data and Open Source Software Products
• Disposable Services and Persistent Data
16. Semantics
• Send messages / execute jobs once, then:
• Exactly once
• If any error occurs, it will be retried if needed
• At least once
• If any error occurs, it will be retried either way
• At most once
• If any error occurs, do nothing anymore
17. Semantics
• Send messages / execute jobs once, then:
• Exactly once
• If any error occurs, it will be retried if needed
• Hard to implement, poor performance (e.g. TCP for users)
• At least once
• If any error occurs, it will be retried either way
• Possibly duplicated events (e.g. TCP packets)
• At most once
• If any error occurs, do nothing anymore
• Possibly missing events (e.g. UDP packets)
18. Idempotence (冪等性)
• "describing an action which, when performed
multiple times on the same subject, has no further
effect on its subject after the first time it is
performed." (wikipedia)
• Exactly-once operations
• by idempotent operation w/ at-least-one semantics
20. Queue/Worker and Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others
21. PerfectSched
• Provides periodical/scheduled queries for customers
• it's like reliable "cron"
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-once semantics
• PerfectSched enqueues jobs into PerfectQueue
https://github.com/treasure-data/perfectsched
22. Jobs in TD: for queues
LOST
Retried
after
Errors
Throughput
Execution
time
QUERY NG
OK
or
NG
LOW
SHORT
(secs)
or
LONG
(mins-hours)
DATA
import/export
NG
OK
or
NG
HIGH SHORT
(secs-mins)
23. PerfectQueue
• Highly available distributed queue using RDBMS
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Using transactions
• Flexible scheduling rather than scalability
• Workers do many things
• Plazmadb operations (including importing data)
• Building job parameters
• Handling results of jobs + kicking other jobs
• Using Amazon RDS (MySQL) internally (+ Workers on EC2)
https://github.com/treasure-data/perfectqueue
25. Features
• Priorities for query types
• Resource limits per accounts
• Graceful restarts
• Queries must run long time (<= 1d)
• New worker code should be loaded, besides
running job with older code
27. Hive and Presto
• Hive: SQL executor on Hadoop
• Parse SQL, compile&submit MapReduce jobs
• Good for large/complex query (e.g. JOINs)
• Stable, high throughput but high latency for small
queries
• Presto: MPP engine for SQL
• MPP: Massively Parallel Processing (using threads)
• Good for small-or-middle size query
• High performance and low latency for small queries
28. Query engines in TD
Latency
for short
query
Throughput
for extra-
large
source
data
Complex
JOINs
Standard
SQL
Semantics
Hive Bad Good Good Bad
At least
once
speculative
execution,
task retries
Presto Good Not so good Not so good Good
At most
once
no retries for
failure
29. PlazmaDB
One Data Source, Two query engines
Our customers can switch query engines per queries.
object storageWorker
Hadoop
Cluster
Presto
Cluster
metadata
database, table, schema
data chunks
PlazmaDB
31. PlazmaDB
• Just only one persistent component in TD
• stores all customers' data
• Distributed database
• time indexed, columnar, schema-on-read
• transparent 2-layered storage (realtime, archive)
• realtime: to append recent data
• archive: to store long-term data
• metadata (PostgreSQL) + object storage (S3/RiakCS)
• all operations are idempotent
32. Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
Archive
Storage
Metadata of the
records in a file
(stored on
PostgreSQL)
33. Amazon S3 /
Basho Riak CS
Metadata
Merge Worker
(MapReduce)
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
Merge every 1 hourRetrying + Unique
(at-least-once + at-most-once)
34. Amazon S3 /
Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
GiST (R-tree) Index
on“time” column on the files
Read from Archive Storage if merged.
Otherwise, from Realtime Storage
35. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
path index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
37. What I'll talk about today
• Architecture Overview
• Semantics for Distributed Systems
• Queue/Worker and Scheduler
• Data Processing: Hive and Presto
• Data Storage: PlazmaDB
• All about Our Infrastructure
• Data Collection: Fluentd and Embulk
• Data and Open Source Software Products
• Disposable Services and Persistent Data
39. Infrastructure: fully on cloud
• AWS, IDCF Cloud, Heroku, Fastly for service core
• S3, EC2, RDS, ElastiCache, ... (in AWS)
• Not locked-in for cloud services
• And many services for DevOps
• Github, CircleCI, DataDog, PagerDuty, StatusPage.io,
Chef.io, JIRA/Confluence, Airbrake, NewRelic,
Artifactory, zendesk, OneLogin, Box, Orbitz, Slack, ...
• No self-hosted servers for non-core services
• w/o VPN servers :P
40. Policies about Infrastructure
• Concentrate on development of our own service
• decrease operation costs as far as possible
• Keep initiative about deployment/performance
• be not locked-in by any environments or services
• solve our problems by our own technologies
• Keep it simple and not too much
• use our time on service as much as possible, not
so much for processes
44. Fluentd
• Stream based log collector
• Written in CRuby
• Plugin based architecture using rubygems.org
• Open Source Software: Apache License v2
• Committers from TD and others (e.g. DeNA)
• Packaging for many environments: td-agent
• Package w/ Fluentd + some selected plugins
46. Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/
48. Embulk
• Batch based log collector
• Written in Java + JRuby
• Plugin based architecture using rubygems.org
• Open Source Software: Apache License v2
• Committers from TD and many external
contributors
• Hosted embulk: DataConnector in TD
• With no customizations
50. Treasure Data:
"an open source company at its core"
• To enlarge the world about data
• To make the world better by software
• What important is data, not tools nor services
52. Disposable components
• Blue-green deployment ready
• Console&API servers: Heroku/EC2
• Event collector
• Workers and schedulers
• Hadoop&Presto clusters
• It makes operations much easier! :D
• no care about long-life servers/processes
• no complex operation steps for server crashes
58. Persistent Data
• 2 RDBMSs
• API DB for customer data
• PlazmaDB metadata
• Object Storage for PlazmaDB
• That's it!
• We can concentrate on these very few
components :D