Building an analytics and reporting stack in 2017

Building an analytics and reporting stack in 2017

We spoke with over 50 Reflect customers across 10 industries to find out which analytics stacks are working and which are not. Here are the biggest analytics trends we’re seeing in 2017.

The move to a modular data architecture

Developers want the flexibility to customize and extend their stack, and are investing in a modular data architecture that makes it easy to switch out individual components. This approach is commonly accepted wisdom in software development, but has only recently been applied in the BI/data space.

The old model was focused on building an end-to-end stack for every use case. For example, marketing, sales, operations, and HR each had to run their own vertical stack. If you were integrating reporting into your SaaS offering, or sharing data outside of your organization, you would have to build another stack. This model is expensive and breaks down quickly.

The new model lets you treat each horizontal part of the stack individually, without stressing about how they play together. The industry is moving from a monolithic model into a component-based system that’s broken out into three main parts:

  • BI/visualization layer (where you get value from your data)
  • Data warehouse (somewhere to keep your data)
  • Data pipeline (collect, transform, and load data into your warehouse)

It’s all about the endgame

There’s no point in collecting and storing data unless you have a solid endgame. Here are some of the BI and visualization tools that stand out, each one solving a unique need or set of use cases:

Looker (best for large data-driven orgs)

Looker is a BI platform that helps data teams create dashboards and reports for data-driven organizations. Looker requires a data team to configure a data model before business users can access the tool.

Mode (default tool for hardcore analysts)

Mode is a collaborative BI platform that focuses on analysts’ workflows and allows for deep integration into Jupyter notebooks, as well as the ability to write complex SQL statements. Mode works well for hardcore analysts and data scientists that need a daily workflow.

Periscope (all-around SQL reporting)

Periscope is a SQL chart builder geared towards analysts that are looking for a daily workbench. Periscope has a hybrid architectural mode, which includes caching (backed by Redshift) to help improve performance on unwieldy queries.

Metabase (great for simple dashboards)

Metabase is a lightweight, open source, free BI tool for asking data-driven questions and building out basic dashboards. Given that Metabase is open source, you can download it and run it locally, which isn’t an option for most SaaS-based providers.

Reflect (for delivering analytics to customers)

Reflect is a data-visualization-as-a-service platform that lets product owners build analytics and reporting into their applications, without a ton of engineering effort. Reflect has a lightweight data integration model and is completely customizable (so you can match your brand) on the front-end.

A cloud-based datasource is table stakes

Databases have become faster, cheaper, and easier to maintain. You can now BYOD (bring your own database), and use it to run all your aggregations. Cloud-based data warehouses have become increasingly popular, and most savvy people are working on migrating their on-premise solutions. Some of the success cases we see include:

Redshift (best bang for the buck)

Redshift is a columnar database that is purpose-built for analytical workloads at scale. Accessibility to data within the AWS ecosystem, cost-effectiveness, and stellar performance have contributed the unprecedented growth of Redshift. AWS also recently announced Redshift Spectrum, making it even easier to query and ingest data from S3.

BigQuery (optimized for huge datasets)

BigQuery is a scalable, interactive ad-hoc query system for analysis of read-only nested data. You can load data from Google Storage or stream it directly in. BigQuery is fully managed and easy to maintain, but only makes sense if you’re working with big datasets as the performance-to-value ratio is only optimized for huge queries or when using ML (machine learning) features.

ElasticSearch (great for event-level data)

ElasticSearch is a NoSQL store with a JSON-based query language that can perform ad-hoc queries on large datasets quickly and without extensive configuration. While not traditionally thought of as a database, ElasticSearch is rapidly gaining momentum as a datasource for analytics, due to a combination of fast full-text search and aggregation in data cubes.

Snowflake (best for concurrent reads)

Snowflake is a cloud-native relational SQL warehouse that has separated storage from compute, using a unique distributed query execution engine optimized by columnar storage, pipelining, and data pruning for aggregations. Snowflake is really good at concurrent queries, which makes it a good fit for high volume interactions.

Panoply (easiest way to get started)

Panoply is a “smart data warehouse.” They use Redshift as their datastore, but have a layer of intelligence built in that uses ML to optimize Redshift over time by doing things like rewriting queries to use automatically generated summary tables. Their loading and transformation processes are more intuitive and require less customization than conventional tools to achieve similar results.

Getting data into your warehouse is easier than it’s ever been

Writing ETL scripts is the bane of any data engineer’s existence. It’s time consuming and inefficient. Today, you can use a variety of plug-and-play tools that let you create a data pipeline with little to no code. We’ve seen great results with:

Segment (best for marketing data)

Segment is a customer data hub that lets you integrate with hundreds of tools and stream events to a Redshift, Postgres or MySQL warehouse. This works well for collecting marketing, or other event data, across a variety of sources.

AWS Glue (perfect for AWS data)

AWS Glue is a new ETL service by AWS that simplifies and automates data discovery, conversion, mapping, and job scheduling tasks. AWS Glue also supports direct access to S3, RDS, and Redshift. This service is useful for taking at-rest data from a variety of data sources and getting it all into one.

Apache Spark (pre-aggregate large datasets)

Apache Spark is a data processing engine that works well for large sets. You can use Spark to aggregate large datasets into summary tables that make reporting significantly more performant. Spark only makes sense if your data volume is large, as there are easier alternatives for smaller datasets.

Stitch (replace scripts with a GUI)

Stitch is an ETL-as-a-service company, offering a multitude of connectors that makes it easy to pump data from all your sources into a data warehouse of your choice. This is a great option if you’re doing traditional ETL work and want to replace manual scripts with a more centralized and automated solution.

Future-proof your stack

With how quickly things change, make sure to set up your stack in a way that will save yourself from pulling your hair out later. If you build using a set of components, you can add or remove individual ones without completely scrapping your work. Thinking ahead also lets you take advantage of the new innovations the next generation of data tools will bring.

Nelson Bostrom

Global GTM Alliance Team Leader - Startups

7y

Great post / primer!

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics