Design Decisions & Tradeoffs

Scope of Telemetry Data

One of the first major design decisions made was on the scope of telemetry signals to include. The current version of Vispyr captures profiles, traces, and metrics, but does not include logs. Vispyr’s initial focus was on leveraging continuous profiling and combining that with other telemetry signals. While metrics and traces were relatively straightforward to instrument, incorporating logs introduced additional complexities into the process.

First, the OTel telemetry signal for logs is mature, but the SDK tooling is in varying states of maturity. For the JavaScript SDK, logs are still in development. In order to work around this, we would need to deploy separate instrumentation for logs and ensure that any additional instrumentation was compatible with our pipeline and the OTLP protocols. Second, logs are unique as a telemetry signal in that they are typically already produced by developers as part of existing processes. This means that any additional instrumentation would need to account for existing processes used to produce and store logs. This is how OTel itself approaches logs, as stated on their website: “OpenTelemetry does not define a bespoke API or SDK to create logs. Instead, OpenTelemetry logs are the existing logs you already have from a logging framework or infrastructure component.”

While incorporating logs would be a natural complement to the other three telemetry signals included in Vispyr, the benefit did not outweigh the added complexity of accounting for the various forms of user log management and the lack of production-ready tooling in the OTel SDK. As such, we chose to omit logs for the current version of Vispyr.

The Observability Stack

Agent and Gateway Collectors

Deciding which collector to use for the telemetry pipeline was a choice between using the OTel collector directly from OpenTelemetry or Grafana Alloy, an open source vendor distribution of the OTel collector.

The initial choice was to use a vendor-agnostic solution, but OTel’s lack of profile support was a heavy trade-off. As mentioned previously, OTel lacks a production-ready profiler, and the profiles signal is still in development. Without an OTLP-compatible profile signal, sending profiles would require bypassing the agent-gateway telemetry pipeline. This would result in isolating profile data, increasing deployment complexity, and adding additional constraints when debugging issues between Vispyr components.

The alternative was to replace both the agent and gateway collector with Grafana Alloy. Alloy offers support for Pyroscope, an open source continuous profiler which we chose to use for instrumentation and data storage (see below for the decision related to Pyroscope).

This allowed for a unified telemetry pipeline and simplified the connection between the agent collector and the Vispyr backend. The downside of this decision is that Alloy uses a different configuration format and custom collector components. Though these are specific to Alloy, the fact that this is an open-source tool based on the existing OTel collector was an appropriate compromise for deploying vendor-specific tooling over OTel. For this reason, we chose to use Alloy as both the Agent and Gateway collectors.

Continuous Profiler

For profiles, there were two open-source options considered: Parca by Polar Signals and Pyroscope. Parca is an eBPF-based system profiler and was the first tool we examined. It matched our goals for a simple to deploy instrumentation tool that required minimal customization of the user’s application. However, it being an eBPF-based profiler meant it came with the inherent limitations of eBPF tooling described previously, specifically the limitations around newer Linux kernels and lacking production-ready OTLP support. Additionally, Parca is designed as a self-contained tool with its own data storage and visualization components. While it does support connecting the data to Grafana, it does not support sending instrumentation data through our agent and gateway collectors. This tradeoff would reintroduce the data isolation issue described in the Agent & Gateway Collector decision above.

Our second option was Pyroscope, an open source profiler that is part of Grafana’s ecosystem and capable of sending telemetry through Grafana’s OTel-based collector Alloy. It also has an SDK for instrumentation. This meant we could deploy instrumentation for profiles along with all other telemetry data in a single step. However, in order to maintain a single pipeline, we were required to use Grafana Alloy and remove the OTel collector.

Ultimately, the benefits of Pyroscope’s maturity, ease of instrumentation, and compatibility with our agent-gateway architecture outweighed any tradeoffs.

Data Storage

The decision for the metrics data store was focused on Prometheus and Grafana Mimir. Similar to Alloy’s relationship with the OpenTelemetry Collector, Mimir is an extension of Prometheus, offering horizontal scalability, high availability, and multi-tenancy for long-term metric storage. While beneficial for a large, microservice infrastructure, the additional complexity and overhead from Mimir is unnecessary for smaller, monolith architectures, which is more aligned with the environment of the intended Vispyr user. Without a compelling reason to deviate from the industry standard, we chose to use Prometheus as the metrics data store.

The decision for traces focused on Jaeger and Grafana Tempo. While similar in identity, the approaches between the data stores are distinct. Jaeger uses external databases, such as Cassandra and Elasticsearch, to store traces. In contrast, Tempo’s only dependency is basic object storage, and it comes with configuration options for major cloud provider object storage services. Choosing Tempo would simplify the deployment process, as a simple AWS S3 bucket could be quickly and easily deployed along with the rest of the AWS resources in the Vispyr backend. The simplicity of setting up the underlying storage mechanism for traces made Tempo the clear choice for data storage.

Custom UI vs Provisioned Grafana Instance

The decision for the visualization layer came as a choice between building a custom interface or leveraging Grafana’s visualization tools.

The rationale for a custom UI was that it would allow us to create a streamlined experience focused exclusively on the telemetry views most relevant to our stack and our target users. This would potentially provide a more approachable entry point for teams new to observability. However, this would come at the cost of flexibility, as users would be limited to the same predefined views. Deploying the Grafana UI as the visualization layer would provide extensive capabilities to dive deep into granular data, but this would come at the cost of limited ability to customize the various visualizations and a steeper learning curve for Vispyr’s target users.

Ultimately, the benefits of providing developers with the flexibility to investigate beyond any predefined views and dashboards outweighed the alternative. With the inclusion of a preprovisioned dashboard, Vispyr is still able to provide some level of simplified views into the telemetry data while still allowing access to Grafana’s extensive querying and drilldown features for more specific data.

Infrastructure Choices

Vispyr’s backend is fully containerized, so our first idea was to utilize the AWS Elastic Container Store (ECS), incorporating CloudMap for service discovery and network routing. ECS would simplify container deployment and management and including Fargate would facilitate horizontal scalability for each data store as needed. However, the tradeoff for ECS is cost. For five containerized applications, ECS can quickly become an expensive service, particularly in conjunction with Fargate.

For Vispyr’s requirements, the benefits of ECS did not outweigh the potential cost. The alternative was to deploy directly to an EC2 instance, which still met all the requirements for Vispyr’s backend while being the least expensive option.

Design Decisions & Tradeoffs

Scope of Telemetry Data

The Observability Stack

Agent and Gateway Collectors

Telemetry pipeline with profiles bypassing the OTel Agent and Gateway Collectors

Telemetry pipeline with the Alloy Agent and Gateway Collectors processing profiles with all telemetry data

Continuous Profiler

Pyroscope vs Parca

Data Storage

Mimir vs Prometheus

Tempo vs Jaeger

Custom UI vs Provisioned Grafana Instance

Infrastructure Choices

Infrastructure decision between EC2 or ECS + Fargate deployment