Observability Platform

Designing cohesive observability experience to monitor resources

DigitalOcean provides AI-native cloud platform accessed through control panel (called console). It provided basic monitoring for very limited parameters. However, DigitalOcean needed to replace a fragmented, siloed monitoring experience with a single unified observability platform across multiple distinct product areas. The challenge wasn't just design complexity — it was people complexity. As the sole designer, I had to align multiple PMs across platform, each with different priorities and definitions of success and shape a cohesive experience from their competing requirements.

MY ROLE

Lead Designer — sole designer across all product areas

TEAM

Multiple PMs (one per product area), Engineering, AI/ML

PLATFORM

DigitalOcean Control Panel — cloud infrastructure mngt

SCOPE

AI Inference · Kubernetes · GPU Droplets · Monitoring

CONTEXT

A PLATFORM THAT COULDN’T SEE ITSELF

DigitalOcean's existing monitoring experience was built product by product, team by team. Each product area had its own interpretation of what observability meant: some offered basic metrics at the resource level, others had nothing at all. Users — developers, DevOps leads, CTOs managing large infrastructure — had to navigate multiple disconnected surfaces to understand what was happening across their systems.

The business goal was clear: replace the legacy "Insights" tab with a high-fidelity, integrated observability platform — a single pane of glass for metrics, logs and alerts across all DigitalOcean products. The goal was to help users reduce time to recovery, optimize cost of ownership and drive adoption toward paid high-fidelity GPU telemetry.

DigitalOcean's Disconnected monitoring meant slow troubleshooting, blind spots and runaway costs. Developers were piecing together their system's health from four different places — none of which spoke to each other.

- Synthesis from user research and stakeholder discovery

USERS

The platform needed to serve three distinct user types with very different needs:

  • Lead devs & DevOps leads managing infrastructure at scale;

  • CTOs tracking cost and resource efficiency

  • Kubernetes/GPU power users running large DOKS clusters, GPU Droplets with advanced telemetry requirements.

UNDERSTANDING USER NEEDS

Due to fast paced nature of the work and tight timeline to launch by DigitalOcean conference, there was not time for external user research. So I conducted research with internal users:

  • No. of users : 12

  • Format: 1:1 interviews

  • Selection criteria: Currently using metrics, logs etc.

THE REAL CHALLENGE - FRAGMENTATION

MULTIPLE PRODUCT AREA, MULTIPLE PMs, ZERO SHARED DEFINITION

The technical problem — designing an observability platform where same tech will render observability insights for each product — was complex but tractable. The organizational problem was harder. Each product area had its own PM, its own roadmap and its own understanding of what observability should look like for their users.

PRODUCT AREA 01

AI Agents & Inference

PM needed cost transparency, token usage, latency distribution, and model performance data. Business metrics alongside technical health signals.

PRODUCT AREA 02

GPU Droplets

PM needed GPU-specific advanced metrics — SM efficiency, clock frequency, fault counts, thermal alerts — alongside standard compute health.

PRODUCT AREA 03

Kubernetes

PM needed cluster-level health, node metrics, control plane visibility, and GPU metrics for AI workloads. Complex, technical, multi-level hierarchy.

PRODUCT AREA 01

Monitoring Dashboard

PM needed a fleet-wide GPU management view across all resource types — a control room, not a detail page. Different scope entirely.

THE REAL CHALLENGE - COLLABORATION

NO REQUIREMENTS DOC. JUST PEOPLE WHO KNEW THEIR DOMAIN

There were no formal requirements handed to me. Each PM understood their product area deeply — they knew what metrics mattered, what engineers were building, what users were asking for in support tickets — but that knowledge lived in their heads, not in documents. My first job was to extract it, structure it, and turn it into something I could design from.

I ran a dedicated discovery session with each PM and their engineering lead. I'd go in with a set of questions aboutcurrent knowledge, user needs, technical constraints etc.


HOW I STRUCTURED COLLABORATION WITH EACH TEAM

  • Discovery with PM: Understand the metrics to be surfaced, the user problems we were solving & the business goals behind the feature

  • Technical deep-dive with engineers: Understand what data was available, what was technically feasible to surface in real time & backend work vs frontend-only changes

  • Design exploration: Present multiple visual directions for how to show the same data — so PMs and engineers could react to something concrete, not hypothetical

  • Iterative review cycles: Share work-in-progress designs with each team separately, then bring cross-team decisions back to the group to check for consistency

  • Finalization: Sign-off from both PM and engineering lead before moving to the next area — preventing late-stage surprises or scope creep


Building direct working relationships with engineers — not routing everything through PMs — made the designs more technically grounded and saved significant rework later.

CORE DESIGN CHALLENGE

BUILD ONE COHERENT OBSERVABILITY EXPERIENCE FOR DIVERSE PRODUCTS

How do you build one coherent observability experience when each product area legitimately requires different information, different hierarchy, and different interaction patterns? A single template wouldn't work. But four completely divergent experiences would destroy the product's value proposition.


INSIGHT - The alignment problem I had to solve before any design could start

Without a shared framework, each PM would pull the design toward their area's needs. I needed to establish a common design language and structural pattern that every product area could adopt while still meeting their specific requirements.

I proposed a layered observability model: a consistent shell (health overview → metrics → logs) that each product area could populate differently, with a shared visual system, shared component library and shared interaction patterns. The content within would be tailored per area. This was the negotiation I ran with every PM before I opened a design tool.


DESIGN STRATEGY

ONE SYSTEM, MULTIPLE EXPRESSIONS

#1 - Hierarchical Observability Experience

The architectural principle I established was a three-level observability hierarchy consistent across all product areas: platform-level overview, fleet/resource-level detail and individual resource deep-dive. Users moving between different products would always know where they were in that hierarchy.

#2 - Basic and Advanced Metrics

I introduced the concept of Advanced Metrics — a tiered model that kept the free-tier experience clean and focused, while unlocking richer telemetry (GPU-specific metrics, control plane data, inference latency percentiles) for paid tiers. This wasn't just a business decision — it was a UX decision that let me keep default views uncluttered without hiding critical data from power users who needed it.


DESIGN PRINCIPLES I SET FOR THE EXPERIENCE

  • Simplicity first: Abstract complexity — surface the insight, not the raw signal. Users should understand system state at a glance, not after reading six charts

  • Contextual: Every metric should be meaningful where it appears. GPU clock frequency belongs on the GPU Droplet page — not as a generic "advanced metric" buried in a settings panel

  • Intuitive: Users shouldn't need documentation to understand what's healthy, what's degraded, and what to do next

  • Scalable: The system needed to accommodate product areas that didn't exist yet. Every pattern had to work for future DigitalOcean products, not just today's four


DESIGN EXPLORATIONS

FROM LIST OF METRICS TO VISUAL LANGUAGE

Once each PM had walked me through their required metrics, the design challenge shifted: how do you show this data in a way that's immediately useful, not just technically accurate?

What makes a metric meaningful is how it's presented — the chart type, the time range, the hierarchy it sits within, the context it's given.

Exploration #1

Exploration #2

Exploration #3

DESIGN - PRODUCT AREA 01

AI Inference — Cost, Performance & Model Intelligence

For AI Inference product area, users weren't just monitoring infrastructure — they were monitoring model behavior and cost. Token usage, latency distribution, cost per model, savings versus competitors — these were business metrics as much as technical ones.


DECISION - Analyze & Manage as a unified cost + performance surface

The PM initially wanted separate pages for cost and performance. I pushed back — users making cost decisions need performance context and vice versa. I designed the Analyze & Manage view to answer "what am I spending, on what and is it performing?" on a single surface — with a savings comparison against direct provider pricing as the hook that made the value case for DigitalOcean immediately visible.


Analyze & Manage unified cost and performance view with competitor savings comparison

Optimize Insights — evaluation health, revenue breakdown and token consumption trends

Dedicated Inference Insights — SLA compliance, GPU utilization, endpoint request metrics & latency percentiles

Serverless Inference Insights — cost, token usage and errosr

DESIGN - PRODUCT AREA 02

GPU DROPLETS : TRANSLATING SIGNALS INTO ACTIONABLE INSIGHTS

GPU Droplets required the deepest domain knowledge of any product area. GPU-specific metrics — SM (Streaming Multiprocessor) efficiency, clock frequency, ECC errors, fault counts — are meaningful to ML engineers running heavy compute workloads, but meaningless noise to a developer who accidentally upgraded to a GPU droplet. The design had to serve both without overwhelming either.


DECISION - Contextual health alerts with recommendations — not just numbers

Raw GPU metrics are hard to interpret. SM clock frequency at 847 MHz means nothing unless users know the threshold is 1500 MHz and thermal throttling is occurring. I designed contextual health cards that translate signals into verdicts: "Critical — Degrading" with an explanation and a concrete recommendation. This shifted the experience from data display to decision support.


GPU Droplets Insights — health verdicts with context and recommendations, plus tiered raw and advanced metrics

GPU Droplets Logs — searchable log stream with severity distribution and OpenTelemetry-compliant export

DESIGN - PRODUCT AREA 04

MONITORING DASHBOARD — FLEET-WIDE CONTROL ROOM

The Observability Dashboard was the highest-level surface in the platform — a fleet-wide view of resource health across all resource types. Unlike the product-area pages, this wasn't about understanding one cluster or one droplet. It was about answering "how is my entire infrastructure performing right now?" for teams running dozens of nodes.


DECISION - OTel-compatible export and PCI health as first-class signals

Enterprise GPU users don't just want to see their metrics just in DigitalOcean — they want to pipe them into Datadog, Grafana or their own monitoring stack. I designed the dashboard with OTel (OpenTelemetry) export as a primary action, not a buried settings option. I also elevated PCI health status — healthy vs. degraded GPU links — as a top-level concern alongside utilization and temperature. For AI teams running distributed training, PCI link degradation is a critical failure mode that no existing dashboard surfaced clearly.


Observability Dashboard — fleet-wide health, triggered alerts, PCI health status & OTel-compatible

Create Dashboard — Users can create customized dashboards to track specific metrics

Create Dashboard — Add & customize metrics to dashboard

OUTCOME

A COHESIVE PLATFORM THAT HOLDS TOGETHER

The core deliverable was a cohesive observability experience across multiple product areas that had never shared a design language before. Users could now move from their Kubernetes cluster health to their inference cost dashboard to their GPU Droplet metrics without re-learning a new interface at each step.

The Advanced Metrics tiering created a clear upgrade path from free to paid GPU telemetry — making the business case visible inside the product itself.

PRODUCT OUTCOME

COHESIVE OBSERVABILITY EXPERIENCE

One coherent system from multiple independent roadmaps

DESIGN OUTCOME

ONE REUSABLE DESIGN SYSTEM

Shared components, patterns & hierarchy across all surfaces for consistency and faster development

0 → 1 launch

3 MONTHS TO SHIP

From first design session to Public Preview in the production codebase at DigitalOcean’s Annual Conference