Platform Engineering Podcast

Cory O'Daniel, CEO of Massdriver

Land USA

Genrer News, Business, Careers, Tech News

Sprog EN

Episoder 49

Seneste 24.06.2026

The Platform Engineering Podcast, hosted by Cory O'Daniel, CEO of Massdriver, explores the real-world challenges of building and running internal platforms. Each episode features candid conversations with engineers and leaders about org structure, infrastructure design, developer experience, and the tradeoffs behind platform engineering decisions. The show focuses on practical insights rather than trends, drawing on Cory's two decades of experience in infrastructure and platform building.

Episoder

What Do Service Meshes Actually Solve? (William Morgan, Buoyant/Linkerd) 24.06.2026 55min

Network calls fail in ways function calls never do - and once a monolith becomes microservices, reliability problems show up fast: retries amplify load, latency spikes cascade, and “what talks to what?” becomes hard to answer.William Morgan, co-creator of Linkerd and the person who coined “service mesh,” breaks down what service meshes actually solve for platform teams running Kubernetes at scale. The conversation focuses on practical outcomes: improving reliability between services, getting uniform observability without rewriting every app, and handling gaps Kubernetes doesn’t cover well - like gRPC/HTTP2 load balancing and cross-environment communication.Key topicsWhy reliability is the first “microservices tax” (timeouts, retries, backoff, cascading failure)What Kubernetes does not solve at the networking layer—and where a service mesh fitsgRPC/HTTP2 load balancing problems and why L4 balancing can fall shortService-to-service visibility: understanding traffic flows and performance without per-app instrumentationCost and resilience tradeoffs with multi-AZ Kubernetes on AWS (and how zonal-aware balancing can help)Whether developers should ever need to interact with service mesh configurationWhere zero trust and policy controls belong: platform guardrails vs application ownershipGuest: William Morgan, CEO at Buoyant, Co-Creator of LinkerdWilliam Morgan brings a unique take on platform engineering, security, and traffic management in cloud native environments. William’s the mind behind Linkerd, the CNCF graduate service mesh born to make security, observability, and reliability "just work" for modern apps without all that heavy overhead. With roots as an infrastructure engineer at Twitter, where he was hands-on in the shift to microservices, and experience at Microsoft, Powerset, Adap.tv, and MITRE, William understands operational complexity better than most. His perspective on reducing unpredictable cloud spend with features like Linkerd’s High Availability Zonal Load Balancing is timely for any team wrestling with multi-AZ cloud bills.William has hands-on knowledge of MCP, the protocol now critical for securing enterprise AI traffic. He also has strong views on sustainable open source business models, having contributed to open source for over 20 years.William Morgan, BlueSkyBuoyant, WebsiteBuoyant, LinkedInBuoyant, YouTubeLinkerd, GitHubLinks to interesting things from this episode:The Service Mesh Landscape
Continuous Integration at Agentic Velocity with CircleCI’s Rob Zuber 10.06.2026 50min

When code gets cheaper to produce, feedback becomes the limiting factor - CI, reviews, and the handoffs between tools can quietly slow everything down.Rob Zuber breaks down what platform engineers are seeing as teams adopt AI-assisted development: more branch builds, new failure modes, and growing pressure to shorten the loop between “change made” and “change validated.” He focuses on how CI can evolve from a human-first dashboard into a system that agents can interact with directly through APIs, CLIs, and MCP-style interfaces - so fixes can happen faster and with less waiting on manual triage.Along the way, Rob and Cory dig into practical questions engineering leaders are wrestling with: how PR review becomes the next major bottleneck, what “agent experience” means in a delivery pipeline, why speed isn’t only about faster compute (it’s also about doing less unnecessary work), and how teams can share learnings so “agentic velocity” doesn’t only benefit a few power users.If you’re building or running the systems that ship software, this is a clear look at where CI fits in an AI-accelerated workflow, and what needs to change to keep delivery safe, fast, and sustainable.Guest: Rob Zuber, Chief Technology Officer at CircleCIRob Zuber is a 20-year veteran of software startups, a four-time founder, and three-time CTO. Since joining CircleCI, Rob has seen the company through its Series F funding and delivered on product innovation at scale while leading a team of 300+ engineers who are distributed around the globe.CircleCI, WebsiteCircleCI, LinkedInCircleCI, GitHubLinks to interesting things from this episode:“The Confident Commit” podcast Wardley Mapping“How one programmer broke the internet by deleting a tiny piece of code.”
Durable Execution for Real‑World Failures with Temporal’s Cornelia Davis 27.05.2026 46min

A lot of infrastructure and automation fails for ordinary reasons: rate limits, flaky networks, partial permissions, long-running jobs, and retries that vanish when the process restarts. Durable execution is a way to design systems that keep going anyway - without rebuilding a maze of queues, cron jobs, and manual cleanup.Cornelia Davis breaks down how durable execution works in practice: writing “normal” code while the runtime provides durable retries, state management, and the ability to pause work, wait for a human or external change (like a quota increase), and resume right where things left off. The conversation connects these ideas to platform engineering realities - Terraform workflows, long provisioning times, and “orphan” resources - and explains how Temporal workflows and activities help teams model failure handling as a first-class part of the system.You’ll also hear why this approach is showing up in AI engineering: long-running agent workflows, frequent rate limiting, and the need to avoid re-running expensive LLM calls when something breaks near the end.Guest: Cornelia Davis, Developer Advocate at Temporal Technologies and author of “Cloud Native Patterns”Cornelia Davis is a Developer Advocate at Temporal, where she brings more than three decades of experience as a software technologist to help engineers build resilient, scalable systems. Known for her pragmatic blend of hands-on coding, technical strategy, and customer collaboration, Cornelia is passionate about helping developers unlock the full potential of modern cloud-native architectures. Previously, she served as VP of Technology at Pivotal, where she played a key role in shaping Cloud Foundry and enabling enterprise cloud transformations. Whether she’s writing code, presenting at conferences, or whiteboarding with teams, Cornelia is driven by a singular goal: empowering developers to build better software. Outside of tech, she recharges on the yoga mat or in the kitchen, where she brings the same creativity and focus to her practice.Temporal, WebsiteTemporal, GitHubTemporal Community, GitHubTemporal’s AI-assisted development toolsLinks to interesting things from this episode:Temporal Developer Skill“Cloud Native Patterns” by Cornelia Davis
You Need AI Sysadmins Can Trust, With Cribl's Nikhil Mungel 13.05.2026 55min

What happens when a non-deterministic AI system is asked to touch production telemetry or generate changes for an SRE pipeline? The cost of being “close enough” can be lost data, downtime, or a security incident.Cribl’s Nikhil Mungel joins Cory to break down what it takes to build AI that sysadmins can actually trust. The conversation digs into harness engineering and the practical guardrails that turn probabilistic models into repeatable, verifiable outcomes. They cover why breaking work into small chunks matters, how validation and testing become the real leverage point for AI-native development, and what “code factories” mean for review, CI, and platform reliability when teams can generate a thousand PRs an hour.Platform engineers will also hear a pragmatic take on the future of the job. The focus shifts away from typing code and toward building systems for verification, simulation, and safe deployment at scale, plus clearer ways to decide what needs human scrutiny and what can ship automatically.Guest: Nikhil Mungel, Head of AI R&D at CriblNikhil Mungel is the Head of AI R&D at Cribl, where he's building LLM-powered systems for IT and Security data transformation and analysis. Before Cribl, he spent over a decade developing distributed systems across the observability and consumer social tech landscape. He lives in San Francisco with his wife and two kids. His current focus is applying AI to make complex infrastructure more intuitive and explainable.Nikhil Mungel, WebsiteNikhil Mungel, XCribl, WebsiteCribl, LinkedInLinks to interesting things from this episode:Cribl Guard“Open source died in March. It just doesn't know it yet.” by Dan Lorenc, CEO of Chainguard
Green CI and Merge Queue Mastery with Trunk’s Eli Schleifer 15.04.2026 49min

When a flaky test can stall a merge queue, “just rerun CI” stops scaling fast.Cory talks with Trunk co-founder and CEO Eli Schleifer about the outer loop problems that show up as teams ship more code - especially with AI-assisted development increasing PR volume. They break down what a merge queue is, why logical merge conflicts happen even when individual PRs are green, and how predictive testing helps protect main without forcing constant retesting.Eli also explains how Trunk approaches flaky tests: collecting JUnit results, using quarantines so known flakes don’t block delivery, and fingerprinting failures to tell the difference between “this always times out” and “this was just broken by a recent change.” The conversation closes on how review and quality practices may shift as code generation accelerates - and what still needs strong guardrails like tests, security checks, and reliable CI signals.Guest: Eli Schleifer, co-founder and CEO of TrunkEli Schleifer leads Trunk’s technical vision and product strategy, focused on closing the gap between AI-speed code generation and human-speed delivery by removing the bottlenecks that slow modern engineering teams. Trunk’s platform eliminates flaky tests, resolves merge queue constraints, and redesigns CI systems to enable high-throughput, continuous delivery.Prior to founding Trunk, Eli was CTO at Directr, which was acquired by Google, and has served in engineering leadership roles at YouTube, Uber, and Microsoft.Eli Schleifer, XTrunk, websiteTrunk, SlackTrunk, GithubTrunk, XLinks to interesting things from this episode:Balsamiq“Code First Engineering” by Eli Schleifer
AI-Native Ops: Making AI Safe for Production with William Collins 01.04.2026 1t 3min

What happens when your “coworker” can generate code and changes faster than your team can review them, and production still has to stay up?William Collins breaks down what AI-Native Ops looks like when you take reliability seriously: where reasoning should stop, where deterministic automation should begin, and how guardrails like compliance checks, version pinning, and controlled workflows keep AI from turning into outage fuel. Cory and William also dig into why context windows and tool sprawl matter in real systems, how protocols like MCP and agent-to-agent communication are shaping day-to-day automation, and why regulated environments can’t adopt new tech with hype-driven shortcuts.If you’re a platform engineer trying to balance speed with safety, this conversation offers a practical way to use AI for the work that drags teams down, without giving up operational discipline.Guest: William Collins, Director of Technical Evangelism at Itential, AWS community builder, and the co-host of the Cloud Gambit podcastWilliam Collins is a strategic thinker and catalyst for innovation. Over his career, he has helped enterprises build large-scale networks, driven modernization through cloud adoption, and excels at optimizing complex environments through good design practices and automation. Today, William works as Director of Technical Evangelism for Itential, where he focuses on evangelizing the Itential Platform, fostering strong relationships with customers to fully realize their goals, engaging with community, and advocating for the successful future of network, security, and automation infrastructure.As a content creator, William hosts The Cloud Gambit Podcast with Eyvonne Sharp, a show that unravels the state of cloud computing, markets, strategy, and emerging trends with industry experts. He is also a LinkedIn Learning Instructor (Automation, Cloud, and Network Engineering Content), AWS Community Builder (Network & Content Delivery), and is a group organizer for the USNUA - Kentucky User Group (KYNUG).Prior to Itential, William worked as a Principal Cloud Architect and Director of Technical Evangelism for Alkira where he helped grow the company from lean beginnings to being ranked 25th Fastest-Growing Company in North America and 6th in the Bay Area on the 2024 Deloitte Technology Fast 500. He also held various senior technical roles across the enterprise space in Financial Services and Healthcare, most recently at Humana as Director of Cloud Architecture. Outside of tech, his time is spent with family, woodworking, ice hockey, and guitar.Opinions expressed are solely his own and do not express the views or opinions of his employer.William Collins, BlogWilliam Collins, YouTubeWilliam Collins, X William Collins, Instagram William Collins, TikTokWilliam Collins, GitHubItential“The Cloud Gambit” podcastLinks to interesting things from this episode:Ghostty“Harness design for long-running application development” by Anthropic
Infrastructure as Code's Hidden Problem with Pavlo Baron 18.03.2026 57min

Terraform drift, state wrangling, and a growing “tools for tools” stack are still daily work for many platform teams - despite a decade of DevOps talk and cloud maturity. Why does ops automation so often feel like it needs babysitting?Pavlo Baron breaks down where Infrastructure as Code tends to break down in real organizations: manual drift management, low-level state complexity, and a lack of practical abstractions that let developers self-serve without inheriting the entire ops burden.The conversation digs into what a more use-case-driven approach could look like - where teams can choose when to enforce desired state, when to accept emergency changes, and how to build “guardrails” that reduce mistakes without slowing delivery.Pavlo also explains why type safety and constrained interfaces matter (especially as AI starts generating more code and infrastructure changes), and why the future of platform engineering depends less on slogans and more on systems that reduce toil.Guest: Pavlo Baron, Co-Founder and CEO of Platform Engineering LabsPavlo Baron is Co-Founder and CEO of Platform Engineering Labs, who are crafting tools to remove the toil from the operations work, with a current focus on infrastructure. He is a veteran in the space, having served in all kinds of roles throughout his career that spans more than 35 years. Previously, he was co-founder, CTO, and major inventor at an observability startup, Instana, that was acquired by IBM in 2020. Pavlo is a frequent conference speaker and author of several books.Pavlo Baron, Xhttps://pavlobaron.medium.com/https://github.com/platform-engineering-labshttps://www.linkedin.com/company/platform-engineering-labshttps://x.com/plateng_labshttps://bsky.app/profile/platform.engineeringhttps://mastodon.social/@plateng_labshttps://www.youtube.com/@plateng-labsLinks to interesting things from this episode:The Pkl Primerformaeformae quick start"10+ Deploys Per Day: Dev and Ops Cooperation at Flickr"“Where everyone is responsible, no one is really responsible.” Albert BanduraJPL “Visions of the Future”“Fallout: New Vegas”
Why Extend Went All-In on Serverless Platform Engineering 04.03.2026 1t 2min

Billions of requests a month on AWS Lambda can cost less than a single engineer’s laptop budget, but only if the architecture and developer workflow are designed for it.Justin Masse, Senior Platform DevOps Engineer at Extend, shares how Extend committed early to a serverless-first approach and built a platform that prioritizes developer speed and low operational toil. The conversation breaks down what it takes to run active-active, multi-region systems in a serverless world, how the team keeps services small and fast, and why asynchronous, event-driven design changes both reliability and cost.You’ll also hear how Extend treats developer experience as a core platform responsibility: templated microservices, fast deployment pipelines, ephemeral environments for pull requests, and infrastructure that developers can own without becoming cloud specialists. A big theme is using AWS CDK and internal abstractions to keep infrastructure close to the application code, so teams can move quickly while keeping platform standards consistent.Finally, the discussion gets practical about tradeoffs that show up after the “serverless is easy” pitch: local development challenges, the real cost center (observability), and where AI is helping today, including an internal agent that diagnoses failed deployments and suggests fixes.What you’ll learnWhy Extend avoids servers and VPC complexity, and what they use insteadPatterns for active-active, multi-region thinking in a serverless architectureHow DevEx practices like templates and ephemeral environments reduce frictionA pragmatic approach to IaC with CDK and reusable internal constructsWhere serverless costs stay low, and why observability often dominates the billHow AI is being applied to platform workflows without skipping engineering judgmentGuest: Jusin Masse, Senior Platform DevOps Engineer at ExtendJustin Masse is a self-proclaimed lead chaos engineer, recognized within niche engineering communities for his expertise Chaos Engineering and Infrastructure & DevOps.The father of three young kids, a husband, a recent MBA graduate, recent cancer survivor, and competitive powerlifter, he still finds time to actively contribute to the platform engineering community.Justin Masse, websiteJustin Masse, GitHubExtend, websiteLinks to interesting things from this episode:Episode with Adrian Cockroft“From $erverless to Elixir” by Cory O’Daniel
Observability in the AI Era with New Relic's Nic Benders 18.02.2026 50min

What happens when nobody wrote the code running in your production environment? As AI-generated software becomes standard practice, platform engineers face a new challenge: operating systems without experts to consult.Nic Benders, Chief Technical Strategist at New Relic, has spent 15 years watching observability evolve from basic server monitoring to understanding complex distributed systems. Now he's tackling the next frontier: how to maintain and operate software when there's no human author to ask why something was built a certain way.The conversation covers the shift from instrumentation being the hard problem to understanding being the bottleneck. Nic explains why inventory matters more than you think, how to approach AI-generated code as a black box that needs testing and telemetry, and why "garbage in, safety out" should be your new mantra.You'll learn practical strategies for instrumenting modern systems with OpenTelemetry, why your observability hierarchy needs to start with knowing what's actually running, and how to build platforms that make safe deployment easier than risky shortcuts. Nic also shares his perspective on technical drift versus technical debt and what changes when your best troubleshooting tool - institutional knowledge - no longer exists.Whether you're drowning in observability data or just starting to instrument your systems, this conversation offers concrete approaches for building understanding into your platform engineering practice.Guest: Nic Benders, Chief Technical Strategist at New RelicNic Benders is New Relic's Chief Technical Strategist. Part of the Engineering team since the early days of the company, Nic has been involved with everything from Agents to ZooKeeper and all the pieces and products in between. As New Relic's Chief Technical Strategist, he now looks after the long-term technical strategy behind the product and the experience of all the engineering teams who build it. Before New Relic, he worked in the mobile space, managing back-end messaging and commerce systems powering some of the largest carriers in the world.New Relic, websiteNew Relic, BlogLinks to interesting things from this episode:OpenClaw (aka Moltbot, aka Clawdbot)Moltbook
Simplicity at Scale: Cleaning House for Platform Teams with Brian Childress 17.12.2025 40min

Why do so many “modern” platforms feel slow, fragile, and painful to work on?Platform engineer and fractional CTO Brian Childress joins Cory to discuss how over-engineering, resume‑driven development, and scattered tooling quietly block teams from shipping value. They explore why simplicity is a competitive advantage for platform teams, especially as AI becomes part of everyday development.You’ll learn:How to design a simple platform MVP that developers actually like usingWhat a good local‑to‑prod story looks like (and why it’s the real scaling superpower)Practical ways to onboard humans and AI tools so both can contribute fasterWhere teams introduce unnecessary complexity with Kubernetes, microservices, and NoSQLHow to think about scaling in three dimensions: users, developers, and featuresWhy good architecture, docs, and decision records make AI more useful, not lessHow to spot and avoid resume‑driven development before it explodes your platformWhether you’re cleaning up a messy stack or trying to keep a young platform from drifting into chaos, this conversation gives you concrete patterns for keeping things simple while still scaling teams, systems, and features.Guest: Brian Childress, Platform engineer and fractional CTOBrian Childress is an accomplished Software Engineer, Architect and Fractional CTO. For over a decade Brian has developed applications in healthcare, finance, and consumer products. Brian has spoken internationally on topics such as application security and developer tooling. Brian spends his free time researching and teaching the latest in application and API security design and best practices.Brian Childress, websiteBrian Childress, XLinks to interesting things from this episode:ReplitLovable
Using Feature Flags to Tame Complexity with Mike Zorn 03.12.2025 43min

What if changing a single flag could save you from a failed migration, a broken API, or a late-night rollback?Join us as we dive into how feature flags become a practical tool for changing application behavior at runtime, not just toggling UI elements. Cory talks Mike Zorn about real stories from LaunchDarkly and Rippling, covering how teams use flags to ship safely, debug faster, and simplify complex systems.You’ll hear about:Using feature flags to avoid staging overload and ship directly to productionMigrating critical systems and databases with minimal downtime and riskControlling log levels and rate limits for specific customers on the flyManaging flag sprawl so teams do not drown in half-rolled-out featuresExperimenting with AI features, prompts, and models without fully committingIf you’re working on a platform, running critical infrastructure, or just trying to ship faster without breaking everything, this conversation offers concrete patterns you can start using right away.Guest: Mike Zorn, Senior Software Engineer at RipplingMike’s software engineering journey began with an early interest in problem-solving and programming, starting with creating programs on a TI-83 calculator in middle school. After studying mathematics in college, he transitioned into software through an applied math project that required coding, which sparked his interest in engineering as a career. Professionally, he has worked at several product and SaaS companies, including one that was an early LaunchDarkly customer, where they experienced firsthand the challenges of managing feature flags internally. That experience led him to appreciate the value of tools like LaunchDarkly, eventually joining the company himself. Since then, he has contributed across various areas, including focusing on how LaunchDarkly can best adopt its own platform internally to streamline releases and help engineers work more efficiently. His latest adventure has been joining Rippling as a Senior Staff Software Engineer.Mike Zorn, GitHubMike Zorn, EmailRipplingLaunchDarklyLinks to interesting things from this episode:SigNozSignadotOpen Container Initiative“Using Feature Flags to Avoid Downtime During Migrations”Apache Iceberg
Policy as Code: Kyverno and Securing Kubernetes at Scale with Jim Bugwadia 19.11.2025 42min

Most Kubernetes security breaches don't come from zero-day exploits - they come from misconfigurations. While your team runs scanners and reviews reports, containers are already running as root, network policies are missing, and compliance violations are piling up across dozens of repositories.Jim Bugwadia, co-founder and CEO of Nirmata and creator of Kyverno, joins Cory to talk about a different approach: policy as code. Instead of asking developers to remember security best practices across every repo, what if your cluster automatically enforced secure defaults and blocked non-compliant deployments before they ever reached production?You'll learn how to start using Kyverno today without breaking your production environment - from running your first audit scan (no installation required) to implementing enforcement mode with exceptions. Jim explains why micro-segmentation matters more than ever, how to automate network policies for every namespace, and why platform teams are using Kyverno for everything from security to cost optimization.Whether you're running one cluster or managing Kubernetes at scale, this conversation offers practical strategies for making security a byproduct of your platform - not an afterthought.Topics covered:Why shift-left security fails and what "shift-down" means for platform teamsHow to implement Kubernetes policy enforcement without grinding deployments to a haltAutomating secure defaults: network policies, resource quotas, and role bindingsThe crawl-walk-run approach to rolling out policies in existing clustersReal-world use cases beyond security: cost optimization and resource managementGuest: Jim Bugwadia, Co-Founder & CEO of Nirmata and creator of KyvernoJim Bugwadia is the Co-founder and CEO of Nirmata, a Kubernetes management platform built for enterprises to simplify and scale cloud-native operations across clouds, data centers, edge, and connected devices. With a mission to democratize cloud-native best practices, Jim brings deep expertise in building large-scale software products and leading high-performing teams. Before founding Nirmata, he led a global consulting team at Cisco, guiding enterprises and service providers on their cloud computing journeys. Earlier in his career, he contributed to innovative products at startups and major companies including Trapeze Networks, Pano Logic, Jetstream, Lucent, and Motorola. A hands-on technologist, Jim continues to code in Go, Java, and JavaScript, reflecting his passion for building in the rapidly evolving world of software.Jim Bugwadia, XNirmataKyvernoLinks to interesting things from this episode:Kyverno Community Repository“Shift-Down Security” PaperOpenReportsPolicy Reporter“The Shai-Hulud npm malware attack: A wake-up call for supply chain security”Kyverno Slack Channel
Guest Host: Kelsey Hightower - Beyond Pipelines: Infrastructure As Data 05.11.2025 48min

Is your Git repo really the source of truth for infrastructure - or just a suggestion?Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack why many teams hit dead ends with CI/CD for provisioning, where GitOps struggles with drift, and when TicketOps helps or hurts. They explore a different model: infrastructure as data with typed contracts, shared artifacts, and workflows that embed policy, validation, and upgrades from the start. You’ll hear practical ways to reduce cognitive load for developers while giving operations reliable control and better day‑2 levers.You’ll learn:Why pipelines are a poor fit for infra provisioning and what to do insteadHow to reason about drift as a three‑way merge with realityWhen reconciliation helps, and when it breaks production firefightsHow typed contracts and artifacts connect modules and teams without glue scriptsWays to present safer self‑service without requiring everyone to learn TerraformA simple mental model for treating TicketOps as a surface, not the workflowGuest Host: Kelsey HightowerKelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofuCory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.Cory O'Daniel, XCory O'Daniel, MediumMassdriver, websiteMassdriver, GitHubMassdriver, YoutubeOpen TofuLinks to interesting things from this episode:"Gitopscracy" video
Guest Host: Kelsey Hightower - Are CI/CD and GitOps Just Making Things Harder? 22.10.2025 30min

What if your production environment had a live, trustworthy blueprint you could zoom in and out of on demand?Kelsey Hightower guest-hosts a candid conversation with Cory about why CI/CD pipelines and GitOps often break down for cloud infrastructure. They explore a simpler operational model: treat infrastructure as data, lean on clear checkpoints instead of rigid “golden paths,” and make production legible for both developers and ops.You’ll learn:Where CI/CD adds friction for infra and what to do insteadWhy GitOps works for apps but hits limits for databases, networks, and multi-region realitiesHow “living diagrams” help new teammates understand prod on day onePractical guardrails that evolve with your org without locking teams inWays to reduce drift, surprise cloud costs, and Day Two chaosA mindset shift: databases for ops data, not shell-script archaeologyWalk away with concrete patterns to make production understandable, auditable, and easier to change—without more YAML or bigger pipelines.Guest Host: Kelsey HightowerKelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofuCory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.Cory O'Daniel, XCory O'Daniel, MediumMassdriver, websiteMassdriver, GitHubMassdriver, YoutubeOpen TofuLinks to interesting things from this episode:SigNoz“The $6,459 Terraform Lesson: Why Infrastructure Lifecycle Monitoring Matters” by Liz Fong-Jones "Gitopscracy" video
Guest Host: Kelsey Hightower — Why IaC Alone Isn’t Enough 08.10.2025 39min

Ever wonder why strong Terraform modules still lead to long review queues and fragile pipelines? From hand-built scripts and early data center migrations to cloud sprawl and Kubernetes, configuration management has changed a lot - but the core struggle remains: too many decisions, not enough guardrails. Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack where Infrastructure as Code succeeds and where teams get stuck.What you’ll learn:How to avoid “choice overload” in cloud configs by moving decisions upstreamPractical ways to pair IaC with UX, policies, and SLAs to reduce toilWhen click-ops is a symptom, not the problem - and how to replace it safelyPatterns for scaling platform practices beyond a handful of expertsA simple mental model for mapping workflows across serverless, containers, and VMsGuest Host: Kelsey HightowerKelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofuCory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.Cory O'Daniel, XCory O'Daniel, MediumMassdriver, websiteMassdriver, GitHubMassdriver, YoutubeOpen TofuLinks to interesting things from this episode:"The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win" by Gene Kim"15 Years of Duct Tape - Why IaC Adoption Stalled at 30"
How to Ship Faster with Feature Flags: Insights from Unleash 24.09.2025 43min

Still freezing code before Black Friday and hoping nothing breaks? Feature flags can help you ship smaller, safer changes continuously—without the “big bang” risk or painful rollbacks.Cory O’Daniel talks with Unleash VP of Marketing Michael Ferranti about how modern teams use flags as a core delivery primitive alongside CI/CD and trunk-based development. They dig into kill switches for instant mitigation, progressive rollouts tied to real metrics, and why homegrown “if-statement” systems turn into hidden platforms you didn’t mean to build. They also cover the rising volume of AI‑assisted code and how flags provide the control layer to move faster while protecting reliability.What you’ll learn:How feature flags reduce risk for high-stakes periods like Black Friday by avoiding code freezesWhen to replace staging queues with progressive delivery and experiment-driven rolloutsPractical uses: kill switches, trunk-based development, targeting, and cleanup strategies to manage flag debtBuild vs. buy: why DIY flag systems become costly and how Unleash’s open source and on-prem options fit regulated or air‑gapped needsUsing business, engineering, and customer signals to automate safe ramp-ups and ramp-backsWhy AI increases code throughput, how it affects reliability, and how flags create the safety rails for agentic workflowsGuest: Michael Ferranti, VP of Marketing at UnleashMichael Ferranti has held leadership roles at Teleport, Portworx, ClusterHQ, and Rackspace Technology, with a focus on go-to-market strategy in open-source and enterprise software. At Teleport he focused on shifting from legacy security models to developer-first, identity-driven access. At Portworx, he was building new GTM strategies for Kubernetes-native storage when everyone was still figuring out containers, and he helped scale the company from under $500K in revenue to a $370M acquisition by Pure Storage. His work has centered on supporting engineering leaders in delivering features, scaling infrastructure, and improving security without adding unnecessary blockers. Michael has spoken at industry events like KubeCon and theCUBE, sharing insights on platform org design, category creation, and growing open-source adoption. Unleash, websiteUnleash, GitHubUnleash, LinkedInUnleash, XUnleash, SlackUnleash, YouTubeUnleashCon 2025Links to interesting things from this episode:ReactBitbucketLaunchDarklyServiceNowCockroachDBRed Hat OpenShiftState of DevOps Report (DORA)"How to Win Friends & Influence People"Grafana** REMINDER** - Apollo GraphQL has kindly offered us a few free passes to join them at the GraphQL Summit in San Francisco, October 6-8, 2025. If you are interested in going, the code is: PodcastSummit25
GraphQL, MCP, and the Future of APIs with Apollo CEO Matt DeBergalis 10.09.2025 43min

**UPDATE** - Apollo GraphQL has kindly offered us a few free passes to join them at the GraphQL Summit in San Francisco, October 6-8, 2025. If you are interested in going, the code is: PodcastSummit25What if your API layer could help you ship faster today and make tomorrow’s AI workflows safer and easier to build?Apollo CEO Matt DeBergalis explains how GraphQL became a practical standard for unifying messy backends, why declarative schemas and strong types are the “bedrock” for agentic systems, and where MCP fits when you want agents to call business data safely. You’ll hear real examples of speeding up frontends, tightening observability, and running focused personalization without “fat” APIs.What you’ll learn: A plain-language model for GraphQL and why it decouples frontend needs from backend servicesHow typing, schema docs, and field-level telemetry reduce risk and enable LLM-driven toolingPractical ways to expose queries as MCP tools and start with internal “agentic DevOps”Tactics for experiments and personalization that stay fast and measurable at scaleWhy an end-to-end approach (client and server) matters for reliability and speedGuest: Matt DeBergalis, CEO and Co-Founder of Apollo GraphQLMatt DeBergalis is the Chief Executive Officer and Co-Founder of Apollo GraphQL, focused on bringing the popular GraphQL technology to the enterprise. He previously served as Apollo's CTO, leading product and engineering. Matt's longtime focus has been in open source and platforms: he co-founded Meteor.js, which grew to become one of the most popular open-source projects in the world for developing full-stack web apps with JavaScript, as well as ActBlue, the American political fundraising platform that revolutionized grassroots political giving. He attended the Massachusetts Institute of Technology and resides in the San Francisco Bay Area with his family. In his spare time, Matt enjoys taking to the air and flying his 1966 Beechcraft Baron.Apollo GraphQL, websiteApollo GraphQL, GitHubApollo GraphQL, LinkedInApollo GraphQL, XApollo GraphQL, YouTubeLinks to interesting things from this episode:Free Software FoundationCursorMotley Fool podcastGraphQL Summit
Beyond Cracking the Coding Interview with Mike Mroczka 20.08.2025 1t 8min

Ever wondered how many “perfect” candidates simply learned the test—or how many great engineers get filtered out by bad interview design? Mike Mroczka, interview coach and ex-Googler, shares what really goes on behind technical hiring and how to navigate it to your advantage.What you’ll learn:How leaked question banks and standardized puzzles can distort hiring signals - and where they still helpPractical ways companies can make interviews fairer and harder to game, both on-site and remoteA balanced take on data structures and algorithms: when they’re useful and when they’re noiseTactics to spot and reduce cheating without turning interviews into surveillanceHow to structure interviews for different seniority levels so you measure the right skillsSalary negotiation playbook: timing, leverage, and common pitfalls that cost candidates real moneyGetting past the application black hole: skipping recruiters, networking that works, and coordinating offersWho this helps:Engineers tired of grinding puzzles who want a smarter prep planHiring managers looking to improve signal and reduce false negativesAnyone preparing to negotiate an offer with confidenceGuest: Mike Mroczka, Primary author of Beyond Cracking the Coding Interview, Ex-GoogleMike Mroczka, a former senior SWE (Google, Salesforce, GE), is now a tech consultant with a decade of experience helping engineers land their dream jobs. He’s a top-rated mentor (interviewing.io, Karat, Pathrise, Skilledinc) and the author of viral technical content on system design and technical interview strategies featured on HackerNews, Business Insider, and Wired.Mike Mroczka, websiteBeyond Cracking the Coding InterviewLinks to interesting things from this episode:Cracking the Coding Interview by Gayle Laakmann McDowell HackerOne Interviewing.io Cluely Google glass Ray-Ban HackerRank⁠ CodeSignal⁠
From React to Dagster: Pete Hunt on Data, Infra, and AI-Ready Platforms 30.07.2025 49min

Is Postgres actually a better message queue than Kafka? This provocative question is just one of many insights Pete Hunt shares in this conversation about data orchestration, platform engineering, and the evolution of infrastructure.Pete Hunt, CEO of Dagster Labs and former React co-founder at Facebook, brings his unique perspective from working at tech giants like Instagram and Twitter to discuss how different platform team approaches impact product development. Having witnessed both Facebook's clear delineation between product and infrastructure teams and Twitter's DevOps-style ownership model, Pete offers valuable comparisons of these contrasting philosophies.The conversation explores:How Dagster provides a higher-level abstraction for data teams, making it easier to track and debug data assets rather than just managing workflowsThe challenges of modern data platforms and why many organizations struggle with complex, distributed systems that could be simplifiedA practical approach to migrating from Airflow to Dagster with their "Airlift" toolkit that allows for incremental, low-risk transitionsHow AI development is fueling demand for better data orchestration as companies build applications that rely on properly managed data pipelinesPete also shares his thoughtful approach to balancing technical debt and product development with a "quarter on, quarter off" cadence that allows teams to both ship features and clean up the inevitable corners that get cut under deadline pressure.For platform engineers, data teams, and technical leaders navigating the intersection of infrastructure and AI, this episode provides practical insights on creating abstractions that deliver real operational value without unnecessary complexity.Guest: Pete Hunt, CEO of DagsterPete is the CEO of Dagster Labs, where he first joined as Head of Engineering in early 2022 and transitioned into the CEO role later that same year. Before Dagster, Pete co-founded Smyte, an anti-abuse startup acquired by Twitter, where he continued as a senior staff engineer.Earlier in his career, Pete was one of the first engineers to work on Instagram after its acquisition by Facebook in 2012. There, he led development on Instagram’s web and analytics teams and became a co-founder of the React.js project, helping transform an internal experiment into one of the most widely used front-end frameworks in the world. He was also part of the early community around GraphQL and has remained deeply engaged in open source and developer tooling.Pete brings a pragmatic, hands-on perspective to modern data infrastructure. Having been both a founder and an engineer, he focuses on reducing complexity and fatigue in data teams by building tools that actually work together. At Dagster, he remains close to the code and actively involved in technical decisions, combining leadership with deep technical fluency.Pete Hunt, XDagsterDagster PipesDagster AirliftLinks to interesting things from this episode:React“Postgres: a Better Message Queue than Kafka?”AirflowKubeflowCAPESFargate
Building Better Platforms with Dapr: Abstractions, Portability, and Durable Systems with Mark Fussell 16.07.2025 48min

Cloud lock-in isn't just about where your data lives—it's about how deeply cloud-specific code permeates your applications. Mark Fussell, co-creator of Dapr and CEO of Diagrid, joins Cory O'Daniel to explore how Dapr provides clean abstractions for common distributed system patterns, enabling teams to build portable applications without sacrificing cloud-native capabilities.The conversation covers:How Dapr creates a clean separation between application code and underlying infrastructure services like messaging, state management, and secretsWhy platform teams struggle with tight coupling between applications and infrastructure, and how Dapr solves this problemThe benefits of Dapr's sidecar architecture for local development, testing, and production environmentsHow Dapr automatically handles cross-cutting concerns like security, observability, and resiliency without boilerplate codeIntroduction to Dapr's workflow engine for durable execution and the emerging world of stateful AI agentsWhether you're a platform engineer struggling with cloud lock-in or a developer tired of rewriting code for different infrastructures, this conversation demonstrates how Dapr can simplify your distributed systems while maintaining access to the unique capabilities of each cloud provider.Guest: Mark Fussell, Co-founder of Dapr and CEO of DiagridMark Fussell is the CEO of Diagrid, a cutting-edge company that simplifies building and scaling cloud-native applications. As the co-founder of Dapr (Distributed Application Runtime), Mark has played a pivotal role in shaping the future of modern application development by empowering developers to build resilient, distributed systems with ease. With decades of experience in the software industry, Mark has been a driving force behind innovative solutions that bridge the gap between developers and complex infrastructure.DiagridDaprLinks to interesting things from this episode:"XML Bible" by Elliotte Rusty HaroldOpenTelemetrySPIFFEDataGalaxy case studyCloud Native Computing Foundation