Data Engineering Podcast

Tobias Macey

Shteti Shtetet e Bashkuara

Zhanret Arsim, Teknologji

Gjuha EN

Episode 512

I/E fundit 01.08.2026

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodet

Why Multi-Agent Systems Need Shared State, Graph Semantics, and Governance 01.08.2026 1h 2min

Summary In this episode Ragnor Comerford talks about OmniGraph, a lakehouse-native graph storage layer designed around the needs of agentic systems. He explores how graphs are primarily a semantic model for representing the world, rather than just a specialized engine for traversal workloads, and how that perspective shaped OmniGraph’s design on top of object storage, Lance, Arrow, and DataFusion. Ragnor explained the motivation for combining graph semantics with Git-style branching and merging so that teams can manage probabilistic writers such as AI agents with stronger governance, shared context, and safer collaboration patterns. He also dug into the practical tradeoffs of building a graph engine for multi-agent coordination instead of traditional graph analytics use cases. He closed with a look at emerging use cases such as company “brain” systems, software development lifecycle graphs, research workflows, and event-driven agent orchestration, along with a broader conversation about composability, and sovereign AI infrastructure. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Ragnor Comerford about OmniGraph, a lakehouse-native graph storage layer with git semanticsInterview IntroductionHow did you get involved in the area of data management?Can you describe what OmniGraph is and the story behind it?What was the original problem that you were trying to solve by creating it?There are numerous graph engines available, what are the properties of OmniGraph that differentiate it from the competition?Cypher (GQL) and Gremlin are all established languages with years of examples to work from. What are the benefits of developing a new and more constrained query interface for an agentic audience?How does that change the potential applications of OmniGraph? (e.g. general knowledge graph, fraud detection, SIEM, etc.)Can you describe the architecture of OmniGraph?You have built the system on top of several well-established open source components. What was your process for deciding what to use and how to compose it?What are some examples of systems that can be built with OmniGraph?What are other components/integration points that compose well with OmniGraph?Given the technologies that you are building on top of, what are the automatic benefits/integrations that you benefit from?Given that the underlying storage is Lance, and Lance's interoperability with Parquet/Iceberg, what are the opportunities for modeling graphs on top of existing lakehouse data?What are the most interesting, innovative, or unexpected ways that you have seen OmniGraph used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OmniGraph?When is OmniGraph the wrong choice?What do you have planned for the future of OmniGraph?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links OmniGraphModernRelayInformation TheoryLanceGitNeo4JGraphRAGTigerGraphLakeHouseIcebergDoltPuppyGraphPodcast EpisodeTerraformGremlinCypherGQLSPARQLIn-context LearningBM25 IndexingData FusionAdjacency MatrixPredicate PushdownAgentic Mesh book (affiliate link)witan-councilwitan-codeWeb AssemblyClickhouseDSPyMCP-UIThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Building the Context Flywheel for AI Data Agents 06.07.2026 1h

Summary In this episode Prukalpa Sankar, co-founder of Atlan, talks about what it takes to build a “context flywheel” for AI agents in data-intensive organizations. She explained why model intelligence alone isn’t enough to make AI useful in production, and how real performance depends on contextual intelligence: institutional knowledge, semantic meaning, procedural know-how, and access to the right tools. She also dug into how metadata catalogs are evolving into broader context layers that serve both humans and agents, and why agentic systems are changing the economics of metadata and governance work. Prakulpa shared Atlan’s perspective on bootstrapping context from existing systems such as warehouses, BI tools, query logs, and SaaS applications, then using simulation, traces, and human governance loops to improve agent accuracy over time. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Prukalpa Sankar about strategies for building a context flywheel for your data agentsInterviewIntroductionHow did you get involved in the area of data management?You have spent several years working in the metadata catalog space with Atlan. What are the notable changes in scope, adoption, and application that you have seen since we last spoke (June 2022)?The recurring theme since the start of 2026 has been agentic augmentation of all engineering workflows, including data. How do you differentiate between data catalogs, semantic layers, agent memory, context layers, etc. when architecting an AI-powered data-oriented system?One of the perennial problems with data catalogs, business glossaries, master data management, etc. is the up-front investment required to get a real-world impact. How can agents help reduce the activation energy needed to get to that return on effort?One of the perennial problems in data engineering is fragmentation and siloing of data. This is exacerbated by AI systems due to the introduction of vector data as a new specialization. What are the forces that you are seeing play into the current set of tensions and the architectural primitives that we need to bring to bear to keep things maintainable?Since the introduction of transformer-based generative models we have been combating hallucinations. While we have made progress, it is still critical to ensure accuracy and trustworthiness when working with business data. What are the policy elements of governance and technical controls to ensure a high degree of confidence in agent-generated context and business semantics?What are the most interesting, innovative, or unexpected ways that you have seen teams build context layers for their agentic data workloads?What are the most interesting, unexpected, or challenging lessons that you have learned while working on business context engineering?When is agent-managed context the wrong choice?What are your predictions for the next set of architectural shifts that will be driven by the pressures of AI-powered systems?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links AtlanAtlan Context LakehouseIcebergBusiness GlossaryMaster Data ManagementSemantic LayerCube.devMCP == Model Context ProtocolA2A == Agent to Agent ProtocolDecision TracesApache DorisStarRocksThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Holding Kafka Right: Product-Friendly Streaming with TypeStream 18.06.2026 49min

Summary In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming Interview IntroductionHow did you get involved in the area of data management?Can you describe what Typestream is and the story behind it?What are the common challenges that teams encounter when trying to build on top of Kafka?How do those challenges/misconfigurations impact the team's ability to deliver on product goals?What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?What makes the original Kafka project so resilient in the face of all of that competition?Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?When is Typestream the wrong choice?What do you have planned for the future of Typestream?Contact Info WebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links TypestreamZapierAirflowKafkaKTablesKSQLRedPandaPulsarAutoMQKafka Schema RegistryDebeziumChange Data CaptureKafka ConnectTerraformKafka Compacted TopicThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards 08.06.2026 52min

Summary In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at KaarviInterviewIntroductionHow did you get involved in the area of data management?Can you describe what Kaarvi is and the story behind it?"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?When is Kaarvi the wrong choice?What do you have planned for the future of Kaarvi?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links KaarviSynthetic Datan8nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture 01.06.2026 54min

SummaryIn this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graphInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what PuppyGraph is and the story behind it?What are some of the key use cases that people are turning to PuppyGraph and graph data models for?Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?latency/data explorationtypes of traversals and limitationslakehouse architecture pros/cons for graphsdata modeling/translationshortcomings of zero-ETL and how transforming the underlying representation could provide benefitsFor someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?When is PuppyGraph the wrong choice?What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksPuppyGraphTigerGraphGoogle F1Graph DatabaseGoogle PregelIcebergGraph SupernodeMPP == Massively Parallel ProcessingSpark GraphXTrinoLadybug DBlance-graphKuzuDBMemGraphLabelled Property GraphRDF TriplesCypher Query LanguageGremlinCDC == Change Data CaptureNeo4JJanusGraphNetworkXPyTorchDuckDBIceberg ArrayLanceDBPalo Alto NetworksColumnar ADBCThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA%
Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes 06.05.2026 58min

SummaryIn this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the major contributors to wasted or idle compute?Why does it matter if the available compute isn't being maximized?What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)? What are the most interesting, innovative, or unexpected ways that you have seen Ray used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?When is Ray the wrong choice?What do you have planned for the future of Ray?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksAnyScaleRayDeep LearningComputer VisionKubernetesCursorClaude CodeKube-RayPyTorchTensorflowTheanoCaffevLLMSGLangRay TuneNeural NetworkLearning RatesReinforcement LearningAlphaGoCursor Composer 2ImageNetTransformer ArchitectureStochastic Gradient DescentAirflowDagsterFlyteMixture of ExpertsPrefillTemporalActor FrameworkRDMA == Remote Direct Memory AccessNeocloudsAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
The AI-First Data Engineer: 10–50x Productivity and What Changes Next 07.04.2026 59min

Summary In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026InterviewIntroductionHow did you get involved in the area of data management?What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Blog PostDatafoldClaude Opus 4.5Harry Potter - MugglesJevon's ParadoxModern Data StackDagster CompassGravity OrionMCP == Model Context ProtocolQwenThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Treat Metering Like Finance: Building Data Platforms for Consumption Economics 29.03.2026 50min

Summary In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business modelsAnnouncementsIntroductionHow did you get involved in managing the data products or data management?Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the level of detail/nuance that the business can employ in their pricing strategies?When we talk about the infrastructure for usage tracking, it often feels like a high-stakes stream processing problem. What are the core architectural components required to build a reliable metering pipeline?How do you think about the trade-offs between "push" models (application emits events) vs. "pull" models (the platform scrapes resource usage)?Accuracy is non-negotiable when data is tied directly to revenue. What are the strategies for ensuring idempotency and handling deduplication in the ingestion layer?How do you address the "late-arriving data" problem in a usage-based world, especially when dealing with monthly billing cycles or credit exhaustion?From an uptime and reliability perspective, should the metering system be in the critical path of the service itself?If the metering service is down, do you "fail open" and provide free service, or "fail closed" and impact availability? How do you build for that kind of resilience?One of the common pitfalls is treating metering like logging or observability. How do you ensure that usage metering is treated as a first-class product priority rather than an afterthought for the platform team?What does the interface look like for product engineers to "register" a new billable event without breaking the downstream data contract?Once you have this data, there is often a requirement for real-time visibility for the end user. What are the data modeling requirements to support both "high-volume ingestion" and "low-latency querying" for customer-facing billing dashboards?How do you bridge the gap between the raw event stream and the aggregated "billable unit" in the data warehouse or lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen usage-based metering used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building consumption-based data platforms?When is usage-based metering the wrong choice? (e.g., When does the complexity of the data platform outweigh the economic benefits?)What are your predictions for the future of consumption-based data architectures?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Hackernoon PostCOGS == Cost of Good SoldMedallion ArchitectureThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Beyond the PDF: Rowan Cockett on Reproducible, Composable Science 22.03.2026 42min

Summary In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduceInterview IntroductionHow did you get involved in the area of data management?Can you describe what your interest is in reproducibility of scientific research?What role does data play in the set of challenges that plague reproducibility of published research?What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?Are there any areas of study that you are more susceptible to friction and siloing of their data?What does a typical engagement with a research group look like as you try to improve the accessibility of their work?What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?What are the next set of challenges that you are focused on addressing in the research/reproducibility space?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Continuous Science FoundationCurvenoteZenodoDryadHDF5IcebergZarrMyst MarkdownJupyter NotebookArXivJournal of Open Source Software (JOSS)Data CarpentrySoftware CarpentryOpen RxivBio RxivMed RxivForce 11JupyterBookOpen Exchange Architecture (OXA)The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Beyond Prompts: Practical Paths to Self‑Improving AI 16.03.2026 1h 1min

Summary In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments.Interview IntroductionHow did you get involved in AI/ML?Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent? How would you differentiate between an agent/agentic system vs. a self-improving system? One of the components that are becoming common in agentic architectures is a "memory" layer. What are some of the ways that contributes to a self-improvement feedback loop? In what ways are memory layers insufficient for a generalized self-improvement capability? For engineering and technology leaders, what are the key architectural and operational steps you recommend to build AI that can move from pilots into scalable, production systems? One of the perennial challenges for technology leaders is how to build AI systems that scale over time. How has AI changed the way you think about long-term advantage? How do self-improvement feedback loops contribute to AI scalability in real systems? What are some of the other key elements necessary to build a truly evolutionary AI system? What are the hidden costs of building these AI systems that teams should know before starting? I’m talking about enterprise who are deploying AI into their internal mission-critical workflows. What are the most interesting, innovative, or unexpected ways that you have seen self-improving AI systems implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on evolutionary AI systems? What are some of the ways that you anticipate agentic architectures and frameworks evolving to be more capable of self-improvement? Contact Info LinkedInClosing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Parting Question From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?Links Symphony AIReinforcement LearningAgentic MemoryIn-Context LearningContext EngineeringFew-Shot LearningOpenClawDeep Research AgentRAG == Retrieval Augmented GenerationAgentic SearchGoogle Gemma ModelsOllamaThe intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Orion at Gravity: Trustworthy AI Analysts for the Enterprise 08.03.2026 1h 5min

Summary In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analyticsInterview IntroductionHow did you get involved in the area of data management?Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?How does the semantic layer relate to and differ from the physical schema of a data warehouse?In generative AI and agentic systems the latest term of art is "context engineering". How does a semantic layer factor into the context management for an agentic analyst?What are some of the ways that LLMs/agents can help to populate the semantic layer?What are the cases where you want to guard against hallucinations by keeping a human in the loop?Beyond a physical semantic layer, what are the other elements of context that you rely on for guiding the activities of your agents?What are some utilities that you have found helpful for bootstrapping the structural guidelines for an existing warehouse environment?What are the most interesting, innovative, or unexpected ways that you have seen Orion used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?When is Orion the wrong choice?What do you have planned for the future of Orion?Contact Info LucasLinkedInDrewLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links GravityOrionLookerSemantic LayerdbtLookMLTableauOpenClawPareto DistributionThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
From Models to Momentum: Uniting Architects and Engineers with ER/Studio 02.03.2026 45min

Summary In this episode of the Data Engineering Podcast, Jamie Knowles (Product Director) and Ryan Hirsch (Product Marketing Manager) discuss the importance of enterprise data modeling with ER/Studio. They highlight how clear, shared semantic models are a foundational discipline for modern data engineering, preventing semantic drift, speeding up delivery, and reducing rework. Jamie explains that ER/Studio helps teams define logical models that translate into physical designs and code across warehouses and analytics platforms, while maintaining traceability and governance. The conversation also touches on how AI increases the tolerance for ambiguity, but doesn't fix unclear definitions - it amplifies them. Jamie and Ryan describe ER/Studio's integrations with governance tools, collaboration features like TeamServer, reverse engineering, and metadata bridges, as well as new AI-assisted modeling capabilities. They emphasize that most data problems are meaning problems, and investing in architecture and a semantic backbone can make engineering faster, governance simpler, and analytics more reliable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Jamie Knowles and Ryan Hirsch about ER/Studio and the foundational role of enterprise data modeling in modern data engineering.Interview IntroductionHow did you get involved in the area of data management?Can you describe what ER/Studio is and the story behind it? How has it evolved to handle the shift from traditional on-prem databases to modern, complex, and highly regulated enterprise environments?How do you define "Enterprise Data Architecture" today, and how does it differ from just managing a collection of pipelines in a modern data stack?In your view, what are the distinct responsibilities of a Data Architect versus a Data Engineer, and where is the critical overlap where they typically succeed or fail together?From what you see in the field, how often are the technical struggles of data engineering teams—like tool sprawl or "broken" pipelines—actually just "data meaning" problems in disguise?What is a logical data model, and why do you advocate for framing these as "knowledge models" rather than just technical diagrams?What are the long-term consequences, such as "semantic drift" or the erosion of trust, when organizations skip logical modeling to go straight to physical implementation and pipelines?What is the intersection of data modeling and data governance?What are the elements of integration between ER/Studio and governance platforms that reduce friction and time to delivery?For the engineers who worry that architecture and modeling slow down development, how does having a central design authority actually help teams scale and reduce downstream rework?What does a typical workflow look like across data architecture and data engineering for individuals and teams who are using ER/Studio as a core part of their modeling?What are the most interesting, innovative, or unexpected ways that you have seen ER/Studio used? * Context: Specifically regarding grounding AI initiatives or defining enterprise ontologies.What are the most interesting, unexpected, or challenging lessons that you have learned while working on ER/Studio?When is ER/Studio the wrong choice for a data team or a specific project?What do you have planned for the future of ER/Studio, particularly regarding AI and the "design-time" foundation of the data stack?Contact Info JamieLinkedInRyanLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links IderaWherescapeER/StudioEntity-Relation Diagram (ERD)Business KeysMedallion ArchitectureRDF == Resource Description FrameworkCollibraMartin FowlerDB2The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
From Data Models to Mind Models: Designing AI Memory at Scale 22.02.2026 57min

Summary In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-tenant isolation to ensure safe knowledge sharing or protection. The conversation covers practical considerations such as storage choices (Redis, Qdrant, LanceDB, Neo4j), metadata design, temporal relevance and decay, and emerging research areas like trace-based scoring and reinforcement learning for improving retrieval. Vas shares real-world examples of agentic memory in action, including applications in pharma hypothesis discovery, logistics control towers, and cybersecurity feeds, as well as scenarios where simpler approaches may suffice. He also offers guidance on when to add memory, pitfalls to avoid (naive summarization, uncontrolled fine-tuning), human-in-the-loop realities, and Cognee's future plans: revamped session/long-term stores, decision-trace research, and richer time and transformation mechanisms. Additionally, Vas touches on policy guardrails for agent actions and the potential for more efficient "pseudo-languages" for multi-agent collaboration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about agentic memory architectures and applicationsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the different elements of "memory" in an agentic context?storage and retrieval mechanismshow to model memorieshow does that change as you go from short-term to long-term?managing scope and retrieval triggersWhat are some of the useful triggers in an agent architecture to identify whether/when/what to create a new memory?How do things change as you try to build a shared corpus of memory across agents?What are the most interesting, innovative, or unexpected ways that you have seen agentic memory used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cognee?When is a dedicated memory layer the wrong choice?What do you have planned for the future of Cognee?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links CogneeAI Engineering Podcast Episode[Kimball Memory](Cognitive ScienceContext WindowRAG == Retrieval Augmented GenerationMemory TypesRedis Vector StoreQdrantVector on EdgeMilvusLanceDBKuzuDBNeo4JMem0Zepp GraphitiA2A (Agent-to-Agent) ProtocolSnowplowReinforcement LearningModel FinetuningOpenClawThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops 15.02.2026 50min

Summary In this episode of the Data Engineering Podcast, Aman Agarwal, creator of OpenLit, discusses the operational groundwork required to run LLM-powered applications reliably and cost-effectively. He highlights common blind spots that teams face, including opaque model behavior, runaway token costs, and brittle prompt management, and explains how OpenTelemetry-native observability can turn these black-box interactions into stepwise, debuggable traces across models, tools, and data stores. Aman showcases OpenLit's approach to open standards, vendor-neutral integrations, and practical features such as fleet-managed OTEL collectors, zero-code Kubernetes instrumentation, prompt and secret management, and evaluation workflows. They also explore experimentation patterns, routing across models, and closing the loop from evals to prompt/dataset improvements, demonstrating how better visibility reshapes design choices from prototype to production. Aman shares lessons learned building in the open, where OpenLit fits and doesn't, and what's next in context management, security, and ecosystem integrations, providing resources and examples of multi-database observability deployments for listeners. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Aman Agarwal about the operational investments that are necessary to ensure you get the most out of your AI modelsInterview IntroductionHow did you get involved in the area of AI/data management?Can you start by giving your assessment of the main blind spots that are common in the existing AI application patterns?As teams adopt agentic architectures, how common is it to fall prey to those same blind spots?There are numerous tools/services available now focused on various elements of "LLMOps". What are the major components necessary for a minimum viable operational platform for LLMs?There are several areas of overlap, as well as disjoint features, in the ecosystem of tools (both open source and commercial). How do you advise teams to navigate the selection process? (point solutions vs. integrated tools, and handling frameworks with only partial overlap)Can you describe what OpenLit is and the story behind it?How would you characterize the feature set and focus of OpenLit compared to what you view as the "major players"?Once you have invested in a platform like OpenLit, how does that change the overall development workflow for the lifecycle of AI/agentic applications?What are the most complex/challenging elements of change management for LLM-powered systems? (e.g. prompt tuning, model changes, data changes, etc.)How can the information collected in OpenLit be used to develop a self-improvement flywheel for agentic systems?Can you describe the architecture and implementation of OpenLit?How have the scope and goals of the project changed since you started working on it?Given the foundational aspects of the project that you have built, what are some of the adjacent capabilities that OpenLit is situated to expand into?What are the sharp edges and blind spots that are still challenging even when you have OpenLit or similar integrated?What are the most interesting, innovative, or unexpected ways that you have seen OpenLit used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenLit?When is OpenLit the wrong choice?What do you have planned for the future of OpenLit?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data/AI management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links OpenLitFleet HubOpenTelemetryLangFuseLangSmithTensorZeroAI Engineering Podcast EpisodeTraceloopHeliconeClickhouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization 08.02.2026 46min

SummaryIn this episode, Shilpa Kolhar, SVP of Product and Engineering at MongoDB, discusses using MongoDB as a unified foundation for AI-driven and agentic applications. She explains how the Application Modernization Platform (AMP) accelerates the transition from legacy relational systems to a document-first architecture, driven by the need for AI-readiness and speed of change. Shilpa highlights MongoDB's features, such as its native JSON document model, Atlas Vector Search, auto-embeddings, and integrated search, which help eliminate drift and latency across operational data, indexing, and vectors, emphasizing the importance of keeping context, transactions, and embeddings together for real-time AI use cases. She shares best practices for re-architecting legacy systems, including schema validation and versioning patterns to tame schema drift, aggregation pipelines for consistent reads, and pragmatic standardization across services, while also detailing AMP's approach to scoping large estates and the balance of LLM-powered automation with human-in-the-loop governance.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Shilpa Kolhar about using MongoDB as the foundation for AI-driven applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what MongoDB is and the core primitives that it offers?The MongoDB engine has gone through substantial evolution since it was first introduced over 20 years ago. What are some of the most notable features that have been added in recent years?You recently launched the MongoDB Application Modernization Platform (AMP). What are the key elements of modernization that it is focused on?How do the core primitives of the MongoDB engine align with modernization objectives?There is a lot of attention being paid now to AI applications where data is the most critical element for success. What are the features of MongoDB that lend itself to being the context store for generative AI services?Besides the data used for context and grounding, AI applications also want to track user interactions and form short and long term memory to improve the system over time. How can MongoDB assist in that work as well?While the lack of schema enforcement on write can be beneficial to rapid evolution of software, it can also be a detriment if not managed well. How can MongoDB help in avoiding schema drift over time that leads to old data being incompatible with current code?What are the most interesting, innovative, or unexpected ways that you have seen MongoDB used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on MongoDB and application modernization?When is MongoDB/AMP the wrong choice?What do you have planned for the future of AMP?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksMongoDBMongoDB AMPGoogle GeminiVoyage AIQdrantChromaDBWeaviatePineconeMongoDB AutoembeddingRetoolODM == Object Document MapperRAG == Retrieval Augmented GenerationAgentic MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows 01.02.2026 56min

Summary In this episode Tim Sehn, founder and CEO of DoltHub, talks about Dolt - the world’s first version‑controlled SQL database - and why Git‑style semantics belong at the heart of data systems and AI workflows. Tim explains how Dolt combines a MySQL/Postgres‑compatible interface with a novel storage engine built on a “Prollytree” to enable fast, row‑level branching, merging, and diffs of both schema and data. He digs into real production use cases: powering applications that expose version control to end users, reproducible ML feature stores, managing massive configuration for games, and enabling safe agentic writes via branch‑based review flows. He compares Dolt’s approach to LakeFS, Neon, and PlanetScale, and explores developer workflows unlocked by decentralized clones, full audit logs, and PR‑style data reviews. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Tim Sehn about Dolt, a version controlled database engine and its applications for agentic workflowsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Dolt is and the story behind it?What are the key use cases that you are focused on solving by adding version control to the database layer?There are numerous projects related to different aspects of versioning in different data contexts (e.g. LakeFS, Datomic, etc.). What are the versioning semantics that you are focused on?You position Dolt as "the database for AI". How does data versioning relate to AI use cases?What types of AI systems are able to make best use of Dolt's versioning capabilities?Can you describe how Dolt and Doltgres are implemented?How have the design and scope of the project changed since you first started working on it?What are some of the architecture and integration patterns around relational databases that change when you introduce version control semantics as a core primitive?What are some anti-patterns that you have seen teams develop around Dolt's versioning functionality?What are the most interesting, innovative, or unexpected ways that you have seen Dolt used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dolt?When is Dolt the wrong choice?What do you have planned for the future of Dolt?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links DoltDoltHubStockmarket DataLakeFSDatomicGitMySQLProlly TreeNeonDjangoFeature StoreMCP ServerNessieIcebergPlanetScaleO(NlogN) Big O ComplexityB-TreeGit MergeGit RebaseAST == Abstract Syntax TreeSupabaseCockroachDBDocument DatabaseMongoDBGastownBeadsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Logical First, Physical Second: A Pragmatic Path to Trusted Data 25.01.2026 40min

Summary In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jamie Knowles about the impact that a well-developed data architecture (or lack thereof) has on data engineering workInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving your definition of "data architecture" and what it encompasses?How does the nuance change depending on the type of system you are designing? (e.g. data warehouse vs. transactional application database vs. event-driven streaming service)In application teams that are large enough there is typically a software architect, but that work often ends up happening organically through trial and error. Who is the responsible party for designing and enforcing a proper data architecture?There have been several generational shifts in approach to data warehouse projects in particular. What are some of the anti-patterns that crop up when there is no-one forming a strong opinion on the design/architecture of the warehouse?The current stage is largely defined by the ELT pattern. What are some of the ways that workflow can encourage shortcuts?Often the need for a proper architecture isn't felt until an organic architecture has developed. What are some of the ways that teams can short-circuit that pain and iterate toward a more sustainable design?The common theme in all of the data architecture conversations that I've had is the need for business involvement. There is also a strong push for the business to just want the engineers to deliver data. What are some of the ways that AI utilities can help to accelerate delivery while also capturing business context?For teams that are already neck deep in a messy architecture, what are the strategies and tactics that they need to start working toward today to get to a better data architecture?What are the most interesting, innovative, or unexpected ways that you have seen teams approach the creation and implementation of their data architecture?What are the most interesting, unexpected, or challenging lessons that you have learned while working in data architecture?How do you see the introduction of AI at each stage of the data lifecycle changing the ways that teams think about their architectural needs?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksIderaER StudioELTRDF == Resource Description FrameworkORM == Object-Relational MappingThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability 18.01.2026 1h 12min

Summary In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economicsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of what the major pain points have been in the observability space? (e.g. limited scale/retention, costs, integration fragmentation)What are the elements of the ecosystem and tech stacks that led to that state of the world?What are you building at Observe that circumvents those pain points?What are the major ecosystem evolutions that make this a feasible architecture? (e.g. columnar storage, distributed compute, protocol consolidation)Can you describe the architecture of the Observe platform?How have the design of the platform evolved/changed direction since you first started working on it?What was your process for determining which core technologies to build on top of?What were the missing pieces that you had to engineer around to get a cohesive and performant platform?The perennial problem with observability systems and data lakes is their tendency to succumb to entropy. What are the guardrails that you are relying on to help customers maintain a well-structured and usable repository of information?Data lakehouses are excellent for flexibility and scaling to massive data volumes, but they're not known for being fast. What are the areas of investment in the ecosystem that is changing that narrative?As organizations overcome the constraints of limited retention periods and anxiety over cost, what new use cases does that unlock for their observability data?How do AI applications/agents change the requirements around observability data? (collection, scale, complexity, applications, etc.)What are the most interesting, innovative, or unexpected ways that you have seen Observe/lakehouse technologies used for observability?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Observe?When is Observe/lakehouse technologies the wrong choice?What do you have planned for the future of Observe?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Observe Inc.Lakehouse ArchitectureSplunkObservabilityRSyslogGlusterFSDremelDrillBigQuerySnowflake SIGMOD PaperPrometheusDatadogNewRelicAppDynamicsDynaTraceLokiCortexMimirTempoCardinalityFluentBitFluentDOpenTelemetryOTLP == OpenTelemetry Line ProtocolKafkaVPC Flow LogsRead AmplificationLanceIcebergHudiPromQLThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Semantic Operators Meet Dataframes: Building Context for Agents with FENIC 12.01.2026 56min

Summary In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semantic filter, extract, join) as first-class citizens in the logical plan so the optimizer can reason about inference, costs, and constraints. This enables developers to turn unstructured data into explicit schemas, compose transformations lazily, and offload LLM work safely and efficiently. He digs into Fenic’s architecture (lazy dataframe API, logical/physical plans, Polars execution, DuckDB/Arrow SQL path), how it exposes tools via MCP for agent integration, and where it fits in context engineering as a companion for memory/state management in agentic systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Kostas Pardalis about Fenic, an opinionated, PySpark-inspired DataFrame framework for building AI and agentic applicationsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Fenic is and the story behind it?What are the core problems that you are trying to address with Fenic?Dataframes have become a popular interface for doing chained transformations on structured data. What are the benefits of using that paradigm for LLM use-cases?Can you describe the architecture and implementation of Fenic?How have the design and scope of the project changed since you first started working on it?You position Fenic as a means of bringing reliability to LLM-powered transformations. What are some of the anti-patterns that teams should be aware of when getting started with Fenic?What are some of the most common first steps that teams take when integrating Fenic into their pipelines or applications?What are some of the ways that teams should be thinking about using Fenic and semantic operations for data pipelines and transformations?How does Fenic help with context engineering for agentic use cases?What are some examples of toolchains/workflows that could be replaced with Fenic?How does Fenic integrate with the broader ecosystem of data and AI frameworks? (e.g. Polars, Arrow, Qdrant, LangChan/Pydantic AI)What are the most interesting, innovative, or unexpected ways that you have seen Fenic used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fenic?When is Fenic the wrong choice?What do you have planned for the future of Fenic?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links FenicRudderStackPodcast EpisodeTrinoStarburstTrino Project TardigradeTypedef AIdbtPySparkUDF == User-Defined FunctionLOTUSPandasPolarsRelational AlgebraArrowDuckDBMarkdownPydantic AIAI Engineering Podcast EpisodeLangChainRayDaskThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Beyond Dashboards: How Data Teams Earn a Seat at the Table 05.01.2026 49min

Summary In this episode Goutham Budati about his Data–Perspective–Action framework and how it empowers data teams to become true business partners. Gautham traces his path from automating Excel reports to leading high‑impact data organizations, then breaks down why technical excellence alone isn’t enough: teams must pair reliable data systems with deliberate storytelling, clear problem framing, and concrete action plans. He digs into tactics for moving from reactive ticket-taking to proactive influence — weekly one‑page narratives, design-first discovery, sampling stakeholders for real pain points, and treating dashboards as living roadmaps. He also explores how to right-size technical scope, preserve trust in core metrics, organize teams as “build” and “storytelling” duos, and translate business macros and micros into resilient system designs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Goutham Budati about his data-perspective-action framework for empowering data teams to be more influential in the businessInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data-Perspective-Action framework is and the story behind it?What does it look like when someone operates at each of those three levels?How does that change the day-to-day work of an individual contributor?Why does technically excellent data work sometimes fail to drive decisions?How do you identify whether a data system or pipeline is actually creating value versus just existing?What's the moment when you realized that building reliable systems wasn't the same as enabling better decisions?Better decisions still need to be powered by reliable systems. How do you manage the tension of focusing on up-time against focusing on impact?What does it mean to add "Perspective" to data? How is that different from analysis or insights?How do you know when you're overwhelming stakeholders versus giving them what they need?What changes when you start designing systems to surface signal rather than just providing comprehensive data?How do you learn what business context matters for turning data into something actionable?What does it mean to design for Action from day one? How does that change what you build?How do you get stakeholders to actually act on data instead of just consuming it?Walk us through how you structure collaboration with business partners when you're trying to drive decisions, not just inform them.What's the relationship between iteration and trust when you're building data products?What does the transition from order-taker to strategic partner actually look like? What has to change?How do you position data work as driving the business rather than supporting it?Why does storytelling matter for data professionals? What role does it play that technical communication doesn't cover?What organizational structures or team setups help data people gain influence?Tell us about a time when you built something technically sound that failed to create impact. What did you learn?What are the common patterns in dysfunctional data organizations? What causes the breakdown?How do you rebuild credibility when you inherit a data function that's lost trust with the business?What's the relationship between technical excellence and stakeholder trust? Can you have one without the other?When is this framework the wrong lens? What situations call for a different approach?How do you balance the demand for technical depth with the need to develop business and communication skills?How should data professionals position themselves as AI and ML tools become more accessible?What shifts do you see coming in how businesses think about data work?How is your thinking about data impact evolving?For someone who recognizes they're focused purely on the technical work and wants to expand their impact—where should they start?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

I/E popullarizuar në

Ky podkast shfaqet edhe në listat e podkasteve të këtyre shteteve.

Data Engineering Podcast

Episodet

Podkaste të ngjashme

I/E popullarizuar në