<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>DataOps on Adur</title><link>https://adurrr.github.io/en/categories/dataops/</link><description>Recent content in DataOps on Adur</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 15 Jun 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://adurrr.github.io/en/categories/dataops/index.xml" rel="self" type="application/rss+xml"/><item><title>LLMOps: integrating LLMs into DevOps workflows</title><link>https://adurrr.github.io/en/p/llmops-integrating-llms-into-devops-workflows/</link><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid>https://adurrr.github.io/en/p/llmops-integrating-llms-into-devops-workflows/</guid><description>&lt;p&gt;LLMs have moved beyond chatbots. They&amp;rsquo;re now embedded in engineering workflows where they automate tedious tasks, speed incident response, and boost developer productivity. But deploying an LLM into a production DevOps pipeline is fundamentally different from using ChatGPT in a browser.&lt;/p&gt;
&lt;p&gt;This guide covers what LLMOps means in practice, where LLMs fit into DevOps, architecture patterns that work, and pitfalls to avoid.&lt;/p&gt;
&lt;h2 id="what-is-llmops"&gt;What is LLMOps?
&lt;/h2&gt;&lt;p&gt;LLMOps is the practices, tools, and infrastructure needed to operationalize LLMs. It extends MLOps but addresses challenges unique to language models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model selection vs. model training&lt;/strong&gt;: Most teams consume pre-trained models (via APIs or self-hosted inference) rather than training from scratch. The operational focus shifts to prompt engineering, fine-tuning, and retrieval-augmented generation (RAG).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost management&lt;/strong&gt;: LLM inference is expensive. Token-based pricing means costs scale with usage in ways that are harder to predict than traditional compute.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Non-determinism&lt;/strong&gt;: LLMs produce variable outputs for the same input, which complicates testing, validation, and reproducibility.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: Response times of seconds (not milliseconds) require different architectural patterns than traditional microservices.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;LLMOps is not a separate discipline. It is an extension of your existing DevOps and MLOps practices, adapted for the specific operational characteristics of language models.&lt;/p&gt;
&lt;h2 id="practical-use-cases-in-devops"&gt;Practical use cases in DevOps
&lt;/h2&gt;&lt;p&gt;Here is where LLMs are delivering real value in DevOps workflows today:&lt;/p&gt;
&lt;h3 id="automated-code-review"&gt;Automated code review
&lt;/h3&gt;&lt;p&gt;LLMs can provide a first-pass review of pull requests, catching common issues like missing error handling, security anti-patterns, inconsistent naming, or missing tests. They do not replace human reviewers but reduce the burden of repetitive feedback.&lt;/p&gt;
&lt;h3 id="incident-summarization"&gt;Incident summarization
&lt;/h3&gt;&lt;p&gt;When an incident fires at 3 AM, the on-call engineer needs context fast. An LLM can ingest alert data, recent deployment logs, related runbooks, and previous incident reports to produce a concise summary of what is likely going wrong and what was done last time.&lt;/p&gt;
&lt;h3 id="log-analysis"&gt;Log analysis
&lt;/h3&gt;&lt;p&gt;LLMs are surprisingly effective at pattern recognition in unstructured log data. Feed them a block of error logs and they can identify the root cause faster than manual grep sessions, especially for unfamiliar systems.&lt;/p&gt;
&lt;h3 id="documentation-generation"&gt;Documentation generation
&lt;/h3&gt;&lt;p&gt;Generating draft documentation from code, API schemas, or Terraform modules. The output needs human review, but it eliminates the blank-page problem and keeps docs closer to current state.&lt;/p&gt;
&lt;h3 id="infrastructure-as-code-generation"&gt;Infrastructure as Code generation
&lt;/h3&gt;&lt;p&gt;Given a natural language description of desired infrastructure, LLMs can generate Terraform, Ansible, or Kubernetes manifests as a starting point. Useful for scaffolding, not for production-ready code without review.&lt;/p&gt;
&lt;h2 id="architecture-patterns-for-llm-integration"&gt;Architecture patterns for LLM integration
&lt;/h2&gt;&lt;h3 id="pattern-1-api-gateway-to-external-llm"&gt;Pattern 1: API gateway to external LLM
&lt;/h3&gt;&lt;p&gt;The simplest approach. Your application calls an external LLM API (OpenAI, Anthropic, etc.) through a centralized gateway that handles authentication, rate limiting, logging, and cost tracking.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[CI/CD Pipeline] --&amp;gt; [API Gateway] --&amp;gt; [External LLM API]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; |
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; [Logging &amp;amp; Metrics]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; |
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; [Cost Tracking]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: No infrastructure to manage, access to the most capable models, fast to implement.
&lt;strong&gt;Cons&lt;/strong&gt;: Data leaves your network, vendor lock-in, variable latency, ongoing API costs.&lt;/p&gt;
&lt;h3 id="pattern-2-self-hosted-inference"&gt;Pattern 2: Self-hosted inference
&lt;/h3&gt;&lt;p&gt;Run open-weight models (Llama, Mistral, etc.) on your own infrastructure using inference servers like vLLM or Ollama.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[CI/CD Pipeline] --&amp;gt; [Load Balancer] --&amp;gt; [vLLM / Ollama Instance(s)]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; |
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; [GPU Node Pool]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Data stays internal, predictable costs at scale, no vendor dependency, full control over model versions.
&lt;strong&gt;Cons&lt;/strong&gt;: Requires GPU infrastructure, operational overhead, smaller models may be less capable.&lt;/p&gt;
&lt;h3 id="pattern-3-rag-enhanced-pipeline"&gt;Pattern 3: RAG-enhanced pipeline
&lt;/h3&gt;&lt;p&gt;Combine an LLM with a retrieval system that provides relevant context from your own knowledge base (runbooks, documentation, past incidents). This dramatically improves response quality for domain-specific tasks.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[Query] --&amp;gt; [Embedding Model] --&amp;gt; [Vector DB Search] --&amp;gt; [Context + Query] --&amp;gt; [LLM] --&amp;gt; [Response]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; |
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; [Your Knowledge Base]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; (runbooks, docs, etc.)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This pattern is particularly powerful for incident response and documentation tasks where the LLM needs your organization&amp;rsquo;s specific context.&lt;/p&gt;
&lt;h2 id="key-considerations"&gt;Key considerations
&lt;/h2&gt;&lt;h3 id="cost"&gt;Cost
&lt;/h3&gt;&lt;p&gt;LLM API costs can be surprising. A code review pipeline that processes 50 PRs per day with large diffs can easily run hundreds of dollars per month. Strategies to control costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set token limits per request&lt;/li&gt;
&lt;li&gt;Cache common queries and responses&lt;/li&gt;
&lt;li&gt;Use smaller models for simpler tasks (triage with a small model, escalate to a larger one)&lt;/li&gt;
&lt;li&gt;Monitor token usage per pipeline and set alerts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="latency"&gt;Latency
&lt;/h3&gt;&lt;p&gt;LLM responses take seconds, not milliseconds. Design your integrations as asynchronous processes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Post code review comments after the fact, do not block the PR&lt;/li&gt;
&lt;li&gt;Process incident data in the background, push results to a Slack channel&lt;/li&gt;
&lt;li&gt;Use streaming responses where possible to improve perceived performance&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="hallucinations"&gt;Hallucinations
&lt;/h3&gt;&lt;p&gt;LLMs will confidently generate plausible-sounding but incorrect information. This is a critical concern for DevOps tasks where bad advice can cause outages.&lt;/p&gt;
&lt;p&gt;Mitigations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Always present LLM output as suggestions, never as authoritative actions&lt;/li&gt;
&lt;li&gt;Require human approval before any LLM-generated change is applied&lt;/li&gt;
&lt;li&gt;Use RAG to ground responses in verified documentation&lt;/li&gt;
&lt;li&gt;Implement output validation (e.g., lint generated IaC before presenting it)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security"&gt;Security
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data exposure&lt;/strong&gt;: Anything you send to an external LLM API may be used for training or stored. Never send secrets, credentials, or sensitive customer data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt injection&lt;/strong&gt;: Malicious content in code, logs, or user input can manipulate LLM behavior. Sanitize inputs and validate outputs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supply chain&lt;/strong&gt;: LLM-generated code may introduce vulnerabilities. Run all generated code through your existing security scanning pipeline.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="tools-and-platforms"&gt;Tools and platforms
&lt;/h2&gt;&lt;h3 id="langchain"&gt;LangChain
&lt;/h3&gt;&lt;p&gt;A framework for building LLM-powered applications. Useful for orchestrating multi-step chains (e.g., retrieve context, format prompt, call LLM, parse output). Supports many LLM providers and has good tooling for RAG pipelines.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;langchain.chat_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;langchain.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;Review this code diff for security issues and suggest fixes:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{diff}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;gpt-4o&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;diff&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;code_diff&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="vllm"&gt;vLLM
&lt;/h3&gt;&lt;p&gt;A high-throughput inference engine for self-hosted models. Supports PagedAttention for efficient memory management and continuous batching for high throughput.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Start a vLLM server&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python -m vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --model mistralai/Mistral-7B-Instruct-v0.2 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Exposes an OpenAI-compatible API, so you can swap between self-hosted and external APIs with minimal code changes.&lt;/p&gt;
&lt;h3 id="ollama"&gt;Ollama
&lt;/h3&gt;&lt;p&gt;The easiest way to run LLMs locally for development and testing. Great for prototyping pipelines before committing to infrastructure.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Pull and run a model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ollama pull llama3
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ollama run llama3 &lt;span class="s2"&gt;&amp;#34;Summarize this error log: [paste log]&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Serve as an API&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Then call http://localhost:11434/api/generate&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="example-automated-pr-review-pipeline"&gt;Example: Automated PR review pipeline
&lt;/h2&gt;&lt;p&gt;Here is a conceptual pipeline for automated PR review using an LLM:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;span class="lnt"&gt;22
&lt;/span&gt;&lt;span class="lnt"&gt;23
&lt;/span&gt;&lt;span class="lnt"&gt;24
&lt;/span&gt;&lt;span class="lnt"&gt;25
&lt;/span&gt;&lt;span class="lnt"&gt;26
&lt;/span&gt;&lt;span class="lnt"&gt;27
&lt;/span&gt;&lt;span class="lnt"&gt;28
&lt;/span&gt;&lt;span class="lnt"&gt;29
&lt;/span&gt;&lt;span class="lnt"&gt;30
&lt;/span&gt;&lt;span class="lnt"&gt;31
&lt;/span&gt;&lt;span class="lnt"&gt;32
&lt;/span&gt;&lt;span class="lnt"&gt;33
&lt;/span&gt;&lt;span class="lnt"&gt;34
&lt;/span&gt;&lt;span class="lnt"&gt;35
&lt;/span&gt;&lt;span class="lnt"&gt;36
&lt;/span&gt;&lt;span class="lnt"&gt;37
&lt;/span&gt;&lt;span class="lnt"&gt;38
&lt;/span&gt;&lt;span class="lnt"&gt;39
&lt;/span&gt;&lt;span class="lnt"&gt;40
&lt;/span&gt;&lt;span class="lnt"&gt;41
&lt;/span&gt;&lt;span class="lnt"&gt;42
&lt;/span&gt;&lt;span class="lnt"&gt;43
&lt;/span&gt;&lt;span class="lnt"&gt;44
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# .github/workflows/llm-review.yml&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;LLM Code Review&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;pull_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="l"&gt;opened, synchronize]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;llm-review&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;runs-on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ubuntu-latest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Checkout&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/checkout@v4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;fetch-depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Get diff&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;diff&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; git diff origin/${{ github.base_ref }}...HEAD &amp;gt; diff.txt&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Run LLM review&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;LLM_API_KEY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${{ secrets.LLM_API_KEY }}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; python scripts/llm_review.py \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; --diff diff.txt \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; --model gpt-4o \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; --max-tokens 2000 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; --output review.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Post review comments&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/github-script@v7&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;script&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; const review = require(&amp;#39;./review.json&amp;#39;);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; await github.rest.pulls.createReview({
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; owner: context.repo.owner,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; repo: context.repo.repo,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; pull_number: context.issue.number,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; body: review.summary,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; event: &amp;#39;COMMENT&amp;#39;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; comments: review.line_comments
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; });&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The review script would:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the diff&lt;/li&gt;
&lt;li&gt;Split large diffs into chunks that fit within the model&amp;rsquo;s context window&lt;/li&gt;
&lt;li&gt;For each chunk, construct a prompt asking for security issues, bugs, and style problems&lt;/li&gt;
&lt;li&gt;Aggregate results and format as GitHub review comments&lt;/li&gt;
&lt;li&gt;Include confidence scores and always mark output as AI-generated&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="guardrails-and-responsible-use"&gt;Guardrails and responsible use
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Label all LLM output clearly&lt;/strong&gt; as AI-generated. Engineers should know when they are reading machine output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Never auto-merge or auto-apply&lt;/strong&gt; LLM suggestions. Keep a human in the loop for all changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Log all prompts and responses&lt;/strong&gt; for debugging and audit purposes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set spending limits&lt;/strong&gt; and alerts on LLM API usage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review prompt templates regularly&lt;/strong&gt; to ensure they do not leak sensitive information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test for bias and errors&lt;/strong&gt; with representative samples before deploying to production workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="getting-started-recommendations"&gt;Getting started recommendations
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pick one use case&lt;/strong&gt; - Don&amp;rsquo;t try to LLM-enable everything at once. Start low-risk: documentation drafts, commit message suggestions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Start with an external API&lt;/strong&gt; - Don&amp;rsquo;t invest in GPU infrastructure until you&amp;rsquo;ve validated the use case. Use OpenAI or Anthropic to prototype.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure everything&lt;/strong&gt; - Track cost per invocation, latency, user satisfaction, error rates from day one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build an evaluation framework&lt;/strong&gt; - Create a test suite of known-good inputs and expected outputs. Run it against every prompt change or model update.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plan your data strategy&lt;/strong&gt; - Decide early what data you&amp;rsquo;ll and won&amp;rsquo;t send to external APIs. Document clearly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iterate on prompts&lt;/strong&gt; - Prompt engineering is iterative. Version control prompts, treat as code.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;LLMs are a powerful tool for DevOps automation, but they&amp;rsquo;re exactly that: a tool. They work best when thoughtfully integrated into existing workflows, with clear boundaries on what they can and cannot do autonomously.&lt;/p&gt;</description></item><item><title>DataOps: building reliable data pipelines</title><link>https://adurrr.github.io/en/p/dataops-building-reliable-data-pipelines/</link><pubDate>Sun, 18 Feb 2024 00:00:00 +0000</pubDate><guid>https://adurrr.github.io/en/p/dataops-building-reliable-data-pipelines/</guid><description>&lt;h2 id="what-is-dataops"&gt;What Is DataOps?
&lt;/h2&gt;&lt;p&gt;DataOps applies DevOps principles (automation, continuous integration, monitoring, collaboration) to data pipelines and analytics. While DevOps ships software reliably, DataOps ships &lt;em&gt;data&lt;/em&gt; reliably.&lt;/p&gt;
&lt;p&gt;The difference: what flows through the pipeline. DevOps builds, tests, deploys code. DataOps builds, tests, deploys data transformations. The goal is ensuring data arriving at dashboards, ML models, and downstream systems is correct, fresh, and trustworthy.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;ve ever had a broken dashboard Monday morning because a schema changed over the weekend, you know why DataOps matters.&lt;/p&gt;
&lt;h2 id="core-principles"&gt;Core principles
&lt;/h2&gt;&lt;h3 id="1-automation-first"&gt;1. Automation first
&lt;/h3&gt;&lt;p&gt;Every step in your data pipeline &amp;ndash; extraction, transformation, loading, testing, and deployment &amp;ndash; should be automated. Manual SQL scripts run from someone&amp;rsquo;s laptop are a liability. Codify everything, version it in Git, and let orchestrators handle execution.&lt;/p&gt;
&lt;h3 id="2-continuous-testing"&gt;2. Continuous testing
&lt;/h3&gt;&lt;p&gt;Data testing is not optional. You should validate data at every stage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema tests&lt;/strong&gt;: column types, nullability constraints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volume tests&lt;/strong&gt;: row counts within expected ranges&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freshness tests&lt;/strong&gt;: data arrived on schedule&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business rule tests&lt;/strong&gt;: revenue is never negative, dates are not in the future&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-monitoring-and-observability"&gt;3. Monitoring and observability
&lt;/h3&gt;&lt;p&gt;You need to know when something breaks before your stakeholders do. Instrument your pipelines with metrics on latency, row counts, error rates, and data quality scores. Set up alerts that fire when anomalies are detected.&lt;/p&gt;
&lt;h3 id="4-collaboration-and-version-control"&gt;4. Collaboration and version control
&lt;/h3&gt;&lt;p&gt;Data pipelines are code. Treat them that way. Use pull requests, code reviews, and CI/CD for your transformation logic. Every change to a pipeline should be reviewable, testable, and reversible.&lt;/p&gt;
&lt;h2 id="pipeline-architecture-etl-vs-elt"&gt;Pipeline architecture: ETL vs ELT
&lt;/h2&gt;&lt;p&gt;The two dominant patterns for data pipelines are ETL and ELT. The choice depends on your infrastructure and use case.&lt;/p&gt;
&lt;h3 id="etl-extract-transform-load"&gt;ETL (Extract, Transform, Load)
&lt;/h3&gt;&lt;p&gt;Data is extracted from sources, transformed in a processing engine (Spark, Python scripts), and then loaded into the target system. This pattern makes sense when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need to reduce data volume before loading (cost control)&lt;/li&gt;
&lt;li&gt;Transformations require heavy computation not suited for your warehouse&lt;/li&gt;
&lt;li&gt;You have strict data governance requiring transformation before storage&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="elt-extract-load-transform"&gt;ELT (Extract, Load, Transform)
&lt;/h3&gt;&lt;p&gt;Data is extracted and loaded raw into a data warehouse (BigQuery, Snowflake, Redshift), then transformed in place using SQL. This is the modern default because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cloud warehouses have massive compute capacity&lt;/li&gt;
&lt;li&gt;SQL-based transformations are easier to review and test&lt;/li&gt;
&lt;li&gt;Raw data is preserved, enabling reprocessing when logic changes&lt;/li&gt;
&lt;li&gt;Tools like dbt make SQL-based transformations first-class citizens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most teams starting today, &lt;strong&gt;ELT is the recommended approach&lt;/strong&gt; unless you have a specific reason to transform before loading.&lt;/p&gt;
&lt;h2 id="key-tools"&gt;Key tools
&lt;/h2&gt;&lt;h3 id="apache-airflow--orchestration"&gt;Apache Airflow &amp;ndash; orchestration
&lt;/h3&gt;&lt;p&gt;Airflow is the most widely adopted open-source orchestrator for data pipelines. It lets you define workflows as Directed Acyclic Graphs (DAGs) in Python, with built-in scheduling, retries, dependency management, and a web UI for monitoring.&lt;/p&gt;
&lt;p&gt;Here is a practical example of a DAG that orchestrates an ELT pipeline:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;span class="lnt"&gt;22
&lt;/span&gt;&lt;span class="lnt"&gt;23
&lt;/span&gt;&lt;span class="lnt"&gt;24
&lt;/span&gt;&lt;span class="lnt"&gt;25
&lt;/span&gt;&lt;span class="lnt"&gt;26
&lt;/span&gt;&lt;span class="lnt"&gt;27
&lt;/span&gt;&lt;span class="lnt"&gt;28
&lt;/span&gt;&lt;span class="lnt"&gt;29
&lt;/span&gt;&lt;span class="lnt"&gt;30
&lt;/span&gt;&lt;span class="lnt"&gt;31
&lt;/span&gt;&lt;span class="lnt"&gt;32
&lt;/span&gt;&lt;span class="lnt"&gt;33
&lt;/span&gt;&lt;span class="lnt"&gt;34
&lt;/span&gt;&lt;span class="lnt"&gt;35
&lt;/span&gt;&lt;span class="lnt"&gt;36
&lt;/span&gt;&lt;span class="lnt"&gt;37
&lt;/span&gt;&lt;span class="lnt"&gt;38
&lt;/span&gt;&lt;span class="lnt"&gt;39
&lt;/span&gt;&lt;span class="lnt"&gt;40
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow.providers.common.sql.operators.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLExecuteQueryOperator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow.utils.dates&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;days_ago&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;owner&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;data-team&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;retries&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;retry_delay&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;email_on_failure&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;email&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;data-alerts@company.com&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;elt_sales_pipeline&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;@daily&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;days_ago&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;elt&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;sales&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;extract_load&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;extract_and_load_raw&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract_and_load_sales_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# your extraction function&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLExecuteQueryOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;transform_sales&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;warehouse_conn&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sql/transform_sales.sql&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;run_quality_checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;data_quality_checks&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run_great_expectations_suite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;extract_load&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;run_quality_checks&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Key patterns to follow in Airflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Idempotent tasks&lt;/strong&gt;: running the same task twice should produce the same result&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Atomic writes&lt;/strong&gt;: use staging tables and swap on success&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterized dates&lt;/strong&gt;: use &lt;code&gt;{{ ds }}&lt;/code&gt; template variables for date partitioning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Small tasks&lt;/strong&gt;: each task should do one thing, making failures easy to diagnose&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="dbt--transformation"&gt;dbt &amp;ndash; transformation
&lt;/h3&gt;&lt;p&gt;dbt (data build tool) is the standard for managing SQL-based transformations in an ELT pipeline. It provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modular SQL&lt;/strong&gt;: break complex transformations into referenceable models&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Built-in testing&lt;/strong&gt;: schema tests, custom tests, and data freshness checks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: auto-generated docs from your model descriptions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lineage&lt;/strong&gt;: visual DAG showing how models depend on each other&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A typical dbt project structure looks like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;models/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; staging/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stg_sales.sql -- clean raw data
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stg_customers.sql
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; marts/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; fct_daily_revenue.sql -- business-level aggregations
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; dim_customers.sql
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; schema.yml -- tests and documentation
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="great-expectations--data-quality"&gt;Great Expectations &amp;ndash; data quality
&lt;/h3&gt;&lt;p&gt;Great Expectations is a Python framework for defining, running, and documenting data quality checks. It goes beyond simple assertions by generating human-readable data documentation.&lt;/p&gt;
&lt;p&gt;Here is an example of setting up expectations for a sales table:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;span class="lnt"&gt;22
&lt;/span&gt;&lt;span class="lnt"&gt;23
&lt;/span&gt;&lt;span class="lnt"&gt;24
&lt;/span&gt;&lt;span class="lnt"&gt;25
&lt;/span&gt;&lt;span class="lnt"&gt;26
&lt;/span&gt;&lt;span class="lnt"&gt;27
&lt;/span&gt;&lt;span class="lnt"&gt;28
&lt;/span&gt;&lt;span class="lnt"&gt;29
&lt;/span&gt;&lt;span class="lnt"&gt;30
&lt;/span&gt;&lt;span class="lnt"&gt;31
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;great_expectations&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;gx&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Connect to your data source&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;datasource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_or_update_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sales_source&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;data_asset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_csv_asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;daily_sales&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filepath_or_buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;data/daily_sales.csv&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;batch_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data_asset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;build_batch_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Create an expectation suite&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_or_update_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sales_quality_suite&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;batch_request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;expectation_suite_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sales_quality_suite&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Define expectations&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_to_exist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;order_id&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_be_unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;order_id&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_not_be_null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;customer_id&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_be_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;amount&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_be_in_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;status&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;pending&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;completed&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;refunded&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Run validation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;discard_failed_expectations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Data quality checks failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Integrate this into your Airflow DAG so that quality gates run after every transformation step. If checks fail, the pipeline stops and alerts fire.&lt;/p&gt;
&lt;h2 id="monitoring-and-observability"&gt;Monitoring and observability
&lt;/h2&gt;&lt;p&gt;A production data pipeline needs observability across several dimensions:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Dimension&lt;/th&gt;
 &lt;th&gt;What to Track&lt;/th&gt;
 &lt;th&gt;Tools&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Pipeline health&lt;/td&gt;
 &lt;td&gt;Task success/failure rates, duration trends&lt;/td&gt;
 &lt;td&gt;Airflow metrics, Prometheus&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Data freshness&lt;/td&gt;
 &lt;td&gt;Time since last successful load&lt;/td&gt;
 &lt;td&gt;dbt source freshness, custom checks&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Data volume&lt;/td&gt;
 &lt;td&gt;Row counts per table per run&lt;/td&gt;
 &lt;td&gt;Great Expectations, custom SQL&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Data quality&lt;/td&gt;
 &lt;td&gt;Test pass/fail rates, anomaly scores&lt;/td&gt;
 &lt;td&gt;Great Expectations, Monte Carlo&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Cost&lt;/td&gt;
 &lt;td&gt;Warehouse compute usage, storage growth&lt;/td&gt;
 &lt;td&gt;Cloud provider dashboards&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Set up alerts for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Any pipeline task failure&lt;/li&gt;
&lt;li&gt;Data freshness exceeding SLA thresholds&lt;/li&gt;
&lt;li&gt;Row count deviations beyond 2 standard deviations from the rolling average&lt;/li&gt;
&lt;li&gt;Data quality test failures&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Push Airflow metrics to Prometheus and build Grafana dashboards that give your team a single pane of glass for pipeline health.&lt;/p&gt;
&lt;h2 id="best-practices"&gt;Best practices
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Treat pipelines as code&lt;/strong&gt;: all SQL, DAG definitions, and configuration live in Git&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use environments&lt;/strong&gt;: dev, staging, production &amp;ndash; just like application code&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Implement CI/CD&lt;/strong&gt;: run dbt tests and linting on every pull request&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Design for failure&lt;/strong&gt;: every task should be retryable and idempotent&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document data contracts&lt;/strong&gt;: define and publish schemas that upstream and downstream teams agree on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Start with testing&lt;/strong&gt;: add data quality checks before adding new features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert on SLAs, not just failures&lt;/strong&gt;: a pipeline that succeeds but runs 3x slower than usual is still a problem&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep raw data immutable&lt;/strong&gt;: never modify source data; transform into separate tables&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;DataOps isn&amp;rsquo;t a tool. It&amp;rsquo;s a set of practices that make your data infrastructure reliable, testable, and maintainable. Start with orchestration and testing, then add monitoring and quality checks as you mature.&lt;/p&gt;</description></item><item><title>MLOps pipeline design patterns</title><link>https://adurrr.github.io/en/p/mlops-pipeline-design-patterns/</link><pubDate>Wed, 22 Mar 2023 10:00:00 +0100</pubDate><guid>https://adurrr.github.io/en/p/mlops-pipeline-design-patterns/</guid><description>&lt;h2 id="what-is-mlops-and-why-it-matters"&gt;What Is MLOps and Why It Matters
&lt;/h2&gt;&lt;p&gt;MLOps is about deploying and maintaining ML models reliably and efficiently in production. It bridges data science experiments and production engineering. Without it, you hit the same problems repeatedly: models that work in notebooks but fail in production, no way to reproduce results, painful handoffs between teams, and zero visibility into how models perform once deployed.&lt;/p&gt;
&lt;p&gt;The idea is simple: treat ML systems with the same rigor as software. Use version control, automated testing, continuous delivery, and monitoring. Just acknowledge that data and models introduce unique challenges.&lt;/p&gt;
&lt;h2 id="the-ml-lifecycle"&gt;The ML Lifecycle
&lt;/h2&gt;&lt;p&gt;Before diving into patterns, understand the stages every ML system goes through:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Data ingestion and validation&lt;/strong&gt; - Collect, clean, and validate input data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Feature engineering&lt;/strong&gt; - Transform raw data into features the model can use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model training&lt;/strong&gt; - Run experiments, tune hyperparameters, pick algorithms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model evaluation&lt;/strong&gt; - Test model quality against held-out data and business metrics&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model deployment&lt;/strong&gt; - Serve predictions in production (batch or real-time)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring and feedback&lt;/strong&gt; - Track performance, detect drift, retrain when needed&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each stage has failure modes, and the patterns below help prevent them.&lt;/p&gt;
&lt;h2 id="key-design-patterns"&gt;Key design patterns
&lt;/h2&gt;&lt;h3 id="feature-store"&gt;Feature store
&lt;/h3&gt;&lt;p&gt;A feature store is a centralized repository for storing, sharing, and serving ML features. Instead of each team recomputing features from scratch, a feature store provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt; between training and serving (avoiding training-serving skew).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt; across teams and models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Point-in-time correctness&lt;/strong&gt; for historical feature values.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tools like &lt;strong&gt;Feast&lt;/strong&gt;, &lt;strong&gt;Tecton&lt;/strong&gt;, and &lt;strong&gt;Hopsworks&lt;/strong&gt; implement this pattern. If you find multiple teams duplicating feature pipelines, a feature store is likely worth the investment.&lt;/p&gt;
&lt;h3 id="model-registry"&gt;Model registry
&lt;/h3&gt;&lt;p&gt;A model registry acts as a versioned catalog for trained models. It stores model artifacts, metadata (hyperparameters, metrics, training data version), and lifecycle stage (staging, production, archived).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MLflow Model Registry&lt;/strong&gt; is one of the most widely adopted solutions. It lets you promote models through stages with approval workflows and track lineage from experiment to production.&lt;/p&gt;
&lt;h3 id="ctcicd-for-ml"&gt;CT/CI/CD for ML
&lt;/h3&gt;&lt;p&gt;Traditional CI/CD pipelines build and deploy code. ML pipelines need three loops:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Continuous Training (CT)&lt;/strong&gt; &amp;mdash; Automatically retrain models when data changes or performance degrades.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous Integration (CI)&lt;/strong&gt; &amp;mdash; Validate not just code but also data schemas, feature expectations, and model quality thresholds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous Delivery (CD)&lt;/strong&gt; &amp;mdash; Deploy validated models to serving infrastructure automatically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A typical pipeline trigger might be: new data lands in the data lake, CT kicks off retraining, CI runs validation tests, and CD pushes the model to production if all checks pass.&lt;/p&gt;
&lt;h3 id="ab-testing"&gt;A/B testing
&lt;/h3&gt;&lt;p&gt;A/B testing for models means routing a percentage of traffic to a new model while the rest continues hitting the current production model. You measure business metrics (conversion rate, click-through, revenue) rather than just ML metrics (accuracy, F1). This pattern is essential because a model that scores well offline can still perform poorly in production due to feedback loops, latency, or distribution differences.&lt;/p&gt;
&lt;h3 id="shadow-deployment"&gt;Shadow deployment
&lt;/h3&gt;&lt;p&gt;In shadow mode, the new model receives production traffic and generates predictions, but those predictions are &lt;strong&gt;not&lt;/strong&gt; served to users. Instead, they are logged alongside the current model&amp;rsquo;s predictions for offline comparison. This is a low-risk way to validate a model on real traffic before exposing it to users.&lt;/p&gt;
&lt;h3 id="canary-releases-for-models"&gt;Canary releases for models
&lt;/h3&gt;&lt;p&gt;Similar to canary deployments in software, you roll out a new model to a small fraction of traffic (say 5%), monitor key metrics, and gradually increase traffic if everything looks healthy. If metrics degrade, you roll back automatically. This combines well with A/B testing but focuses more on risk mitigation than experimentation.&lt;/p&gt;
&lt;h2 id="tooling-overview"&gt;Tooling overview
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Tool&lt;/th&gt;
 &lt;th&gt;Primary Use&lt;/th&gt;
 &lt;th&gt;Key Strength&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;MLflow&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Experiment tracking, model registry&lt;/td&gt;
 &lt;td&gt;Flexible, vendor-neutral&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Kubeflow&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;End-to-end ML pipelines on Kubernetes&lt;/td&gt;
 &lt;td&gt;Scalable, cloud-native&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;DVC&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Data and model versioning&lt;/td&gt;
 &lt;td&gt;Git-like workflow for data&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Weights &amp;amp; Biases&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Experiment tracking, visualization&lt;/td&gt;
 &lt;td&gt;Excellent UI and collaboration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Feast&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Feature store&lt;/td&gt;
 &lt;td&gt;Open-source, production-ready&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Seldon Core&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Model serving on Kubernetes&lt;/td&gt;
 &lt;td&gt;Advanced deployment strategies&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is no single tool that covers everything. Most production setups combine several of these, choosing based on team expertise and infrastructure constraints.&lt;/p&gt;
&lt;h2 id="example-mlflow-experiment-tracking"&gt;Example: MLflow experiment tracking
&lt;/h2&gt;&lt;p&gt;Here is a minimal example of tracking an experiment with MLflow:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;span class="lnt"&gt;22
&lt;/span&gt;&lt;span class="lnt"&gt;23
&lt;/span&gt;&lt;span class="lnt"&gt;24
&lt;/span&gt;&lt;span class="lnt"&gt;25
&lt;/span&gt;&lt;span class="lnt"&gt;26
&lt;/span&gt;&lt;span class="lnt"&gt;27
&lt;/span&gt;&lt;span class="lnt"&gt;28
&lt;/span&gt;&lt;span class="lnt"&gt;29
&lt;/span&gt;&lt;span class="lnt"&gt;30
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;mlflow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;mlflow.sklearn&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Start an MLflow run&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;rf-baseline&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Log parameters&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;n_estimators&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;max_depth&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Train model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Log metrics&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;accuracy&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;f1_score&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;average&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;weighted&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Log model artifact&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;random-forest-model&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Every run is tracked with its parameters, metrics, and artifacts, making it straightforward to compare experiments and reproduce results.&lt;/p&gt;
&lt;h2 id="anti-patterns-to-avoid"&gt;Anti-patterns to avoid
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;No versioning of data or models.&lt;/strong&gt; If you cannot reproduce a training run from six months ago, you have a problem. Version everything: code, data, configuration, and model artifacts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training-serving skew.&lt;/strong&gt; When the feature computation logic differs between training and serving, predictions silently degrade. A feature store or shared feature computation library helps eliminate this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manual deployment.&lt;/strong&gt; Copy-pasting model files to a server is a recipe for incidents. Automate deployment through pipelines with proper validation gates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ignoring model monitoring.&lt;/strong&gt; Models degrade over time as input distributions shift. Without monitoring, you only discover this when a user complains or a business metric drops. Set up alerts for prediction distribution changes, latency, and data quality.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monolithic pipelines.&lt;/strong&gt; A single pipeline that does everything from data ingestion to model serving is fragile and hard to debug. Break pipelines into modular, independently testable stages.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-engineering too early.&lt;/strong&gt; Not every ML project needs Kubeflow and a feature store on day one. Start simple, identify bottlenecks, and adopt patterns as the complexity of your system grows.&lt;/p&gt;
&lt;h2 id="mlops-maturity-levels"&gt;MLOps maturity levels
&lt;/h2&gt;&lt;p&gt;Organizations typically progress through several maturity levels:&lt;/p&gt;
&lt;h3 id="level-0-manual"&gt;Level 0: manual
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Models trained in notebooks.&lt;/li&gt;
&lt;li&gt;Manual deployment (file copy, manual API restart).&lt;/li&gt;
&lt;li&gt;No experiment tracking.&lt;/li&gt;
&lt;li&gt;No monitoring.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="level-1-ml-pipeline-automation"&gt;Level 1: ML pipeline automation
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Automated training pipelines.&lt;/li&gt;
&lt;li&gt;Experiment tracking with tools like MLflow.&lt;/li&gt;
&lt;li&gt;Basic model validation before deployment.&lt;/li&gt;
&lt;li&gt;Some monitoring of model predictions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="level-2-cicd-for-ml"&gt;Level 2: CI/CD for ML
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Automated testing of data, features, and model quality.&lt;/li&gt;
&lt;li&gt;Continuous training triggered by data changes or schedule.&lt;/li&gt;
&lt;li&gt;Automated deployment with canary or shadow releases.&lt;/li&gt;
&lt;li&gt;Comprehensive monitoring with alerting and automated rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="level-3-full-mlops"&gt;Level 3: Full MLOps
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Feature store for consistent feature management.&lt;/li&gt;
&lt;li&gt;Model registry with governance and approval workflows.&lt;/li&gt;
&lt;li&gt;A/B testing integrated into the deployment process.&lt;/li&gt;
&lt;li&gt;Data and model lineage tracked end-to-end.&lt;/li&gt;
&lt;li&gt;Self-healing pipelines that detect and respond to drift automatically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most teams are somewhere between Level 0 and Level 1. The goal is not to jump to Level 3 immediately but to progress incrementally, addressing the most painful bottlenecks first.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;MLOps is about applying engineering patterns to ML&amp;rsquo;s unique challenges. Start with experiment tracking and basic automation, then add feature stores, model registries, and advanced deployment strategies as you scale. The key: treat models like first-class production artifacts. Version them, test them, monitor them, improve them.&lt;/p&gt;</description></item><item><title>Introduction to AIOps: intelligent IT operations</title><link>https://adurrr.github.io/en/p/introduction-to-aiops-intelligent-it-operations/</link><pubDate>Mon, 05 Dec 2022 00:00:00 +0000</pubDate><guid>https://adurrr.github.io/en/p/introduction-to-aiops-intelligent-it-operations/</guid><description>&lt;h2 id="what-is-aiops"&gt;What is AIOps?
&lt;/h2&gt;&lt;p&gt;AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to operational data (logs, metrics, events, traces) to automate and improve workflows. Gartner coined the term in 2017, but the idea is simple: use algorithms to handle the volume and complexity that humans can&amp;rsquo;t manage manually.&lt;/p&gt;
&lt;p&gt;In practical terms, AIOps platforms ingest data from monitoring tools, APM systems, log aggregators, and event sources. They apply ML models to detect anomalies, correlate events, identify root causes, and in some cases trigger automated remediation. The goal is to reduce mean time to detection (MTTD) and mean time to resolution (MTTR) while freeing operations teams from alert fatigue.&lt;/p&gt;
&lt;h2 id="why-traditional-monitoring-falls-short"&gt;Why traditional monitoring falls short
&lt;/h2&gt;&lt;p&gt;Monitoring used to work fine. You had a few servers, a handful of apps, and a limited set of metrics to watch. A static CPU threshold or log regex was enough.&lt;/p&gt;
&lt;p&gt;Modern infrastructure broke that model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: A medium Kubernetes cluster generates millions of metrics and logs per minute. You can&amp;rsquo;t humanly watch dashboards at that scale.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complexity&lt;/strong&gt;: Microservices create tangled dependency graphs. One user request might touch dozens of services. Finding what caused a latency spike means correlating data across all of them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic environments&lt;/strong&gt;: Auto-scaling, ephemeral containers, and serverless functions mean baselines constantly shift. Static thresholds explode with false positives.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert fatigue&lt;/strong&gt;: Teams get buried in alerts. When 90% is noise, that critical 10% disappears. Engineers start ignoring everything.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AIOps doesn&amp;rsquo;t replace monitoring. It layers on top of what you already have and makes it smarter.&lt;/p&gt;
&lt;h2 id="key-capabilities"&gt;Key capabilities
&lt;/h2&gt;&lt;h3 id="1-anomaly-detection"&gt;1. Anomaly detection
&lt;/h3&gt;&lt;p&gt;Instead of static thresholds, AIOps uses ML models (often time-series analysis, clustering, or autoencoders) to learn what &amp;ldquo;normal&amp;rdquo; looks like for each metric and service. When behavior deviates significantly from the learned baseline, an anomaly is flagged.&lt;/p&gt;
&lt;p&gt;This handles the dynamic baseline problem. If your application normally sees a traffic spike every Monday at 9 AM, the model learns that pattern and does not alert on it. But an unexpected spike at 3 AM on a Wednesday gets flagged.&lt;/p&gt;
&lt;h3 id="2-event-correlation"&gt;2. Event correlation
&lt;/h3&gt;&lt;p&gt;A single infrastructure issue can generate hundreds or thousands of related alerts across different monitoring tools. AIOps correlates these events — grouping them by time, topology, and causal relationships — to present a single incident instead of a wall of alerts.&lt;/p&gt;
&lt;p&gt;For example, a network switch failure might trigger alerts on: the switch itself, all connected servers (connectivity lost), all applications on those servers (health check failures), and downstream services (timeout errors). An AIOps platform correlates all of these into one incident: &amp;ldquo;Network switch X failed.&amp;rdquo;&lt;/p&gt;
&lt;h3 id="3-root-cause-analysis"&gt;3. Root cause analysis
&lt;/h3&gt;&lt;p&gt;Beyond correlation, AIOps attempts to identify the root cause of an incident. By understanding the topology of your infrastructure and the causal chain of events, it can suggest that the network switch failure is the root cause, rather than presenting the application timeout as an independent issue.&lt;/p&gt;
&lt;p&gt;This is where the value becomes tangible. Instead of an on-call engineer spending 30 minutes tracing through dashboards and logs, the platform surfaces the probable root cause immediately.&lt;/p&gt;
&lt;h3 id="4-auto-remediation"&gt;4. Auto-remediation
&lt;/h3&gt;&lt;p&gt;The most mature AIOps implementations close the loop by triggering automated remediation actions. If a known pattern is detected (disk filling up, a pod in CrashLoopBackOff, a runaway process consuming memory), the platform can execute predefined runbooks automatically.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Restart a crashed pod or service.&lt;/li&gt;
&lt;li&gt;Scale up a deployment when anomalous load is detected.&lt;/li&gt;
&lt;li&gt;Clear a log directory when disk usage exceeds a dynamic threshold.&lt;/li&gt;
&lt;li&gt;Trigger a failover when a primary database becomes unresponsive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Auto-remediation requires careful design. Start with low-risk actions and expand as confidence grows.&lt;/p&gt;
&lt;h2 id="common-platforms-and-tools"&gt;Common platforms and tools
&lt;/h2&gt;&lt;p&gt;The AIOps landscape includes both commercial platforms and open-source building blocks:&lt;/p&gt;
&lt;h3 id="commercial-platforms"&gt;Commercial platforms
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;Strengths&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Dynatrace&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Strong auto-discovery, AI engine (Davis), full-stack observability&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Unified monitoring + ML-powered alerting, Watchdog anomaly detection&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Splunk ITSI&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Powerful log analytics + ML toolkit, good for event correlation&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Moogsoft&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Pioneered AIOps space, strong event correlation and noise reduction&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;BigPanda&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Event correlation and automation focused, integrates with existing tools&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Incident management with ML-driven noise reduction and smart grouping&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="open-source-building-blocks"&gt;Open-source building blocks
&lt;/h3&gt;&lt;p&gt;You can assemble an AIOps-like stack from open-source components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data collection&lt;/strong&gt;: Prometheus, Grafana Agent, OpenTelemetry Collector, Fluentd/Fluent Bit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data storage&lt;/strong&gt;: Prometheus (metrics), Elasticsearch/OpenSearch (logs), Jaeger/Tempo (traces).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anomaly detection&lt;/strong&gt;: Facebook Prophet, Isolation Forest (scikit-learn), luminol, Grafana ML.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Event correlation&lt;/strong&gt;: Custom logic on top of event streams, or StackStorm for event-driven automation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alerting and automation&lt;/strong&gt;: Alertmanager, Grafana OnCall, StackStorm, Rundeck.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Building a custom AIOps stack is significantly more work than using a commercial platform, but it gives you full control and avoids vendor lock-in. A reasonable middle ground is using a commercial platform for core AIOps capabilities while keeping your data pipeline open-source.&lt;/p&gt;
&lt;h2 id="practical-use-cases"&gt;Practical use cases
&lt;/h2&gt;&lt;h3 id="noise-reduction-in-alert-management"&gt;Noise reduction in alert management
&lt;/h3&gt;&lt;p&gt;A team receiving 500+ alerts per day implements AIOps event correlation. Related alerts are grouped into incidents, duplicates are suppressed, and flapping alerts are silenced. Alert volume drops by 80%, and the on-call engineer can focus on actual incidents.&lt;/p&gt;
&lt;h3 id="proactive-capacity-planning"&gt;Proactive capacity planning
&lt;/h3&gt;&lt;p&gt;AIOps models analyze historical resource usage trends and predict when capacity limits will be reached. Instead of reacting to a disk-full alert at 2 AM, the platform predicts the issue two weeks in advance and creates a ticket for the team to address during business hours.&lt;/p&gt;
&lt;h3 id="faster-incident-response"&gt;Faster incident response
&lt;/h3&gt;&lt;p&gt;During a production outage, the AIOps platform correlates alerts across the monitoring stack, identifies the root cause (a recent deployment that introduced a memory leak), and surfaces the relevant deployment commit. MTTR drops from 45 minutes to 10 minutes.&lt;/p&gt;
&lt;h3 id="automated-scaling"&gt;Automated scaling
&lt;/h3&gt;&lt;p&gt;The platform detects anomalous traffic patterns that deviate from the learned baseline. Instead of waiting for CPU to hit 80% (the static threshold), it triggers a scale-up action based on the rate of change, ensuring capacity is ready before users experience degradation.&lt;/p&gt;
&lt;h2 id="how-aiops-fits-into-devops-workflows"&gt;How AIOps fits into DevOps workflows
&lt;/h2&gt;&lt;p&gt;AIOps is not a replacement for DevOps practices. It is an enhancement layer:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Code ──&amp;gt; CI/CD Pipeline ──&amp;gt; Deploy ──&amp;gt; Observe ──&amp;gt; AIOps Layer ──&amp;gt; Act
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Monitoring Stack ML Models
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; (metrics, logs, (anomaly detection,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; traces, events) correlation, RCA)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Developers&lt;/strong&gt; benefit from faster root cause identification when their code causes issues in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operations&lt;/strong&gt; teams benefit from noise reduction, automated remediation, and proactive alerting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SRE teams&lt;/strong&gt; benefit from data-driven SLO tracking and error budget burn rate analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AIOps works best when your observability foundation is solid. If you are not collecting good data (structured logs, meaningful metrics, distributed traces), ML models will not produce meaningful insights. Fix your observability first, then layer AIOps on top.&lt;/p&gt;
&lt;h2 id="getting-started-a-pragmatic-path"&gt;Getting started: A pragmatic path
&lt;/h2&gt;&lt;p&gt;If AIOps sounds useful, here&amp;rsquo;s a practical approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Audit your current observability stack.&lt;/strong&gt; What data are you collecting? Do you have structured logs? Consistently labeled metrics? Traces across services? AIOps can only work with good data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start with noise reduction.&lt;/strong&gt; This is the lowest-hanging fruit. Implement alert grouping and deduplication. Even basic rules-based correlation (before any ML) will reduce alert fatigue significantly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add anomaly detection to key metrics.&lt;/strong&gt; Pick 3-5 critical business and infrastructure metrics. Apply a time-series anomaly detection model. Facebook Prophet or Prometheus recording rules with seasonal adjustments are good starting points.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Implement automated remediation for known issues.&lt;/strong&gt; Identify the top 5 recurring incidents. Write runbooks for them. Automate the runbooks using StackStorm, Rundeck, or your platform&amp;rsquo;s automation engine.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evaluate a commercial platform when complexity demands it.&lt;/strong&gt; If you have hundreds of services, multiple monitoring tools, and a growing operations team, the investment in a commercial AIOps platform may be justified by the reduction in MTTR alone.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Measure the impact.&lt;/strong&gt; Track MTTD, MTTR, alert-to-incident ratio, and false positive rate. Without metrics, you can&amp;rsquo;t prove AIOps is worth the investment.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;AIOps isn&amp;rsquo;t magic. It&amp;rsquo;s a set of techniques that, applied to solid operational data, can reduce the burden on ops teams and improve reliability. Start small, measure everything, and scale what actually works.&lt;/p&gt;</description></item></channel></rss>