Automation on Adur

Lessons learned in DevSecOps

Mon, 20 Oct 2025 00:00:00 +0000

DevSecOps gets thrown around in job descriptions and conference talks a lot. But behind the buzzword are real lessons that only come from doing the work. From building pipelines that break when you add security gates, to watching teams ignore the tools you spent months deploying, to finally finding what actually works.

These are lessons we learned the hard way. They’re opinionated, practical, shaped by experience.

Security is everyone’s responsibility

Sounds like a break room poster, but it’s the most important lesson here. If security is only the security team’s job, you’ve lost.

Developers make security decisions every time they write code, whether they know it or not. How they validate input. How they handle secrets. How they configure network access. Every PR is a security event.

What works: make security part of the normal development workflow, not a gate at the end. Developers learn when they get fast feedback on security issues in their PR. They resent finding out three weeks later from an auditor.

We’ve seen this repeatedly: teams that treat security as shared responsibility find fewer critical vulnerabilities in production. Teams that silo it find them in the news.

Automate everything you can

Manual security processes do not scale. Period. If your security review is a human reading a checklist, it will be skipped under deadline pressure, inconsistently applied, and resented by everyone involved.

Automate the things that can be automated:

Dependency scanning in every CI build (Dependabot, Snyk, Trivy)
Static analysis on every pull request (Semgrep, SonarQube)
Secret detection as a pre-commit hook and CI check (gitleaks, detect-secrets)
Container image scanning before deployment (Trivy, Grype)
Infrastructure as Code scanning (tfsec, Checkov, KICS)
Compliance as Code for runtime policy enforcement (OPA, Kyverno)

The goal is not to catch everything automatically. The goal is to catch the easy stuff automatically so that human reviewers can focus on the hard stuff: business logic flaws, design-level security issues, threat modeling.

Start Small

One of the biggest mistakes we have made is trying to secure everything at once. You roll out SAST, DAST, SCA, container scanning, IaC scanning, and runtime protection in one quarter. The result? Alert fatigue, developer rebellion, and a wall of unresolved findings that nobody looks at.

Start with one tool, one pipeline, one team. Get it working well. Get developers comfortable with it. Resolve the false positives. Tune the rules. Then expand.

A practical progression:

Month 1: Secret detection in pre-commit hooks and CI. This is uncontroversial and catches real issues.
Month 2: Dependency scanning with automated PR creation for updates. Developers see the value immediately.
Month 3: Container image scanning blocking deployments of critical/high vulnerabilities.
Month 4+: Static analysis, gradually expanding rule sets.

Each step should be stable before moving to the next. Rushing creates noise, and noise teaches people to ignore alerts.

Blameless culture matters

When a security incident happens because someone pushed a secret to a public repo, or because a vulnerability was not patched in time, the response matters more than the incident itself.

If people get blamed, they hide things. They do not report near-misses. They cover up mistakes. And the next incident will be worse because nobody shared the lessons from the last one.

Blameless postmortems are not about letting people off the hook. They are about understanding systemic failures. Why was it possible to push a secret? Why was there no scanning? Why was the patching process slow? Fix the system, not the person.

We have found that teams with genuinely blameless cultures have significantly better security postures. People report suspicious things. They ask for help early. They flag risks before they become incidents.

Tooling is not enough without culture change

We once deployed a comprehensive security scanning pipeline with beautiful dashboards, Slack notifications, Jira ticket creation, the works. Six months later, there were 3,000 unresolved findings and the Slack channel was muted by every developer.

The tools were fine. The culture was not ready.

Before you deploy tooling, invest in:

Training: Developers need to understand why the tool exists and how to act on its findings.
Ownership: Someone needs to own the backlog of findings and triage them. If nobody owns it, nobody does it.
SLAs: Define clear timelines for remediating findings by severity. Critical gets 48 hours. High gets a week. Medium gets a sprint. Low gets a quarter.
Feedback loops: When a tool produces a false positive, there must be an easy way to report it and get the rule tuned. Otherwise, developers learn to ignore everything.

Invest in developer experience for security tools

If your security tool makes developers’ lives harder, they will find a way around it. This is not a character flaw. It is human nature and good engineering instinct: remove obstacles to shipping.

The security tools that get adopted are the ones that:

Run fast: A SAST scan that takes 20 minutes will be bypassed. One that takes 30 seconds will be tolerated.
Integrate natively: Show results in the PR, not in a separate portal. Nobody wants to log into another dashboard.
Have low false positive rates: Every false positive erodes trust. Invest time in tuning.
Provide actionable guidance: “SQL injection vulnerability on line 42” is useless without “here is how to fix it.”
Fail gracefully: If the scanner is down, the pipeline should warn, not block. Availability of the development pipeline is non-negotiable.

We think of it this way: if a developer has to change their workflow to accommodate a security tool, the tool has failed. The best security tooling is invisible.

Monitoring and observability are non-negotiable

You cannot secure what you cannot see. Security monitoring is not optional, and it is not something you bolt on after the fact.

What this means in practice:

Centralized logging: All application, infrastructure, and security tool logs in one place. If you have to SSH into a box to read logs, you are already behind.
Audit trails: Who did what, when, and from where. Every deployment, every config change, every access request.
Alerting on anomalies: Not just “is the service up?” but “is this access pattern normal?” Unusual API call volumes, access from new locations, privilege escalations.
Runtime security: Tools like Falco for container runtime monitoring. Know when something unexpected happens in production.

Monitoring is also how you prove to auditors and customers that your security controls are working. “Trust us” is not a compliance strategy.

Open source is your ally

Some of the best security tools available are open source. Trivy, Falco, OPA, Semgrep, gitleaks, cosign, KICS, Checkov. The ecosystem is rich and maturing fast.

Benefits of open source security tooling:

Transparency: You can read the rules and understand exactly what is being checked.
Community: Thousands of contributors finding edge cases and adding detection rules.
No vendor lock-in: You can switch tools without renegotiating a contract.
Cost: Start for free, scale as needed.

This does not mean commercial tools have no place. Some provide valuable aggregation, management, and support. But you can build a very solid security pipeline with open source tools alone, and we think every team should start there.

Continuous learning is essential

The threat landscape changes constantly. The tools change. The best practices evolve. What was considered secure two years ago might have a CVE today.

What we do to stay current:

Dedicate time for learning: At least a few hours per sprint for the team to read about new vulnerabilities, tools, and techniques. This is not a nice-to-have. It is a professional requirement.
Run internal CTFs and tabletop exercises: Nothing teaches security like trying to break things. Regular exercises keep skills sharp and reveal gaps in your defenses.
Participate in the community: Attend meetups, contribute to open source, read advisories. The security community is generous with knowledge. Take advantage of it.
Review and update: Quarterly reviews of your security tooling, policies, and incident response procedures. What worked last quarter may not work next quarter.

Final Thoughts

DevSecOps isn’t a destination. There’s no point where you say “we’re done, we’re secure.” It’s a continuous practice of reducing risk, improving visibility, building a culture where security is as natural as writing tests.

The most important lesson: perfect is the enemy of good. A basic security pipeline that developers actually use beats a comprehensive one they bypass. Start where you are, improve iteratively, never stop.

LLMOps: integrating LLMs into DevOps workflows

Sun, 15 Jun 2025 00:00:00 +0000

LLMs have moved beyond chatbots. They’re now embedded in engineering workflows where they automate tedious tasks, speed incident response, and boost developer productivity. But deploying an LLM into a production DevOps pipeline is fundamentally different from using ChatGPT in a browser.

This guide covers what LLMOps means in practice, where LLMs fit into DevOps, architecture patterns that work, and pitfalls to avoid.

What is LLMOps?

LLMOps is the practices, tools, and infrastructure needed to operationalize LLMs. It extends MLOps but addresses challenges unique to language models:

Model selection vs. model training: Most teams consume pre-trained models (via APIs or self-hosted inference) rather than training from scratch. The operational focus shifts to prompt engineering, fine-tuning, and retrieval-augmented generation (RAG).
Cost management: LLM inference is expensive. Token-based pricing means costs scale with usage in ways that are harder to predict than traditional compute.
Non-determinism: LLMs produce variable outputs for the same input, which complicates testing, validation, and reproducibility.
Latency: Response times of seconds (not milliseconds) require different architectural patterns than traditional microservices.

LLMOps is not a separate discipline. It is an extension of your existing DevOps and MLOps practices, adapted for the specific operational characteristics of language models.

Practical use cases in DevOps

Here is where LLMs are delivering real value in DevOps workflows today:

Automated code review

LLMs can provide a first-pass review of pull requests, catching common issues like missing error handling, security anti-patterns, inconsistent naming, or missing tests. They do not replace human reviewers but reduce the burden of repetitive feedback.

Incident summarization

When an incident fires at 3 AM, the on-call engineer needs context fast. An LLM can ingest alert data, recent deployment logs, related runbooks, and previous incident reports to produce a concise summary of what is likely going wrong and what was done last time.

Log analysis

LLMs are surprisingly effective at pattern recognition in unstructured log data. Feed them a block of error logs and they can identify the root cause faster than manual grep sessions, especially for unfamiliar systems.

Documentation generation

Generating draft documentation from code, API schemas, or Terraform modules. The output needs human review, but it eliminates the blank-page problem and keeps docs closer to current state.

Infrastructure as Code generation

Given a natural language description of desired infrastructure, LLMs can generate Terraform, Ansible, or Kubernetes manifests as a starting point. Useful for scaffolding, not for production-ready code without review.

Architecture patterns for LLM integration

Pattern 1: API gateway to external LLM

The simplest approach. Your application calls an external LLM API (OpenAI, Anthropic, etc.) through a centralized gateway that handles authentication, rate limiting, logging, and cost tracking.

1
2
3
4
5


[CI/CD Pipeline] --> [API Gateway] --> [External LLM API]
 |
 [Logging & Metrics]
 |
 [Cost Tracking]

Pros: No infrastructure to manage, access to the most capable models, fast to implement. Cons: Data leaves your network, vendor lock-in, variable latency, ongoing API costs.

Pattern 2: Self-hosted inference

Run open-weight models (Llama, Mistral, etc.) on your own infrastructure using inference servers like vLLM or Ollama.

1
2
3


[CI/CD Pipeline] --> [Load Balancer] --> [vLLM / Ollama Instance(s)]
 |
 [GPU Node Pool]

Pros: Data stays internal, predictable costs at scale, no vendor dependency, full control over model versions. Cons: Requires GPU infrastructure, operational overhead, smaller models may be less capable.

Pattern 3: RAG-enhanced pipeline

Combine an LLM with a retrieval system that provides relevant context from your own knowledge base (runbooks, documentation, past incidents). This dramatically improves response quality for domain-specific tasks.

1
2
3
4


[Query] --> [Embedding Model] --> [Vector DB Search] --> [Context + Query] --> [LLM] --> [Response]
 |
 [Your Knowledge Base]
 (runbooks, docs, etc.)

This pattern is particularly powerful for incident response and documentation tasks where the LLM needs your organization’s specific context.

Key considerations

Cost

LLM API costs can be surprising. A code review pipeline that processes 50 PRs per day with large diffs can easily run hundreds of dollars per month. Strategies to control costs:

Set token limits per request
Cache common queries and responses
Use smaller models for simpler tasks (triage with a small model, escalate to a larger one)
Monitor token usage per pipeline and set alerts

Latency

LLM responses take seconds, not milliseconds. Design your integrations as asynchronous processes:

Post code review comments after the fact, do not block the PR
Process incident data in the background, push results to a Slack channel
Use streaming responses where possible to improve perceived performance

Hallucinations

LLMs will confidently generate plausible-sounding but incorrect information. This is a critical concern for DevOps tasks where bad advice can cause outages.

Mitigations:

Always present LLM output as suggestions, never as authoritative actions
Require human approval before any LLM-generated change is applied
Use RAG to ground responses in verified documentation
Implement output validation (e.g., lint generated IaC before presenting it)

Security

Data exposure: Anything you send to an external LLM API may be used for training or stored. Never send secrets, credentials, or sensitive customer data.
Prompt injection: Malicious content in code, logs, or user input can manipulate LLM behavior. Sanitize inputs and validate outputs.
Supply chain: LLM-generated code may introduce vulnerabilities. Run all generated code through your existing security scanning pipeline.

Tools and platforms

LangChain

A framework for building LLM-powered applications. Useful for orchestrating multi-step chains (e.g., retrieve context, format prompt, call LLM, parse output). Supports many LLM providers and has good tooling for RAG pipelines.

1
2
3
4
5
6
7
8


from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
 "Review this code diff for security issues and suggest fixes:\n\n{diff}"
)
chain = prompt | ChatOpenAI(model="gpt-4o", temperature=0)
result = chain.invoke({"diff": code_diff})

vLLM

A high-throughput inference engine for self-hosted models. Supports PagedAttention for efficient memory management and continuous batching for high throughput.

1
2
3
4


# Start a vLLM server
python -m vllm.entrypoints.openai.api_server \
 --model mistralai/Mistral-7B-Instruct-v0.2 \
 --port 8000

Exposes an OpenAI-compatible API, so you can swap between self-hosted and external APIs with minimal code changes.

Ollama

The easiest way to run LLMs locally for development and testing. Great for prototyping pipelines before committing to infrastructure.

1
2
3
4
5
6
7


# Pull and run a model
ollama pull llama3
ollama run llama3 "Summarize this error log: [paste log]"

# Serve as an API
ollama serve
# Then call http://localhost:11434/api/generate

Example: Automated PR review pipeline

Here is a conceptual pipeline for automated PR review using an LLM:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


# .github/workflows/llm-review.yml
name: LLM Code Review

on:
 pull_request:
 types: [opened, synchronize]

jobs:
 llm-review:
 runs-on: ubuntu-latest
 steps:
 - name: Checkout
 uses: actions/checkout@v4
 with:
 fetch-depth: 0

 - name: Get diff
 id: diff
 run: |
 git diff origin/${{ github.base_ref }}...HEAD > diff.txt

 - name: Run LLM review
 env:
 LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
 run: |
 python scripts/llm_review.py \
 --diff diff.txt \
 --model gpt-4o \
 --max-tokens 2000 \
 --output review.json

 - name: Post review comments
 uses: actions/github-script@v7
 with:
 script: |
 const review = require('./review.json');
 await github.rest.pulls.createReview({
 owner: context.repo.owner,
 repo: context.repo.repo,
 pull_number: context.issue.number,
 body: review.summary,
 event: 'COMMENT',
 comments: review.line_comments
 });

The review script would:

Read the diff
Split large diffs into chunks that fit within the model’s context window
For each chunk, construct a prompt asking for security issues, bugs, and style problems
Aggregate results and format as GitHub review comments
Include confidence scores and always mark output as AI-generated

Guardrails and responsible use

Label all LLM output clearly as AI-generated. Engineers should know when they are reading machine output.
Never auto-merge or auto-apply LLM suggestions. Keep a human in the loop for all changes.
Log all prompts and responses for debugging and audit purposes.
Set spending limits and alerts on LLM API usage.
Review prompt templates regularly to ensure they do not leak sensitive information.
Test for bias and errors with representative samples before deploying to production workflows.

Getting started recommendations

Pick one use case - Don’t try to LLM-enable everything at once. Start low-risk: documentation drafts, commit message suggestions.
Start with an external API - Don’t invest in GPU infrastructure until you’ve validated the use case. Use OpenAI or Anthropic to prototype.
Measure everything - Track cost per invocation, latency, user satisfaction, error rates from day one.
Build an evaluation framework - Create a test suite of known-good inputs and expected outputs. Run it against every prompt change or model update.
Plan your data strategy - Decide early what data you’ll and won’t send to external APIs. Document clearly.
Iterate on prompts - Prompt engineering is iterative. Version control prompts, treat as code.

LLMs are a powerful tool for DevOps automation, but they’re exactly that: a tool. They work best when thoughtfully integrated into existing workflows, with clear boundaries on what they can and cannot do autonomously.

DevSecOps maturity model

Sun, 08 Oct 2023 10:00:00 +0100

Why a Maturity Model Helps

Most teams know they should “shift security left,” but knowing where to start is the hard part. A maturity model gives you a structured way to assess your current state, identify gaps, and plan a realistic roadmap for improvement.

Without a model, security improvements tend to be reactive (triggered by incidents or audit findings rather than deliberate planning). A maturity model turns security from a fire drill into an engineering discipline with measurable progress.

The model described here has five levels. The goal is not to rush to the highest level but to make steady, sustainable progress. Each level builds on the previous one.

The Five Maturity Levels

Level 1: Ad-Hoc

At this level, security is an afterthought. There are no formal processes, and security activities happen sporadically if at all.

What it looks like:

No security testing in CI/CD pipelines.
Vulnerabilities discovered in production or by external parties.
No dedicated security tooling.
Developers have little to no security training.
Incident response is improvised.
Compliance is addressed manually before audits.

Typical tools: None specifically for security. Maybe a firewall and antivirus.

Level 2: Reactive

Security is recognized as important, but the approach is reactive. The team responds to vulnerabilities and incidents but doesn’t proactively prevent them.

What it looks like:

Basic static analysis (SAST) runs occasionally, but findings are not always addressed.
Dependency scanning is done manually or on an ad-hoc basis.
There’s some security documentation, but it’s outdated.
Incident response exists as a documented process, though it’s rarely practiced.
Security reviews happen late in the development cycle (right before release).

Typical tools: SonarQube (basic rules), OWASP Dependency-Check, manual penetration testing.

Level 3: Proactive

Security is integrated into the development workflow. The team actively seeks to prevent vulnerabilities rather than just reacting to them.

What it looks like:

SAST and DAST run automatically in CI/CD pipelines.
Dependency scanning with automated alerts for known vulnerabilities.
Container image scanning before deployment (Trivy, Grype).
Infrastructure as Code is scanned for misconfigurations (Checkov, tfsec).
Threat modeling is performed for new features and architecture changes.
Security champions exist within development teams.
Blameless postmortems are conducted after security incidents.
Regular security training for developers.

Typical tools: Semgrep, Trivy, Checkov, OWASP ZAP, HashiCorp Vault, Falco.

Level 4: Optimized

Security is deeply embedded in every stage of the software lifecycle. Metrics drive decisions, and the team continuously improves based on data.

What it looks like:

Security gates in pipelines that block deployment if critical issues are found.
Mean time to remediate (MTTR) is tracked and continuously reduced.
Software Bill of Materials (SBOM) generated for every release.
Signed artifacts and verified supply chain.
Automated compliance checks mapped to frameworks (SOC2, ISO 27001, PCI-DSS).
Runtime security monitoring with automated response (Falco + custom rules).
Regular red team exercises and chaos engineering for security.
Security metrics are part of engineering dashboards.

Typical tools: Sigstore/cosign, OPA/Gatekeeper, Kyverno, SIEM integration, automated compliance platforms.

Level 5: Innovative

Security is a competitive advantage. The team contributes to the broader security community and pushes the state of the art.

What it looks like:

Bug bounty programs actively managed.
Custom security tooling developed for organization-specific risks.
Machine learning applied to anomaly detection and threat hunting.
Security is a feature sold to customers (certifications, transparency reports).
Active participation in open-source security projects.
Zero-trust architecture fully implemented.
Policy as code governs all infrastructure and application security.

Typical tools: Custom-built platforms, eBPF-based security tools, advanced SIEM with ML, zero-trust service mesh.

Key Dimensions

A maturity model isn’t one-dimensional. Assess your organization across these dimensions, as progress is rarely uniform:

Code Security

Level	Practices
Ad-Hoc	No code scanning
Reactive	Occasional SAST, manual code reviews for security
Proactive	Automated SAST/DAST in CI, security-focused code review guidelines
Optimized	Custom rules for organization-specific patterns, MTTR tracked
Innovative	AI-assisted code review, automatic fix suggestions

Infrastructure Security

Level	Practices
Ad-Hoc	Manual server configuration, no hardening standards
Reactive	Basic hardening checklists, occasional audits
Proactive	IaC scanning, automated hardening, CIS benchmarks
Optimized	Policy as code (OPA), drift detection, automated remediation
Innovative	Self-healing infrastructure, zero-trust networking

Monitoring and Detection

Level	Practices
Ad-Hoc	No security monitoring
Reactive	Basic log collection, manual review after incidents
Proactive	Centralized logging, alerting on known patterns, runtime monitoring
Optimized	SIEM with correlation rules, automated response playbooks
Innovative	ML-based anomaly detection, threat hunting programs

Incident Response

Level	Practices
Ad-Hoc	No process, ad-hoc response
Reactive	Documented runbooks, rarely tested
Proactive	Regular tabletop exercises, blameless postmortems, on-call rotation
Optimized	Automated incident classification, SLA-driven response times
Innovative	Chaos engineering for security, automated containment

Compliance

Level	Practices
Ad-Hoc	Manual evidence collection before audits
Reactive	Spreadsheet-based tracking, periodic reviews
Proactive	Automated evidence collection, continuous monitoring
Optimized	Compliance as code, real-time dashboards, automated reporting
Innovative	Continuous certification, public transparency reports

Self-Assessment Checklist

Rate your organization on each item (Yes / Partial / No):

Build Phase:

SAST runs automatically on every pull request.
Dependency scanning alerts on known CVEs.
Container images are scanned before being pushed to a registry.
IaC templates are scanned for misconfigurations.
Secrets detection prevents credentials from being committed.

Deploy Phase:

Security gates can block deployment for critical findings.
Artifacts are signed and signatures are verified.
SBOM is generated for every release.
Infrastructure changes go through policy-as-code validation.

Run Phase:

Runtime security monitoring is active (Falco, Sysdig, etc.).
Centralized logging with security-relevant alerts.
Network segmentation limits blast radius.
Secrets are managed through a dedicated vault.

Culture and Process:

Developers receive regular security training.
Security champions are embedded in development teams.
Blameless postmortems are conducted after incidents.
Threat modeling is part of the design process for new features.
Security metrics are tracked and reviewed regularly.

Roadmap for Progression

Moving up the maturity levels doesn’t happen overnight. Here’s a practical roadmap:

From Ad-Hoc to Reactive (3-6 months)

Add a SAST tool to your CI pipeline (start with Semgrep - it has good defaults and is fast).
Enable dependency scanning (GitHub Dependabot, or trivy fs in CI).
Document your incident response process, even if it’s simple.
Run a single security training session for the team.

From Reactive to Proactive (6-12 months)

Add container image scanning and IaC scanning to pipelines.
Implement secrets detection in pre-commit hooks (gitleaks, detect-secrets).
Appoint security champions in each team.
Start threat modeling for major features.
Conduct your first blameless postmortem after an incident.
Deploy runtime monitoring (Falco).

From Proactive to Optimized (12-18 months)

Implement security gates that can block deployments.
Track MTTR and set reduction targets.
Generate SBOMs and sign artifacts.
Implement policy-as-code for infrastructure (OPA/Gatekeeper).
Map automated checks to compliance frameworks.
Integrate security metrics into engineering dashboards.

From Optimized to Innovative (18+ months)

Launch a bug bounty program.
Build custom security tooling for organization-specific risks.
Implement zero-trust architecture.
Run regular red team exercises.
Contribute to open-source security projects.

Cultural Aspects

Tools and processes are necessary but insufficient. Culture determines whether security practices actually stick.

Blameless Postmortems

When a security incident occurs, the instinct is often to find someone to blame. This drives people to hide mistakes and cover up near-misses. Blameless postmortems flip this around: they focus on systemic failures and process improvements rather than individual fault. The question changes from “who made this mistake?” to “what allowed this mistake to happen, and how do we prevent it?”

Security Champions

A security champion is a developer who takes on extra responsibility for security within their team. They are not full-time security engineers — they are developers who act as a bridge between the security team and the development team. Their role includes:

Reviewing security-relevant pull requests.
Staying current on security topics and sharing knowledge.
Participating in threat modeling sessions.
Being the first point of contact for security questions.

This model scales far better than having a central security team review everything.

Making Security Easy

If security practices are painful, people will find workarounds. The goal is to make security the easiest path:

Provide secure templates and starter projects.
Automate as much as possible so developers don’t have to remember manual steps.
Give fast feedback. A SAST scan that takes 30 minutes will be ignored; one that takes 30 seconds will be used.
Celebrate security improvements just as you celebrate feature delivery.

Conclusion

A DevSecOps maturity model is a compass, not a destination. The value comes from honest self-assessment, setting realistic goals, and making steady progress. Start where you are, pick the dimension where improvement will have the most impact, and build from there. Security is a team sport. The best security cultures are built incrementally, one practice at a time.

Introduction to AIOps: intelligent IT operations

Mon, 05 Dec 2022 00:00:00 +0000

What is AIOps?

AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to operational data (logs, metrics, events, traces) to automate and improve workflows. Gartner coined the term in 2017, but the idea is simple: use algorithms to handle the volume and complexity that humans can’t manage manually.

In practical terms, AIOps platforms ingest data from monitoring tools, APM systems, log aggregators, and event sources. They apply ML models to detect anomalies, correlate events, identify root causes, and in some cases trigger automated remediation. The goal is to reduce mean time to detection (MTTD) and mean time to resolution (MTTR) while freeing operations teams from alert fatigue.

Why traditional monitoring falls short

Monitoring used to work fine. You had a few servers, a handful of apps, and a limited set of metrics to watch. A static CPU threshold or log regex was enough.

Modern infrastructure broke that model:

Scale: A medium Kubernetes cluster generates millions of metrics and logs per minute. You can’t humanly watch dashboards at that scale.
Complexity: Microservices create tangled dependency graphs. One user request might touch dozens of services. Finding what caused a latency spike means correlating data across all of them.
Dynamic environments: Auto-scaling, ephemeral containers, and serverless functions mean baselines constantly shift. Static thresholds explode with false positives.
Alert fatigue: Teams get buried in alerts. When 90% is noise, that critical 10% disappears. Engineers start ignoring everything.

AIOps doesn’t replace monitoring. It layers on top of what you already have and makes it smarter.

Key capabilities

1. Anomaly detection

Instead of static thresholds, AIOps uses ML models (often time-series analysis, clustering, or autoencoders) to learn what “normal” looks like for each metric and service. When behavior deviates significantly from the learned baseline, an anomaly is flagged.

This handles the dynamic baseline problem. If your application normally sees a traffic spike every Monday at 9 AM, the model learns that pattern and does not alert on it. But an unexpected spike at 3 AM on a Wednesday gets flagged.

2. Event correlation

A single infrastructure issue can generate hundreds or thousands of related alerts across different monitoring tools. AIOps correlates these events — grouping them by time, topology, and causal relationships — to present a single incident instead of a wall of alerts.

For example, a network switch failure might trigger alerts on: the switch itself, all connected servers (connectivity lost), all applications on those servers (health check failures), and downstream services (timeout errors). An AIOps platform correlates all of these into one incident: “Network switch X failed.”

3. Root cause analysis

Beyond correlation, AIOps attempts to identify the root cause of an incident. By understanding the topology of your infrastructure and the causal chain of events, it can suggest that the network switch failure is the root cause, rather than presenting the application timeout as an independent issue.

This is where the value becomes tangible. Instead of an on-call engineer spending 30 minutes tracing through dashboards and logs, the platform surfaces the probable root cause immediately.

4. Auto-remediation

The most mature AIOps implementations close the loop by triggering automated remediation actions. If a known pattern is detected (disk filling up, a pod in CrashLoopBackOff, a runaway process consuming memory), the platform can execute predefined runbooks automatically.

Examples:

Restart a crashed pod or service.
Scale up a deployment when anomalous load is detected.
Clear a log directory when disk usage exceeds a dynamic threshold.
Trigger a failover when a primary database becomes unresponsive.

Auto-remediation requires careful design. Start with low-risk actions and expand as confidence grows.

Common platforms and tools

The AIOps landscape includes both commercial platforms and open-source building blocks:

Commercial platforms

Platform	Strengths
Dynatrace	Strong auto-discovery, AI engine (Davis), full-stack observability
Datadog	Unified monitoring + ML-powered alerting, Watchdog anomaly detection
Splunk ITSI	Powerful log analytics + ML toolkit, good for event correlation
Moogsoft	Pioneered AIOps space, strong event correlation and noise reduction
BigPanda	Event correlation and automation focused, integrates with existing tools
PagerDuty	Incident management with ML-driven noise reduction and smart grouping

Open-source building blocks

You can assemble an AIOps-like stack from open-source components:

Data collection: Prometheus, Grafana Agent, OpenTelemetry Collector, Fluentd/Fluent Bit.
Data storage: Prometheus (metrics), Elasticsearch/OpenSearch (logs), Jaeger/Tempo (traces).
Anomaly detection: Facebook Prophet, Isolation Forest (scikit-learn), luminol, Grafana ML.
Event correlation: Custom logic on top of event streams, or StackStorm for event-driven automation.
Alerting and automation: Alertmanager, Grafana OnCall, StackStorm, Rundeck.

Building a custom AIOps stack is significantly more work than using a commercial platform, but it gives you full control and avoids vendor lock-in. A reasonable middle ground is using a commercial platform for core AIOps capabilities while keeping your data pipeline open-source.

Practical use cases

Noise reduction in alert management

A team receiving 500+ alerts per day implements AIOps event correlation. Related alerts are grouped into incidents, duplicates are suppressed, and flapping alerts are silenced. Alert volume drops by 80%, and the on-call engineer can focus on actual incidents.

Proactive capacity planning

AIOps models analyze historical resource usage trends and predict when capacity limits will be reached. Instead of reacting to a disk-full alert at 2 AM, the platform predicts the issue two weeks in advance and creates a ticket for the team to address during business hours.

Faster incident response

During a production outage, the AIOps platform correlates alerts across the monitoring stack, identifies the root cause (a recent deployment that introduced a memory leak), and surfaces the relevant deployment commit. MTTR drops from 45 minutes to 10 minutes.

Automated scaling

The platform detects anomalous traffic patterns that deviate from the learned baseline. Instead of waiting for CPU to hit 80% (the static threshold), it triggers a scale-up action based on the rate of change, ensuring capacity is ready before users experience degradation.

How AIOps fits into DevOps workflows

AIOps is not a replacement for DevOps practices. It is an enhancement layer:

1
2
3
4
5


Code ──> CI/CD Pipeline ──> Deploy ──> Observe ──> AIOps Layer ──> Act
 │ │
 Monitoring Stack ML Models
 (metrics, logs, (anomaly detection,
 traces, events) correlation, RCA)

Developers benefit from faster root cause identification when their code causes issues in production.
Operations teams benefit from noise reduction, automated remediation, and proactive alerting.
SRE teams benefit from data-driven SLO tracking and error budget burn rate analysis.

AIOps works best when your observability foundation is solid. If you are not collecting good data (structured logs, meaningful metrics, distributed traces), ML models will not produce meaningful insights. Fix your observability first, then layer AIOps on top.

Getting started: A pragmatic path

If AIOps sounds useful, here’s a practical approach:

Audit your current observability stack. What data are you collecting? Do you have structured logs? Consistently labeled metrics? Traces across services? AIOps can only work with good data.
Start with noise reduction. This is the lowest-hanging fruit. Implement alert grouping and deduplication. Even basic rules-based correlation (before any ML) will reduce alert fatigue significantly.
Add anomaly detection to key metrics. Pick 3-5 critical business and infrastructure metrics. Apply a time-series anomaly detection model. Facebook Prophet or Prometheus recording rules with seasonal adjustments are good starting points.
Implement automated remediation for known issues. Identify the top 5 recurring incidents. Write runbooks for them. Automate the runbooks using StackStorm, Rundeck, or your platform’s automation engine.
Evaluate a commercial platform when complexity demands it. If you have hundreds of services, multiple monitoring tools, and a growing operations team, the investment in a commercial AIOps platform may be justified by the reduction in MTTR alone.
Measure the impact. Track MTTD, MTTR, alert-to-incident ratio, and false positive rate. Without metrics, you can’t prove AIOps is worth the investment.

AIOps isn’t magic. It’s a set of techniques that, applied to solid operational data, can reduce the burden on ops teams and improve reliability. Start small, measure everything, and scale what actually works.

Infrastructure as code with Terraform: a practical guide

Sat, 10 Sep 2022 00:00:00 +0000

Why infrastructure as code matters

Managing infrastructure manually through web consoles or ad-hoc scripts creates problems that pile up over time: inconsistent environments, undocumented changes, impossible rollbacks, and the classic “it works on my machine” extended to entire servers.

Infrastructure as Code (IaC) fixes this by treating infrastructure like application code: it’s written, versioned, reviewed, tested, and applied through automated workflows. The benefits show up right away:

Reproducibility: Spin up identical environments in minutes, not days.
Version control: Every infrastructure change goes through a PR with code review.
Documentation by default: The code is the documentation of what your infrastructure looks like.
Disaster recovery: Rebuild everything from code if a region goes down.
Cost visibility: Review infrastructure changes before they are applied (and before they start costing money).

Terraform vs other tools

Several IaC tools exist. Here’s how Terraform compares to the main alternatives:

Feature	Terraform	Pulumi	CloudFormation	Ansible
Language	HCL (declarative)	Python, TypeScript, Go, etc.	JSON/YAML	YAML (procedural)
Cloud support	Multi-cloud	Multi-cloud	AWS only	Multi-cloud (via modules)
State management	Explicit state file	Managed by Pulumi service	Managed by AWS	Stateless
Learning curve	Moderate	Varies by language	Moderate	Low
Ecosystem	Huge provider ecosystem	Growing	AWS-only but deep	Huge role ecosystem
Best for	Multi-cloud infra	Teams that prefer general-purpose languages	AWS-only shops	Configuration management

Terraform’s sweet spot is multi-cloud infrastructure provisioning with a declarative approach. If you’re on AWS only and want tight integration, CloudFormation is reasonable. If your team prefers writing Python over HCL, Pulumi deserves a look. But for most teams managing infrastructure across providers, Terraform is the pragmatic choice.

Core concepts

Providers

Providers are plugins that let Terraform interact with APIs — AWS, Azure, GCP, Kubernetes, GitHub, Cloudflare, and hundreds more.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


terraform {
 required_providers {
 aws = {
 source = "hashicorp/aws"
 version = "~> 5.0"
 }
 }
}

provider "aws" {
 region = "eu-west-1"
}

Resources

Resources are the fundamental building blocks. Each resource block describes one infrastructure object.

1
2
3
4
5
6
7
8


resource "aws_instance" "web" {
 ami = "ami-0c55b159cbfafe1f0"
 instance_type = "t3.micro"

 tags = {
 Name = "web-server"
 }
}

State

Terraform maintains a state file that maps your configuration to real-world resources. This is how Terraform knows what exists, what needs to change, and what to destroy. The state file is critical. Losing it means Terraform loses track of your infrastructure.

Modules

Modules are reusable packages of Terraform configuration. Think of them as functions: they take inputs (variables), create resources, and produce outputs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


module "vpc" {
 source = "terraform-aws-modules/vpc/aws"
 version = "5.1.0"

 name = "my-vpc"
 cidr = "10.0.0.0/16"

 azs = ["eu-west-1a", "eu-west-1b"]
 private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
 public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]

 enable_nat_gateway = true
}

Practical example: VPC + EC2

Here’s a complete example that provisions a VPC with a public subnet and an EC2 instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114


terraform {
 required_version = ">= 1.5.0"
 required_providers {
 aws = {
 source = "hashicorp/aws"
 version = "~> 5.0"
 }
 }
}

provider "aws" {
 region = "eu-west-1"
}

# --- Networking ---

resource "aws_vpc" "main" {
 cidr_block = "10.0.0.0/16"
 enable_dns_support = true
 enable_dns_hostnames = true

 tags = {
 Name = "main-vpc"
 }
}

resource "aws_subnet" "public" {
 vpc_id = aws_vpc.main.id
 cidr_block = "10.0.1.0/24"
 availability_zone = "eu-west-1a"
 map_public_ip_on_launch = true

 tags = {
 Name = "public-subnet"
 }
}

resource "aws_internet_gateway" "gw" {
 vpc_id = aws_vpc.main.id

 tags = {
 Name = "main-igw"
 }
}

resource "aws_route_table" "public" {
 vpc_id = aws_vpc.main.id

 route {
 cidr_block = "0.0.0.0/0"
 gateway_id = aws_internet_gateway.gw.id
 }

 tags = {
 Name = "public-rt"
 }
}

resource "aws_route_table_association" "public" {
 subnet_id = aws_subnet.public.id
 route_table_id = aws_route_table.public.id
}

# --- Security Group ---

resource "aws_security_group" "web" {
 name = "web-sg"
 description = "Allow HTTP and SSH"
 vpc_id = aws_vpc.main.id

 ingress {
 from_port = 80
 to_port = 80
 protocol = "tcp"
 cidr_blocks = ["0.0.0.0/0"]
 }

 ingress {
 from_port = 22
 to_port = 22
 protocol = "tcp"
 cidr_blocks = ["YOUR_IP/32"] # Restrict to your IP
 }

 egress {
 from_port = 0
 to_port = 0
 protocol = "-1"
 cidr_blocks = ["0.0.0.0/0"]
 }
}

# --- EC2 Instance ---

resource "aws_instance" "web" {
 ami = "ami-0c55b159cbfafe1f0"
 instance_type = "t3.micro"
 subnet_id = aws_subnet.public.id
 vpc_security_group_ids = [aws_security_group.web.id]

 tags = {
 Name = "web-server"
 }
}

# --- Outputs ---

output "instance_public_ip" {
 value = aws_instance.web.public_ip
}

output "vpc_id" {
 value = aws_vpc.main.id
}

The plan/apply workflow

Terraform follows a predictable workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# 1. Initialize - download providers and modules
terraform init

# 2. Format - ensure consistent code style
terraform fmt

# 3. Validate - check syntax and configuration
terraform validate

# 4. Plan - preview what will change (critical step!)
terraform plan -out=tfplan

# 5. Apply - execute the plan
terraform apply tfplan

# 6. Destroy - tear down all resources (when needed)
terraform destroy

The terraform plan step is the most important. Never skip it. Always review the plan output before applying, especially in production. The plan shows you exactly what will be created, modified, or destroyed.

1
2


# Example plan output
Plan: 6 to add, 0 to change, 0 to destroy.

In CI/CD pipelines, save the plan to a file (-out=tfplan) and apply that exact plan. This prevents race conditions where infrastructure changes between the plan and apply steps.

State management best practices

State management is where most Terraform problems originate. Follow these practices:

Use a remote backend

Never store state locally or in Git. Use a remote backend with encryption and locking:

1
2
3
4
5
6
7
8
9


terraform {
 backend "s3" {
 bucket = "my-terraform-state"
 key = "prod/networking/terraform.tfstate"
 region = "eu-west-1"
 encrypt = true
 dynamodb_table = "terraform-locks"
 }
}

The DynamoDB table provides state locking. This prevents two people or pipelines from modifying the same infrastructure at the same time.

Organize state by component

Don’t put all your infrastructure in one state file. Split by component or team:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


environments/
├── prod/
│ ├── networking/ # VPC, subnets, routes
│ ├── compute/ # EC2, ASGs, load balancers
│ ├── database/ # RDS instances
│ └── monitoring/ # CloudWatch, alerts
└── staging/
 ├── networking/
 ├── compute/
 └── database/

Smaller state files mean faster plans, smaller blast radius, and fewer teams competing for locks.

Use `terraform_remote_state` sparingly

You can reference outputs from other state files, but use it carefully. Over-reliance on remote state creates tight coupling between components. Prefer passing values through variables or a parameter store.

Tips for production use

Pin provider versions. Use ~> constraints to allow patch updates but prevent breaking changes: version = "~> 5.0".
Use workspaces carefully. Workspaces are useful for simple environment separation but get confusing at scale. Separate directories per environment is usually clearer.
Implement a CI/CD pipeline for Terraform. Run terraform plan on PRs and post the output as a PR comment. Run terraform apply only after merge and approval.
Use prevent_destroy for critical resources. This lifecycle rule stops accidental destruction of databases or persistent storage:
1 2 3 4 5 6

resource "aws_db_instance" "main" { # ... lifecycle { prevent_destroy = true } }
Tag everything. Use a default_tags block in the provider to ensure every resource gets standard tags (environment, team, project).
Use tflint and checkov. Lint your Terraform code and scan for security misconfigurations before applying.
1 2 3

tflint --init tflint checkov -d .
Import existing resources. If you have manually created infrastructure, use terraform import to bring it under management instead of recreating it.
Review the plan diff carefully. A resource showing “destroy and recreate” might cause downtime. Understand which changes are in-place versus destructive.

Terraform is one of those tools that rewards discipline. The more consistently you follow these practices, the more confidently your team manages infrastructure at scale.