Module 10 Lesson 4: Designing for Resilience
·DevOps

Module 10 Lesson 4: Designing for Resilience

The 'Anti-Fragile' pipeline. Learn how to write CI/CD scripts that can handle network failures, registry timeouts, and flaky tests without completely stopping your company's delivery engine.

Module 10 Lesson 4: Designing for Resilience

A "Brittle" pipeline is one that fails because the internet for 1 second was slow. A "Resilient" pipeline is one that recovers automatically and only alerts you for REAL bugs.

1. The retry Strategy

Sometimes a job fails for a "Stupid" reason (e.g., a temporary Docker Hub timeout). Don't make the developer manually hit the "Retry" button.

build-job:
  retry:
    max: 2              # Try 3 times total (Initial + 2 retries)
    when:               # Only retry for infrastructure bugs, not code bugs!
      - runner_system_failure
      - stuck_or_timeout_failure

2. Setting Timeouts

A job that is "Stuck" in a loop can eat up your CI minutes and block your runners for hours.

  • Set a timeout for every job.
  • If the job usually takes 5 minutes, set the timeout to 10 minutes.

Visualizing the Process

graph TD
    Start[Input] --> Process[Processing]
    Process --> Decision{Check}
    Decision -->|Success| End[Complete]
    Decision -->|Retry| Process

3. Idempotent Scripts

An Idempotent script is one that can be run 10 times and has the same effect as being run 1 time.

  • Bad: mkdir myfolder (Fails if the folder already exists).
  • Good: mkdir -p myfolder (succeeds regardless).
  • Bad: db-migrate (might break if run twice).
  • Good: db-migrate --only-if-needed.

4. Graceful Failures (allow_failure)

In Module 5, we saw this for experimental tests. Resilient pipelines use allow_failure for Non-Critical services, like:

  • Uploading coverage reports to a 3rd party site.
  • Sending a Slack notification.
  • Running a documentation lint check. If the Slack API is down, your code should still go to Production!

Exercise: The Resilience Test

  1. Add a retry block to one of your jobs.
  2. Set a 1-minute timeout for a job and then add sleep 90 to the script. Observe the failure.
  3. Rewrite a script that "Writes to a file" to be Idempotent.
  4. Why is it dangerous to use retry: 2 for a job that "Sends an Email to 1,000 customers"?
  5. Research: What is "Exponential Backoff," and does GitLab support it for retries?

Summary

Resilience is built into the Scripts, not just the YAML. By using retries, timeouts, and idempotent commands, you ensure that your automation is a helpful "Assistant" rather than a "Whiny Critic" that fails at the slightest inconvenience.

Next Lesson: The Final Word: Documentation and Handoff.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn