I recently had to add some EventBridge rules for detecting and responding to ECS task failures. I spent a bit of time trying to find examples online, and ended up coming up with a couple of my own. I’m posting them here hopefully to save others the trouble.
TL;DR: See the eventbridge rule below below.
Stopped Task Error Codes
If there is an issue running a container, ECS will set the stoppedReason
field to be one of a few possible values. The full list is here, however most contain the terms “Error” or “Failure”. So, we can use a wildcard
pattern to match these failures
Detecting Exit Codes
Like with most programs, the containers in an ECS task also return exit codes indicating their status. There are some common exit codes that might be returned see the “Common exit codes” section here.
However, since all we need is the exit code, we can use this filter to detect any tasks which exit with a non-zero exit code, indicating a failure. We can use the EventBridge anything-but
pattern to only match failed tasks.
EventBridge Rule
Combining both these scenarios using the $or operator, this rule should be able to detect most of the possible failure scenarios for an ECS task.
{
"source": ["aws.ecs"],
"detail-type": ["ECS Task State Change"],
"detail": {
"lastStatus": ["STOPPED"],
"$or": [
{
"stoppedReason": [{
"wildcard": "*Error*"
}, {
"wildcard": "*error*"
}, {
"wildcard": "*Failed*"
}]
},
{
"containers": {
"exitCode": [{
"anything-but": [0]
}]
},
"stoppedReason": ["Essential container in task exited"]
}
]
}
}