healthchecks or argocd
  • Software-Entwicklung

How to implement CRD Healthchecks for ArgoCD?

Story

07.11.2022 – Monday – 9:13
Incident… Our Event Bus for Argo workflows is not working…

9:32

We found out that the problem is related to the version of JetStream.
Apparently, is ArgoCD not detecting unhealthy JetStream. Should we solve this? Yes!

Intro

We use ArgoEvents to trigger our workflows. In order to deploy our configuration code, we use ArgoCD. This tool also provides functionality for implementing health checks for eventual errors in the YAML definition. While some resources already have an implementation in the current version of ArgoCD, the EventBus CRD doesn’t. The official Documentation of ArgoCD provides a concise guide for implementing such checks. It provides two possibilities:

  • An implementation in the argocd-cm ConfigMap for yourself
  • Contributing it to the OpenSource repository

Our goal was to test our changes locally and then try to contribute them.

How do you implement a health check?

Health Checks can be included in the argocd-cm ConfigMap in a namespace consisting of an ArgoCD installation. If you, like us, use the OpenShift GitOps operator, you’ll have to implement this in your ArgoCD custom resource because the operator manages argocd-cm. We’ve come up with this:

‘’’.lua
hs={ status = "Progressing", message = "Waiting for initialization" }

if obj.status ~= nil then
  if obj.status.conditions ~= nil then
    for _, condition in ipairs(obj.status.conditions) do
      if condition.type == "Deployed" and condition.status == "False" then
        hs.status = "Degraded"
        hs.message = condition.message or condition.reason
        return hs
      end
      if condition.type == "Deployed" and condition.status == "True" then
        hs.status = "Healthy"
        hs.message = condition.message or condition.reason
        return hs
      end
    end
  end
end

return hs
‘’’

“hs” refers to the health state in ArgoCD and in our case “obj” refers to the deployed EventBus. All it does is sync these two. Every health check in ArgoCD might look similar to this. Finally, we had to put it into the resourceCustomizations key in our custom resource under “argoproj.io/EventBus” at “health.lua”.  Our changes were easily testable by deploying an EventBus with a faulty JetStream version.

How do you contribute a health check?

TL;DR don’t

But for the realistic explanation: ArgoCD is a mature Project which is used by a dozen of big players in the industry. Therefore, it tends to have a high bureaucracy. After implementing our health check, we found that out through experience. ArgoCD defines a set of guidelines in order to manage contributions to their rather large project. 

Firstly, you have to create a discussion for everything that is not a chore task. A chore task is, for example, a dependency bump from Dependabot. But in our case, we opened an enhancement. We included a summary, our motivation and a possible solution of the underlying problem. We crafted this proposal to be able to follow up with our implementation of the fix. After the enhancement proposal, we continued with our Pull request. We had to integrate our code into the existing structure. Our work had to be put into “resource_customizations/argoproj.io/EventBus/”. This repository requires some tests too. So, we added a degraded and a healthy Event Bus definition for integration tests.   

We naively tried to just push our commits to a fork and created a pull request. This is where we met the DCO. DCO stands for Developer Certificate of Origin and is actually a quite simple check that requires all commits of the PR to include a “Signed-off-by” line in the message. Furthermore, ArgoCD only trusts “verified” commits. We struggled a bit and after 3 retries of creating a fork and creating a PR for it, we finally got green lights from DCO. After a brief period of excitement, we noticed that the end-to-end tests were failing.. After.. 45 min. ArgoCD tests E2E for 4 different Kubernetes versions. And the tests were failing because of a bug not introduced by us, but already built into the main branch.

Somehow we came across pull request again and merged the new master into it and ZACK all tests a green. We pinged a reviewer and are currently writing this in the ArgoCD lifecycle to get merged by a final review.

Conclusion

Implementing a health check for yourself is rather easily and quickly. However, many teams might use the same CRDs. Contributing can save time for many people around the globe. However, it is a rocky path. To be fair, this was the first time for us contributing to any Open Source repository.

There are many things to look after. But without these guidelines, these projects would have a much harder time selecting appropriate increments and managing their software development cycle. Even though it can cost some nerves, we strongly encourage helping the tools we all use grow. And at the end of the day, everyone ensured that the increment is the safest and best it could be.

geschrieben von:
Constantin
WordPress Cookie Plugin von Real Cookie Banner