Sep 10

Site Reliability Engineer - (100% remote)

We are looking for a Site Reliability Engineer

You will be a key member of a tight-knit group of talented Engineers who are responsible for keeping ours and our customer’s Kubernetes clusters operational and healthy. You’ll also have a key role in the development of the product itself, working together with our Platform Engineers to deliver the greatest Kubernetes service possible.

Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products. We are a diverse, fully remote (since 2014) and experienced team that is growing and spread across Europe - with a headquarters in Cologne.


YOUR JOB



  • You maintain, operate and upgrade our own and our customer’s Kubernetes clusters.

  • You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.

  • You understand how servers and systems work and you tweak their behavior to your needs.

  • You will be responsible for our monitoring, logging and alerting.

  • You will help resolve incidents on our own and our customer’s clusters.

  • You participate in the on-call support schedule

  • You are a go-to person in case our developers need advice regarding infrastructure.

  • You will automate all the things, and the thought of Terraform doesn’t make you cry.

  • We (and the majority of our customers) are currently mostly distributed around Europe (around UTC), thus, your main time zone should be somewhere between +/-2UTC to ensure better communication.


REQUIREMENTS



  • You must have deep, hands-on knowledge of Kubernetes from both the end-user and the operational side.

  • You’re comfortable debugging systems at all levels, from kernel fundamentals right up to workloads running on Kubernetes.

  • You’re happy troubleshooting a wide variety of issues and you’re not afraid to parse thousands of lines of logs in pursuit of an answer.

  • You have good coding skills (preferably Go, but Python or similar is fine as well)

  • You have experience with maintaining infrastructure with code and you know the pros and cons of various automation tools (We use Terraform & Ansible but Chef, Puppet and the lot is also a good start).

  • You are fluent with Cloud Native Tools running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.

  • You automate all the things by writing code. Using bash scripts makes you sad :)



Related jobs

Jobs at this company

Giant Swarm GmbH

Giant Swarm GmbH