Site Reliability Engineer

Location:

Remote, Ukraine | Anywhere, Worldwide

About company:

Overcode is a full-cycle custom software development company that provides software design and development services to startups and established businesses. Our clients have successfully partnered with well-known brands such as LinkedIn, Twitch, SAP and some of them have been acquired by big tech companies such as New Relic.

We are dedicated to providing an elevated standard of service in the following areas:

• Software Development
• Product engineering
• Web App Development
• Product Design (UX/UI)

About the project:

This is a project for a partner of Overcode. Innovative AI Cloud Database Platform which helps large companies and corporations reduce project infrastructure costs. Stack of the project - Kubernetes, AWS/GCP, Grafana, Prometheus, Python.

Overview:

We are looking for a dedicated and proactive Site Reliability Engineer (SRE) with a specialization in Kubernetes to enhance our dynamic technology team. This pivotal role focuses on improving the reliability and performance of our Kubernetes-deployed software solutions. The ideal candidate will possess a robust background in deploying code to Kubernetes environments, orchestrating and maintaining Kubernetes clusters with advanced infrastructure tools, and adeptly managing operational incidents to ensure system resilience and efficiency.

Responsibilities:

- Deploy containerized applications to Kubernetes using a mostly automated system, ensuring reliable and scalable application performance.
- Continuously monitor system performance and reliability to proactively identify and resolve issues, aiming to surpass our Service Level Agreements (SLAs).
- Respond to system alerts, accurately diagnose issues, escalate as necessary, and execute predefined scripts to mitigate operational challenges.
- Thoroughly document system anomalies and initiate JIRA tickets to enable the engineering team to address and resolve complex or recurring issues.
- Leverage monitoring tools such as Grafana for setting up alerts, and tracking and troubleshooting system performance.

Qualifications:

- Demonstrated experience with Kubernetes in a production environment, covering both application deployment and cluster management.
- Familiarity with at least one major cloud service (AWS, GCP, Azure) and their Kubernetes management solutions.
- Proficiency in using Grafana and Prometheus for monitoring system performance, setting up alerts, and conducting troubleshooting.
- Capable of effectively executing predefined scripts with an understanding of their functionality and potential impacts.
- Exceptional problem-solving abilities with a strong aptitude for documenting and clearly communicating issues for efficient resolution.
- Proficient in using PagerDuty, JIRA or similar issue-tracking tools for managing operational incidents.

Ideal Skills:

- Advanced expertise in Kubernetes administration.
- Practical experience with database management and SQL, enhancing data handling and optimization within Kubernetes environments.

Benefits:

•   Full employment in the team of overcoders;
•   Reimbursement of taxes, holidays and sick leave;
•   Remote work;

Are you interested?
Please feel free to contact us if you need any further information.