Our client, a leading financial institution, is seeking a highly skilled and experienced individual to join the team as a Cloud Platform Site Reliability Engineering Lead. This is an exciting opportunity to be at the forefront of driving cloud infrastructure strategy and ensuring the highest levels of reliability and availability.
As the Cloud Platform Site Reliability Engineering (SRE) Lead, you will be responsible for leading a team of highly talented engineers in building, optimizing, and maintaining the cloud platform infrastructure. You will work closely with cross-functional teams, including software development, operations, and security, to ensure that the cloud services are highly reliable, scalable, and secure. Your expertise will be vital in designing and implementing robust monitoring and alerting systems to facilitate proactive detection and resolution of potential issues.
Key Responsibilities:
Leadership: Lead a talented team of cloud platform Site Reliability Engineers to design, develop, and manage the cloud infrastructure platform.
Strategy Development: Develop and execute the strategic roadmap for our cloud platform, ensuring scalability, reliability, and security.
Cloud Infrastructure Management: Oversee the deployment, maintenance, and troubleshooting of our cloud infrastructure, ensuring maximum availability and optimal performance.
Collaboration: Collaborate with cross-functional teams to ensure seamless integration of our cloud platform with other systems, and provide guidance on best practices for cloud implementations.
Incident Management: Lead the incident response and resolution process, ensuring that critical incidents are addressed promptly, and post-incident reviews are conducted to prevent future occurrences.
Automation and Tooling: Drive the automation efforts to enhance the operational efficiency of our cloud platform, including creating and maintaining automated processes and tooling.
Performance Optimization: Continuously monitor and analyze system performance metrics, identify areas for improvement, and implement enhancements to optimize resource utilization.
Documentation: Create and maintain comprehensive documentation of system configurations, procedures, and troubleshooting guides to ensure knowledge sharing and facilitate efficient onboarding of new team members.
Qualifications:
Bachelor's degree in Computer Science, Engineering, or a related field.
Minimum of 8 years of experience in site reliability engineering or related roles.
Extensive experience in managing cloud infrastructure on major cloud platforms such as AWS, Azure, or GCP.
Strong knowledge of networking concepts and protocols.
Proficient in scripting and programming languages (Python, Bash, etc.).
Solid understanding of containerization technologies (Docker, Kubernetes, etc.).
Excellent problem-solving and troubleshooting skills, with a keen attention to detail.
Experience in the financial industry is a plus.
Strong leadership abilities and experience in managing engineering teams.