The Site Reliability Engineer is responsible for the health and well-being of the production environment, implementation of new and existing components, and maintaining and modernizing the processes and methods used within our platform. They will be expected to interface with the rest of the operations, development and business teams, lead assigned projects, participate in peer mentoring and operate an always-on production environment.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
Onboard and optimize microservices using Docker
Streamline CI/CD process and green/blue deployment
Optimize resource usage to meet KPI targets
Maintain and evolve monitoring and notification systems
Create and maintain documentation on new services, procedures, and requirements
Participate in an on-call schedule established by your manager, and be ready and available while on-call to immediately diagnose and resolve incidents.
Participate in the diagnosis and resolution of escalated critical emergency incidents.
Bachelor’s degree or equivalent work experience
Linux / Unix system administration skills, 5-10 years operations experience
Strong time and project management skills and attention to detail
Solid experience in the administration and performance tuning of application stacks
Experience with multiple cloud hosting providers, and extensive experience with AWS
Experience with virtualization and containerization (i.e. docker)
Experience with RabbitMQ, ElasticSearch and Redis
Experience with monitoring and metrics systems (i.e. nagios, grafana)
Experience with configuration management systems (i.e. Ansible, Chef)
Solid scripting skills (i.e. shell scripts, Ruby, Python, Go)
Authorized to work in the United States and pass standard background checks for compliance standards