Databricks logo

Site Reliability Engineer

Databricks
2 días hace
En el sitio
Costa Rica

GAQ127R40 

Team: IT Infrastructure and Operations

About the Role

At Databricks Information Technology, we are a product-led organization transforming how we work—from the ease of using our IT services to the applications we develop to scale seamlessly during rapid growth.

As a Site Reliability Engineer (SRE), you will bridge the gap between software engineering and systems architecture. You will be a core contributor to the IT Infrastructure team, owning the evolution of core infrastructure and observability platforms. This role requires a strong software engineering mindset and deep technical breadth to deliver high-quality, scalable solutions for "immature" system problems. Your focus will be on building resilient, automated infrastructure that empowers development teams and ensures our cloud environment is cost-optimized, secure, and highly available.

The Impact You Will Have

  • Architect and Automate: Design and deploy production-grade infrastructure on cloud platforms (AWS/Azure) using Infrastructure as Code (IaC) tools like Terraform or Pulumi.
  • Reliability and Performance Engineering:Optimize system performance, architecture, and scaling to ensure maximum uptime and minimal latency for critical IT services.
  • CI/CD Excellence: Architect robust deployment pipelines (e.g., GitHub Actions), managing both hosted and self-hosted runners for specialized build requirements.
  • Observable by Default: Create underlying infrastructure to ensure new internal applications are secure and have logging, metrics and alerts enabled by default.
  • Agentic ToolingI: Build internal AI plugins, and automation scripts to streamline developer workflows and enhance operational efficiency.
  • Incident Response: Focus on subsequent data usage, incident management workflows, and creating necessary dashboards to maintain service health. Participate in a shared on-call rotation, leading rapid incident response and technical troubleshooting for production outages.Facilitate blameless post-mortems to identify root causes and implement permanent preventive engineering solutions.
  • Partner Cross-Functionally: Collaborate with Security, Engineering, and Support teams to deliver real business outcomes.

What We Look For

  • Software Engineering Expertise: 5+ years of production-level experience with strong proficiency in Python (non-negotiable).
  • Infrastructure as Code (IaC): Expert-level proficiency in Terraform (modules, state management) or Pulumi.
  • Cloud & Containers: Hands-on experience with AWS, Azure, or GCP, along with Kubernetes, Docker, and containerization concepts.
  • Observability Mindset: Deep understanding of observability pillars (logging, metrics, tracing) and experience with tools such as Datadog, Prometheus, or ELK.
  • Distributed Systems: Proficiency in running systems using concepts like Kafka or messaging queues.
  • CI/CD Proficiency: Advanced knowledge of GitHub Actions and GitHub Runners.
  • Independent Execution: Ability to take ownership of ambiguous projects, follow a vision set by tech leads, and execute independently with minimal guidance.