Jobs at Q Systems

View all jobs

Site Reliability Engineer, Platform Engineering (NY)

New York, NY
The Platform Engineering team is one of the most mission-critical engineering teams at the firm and is in charge of driving technology innovation across the organization. We design much of the computational and data-oriented platforms in use across different groups and tackle the tough scalability issues. In this role, you’ll be at the center of the team that empowers our business to examine the world and its markets in a way only possible thanks to your work.

Purpose of the role:

The Platform SRE team builds and maintains the core services that our distributed systems are built upon. While you will have exposure to many services, we are looking for engineers to focus on reliability engineering of our core messaging platform based on Apache Kafka. Additionally, you will have a hand in the design and development of several core services offered through a PaaS-like experience – large-scale compute and container runtimes, observability platforms, caching, data-stores, service discovery, secrets, and an integrated development and deployment pipeline.

Key responsibilities:

Our Site Reliability Engineers (SRE)  fill the mission-critical role of ensuring that our complex, large-scale systems are healthy, monitored, automated, and designed to scale. Your will use your background in software engineering combined with experience as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about applying a software engineering approach to the operations problem-space, involving deep knowledge of our platforms as well as our various use-cases. You are both a generalist, capable of picking up and working with multiple, disparate systems, and an expert, having an ability to dive deep into specific topics and quickly master them.
 
 
  • Serve as a primary point responsible for the overall health, performance, and capacity of our business- facing platforms, e.g. globally distributed Kubernetes as well as common data and streaming platforms.
  • Gain deep knowledge of our complex platforms, business applications, and use-cases
  • Assist in the roll-out and deployment of new platforms or features to facilitate our rapid iteration and continuous improvements
  • Develop tools to improve our ability to rapidly deploy and effectively monitor and maintain custom applications or services in a large-scale Linux environment
  • Work closely with development teams to ensure that platforms are designed with “operability” and “usability” in mind
  • Function well in a fast-paced, rapidly changing environment
  • Participate in a 24x7 rotation for second-tier escalations.

Qualifications:

  • B.S. (M.S. preferred, and Ph.D a plus) in Computer Science, Engineering, Physics, or Mathematics
  • Developer background with experience in two or more of C++, Go, Python, or Node.js
  • 5+ years in a Linux-based large-scale systems role
  • Experience managing container orchestration platforms such as Kubernetes
  • Experience building self-service APIs and tuning, sharding, and partitioning systems to auto-manage platforms at scale
  • Knowledge of most of these: data structures, relational and non-relational data-stores, networking, Linux internals, file systems, distributed systems, and related topics
  • Experience in containerizing applications and services a plus
  • Experience using AWS or GCP at scale a plus
  • Experience with random fault injection (Chaos Engineering) and building self-healing capabilities into platforms a plus
  • Commits to well-known open-source projects a huge plus
  • Strong interpersonal communication skills and ability to work well in a diverse, team-focused environment with other SREs, SWEs, product managers, etc.
Powered by