SiteOps Global Product Hardware Lead Engineer - GPU
Company: Meta Inc
Location: Des Moines
Posted on: March 20, 2023
|
|
Job Description:
Summary:
Meta is seeking a forward thinking, experienced AI/ML (Artificial
Intelligence/Machine Learning) Product Hardware Platform Lead
Engineer to join the Data Center Site Operations team. The Product
Hardware Platform Engineering (PHE) team is responsible for the
overall performance of Meta's production compute, storage, and
AI/ML platforms through their life-cycles in our data centers. This
role will lead the subset of the PHE team that focuses on AI/ML
platform hardware. AI/ML is an important priority for Meta that
involves complex GPU based systems operating in shared computing
clusters. The role scope is focused on maintaining and improving
the health of the AI/ML platforms from verification testing into
mass production through end-of-life. Key responsibilities include
identifying systemic hardware, firmware, and tooling issues;
engaging in hands-on problem solving; and collaborating effectively
with cross-functional engineering and tooling teams to improve
performance of the fleet. Our data centers, and the tens of
thousands of servers installed in them, are the foundation upon
which our rapidly scaling infrastructure efficiently operates and
upon which our innovative services are delivered. Meta is at the
leading edge of the global data center industry both in terms of
how data centers are designed and operated. This person should
enjoy working in a fast-paced environment where adaptability and
flexibility will be key to their success.We seek an individual who
can quickly absorb and understand the technical challenges of
subject matter experts and local site operations teams, create
alignment between these globally distributed teams as well as
partner organizations, and can set informed priorities and
direction while getting buy-in and commitment from relevant
stakeholders.
Required Skills:
SiteOps Global Product Hardware Lead Engineer - GPU
Responsibilities:
Lead other AI/ML PHE team members through efforts that provide
end-to-end lifecycle ownership (verification test through end of
life decommissioning) of AI/ML hardware platforms and associated
new technologies in the data centers
Serve as the central point of contact representing the AI/ML
hardware platforms and associated new technologies across SiteOps,
and be the subject matter experts on hardware platform issues, for
datacenter operations teams
Drive complex AI/ML technical investigations globally and spanning
multiple disciplines such as Hardware, Software/Firmware,
Networking and Power & Cooling
Work closely with other PHE team members to share best practices
and ensure appropriate feedback is given to cross-functional
teams
Issue timely alerts and support fixes to operations teams, and
assure a robust feedback pipeline to engineering teams
Provide serviceability feedback on AI/ML production hardware to
engineering design teams
Provide technical mentorship on large scale data center projects
and initiatives to global, cross-functional teams
Build strong relationships and collaboration with engineering and
cross functional teams across the company. Actively solicit
feedback from teams, and use that feedback to improve operational
effectiveness as infrastructure scales
Own the cross-functional communication with other technical
operations groups to help resolve incidents
Collaborate with stakeholders, functional owners and subject matter
experts to interpret and articulate business and operations
needs
Ability to travel up to 30% required
Minimum Qualifications:
Minimum Qualifications:
Experience managing multiple concurrent projects and managing
competitive timelines
10+ years experience in hardware development and/or validation,
working with cross functional teams to deliver products to
production
BS or BA in technical field or commensurate experience
Effecting technical drafting skills, experience creating
documentation for users of all levels
Experience in processing and analyzing large sets of data
Experience triaging and debugging hardware platforms
Knowledge of server and storage platforms, principles,
technologies, protocols, and standards
Experience working with Linux or Unix Operating systems
Experience working independently within a multi-disciplinary team
of hardware and operations engineers
Experience working across a diverse global organization and
building partnerships with cross functional teams inside and
outside of the organization
Preferred Qualifications:
Preferred Qualifications:
Experience with GPU based platform hardware that operates in AI/ML
computing clusters
Large-scale data center environment experience, including hardware
deployments, deep system knowledge of Linux, Server Hardware,
networking, network protocols, supply chain and Data Center
automation
Leadership presence and presentation skills
Experience in data center system and process automation
Bash, PHP, Python, or Perl scripting experience
Public Compensation:
$163,000/year to $223,000/year + bonus + equity + benefits
Industry: Internet
Equal Opportunity:
Meta is proud to be an Equal Employment Opportunity and Affirmative
Action employer. We do not discriminate based upon race, religion,
color, national origin, sex (including pregnancy, childbirth,
reproductive health decisions, or related medical conditions),
sexual orientation, gender identity, gender expression, age, status
as a protected veteran, status as an individual with a disability,
genetic information, political views or activity, or other
applicable legally protected characteristics. You may view our
Equal Employment Opportunity notice here. We also consider
qualified applicants with criminal histories, consistent with
applicable federal, state and local law. We may use your
information to maintain the safety and security of Meta, its
employees, and others as required or permitted by law. You may view
Meta's Pay Transparency Policy, Equal Employment Opportunity is the
Law notice, and Notice to Applicants for Employment and Employees
by clicking on their corresponding links. Additionally, Meta
participates in the E-Verify program in certain locations, as
required by law
Keywords: Meta Inc, Des Moines , SiteOps Global Product Hardware Lead Engineer - GPU, Engineering , Des Moines, Iowa
Click
here to apply!
|