Architect, AI Compute, OCI
Oracle
Job Overview
Who's the hiring manager?
Sign up to PitchMeAI to discover the hiring manager's details for this job. We will also write them an intro email for you.

Job Description
About the Role: Architect, AI Compute, OCI at Oracle
At Oracle Cloud Infrastructure (OCI), we are at the forefront of building the world’s largest AI clusters and are unmatched in bringing them to market quickly. As an Architect, AI Compute, you will play a pivotal role in designing a cutting-edge, ultra-high-performance GPU platform specifically engineered to support demanding AI/ML/HPC workloads. This is your chance to profoundly impact the AI revolution, creating scalable systems that allow customers to seamlessly transition from tens to thousands of GPUs without compromising performance.
Our dedicated team is responsible for architecting and developing fundamental changes in GPU delivery, robust health monitoring, efficient triage automation, and comprehensive diagnostic services. These capabilities are crucial for running distributed AI/ML/HPC workloads across vast arrays of GPUs, utilizing advanced technologies such as RoCE and Infiniband. You will engage with state-of-the-art technologies and contribute significantly to our organization's success.
Responsibilities
- Play a pivotal role in ensuring AI infrastructure continues to meet the rapidly evolving demands of both Enterprise and AI/ML customers.
- Engage with Enterprise and AI/ML customers to understand their specific requirements for uninterrupted workloads and tailor OCI Kubernetes and Slurm solutions accordingly.
- Drive the organization’s goals and technical direction to pursue opportunities that make AI infrastructure more efficient.
- Partner and collaborate with organization leaders to help improve the performance of the team and organization.
- Participate in or lead design reviews with peers and stakeholders to choose among available technologies.
- Drive technical innovation to push the boundaries of performance and reliability, ensuring our customers have a seamless and exceptional experience with their most demanding workloads.
- Evaluate and refine OCI's architectural and operational practices to establish OCI as a leader in supporting AI demanding workloads.
- Cultivate a culture of proactive resilience within engineering teams, ensuring all software systems prioritize scalability, performance, availability, and fast GPU delivery.
- Mentor teams to adopt architectural practices that enable systems to withstand demanding operating environments.
- Understand industry and company-wide trends to help assess and develop new technologies.
Requirements
- BS or MS in Computer Science, Engineering, or related field.
- 12+ years of total experience in software development.
- Proven industry expert in Control Plane, Data Plane, or both.
- Excellent organizational, verbal, and written communication skills.
- Demonstrated aptitude for public speaking and executive presentations.
- Working familiarity with networking protocols (TCP/IP, UDP, HTTP) and standard network architectures.
- Strong technical knowledge in distributed systems, high performance computing, and GPU systems.
- Design, develop, troubleshoot, and debug software programs for databases, applications, tools, networks, etc.
- Demonstrated ability to write great code using Java; Experience working with REST APIs.
- Proven experience designing architectures that demonstrate deep technical depth in one area or span many products to enable high availability, scalability, market-leading features, and flexibility to meet future business demands.
- Proven ability to deliver products and experience with the full software development lifecycle.
- Experience working on large-scale, highly distributed service infrastructure.
- Experience working in cloud platform(s) (AWS, OCI, GCP, Azure etc).
- Experience in mentoring others to achieve career excellence.
Preferred Qualifications
- Experience in Nvidia training technologies (CUDA, NCCL).
- Experience in AI model training infrastructure.
Key skills/competency
- AI Infrastructure
- GPU Platforms
- Distributed Systems
- High Performance Computing (HPC)
- Cloud Architecture
- Network Protocols (RoCE, Infiniband)
- Kubernetes
- Slurm
- System Scalability
- Java Programming
How to Get Hired at Oracle
- Research Oracle's culture: Study their mission, values, AI vision, and employee experiences on LinkedIn and Glassdoor.
- Tailor your resume: Highlight deep expertise in AI compute, OCI, GPU systems, and distributed architecture.
- Showcase technical depth: Emphasize experience with RoCE, Infiniband, Kubernetes, Slurm, and cloud platforms.
- Prepare for architectural discussions: Be ready to discuss large-scale system design, performance, and reliability challenges.
- Demonstrate leadership & communication: Practice presenting complex technical concepts and mentoring team members.
Frequently Asked Questions
Find answers to common questions about this job opportunity
Explore similar opportunities that match your background