Head of Technology - Cosmos
Infinity
Location
Remote
Employment Type
Full time
Location Type
Remote
Department
Cosmos
Head of Technology AI-First IT Automation Platform
About the Company
We’re building the next generation of AI-powered IT operations combining automation, intelligent agents, and human expertise to create an entirely new category of Managed Service Provider.
Our mission is to make every IT function from endpoint management to incident response faster, more autonomous, and more intelligent.
We are part of Infinity Constellation, a portfolio of AI-enabled service companies. This is an early-stage, hands-on builder role: you’ll define the core systems, agent frameworks, and infrastructure that power the platform from day one.
The Role
We’re looking for a Head of Technology who thrives in early-stage, fast-moving environments, someone who codes, experiments, and ships intelligent systems that solve operational problems in the real world.
This isn’t a “corner-office” leadership role. It’s a hands-on founder-type position where you’ll:
Architect and build the AI and automation backbone of the company
Deploy, evaluate, and monitor ML/LLM systems in production
Partner with product and operations to translate real customer pain points into intelligent, self-healing systems
Build and lead a small technical team (5–10 engineers) as the platform scales over the next 24 months
Key Responsibilities
Architecture & Systems Design
Own the end-to-end technical architecture: service orchestration, observability, and AI/ML pipelines
Select and integrate frameworks for agentic orchestration (e.g. LangChain, LlamaIndex, OpenDevin-style frameworks, or custom alternatives)
Establish standards for model evaluation, context management, and secure data handling
AI/ML & Agent Development
Build, train, and deploy AI agents that automate IT workflows (troubleshooting, monitoring, patching, access management, etc.)
Select and ship modern MLOps and Agentic Frameworks for deployment and monitoring
Implement human-in-the-loop feedback loops to continuously improve model behavior
Infrastructure & Reliability
Lead development of scalable backend systems (Python, Django/FastAPI, Pydantic, etc.) on AWS with EKS/Helm
Design observability and cost-tracking systems from day one
Ensure reliability, uptime, and security across distributed environments
Team Leadership & Scale
Recruit, mentor, and lead early technical hires across backend, MLOps, and automation domains
Define engineering standards, code review processes, and developer productivity systems
Build the technical culture: curiosity, ownership, and continuous learning
Who You Are
Builder-first: happiest when coding, prototyping, or debugging not just delegating
Systems-minded: you can design for both reliability and intelligence
AI-fluent: you understand how to deploy, monitor, and improve LLMs and agent systems
Pragmatic: you pick tools that work and move fast while managing long-term scalability
Operationally literate: you understand IT operations, DevOps, or infrastructure automation domains
Qualifications
Required
5+ years of software engineering experience, including 2–3 years building or operating ML systems in production
Expertise in Python and experience with backend frameworks (Django/FastAPI), MLOps tools (Ray, MLflow, W&B, etc.) and Agentic Frameworks (LangChain, LangGraph, PydanticAI, CrewAI, Autogen, etc.)
Experience deploying models or agent systems on cloud infrastructure (AWS preferred; EKS/Helm/Terraform a plus)
Familiarity with LLM orchestration or multi-agent frameworks
Strong foundation in observability, logging, and production reliability
Demonstrated ability to lead and scale small teams (5–10 engineers)
Nice to Have
Experience in IT automation, RMM tools, or endpoint management
Experience fine-tuning or evaluating LLMs (OpenAI, Anthropic, HuggingFace, etc.)
Familiarity with retrieval-augmented generation or evaluation frameworks
Background in Infrastructure-as-Code, SRE, or cloud cost optimization
Previous startup or founder experience