In [29]: # computer_vision.ipynb
Abstract
Computer vision engineer is an umbrella term spanning multiple sub-fields with materially different compensation bands. Senior L5 vision-language model engineers earn total compensation $420,000 to $720,000; autonomous-vehicle perception engineers earn $380,000 to $620,000; classical CV engineers earn $230,000 to $380,000, materially below the modern foundation-model-adjacent sub-fields. The dispersion within CV is wider than within most ML specialisations because the field spans both modern multimodal foundation-model work and mature classical methods [1].
1 Bands from Levels.fyi ML Engineer track, Tesla, Waymo, May 2026.
table cv-1 : L5 senior bands
| Sub-field | L5 base | L5 total comp |
|---|---|---|
| Vision-language models (CLIP, SigLIP, VLMs)Modern multimodal foundation models; frontier-lab demand high | $240k - $320k | $420k - $720k |
| Autonomous vehicle perceptionTesla, Waymo, Wayve, Zoox; large CV organisations, RSU upside variance | $220k - $290k | $380k - $620k |
| Diffusion / image generationStable Diffusion lineage; concentrated at a few labs | $220k - $310k | $400k - $680k |
| Robotics CVManipulation, grasping, scene understanding; growing demand from humanoid robotics | $200k - $270k | $320k - $520k |
| Medical imaging MLHospital systems, biotech, regulated; lower comp but mission-driven | $180k - $240k | $250k - $410k |
| Classical CV (object detection, segmentation, OCR)Mature field, declining premium as VLMs commoditise applications | $170k - $230k | $230k - $380k |
| 3D reconstruction / NeRF / Gaussian SplattingGrowing field, AR/VR adjacent, Apple and Meta hiring | $200k - $280k | $320k - $560k |
section cv-2 : compensation gradient
The computer vision field has undergone a structural shift since the late 2010s. Classical CV methods (handcrafted features, single-modality CNN-based object detection and segmentation, OCR pipelines) were gradually displaced by deep-learning-based architectures starting with AlexNet (2012) and accelerating through ResNet (2015), Mask R-CNN (2017), and Vision Transformers (2020). The post-2022 shift to vision-language models (CLIP, SigLIP, GPT-4V, LLaVA, Gemini multimodal) represents another step-change: many tasks that previously required specialised single-modality models are now better served by general-purpose VLMs with prompt-based adaptation.
The compensation gradient reflects this shift. Engineers whose experience is concentrated in pre-2020 classical CV methods face slowly declining marginal demand and a corresponding compensation plateau. Engineers with hands-on experience in modern VLM work, diffusion-model image generation, or multi-modal foundation models earn compensation comparable to LLM engineer ranges. Autonomous-vehicle CV sits in between: the field demands real-time performance, multi-sensor fusion (LiDAR, radar, camera), and safety-critical engineering that pure VLM work does not address; compensation is strong but the employer concentration is small.
The 3D reconstruction sub-field (Neural Radiance Fields, Gaussian Splatting, photogrammetry-ML hybrids) has grown rapidly since 2022 with applications in AR / VR (Apple Vision Pro, Meta Quest), robotics scene understanding, and entertainment-tech. Compensation is competitive but the employer pool is narrower than VLM or AV. Apple's investment in on-device 3D reconstruction for Vision Pro and Meta's investment in Reality Labs have created strong demand at these specific employers; broader market demand is still developing.
The medical imaging sub-field continues to operate at lower compensation bands because the dominant employers (hospital systems, medical device manufacturers, biotech companies) structurally pay less than Big Tech or frontier labs. The work is technically interesting (3D imaging modalities, regulatory-grade validation, clinical deployment workflows) and impactful, but the compensation ceiling is bounded by the broader healthcare-IT economics rather than by the AI-investment cycle.
section cv-3 : Tesla, Waymo, others
Autonomous vehicle CV is a distinct compensation cluster within the broader CV field. The major employers are Tesla (Full Self-Driving and Optimus humanoid), Waymo (Alphabet subsidiary, US robotaxi deployment), Wayve (UK-based, end-to-end learning approach with recent US expansion), and smaller players including Zoox (Amazon subsidiary, restructured 2024), Cruise (GM subsidiary, scaled back 2023-2024), and various trucking-AV startups (Aurora, Kodiak, Plus). The cluster's employer count is small but compensation is structurally high because the work is mission-critical, deeply technical, and competes with frontier-lab and FAANG offers for the same senior talent.
Tesla AV compensation has been volatile: senior L5-equivalent base salary $200,000 to $250,000 with RSU grants whose realised value has been variable depending on Tesla stock-price performance. The work culture at Tesla is intense and demanding; compensation reflects this with higher base bands than typical for non-FAANG public companies but with implicit work-hours expectations above industry norms.
Waymo compensation is stable Google-equivalent: senior L5 base $215,000 to $260,000 with total compensation $320,000 to $470,000. The work culture is more Google-mature than Tesla. For senior AV CV engineers comparing Tesla and Waymo offers, the choice is typically between Tesla's higher potential upside (with higher variance and culture intensity) and Waymo's stability (with lower potential upside).
section cv-4 : common questions
What is the average computer vision engineer salary in 2026?
Computer vision engineer salary varies substantially by sub-field. Senior L5 vision-language model engineers earn base salary $240,000 to $320,000 with total compensation $420,000 to $720,000. Autonomous vehicle perception engineers earn $220,000 to $290,000 base with total compensation $380,000 to $620,000. Classical CV (object detection, segmentation) engineers earn $170,000 to $230,000 base with total compensation $230,000 to $380,000, materially below the modern foundation-model adjacent sub-fields. The dispersion within computer vision is wider than within most ML specialisations because the field spans both modern foundation-model work and mature classical methods.
Does autonomous vehicle work pay better than other CV roles?
Comparable at the L5 senior level. AV perception engineers at major employers (Tesla, Waymo, Wayve, Zoox before its Amazon acquisition restructuring) earn $220,000 to $290,000 base with total compensation $380,000 to $620,000. The variance is large: Tesla offers carry significant RSU upside risk and reward; Waymo (Alphabet subsidiary) offers stable Google-equivalent compensation; smaller AV startups offer base-heavy compensation with pre-IPO equity grants of uncertain realisation value. AV CV work is technically demanding (real-time, multi-sensor, safety-critical) and the labour pool is narrow, but the employer count is small and concentrated.
Why do vision-language model engineers earn more than classical CV engineers?
Vision-language models (VLMs) are at the centre of the post-2022 multimodal foundation-model wave, with major investment from frontier AI labs and hyperscalers. The labour pool of engineers with hands-on VLM training experience is narrow, while demand is broad (every major frontier lab now has VLM training capability). Classical CV (single-modality image classification, object detection, segmentation) is a mature field with well-understood methods and broad labour supply, so the equilibrium specialisation premium is smaller. The split favours engineers who can demonstrate hands-on experience with modern multimodal architectures (CLIP, SigLIP, LLaVA, GPT-4V class models) over engineers whose experience is concentrated in pre-2020 classical CV methods.
Is medical imaging ML a good career path?
Yes for engineers prioritising domain depth and mission-aligned work over compensation maximisation. Medical imaging ML compensation is materially below frontier-lab or AV ranges (senior L5 typically $180,000 to $240,000 base, $250,000 to $410,000 total) because the dominant employers are hospital systems, biotech companies, and medical device manufacturers with structurally lower compensation budgets than Big Tech or frontier labs. The work is technically interesting (3D imaging modalities, regulatory-grade model validation, clinical deployment) and the impact is direct. For ML engineers seeking maximum total compensation, medical imaging is not the best path; for engineers seeking sustained domain depth with clear social impact, it is competitive.
How is robotics CV compensation different from AV CV?
Robotics CV (manipulation, grasping, scene understanding for physical robots) has grown rapidly since 2023 with the humanoid robotics investment wave (Figure, 1X, Tesla Optimus, Apptronik, Sanctuary AI). Compensation is competitive with AV CV at L5 senior level ($200,000 to $270,000 base, $320,000 to $520,000 total) but with substantial variance: the humanoid robotics startups offer pre-IPO equity with uncertain realisation value. Robotics CV at established research labs (DeepMind robotics, Google Robotics, NVIDIA robotics) pays stable Google-equivalent compensation. The technical work emphasises real-world physics, sample efficiency, and sim-to-real transfer, which is meaningfully different from AV CV's perception-focused emphasis.
Where do computer vision engineer jobs concentrate?
Bay Area dominates for VLM and frontier-lab CV. Seattle hosts Amazon Robotics, Microsoft Mixed Reality, and large AWS computer vision teams. Pittsburgh anchors AV work (Carnegie Mellon spinouts: Argo AI before shutdown, Aurora, Locomation) plus robotics. Los Angeles supports a smaller AV cluster (Waymo LA, plus aerospace and defence CV). Phoenix supports Waymo's primary deployment market. Internationally, Toronto and Tel Aviv host strong CV ecosystems. The US distribution is more geographically dispersed than LLM engineering, which concentrates much more heavily in the Bay Area.
Should a classical CV engineer transition to VLM work?
Yes for compensation, with realistic effort. The transition requires building hands-on experience with vision-language architectures (start with CLIP and SigLIP foundations, then LLaVA-class architectures, then current VLM frontier), running multimodal training experiments (even at small scale on open-source models), and contributing to open-source VLM tooling. The transition path is similar to the classical NLP to LLM transition: 12 to 18 months of dedicated work for an experienced classical-CV engineer. First-job realistic placement after transition is at applied VLM engineering at a unicorn or hyperscaler; frontier-lab VLM research roles typically require longer demonstrated track record.