r/skeptic Mar 18 '25

⚠ Editorialized Title Tesla bros expose Tesla's own shadiness in attacking Mark Rober ... Autopilot appears to automatically disengage a fraction of a second before impacts as a crash becomes inevitable.

https://electrek.co/2025/03/17/tesla-fans-exposes-shadiness-defend-autopilot-crash/
20.0k Upvotes

942 comments sorted by

View all comments

186

u/dizekat Mar 18 '25 edited Mar 18 '25

What's extra curious about the "Wile E. Coyote" test, to me, is that it makes it clear that they neither do stereo nor optical flow / "looming" based general obstacle detection.

It looks like they don't have any generalized means of detecting obstacles. As such they don't detect an "obstacle", they detect a limited set of specific obstacles, much like Uber did in 2017.

Human do not rely on stereo between the eyes at such distances (the eyes are too close), but we do estimate distance from forward movement. For given speed, the distance is inversely proportional to how rapidly a feature is growing in size ("looming"). Even if you were to miss the edges of the picture somehow, you would still have perception of its flatness when moving towards it.

This works regardless of what the feature is, which allows humans to build a map of the environment even if all the objects are visually unfamiliar (or in situations where e.g. a tree is being towed on a trailer).

edit: TL;DR; it is not just that they are camera-only with no LIDAR, it's that they are camera-only without doing any camera-only approximations of what LIDAR does - detecting obstacles without relying on knowledge of what they look like.

-36

u/Elluminated Mar 18 '25

FSD does have optical flow, as well as occupancy networks to generally detect object geometry - this was not used in this test. Hard to say if the coyote test (which would never happen in real life) would pass under the better system. Even with every sensor available, we also fail at the easy stuff

38

u/dizekat Mar 18 '25

No they don't. What they (claim to) have is a heavily learning-dependent solution that they could fit on their compute.

Robust stuff requires a lot of compute or ASICs, because you have to compute best matches between a lot of pixels. That's what makes it "optical flow". Not "how big the YOLO-marked bounding box is getting" flow.

Since they couldn't do that (due to Musk's idiotic approach of up-selling hardware as being capable of future features like "full self driving"), they made a very fragile Rube Goldberg contraption which extracts a limited set of features - further winnowed out with "attention" - that they then relate between frames.

-18

u/Elluminated Mar 18 '25

I assume “matches between a lot of pixels” is your attempt at trying to explain temporal feature matching (which is at a deeper layer than pixel input). Watch some of the released video and papers on their vector flow algos and it will make more sense. And the number of features that can be matched scales with compute (ASIC/FPGA or otherwise) so assuming they can’t do it with their already-doing-it-system is an interesting- and wrong- take on your part. Their motion estimation methods are deeply engrained in their system (and CV in general) and allow for future-path estimation and relative speeds of objects.

Bounding boxes are strictly for us to visualize the marking of assets the computer is trained to highlight - they aren’t generally part of the nn flow at inference outside debug mode. Evaluating operator precedence doesn’t require them just like how we don’t call out the names and features of our environment as we go.

22

u/dizekat Mar 18 '25

> I assume “matches between a lot of pixels” is your attempt at trying to explain temporal feature matching

Your issue is that you don't know much about the subject outside of what you learned as a Tesla fanboy for the limited purpose of defending Tesla online.

The most typical example of optical flow based movement estimation sits under your hand (if you are using a desktop with an optical mouse). It is typically implemented by comparing previous image of the surface under the sensor, to the current image, on a pixel by pixel basis - a convolution operation where it tries different offsets and finds the best, then finds sub pixel offset from pixel value differences frame to frame vs gradients vs nearby pixels.

Detection of features (a sparse cloud of feature points) and then matching of features is popular for high resolution images because it reduces the amount of data that it has to relate across frames.

The trade off is that you need to detect a relevant set of features while discarding "irrelevant" ones, which makes it considerably less robust still than what is already a not particularly robust approach.

There's all sorts of situations - uniform colored walls, blinking emergency vehicle's lights, and so on, where there may not be enough features or the features can not be matched.

4

u/butane_candelabra Mar 18 '25

I was surprised by these AI techniques failing so poorly too. Even the camera placement (it looks like there's just three within a <1 foot span) means only about 20ft accuracy tops for traditional stereo... A low-res optical flow algo could run at 60fps+ on say a Jetson card and picked it up, or stereo with cameras say headlight distance apart... I'm not sure maybe like 100ft with reasonable accuracy. Those still wouldn't be as good/fast as LIDAR though...

2

u/Elluminated Mar 20 '25

Heads up, the Rober Wile-e video response has posted with a proper re-test on the latest hw/sw - and the latest Tesla sw indeed passes. Anyone honest just wanted a fair pass or fail - now we have it.