Determine root cause of automated latency outliers |
|||||
Issue descriptionOn rare occasions, the automated motion-to-photon latency test can produce clearly erroneous results, either single digit latency or ludicrously high latency (e.g. 90,086 ± 270,000 ms). https://chromium-review.googlesource.com/c/614364 adds a workaround to throw out results like this and retry the iteration, but the root cause of this should be found and fixed. There are three possible places where this is happening: 1. Everything is running correctly, but we're incorrectly parsing the Motopho script output during the test. This will simply require fixing the bug in our parsing code. 2. The raw Motopho data is good, but the script that takes it and computes the latency/correlation has a bug. 3. The raw Motopho data is bad. In this case, all we can do is replace the hardware and hope it gets fixed.
,
Aug 15 2017
,
Aug 24 2017
A week or so ago, I modified the latency computing script to dump the raw data it receives from the motopho. Since then, I haven't seen any of the huge outliers (those seem to be much more rare), but a number of single-digit latencies. I also just modified the script again to save a visual representation of the screen brightness and angular velocity over time to help get a better idea of what's going on. A few initial findings and thoughts: 1. There doesn't seem to be any correlation between when the problem occurs and the URL that is being tested, so this is unlikely to be related to anything on Chrome's end such as low FPS or buffer stuffing issues. 2. The raw data from the Motopho appears to be good, at least in the case of getting single digit latencies. We'll get a better idea if this is the case once we get graphs of the data. 3. It occurred to me that the way we're doing movement might actually be the cause of this by being too repeatable. The latency computing script essentially finds the delay that results in the highest correlation between screen brightness and movement. When a human is doing the movement, there's going to naturally be variations in the speed and duration of the rotation, meaning that you're only ever going to get a high correlation when you get the correct offset. However, when we're doing movements with a servo, there's going to be essentially no difference in velocity between samples, which could make it easy for the script to associate some velocity measurement with an older brightness measurement, resulting in very low calculated latencies. If this is the case, then we'll have to see about making the movements vary in speed to prevent this from happening.
,
Aug 28 2017
I got some comparison graphs between good runs and the runs with really low latencies, and I think I may have found the issue. See the attached files for a visual example. There has been a long-standing bug in the way we do the motopho patch that causes the patch to sometimes, at random, flash brightly. I believe the root cause of this is because we need angular velocity, but VrCore does not provide that information, so instead we get two predicted poses that are very close to each other and calculate angular velocity from that. This works well most of the time, but can cause sensor noise to show up as actual movement in some cases, which results in the screen flash. If you look at the three bad graphs, you'll see that one of these flashes appeared very early in the test (shown by the random spike in yellow/red), whereas the good graphs have this occur much later or not at all. This is important because the Motopho needs to spend some amount of time (~2-3 seconds) at the beginning collecting data while completely still and without changes in brightness in order to calculate the bias of its various sensors. My guess is that if the flash happens early, it screws up the Motopho's calibration, which results in these bad results. +klausw who added the Motopho patch code - has VrCore added a way to get angular velocity directly from them/can you think of a way to get rid of these flashes?
,
Aug 30 2017
After working with Klaus, it looks like we fixed the unreasonably small latency results. The cause was the that the latency computation script was restricting the window of valid offsets (and thus latencies) to [0, max(250, first peak)] (we aren't 100% sure why this is the case - it looked like some sort of optimization to not do unnecessary computations, but in the normal case, the valid offsets would always be [0, 250] anyways). What was happening was that the random flashes were being detected as brightness peaks, and if one occurred very early on during the test (e.g. 8ms in), the valid offsets would be restricted to very few values. Since we have the minimum acceptable correlation fairly low (0.6?), this small offset was usually enough to get an acceptable correlation and be accepted as a valid result by the script. I'll leave this open for a bit longer to make sure the issue doesn't pop up again, and that the huge outliers don't show up, either.
,
Sep 1 2017
Everything seems good now. I'll open a separate bug if something similar pops up again.
,
Jul 4
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by leilei@chromium.org
, Aug 15 2017