I have recently been running a series of webrtc (story tag=stress), and have noticed that the benchmark runs for a varying amount of time (once it actually logs in and gets to test test website): I have observed it running anywhere from 3 second to 145 seconds (it may have run longer, I wasn't always monitoring it closely). Furthermore, it frequently ends by hitting an "Aw Snap!" screen, (after a varying amount of time). However sometimes when it hits the Aw Snap! screen (usually the very short runs) telemetry reports that the test failed. However in the slightly longer runs, when it hits an Aw Snap! screen telemetry reports SUCCESS. Shouldn't hitting Aw Snap! always be a failure?
Furthermore I collected perf data on the CPU cycles collected during these various runs. Using the same test, on the same identical machines, and only looking at runs that Telemetry reported as successful, I am seeing anywhere from 10-50% variance in the results. In fact these results are so extremely flaky, noisy and unreliable as to make this benchmark useless for me. This should be investigated/fixed.
These tests were run on a Kevin chromebook, being run non-locally, from a Ubuntu workstation with almost nothing else running on the workstation at the time. I ran tests both pinning the CPU frequency and not (pinning the CPU frequency actually made the variance worse). I can share actual numbers with you, if you want.
Comment 1 by benhenry@google.com
, Jan 11