telemetry_perf_unittests is failing on android_n5x_swarming_rel, blocking CQ |
|||
Issue descriptionThe whole suite is failing, not just individual tests: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.android%2Fandroid_n5x_swarming_rel%2F134654%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests%2F0%2Fstdout Not sure why there is no log at all: ... USER: chrome-bot USERNAME: chrome-bot step returned non-zero exit code: 1
,
Mar 10 2017
I think you have to click on the swarming link? i.e. https://chromium-swarm.appspot.com/task?id=34d1fcf329d51e10&refresh=10&show_raw=1
,
Mar 10 2017
oops, my silly. THe log shows that the suite timed out because of lock related file lock problem:
Traceback (most recent call last):
RunBenchmark at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:364
benchmark.ShouldTearDownStateAfterEachStorySetRun())
Run at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:205
story_set.archive_data_file, story_set.wpr_archive_info, stories):
_UpdateAndCheckArchives at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:405
wpr_archive_info.DownloadArchivesIfNeeded()
DownloadArchivesIfNeeded at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/wpr/archive_info.py:104
download_if_needed(archive_path)
download_if_needed at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/wpr/archive_info.py:83
cloud_storage.GetIfChanged(path, self._bucket)
GetIfChanged at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/cloud_storage.py:406
with _FileLock(file_path):
__enter__ at /usr/lib/python2.7/contextlib.py:17
return self.gen.next()
_FileLock at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/cloud_storage.py:259
LOCK_ACQUISITION_TIMEOUT)
WaitFor at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/__init__.py:132
(timeout, GetConditionString()))
TimeoutException: Timed out while waiting 10s for py_utils.WaitFor(lambda: _AttemptPseudoLockAcquisition(pseudo_lock_path,
pseudo_lock_fd_return),
LOCK_ACQUISITION_TIMEOUT).
Locals:
GetConditionString : <function GetConditionString at 0x7f88a5f2d0c8>
condition : <function <lambda> at 0x7f88a5f2d140>
elapsed_time : 10.890504837036133
last_output_elapsed_time : 10.890504837036133
last_output_time : 1489165918.971811
now : 1489165929.862316
poll_interval : 1.0890504837036132
res : False
start_time : 1489165918.971811
timeout : 10
Traceback (most recent call last):
File "/b/swarm_slave/w/irLX7rXO/tools/perf/benchmarks/system_health_smoke_test.py", line 98, in RunTest
msg='Failed: %s' % benchmark_class)
AssertionError: Failed: <class 'benchmarks.system_health.MobileMemorySystemHealth'>
,
Mar 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d4265776478516e71caaabad38d04750f22b0eba commit d4265776478516e71caaabad38d04750f22b0eba Author: jam <jam@chromium.org> Date: Fri Mar 10 17:57:58 2017 Disable system_health_smoke_test on android_n5x_swarming_rel as it's blocking the CQ. This isn't running on the main waterfall as well; we should make it run there before adding it to the CQ so that sheriffs can see failures on the main waterfall instead of just silently blocking the CQ. BUG=693672, 700426 NOTRY=true Review-Url: https://codereview.chromium.org/2736403006 Cr-Commit-Position: refs/heads/master@{#456101} [modify] https://crrev.com/d4265776478516e71caaabad38d04750f22b0eba/testing/buildbot/chromium.android.json
,
Mar 15 2017
Looking at the log in https://chromium-swarm.appspot.com/task?id=34d1fcf329d51e10&refresh=10&show_raw=1, we have 51 instance of failure due to the lock timeout (TimeoutException: Timed out while waiting 10s for py_utils.WaitFor...) Each instance takes roughly about 16s, which means we waste a total of 13 minutes out of 140 minutes to deal with this failure.
,
Mar 15 2017
To be fair, we're not sure if that's completely true. We know that we spent 13 minutes waiting, but it's possible that, if another parallel test was in fact trying to download the same resource and the timeout just wasn't high enough, to say that the time was wasted on the failure isn't a completely accurate assessment.
,
Mar 15 2017
Charlie: that 13 minutes number is the phone time, not the wall time. We have 20 minutes hard timeout across 7 phones. So that is 13 minutes out of 140 minutes.
,
Mar 15 2017
Right: what I'm saying though is that, if that time is spent actually waiting for required resources, there may not be a problem in the cloud storage logic: we may just be waiting for those resources. (I'm not saying this is the case, just that it's a possibility.)
,
Mar 15 2017
In https://codereview.chromium.org/2752033002/, I fix the problem of WPR archives used by system health smoke tests are download in parallel by prefetching all the archives before triggering parallel run. This ends up having telemetry_perf_unittests taking 11 minutes in total (https://chromium-swarm.appspot.com/task?id=34ebdfd8160fa210&refresh=10&show_raw=1) John: is there anyway for me to get a passing swarming log of telemetry_perf_unittests before jam's CL in #4 so I can double check the suite's time was blowing up due to flakiness? Jam: the hard timeout limit of telemetry_perf_unittests is 20 minutes, is it ok for me to land https://codereview.chromium.org/2752033002/ to re-enable system_health_smoke test? If we still run other tests on android_n5x_swarming_rel on CQ even though that bot is not enabled on main waterfall, I think we need not treat this case differently?
,
Mar 15 2017
To Charlie's comment in #8: failing tests blow up the total suite run time significantly because typ only retry failed tests in serial, not in parallel (for good reason)
,
Mar 15 2017
#9: passing runs from t_p_u on android_n5x_swarming_rel on 2017-03-09: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&et=1489132800000&f=buildername%3Aandroid_n5x_swarming_rel&f=name%3Atelemetry_perf_unittests&f=state%3ACOMPLETED_SUCCESS&l=50&q=state%3A&s=created_ts%3Adesc&st=1489046400000
,
Mar 15 2017
@ned: the other test suites on n5x are running on main waterfall though right? This isn't? Is it running on any bots, even on chromium.android? I'm not sure what you meant by the hard limit of 20 minutes: I understand it's there, and the test suite was hitting it.
,
Mar 15 2017
jam: the non Android version of the suite is run on main waterfall. For example: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.win%2FWin7_Tests__1_%2F64783%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests%2F0%2Fstdout (search for system_health.memory_desktop.." It doesn't run on chromium.android AFAIK. @John: is it ok for for us to add telemetry_perf_unittests (with the system health smoke test) to chromium.android? 2) What I meant by the limit of 20 minutes is that with my fix, the total test runtime of telemetry_perf_unittests should be around 11 minutes, which is way under that limit. So the risk of the test timed out should be low.
,
Mar 15 2017
@nednguyen: I think the test suite running on different OSs is orthogonal. The room for failures to happen on one OS but not another is why we had failures in the CQ in this bug for example. So if we want it on the CQ, it should be on the main waterfall, or at least a chromium.android bot that is actively sheriffed. This is a basic rule that we try to stick to for anything on the CQ.
,
Mar 15 2017
Got it. I will wait for John on whether we can enable system health smoke test on the main waterfall (it's disabled in https://cs.chromium.org/chromium/src/testing/buildbot/chromium.android.json?rcl=2a016d864e0798bef9150b0c1dbc284ba43a77fa&l=1187)
,
Mar 15 2017
13: they're on chromium.android. android_n5x_swarming_rel's matching waterfall bot is https://build.chromium.org/p/chromium.android/builders/Android%20N5X%20Swarm%20Builder
,
Mar 15 2017
#16: but the system health smoke tests are skipped there. Can we enable system health smoke tests?
,
Mar 15 2017
if the prefetching change in https://codereview.chromium.org/2752033002/ addresses the speed and flakiness issues we saw on 3/10, sure.
,
Mar 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cae53638598092a3a97ad4fd6fbe5addd9404d9a commit cae53638598092a3a97ad4fd6fbe5addd9404d9a Author: nednguyen <nednguyen@google.com> Date: Fri Mar 17 18:03:51 2017 Prefetch all WPR archives used by system_health_smoke_test This also enables system_health_smoke_test on android.chromium bot (Android N5X Swarm Builder). BUG= 700426 Review-Url: https://codereview.chromium.org/2752033002 Cr-Commit-Position: refs/heads/master@{#457813} [modify] https://crrev.com/cae53638598092a3a97ad4fd6fbe5addd9404d9a/testing/buildbot/chromium.android.json [modify] https://crrev.com/cae53638598092a3a97ad4fd6fbe5addd9404d9a/tools/perf/benchmarks/system_health_smoke_test.py
,
Mar 18 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by nedngu...@google.com
, Mar 10 2017