New issue
Advanced search Search tips

Issue 700426 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

telemetry_perf_unittests is failing on android_n5x_swarming_rel, blocking CQ

Project Member Reported by nedngu...@google.com, Mar 10 2017

Issue description

The whole suite is failing, not just individual tests:
https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.android%2Fandroid_n5x_swarming_rel%2F134654%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests%2F0%2Fstdout

Not sure why there is no log at all:
...
 USER: chrome-bot
 USERNAME: chrome-bot

step returned non-zero exit code: 1


 
Labels: -Pri-3 Pri-0
jam@ suspect this is caused by system_health_smoke tests, so we try disable this on android_n5x_swarming_rel first: https://codereview.chromium.org/2736403006/

If this doesn't work, we will need to disable the whole telemetry_perf_unittests on android_n5x_swarming_rel

Comment 2 by jam@chromium.org, Mar 10 2017

I think you have to click on the swarming link? i.e. https://chromium-swarm.appspot.com/task?id=34d1fcf329d51e10&refresh=10&show_raw=1
oops, my silly. THe log shows that the suite timed out because of lock related file lock problem:
Traceback (most recent call last):
    RunBenchmark at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:364
      benchmark.ShouldTearDownStateAfterEachStorySetRun())
    Run at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:205
      story_set.archive_data_file, story_set.wpr_archive_info, stories):
    _UpdateAndCheckArchives at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/internal/story_runner.py:405
      wpr_archive_info.DownloadArchivesIfNeeded()
    DownloadArchivesIfNeeded at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/wpr/archive_info.py:104
      download_if_needed(archive_path)
    download_if_needed at /b/swarm_slave/w/irLX7rXO/third_party/catapult/telemetry/telemetry/wpr/archive_info.py:83
      cloud_storage.GetIfChanged(path, self._bucket)
    GetIfChanged at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/cloud_storage.py:406
      with _FileLock(file_path):
    __enter__ at /usr/lib/python2.7/contextlib.py:17
      return self.gen.next()
    _FileLock at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/cloud_storage.py:259
      LOCK_ACQUISITION_TIMEOUT)
    WaitFor at /b/swarm_slave/w/irLX7rXO/third_party/catapult/common/py_utils/py_utils/__init__.py:132
      (timeout, GetConditionString()))
  TimeoutException: Timed out while waiting 10s for py_utils.WaitFor(lambda: _AttemptPseudoLockAcquisition(pseudo_lock_path,
                                                           pseudo_lock_fd_return),
                     LOCK_ACQUISITION_TIMEOUT).
  
  Locals:
    GetConditionString       : <function GetConditionString at 0x7f88a5f2d0c8>
    condition                : <function <lambda> at 0x7f88a5f2d140>
    elapsed_time             : 10.890504837036133
    last_output_elapsed_time : 10.890504837036133
    last_output_time         : 1489165918.971811
    now                      : 1489165929.862316
    poll_interval            : 1.0890504837036132
    res                      : False
    start_time               : 1489165918.971811
    timeout                  : 10
  
  Traceback (most recent call last):
    File "/b/swarm_slave/w/irLX7rXO/tools/perf/benchmarks/system_health_smoke_test.py", line 98, in RunTest
      msg='Failed: %s' % benchmark_class)
  AssertionError: Failed: <class 'benchmarks.system_health.MobileMemorySystemHealth'>

Project Member

Comment 4 by bugdroid1@chromium.org, Mar 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d4265776478516e71caaabad38d04750f22b0eba

commit d4265776478516e71caaabad38d04750f22b0eba
Author: jam <jam@chromium.org>
Date: Fri Mar 10 17:57:58 2017

Disable system_health_smoke_test on android_n5x_swarming_rel as it's blocking the CQ.

This isn't running on the main waterfall as well; we should make it run there before adding it to the CQ so that sheriffs can see failures on the main waterfall instead of just silently blocking the CQ.

BUG=693672, 700426 
NOTRY=true

Review-Url: https://codereview.chromium.org/2736403006
Cr-Commit-Position: refs/heads/master@{#456101}

[modify] https://crrev.com/d4265776478516e71caaabad38d04750f22b0eba/testing/buildbot/chromium.android.json

Looking at the log in https://chromium-swarm.appspot.com/task?id=34d1fcf329d51e10&refresh=10&show_raw=1, we have 51 instance of failure due to the lock timeout (TimeoutException: Timed out while waiting 10s for py_utils.WaitFor...)

Each instance takes roughly about 16s, which means we waste a total of 13 minutes out of 140 minutes to deal with this failure.
To be fair, we're not sure if that's completely true. We know that we spent 13 minutes waiting, but it's possible that, if another parallel test was in fact trying to download the same resource and the timeout just wasn't high enough, to say that the time was wasted on the failure isn't a completely accurate assessment.
Charlie: that 13 minutes number is the phone time, not the wall time. We have 20 minutes hard timeout across 7 phones. So that is 13 minutes out of 140 minutes.
Right: what I'm saying though is that, if that time is spent actually waiting for required resources, there may not be a problem in the cloud storage logic: we may just be waiting for those resources.

(I'm not saying this is the case, just that it's a possibility.)
Cc: jbudorick@chromium.org
Labels: -Pri-0 Pri-1
Status: Started (was: Untriaged)
In https://codereview.chromium.org/2752033002/, I fix the problem of WPR archives used by system health smoke tests are download in parallel by prefetching all the archives before triggering parallel run.

This ends up having telemetry_perf_unittests taking 11 minutes in total (https://chromium-swarm.appspot.com/task?id=34ebdfd8160fa210&refresh=10&show_raw=1)

John: is there anyway for me to get a passing swarming log of telemetry_perf_unittests before jam's CL in #4 so I can double check the suite's time was blowing up due to flakiness?

Jam: the hard timeout limit of telemetry_perf_unittests is 20 minutes, is it ok for me to land https://codereview.chromium.org/2752033002/ to re-enable system_health_smoke test? If we still run other tests on android_n5x_swarming_rel on CQ even though that bot is not enabled on main waterfall, I think we need not treat this case differently?
To Charlie's comment in #8: failing tests blow up the total suite run time significantly because typ only retry failed tests in serial, not in parallel (for good reason)

Comment 12 by jam@chromium.org, Mar 15 2017

@ned: the other test suites on n5x are running on main waterfall though right? This isn't? Is it running on any bots, even on chromium.android?

I'm not sure what you meant by the hard limit of 20 minutes: I understand it's there, and the test suite was hitting it.
jam: the non Android version of the suite is run on main waterfall. For example: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.win%2FWin7_Tests__1_%2F64783%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests%2F0%2Fstdout (search for system_health.memory_desktop.."

It doesn't run on chromium.android AFAIK.

@John: is it ok for for us to add telemetry_perf_unittests (with the system health smoke test) to chromium.android?

2) What I meant by the limit of 20 minutes is that with my fix, the total test runtime of telemetry_perf_unittests should be around 11 minutes, which is way under that limit. So the risk of the test timed out should be low.

Comment 14 by jam@chromium.org, Mar 15 2017

@nednguyen: I think the test suite running on different OSs is orthogonal. The room for failures to happen on one OS but not another is why we had failures in the CQ in this bug for example. So if we want it on the CQ, it should be on the main waterfall, or at least a chromium.android bot that is actively sheriffed. This is a basic rule that we try to stick to for anything on the CQ.
Got it. I will wait for John on whether we can enable system health smoke test on the main waterfall (it's disabled in https://cs.chromium.org/chromium/src/testing/buildbot/chromium.android.json?rcl=2a016d864e0798bef9150b0c1dbc284ba43a77fa&l=1187)
13: they're on chromium.android. android_n5x_swarming_rel's matching waterfall bot is https://build.chromium.org/p/chromium.android/builders/Android%20N5X%20Swarm%20Builder
#16: but the system health smoke tests are skipped there. Can we enable system health smoke tests?
if the prefetching change in https://codereview.chromium.org/2752033002/ addresses the speed and flakiness issues we saw on 3/10, sure.
Project Member

Comment 19 by bugdroid1@chromium.org, Mar 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cae53638598092a3a97ad4fd6fbe5addd9404d9a

commit cae53638598092a3a97ad4fd6fbe5addd9404d9a
Author: nednguyen <nednguyen@google.com>
Date: Fri Mar 17 18:03:51 2017

Prefetch all WPR archives used by system_health_smoke_test

This also enables system_health_smoke_test on android.chromium bot (Android N5X Swarm Builder).

BUG= 700426 

Review-Url: https://codereview.chromium.org/2752033002
Cr-Commit-Position: refs/heads/master@{#457813}

[modify] https://crrev.com/cae53638598092a3a97ad4fd6fbe5addd9404d9a/testing/buildbot/chromium.android.json
[modify] https://crrev.com/cae53638598092a3a97ad4fd6fbe5addd9404d9a/tools/perf/benchmarks/system_health_smoke_test.py

Status: Fixed (was: Started)

Sign in to add a comment