Issue metadata
Sign in to add a comment
|
Test harness is using devices it should not be using |
||||||||||||||||||||||||
Issue descriptionOS: Android Android devices are supposed to reboot after every task (or test? I don't know), and devices that don't do this are not supposed to be used by the test harness. It appears that the test harness is actually using these devices, and the result is flakey infra failures. One way this manifests itself is in segmentation faults, as in https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154235/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio. A list of build logs that fit this profile: https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154235/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154222/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154198/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153458/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153432/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153371/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text A list of devices that seems to be problematic but in use: 01ab72b6032fd46f 060f312b00622f4b 0cbc70cb0c36f32a 03d1ca3e006adc9d 03d13012006af0b4 060887bf00623344 07406225003b76cb 0cba9109032fd6e3 02e78216f0c9d460 06b6a940006b0ab6 060f2aaf13c85a70 06c2e068006af1d0 06c15728006af052 06094361006234e9 01ab4b680c375d0a 03f9fb0000622bd7 060ef89400622d8a 01ab77d7032fd57e 060f030913c86e68 06b777aa00622856 01ab65770c375d2f 03ab1b5d003bf320 0cbc7002032f93a5 +stip +jbudorick to help fix the components and inform the right people (thanks!)
,
Oct 6 2016
We reboot our android devices after they fail a task or if their uptime exceeds 3 hours. You can use https://viceroy.corp.google.com/chrome_infra/Machines/per_android_device and look at the uptime graph of a phone to confirm that it's reporting data and never exceeds 3 hours. Randomly spot checking a few of the phones, it doesn't look like any of the devices you listed do. I wonder how you came to the conclusion that all the task failures listed are due to devices not being rebooted/quarantined when they should be. Given that all the failing tests were triggered by the same two CLs (https://codereview.chromium.org/2388253002 and https://codereview.chromium.org/2388693002) it seems to me that the cause of failure here lies in the changes they introduced.
,
Oct 7 2016
Oh ok, I didn't realize they're only rebooted after a _failing_ task. I thought they were rebooted after every task. That influenced how I was thinking about it, and I overlooked that these were all triggered by the same CLs. Thanks for pointing that out! If it is the case that these changes caused the failures, then I'm curious why the step/build was marked as an INFRA_FAILURE. I'm going to look into this more to try to determine more precisely the cause of the failures.
,
Oct 10 2016
It does seem likely that these issues were due to the patches rather than infrastructure. I wonder if there is a way we can detect that and mark it as such? As is, marking these as infra failures is noisy for us (chrome infra) and unhelpful for developers. I contacted the two owners the the referenced CLs, and they felt pretty confident the failures were due to their changes. A couple of pieces of feedback: - It's hard to tell what caused the failure because we don't see information in the logs for where in the test the failure originated. - The errors were inconsistent and not very useful, and therefore were ignored in development. <-- This in particular seems especially bad. If we're blaming a change-related and inconsistent failure on infra, it will be less likely to be noticed by the developer as an issue, increasing the likelihood the changed will be pushed without a fix, leading to difficult-to-reproduce bugs in production.
,
Oct 11 2016
,
Oct 12 2016
,
Oct 21 2016
https://bugs.chromium.org/p/chromium/issues/detail?id=658358 Aims to fix this kind of thing. Marking this as Duplicate for now. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by jbudorick@chromium.org
, Oct 5 2016Status: Available (was: Untriaged)