New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 653304 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 658358
Owner: ----
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocking:
issue 649391



Sign in to add a comment

Test harness is using devices it should not be using

Project Member Reported by katthomas@chromium.org, Oct 5 2016

Issue description

OS: Android

Android devices are supposed to reboot after every task (or test? I don't know), and devices that don't do this are not supposed to be used by the test harness. 

It appears that the test harness is actually using these devices, and the result is flakey infra failures.

One way this manifests itself is in segmentation faults, as in https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154235/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio.

A list of build logs that fit this profile:
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154235/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154222/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/154198/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153458/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153432/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/153371/steps/chrome_public_test_apk%20%28with%20patch%29%20on%20Android/logs/stdio/text

A list of devices that seems to be problematic but in use:
01ab72b6032fd46f
060f312b00622f4b
0cbc70cb0c36f32a
03d1ca3e006adc9d
03d13012006af0b4
060887bf00623344
07406225003b76cb
0cba9109032fd6e3
02e78216f0c9d460
06b6a940006b0ab6
060f2aaf13c85a70
06c2e068006af1d0
06c15728006af052
06094361006234e9
01ab4b680c375d0a
03f9fb0000622bd7
060ef89400622d8a
01ab77d7032fd57e
060f030913c86e68
06b777aa00622856
01ab65770c375d2f
03ab1b5d003bf320
0cbc7002032f93a5

+stip +jbudorick to help fix the components and inform the right people 
(thanks!)
 
Components: Infra>Client>Android
Status: Available (was: Untriaged)
We reboot our android devices after they fail a task or if their uptime exceeds 3 hours. You can use https://viceroy.corp.google.com/chrome_infra/Machines/per_android_device and look at the uptime graph of a phone to confirm that it's reporting data and never exceeds 3 hours. Randomly spot checking a few of the phones, it doesn't look like any of the devices you listed do.

I wonder how you came to the conclusion that all the task failures listed are due to devices not being rebooted/quarantined when they should be. Given that all the failing tests were triggered by the same two CLs (https://codereview.chromium.org/2388253002 and https://codereview.chromium.org/2388693002) it seems to me that the cause of failure here lies in the changes they introduced.
Oh ok, I didn't realize they're only rebooted after a _failing_ task. I thought they were rebooted after every task. That influenced how I was thinking about it, and I overlooked that these were all triggered by the same CLs. Thanks for pointing that out! If it is the case that these changes caused the failures, then I'm curious why the step/build was marked as an INFRA_FAILURE. I'm going to look into this more to try to determine more precisely the cause of the failures.  
It does seem likely that these issues were due to the patches rather than infrastructure. I wonder if there is a way we can detect that and mark it as such? As is, marking these as infra failures is noisy for us (chrome infra) and unhelpful for developers. 

I contacted the two owners the the referenced CLs, and they felt pretty confident the failures were due to their changes. A couple of pieces of feedback: 
- It's hard to tell what caused the failure because we don't see information in the logs for where in the test the failure originated.
- The errors were inconsistent and not very useful, and therefore were ignored in development. <-- This in particular seems especially bad. If we're blaming a change-related and inconsistent failure on infra, it will be less likely to be noticed by the developer as an issue, increasing the likelihood the changed will be pushed without a fix, leading to difficult-to-reproduce bugs in production.
Blocking: 649391
Labels: Hotlist-Infra-Flakiness
Mergedinto: 658358
Status: Duplicate (was: Available)
https://bugs.chromium.org/p/chromium/issues/detail?id=658358 Aims to fix this kind of thing. Marking this as Duplicate for now.

Sign in to add a comment