New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618576 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocked on:
issue 616392

Blocking:
issue 620486



Sign in to add a comment

Nexus 5X bot failing to capture screenshot after switch to android-chromium and Release mode

Project Member Reported by kbr@chromium.org, Jun 9 2016

Issue description

This swarmed Nexus 5X bot:
https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20(Nexus%205X)

is failing most tests after a couple of configuration changes:
 - Running android-chromium instead of android-content-shell
 - Running a Release build instead of Debug build

I'm not sure which one, or if both, broke things.

Basically all of the tests that capture screenshots are broken. The call from Telemetry to DevTools to capture the screenshot is returning None.

The same tests are passing on other bots -- for example, the Nexus 5: https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20(Nexus%205) . Therefore screenshots can't be completely broken with ChromePublic.apk.

This is the only Android GPU bot running these tests via Swarming, so it's possible that the deps or data_deps for e.g. src/tools/perf/chrome_telemetry_build/BUILD.gn are wrong. In particular, it's surprising to me they don't depend on bitmaptools. I would expect better logging if a key component like that were missing and that was why the screenshot was failing.

 

Comment 1 by kbr@chromium.org, Jun 9 2016

One worrisome part of the log is:

(WARNING) 2016-06-08 20:47:45,260 browser_finder.FindBrowser:118  Multiple browsers of the same type found: [PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium), PossibleAndroidBrowser(browser_type=android-chromium)]

Is it possible that the Swarming support for these bots is failing to uninstall previously installed packages?

This probably isn't the cause of the failure since the tests  which are green also report this.

Another worrisome error is:

(CRITICAL) 2016-06-08 20:47:06,817 timeout_retry._LogLastException:118  ********************************************************************************
(CRITICAL) 2016-06-08 20:47:06,817 timeout_retry._LogLastException:120  Exception on thread TimeoutThread-1-for-delete_temporary_file(00cfafa449995210) (attempt 1 of 3)
(CRITICAL) 2016-06-08 20:47:06,817 timeout_retry._LogLastException:121  ********************************************************************************
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124  Traceback (most recent call last):
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/utils/timeout_retry.py", line 167, in Run
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      error_log_func=error_log_func)
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      self._JoinAll(watcher, timeout)
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 158, in _JoinAll
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      thread.ReraiseIfException()
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      self._ret = self._func(*self._args, **self._kwargs)
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/utils/timeout_retry.py", line 160, in <lambda>
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs),
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      return f(*args, **kwargs)
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124    File "/tmp/runSfzIFl/third_party/catapult/devil/devil/android/sdk/adb_wrapper.py", line 238, in _RunAdbCmd
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124      args, output, status, device_serial)
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124  AdbCommandFailedError: (device: 00cfafa449995210) adb shell 'rm -f /data/local/tmp/temp_file-b3993c76a8a72': failed with exit status 255 and output:
(CRITICAL) 2016-06-08 20:47:06,818 timeout_retry._LogLastException:124  - error: protocol fault (couldn't read status): Success


The failure to clean up temporary files doesn't show up during the Telemetry-based tests which are currently passing on this bot.

Ken: I think that's because each of these hosts machine are hook up with 7 devices.

Comment 3 by kbr@chromium.org, Jun 9 2016

I built the telemetry_gpu_test_run target with the gn args:

dcheck_always_on = true
ffmpeg_branding = "Chrome"
goma_dir = "/b/build/slave/cache/cipd/goma"
is_component_build = false
is_debug = false
proprietary_codecs = true
symbol_level = 1
target_cpu = "arm64"
target_os = "android"
use_goma = true

from:

https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/5/steps/generate_build_files/logs/stdio

and ran one of the affected tests:

./content/test/gpu/run_gpu_test.py gpu_rasterization --browser=android-chromium

Visibly, the test failed to navigate to the target tab; it was stuck on about:blank forever. I didn't wait 5 minutes for it to time out.

Something seems broken with Telemetry's page navigation on this device, though this doesn't look exactly like the failure mode on the bots.

Comment 4 by kbr@chromium.org, Jun 9 2016

Building content_shell_apk with the same GN args and running:

./content/test/gpu/run_gpu_test.py gpu_rasterization --browser=android-content-shell

fails too on my device, with Android reporting a couple of times that ContentShell has crashed.

Maybe the failure was caused by the switch from Debug to Release+Asserts.

On my Nexus 5X "./content/test/gpu/run_gpu_test.py gpu_rasterization --browser=android-chrome" built with the args in #3 passes.
Looks like I didn't configure the proper environment for run_gpu_test to find my compiled browser, so it ran with a stock one and didn't give me "--browser=android-chromium" option.
Now, I configured it correctly, and the test still passes with --browser=android-chromium
FWIW, "./content/test/gpu/run_gpu_test.py gpu_rasterization --browser=android-content-shell" also passed for me.
I think the problem may be because of old versions of Chrome on swarmed devices not cleaned.
./content/test/gpu/run_gpu_test.py maps --browser=android-chromium
also passed on my device.
Seems like there is something different in swarmed devices configuration, which causes capturing a screenshot to fail.
All right! I got "Failure: Could not capture screenshot" on my device as well!
It happens when the screen is off.
So, the solution should be to keep swarmed devices awake.
In that case, this is likely an issue with how swarming sets up its devices.
Owner: bpastene@chromium.org
Status: Assigned (was: Untriaged)
I'll configure swarming to turn device screens on right before a task.

I'll also look into killing any chrome-related processes before a task.

Comment 12 by kbr@chromium.org, Jun 9 2016

Thanks Ben.

https://chromium.googlesource.com/chromium/src/+/master/docs/android_test_instructions.md suggest that:
You MUST ensure that the screen stays on while testing: adb shell svc power stayon usb Or do this manually on the device: Settings -> Developer options -> Stay Awake.

The same doc also suggests for instumentation tests to:
In order to run instrumentation tests, you must leave your device screen ON and UNLOCKED. Otherwise, the test will timeout trying to launch an intent. Optionally you can disable screen lock under Settings -> Security -> Screen Lock -> None.

Makes sense to check this as well, while at it.

Something else I've found in that document is
adb shell setprop debug.assert 1
Makes sense to do this as well, since we rely now on asserts in Release builds.
IIRC that theoretically enables java asserts but doesn't actually do anything on ART.
Aha! With content-shell screenshot capture succeeds even when screen is off. That explains why switching from content-shell to chromium has triggered the failures. Perhaps it would be worthwhile to be able to do screenshot capture when screen is off in chromium as well. kbr@, I trust you can find an owner for that?
Re #15, then I guess it's best not to enable debug.assert. It reported some weird errors for me when I tried it locally.

Comment 18 by kbr@chromium.org, Jun 10 2016

FYI: the reason the Telemetry tests are hanging on my device with --browser=android-chromium seems to be that my device isn't rooted. They're working fine on a rooted Nexus 5X.

It would be really helpful if we could force the screens on these devices to be kept awake per #13. This would get our tests green again. Is there a possibility of this being done in the short term? Thanks.
John: I remember there maybe some devil API to check if the device's screen is off & enable it on? If so, we can add s.t to telemetry to make sure that device's screen is always on during the test time.
Non-swarming bots currently handle all of this down in provision_devices.py (or stuff it calls) on the chromium side. The solution here is to port logic from there over to swarming (again). The relevant part in this case is the DETERMINISTIC_DEVICE_SETTINGS logic here: https://chromium.googlesource.com/chromium/src/+/master/build/android/provision_devices.py#271
specifically, these are probably the settings we want for this specifically:

  stay_on_while_plugged_in: https://chromium.googlesource.com/chromium/src/+/master/build/android/pylib/device_settings.py#159
  lockscreen_disabled: https://chromium.googlesource.com/chromium/src/+/master/build/android/pylib/device_settings.py#173, https://chromium.googlesource.com/chromium/src/+/master/build/android/pylib/device_settings.py#184 (note that those are in different tables)
  screensaver_enabled: https://chromium.googlesource.com/chromium/src/+/master/build/android/pylib/device_settings.py#175

... though I imagine some of the others may be useful as well.
Status: Started (was: Assigned)
I'll start porting some of those over to swarming's before task hook.

I'll need to test around locally to see if these settings can take effect without a device reboot. If not, we may need to just permanently turn on/unlock screens since we can't afford to reboot devices before every task.
Yeah, not sure about that. I know some of the things we set need a reboot to take effect, but not all.

If they do need a reboot, perhaps we could add the settings wherever we do the periodic device reboots?
Looks like we can disable the phone from automatically locking the screen after a period of idleness with lockscreen.disabled, but it still boots into the lockscreen. We can get out of it with 'input keyevent 82' but I'd rather disable it all together.

Time to start diving into these sqlite tables on the phone...
Random thought: I wonder if we're only seeing this issue on N5Xs because they've never run provision_devices before, whose effects seem to persist through reboots.

Our N5s, on the other hand, were buildbot devices in another life, and so have all had provision run on them many times. Hence why their screens are always on/unlocked.

If we wanted an super immediate fix, we could isolate provision_devices.py and run it on once on all bots with device_type == bullhead. But still, I'd rather add the necessary logic to swarming's setup code.
Hm, that could be, and if that's the case a run of provision_devices.py would solve the issue in the short term. Given that we want to occasionally factory-reset or flash devices in the future, we should definitely ensure that swarming handles this, though.
Looks like the lockscreen.disabled row in locksettings.db did the trick. (Also nuking that entire database also seems to work. I'm looking for a reason not to just 'rm /data/system/locksettings.db' and am having a hard time finding one.)

Additionally, these settings need a reboot to take effect, so this'll need to be done at bot_startup which later reboots all devices.

Comment 28 by kbr@chromium.org, Jun 13 2016

How long will it take to deploy this update to the swarming pool? If it will take more than a couple of days then perhaps we should switch these tests back to using content_shell to get them green again.

https://chromereviews.googleplex.com/448847013/

No reason not to get that committed today. I'll do some pinging.

Comment 30 by kbr@chromium.org, Jun 14 2016

Could I please ask for a status update on this? Can these configuration changes land today? If not I want to switch the bots back to content_shell to get them green again.

Sorry for the delay. This is fixed with the CL I already mentioned. I'll try to get it landed today in between troopering once I get an owner's lgtm, but if you absolutely can't wait then go ahead and make your test changes.

Comment 32 by kbr@chromium.org, Jun 14 2016

I'd strongly prefer to push forward with your CL https://chromereviews.googleplex.com/448847013/ . Please tell me if you need help getting reviews. Thanks.

Project Member

Comment 33 by bugdroid1@chromium.org, Jun 15 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/6cd163d625fe21578e8b1c96bbf870530510880f

commit 6cd163d625fe21578e8b1c96bbf870530510880f
Author: bpastene <bpastene@google.com>
Date: Wed Jun 15 00:56:25 2016

The bot is still red. Is there something else needed (bot restart?) for the change in #33 to be applied?
It needs to be deployed to prod. I'll do that now.
Screens should be on for good now. Picking a random bot:

~/adb -s 00b9d4ce76671554 shell dumpsys input_method | grep mInteractive
mSystemReady=true mInteractive=true

A consequence of this is that the devices now run warmer and are getting quarantined for over heating:
http://shortn/_8V6AD5e18g

I may need to increase the maximum allowed temperature.
Thanks a lot!
I see that build 172 doesn't have screenshot capturing problem.

Tests still fail, though.
When I ran it locally, I got a similar problem when screen was rotated.
Could you please configure the devices to stay in portrait view when the device is rotated?

And for temperature problem - maybe setting brightness to minimum will help?
Yeah, it's already at the dimmest I could bring it. I think we just have to bump up the threshold a bit. The battery temperatures are all unaffected, and that's where we really care about temps, so bumping only for non-battery sensors seems fine to me.

As for the screen orientation, I'll add that to the setup as well after the temperature issues has been sorted out. Probably just have to play with the accelerometer:
https://codesearch.chromium.org/chromium/src/build/android/pylib/device_settings.py?rcl=0&l=180

Note that all the phones in our labs are laying on their side horizontally, so the N5Xs are all probably in landscape mode at the moment. I'll get to that.

Comment 39 by kbr@chromium.org, Jun 15 2016

Blocking: 620486

Comment 40 by kbr@chromium.org, Jun 15 2016

Thanks Ben for solving the primary problem with the screens being disabled.

The landscape mode issue is still a significant problem. Two of our tests are failing because of it. Could this please be prioritized?

Now that I'm no longer troopering, I can spend more time on this.
Project Member

Comment 42 by bugdroid1@chromium.org, Jun 17 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/bbc7b17675046c02c9bb1d39ce4a2a8dd5119edc

commit bbc7b17675046c02c9bb1d39ce4a2a8dd5119edc
Author: bpastene <bpastene@google.com>
Date: Fri Jun 17 19:47:31 2016

Screen orientation has been pushed out, and thermal threshold raised. Let me know what the next challenge is :)
Status: Fixed (was: Started)
Thanks, https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/214 is mostly green!
Only WebglConformance.conformance_textures_misc_tex_image_and_uniform_binding_bugs failed, but I think it's a test problem (also failed on Nexus 6).

I see that some bots are still quarantined (27), but I guess it's not possible to do something about them, as battery temperature is too high?

I think this bug is fixed, and we should open new bugs if more issues pop up.

Sign in to add a comment