New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 5 users

Issue metadata

Status: Verified
Owner:
Closed: Feb 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

cheets_StartAndroid.stress timeout

Project Member Reported by ddavenp...@chromium.org, Nov 10 2017

Issue description

This failed on kevin-paladin:
http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/155404104-chromeos-test/
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/2998

It appears to have timed out when starting the browser backend, and may not necessarily be an issue with arc:

11/10 06:03:00.959 ERROR|           browser:0068| Failed with TimeoutException while starting the browser backend.
 
Specifically this times out waiting for oobe_exists in cros_browser_backend.py:172

py_utils.WaitFor(lambda: self.oobe_exists, 30)
Saw this again on cheets_StartAndroid.stress on auron_yuna-paladin:

https://luci-milo.appspot.com/buildbot/chromeos/auron_yuna-paladin/1653
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162250346-chromeos-test/chromeos4-row10-rack8-host9/

12/13 04:37:26.154 ERROR|           browser:0050| Failed with TimeoutException while starting the browser backend.
TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).

Cc: deanliao@chromium.org dtor@chromium.org ecgh@chromium.org michae...@chromium.org risan@chromium.org
Owner: levarum@chromium.org
Status: Assigned (was: Untriaged)
At first blush, his looks like a product bug, presumably in
ARC++, not caused by infrastructure.

levarum@ - could you do some initial triage?  The top-level
question is "is this problem caused by a CL in the CQ run,
or is it a bug in the tree?"

Cc: lhchavez@chromium.org yusukes@chromium.org
Owner: ----
Status: Available (was: Assigned)
In

cheets_StartAndroid.stress/sysinfo/var/log/chrome/chrome_20171213-043621

this looks suspicious:

[28319:28319:1213/043626.291292:VERBOSE1:arc_session_manager.cc(563)] ARC opt-in. Starting ARC session.
[28319:28319:1213/043626.291363:VERBOSE1:arc_session_runner.cc(185)] Session start requested: FULL_INSTANCE
[28319:28319:1213/043626.291377:VERBOSE1:arc_session_runner.cc(258)] Starting ARC instance
[28319:28319:1213/043626.730362:ERROR:customization_document.cc(623)] Customization manifest is missing on server: https://ssl.gstatic.com/chrome/chromeos-customization/yuna.json
[28319:28319:1213/043626.759986:WARNING:memory_kills_monitor.cc(175)] Monitoring has been started
[28319:28319:1213/043627.303373:INFO:arc_session_impl.cc(459)] ARC ready.
[28319:28319:1213/043627.372469:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t)
[28319:28319:1213/043630.288315:VERBOSE1:arc_net_host_impl.cc(721)] GetDefaultNetwork: default network is /service/2
[28319:28319:1213/043630.299925:ERROR:object_proxy.cc(626)] Failed to call method: org.chromium.Midis.BootstrapMojoConnection: object_path= /org/chromium/Midis: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.Midis was not provided by any .service files
[28319:28319:1213/043630.299986:ERROR:arc_midis_bridge.cc(65)] ArcMidisBridge had a failure in D-Bus with the daemon.
[28319:28319:1213/043631.861828:VERBOSE1:connection_holder.h(384)] Instance arc::mojom::AppInstance not available.
[28319:28319:1213/043634.981454:VERBOSE1:connection_holder.h(384)] Instance arc::mojom::IntentHelperInstance not available.
[28319:28319:1213/043634.981983:VERBOSE1:arc_settings_service.cc(241)] OnPrefChanged: arc.backup_restore.enabled
[28319:28319:1213/043634.982061:VERBOSE1:arc_settings_service.cc(241)] OnPrefChanged: arc.location_service.enabled
[28319:28319:1213/043648.043337:VERBOSE1:arc_net_host_impl.cc(758)] New default network: /service/2 (ethernet)
[28319:28319:1213/043648.043774:WARNING:statistics_provider.cc(370)] Requested statistic not found: Product_S/N
[28319:28319:1213/043648.044319:VERBOSE1:arc_net_host_impl.cc(758)] New default network: /service/2 (ethernet)
[28319:28319:1213/043648.198988:VERBOSE1:arc_session_runner.cc(236)] OnShutdown
[28319:28319:1213/043648.199078:VERBOSE1:arc_bridge_host_impl.cc(281)] Mojo connection lost
[28319:28319:1213/043648.200962:VERBOSE1:arc_net_host_impl.cc(912)] AndroidVpnStateChanged: state=ConnectionStateType::NOT_CONNECTED service=
[28319:28319:1213/043648.201321:INFO:arc_session_runner.cc(287)] ARC stopped: SHUTDOWN
[28319:28319:1213/043648.256631:VERBOSE1:arc_session_runner.cc(236)] OnShutdown
[28319:28319:1213/043648.306582:WARNING:pref_notifier_impl.cc(28)] Init observer found at shutdown.

@yusukes, @lhchavez are any of these messages fatal?
arc_session_runner / arc_bridge_host_impl are consequences (not causes) of another fatal issue in the Android side.
 Issue 794719  has been merged into this issue.
Still happening:

https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release/builds/1808

2/27 05:23:52.579 DEBUG|          autotest:1310| AUTOTEST_STATUS::		FAIL	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1514381032	localtime=Dec 27 05:23:52	Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).
12/27 05:23:52.579 INFO |        server_job:0218| 		FAIL	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1514381032	localtime=Dec 27 05:23:52	Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).

Any ARC++ engineer can take this?

If it's helpful, here's the relevant part of cheets_StartAndroid.stress.DEBUG from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165673689-chromeos-test/chromeos6-row2-rack24-host18/cheets_StartAndroid.stress/debug/:

12/27 05:23:11.914 INFO |cheets_StartAndroi:0048| cheets_StartAndroid iteration 8
12/27 05:23:11.915 DEBUG|          arc_util:0040| ARC is enabled in mode enabled
12/27 05:23:11.916 DEBUG|          arc_util:0040| ARC is enabled in mode enabled
12/27 05:23:11.928 INFO |    browser_finder:0123| Chose browser: PossibleCrOSBrowser(browser_type=system)
12/27 05:23:11.929 DEBUG|    cros_interface:0058| sh -c mktemp -d /tmp/extension_XXXXX 
12/27 05:23:11.930 DEBUG|      global_hooks:0056| ['sh', '-c', 'mktemp -d /tmp/extension_XXXXX']
12/27 05:23:11.946 DEBUG|    cros_interface:0067|  > stdout=[/tmp/extension_gCtUt
], stderr=[]
12/27 05:23:11.947 DEBUG|      global_hooks:0056| ['cp', '-r', '/usr/local/autotest/common_lib/cros/autotest_private_ext', '/tmp/extension_gCtUt']
12/27 05:23:11.965 DEBUG|    cros_interface:0058| sh -c chown -R chronos:chronos /tmp/extension_gCtUt 
12/27 05:23:11.965 DEBUG|      global_hooks:0056| ['sh', '-c', 'chown -R chronos:chronos /tmp/extension_gCtUt']
12/27 05:23:11.978 DEBUG|    cros_interface:0067|  > stdout=[], stderr=[]
12/27 05:23:11.978 INFO |    cros_interface:0551| (Re)starting the ui (logs the user out)
12/27 05:23:11.979 DEBUG|      global_hooks:0056| ['sh', '-c', 'systemctl']
12/27 05:23:11.988 DEBUG|    cros_interface:0058| sh -c stop ui 
12/27 05:23:11.988 DEBUG|      global_hooks:0056| ['sh', '-c', 'stop ui']
12/27 05:23:12.473 DEBUG|    cros_interface:0067|  > stdout=[ui stop/waiting
], stderr=[]
12/27 05:23:12.475 DEBUG|    cros_interface:0414| rm -rf /var/lib/whitelist/*
12/27 05:23:12.477 DEBUG|      global_hooks:0056| ['sh', '-c', 'rm -rf /var/lib/whitelist/*']
12/27 05:23:12.501 DEBUG|    cros_interface:0414| rm -rf /home/chronos/Local\ State
12/27 05:23:12.503 DEBUG|      global_hooks:0056| ['sh', '-c', 'rm -rf /home/chronos/Local\\ State']
12/27 05:23:12.525 DEBUG|      global_hooks:0056| ['sh', '-c', 'systemctl']
12/27 05:23:12.548 DEBUG|      global_hooks:0056| ['sh', '-c', 'status ui']
12/27 05:23:12.567 DEBUG|    cros_interface:0443| IsServiceRunning(ui)->False
12/27 05:23:12.569 DEBUG|    cros_interface:0058| sh -c start ui 
12/27 05:23:12.570 DEBUG|      global_hooks:0056| ['sh', '-c', 'start ui']
12/27 05:23:12.826 DEBUG|    cros_interface:0067|  > stdout=[ui start/running, process 32413
], stderr=[]
12/27 05:23:12.829 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state']
12/27 05:23:12.876 DEBUG|    cros_interface:0367| ListProcesses(<predicate>)->[292 processes]
12/27 05:23:12.878 DEBUG|    cros_interface:0058| sh -c cryptohome --action=remove --force --user=test@test.test 
12/27 05:23:12.879 DEBUG|      global_hooks:0056| ['sh', '-c', 'cryptohome --action=remove --force --user=test@test.test']
12/27 05:23:13.131 DEBUG|    cros_interface:0067|  > stdout=[Remove succeeded.
], stderr=[]
12/27 05:23:13.132 DEBUG|    cros_interface:0414| rm -rf /home/chronos/DevToolsActivePort
12/27 05:23:13.132 DEBUG|      global_hooks:0056| ['sh', '-c', 'rm -rf /home/chronos/DevToolsActivePort']
12/27 05:23:13.149 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state']
12/27 05:23:13.202 DEBUG|    cros_interface:0367| ListProcesses(<predicate>)->[296 processes]
12/27 05:23:13.204 INFO |cros_browser_backe:0159| Restarting Chrome (pid=32428) with remote port
12/27 05:23:13.204 INFO |cros_browser_backe:0166| dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\,*/chromeos/login/*=2\,chrome_browser_main_posix=2,--disable-gaia-services"
12/27 05:23:13.205 DEBUG|    cros_interface:0058| sh -c dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\,*/chromeos/login/*=2\,chrome_browser_main_posix=2,--disable-gaia-services" 
12/27 05:23:13.205 DEBUG|      global_hooks:0056| ['sh', '-c', 'dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\\,*/chromeos/login/*=2\\,chrome_browser_main_posix=2,--disable-gaia-services"']
12/27 05:23:13.218 DEBUG|    cros_interface:0067|  > stdout=[], stderr=[]
12/27 05:23:13.219 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state']
12/27 05:23:13.253 DEBUG|    cros_interface:0367| ListProcesses(<predicate>)->[291 processes]
12/27 05:23:13.255 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpk10gwj)
12/27 05:23:13.256 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpWAVKAr)
12/27 05:23:13.357 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpbKWMzp)
12/27 05:23:13.458 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpZK5VNp)
12/27 05:23:13.560 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpMm6EZ_)
12/27 05:23:13.663 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpBEV5py)
12/27 05:23:13.765 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpmApfZY)
12/27 05:23:13.866 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpolw4IL)
12/27 05:23:13.967 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpe370dk)
12/27 05:23:13.968 DEBUG|    cros_interface:0339| GetFileContents(/home/chronos/DevToolsActivePort)->36495
/devtools/browser/d9a6aefb-c96b-4463-b4af-ea6ec643e6df
12/27 05:23:13.968 INFO |cros_browser_backe:0076| Discovered ephemeral port 36495
12/27 05:23:13.969 INFO |cros_browser_backe:0077| Browser target: /devtools/browser/d9a6aefb-c96b-4463-b4af-ea6ec643e6df
12/27 05:23:48.902 ERROR|           browser:0050| Failed with TimeoutException while starting the browser backend.

Cc: cmtm@chromium.org
Adding this week's constable.
Cc: cmtm@google.com
I'll start looking into this.
Labels: -Pri-3 Pri-1
Cheets_StartAndroid.stress also just killed a CQ run.  I am not sure if it's the same problem, but maybe you can look into this too?  Thanks.

https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/4886


Owner: cmtm@chromium.org
Status: Started (was: Available)
Regarding the minnie failure from #13:

there seems to be a disconnect between 14:48 and 14:51 as shown in autoserv.DEBUG:


12/27 14:48:51.551 DEBUG|          autotest:1310| AUTOTEST_STATUS::START	----	----	timestamp=1514414930	localtime=Dec 27 14:48:50	
12/27 14:48:51.552 INFO |        server_job:0218| START	----	----	timestamp=1514414930	localtime=Dec 27 14:48:50	
12/27 14:48:51.552 INFO |        server_job:1401| Parsing lines in fast mode
12/27 14:48:51.553 ERROR|   logging_manager:0626| tko parser: parsing partial test None CLIENT_JOB.0
12/27 14:48:51.554 ERROR|   logging_manager:0626| tko parser: RUNNING: RUNNING
12/27 14:48:51.554 ERROR|   logging_manager:0626| CLIENT_JOB.0
12/27 14:48:51.554 ERROR|   logging_manager:0626| 
12/27 14:48:51.719 DEBUG|          autotest:1310| AUTOTEST_STATUS::	START	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1514414930	localtime=Dec 27 14:48:50	
12/27 14:48:51.720 INFO |        server_job:0218| 	START	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1514414930	localtime=Dec 27 14:48:50	
12/27 14:48:51.720 INFO |        server_job:1401| Parsing lines in fast mode
12/27 14:48:51.720 ERROR|   logging_manager:0626| tko parser: parsing partial test cheets_StartAndroid.stress cheets_StartAndroid.stress
12/27 14:48:51.721 ERROR|   logging_manager:0626| tko parser: RUNNING: RUNNING
12/27 14:48:51.721 ERROR|   logging_manager:0626| Subdir: cheets_StartAndroid.stress
12/27 14:48:51.721 ERROR|   logging_manager:0626| Testname: cheets_StartAndroid.stress
12/27 14:48:51.722 ERROR|   logging_manager:0626| 
12/27 14:51:02.146 DEBUG|          autotest:0985| Result exit status is 255.
12/27 14:51:02.147 DEBUG|             utils:0212| Running 'ping chromeos4-row9-rack9-host6 -w1 -c1'
12/27 14:51:02.364 DEBUG|             utils:0280| [stdout] PING chromeos4-row9-rack9-host6.cros.corp.google.com (100.115.200.102) 56(84) bytes of data.
12/27 14:51:02.364 DEBUG|             utils:0280| [stdout] 64 bytes from 100.115.200.102: icmp_seq=1 ttl=59 time=0.721 ms
12/27 14:51:02.365 DEBUG|             utils:0280| [stdout] 
12/27 14:51:02.365 DEBUG|             utils:0280| [stdout] --- chromeos4-row9-rack9-host6.cros.corp.google.com ping statistics ---
12/27 14:51:02.365 DEBUG|             utils:0280| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
12/27 14:51:02.365 DEBUG|             utils:0280| [stdout] rtt min/avg/max/mdev = 0.721/0.721/0.721/0.000 ms
12/27 14:51:02.385 DEBUG|          ssh_host:0301| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' from '_do_run|execute_control|_diagnose_dut|get_boot_id|run|run_very_slowly'
12/27 14:51:02.396 INFO |     ssh_multiplex:0076| Master ssh connection to chromeos4-row9-rack9-host6 is down.
12/27 14:51:02.397 DEBUG|     ssh_multiplex:0118| Nuking ssh master_job
12/27 14:51:02.397 DEBUG|     ssh_multiplex:0123| Cleaning ssh master_tempdir
12/27 14:51:02.398 INFO |     ssh_multiplex:0092| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_BOwe3Kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack9-host6'
12/27 14:51:02.398 DEBUG|             utils:0212| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_BOwe3Kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack9-host6'
12/27 14:51:06.211 DEBUG|             utils:0280| [stdout] 4db3912d-e8da-4d7a-b032-4530c6740ef7
12/27 14:51:06.253 INFO |        server_job:0218| 		ABORT	----	----	timestamp=1514415066	localtime=Dec 27 14:51:06	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
12/27 14:51:06.253 INFO |        server_job:1401| Parsing lines in fast mode


at that time the syslog shows this (NOTE: time is GMT)


2017-12-27T22:48:50.131903+00:00 NOTICE ag[5015]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local\
/autotest/bin/autotestd_monitor /tmp/autoserv-SgAQev 0 0)
2017-12-27T22:48:51.005036+00:00 NOTICE root[5049]: autotest runtest cheets_StartAndroid
2017-12-27T22:48:52.833212+00:00 NOTICE autotest[5070]: 14:48:52.826 WARNI|  external_modules:0057| Unable to import cv2 due to: In\
correct cv2 version found, expected 2.4.8 <= version < 3.0.0, found version $Rev: 4557 $
2017-12-27T22:48:54.480356+00:00 INFO session_manager[1793]: [INFO:browser_job.cc(164)] Terminating process: exiting cleanly
2017-12-27T22:48:54.480735+00:00 INFO session_manager[1793]: [INFO:system_utils_impl.cc(93)] Sending 15 to 1845 as 1000
2017-12-27T22:48:54.481083+00:00 INFO session_manager[1793]: [INFO:container_manager_impl.cc(193)] Killing off container android
2017-12-27T22:48:54.481335+00:00 INFO session_manager[1793]: [INFO:system_utils_impl.cc(93)] Sending 9 to 2268 as 0
2017-12-27T22:48:55.222584+00:00 INFO session_manager[1793]: [INFO:browser_job.cc(180)] Cleaned up child 1845
2017-12-27T22:48:55.222680+00:00 INFO session_manager[1793]: [INFO:session_manager_impl.cc(419)] emitting D-Bus signal SessionState\
Changed:stopped

I have no idea what cv2 is, but could this be the problem?

Mmm I don't think cv2 is the problem because nothing happens for almost two seconds.  Then the session manager shuts something down---maybe android---but it appears to do that cleanly.  So, no obvious failure from my limited viewpoint.
Labels: -Pri-1 Pri-0
Happened on auron_yuma paladin.  This is breaking the CQ.

On auron_yuma (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165946603-chromeos-test/chromeos4-row10-rack8-host3/cheets_StartAndroid.stress/sysinfo/var/log/chrome/), there is a timeout waiting for the browser backend. The last thing it does is starting the Android container (seemingly successfully), then a bunch of SELinux errors, unclear if they are critical (probably not).

2017-12-28T15:51:25.889844+00:00 INFO session_manager[2028]: [INFO:session_manager_impl.cc(1831)] Started Android container with PID 2373
2017-12-28T15:51:25.896632+00:00 WARNING kernel: [ 1241.572229] audit_printk_skb: 9 callbacks suppressed
2017-12-28T15:51:25.896662+00:00 NOTICE kernel: [ 1241.572253] audit: type=1400 audit(1514476285.894:966): avc:  denied  { module_request } for  pid=2373 comm="init" kmod="personality-8" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0
2017-12-28T15:51:25.896666+00:00 NOTICE kernel: [ 1241.572417] audit: type=1400 audit(1514476285.894:967): avc:  denied  { use } for  pid=2373 comm="init" path="/init" dev="loop1" ino=15 scontext=u:r:init:s0 tcontext=u:r:chromeos:s0 tclass=fd permissive=0
2017-12-28T15:51:25.917951+00:00 DEBUG kernel: [ 1241.593335] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-12-28T15:51:25.917967+00:00 NOTICE kernel: [ 1241.593518] audit: type=1400 audit(1514476285.915:968): avc:  denied  { module_request } for  pid=2373 comm="init" kmod="fs-configfs" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0
2017-12-28T15:51:25.917601+00:00 INFO session_manager[2028]: [INFO:server_backed_state_key_generator.cc(126)] Stable device secret missing!
2017-12-28T15:51:25.922608+00:00 DEBUG kernel: [ 1241.598295] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-12-28T15:51:25.923592+00:00 NOTICE kernel: [ 1241.599557] audit: type=1400 audit(1514476285.921:969): avc:  denied  { mounton } for  pid=2373 comm="init" path="/sys/kernel/debug" dev="tmpfs" ino=12721 scontext=u:r:init:s0 tcontext=u:object_r:debugfs:s0 tclass=dir permissive=0
2017-12-28T15:51:26.268617+00:00 DEBUG kernel: [ 1241.943677] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-12-28T15:51:26.272600+00:00 INFO kernel: [ 1241.947785] binder: 2408:2408 transaction failed 29189, size 0-0
2017-12-28T15:51:26.273600+00:00 NOTICE kernel: [ 1241.949111] audit: type=1400 audit(1514476286.271:970): avc:  denied  { module_request } for  pid=2373 comm="init" kmod="fs-configfs" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0
2017-12-28T15:51:26.853598+00:00 NOTICE kernel: [ 1242.528481] audit: type=1400 audit(1514476286.851:971): avc:  denied  { dac_read_search } for  pid=2412 comm="main" capability=2  scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0
2017-12-28T15:51:26.853612+00:00 NOTICE kernel: [ 1242.528586] audit: type=1400 audit(1514476286.851:972): avc:  denied  { dac_read_search } for  pid=2412 comm="main" capability=2  scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0
2017-12-28T15:51:26.853614+00:00 NOTICE kernel: [ 1242.528670] audit: type=1400 audit(1514476286.851:973): avc:  denied  { dac_read_search } for  pid=2412 comm="main" capability=2  scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0
2017-12-28T15:51:26.853615+00:00 NOTICE kernel: [ 1242.529300] audit: type=1400 audit(1514476286.851:974): avc:  denied  { dac_read_search } for  pid=2412 comm="main" capability=2  scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0
2017-12-28T15:51:26.856586+00:00 NOTICE kernel: [ 1242.531469] audit: type=1400 audit(1514476286.854:975): avc:  denied  { dac_read_search } for  pid=2412 comm="main" capability=2  scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0
2017-12-28T15:51:56.908797+00:00 NOTICE autotest[2474]: 07:51:56.902 ERROR|           browser:0050| Failed with TimeoutException while starting the browser backend.


In the chrome logs, there is a "Not implemented reached" error sandwiched between ARC instance start and shutting down after the autotest timeouts waiting for the browser:

[2207:2207:1228/075124.828986:VERBOSE1:arc_session_runner.cc(258)] Starting ARC instance
[2207:2207:1228/075126.327301:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t)
[2207:2207:1228/075157.097507:VERBOSE1:arc_session_runner.cc(236)] OnShutdown

Now that we have this bug tracking the current failure, let's use it. But I'll note that cheets_StartAndroid.stress fails anytime we get a bad android, so we should not conflate these issues.

Let's mark this fixed once we're over the current problem.

cmtm: Have you tried reproing this problem with the current arc++?
I _think_  issue 798000  is another symptom of this bad android container image. Keeping it separate until proven the same.
It seems that the timeout occurs here:
https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py#L177

The timeout is 30s. It seems to be that this operation usually takes less than 8s and the time taken is very stable, so it's unlikely to be an issue where it sometimes "just takes longer".

Also of note is that this occurs before the call to enable android here:
https://cs.corp.google.com/chromeos_internal/src/third_party/autotest/files/client/common_lib/cros/chrome.py?l=175

Chrome is still started with ARC enabled before this. I'll have to dig into what this actually means.
Owner: cmtm@google.com
Re #20: if the timeout indeed occurs around BindDevToolsClient, could this be the same/similar issue as  issue 797620 ?
Owner: linben@chromium.org
Cc: linben@chromium.org ahass...@chromium.org diand...@chromium.org uekawa@chromium.org hiroh@chromium.org
 Issue 799948  has been merged into this issue.
Another failure.

https://luci-milo.appspot.com/buildbot/chromeos/auron_yuna-paladin/1917

cheets_StartAndroid.stress                  FAIL: Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).

Project Member

Comment 28 by bugdroid1@chromium.org, Jan 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9a225c02d1f78511f44bc9abdeea07fb9bfe011b

commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b
Author: Ben Lin <linben@google.com>
Date: Sat Jan 13 06:37:50 2018

Add check for Chrome PID if exception occurs for startAndroid.stress.

startAndroid.stress test has been hitting exceptions regarding timeout
when trying to enable play store. We don't know whether Chrome is still
alive or not at this point, so logging this.

BUG= chromium:783832 
BUG= chromium:799958 
TEST=ran cheets_StartAndroid test locally.

Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80
Reviewed-on: https://chromium-review.googlesource.com/865100
Commit-Ready: Ben Lin <linben@chromium.org>
Tested-by: Ben Lin <linben@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>

[modify] https://crrev.com/9a225c02d1f78511f44bc9abdeea07fb9bfe011b/client/common_lib/cros/chrome.py
[modify] https://crrev.com/9a225c02d1f78511f44bc9abdeea07fb9bfe011b/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py

Any update on this? Still seeing many failures.
Cc: phshah@chromium.org
With the CL landed from Jan 13 we should be seeing more debug information now.

Can you link the failures?

Also adding this week's constable.
Components: -Infra>Client>ChromeOS Platform>Apps>ARC
This isn't a CrOS Infra bug.  Recategorizing as ARC++.

Probably, this also shouldn't be P0, but the ARC++ constable
should make that call.

As written in #20,

py_utils.WaitFor(lambda: self.oobe_exists, 30)

is a pretty generic failure that happens even before trying to start the main ARC instance
(((even though the failure comes after starting a mini ARC instance for login-screen)))
which is happening across all the kinds of Chrome OS tests including boards that have never enable ARC:

https://stainless.corp.google.com/search?view=list&first_date=2017-12-24&last_date=2018-01-22&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&reason=oobe_exists&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true

I think cheets_StartAndroid.stress is more visible just because, as being the "stress" test, it does 100x more login compared to other tests and thus has 100x more chance to hit the bug.

I'm not sure if categorizing it as ARC++ bug is appropriate. Rather I think we want help for autotest/devtool-connection experts (though I don't know who they are...)
Cc: kinaba@chromium.org
Looked at 4 or 5 logs of this failure.

As far as I could see, in all of them, the Chrome process was restarting during the login wizard.
It seems that autotest is waiting for the connection between the first (dead) chrome and is timing out.
(I guess this is what linben@ observed as well and that's why the debug code of #28 is added.)


All (and only) for the failing instance of Chrome, the corresponding /var/log/ui/ log recorded the following line

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!

immediately before the logged PID switches.




https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169160870-chromeos-test/chromeos6-row4-rack5-host1/

/var/log/messages
2018-01-10T21:03:07.477986+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(147)] Running child /opt/google/chrome/chrome ...
2018-01-10T21:03:08.387393+00:00 INFO session_manager[30716]: [INFO:child_exit_handler.cc(77)] Handling 30827 exit.
2018-01-10T21:03:08.387513+00:00 ERR session_manager[30716]: [ERROR:child_exit_handler.cc(79)]   Exited with exit code 127
2018-01-10T21:03:08.387828+00:00 INFO session_manager[30716]: [INFO:session_manager_service.cc(307)] Exiting process is chrome.
2018-01-10T21:03:08.387894+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(156)] Terminating process group: Ensuring browser processes are gone.
2018-01-10T21:03:08.387941+00:00 INFO session_manager[30716]: [INFO:system_utils_impl.cc(93)] Sending 9 to -30827 as 1000
2018-01-10T21:03:08.423756+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(147)] Running child /opt/google/chrome/chrome ...

autotest DEBUG log
01/10 13:03:07.450 INFO |cros_browser_backe:0141| Restarting Chrome (pid=30735) with remote port
...
01/10 13:03:07.468 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state']
01/10 13:03:07.519 DEBUG|    cros_interface:0367| ListProcesses(<predicate>)->[302 processes]
01/10 13:03:07.521 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpbILu43)
01/10 13:03:07.522 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpS4IMRr)
01/10 13:03:07.623 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpWWODD4)
01/10 13:03:07.724 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmptlRQfF)
01/10 13:03:07.826 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpqkKm_u)
01/10 13:03:07.927 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpg_CsWu)
01/10 13:03:08.029 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpNspBk8)
01/10 13:03:08.130 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpkLb2Rf)
01/10 13:03:08.231 DEBUG|    cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpuhVMKG)
01/10 13:03:08.231 DEBUG|    cros_interface:0339| GetFileContents(/home/chronos/DevToolsActivePort)->39626
/devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83
01/10 13:03:08.232 INFO |          __init__:0047| DoNothingForwarder started between 127.0.0.1:39626 and 39626
01/10 13:03:08.233 DEBUG|do_nothing_forward:0063| Connection test succeeded for 127.0.0.1:39626
01/10 13:03:08.233 INFO |chrome_browser_bac:0231| Got devtools config: ws://127.0.0.1:39626/devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83
01/10 13:03:44.569 ERROR|           browser:0052| Failed with TimeoutException while starting the browser backend.

/var/log/ui.*

[30827:30827:0110/130308.087582:ERROR:input_method_manager_impl.cc(1032)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered

DevTools listening on ws://127.0.0.1:39626/devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
[30949:30949:0110/130309.190555:ERROR:input_method_manager_impl.cc(1032)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered

DevTools listening on ws://127.0.0.1:34615/devtools/browser/ccb23e8f-bef1-442a-9fa7-ac7e33fcd8c0
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
device-enumerator: scan all dirs
  device-enumerator: scanning /sys/bus
  device-enumerator: scanning /sys/class
[30949:30949:0110/130314.845695:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t)
An instance on peach_pi, policy_ProxySettings.directproxy_usenoproxy
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170911181-chromeos-test/chromeos6-row2-rack9-host4/policy_ProxySettings/
(ui.20180119-054605 is the restarted chrome run with "Inconsistency detected by ld.so".)
According to my Google search, the error message can be due to a race between dlopen and pthread_create of glibc
https://sourceware.org/ml/libc-alpha/2016-01/msg00089.html
but so far I have zero familiarity with the field to access it...
Upstream (glibc) bug https://sourceware.org/bugzilla/show_bug.cgi?id=19329 (still open)
Cc: hidehiko@chromium.org
Cc: vapier@chromium.org yunlian@chromium.org tor@chromium.org g...@chromium.org
CC:ing people who seem to be familiar with glibc (chosen from sys-libs/glibc git history).

+vapier +gbiv +yunlian, do you folks have any suggestions? (or do you know anyone who's familiar in this area?)
The bug https://sourceware.org/bugzilla/show_bug.cgi?id=19329 looks biting Chrome OS: 0.01 ~ 0.1 % of time Chrome restarts in OOBE screen due to this assertion in libc, and makes autotests failing.
(1) Ideally we want to fix the root cause (race in glibc) and I believe it's worth trying, but even the upstream people looks struggling 

(2) Add a workaround to Chrome browser so that the critical situation can be avoided. (Not sure if it is practically feasible, and it doesn't look great anyway.)

(3) Add a workaround to either of Telemetry/Autotest so that they become robust against Chrome crash at this timing. But I feel really awkward to do that, because it's just hiding the real bug. The sole purpose of the "stress" test should be to detect and fix this kind of situation (= occasional crash/failure in relatively low repro ratio.)
Project Member

Comment 41 by bugdroid1@chromium.org, Jan 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d

commit fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d
Author: Ben Lin <linben@chromium.org>
Date: Tue Jan 23 09:48:47 2018

Revert "Add check for Chrome PID if exception occurs for startAndroid.stress."

This reverts commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b.

Reason for revert: See original CL's comments - test is no longer failing properly.

Original change's description:
> Add check for Chrome PID if exception occurs for startAndroid.stress.
>
> startAndroid.stress test has been hitting exceptions regarding timeout
> when trying to enable play store. We don't know whether Chrome is still
> alive or not at this point, so logging this.
>
> BUG= chromium:783832 
> BUG= chromium:799958 
> TEST=ran cheets_StartAndroid test locally.
>
> Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80
> Reviewed-on: https://chromium-review.googlesource.com/865100
> Commit-Ready: Ben Lin <linben@chromium.org>
> Tested-by: Ben Lin <linben@chromium.org>
> Reviewed-by: Yusuke Sato <yusukes@chromium.org>

Bug:  chromium:783832 ,  chromium:799958 
Change-Id: I788e204682bb62093ad7284e78817c60f3ddb39f
Reviewed-on: https://chromium-review.googlesource.com/879241
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Ben Lin <linben@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>

[modify] https://crrev.com/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d/client/common_lib/cros/chrome.py
[modify] https://crrev.com/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py

Labels: Build-Toolchain
I will test the glibc patch proposed in the upstream bug mentioned in #37. But it may take a few days to finish testing.

Should we keep this as a P0 or P1?
> Should we keep this as a P0 or P1?

Per  crbug.com/783832#c31 , this probably shouldn't be P0; the
failure rate doesn't seem to be high enough.

Cc: -ahass...@chromium.org
Labels: -Pri-0 Pri-1
lowering to P1. 

is linben@ still doing something for this? 
who should be the owner?
I had a change (which I've reverted) that was trying to check whether it was Chrome failing or not, but looks like syslog has enough information for that. I don't have enough experience with glibc, so not working on this anymore.
Owner: yunlian@chromium.org
Project Member

Comment 49 by bugdroid1@chromium.org, Jan 26

Labels: merge-merged-release-R65-10323.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ef56d33d40afb1da263697944305a1288122c8cd

commit ef56d33d40afb1da263697944305a1288122c8cd
Author: Ben Lin <linben@chromium.org>
Date: Fri Jan 26 04:18:51 2018

Revert "Add check for Chrome PID if exception occurs for startAndroid.stress."

This reverts commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b.

Reason for revert: See original CL's comments - test is no longer failing properly.

Original change's description:
> Add check for Chrome PID if exception occurs for startAndroid.stress.
>
> startAndroid.stress test has been hitting exceptions regarding timeout
> when trying to enable play store. We don't know whether Chrome is still
> alive or not at this point, so logging this.
>
> BUG= chromium:783832 
> BUG= chromium:799958 
> TEST=ran cheets_StartAndroid test locally.
>
> Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80
> Reviewed-on: https://chromium-review.googlesource.com/865100
> Commit-Ready: Ben Lin <linben@chromium.org>
> Tested-by: Ben Lin <linben@chromium.org>
> Reviewed-by: Yusuke Sato <yusukes@chromium.org>

Bug:  chromium:783832 ,  chromium:799958 
Change-Id: I788e204682bb62093ad7284e78817c60f3ddb39f
Reviewed-on: https://chromium-review.googlesource.com/879241
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Ben Lin <linben@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>
(cherry picked from commit fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d)
Reviewed-on: https://chromium-review.googlesource.com/888298
Tested-by: Kazuhiro Inaba <kinaba@chromium.org>

[modify] https://crrev.com/ef56d33d40afb1da263697944305a1288122c8cd/client/common_lib/cros/chrome.py
[modify] https://crrev.com/ef56d33d40afb1da263697944305a1288122c8cd/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py

Cc: ahass...@chromium.org
 Issue 799958  has been merged into this issue.
Project Member

Comment 51 by bugdroid1@chromium.org, Jan 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/2832d11e4e7071222a1464b26b365ef4edf3b720

commit 2832d11e4e7071222a1464b26b365ef4edf3b720
Author: Yunlian Jiang <yunlian@chromium.org>
Date: Fri Jan 26 18:58:55 2018

glibc: backport an patch to fix [BZ 19329]

This backports a pending CL from
https://sourceware.org/bugzilla/show_bug.cgi?id=19329
to fix the issue.

The commit message is
This fixes a subset of the issues described in
https://sourceware.org/ml/libc-alpha/2016-11/msg01026.html
without adding locks to pthread_create.

Only races between dlopen and pthread_create were considered,
and the asserts got removed that tried to check for concurrency
issues.

The patch is incomplete because dlclose, tls access and
dl_iterate_phdr related code paths are not modified.

dlclose should be updated in a similar fashion to dlopen
to make the patch complete alternatively pthread_create
may take the GL(dl_load_write_lock) to sync with dlclose
or the GL(dl_load_lock) to sync with dlopen and dlclose
(that would simplify the concurrency design, but increase
lock contention on the locks).

2016-11-30  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
	atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
	Remove assertions that cannot be guaranteed.
	(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
	atomically.

BUG= chromium:783832 
TEST=cbuildbot chromiuos-sdk falco-llvm-toolchain
     kevin-llvm-toolchain caroline-llvm-toolchain
     daisy-llvm-toolchain

Change-Id: I5f929f3e9bc2a4e1733907745e24e47aa92d1156
Reviewed-on: https://chromium-review.googlesource.com/883568
Commit-Ready: Yunlian Jiang <yunlian@chromium.org>
Tested-by: Yunlian Jiang <yunlian@chromium.org>
Reviewed-by: George Burgess <gbiv@chromium.org>

[rename] https://crrev.com/2832d11e4e7071222a1464b26b365ef4edf3b720/sys-libs/glibc/glibc-2.23-r15.ebuild
[add] https://crrev.com/2832d11e4e7071222a1464b26b365ef4edf3b720/sys-libs/glibc/files/local/glibc-2.23-dlopen.patch

I have backported a glibc patch and it needs a chromiumos-sdk uprev to take effect. How could we test whether this change really fixes this issue?
Thank you very much for the patching!

> How could we test whether this change really fixes this issue?

As far as I investigated, all the failures from 
  cheets_StartAndroid.stress.*
with an error message
  "TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30)."
were due to this glibc assertion.
So if it is deployed to the release images, we can watch the tree for a week and see if this type of error is gone or not.

Otherwise, unfortunately there's no reliable repro for the bug. Running the test
`test_that cheets_StartAndroid.stress.0`
locally 100 times or something like that may be the only way to test.
Do we need this patch to go to R65?
i think this patch is too risky for R65 at this point
we should monitor ToT and see how less flaky this makes; but
this should appear as user perceived latency increased for <1% of logins if it's restarting Chrome but not really visibly broken.


Right, the session manager is anyhow restarting Chrome. So in the real use case, user will only see a blackout/flash at the login screen in the worst case, I believe. Contrarily, tests are sensitive to the browser crash and thus are blocking the commit queue etc sometimes. Fix in ToT is thus very important but release branches will probably be able to bare with it.



Dashboard to monitor (query for error messages that have been caused by this failure):
https://stainless.corp.google.com/search?view=list&first_date=2018-01-15&last_date=2018-02-15&build=R66&status=FAIL&status=ERROR&reason=oobe_exists%7CDevToolsClientConnectionError%3A+%27%27&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
(note that Stainless is not updating since Jan 23 at the moment b/72484907)
Status: Fixed (was: Started)
If I'm reading the thing correctly

The glibc patching CL has landed in 10345.0.0
  https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/883568
which is picked up by this SDK builder run that updated to SDK_LATEST_VERSION="2018.01.26.210622"
  https://build.chromium.org/p/chromiumos/builders/chromiumos-sdk/builds/8475
which is picked up by the R66-10347.0.0 release:
  https://crosland.corp.google.com/log/10346.0.0..10347.0.0
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/0d09cfb399d8d1b0c916759ae077099c4aa37adc%5E%21

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The cheets_StartAndroid test run 6233 times since then, and we see ZERO "py_utils.WaitFor(lambda: self.oobe_exists, 30)" timeout failures:

https://stainless.corp.google.com/search?view=list&first_date=2018-01-25&last_date=2018-02-03&test=cheets_StartAndroid&build=%5ER66-10%28%5B4-9%5D%7C3%28%5B5-9%5D%7C4%5B7-9%5D%29%29&status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false

(Without the fix, say, R65 bots in the same date range had run the test 3962 times and encountered 13 failures of that type.)
(The success ratio is 6215/6233 vs 3932/3962, so we are neither seeing any increase of other failure modes.)

Even if we widen the search range to all test failures Stainless has gathered, we are no more seeing the timeout:

https://stainless.corp.google.com/search?view=list&first_date=2018-01-25&last_date=2018-02-03&build=%5ER66-10%28%5B4-9%5D%7C3%28%5B5-9%5D%7C4%5B7-9%5D%29%29&status=FAIL&status=ERROR&reason=oobe_exists%7CDevToolsClientConnectionError%3A+%27%27&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false

"Unhandled DevToolsClientConnectionError: ''" was another possible failure mode that had been caused by the glibc assertion,
This failure reason is still occasionally observed, but if we closely look into the logs, we know that these are completely
different failures. In our case the failure was during the browser start time, with a log message
"Failed with DevToolsClientConnectionError while starting the browser backend."
but none of the failures in above link were of this type.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So, in short, I believe the issue is now resolved.
Thank you very much for your help, Yunlian!
Status: Verified (was: Fixed)

Sign in to add a comment