cheets_StartAndroid.stress timeout |
|||||||||||||||||||||||
Issue descriptionThis failed on kevin-paladin: http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/155404104-chromeos-test/ https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/2998 It appears to have timed out when starting the browser backend, and may not necessarily be an issue with arc: 11/10 06:03:00.959 ERROR| browser:0068| Failed with TimeoutException while starting the browser backend.
,
Nov 10 2017
The same failure has also been seen on login_LoginSuccess.stress: https://wmatrix.googleplex.com/failures/unfiltered?platforms=soraka&tests=login_LoginSuccess.stress&builds=R64-10054.0.0&releases=tot&hide_missing=True
,
Dec 13 2017
Saw this again on cheets_StartAndroid.stress on auron_yuna-paladin: https://luci-milo.appspot.com/buildbot/chromeos/auron_yuna-paladin/1653 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162250346-chromeos-test/chromeos4-row10-rack8-host9/ 12/13 04:37:26.154 ERROR| browser:0050| Failed with TimeoutException while starting the browser backend. TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).
,
Dec 14 2017
At first blush, his looks like a product bug, presumably in ARC++, not caused by infrastructure. levarum@ - could you do some initial triage? The top-level question is "is this problem caused by a CL in the CQ run, or is it a bug in the tree?"
,
Dec 14 2017
In cheets_StartAndroid.stress/sysinfo/var/log/chrome/chrome_20171213-043621 this looks suspicious: [28319:28319:1213/043626.291292:VERBOSE1:arc_session_manager.cc(563)] ARC opt-in. Starting ARC session. [28319:28319:1213/043626.291363:VERBOSE1:arc_session_runner.cc(185)] Session start requested: FULL_INSTANCE [28319:28319:1213/043626.291377:VERBOSE1:arc_session_runner.cc(258)] Starting ARC instance [28319:28319:1213/043626.730362:ERROR:customization_document.cc(623)] Customization manifest is missing on server: https://ssl.gstatic.com/chrome/chromeos-customization/yuna.json [28319:28319:1213/043626.759986:WARNING:memory_kills_monitor.cc(175)] Monitoring has been started [28319:28319:1213/043627.303373:INFO:arc_session_impl.cc(459)] ARC ready. [28319:28319:1213/043627.372469:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t) [28319:28319:1213/043630.288315:VERBOSE1:arc_net_host_impl.cc(721)] GetDefaultNetwork: default network is /service/2 [28319:28319:1213/043630.299925:ERROR:object_proxy.cc(626)] Failed to call method: org.chromium.Midis.BootstrapMojoConnection: object_path= /org/chromium/Midis: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.Midis was not provided by any .service files [28319:28319:1213/043630.299986:ERROR:arc_midis_bridge.cc(65)] ArcMidisBridge had a failure in D-Bus with the daemon. [28319:28319:1213/043631.861828:VERBOSE1:connection_holder.h(384)] Instance arc::mojom::AppInstance not available. [28319:28319:1213/043634.981454:VERBOSE1:connection_holder.h(384)] Instance arc::mojom::IntentHelperInstance not available. [28319:28319:1213/043634.981983:VERBOSE1:arc_settings_service.cc(241)] OnPrefChanged: arc.backup_restore.enabled [28319:28319:1213/043634.982061:VERBOSE1:arc_settings_service.cc(241)] OnPrefChanged: arc.location_service.enabled [28319:28319:1213/043648.043337:VERBOSE1:arc_net_host_impl.cc(758)] New default network: /service/2 (ethernet) [28319:28319:1213/043648.043774:WARNING:statistics_provider.cc(370)] Requested statistic not found: Product_S/N [28319:28319:1213/043648.044319:VERBOSE1:arc_net_host_impl.cc(758)] New default network: /service/2 (ethernet) [28319:28319:1213/043648.198988:VERBOSE1:arc_session_runner.cc(236)] OnShutdown [28319:28319:1213/043648.199078:VERBOSE1:arc_bridge_host_impl.cc(281)] Mojo connection lost [28319:28319:1213/043648.200962:VERBOSE1:arc_net_host_impl.cc(912)] AndroidVpnStateChanged: state=ConnectionStateType::NOT_CONNECTED service= [28319:28319:1213/043648.201321:INFO:arc_session_runner.cc(287)] ARC stopped: SHUTDOWN [28319:28319:1213/043648.256631:VERBOSE1:arc_session_runner.cc(236)] OnShutdown [28319:28319:1213/043648.306582:WARNING:pref_notifier_impl.cc(28)] Init observer found at shutdown. @yusukes, @lhchavez are any of these messages fatal?
,
Dec 14 2017
arc_session_runner / arc_bridge_host_impl are consequences (not causes) of another fatal issue in the Android side.
,
Dec 27 2017
Issue 794719 has been merged into this issue.
,
Dec 27 2017
Still happening: https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release/builds/1808 2/27 05:23:52.579 DEBUG| autotest:1310| AUTOTEST_STATUS:: FAIL cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1514381032 localtime=Dec 27 05:23:52 Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30). 12/27 05:23:52.579 INFO | server_job:0218| FAIL cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1514381032 localtime=Dec 27 05:23:52 Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30). Any ARC++ engineer can take this?
,
Dec 27 2017
If it's helpful, here's the relevant part of cheets_StartAndroid.stress.DEBUG from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165673689-chromeos-test/chromeos6-row2-rack24-host18/cheets_StartAndroid.stress/debug/: 12/27 05:23:11.914 INFO |cheets_StartAndroi:0048| cheets_StartAndroid iteration 8 12/27 05:23:11.915 DEBUG| arc_util:0040| ARC is enabled in mode enabled 12/27 05:23:11.916 DEBUG| arc_util:0040| ARC is enabled in mode enabled 12/27 05:23:11.928 INFO | browser_finder:0123| Chose browser: PossibleCrOSBrowser(browser_type=system) 12/27 05:23:11.929 DEBUG| cros_interface:0058| sh -c mktemp -d /tmp/extension_XXXXX 12/27 05:23:11.930 DEBUG| global_hooks:0056| ['sh', '-c', 'mktemp -d /tmp/extension_XXXXX'] 12/27 05:23:11.946 DEBUG| cros_interface:0067| > stdout=[/tmp/extension_gCtUt ], stderr=[] 12/27 05:23:11.947 DEBUG| global_hooks:0056| ['cp', '-r', '/usr/local/autotest/common_lib/cros/autotest_private_ext', '/tmp/extension_gCtUt'] 12/27 05:23:11.965 DEBUG| cros_interface:0058| sh -c chown -R chronos:chronos /tmp/extension_gCtUt 12/27 05:23:11.965 DEBUG| global_hooks:0056| ['sh', '-c', 'chown -R chronos:chronos /tmp/extension_gCtUt'] 12/27 05:23:11.978 DEBUG| cros_interface:0067| > stdout=[], stderr=[] 12/27 05:23:11.978 INFO | cros_interface:0551| (Re)starting the ui (logs the user out) 12/27 05:23:11.979 DEBUG| global_hooks:0056| ['sh', '-c', 'systemctl'] 12/27 05:23:11.988 DEBUG| cros_interface:0058| sh -c stop ui 12/27 05:23:11.988 DEBUG| global_hooks:0056| ['sh', '-c', 'stop ui'] 12/27 05:23:12.473 DEBUG| cros_interface:0067| > stdout=[ui stop/waiting ], stderr=[] 12/27 05:23:12.475 DEBUG| cros_interface:0414| rm -rf /var/lib/whitelist/* 12/27 05:23:12.477 DEBUG| global_hooks:0056| ['sh', '-c', 'rm -rf /var/lib/whitelist/*'] 12/27 05:23:12.501 DEBUG| cros_interface:0414| rm -rf /home/chronos/Local\ State 12/27 05:23:12.503 DEBUG| global_hooks:0056| ['sh', '-c', 'rm -rf /home/chronos/Local\\ State'] 12/27 05:23:12.525 DEBUG| global_hooks:0056| ['sh', '-c', 'systemctl'] 12/27 05:23:12.548 DEBUG| global_hooks:0056| ['sh', '-c', 'status ui'] 12/27 05:23:12.567 DEBUG| cros_interface:0443| IsServiceRunning(ui)->False 12/27 05:23:12.569 DEBUG| cros_interface:0058| sh -c start ui 12/27 05:23:12.570 DEBUG| global_hooks:0056| ['sh', '-c', 'start ui'] 12/27 05:23:12.826 DEBUG| cros_interface:0067| > stdout=[ui start/running, process 32413 ], stderr=[] 12/27 05:23:12.829 DEBUG| global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state'] 12/27 05:23:12.876 DEBUG| cros_interface:0367| ListProcesses(<predicate>)->[292 processes] 12/27 05:23:12.878 DEBUG| cros_interface:0058| sh -c cryptohome --action=remove --force --user=test@test.test 12/27 05:23:12.879 DEBUG| global_hooks:0056| ['sh', '-c', 'cryptohome --action=remove --force --user=test@test.test'] 12/27 05:23:13.131 DEBUG| cros_interface:0067| > stdout=[Remove succeeded. ], stderr=[] 12/27 05:23:13.132 DEBUG| cros_interface:0414| rm -rf /home/chronos/DevToolsActivePort 12/27 05:23:13.132 DEBUG| global_hooks:0056| ['sh', '-c', 'rm -rf /home/chronos/DevToolsActivePort'] 12/27 05:23:13.149 DEBUG| global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state'] 12/27 05:23:13.202 DEBUG| cros_interface:0367| ListProcesses(<predicate>)->[296 processes] 12/27 05:23:13.204 INFO |cros_browser_backe:0159| Restarting Chrome (pid=32428) with remote port 12/27 05:23:13.204 INFO |cros_browser_backe:0166| dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\,*/chromeos/login/*=2\,chrome_browser_main_posix=2,--disable-gaia-services" 12/27 05:23:13.205 DEBUG| cros_interface:0058| sh -c dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\,*/chromeos/login/*=2\,chrome_browser_main_posix=2,--disable-gaia-services" 12/27 05:23:13.205 DEBUG| global_hooks:0056| ['sh', '-c', 'dbus-send --system --type=method_call --dest=org.chromium.SessionManager /org/chromium/SessionManager org.chromium.SessionManagerInterface.EnableChromeTesting boolean:true array:string:"--disable-arc-opt-in-verification,--enable-net-benchmarking,--metrics-recording-only,--no-default-browser-check,--no-first-run,--enable-gpu-benchmarking,--autoplay-policy=no-user-gesture-required,--load-extension=/tmp/extension_gCtUt/autotest_private_ext,--disable-component-extensions-with-background-pages,--disable-default-apps,--disable-search-geolocation-disclosure,--enable-smooth-scrolling,--enable-threaded-compositing,--remote-debugging-port=0,--start-maximized,--ash-disable-system-sounds,--allow-failed-policy-fetch-for-test,--oobe-skip-postlogin,--disable-logging-redirect,--vmodule=*/chromeos/net/*=2\\,*/chromeos/login/*=2\\,chrome_browser_main_posix=2,--disable-gaia-services"'] 12/27 05:23:13.218 DEBUG| cros_interface:0067| > stdout=[], stderr=[] 12/27 05:23:13.219 DEBUG| global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state'] 12/27 05:23:13.253 DEBUG| cros_interface:0367| ListProcesses(<predicate>)->[291 processes] 12/27 05:23:13.255 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpk10gwj) 12/27 05:23:13.256 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpWAVKAr) 12/27 05:23:13.357 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpbKWMzp) 12/27 05:23:13.458 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpZK5VNp) 12/27 05:23:13.560 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpMm6EZ_) 12/27 05:23:13.663 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpBEV5py) 12/27 05:23:13.765 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpmApfZY) 12/27 05:23:13.866 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpolw4IL) 12/27 05:23:13.967 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpe370dk) 12/27 05:23:13.968 DEBUG| cros_interface:0339| GetFileContents(/home/chronos/DevToolsActivePort)->36495 /devtools/browser/d9a6aefb-c96b-4463-b4af-ea6ec643e6df 12/27 05:23:13.968 INFO |cros_browser_backe:0076| Discovered ephemeral port 36495 12/27 05:23:13.969 INFO |cros_browser_backe:0077| Browser target: /devtools/browser/d9a6aefb-c96b-4463-b4af-ea6ec643e6df 12/27 05:23:48.902 ERROR| browser:0050| Failed with TimeoutException while starting the browser backend.
,
Dec 27 2017
Adding this week's constable.
,
Dec 27 2017
,
Dec 27 2017
I'll start looking into this.
,
Dec 27 2017
Cheets_StartAndroid.stress also just killed a CQ run. I am not sure if it's the same problem, but maybe you can look into this too? Thanks. https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/4886
,
Dec 28 2017
Regarding the minnie failure from #13: there seems to be a disconnect between 14:48 and 14:51 as shown in autoserv.DEBUG: 12/27 14:48:51.551 DEBUG| autotest:1310| AUTOTEST_STATUS::START ---- ---- timestamp=1514414930 localtime=Dec 27 14:48:50 12/27 14:48:51.552 INFO | server_job:0218| START ---- ---- timestamp=1514414930 localtime=Dec 27 14:48:50 12/27 14:48:51.552 INFO | server_job:1401| Parsing lines in fast mode 12/27 14:48:51.553 ERROR| logging_manager:0626| tko parser: parsing partial test None CLIENT_JOB.0 12/27 14:48:51.554 ERROR| logging_manager:0626| tko parser: RUNNING: RUNNING 12/27 14:48:51.554 ERROR| logging_manager:0626| CLIENT_JOB.0 12/27 14:48:51.554 ERROR| logging_manager:0626| 12/27 14:48:51.719 DEBUG| autotest:1310| AUTOTEST_STATUS:: START cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1514414930 localtime=Dec 27 14:48:50 12/27 14:48:51.720 INFO | server_job:0218| START cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1514414930 localtime=Dec 27 14:48:50 12/27 14:48:51.720 INFO | server_job:1401| Parsing lines in fast mode 12/27 14:48:51.720 ERROR| logging_manager:0626| tko parser: parsing partial test cheets_StartAndroid.stress cheets_StartAndroid.stress 12/27 14:48:51.721 ERROR| logging_manager:0626| tko parser: RUNNING: RUNNING 12/27 14:48:51.721 ERROR| logging_manager:0626| Subdir: cheets_StartAndroid.stress 12/27 14:48:51.721 ERROR| logging_manager:0626| Testname: cheets_StartAndroid.stress 12/27 14:48:51.722 ERROR| logging_manager:0626| 12/27 14:51:02.146 DEBUG| autotest:0985| Result exit status is 255. 12/27 14:51:02.147 DEBUG| utils:0212| Running 'ping chromeos4-row9-rack9-host6 -w1 -c1' 12/27 14:51:02.364 DEBUG| utils:0280| [stdout] PING chromeos4-row9-rack9-host6.cros.corp.google.com (100.115.200.102) 56(84) bytes of data. 12/27 14:51:02.364 DEBUG| utils:0280| [stdout] 64 bytes from 100.115.200.102: icmp_seq=1 ttl=59 time=0.721 ms 12/27 14:51:02.365 DEBUG| utils:0280| [stdout] 12/27 14:51:02.365 DEBUG| utils:0280| [stdout] --- chromeos4-row9-rack9-host6.cros.corp.google.com ping statistics --- 12/27 14:51:02.365 DEBUG| utils:0280| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms 12/27 14:51:02.365 DEBUG| utils:0280| [stdout] rtt min/avg/max/mdev = 0.721/0.721/0.721/0.000 ms 12/27 14:51:02.385 DEBUG| ssh_host:0301| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' from '_do_run|execute_control|_diagnose_dut|get_boot_id|run|run_very_slowly' 12/27 14:51:02.396 INFO | ssh_multiplex:0076| Master ssh connection to chromeos4-row9-rack9-host6 is down. 12/27 14:51:02.397 DEBUG| ssh_multiplex:0118| Nuking ssh master_job 12/27 14:51:02.397 DEBUG| ssh_multiplex:0123| Cleaning ssh master_tempdir 12/27 14:51:02.398 INFO | ssh_multiplex:0092| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_BOwe3Kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack9-host6' 12/27 14:51:02.398 DEBUG| utils:0212| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_BOwe3Kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack9-host6' 12/27 14:51:06.211 DEBUG| utils:0280| [stdout] 4db3912d-e8da-4d7a-b032-4530c6740ef7 12/27 14:51:06.253 INFO | server_job:0218| ABORT ---- ---- timestamp=1514415066 localtime=Dec 27 14:51:06 Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test. 12/27 14:51:06.253 INFO | server_job:1401| Parsing lines in fast mode at that time the syslog shows this (NOTE: time is GMT) 2017-12-27T22:48:50.131903+00:00 NOTICE ag[5015]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local\ /autotest/bin/autotestd_monitor /tmp/autoserv-SgAQev 0 0) 2017-12-27T22:48:51.005036+00:00 NOTICE root[5049]: autotest runtest cheets_StartAndroid 2017-12-27T22:48:52.833212+00:00 NOTICE autotest[5070]: 14:48:52.826 WARNI| external_modules:0057| Unable to import cv2 due to: In\ correct cv2 version found, expected 2.4.8 <= version < 3.0.0, found version $Rev: 4557 $ 2017-12-27T22:48:54.480356+00:00 INFO session_manager[1793]: [INFO:browser_job.cc(164)] Terminating process: exiting cleanly 2017-12-27T22:48:54.480735+00:00 INFO session_manager[1793]: [INFO:system_utils_impl.cc(93)] Sending 15 to 1845 as 1000 2017-12-27T22:48:54.481083+00:00 INFO session_manager[1793]: [INFO:container_manager_impl.cc(193)] Killing off container android 2017-12-27T22:48:54.481335+00:00 INFO session_manager[1793]: [INFO:system_utils_impl.cc(93)] Sending 9 to 2268 as 0 2017-12-27T22:48:55.222584+00:00 INFO session_manager[1793]: [INFO:browser_job.cc(180)] Cleaned up child 1845 2017-12-27T22:48:55.222680+00:00 INFO session_manager[1793]: [INFO:session_manager_impl.cc(419)] emitting D-Bus signal SessionState\ Changed:stopped I have no idea what cv2 is, but could this be the problem?
,
Dec 28 2017
Mmm I don't think cv2 is the problem because nothing happens for almost two seconds. Then the session manager shuts something down---maybe android---but it appears to do that cleanly. So, no obvious failure from my limited viewpoint.
,
Dec 28 2017
Happened on auron_yuma paladin. This is breaking the CQ.
,
Dec 28 2017
On auron_yuma (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165946603-chromeos-test/chromeos4-row10-rack8-host3/cheets_StartAndroid.stress/sysinfo/var/log/chrome/), there is a timeout waiting for the browser backend. The last thing it does is starting the Android container (seemingly successfully), then a bunch of SELinux errors, unclear if they are critical (probably not). 2017-12-28T15:51:25.889844+00:00 INFO session_manager[2028]: [INFO:session_manager_impl.cc(1831)] Started Android container with PID 2373 2017-12-28T15:51:25.896632+00:00 WARNING kernel: [ 1241.572229] audit_printk_skb: 9 callbacks suppressed 2017-12-28T15:51:25.896662+00:00 NOTICE kernel: [ 1241.572253] audit: type=1400 audit(1514476285.894:966): avc: denied { module_request } for pid=2373 comm="init" kmod="personality-8" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0 2017-12-28T15:51:25.896666+00:00 NOTICE kernel: [ 1241.572417] audit: type=1400 audit(1514476285.894:967): avc: denied { use } for pid=2373 comm="init" path="/init" dev="loop1" ino=15 scontext=u:r:init:s0 tcontext=u:r:chromeos:s0 tclass=fd permissive=0 2017-12-28T15:51:25.917951+00:00 DEBUG kernel: [ 1241.593335] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs 2017-12-28T15:51:25.917967+00:00 NOTICE kernel: [ 1241.593518] audit: type=1400 audit(1514476285.915:968): avc: denied { module_request } for pid=2373 comm="init" kmod="fs-configfs" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0 2017-12-28T15:51:25.917601+00:00 INFO session_manager[2028]: [INFO:server_backed_state_key_generator.cc(126)] Stable device secret missing! 2017-12-28T15:51:25.922608+00:00 DEBUG kernel: [ 1241.598295] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs 2017-12-28T15:51:25.923592+00:00 NOTICE kernel: [ 1241.599557] audit: type=1400 audit(1514476285.921:969): avc: denied { mounton } for pid=2373 comm="init" path="/sys/kernel/debug" dev="tmpfs" ino=12721 scontext=u:r:init:s0 tcontext=u:object_r:debugfs:s0 tclass=dir permissive=0 2017-12-28T15:51:26.268617+00:00 DEBUG kernel: [ 1241.943677] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs 2017-12-28T15:51:26.272600+00:00 INFO kernel: [ 1241.947785] binder: 2408:2408 transaction failed 29189, size 0-0 2017-12-28T15:51:26.273600+00:00 NOTICE kernel: [ 1241.949111] audit: type=1400 audit(1514476286.271:970): avc: denied { module_request } for pid=2373 comm="init" kmod="fs-configfs" scontext=u:r:init:s0 tcontext=u:r:kernel:s0 tclass=system permissive=0 2017-12-28T15:51:26.853598+00:00 NOTICE kernel: [ 1242.528481] audit: type=1400 audit(1514476286.851:971): avc: denied { dac_read_search } for pid=2412 comm="main" capability=2 scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0 2017-12-28T15:51:26.853612+00:00 NOTICE kernel: [ 1242.528586] audit: type=1400 audit(1514476286.851:972): avc: denied { dac_read_search } for pid=2412 comm="main" capability=2 scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0 2017-12-28T15:51:26.853614+00:00 NOTICE kernel: [ 1242.528670] audit: type=1400 audit(1514476286.851:973): avc: denied { dac_read_search } for pid=2412 comm="main" capability=2 scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0 2017-12-28T15:51:26.853615+00:00 NOTICE kernel: [ 1242.529300] audit: type=1400 audit(1514476286.851:974): avc: denied { dac_read_search } for pid=2412 comm="main" capability=2 scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0 2017-12-28T15:51:26.856586+00:00 NOTICE kernel: [ 1242.531469] audit: type=1400 audit(1514476286.854:975): avc: denied { dac_read_search } for pid=2412 comm="main" capability=2 scontext=u:r:zygote:s0 tcontext=u:r:zygote:s0 tclass=capability permissive=0 2017-12-28T15:51:56.908797+00:00 NOTICE autotest[2474]: 07:51:56.902 ERROR| browser:0050| Failed with TimeoutException while starting the browser backend. In the chrome logs, there is a "Not implemented reached" error sandwiched between ARC instance start and shutting down after the autotest timeouts waiting for the browser: [2207:2207:1228/075124.828986:VERBOSE1:arc_session_runner.cc(258)] Starting ARC instance [2207:2207:1228/075126.327301:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t) [2207:2207:1228/075157.097507:VERBOSE1:arc_session_runner.cc(236)] OnShutdown
,
Dec 28 2017
Now that we have this bug tracking the current failure, let's use it. But I'll note that cheets_StartAndroid.stress fails anytime we get a bad android, so we should not conflate these issues. Let's mark this fixed once we're over the current problem. cmtm: Have you tried reproing this problem with the current arc++?
,
Dec 28 2017
I _think_ issue 798000 is another symptom of this bad android container image. Keeping it separate until proven the same.
,
Dec 29 2017
It seems that the timeout occurs here: https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py#L177 The timeout is 30s. It seems to be that this operation usually takes less than 8s and the time taken is very stable, so it's unlikely to be an issue where it sometimes "just takes longer". Also of note is that this occurs before the call to enable android here: https://cs.corp.google.com/chromeos_internal/src/third_party/autotest/files/client/common_lib/cros/chrome.py?l=175 Chrome is still started with ARC enabled before this. I'll have to dig into what this actually means.
,
Jan 2 2018
Another instance https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/2607
,
Jan 2 2018
,
Jan 2 2018
HWTests failing here: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/4927
,
Jan 3 2018
Re #20: if the timeout indeed occurs around BindDevToolsClient, could this be the same/similar issue as issue 797620 ?
,
Jan 11 2018
,
Jan 11 2018
Issue 799948 has been merged into this issue.
,
Jan 12 2018
Another failure. https://luci-milo.appspot.com/buildbot/chromeos/auron_yuna-paladin/1917 cheets_StartAndroid.stress FAIL: Unhandled TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30).
,
Jan 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9a225c02d1f78511f44bc9abdeea07fb9bfe011b commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b Author: Ben Lin <linben@google.com> Date: Sat Jan 13 06:37:50 2018 Add check for Chrome PID if exception occurs for startAndroid.stress. startAndroid.stress test has been hitting exceptions regarding timeout when trying to enable play store. We don't know whether Chrome is still alive or not at this point, so logging this. BUG= chromium:783832 BUG= chromium:799958 TEST=ran cheets_StartAndroid test locally. Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80 Reviewed-on: https://chromium-review.googlesource.com/865100 Commit-Ready: Ben Lin <linben@chromium.org> Tested-by: Ben Lin <linben@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> [modify] https://crrev.com/9a225c02d1f78511f44bc9abdeea07fb9bfe011b/client/common_lib/cros/chrome.py [modify] https://crrev.com/9a225c02d1f78511f44bc9abdeea07fb9bfe011b/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py
,
Jan 16 2018
Any update on this? Still seeing many failures.
,
Jan 16 2018
With the CL landed from Jan 13 we should be seeing more debug information now. Can you link the failures? Also adding this week's constable.
,
Jan 16 2018
This isn't a CrOS Infra bug. Recategorizing as ARC++. Probably, this also shouldn't be P0, but the ARC++ constable should make that call.
,
Jan 22 2018
As written in #20, py_utils.WaitFor(lambda: self.oobe_exists, 30) is a pretty generic failure that happens even before trying to start the main ARC instance (((even though the failure comes after starting a mini ARC instance for login-screen))) which is happening across all the kinds of Chrome OS tests including boards that have never enable ARC: https://stainless.corp.google.com/search?view=list&first_date=2017-12-24&last_date=2018-01-22&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&reason=oobe_exists&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true I think cheets_StartAndroid.stress is more visible just because, as being the "stress" test, it does 100x more login compared to other tests and thus has 100x more chance to hit the bug. I'm not sure if categorizing it as ARC++ bug is appropriate. Rather I think we want help for autotest/devtool-connection experts (though I don't know who they are...)
,
Jan 22 2018
,
Jan 22 2018
Looked at 4 or 5 logs of this failure. As far as I could see, in all of them, the Chrome process was restarting during the login wizard. It seems that autotest is waiting for the connection between the first (dead) chrome and is timing out. (I guess this is what linben@ observed as well and that's why the debug code of #28 is added.) All (and only) for the failing instance of Chrome, the corresponding /var/log/ui/ log recorded the following line Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! immediately before the logged PID switches. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169160870-chromeos-test/chromeos6-row4-rack5-host1/ /var/log/messages 2018-01-10T21:03:07.477986+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(147)] Running child /opt/google/chrome/chrome ... 2018-01-10T21:03:08.387393+00:00 INFO session_manager[30716]: [INFO:child_exit_handler.cc(77)] Handling 30827 exit. 2018-01-10T21:03:08.387513+00:00 ERR session_manager[30716]: [ERROR:child_exit_handler.cc(79)] Exited with exit code 127 2018-01-10T21:03:08.387828+00:00 INFO session_manager[30716]: [INFO:session_manager_service.cc(307)] Exiting process is chrome. 2018-01-10T21:03:08.387894+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(156)] Terminating process group: Ensuring browser processes are gone. 2018-01-10T21:03:08.387941+00:00 INFO session_manager[30716]: [INFO:system_utils_impl.cc(93)] Sending 9 to -30827 as 1000 2018-01-10T21:03:08.423756+00:00 INFO session_manager[30716]: [INFO:browser_job.cc(147)] Running child /opt/google/chrome/chrome ... autotest DEBUG log 01/10 13:03:07.450 INFO |cros_browser_backe:0141| Restarting Chrome (pid=30735) with remote port ... 01/10 13:03:07.468 DEBUG| global_hooks:0056| ['sh', '-c', '/bin/ps --no-headers -A -o pid,ppid,args:4096,state'] 01/10 13:03:07.519 DEBUG| cros_interface:0367| ListProcesses(<predicate>)->[302 processes] 01/10 13:03:07.521 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpbILu43) 01/10 13:03:07.522 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpS4IMRr) 01/10 13:03:07.623 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpWWODD4) 01/10 13:03:07.724 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmptlRQfF) 01/10 13:03:07.826 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpqkKm_u) 01/10 13:03:07.927 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpg_CsWu) 01/10 13:03:08.029 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpNspBk8) 01/10 13:03:08.130 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpkLb2Rf) 01/10 13:03:08.231 DEBUG| cros_interface:0309| GetFile(/home/chronos/DevToolsActivePort, /tmp/tmpuhVMKG) 01/10 13:03:08.231 DEBUG| cros_interface:0339| GetFileContents(/home/chronos/DevToolsActivePort)->39626 /devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83 01/10 13:03:08.232 INFO | __init__:0047| DoNothingForwarder started between 127.0.0.1:39626 and 39626 01/10 13:03:08.233 DEBUG|do_nothing_forward:0063| Connection test succeeded for 127.0.0.1:39626 01/10 13:03:08.233 INFO |chrome_browser_bac:0231| Got devtools config: ws://127.0.0.1:39626/devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83 01/10 13:03:44.569 ERROR| browser:0052| Failed with TimeoutException while starting the browser backend. /var/log/ui.* [30827:30827:0110/130308.087582:ERROR:input_method_manager_impl.cc(1032)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered DevTools listening on ws://127.0.0.1:39626/devtools/browser/f79536d4-319d-44e8-b69c-694ea7659f83 device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! [30949:30949:0110/130309.190555:ERROR:input_method_manager_impl.cc(1032)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered DevTools listening on ws://127.0.0.1:34615/devtools/browser/ccb23e8f-bef1-442a-9fa7-ac7e33fcd8c0 device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class [30949:30949:0110/130314.845695:ERROR:server.cc(1915)] Not implemented reached in void exo::wayland::(anonymous namespace)::remote_surface_set_rectangular_shadow_background_opacity_DEPRECATED(wl_client *, wl_resource *, wl_fixed_t)
,
Jan 22 2018
An instance on peach_pi, policy_ProxySettings.directproxy_usenoproxy https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170911181-chromeos-test/chromeos6-row2-rack9-host4/policy_ProxySettings/ (ui.20180119-054605 is the restarted chrome run with "Inconsistency detected by ld.so".)
,
Jan 22 2018
According to my Google search, the error message can be due to a race between dlopen and pthread_create of glibc https://sourceware.org/ml/libc-alpha/2016-01/msg00089.html but so far I have zero familiarity with the field to access it...
,
Jan 23 2018
Upstream (glibc) bug https://sourceware.org/bugzilla/show_bug.cgi?id=19329 (still open)
,
Jan 23 2018
,
Jan 23 2018
CC:ing people who seem to be familiar with glibc (chosen from sys-libs/glibc git history). +vapier +gbiv +yunlian, do you folks have any suggestions? (or do you know anyone who's familiar in this area?) The bug https://sourceware.org/bugzilla/show_bug.cgi?id=19329 looks biting Chrome OS: 0.01 ~ 0.1 % of time Chrome restarts in OOBE screen due to this assertion in libc, and makes autotests failing.
,
Jan 23 2018
(1) Ideally we want to fix the root cause (race in glibc) and I believe it's worth trying, but even the upstream people looks struggling (2) Add a workaround to Chrome browser so that the critical situation can be avoided. (Not sure if it is practically feasible, and it doesn't look great anyway.) (3) Add a workaround to either of Telemetry/Autotest so that they become robust against Chrome crash at this timing. But I feel really awkward to do that, because it's just hiding the real bug. The sole purpose of the "stress" test should be to detect and fix this kind of situation (= occasional crash/failure in relatively low repro ratio.)
,
Jan 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d commit fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d Author: Ben Lin <linben@chromium.org> Date: Tue Jan 23 09:48:47 2018 Revert "Add check for Chrome PID if exception occurs for startAndroid.stress." This reverts commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b. Reason for revert: See original CL's comments - test is no longer failing properly. Original change's description: > Add check for Chrome PID if exception occurs for startAndroid.stress. > > startAndroid.stress test has been hitting exceptions regarding timeout > when trying to enable play store. We don't know whether Chrome is still > alive or not at this point, so logging this. > > BUG= chromium:783832 > BUG= chromium:799958 > TEST=ran cheets_StartAndroid test locally. > > Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80 > Reviewed-on: https://chromium-review.googlesource.com/865100 > Commit-Ready: Ben Lin <linben@chromium.org> > Tested-by: Ben Lin <linben@chromium.org> > Reviewed-by: Yusuke Sato <yusukes@chromium.org> Bug: chromium:783832 , chromium:799958 Change-Id: I788e204682bb62093ad7284e78817c60f3ddb39f Reviewed-on: https://chromium-review.googlesource.com/879241 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Ben Lin <linben@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> [modify] https://crrev.com/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d/client/common_lib/cros/chrome.py [modify] https://crrev.com/fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py
,
Jan 24 2018
,
Jan 24 2018
I will test the glibc patch proposed in the upstream bug mentioned in #37. But it may take a few days to finish testing. Should we keep this as a P0 or P1?
,
Jan 24 2018
> Should we keep this as a P0 or P1? Per crbug.com/783832#c31 , this probably shouldn't be P0; the failure rate doesn't seem to be high enough.
,
Jan 24 2018
,
Jan 24 2018
lowering to P1. is linben@ still doing something for this? who should be the owner?
,
Jan 24 2018
I had a change (which I've reverted) that was trying to check whether it was Chrome failing or not, but looks like syslog has enough information for that. I don't have enough experience with glibc, so not working on this anymore.
,
Jan 24 2018
,
Jan 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ef56d33d40afb1da263697944305a1288122c8cd commit ef56d33d40afb1da263697944305a1288122c8cd Author: Ben Lin <linben@chromium.org> Date: Fri Jan 26 04:18:51 2018 Revert "Add check for Chrome PID if exception occurs for startAndroid.stress." This reverts commit 9a225c02d1f78511f44bc9abdeea07fb9bfe011b. Reason for revert: See original CL's comments - test is no longer failing properly. Original change's description: > Add check for Chrome PID if exception occurs for startAndroid.stress. > > startAndroid.stress test has been hitting exceptions regarding timeout > when trying to enable play store. We don't know whether Chrome is still > alive or not at this point, so logging this. > > BUG= chromium:783832 > BUG= chromium:799958 > TEST=ran cheets_StartAndroid test locally. > > Change-Id: Ifbe1bc73fa75a29df01410a9ef4cd77d5b4c7f80 > Reviewed-on: https://chromium-review.googlesource.com/865100 > Commit-Ready: Ben Lin <linben@chromium.org> > Tested-by: Ben Lin <linben@chromium.org> > Reviewed-by: Yusuke Sato <yusukes@chromium.org> Bug: chromium:783832 , chromium:799958 Change-Id: I788e204682bb62093ad7284e78817c60f3ddb39f Reviewed-on: https://chromium-review.googlesource.com/879241 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Ben Lin <linben@chromium.org> Reviewed-by: Yusuke Sato <yusukes@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> (cherry picked from commit fed7f8aba8cfcc2b75b1ab86cad46b5476bb551d) Reviewed-on: https://chromium-review.googlesource.com/888298 Tested-by: Kazuhiro Inaba <kinaba@chromium.org> [modify] https://crrev.com/ef56d33d40afb1da263697944305a1288122c8cd/client/common_lib/cros/chrome.py [modify] https://crrev.com/ef56d33d40afb1da263697944305a1288122c8cd/client/site_tests/cheets_StartAndroid/cheets_StartAndroid.py
,
Jan 26 2018
,
Jan 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/2832d11e4e7071222a1464b26b365ef4edf3b720 commit 2832d11e4e7071222a1464b26b365ef4edf3b720 Author: Yunlian Jiang <yunlian@chromium.org> Date: Fri Jan 26 18:58:55 2018 glibc: backport an patch to fix [BZ 19329] This backports a pending CL from https://sourceware.org/bugzilla/show_bug.cgi?id=19329 to fix the issue. The commit message is This fixes a subset of the issues described in https://sourceware.org/ml/libc-alpha/2016-11/msg01026.html without adding locks to pthread_create. Only races between dlopen and pthread_create were considered, and the asserts got removed that tried to check for concurrency issues. The patch is incomplete because dlclose, tls access and dl_iterate_phdr related code paths are not modified. dlclose should be updated in a similar fashion to dlopen to make the patch complete alternatively pthread_create may take the GL(dl_load_write_lock) to sync with dlclose or the GL(dl_load_lock) to sync with dlopen and dlclose (that would simplify the concurrency design, but increase lock contention on the locks). 2016-11-30 Szabolcs Nagy <szabolcs.nagy@arm.com> [BZ #19329] * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation) atomically. * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation), GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically. Remove assertions that cannot be guaranteed. (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next atomically. BUG= chromium:783832 TEST=cbuildbot chromiuos-sdk falco-llvm-toolchain kevin-llvm-toolchain caroline-llvm-toolchain daisy-llvm-toolchain Change-Id: I5f929f3e9bc2a4e1733907745e24e47aa92d1156 Reviewed-on: https://chromium-review.googlesource.com/883568 Commit-Ready: Yunlian Jiang <yunlian@chromium.org> Tested-by: Yunlian Jiang <yunlian@chromium.org> Reviewed-by: George Burgess <gbiv@chromium.org> [rename] https://crrev.com/2832d11e4e7071222a1464b26b365ef4edf3b720/sys-libs/glibc/glibc-2.23-r15.ebuild [add] https://crrev.com/2832d11e4e7071222a1464b26b365ef4edf3b720/sys-libs/glibc/files/local/glibc-2.23-dlopen.patch
,
Jan 26 2018
I have backported a glibc patch and it needs a chromiumos-sdk uprev to take effect. How could we test whether this change really fixes this issue?
,
Jan 29 2018
Thank you very much for the patching! > How could we test whether this change really fixes this issue? As far as I investigated, all the failures from cheets_StartAndroid.stress.* with an error message "TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30)." were due to this glibc assertion. So if it is deployed to the release images, we can watch the tree for a week and see if this type of error is gone or not. Otherwise, unfortunately there's no reliable repro for the bug. Running the test `test_that cheets_StartAndroid.stress.0` locally 100 times or something like that may be the only way to test.
,
Jan 29 2018
Do we need this patch to go to R65?
,
Jan 29 2018
i think this patch is too risky for R65 at this point
,
Jan 30 2018
we should monitor ToT and see how less flaky this makes; but this should appear as user perceived latency increased for <1% of logins if it's restarting Chrome but not really visibly broken.
,
Jan 30 2018
Right, the session manager is anyhow restarting Chrome. So in the real use case, user will only see a blackout/flash at the login screen in the worst case, I believe. Contrarily, tests are sensitive to the browser crash and thus are blocking the commit queue etc sometimes. Fix in ToT is thus very important but release branches will probably be able to bare with it. Dashboard to monitor (query for error messages that have been caused by this failure): https://stainless.corp.google.com/search?view=list&first_date=2018-01-15&last_date=2018-02-15&build=R66&status=FAIL&status=ERROR&reason=oobe_exists%7CDevToolsClientConnectionError%3A+%27%27&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false (note that Stainless is not updating since Jan 23 at the moment b/72484907)
,
Feb 2 2018
If I'm reading the thing correctly The glibc patching CL has landed in 10345.0.0 https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/883568 which is picked up by this SDK builder run that updated to SDK_LATEST_VERSION="2018.01.26.210622" https://build.chromium.org/p/chromiumos/builders/chromiumos-sdk/builds/8475 which is picked up by the R66-10347.0.0 release: https://crosland.corp.google.com/log/10346.0.0..10347.0.0 https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/0d09cfb399d8d1b0c916759ae077099c4aa37adc%5E%21 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The cheets_StartAndroid test run 6233 times since then, and we see ZERO "py_utils.WaitFor(lambda: self.oobe_exists, 30)" timeout failures: https://stainless.corp.google.com/search?view=list&first_date=2018-01-25&last_date=2018-02-03&test=cheets_StartAndroid&build=%5ER66-10%28%5B4-9%5D%7C3%28%5B5-9%5D%7C4%5B7-9%5D%29%29&status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false (Without the fix, say, R65 bots in the same date range had run the test 3962 times and encountered 13 failures of that type.) (The success ratio is 6215/6233 vs 3932/3962, so we are neither seeing any increase of other failure modes.) Even if we widen the search range to all test failures Stainless has gathered, we are no more seeing the timeout: https://stainless.corp.google.com/search?view=list&first_date=2018-01-25&last_date=2018-02-03&build=%5ER66-10%28%5B4-9%5D%7C3%28%5B5-9%5D%7C4%5B7-9%5D%29%29&status=FAIL&status=ERROR&reason=oobe_exists%7CDevToolsClientConnectionError%3A+%27%27&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false "Unhandled DevToolsClientConnectionError: ''" was another possible failure mode that had been caused by the glibc assertion, This failure reason is still occasionally observed, but if we closely look into the logs, we know that these are completely different failures. In our case the failure was during the browser start time, with a log message "Failed with DevToolsClientConnectionError while starting the browser backend." but none of the failures in above link were of this type. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So, in short, I believe the issue is now resolved. Thank you very much for your help, Yunlian!
,
Mar 19 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by ddavenp...@chromium.org
, Nov 10 2017