devserver hangs in post-check rootfs update |
|||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/veyron_jerry-release/builds/662 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/90425369-chromeos-test/chromeos4-row4-rack4-host17/debug/ from autoserv.DEBUG: 12/13 05:12:44.669 DEBUG| dev_server:1714| Current CrOS auto-update status: rootfs update 12/13 05:12:54.709 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:12:55.731 DEBUG| dev_server:1714| Current CrOS auto-update status: rootfs update 12/13 05:13:05.781 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:13:06.750 DEBUG| dev_server:1714| Current CrOS auto-update status: rootfs update 12/13 05:13:16.801 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:13:17.684 DEBUG| dev_server:1714| Current CrOS auto-update status: post-check rootfs update 12/13 05:13:27.721 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:13:28.608 DEBUG| dev_server:1714| Current CrOS auto-update status: post-check rootfs update 12/13 05:13:38.657 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:13:39.546 DEBUG| dev_server:1714| Current CrOS auto-update status: post-check rootfs update ... 2/13 05:23:29.402 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"'' 12/13 05:23:30.282 DEBUG| dev_server:1776| Failed to trigger auto-update process on devserver 12/13 05:23:30.283 DEBUG| base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/handler_cleanup?pid=3356&host_name=chromeos4-row4-rack4-host17"'' The code for the post-check is in chromite/auto_update.py: def PostCheckRootfsUpdate(self): """Post-check for rootfs update for CrOS host.""" logging.debug('Start post check for rootfs update...') active_kernel, inactive_kernel = self._GetKernelState() if (self._GetKernelPriority(inactive_kernel) < self._GetKernelPriority(active_kernel)): raise RootfsUpdateError('Update failed. The priority of the inactive ' 'kernel partition is less than that of the ' 'active kernel partition.') so it's not doing much, just a couple of ssh calls. I will check the devserver logs next (ouch yes, those gigabyte-sized logs). I haven't seen this failure before, so Pri-3 for now.
,
Dec 13 2016
Amazing. I found yet another copy of the syslog in sysinfo.tgz in /mnt/stateful_partition/unencripted/preserve/log. It includes a different time frame from the other two, but still not the useful one. This one is from a few hours earlier and ends here: 016-12-13T01:17:24.678972-08:00 INFO avahi-daemon[1569]: Interface eth1.IPv4 no longer relevant for mDNS. 2016-12-13T01:17:24.701787-08:00 INFO kernel: [ 451.777446] smsc95xx 3-1.1:1.0 eth0: hardware isn't capable of remote wakeup 2016-12-13T01:17:24.922167-08:00 ERR wpa_supplicant[577]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied 2016-12-13T01:17:25.087422-08:00 NOTICE shill_logout_user[3627]: Failed to pop shill user profiles 2016-12-13T01:17:25.098949-08:00 INFO session_manager[851]: [INFO:browser_job.cc(124)] Terminating process: 2016-12-13T01:17:25.099003-08:00 INFO session_manager[851]: [INFO:system_utils_impl.cc(110)] Sending 15 to 888 as 1000 2016-12-13T01:17:25.481688-08:00 INFO kernel: [ 452.564158] rockchip-vop ff940000.vop: Detached from iommu domain 2016-12-13T01:17:25.662933-08:00 INFO session_manager[851]: [INFO:session_manager_service.cc(613)] SessionManagerService quitting run loop 2016-12-13T01:17:25.665797-08:00 INFO session_manager[851]: [INFO:session_manager_service.cc(238)] SessionManagerService exiting 2016-12-13T01:17:25.716107-08:00 INFO session_manager[851]: [INFO:policy_service.cc(202)] Persisted policy to disk.
,
Dec 13 2016
No more clues. I wonder if it's another case of "hanging ssh". Devserver uses the chromite libraries for remote commands, and does not retry ssh calls. If we had the right syslogs, we could tell.
,
Jan 4 2017
Xixuan - can you take a look?
,
Jan 4 2017
Looks like something happens during rebooting the host. Logs for auto-update: 2016/12/13 05:11:45.339 DEBUG| auto_updater:0922| Start post check for rootfs update... ... other commands ... 2016/12/13 05:11:46.377 INFO | remote_access:0376| Rebooting chromeos4-row4-rack4-host17... 2016/12/13 05:11:46.377 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- 'touch /tmp/awaiting_reboot && reboot' 2016/12/13 05:11:51.635 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]" 2016/12/13 05:11:56.707 INFO | remote_access:0371| Cannot connect to device; reboot in progress. 2016/12/13 05:12:01.709 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]" 2016/12/13 05:12:06.779 INFO | remote_access:0371| Cannot connect to device; reboot in progress. ... ... 2016/12/13 05:19:45.215 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]" 2016/12/13 05:19:50.284 INFO | remote_access:0371| Cannot connect to device; reboot in progress. 2016/12/13 05:19:50.285 ERROR| cros_build_lib:0659| Reboot has not completed after 480 seconds; giving up. But Luigi already has a CL which changes the reboot logic. Release builder's code is not very up-to-date, maybe this is the reason.
,
Mar 10 2018
This bug has not been touched in over a year. It is probably no longer relevant. |
|||
►
Sign in to add a comment |
|||
Comment 1 by semenzato@chromium.org
, Dec 13 2016