New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 673804 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

devserver hangs in post-check rootfs update

Project Member Reported by semenzato@chromium.org, Dec 13 2016

Issue description


https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_jerry-release/builds/662

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/90425369-chromeos-test/chromeos4-row4-rack4-host17/debug/

from autoserv.DEBUG:

12/13 05:12:44.669 DEBUG|        dev_server:1714| Current CrOS auto-update status: rootfs update
12/13 05:12:54.709 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:12:55.731 DEBUG|        dev_server:1714| Current CrOS auto-update status: rootfs update
12/13 05:13:05.781 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:13:06.750 DEBUG|        dev_server:1714| Current CrOS auto-update status: rootfs update
12/13 05:13:16.801 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:13:17.684 DEBUG|        dev_server:1714| Current CrOS auto-update status: post-check rootfs update
12/13 05:13:27.721 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:13:28.608 DEBUG|        dev_server:1714| Current CrOS auto-update status: post-check rootfs update
12/13 05:13:38.657 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:13:39.546 DEBUG|        dev_server:1714| Current CrOS auto-update status: post-check rootfs update

...

2/13 05:23:29.402 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/get_au_status?build_name=veyron_jerry-release/R57-9081.0.0&force_update=True&pid=3356&host_name=chromeos4-row4-rack4-host17&full_update=False"''
12/13 05:23:30.282 DEBUG|        dev_server:1776| Failed to trigger auto-update process on devserver
12/13 05:23:30.283 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.133 'curl "http://100.115.219.133:8082/handler_cleanup?pid=3356&host_name=chromeos4-row4-rack4-host17"''


The code for the post-check is in chromite/auto_update.py:

  def PostCheckRootfsUpdate(self):
    """Post-check for rootfs update for CrOS host."""
    logging.debug('Start post check for rootfs update...')
    active_kernel, inactive_kernel = self._GetKernelState()
    if (self._GetKernelPriority(inactive_kernel) <
        self._GetKernelPriority(active_kernel)):
      raise RootfsUpdateError('Update failed. The priority of the inactive '
                              'kernel partition is less than that of the '
                              'active kernel partition.')

so it's not doing much, just a couple of ssh calls.  I will check the devserver logs next (ouch yes, those gigabyte-sized logs).

I haven't seen this failure before, so Pri-3 for now.




 
So very sad.

/var/log/messages from crashinfo.chromeos4-row4-rack4-host17.tgz is completely missing the boot session that includes the failure time (around 5:13am).  Snippet:

2016-12-13T13:34:24.134776+00:00 INFO session_manager[960]: [INFO:system_utils_impl.cc(110)] Sending 15 to 1013 as 1000
2016-12-13T13:34:24.618571+00:00 INFO session_manager[960]: [INFO:session_manager_service.cc(616)] SessionManagerService quitting run loop
2016-12-13T13:34:24.619267+00:00 INFO session_manager[960]: [INFO:session_manager_service.cc(241)] SessionManagerService exiting
2016-12-13T13:34:24.650077+00:00 INFO session_manager[960]: [INFO:policy_service.cc(202)] Persisted policy to disk.
2016-12-13T13:34:25.095362+00:00 WARNING chapsd[1024]: SRK does not exist - this is normal when the TPM is not yet owned.
2016-12-13T05:34:31.157342-08:00 INFO kernel: [    0.000000] Booting Linux on physical CPU 0x500
2016-12-13T05:34:31.157476-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuset
2016-12-13T05:34:31.157481-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpu
2016-12-13T05:34:31.157491-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuacct
2016-12-13T05:34:31.157494-08:00 NOTICE kernel: [    0.000000] Linux version 3.14.0 (chrome-bot@cros-beefy72-c2) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r145-41f3e25635616c067b9ee272304e6f86ac8ee9db_4.9.2-r145) ) #1 SMP PREEMPT Tue Dec 13 02:59:52 PST 2016

/var/log/messages from sysinfo.tgz is stopping just before the failure.  This is the end of the file.

2016-12-13T05:02:39.250923-08:00 NOTICE ag[27446]: autotest server[stack::_do_run|execute_control|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2016-12-13T05:02:39.423332-08:00 NOTICE ag[27449]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mkdir -p /tmp)
2016-12-13T05:02:39.631232-08:00 NOTICE ag[27452]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2016-12-13T05:02:40.349578-08:00 NOTICE ag[27457]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(nohup /tmp/sysinfo/autoserv-Zqcyxy/bin/autotestd /tmp/autoserv-9EsyW6 -H autoserv --verbose --hostname=chromeos4-row4-rack4-host17 --user=chromeos-test /tmp/sysinfo/autoserv-Zqcyxy/control.autoserv >/dev/null 2>/dev/null &)
2016-12-13T05:02:40.575595-08:00 NOTICE ag[27462]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(/tmp/sysinfo/autoserv-Zqcyxy/bin/autotestd_monitor /tmp/autoserv-9EsyW6 0 0)

So tragic.  All these logs, and none of them is helping me.

Amazing.  I found yet another copy of the syslog in sysinfo.tgz in /mnt/stateful_partition/unencripted/preserve/log.  It includes a different time frame from the other two, but still not the useful one.  This one is from a few hours earlier and ends here:

016-12-13T01:17:24.678972-08:00 INFO avahi-daemon[1569]: Interface eth1.IPv4 no longer relevant for mDNS.
2016-12-13T01:17:24.701787-08:00 INFO kernel: [  451.777446] smsc95xx 3-1.1:1.0 eth0: hardware isn't capable of remote wakeup
2016-12-13T01:17:24.922167-08:00 ERR wpa_supplicant[577]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied
2016-12-13T01:17:25.087422-08:00 NOTICE shill_logout_user[3627]: Failed to pop shill user profiles
2016-12-13T01:17:25.098949-08:00 INFO session_manager[851]: [INFO:browser_job.cc(124)] Terminating process: 
2016-12-13T01:17:25.099003-08:00 INFO session_manager[851]: [INFO:system_utils_impl.cc(110)] Sending 15 to 888 as 1000
2016-12-13T01:17:25.481688-08:00 INFO kernel: [  452.564158] rockchip-vop ff940000.vop: Detached from iommu domain
2016-12-13T01:17:25.662933-08:00 INFO session_manager[851]: [INFO:session_manager_service.cc(613)] SessionManagerService quitting run loop
2016-12-13T01:17:25.665797-08:00 INFO session_manager[851]: [INFO:session_manager_service.cc(238)] SessionManagerService exiting
2016-12-13T01:17:25.716107-08:00 INFO session_manager[851]: [INFO:policy_service.cc(202)] Persisted policy to disk.
No more clues.  I wonder if it's another case of "hanging ssh".  Devserver uses the chromite libraries for remote commands, and does not retry ssh calls.  If we had the right syslogs, we could tell.

Owner: xixuan@chromium.org
Xixuan - can you take a look? 
Looks like something happens during rebooting the host.

Logs for auto-update: 

2016/12/13 05:11:45.339 DEBUG|      auto_updater:0922| Start post check for rootfs update...
...
other commands
...
2016/12/13 05:11:46.377 INFO |     remote_access:0376| Rebooting chromeos4-row4-rack4-host17...
2016/12/13 05:11:46.377 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- 'touch /tmp/awaiting_reboot && reboot'
2016/12/13 05:11:51.635 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]"
2016/12/13 05:11:56.707 INFO |     remote_access:0371| Cannot connect to device; reboot in progress.
2016/12/13 05:12:01.709 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]"
2016/12/13 05:12:06.779 INFO |     remote_access:0371| Cannot connect to device; reboot in progress.
...
...
2016/12/13 05:19:45.215 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpguuAoQ/testing_rsa root@chromeos4-row4-rack4-host17 -- "[ ! -e '/tmp/awaiting_reboot' ]"
2016/12/13 05:19:50.284 INFO |     remote_access:0371| Cannot connect to device; reboot in progress.
2016/12/13 05:19:50.285 ERROR|    cros_build_lib:0659| Reboot has not completed after 480 seconds; giving up.

But Luigi already has a CL which changes the reboot logic. Release builder's code is not very up-to-date, maybe this is the reason.
Status: Archived (was: Untriaged)
This bug has not been touched in over a year.  It is probably no longer relevant.

Sign in to add a comment