New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 8 users
Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
-release: RootfsUpdateError: Update failed with unexpected update status: UPDATE_STATUS_IDLE'
Project Member Reported by drinkcat@chromium.org, Nov 28 Back to list
Seen on a few recent -release builds (all updating from R64-10164.0.0 to R64-10166.0.0):
https://luci-milo.appspot.com/buildbot/chromeos/eve-release/1136
https://luci-milo.appspot.com/buildbot/chromeos/gandof-release/1702
https://luci-milo.appspot.com/buildbot/chromeos/glimmer-release/1993
https://luci-milo.appspot.com/buildbot/chromeos/quawks-release/1997
https://luci-milo.appspot.com/buildbot/chromeos/snappy-release/1233
https://luci-milo.appspot.com/buildbot/chromeos/squawks-release/1995
https://luci-milo.appspot.com/buildbot/chromeos/tidus-release/1708
https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-release/1713
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-release/1719

e.g. for wizpig:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/158947191-chromeos-test/chromeos6-row2-rack20-host9

2017/11/27 12:56:00.384 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpAqPLZb/testing_rsa root@chromeos6-row2-rack20-host9 -- update_engine_client --status
2017/11/27 12:56:00.708 DEBUG|    cros_build_lib:0642| (stdout):
LAST_CHECKED_TIME=1511816027
PROGRESS=0.345030
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=999999.0.0
NEW_SIZE=1026024186

2017/11/27 12:56:00.709 DEBUG|    cros_build_lib:0644| (stderr):
Warning: Permanently added 'chromeos6-row2-rack20-host9,100.115.131.39' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos6-row2-rack20-host9,100.115.131.39' (ED25519) to the list of known hosts.
[1127/125559:INFO:update_engine_client.cc(508)] Querying Update Engine status...

2017/11/27 12:56:00.709 INFO |      auto_updater:0711| Waiting for update...status: UPDATE_STATUS_DOWNLOADING at progress 0.345030
2017/11/27 12:56:10.720 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpAqPLZb/testing_rsa root@chromeos6-row2-rack20-host9 -- update_engine_client --status
2017/11/27 12:56:39.428 ERROR|     remote_access:0961| Error connecting to device chromeos6-row2-rack20-host9
2017/11/27 12:57:59.372 DEBUG|    cros_build_lib:0642| (stdout):
LAST_CHECKED_TIME=0
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0.0
NEW_SIZE=0

The update engine suddenly goes from UPDATE_STATUS_DOWNLOADING to UPDATE_STATUS_IDLE, no idea why. AFAICT the DUT does not reboot (but I'm not fully sure as I can't find system logs...)
 
It is weird, I cannot see any system logs specially update engine logs in /var/log/update_engine. We probably have to find out why the logs aren't being gathered! update_engine logs can shed light on what could be the cause of the problem.
Cc: rspangler@chromium.org
Owner: caveh@chromium.org
Cc: pho...@chromium.org dgarr...@chromium.org
the logs are under crashinfo.*/var/log once you drill down to the collected logs.

eve's update_engine.log has gems like:

Pending updates found. Abort.
ERROR: Failed to complete.

but there are additional errors from dbus, but they may be unrelated.

all the logs seem to get terminated (truncated?) after the 10% or 20% progress marker from the delta_performer.

there's nothing obvious in the logs that might explain this.


Cc: jrbarnette@chromium.org
Yeah, I don't know why I didn't see those logs the first time. Probably looking at the wrong place :/.

But, anyways, if you look at the eve logs, you can see this pattern:


From CrOS_update_100.115.230.132_16142.log update starts at:

2017/11/27 13:01:06.089 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpWszFjN/testing_rsa root@100.115.230.132 -- update_engine_client -check_for_update '-omaha_url=http://127.0.0.1:39029/update/pregenerated'


And here it gets the wrong update engine status:
2017/11/27 13:01:16.680 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpWszFjN/testing_rsa root@100.115.230.132 -- update_engine_client --status
2017/11/27 13:01:49.352 ERROR|     remote_access:0961| Error connecting to device 100.115.230.132
2017/11/27 13:02:18.972 DEBUG|    cros_build_lib:0642| (stdout):
LAST_CHECKED_TIME=0
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0.0
NEW_SIZE=0


But the last update_engine log that exists and has the logs for the actual update starts with this:

[1127/130216:INFO:main.cc(113)] Chrome OS Update Engine starting
[1127/130216:INFO:boot_control_chromeos.cc(127)] Booted from slot 0 (slot A) of 2 slots present on disk /dev/mmcblk0
[1127/130216:INFO:real_system_state.cc(74)] Booted non-official build.

which shows that at some point the update_engine was rebooted, so it started a new log with a new state UPDATE_STATUS_IDLE.
It seems like that happened two times. The update_engine.20171127-130058 is the actual log that starts the update check at 13:01:06. Then it gets the following error:

[1127/130108:ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.NetworkProxyServiceInterface.ResolveProxy: object_path= /org/chromium/NetworkProxyService: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.NetworkProxyService was not provided by any .service files
[1127/130108:ERROR:http_proxy.cc(27)] org.chromium.NetworkProxyService D-Bus call to ResolveProxy failed

But that error doesn't necessary means any problem. It just means that the chrome was shut down (which provision script does it).

Then it exits here before finishing the update:
[1127/130114:INFO:main.cc(128)] Chrome OS Update Engine terminating with exit code 0
[1127/130114:INFO:action_processor.cc(58)] ActionProcessor: aborted OmahaRequestAction
[1127/130114:INFO:base_message_loop.cc(75)] Leaking 4 canceled tasks.

and the new update engine log update_engine.20171127-130204 exits at: 
[1127/130211:INFO:main.cc(128)] Chrome OS Update Engine terminating with exit code 0
[1127/130211:INFO:base_message_loop.cc(75)] Leaking 3 canceled tasks.

and then you can see the last update_engine log which shows the delta_performer stopped in the middle.

I have no idea at this point why that two update_engine reboots happened. It seems they were forced externally and it was not an update_engine crash. But I cannot find the source of reboot either.
I'm wondering if this is related to crbug.com/777969

Components: Internals>Installer
 Issue 778034  has been merged into this issue.
Cc: akes...@chromium.org nxia@chromium.org
 Issue 696820  has been merged into this issue.
 Issue 796432  has been merged into this issue.
I'm looking at the new logs again for the edgar-paladin:

The source of bad things happen at a few seconds before 16:58:23.

chromeos2-row5-rack2-host5/crashinfo.chromeos2-row5-rack2-host5/messages file shows the following line:

2017-12-18T16:58:23.905830-08:00 NOTICE pre-shutdown[4203]: Shutting down for reboot: unknown-reason

I cannot figure out what is the cause of the shutdown. I cannot figure out even who is initiating it. It seems like the reboot is initiated, and consequently the update engine is shut down (or restarted), but the reboot actually never happens. When the UE is restarted, its initial state is UPDATE_STATUS_IDLE instead of UPDATE_STATUS_NEEDS_REBOOT which is the expected if the UE was not restarted.

Someone more familiar with kernel should look at why the sudden reboot is happening.
Looks like possibly the same error on android-pfq:

https://luci-milo.appspot.com/buildbot/chromeos/cyan-nyc-android-pfq/1293
Labels: Build-PFQ-Failures
Status: Assigned
This looks like it just happened on cyan-chrome-pfq: https://luci-milo.appspot.com/buildbot/chromeos/cyan-chrome-pfq/2393#

	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1516342394	localtime=Jan 18 22:13:14	
		START	----	----	timestamp=1516342423	localtime=Jan 18 22:13:43	
			GOOD	----	sysinfo.before	timestamp=1516342423	localtime=Jan 18 22:13:43	
		END GOOD	----	----	timestamp=1516342423	localtime=Jan 18 22:13:43	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1516344195	localtime=Jan 18 22:43:15	Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row6-rack9-host9: 0) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 1) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 127, in run_once
      with_cheets=with_cheets)
    File "/usr/local/autotest/server/afe_utils.py", line 124, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 828, in machine_install_by_devserver
      quick_provision=quick_provision)
    File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2391, in auto_update
      error_msg % (host_name, real_error))
  DevServerException: CrOS auto-update failed for host chromeos4-row6-rack9-host9: 0) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 1) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 
	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1516344195	localtime=Jan 18 22:43:15	
END FAIL	----	provision	timestamp=1516344195	localtime=Jan 18 22:43:15	

---

Caveh, are you the right person to own this?
Components: Infra>Client>ChromeOS
Owner: xixuan@chromium.org
Here are some logs from #14.

In both cases, it looks to me like update_engine goes through the normal UPDATE_STATUS_DOWNLOADING -> UPDATE_STATUS_FINALIZING -> UPDATE_STATUS_IDLE flow. I don't know anything about the lab's provisioning process, so I don't know why it seems to be unhappy with the UPDATE_STATUS_IDLE at the end. Both logs also describe an "update" from 10032.88.0 to 10032.88.0, i.e. the same version, which seems weird.

Regarding #11, I also see a few reboots in the messages file from 170803419-chromeos-test%2Fchromeos4-row6-rack9-host9%2Fsysinfo.tgz:

2018-01-19T06:32:22.214918+00:00 NOTICE pre-shutdown[4538]: Shutting down for reboot: unknown-reason
2018-01-18T22:35:56.268471-08:00 NOTICE pre-shutdown[4091]: Shutting down for reboot: unknown-reason
2018-01-18T22:41:37.141151-08:00 NOTICE pre-shutdown[4901]: Shutting down for reboot: unknown-reason

The powerd logs from the archive don't show any shutdown requests, so this is presumably something calling "reboot" directly instead of going through powerd. I tried to kill all of the direct calls a while back, since they're hard/impossible to debug.

----

Xixuan, your name came up when I searched for "Update failed with unexpected update status" (//chromite/lib/auto_updater.py from https://crrev.com/c/342131). Are you the right person to figure out what's going on here?
status.txt
5.5 KB View Download
CrOS_update_100.115.197.83_30881.log.txt
86.4 KB View Download
CrOS_update_chromeos4-row6-rack9-host9_23987.log.txt
83.3 KB View Download
The status after UPDATE_STATUS_FINALIZING should got to UPDATE_STATUS_NEEDS_REBOOT until it reboots. So that is why it is not expecting UPDATE_STATUS_IDLE.

I have commented on #5 and #11. The reason for this happening is an unexpected reboot. Probably someone familiar with the kernel should look at it. It is not update engine or infra related I guess. But I'm not 100 sure either :(
Cc: snanda@chromium.org
Ah, sorry for missing your earlier comments. :-(

I looked at the logs preceding the latter two reboots in the most recent failure. In both cases, there were a few root SSH connects and disconnects from localhost in the seconds just before pre-shutdown ran. Maybe that's expected (e.g. the provisioning script running update_engine_client to check the status). I thought that the DHCP releases and other shill stuff just before seemed suspicious, but it turns out that shill is "stop on starting pre-shutdown" (as opposed to pre-shutdown's "stop on started pre-shutdown"), so I guess that's why it gets logged before "Shutting down for reboot".

In any case, it looks like the reboot is going through init. It looks like the kernel is capable of calling /sbin/reboot, although I don't know in which cases it'd do that.

Sameer, can you comment on whether it's plausible that the kernel could be performing an init-directed reboot in the middle of provisioning in some cases?

messages:2481:2018-01-18T22:35:55.119767-08:00 INFO sshd[3681]: Received disconnect from 127.0.0.1 port 37226:11: disconnected by user
messages:2482:2018-01-18T22:35:55.120150-08:00 INFO sshd[3681]: Disconnected from user root 127.0.0.1 port 37226
messages:2483:2018-01-18T22:35:55.236281-08:00 INFO sshd[4071]: Accepted publickey for root from 127.0.0.1 port 37245 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
messages:2484:2018-01-18T22:35:55.375666-08:00 INFO sshd[4071]: Received disconnect from 127.0.0.1 port 37245:11: disconnected by user
messages:2485:2018-01-18T22:35:55.375765-08:00 INFO sshd[4071]: Disconnected from user root 127.0.0.1 port 37245
messages:2486:2018-01-18T22:35:55.486000-08:00 INFO sshd[4079]: Accepted publickey for root from 127.0.0.1 port 37246 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
messages:2487:2018-01-18T22:35:55.647269-08:00 INFO sshd[4079]: Received disconnect from 127.0.0.1 port 37246:11: disconnected by user
messages:2488:2018-01-18T22:35:55.647388-08:00 INFO sshd[4079]: Disconnected from user root 127.0.0.1 port 37246
messages:2489:2018-01-18T22:35:56.119926-08:00 ERR shill[1762]: [ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.dhcpcd.Release: object_path= /org/chromium/dhcpcd: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.34 was not provided by any .service files
messages:2490:2018-01-18T22:35:56.119982-08:00 ERR shill[1762]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name :1.34 was not provided by any .service files
messages:2491:2018-01-18T22:35:56.120751-08:00 INFO avahi-daemon[2161]: Withdrawing address record for 100.115.197.83 on eth1.
messages:2492:2018-01-18T22:35:56.120836-08:00 INFO avahi-daemon[2161]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.197.83.
messages:2493:2018-01-18T22:35:56.122870-08:00 INFO avahi-daemon[2161]: Interface eth1.IPv4 no longer relevant for mDNS.
messages:2494:2018-01-18T22:35:56.127406-08:00 INFO kernel: [  207.551801] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
messages:2495:2018-01-18T22:35:56.141836-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:2496:2018-01-18T22:35:56.141860-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:2497:2018-01-18T22:35:56.155913-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:2498:2018-01-18T22:35:56.155936-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:2499:2018-01-18T22:35:56.172872-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:2500:2018-01-18T22:35:56.172894-08:00 ERR wpa_supplicant[1079]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:2501:2018-01-18T22:35:56.178775-08:00 ERR wpa_supplicant[1079]: rmdir[ctrl_interface=/run/wpa_supplicant]: Permission denied
messages:2502:2018-01-18T22:35:56.268471-08:00 NOTICE pre-shutdown[4091]: Shutting down for reboot: unknown-reason

----

messages:3877:2018-01-18T22:41:35.225731-08:00 INFO sshd[4871]: Accepted publickey for root from 127.0.0.1 port 33977 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
messages:3878:2018-01-18T22:41:35.368800-08:00 INFO sshd[4871]: Received disconnect from 127.0.0.1 port 33977:11: disconnected by user
messages:3879:2018-01-18T22:41:35.368905-08:00 INFO sshd[4871]: Disconnected from user root 127.0.0.1 port 33977
messages:3880:2018-01-18T22:41:36.059549-08:00 INFO sshd[4563]: Received disconnect from 127.0.0.1 port 33961:11: disconnected by user
messages:3881:2018-01-18T22:41:36.059649-08:00 INFO sshd[4563]: Disconnected from user root 127.0.0.1 port 33961
messages:3882:2018-01-18T22:41:36.181714-08:00 INFO sshd[4880]: Accepted publickey for root from 127.0.0.1 port 33978 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
messages:3883:2018-01-18T22:41:36.327124-08:00 INFO sshd[4880]: Received disconnect from 127.0.0.1 port 33978:11: disconnected by user
messages:3884:2018-01-18T22:41:36.327322-08:00 INFO sshd[4880]: Disconnected from user root 127.0.0.1 port 33978
messages:3885:2018-01-18T22:41:36.446495-08:00 INFO sshd[4888]: Accepted publickey for root from 127.0.0.1 port 33979 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
messages:3886:2018-01-18T22:41:36.611042-08:00 INFO sshd[4888]: Received disconnect from 127.0.0.1 port 33979:11: disconnected by user
messages:3887:2018-01-18T22:41:36.611141-08:00 INFO sshd[4888]: Disconnected from user root 127.0.0.1 port 33979
messages:3888:2018-01-18T22:41:37.028033-08:00 ERR shill[1771]: [ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.dhcpcd.Release: object_path= /org/chromium/dhcpcd: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.34 was not provided by any .service fi
les
messages:3889:2018-01-18T22:41:37.028090-08:00 ERR shill[1771]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name :1.34 was not provided by any .service files
messages:3890:2018-01-18T22:41:37.028812-08:00 INFO avahi-daemon[2256]: Withdrawing address record for 100.115.197.83 on eth1.
messages:3891:2018-01-18T22:41:37.028880-08:00 INFO avahi-daemon[2256]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.197.83.
messages:3892:2018-01-18T22:41:37.030297-08:00 INFO avahi-daemon[2256]: Interface eth1.IPv4 no longer relevant for mDNS.
messages:3893:2018-01-18T22:41:37.033301-08:00 INFO kernel: [  335.010214] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
messages:3894:2018-01-18T22:41:37.046772-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:3895:2018-01-18T22:41:37.046793-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:3896:2018-01-18T22:41:37.067905-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:3897:2018-01-18T22:41:37.067927-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:3898:2018-01-18T22:41:37.081619-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory
messages:3899:2018-01-18T22:41:37.081639-08:00 ERR wpa_supplicant[1025]: nl80211: Failed to set IPv4 unicast in multicast filter
messages:3900:2018-01-18T22:41:37.090683-08:00 ERR wpa_supplicant[1025]: rmdir[ctrl_interface=/run/wpa_supplicant]: Permission denied
messages:3901:2018-01-18T22:41:37.141151-08:00 NOTICE pre-shutdown[4901]: Shutting down for reboot: unknown-reason
Cc: tbroch@chromium.org
Shutdowns from the kernel can happen due to thermal reasons for one. But I don't think I am seeing any telltale signs of that.

Perhaps, +tbroch can tell us if these may somehow be thermal triggered?

> It looks like the kernel is capable of calling /sbin/reboot [ ... ]
> messages:3901:2018-01-18T22:41:37.141151-08:00 NOTICE pre-shutdown[4901]: Shutting down for reboot: unknown-reason

To be clear:  This message doesn't mean that the _kernel_ initiated the
reboot.  It means something (almost certainly, something in userland)
invoked the 'reboot' command.

We need to double-check the provisioning logs to convince ourselves that
the reboot command really didn't come from the provisioning side of the
code.  I don't hold out much hope for that; it's just that you have to
check for the easy, obvious stuff before you go haring off on a complicated
search.

I don't have any other concrete suggestions here, other than "scan the sources
for any program that calls 'reboot'" and also "put a call out to package
owners to see if anyone knows of code that can call 'reboot'".  If we find
any stray callers, we should probably update them to pass SHUTDOWN_REASON.

(Aside) Two changes might help with debug now or in the future:
  * Consider changing the text "unknown-reason" to something slightly
    more concrete like "command-line" or "manual".
  * Change the provisioning code to pass its own, unique, SHUTDOWN_REASON,
    so that we can definitively rule that out as a source.

Just to clear things up about the pre-shutdown "Shutting down for reboot" message, powerd_setuid_helper runs "initctl emit --no-wait runlevel RUNLEVEL=0 SHUTDOWN_REASON=<whatever>" when powerd wants to shut down the system. This was initially done so that powerd could communicate low-battery shutdowns to the chromeos_shutdown script so we could display an animation, but I recently also made the pre-shutdown Upstart job log it ( issue 762328 ) so we'd have some way of telling when the system is shutting down cleanly vs. crashing.

pre-shutdown logs unknown-reason if SHUTDOWN_REASON wasn't set. This happens if some process besides powerd runs initctl/halt/poweroff/etc., but it also happens whenever powerd reboots the system, since powerd just runs "reboot" (if there's some way powerd can set SHUTDOWN_REASON while rebooting, I'd love to hear it). pre-shutdown shouldn't log "command-line" or "manual" in this case, since it has no idea what triggered the reboot.

powerd also logs more information to its current file in /var/log/power_manager when it's about to reboot or shut down, and as I mentioned in #15, I didn't see that in the most recent cases of this. That leads me to believe that something other than powerd is rebooting the system. I've previously tried to make everything go through powerd, and I didn't see any obvious new code paths when I looked now.
Owner: ----
Comment 22 by norvez@chromium.org, Feb 21 (3 days ago)
Cc: waihong@chromium.org
Status: Available
Something very similar is happening on most recent builds on guado:
https://luci-milo.appspot.com/buildbot/chromeos/guado-release/1967
https://luci-milo.appspot.com/buildbot/chromeos/guado-release/1966


Some of the builds have a slightly different error, but also during autoupdate:
https://luci-milo.appspot.com/buildbot/chromeos/guado-release/1968


Sign in to add a comment