New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

kip provision failed | device didn't respond to initial ssh | fixed by servoreset

Project Member Reported by akes...@chromium.org, Feb 27 2017

Issue description

Build
https://luci-milo.appspot.com/buildbot/chromeos/kip-paladin/832

Provision job
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host1/519946-provision/

In particular, the autoupdate log is very short:
2017/02/24 21:18:43.890 DEBUG|       cros_update:0193| Remote device chromeos4-row1-rack12-host1 is connected
2017/02/24 21:18:43.890 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6Ee7ph/testing_rsa root@chromeos4-row1-rack12-host1 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'
2017/02/24 21:20:47.047 ERROR|     remote_access:0856| Error connecting to device chromeos4-row1-rack12-host1
2017/02/24 21:20:47.048 DEBUG|       cros_update:0253| Error happens in CrOS auto-update: SSHConnectionError('ssh: connect to host chromeos4-row1-rack12-host1 port 22: Connection timed out\r\n',)


Repair job:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row1-rack12-host1/519955-repair/20172402212328/

status log indicates that required power cycle

START	----	repair	timestamp=1488000213	localtime=Feb 24 21:23:33	
	GOOD	----	verify.ssh	timestamp=1488000214	localtime=Feb 24 21:23:34	
	GOOD	----	verify.brd_config	timestamp=1488000214	localtime=Feb 24 21:23:34	
	GOOD	----	verify.ser_config	timestamp=1488000214	localtime=Feb 24 21:23:34	
	GOOD	----	verify.job	timestamp=1488000215	localtime=Feb 24 21:23:35	
	GOOD	----	verify.servod	timestamp=1488000219	localtime=Feb 24 21:23:39	
	GOOD	----	verify.pwr_button	timestamp=1488000219	localtime=Feb 24 21:23:39	
	GOOD	----	verify.lid_open	timestamp=1488000219	localtime=Feb 24 21:23:39	
	GOOD	----	verify.update	timestamp=1488000222	localtime=Feb 24 21:23:42	
	GOOD	----	verify.PASS	timestamp=1488000222	localtime=Feb 24 21:23:42	
	FAIL	----	verify.ssh	timestamp=1488000924	localtime=Feb 24 21:35:24	No answer to ping from chromeos4-row1-rack12-host1
	START	----	repair.rpm	timestamp=1488000924	localtime=Feb 24 21:35:24	
		FAIL	----	repair.rpm	timestamp=1488001169	localtime=Feb 24 21:39:29	chromeos4-row1-rack12-host1 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1488001169	localtime=Feb 24 21:39:29	
	START	----	repair.sysrq	timestamp=1488001169	localtime=Feb 24 21:39:29	
		FAIL	----	repair.sysrq	timestamp=1488001456	localtime=Feb 24 21:44:16	chromeos4-row1-rack12-host1 is still offline after reset.
	END FAIL	----	repair.sysrq	timestamp=1488001456	localtime=Feb 24 21:44:16	
	START	----	repair.servoreset	timestamp=1488001456	localtime=Feb 24 21:44:16	
		GOOD	----	verify.ssh	timestamp=1488001485	localtime=Feb 24 21:44:45	
	END GOOD	----	repair.servoreset	timestamp=1488001485	localtime=Feb 24 21:44:45	
	GOOD	----	verify.fwstatus	timestamp=1488001485	localtime=Feb 24 21:44:45	
	GOOD	----	verify.good_au	timestamp=1488001485	localtime=Feb 24 21:44:45	
	GOOD	----	verify.writable	timestamp=1488001486	localtime=Feb 24 21:44:46	
	GOOD	----	verify.tpm	timestamp=1488001486	localtime=Feb 24 21:44:46	
	GOOD	----	verify.ext4	timestamp=1488001486	localtime=Feb 24 21:44:46	
	GOOD	----	verify.power	timestamp=1488001486	localtime=Feb 24 21:44:46	
	GOOD	----	verify.rwfw	timestamp=1488001487	localtime=Feb 24 21:44:47	
	GOOD	----	verify.python	timestamp=1488001487	localtime=Feb 24 21:44:47	
	GOOD	----	verify.cros	timestamp=1488001489	localtime=Feb 24 21:44:49	
	GOOD	----	verify.PASS	timestamp=1488001489	localtime=Feb 24 21:44:49	
	INFO	----	repair	timestamp=1488001489	localtime=Feb 24 21:44:49	Can't repair label 'cros-version:kip-paladin/R58-9315.0.0-rc1'.
	INFO	----	repair	timestamp=1488001489	localtime=Feb 24 21:44:49	Can't repair label 'pool:cq'.
	INFO	----	repair	timestamp=1488001489	localtime=Feb 24 21:44:49	Can't repair label 'board:kip'.
END GOOD	----	repair	timestamp=1488001489	localtime=Feb 24 21:44:49	chromeos4-row1-rack12-host1 repaired successfully



Didn't dig further yet as to the cause of the device being hung.


 
Showing comments 25 - 124 of 124 Older
#1: All the systems that this problem is showing, does have smsc95xx controller (some also have an additional smsc75xx). But I could not ascertain for sure if it is the same ssh issue.

#2: On a separate note, I left the kip I had overnight for a continuous loop of reinstalling between 9310.0.0 -> 9315.0.0 -> 9310.0.0 and somewhere after 90th iteration, the EC went unresponsive (the system does not respond, Alt = x + x + x does not work, Atl + r does not work). (Kip being an old product, so board does not have a servo header to see if EC console works). But when I reboot, as expected, there is no console-ramoops :-( and the last part of var/log/messages (attached) before reboot is not very helpful:

...
2017-03-01T03:34:07.258346-08:00 INFO sshd[4784]: Accepted publickey for root from 172.22.64.13 port 59743 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-03-01T03:34:07.309018-08:00 INFO sshd[4784]: Received disconnect from 172.22.64.13 port 59743:11: disconnected by user
2017-03-01T03:34:07.309088-08:00 INFO sshd[4784]: Disconnected from 172.22.64.13 port 59743
2017-03-01T03:34:07.622225-08:00 INFO sshd[4788]: Accepted publickey for root from 172.22.64.13 port 59744 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-03-01T03:34:07.683773-08:00 INFO sshd[4788]: Received disconnect from 172.22.64.13 port 59744:11: disconnected by user
2017-03-01T03:34:07.683904-08:00 INFO sshd[4788]: Disconnected from 172.22.64.13 port 59744
2017-03-01T03:34:07.954961-08:00 INFO sshd[4791]: Accepted publickey for root from 172.22.64.13 port 59745 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-03-01T03:34:08.044863-08:00 INFO sshd[4791]: Received disconnect from 172.22.64.13 port 59745:11: disconnected by user
2017-03-01T03:34:08.044942-08:00 INFO sshd[4791]: Disconnected from 172.22.64.13 port 59745
2017-03-01T03:34:08.066729-08:00 WARNING kernel: [  391.666654] init: debugd main process (940) killed by TERM signal
2017-03-01T03:34:08.268632-08:00 ERR shill[1172]: [ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.dhcpcd.Release: object_path= /org/chromium/dhcpcd: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.46 was not provided by any .service files
2017-03-01T03:34:08.268680-08:00 ERR shill[1172]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name :1.46 was not provided by any .service files
2017-03-01T03:34:08.269626-08:00 INFO avahi-daemon[1511]: Withdrawing address record for 100.96.49.86 on eth0.
2017-03-01T03:34:08.269735-08:00 INFO avahi-daemon[1511]: Leaving mDNS multicast group on interface eth0.IPv4 with address 100.96.49.86.
2017-03-01T03:34:08.269827-08:00 INFO avahi-daemon[1511]: Interface eth0.IPv4 no longer relevant for mDNS.
2017-03-01T03:34:08.270712-08:00 INFO kernel: [  391.871421] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
2017-03-01T03:34:08.328211-08:00 ERR wpa_supplicant[570]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied
2017-03-01T03:34:08.510850-08:00 NOTICE shill_logout_user[4810]: Failed to pop shill user profiles
2017-03-01T03:34:08.517193-08:00 INFO session_manager[895]: [INFO:browser_job.cc(157)] Terminating process:
2017-03-01T03:34:08.517225-08:00 INFO session_manager[895]: [INFO:system_utils_impl.cc(110)] Sending 15 to 942 as 1000
2017-03-01T03:34:09.162755-08:00 INFO session_manager[895]: [INFO:session_manager_service.cc(488)] SessionManagerService quitting run loop
2017-03-01T03:34:09.163005-08:00 INFO session_manager[895]: [INFO:session_manager_service.cc(214)] SessionManagerService exiting
2017-03-01T03:34:09.186591-08:00 INFO session_manager[895]: [INFO:policy_service.cc(202)] Persisted policy to disk.

(I rebooted here)

2017-03-01T09:53:30.366267-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuset
2017-03-01T09:53:30.367751-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpu
2017-03-01T09:53:30.367757-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuacct
2017-03-01T09:53:30.367760-08:00 NOTICE kernel: [    0.000000] Linux version 4.4.44-07219-g1abd782 (chrome-bot@cros-beefy4-c2) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r150-41f3e25635616c067b9ee272304e6f86ac8ee9db_4.9.2-r150) ) #1 SMP PREEMPT Fri Feb 24 20:36:00 PST 2017
2017-03-01T09:53:30.367763-08:00 INFO kernel: [    0.000000] Command line: cros_secure console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 3584000 verity payload=PARTUUID=050379b1-58a1-ed46-9981-a77ee66b71aa/PARTNROFF=1 hashtree=PARTUUID=050379b1-58a1-ed46-9981-a77ee66b71aa/PARTNROFF=1 hashstart=3584000 alg=sha1 root_hexdigest=d59a1e1f4a88ebca130fc55e70bfc53e3b91d87e salt=bdb67d8ea702870cc2fc6a10ef1e772ab16389d2f78c991d77393c25fbe11a73" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=050379b1-58a1-ed46-9981-a77ee66b71aa add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic
...

This behaviour might match this bug, so I'm going to try to reproduce this again. However, is there a way to confirm that this issue was present (or not present) say, a month ago? Buildbot seems to maintain info for only last 200 builds, and looking at that seems to indicate that this has been always flaky.
messages
1.5 MB View Download
Owner: rajatja@google.com
OK, I've been looking at this (looking at the CQ failures) and I think that we're dealing with 2 issues here:

1) For some reason the Kip reaches a wedged state. The kernel is defnitely not responding (autotest tries to reset using sysrq aka Alt+VolUp+x+x+x, but it does not come back to life). This is the issue that this bug intends to track. 

2) Autotest does a servoreset, which I looked at the code and it seems to be rebooting both the EC and the AP, which should definitely bring it back to life. But in some cases, it doesn't, and then it goes on to do a recovery using USB which works. I believe the reason for this is that we may be hitting the  issue 677572 , because after a servo reset, we give up in < 5 min if the DUT does not respond, but in case of USB reset, we wait until ~9-10 mins? As below. 

	FAIL	----	verify.ssh	timestamp=1488180035	localtime=Feb 26 23:20:35	No answer to ping from chromeos4-row1-rack12-host9
	START	----	repair.rpm	timestamp=1488180035	localtime=Feb 26 23:20:35	
		FAIL	----	repair.rpm	timestamp=1488180282	localtime=Feb 26 23:24:42	chromeos4-row1-rack12-host9 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1488180282	localtime=Feb 26 23:24:42	
	START	----	repair.sysrq	timestamp=1488180282	localtime=Feb 26 23:24:42	
		FAIL	----	repair.sysrq	timestamp=1488180571	localtime=Feb 26 23:29:31	chromeos4-row1-rack12-host9 is still offline after reset.
	END FAIL	----	repair.sysrq	timestamp=1488180571	localtime=Feb 26 23:29:31	
	START	----	repair.servoreset	timestamp=1488180571	localtime=Feb 26 23:29:31	
		FAIL	----	repair.servoreset	timestamp=1488180797	localtime=Feb 26 23:33:17	chromeos4-row1-rack12-host9 is still offline after reset.
	END FAIL	----	repair.servoreset	timestamp=1488180797	localtime=Feb 26 23:33:17	
	START	----	repair.firmware	timestamp=1488180797	localtime=Feb 26 23:33:17	
		FAIL	----	repair.firmware	timestamp=1488180797	localtime=Feb 26 23:33:17	Firmware repair is not applicable to host chromeos4-row1-rack12-host9.
	END FAIL	----	repair.firmware	timestamp=1488180797	localtime=Feb 26 23:33:17	
	START	----	repair.usb	timestamp=1488180797	localtime=Feb 26 23:33:17	
		GOOD	----	verify.ssh	timestamp=1488181345	localtime=Feb 26 23:42:25	
		GOOD	----	verify.writable	timestamp=1488181346	localtime=Feb 26 23:42:26	
		GOOD	----	verify.tpm	timestamp=1488181346	localtime=Feb 26 23:42:26	
		GOOD	----	verify.good_au	timestamp=1488181346	localtime=Feb 26 23:42:26	
		GOOD	----	verify.ext4	timestamp=1488181346	localtime=Feb 26 23:42:26	
		GOOD	----	verify.power	timestamp=1488181347	localtime=Feb 26 23:42:27	
		GOOD	----	verify.rwfw	timestamp=1488181347	localtime=Feb 26 23:42:27	
		GOOD	----	verify.python	timestamp=1488181347	localtime=Feb 26 23:42:27	
		GOOD	----	verify.cros	timestamp=1488181349	localtime=Feb 26 23:42:29	
	END GOOD	----	repair.usb	timestamp=1488181349	localtime=Feb 26 23:42:29	


Coming back to the problem this bug is meant to track, unfortunately I've been looking and I have not either not found any logs (looking at /var/log/messages, dmesgs, console-ramoops) from the time the DUT was healthy to the time DUT stopped responding, or none of the logs seem to indicate anything wrong, or give any clues about what could have gone wrong.

The closest I've found is this, where the DUT seemed to be working at the begining of the test, but went unresponsive sometime while the test was running:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103440384-chromeos-test/

Unfortunately there are no logs :-(. 

In summary:

A) I have not found any useful logs thus far. I'll still keep on looking for CQ failures to scrape out any logs that may indicate a problem.

B) On my end, I did run into an issue that looked like an EC unresponsive when I put a Kip DUT under a continuous new image installation loop. On reboot, no interesting logs were there. This behaviour would match the problem symptoms, but given this product has been shipping for ages, I'm hesitant to blame the EC. Also, this issue should be showing up on canaries also in that case.

C) I will attempt to take a look at the failed CQ builds, and see if I can identify a bad CL that is common across the failed CQ cases. (Question: is there a way to reverse engineer the uberchrome paladin build link, from dut-status provisioning/repair/test job?)

I'll be happy to pick up any pointers that anyone may have.
Components: OS>Firmware>EC
Update: I was again able to reproduce the EC freeze. It was exactly like #25 above. Logs attached again. Again, no signs of any issues in the logs, and the system/EC suddenly goes unresponsive. Trying Alt+VolUp+r (warm reboot via EC) does not work. And need to do a EC reset (refresh + power) in order to revive. 

I'm going to try and reproduce it one more time, before requesting some one from EC team to take a look.
debug-logs_20170302-100154.tgz
10.0 MB Download
OK, I reproduced the EC freeze again, this time in only 12 iterations of reinstalling the image. I think some one from EC team needs to take a look. Sameer, do you know who can help?
Unfortunately the board I have does not have a servo header, so I couldn't look at the EC console.My observations are:

* the Alt+VolUp+r combo does not work when it is wedged (Ir works when the system is responding).

* The system logs after I recover (using EC reboot), show no indication of anything that might have gone wrong in AP software. Attaching the latest logs again
debug-logs_20170302-112208.tgz
10.6 MB Download
Reproduced the issue yet again - after 34 iterations this time, log attached. Same behavior - Logs have nothing - and Alt+VolUp+r does not work.

One another thing that I tried on Sameer's suggestion this time is Alt+VolUp+h (hibernate), when I did this the system did go to hibernate apparently because when I followed it up with power button the system rebooted. Sameer mentioned that this could mean that the EC is responding, but there were some known observations/issues around Intel CPU being in bad state such that PCH does not responding appropriately to the reset request line from the EC.

The next steps that I'm planning are:
* Running an older image (that has 3.10 kernel) ~ I'll try an image from January.
* Getting a servo header installed on the board, to take a look at what is happening on the AP console and the EC console. 
debug-logs_20170303-100212.tgz
12.4 MB Download
Some more updates:

1) The issue doe not seem to be reproduced with 3.10 kernel. I loaded a January image and it passed 500 iterations without hitting the issue. It seems something that the kernel 4.4 induces (see next):

2) I got servo header installed on a system, and reproduced the issue with a 4.4 kernel (re-install in a loop). My observations: 

A. the console is absolutely normal with last messages showing a clean shutdown:

[  365.495608] init: debugd main process (1036) killed by TERM signal
[  373.635849] smsc95xx 1-2.2.1:1.0 eth0: hardware isn't capable of remote wakeup
[  375.332799] init: chapsd main process (1084) killed by TERM signal
[  375.340611] init: powerd main process (943) killed by TERM signal
[  375.349729] init: crash-sender main process (1508) killed by TERM signal
[  375.357743] init: cros-machine-id-regen-periodic main process (1519) killed by TERM signal
[  375.367524] init: mtpd main process (1524) killed by TERM signal
[  375.374674] init: log-rotate main process (1536) killed by TERM signal
[  375.383368] init: cras main process (1579) killed by TERM signal
[  375.390465] init: bluetoothd main process (1583) killed by TERM signal
[  375.404700] init: warn-collector main process (2261) killed by TERM signal
[  375.412939] init: tlsdated main process (2292) killed by TERM signal
[  375.441473] init: cryptohomed main process (1365) killed by TERM signal
[  375.450180] EXT4-fs (mmcblk0p5): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[  375.534309] init: recover_duts main process (2259) killed by TERM signal
[  375.588273] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[  383.107333] gsmi: Log Shutdown Reason 0x00
[  383.252172] reboot: Restarting system
[  383.256301] reboot: machine restart

B. The EC console is alive but again - no signs of any thing wrong there. Here are the last messages:
+[2694.580370 HC 0x01]
[2694.581005 HC 0x05]
+[2694.584210 HC 0x02]
[2694.588470 HC 0x01]
[2989.906791 DPTF sensor 1, threshold -73 C, index 0, disabled]
[2989.907729 DPTF sensor 1, threshold -73 C, index 1, disabled]
[2989.914462 DPTF sensor 2, threshold -73 C, index 0, disabled]
[2989.915354 DPTF sensor 2, threshold -73 C, index 1, disabled]
[2989.917598 DPTF sensor 3, threshold -73 C, index 0, disabled]
[2989.918529 DPTF sensor 3, threshold -73 C, index 1, disabled]
[2997.221829 KB disable]
[2997.222101 KS disable]
[2997.222257 KB IRQ disable]
[2997.222488 KB IRQ enable]
[2997.250347 chipset_reset(0)]
[2997.251127 Port 80: 0x1002][2997.251303 LPC RESET# asserted]
[2997.255291 LPC RESET# deasserted]
[2997.255464 GPIO_PP3300_LTE_EN 1->0]
[2997.266121 Port 80: 0x10][2997.282792 chipset_reset(0)]
[2997.285769 GPIO_PP3300_LTE_EN 0->1]
[2997.307395 Port 80: 0x2a][2997.308202 HC 0x87]
[2997.350240 Port 80: 0x1002][2997.350502 LPC RESET# asserted]
[2997.354488 LPC RESET# deasserted]
[2997.354689 GPIO_PP3300_LTE_EN 1->0]
[2997.365316 Port 80: 0x10][2997.384935 GPIO_PP3300_LTE_EN 0->1]
[2997.406576 Port 80: 0x2a][2997.407382 HC 0x87]
[2997.409450 Port 80: 0x1002][2997.409626 LPC RESET# asserted]


C. "powerinfo" believes system is in S0. Power pins looks OK to me:

>
> powerindebug
power in:   0x00ff
debug mask: 0x0000
bit meanings:
  0x0001 1 PGOOD_PP1050
  0x0002 1 PGOOD_PP3300_PCH
  0x0004 1 PGOOD_PP5000
  0x0008 1 PGOOD_S5
  0x0010 1 PGOOD_VCORE
  0x0020 1 PGOOD_PP1000_S0IX
  0x0040 1 SLP_S3#_DEASSERTED
  0x0080 1 SLP_S4#_DEASSERTED
> 
> gpioget
  1* POWER_BUTTON_L
  1* LID_OPEN
  1* AC_PRESENT
  1* PCH_SLP_S3_L
  1* PCH_SLP_S4_L
  1* PP1050_PGOOD
  1* PP3300_PCH_PGOOD
  1* PP5000_PGOOD
  1* S5_PGOOD
  1* VCORE_PGOOD
  0  WP_L
  1* JTAG_TCK
  1* UART0_RX
  0  BOARD_VERSION1
  1* BOARD_VERSION2
  1* BOARD_VERSION3
  1* PP1000_S0IX_PGOOD
  1* USB1_OC_L
  1* USB2_OC_L
  0  BAT_PRESENT_L
  0  CPU_PROCHOT
  1* ENABLE_BACKLIGHT
  1* ENABLE_TOUCHPAD
  0  ENTERING_RW
  1* LPC_CLKRUN_L
  1* PCH_CORE_PWROK
  1* PCH_PWRBTN_L
  1* PCH_RCIN_L
  1* PCH_RSMRST_L
  1* PCH_SMI_L
  0  PCH_SOC_OVERRIDE
  1* PCH_SYS_PWROK
  1* PCH_WAKE_L
  1* PP1350_EN
  1* PP3300_DX_EN
  1* PP3300_LTE_EN
  1* PP3300_WLAN_EN
  1* PP5000_EN
  0  PPSX_EN
  1* SUSP_VR_EN
  1* TOUCHSCREEN_RESET_L
  1* USB_CTL1
  0  USB_ILIM_SEL
  1* USB1_ENABLE
  1* USB2_ENABLE
  1* VCORE_EN
  1* WLAN_OFF_L
  1* PCH_SCI_L
  1* KBD_IRQ_L
  1* BAT_LED0
  0  BAT_LED1
  0  PWR_LED_L
> 

D. As earlier, the Alt+VolUp+r does not reboot the system, although I can see that the EC did receive the event and trying to warm reset the system:

[3274.775776 KB wait]
[3285.393695 KB poll]
[3286.005607 KB wait]
[3291.708919 KB poll]
[3295.067457 KB warm reboot]
[3295.067731 chipset_reset(0)]
[3295.852992 KB wait]

(But the system actually does not reboot).
Cc: adurbin@chromium.org dlaurie@chromium.org
Components: OS>Firmware>BIOS
I wonder if increasing the length of the warm reboot pulse would help with the alt-volup-r.

+Aaron & Duncan since the hang may be in ACPI transition.
Does doing the PCH_RCIN_L assertion manually make it recover? If not, then extending things won't help anything. What dose 'port80' spit out of EC console when you see this?
I'll try that out when I reproduce it next. There are 2 things here I think:

1) How the system got into this state. I think this is the more important issue that we need to address. It looks something that the kernel 4.4 induces, because the issue is not seen with 3.10 kernel.

2) Why Alt+VOlUp+r does not recover. I think this is more secondary.

I'm trying to reproduce it again, and will try out your suggestions.
Labels: -current-issue
Cc: gwendal@chromium.org
OK, nothing concrete coming out yet. Its getting more and more difficult to produce the problem that I'm trying to chase: I was not able to reproduce the original issue for 430 iterations even on 4.4 in my last 2 runs (which may indicate that my earlier conclusion that the problem does not exist on 3.10 may not be true as that was based on 500+ iterations). 

But in the last 2 runs, I did run into another (but issue with similar symptoms - may be some eMMC issue since I may be exhausting eMMC's life / no of write cycles?). After about 430 iterations, cty says:

[   12.364470] mmc1: Controller never released inhibit bit(s).
[   12.364476] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.364485] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.364491] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.364497] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.364503] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.364509] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.364514] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.364520] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.364526] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.364532] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.364538] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.364544] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.364548] sdhci: Host ctl2: 0x0000ffff
[   12.364554] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.364557] sdhci: ===========================================
[   12.464702] mmc1: Reset 0x2 never completed.
[   12.464707] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.464713] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.464718] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.464725] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.464730] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.464736] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.464741] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.464747] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.464753] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.464759] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.464765] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.464771] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.464774] sdhci: Host ctl2: 0x0000ffff
[   12.464781] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.464783] sdhci: ===========================================
[   12.564885] mmc1: Reset 0x4 never completed.
[   12.564889] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.564895] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.564900] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.564906] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.564912] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.564917] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.564923] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.564929] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.564934] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.564940] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.564946] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.564952] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.564956] sdhci: Host ctl2: 0x0000ffff
[   12.564962] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.564965] sdhci: ===========================================
[   12.575184] mmc1: Controller never released inhibit bit(s).
[   12.575190] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.575199] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.575205] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.575211] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.575217] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.575223] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.575228] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.575234] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.575240] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.575246] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.575252] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.575258] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.575262] sdhci: Host ctl2: 0x0000ffff
[   12.575268] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.575271] sdhci: ===========================================
[   12.675435] mmc1: Reset 0x2 never completed.
[   12.675440] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.675446] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.675452] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.675458] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.675464] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.675469] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.675475] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.675481] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.675487] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.675493] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.675499] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.675505] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.675508] sdhci: Host ctl2: 0x0000ffff
[   12.675514] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.675517] sdhci: ===========================================
[   12.775618] mmc1: Reset 0x4 never completed.
[   12.775622] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.775628] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.775633] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.775639] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.775645] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.775650] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.775656] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.775662] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.775668] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.775673] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.775679] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.775685] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.775689] sdhci: Host ctl2: 0x0000ffff
[   12.775695] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.775698] sdhci: ===========================================
[   12.785887] mmc1: Controller never released inhibit bit(s).
[   12.785893] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.785902] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.785907] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.785913] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.785919] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.785925] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.785931] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.785936] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.785942] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.785948] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.785954] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.785960] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.785964] sdhci: Host ctl2: 0x0000ffff
[   12.785971] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.785973] sdhci: ===========================================
[   12.886132] mmc1: Reset 0x2 never completed.
[   12.886136] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.886143] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.886149] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.886154] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.886160] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.886166] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.886172] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.886177] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.886183] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.886189] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.886195] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.886201] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.886204] sdhci: Host ctl2: 0x0000ffff
[   12.886211] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.886213] sdhci: ===========================================
[   12.986315] mmc1: Reset 0x4 never completed.
[   12.986318] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.986324] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.986330] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.986336] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.986341] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.986347] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.986352] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.986358] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.986364] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.986370] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.986376] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.986382] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.986385] sdhci: Host ctl2: 0x0000ffff
[   12.986391] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.986394] sdhci: ===========================================
[   12.996593] mmc1: Controller never released inhibit bit(s).
[   12.996599] sdhci: =========== REGISTER DUMP (mmc1)===========
[   12.996608] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   12.996614] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   12.996620] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   12.996626] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   12.996631] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   12.996637] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   12.996643] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   12.996649] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   12.996655] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   12.996661] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   12.996667] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   12.996671] sdhci: Host ctl2: 0x0000ffff
[   12.996677] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   12.996680] sdhci: ===========================================
[   13.096842] mmc1: Reset 0x2 never completed.
[   13.096847] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.096853] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.096859] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.096864] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.096870] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.096876] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.096882] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.096887] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.096893] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.096899] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.096905] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.096911] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.096915] sdhci: Host ctl2: 0x0000ffff
[   13.096921] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.096924] sdhci: ===========================================
[   13.197025] mmc1: Reset 0x4 never completed.
[   13.197028] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.197034] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.197040] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.197046] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.197052] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.197057] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.197063] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.197068] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.197074] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.197080] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.197086] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.197092] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.197096] sdhci: Host ctl2: 0x0000ffff
[   13.197102] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.197104] sdhci: ===========================================
[   13.207301] mmc1: Controller never released inhibit bit(s).
[   13.207307] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.207316] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.207321] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.207327] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.207333] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.207339] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.207345] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.207350] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.207356] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.207362] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.207368] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.207374] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.207378] sdhci: Host ctl2: 0x0000ffff
[   13.207384] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.207387] sdhci: ===========================================
[   13.307523] mmc1: Reset 0x2 never completed.
[   13.307526] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.307530] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.307533] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.307537] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.307540] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.307544] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.307547] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.307550] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.307554] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.307557] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.307561] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.307564] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.307567] sdhci: Host ctl2: 0x0000ffff
[   13.307571] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.307572] sdhci: ===========================================
[   13.407653] mmc1: Reset 0x4 never completed.
[   13.407655] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.407659] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.407663] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.407666] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.407669] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.407673] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.407676] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.407680] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.407683] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.407686] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.407690] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.407693] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.407696] sdhci: Host ctl2: 0x0000ffff
[   13.407699] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.407701] sdhci: ===========================================
[   13.417850] mmc1: Controller never released inhibit bit(s).
[   13.417856] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.417864] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.417870] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.417876] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.417882] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.417888] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.417894] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.417899] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.417906] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.417911] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.417917] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.417924] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.417927] sdhci: Host ctl2: 0x0000ffff
[   13.417934] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.417936] sdhci: ===========================================
[   13.518099] mmc1: Reset 0x2 never completed.
[   13.518104] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.518110] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.518115] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.518121] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.518127] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.518133] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.518138] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.518144] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.518150] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.518156] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.518162] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.518168] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.518171] sdhci: Host ctl2: 0x0000ffff
[   13.518178] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.518180] sdhci: ===========================================
[   13.618282] mmc1: Reset 0x4 never completed.
[   13.618286] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.618291] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.618297] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.618303] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.618309] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.618314] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.618320] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.618325] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.618332] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.618337] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.618343] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.618349] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.618353] sdhci: Host ctl2: 0x0000ffff
[   13.618359] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.618362] sdhci: ===========================================
[   13.628560] mmc1: Controller never released inhibit bit(s).
[   13.628566] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.628575] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.628581] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.628587] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.628593] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.628599] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.628604] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.628610] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.628616] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.628622] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.628628] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.628634] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.628638] sdhci: Host ctl2: 0x0000ffff
[   13.628644] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.628647] sdhci: ===========================================
[   13.728806] mmc1: Reset 0x2 never completed.
[   13.728811] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.728817] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.728823] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.728829] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.728835] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.728840] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.728846] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.728851] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.728858] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.728863] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.728869] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.728875] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.728879] sdhci: Host ctl2: 0x0000ffff
[   13.728885] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.728888] sdhci: ===========================================
[   13.828988] mmc1: Reset 0x4 never completed.
[   13.828992] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.828998] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.829003] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.829009] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.829015] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.829020] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.829026] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.829032] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.829037] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.829043] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.829049] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.829055] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.829059] sdhci: Host ctl2: 0x0000ffff
[   13.829065] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.829068] sdhci: ===========================================
[   13.839265] mmc1: Controller never released inhibit bit(s).
[   13.839271] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.839280] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.839286] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.839292] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.839298] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.839304] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.839310] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.839315] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.839321] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.839327] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.839333] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.839339] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.839343] sdhci: Host ctl2: 0x0000ffff
[   13.839349] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.839352] sdhci: ===========================================
[   13.939530] mmc1: Reset 0x2 never completed.
[   13.939535] sdhci: =========== REGISTER DUMP (mmc1)===========
[   13.939541] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   13.939547] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   13.939553] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   13.939559] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   13.939564] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   13.939570] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   13.939576] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   13.939582] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   13.939588] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   13.939594] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   13.939600] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   13.939604] sdhci: Host ctl2: 0x0000ffff
[   13.939611] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   13.939613] sdhci: ===========================================
[   14.039714] mmc1: Reset 0x4 never completed.
[   14.039718] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.039724] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.039729] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.039735] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.039741] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.039747] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.039752] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.039758] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.039764] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.039770] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.039776] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.039782] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.039786] sdhci: Host ctl2: 0x0000ffff
[   14.039792] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.039795] sdhci: ===========================================
[   14.050001] mmc1: Controller never released inhibit bit(s).
[   14.050007] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.050016] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.050021] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.050028] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.050034] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.050040] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.050045] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.050051] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.050057] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.050063] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.050069] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.050075] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.050079] sdhci: Host ctl2: 0x0000ffff
[   14.050085] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.050088] sdhci: ===========================================
[   14.150245] mmc1: Reset 0x2 never completed.
[   14.150250] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.150256] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.150262] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.150268] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.150274] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.150280] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.150285] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.150291] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.150297] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.150303] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.150309] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.150315] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.150318] sdhci: Host ctl2: 0x0000ffff
[   14.150325] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.150327] sdhci: ===========================================
[   14.250428] mmc1: Reset 0x4 never completed.
[   14.250432] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.250438] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.250443] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.250449] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.250455] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.250461] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.250466] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.250472] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.250478] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.250484] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.250489] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.250495] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.250499] sdhci: Host ctl2: 0x0000ffff
[   14.250505] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.250508] sdhci: ===========================================
[   14.260753] mmc1: Controller never released inhibit bit(s).
[   14.260759] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.260767] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.260774] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.260780] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.260786] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.260791] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.260797] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.260803] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.260809] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.260815] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.260821] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.260827] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.260831] sdhci: Host ctl2: 0x0000ffff
[   14.260837] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.260840] sdhci: ===========================================
[   14.361003] mmc1: Reset 0x2 never completed.
[   14.361007] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.361013] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.361018] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.361024] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.361030] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.361036] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.361042] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.361047] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.361053] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.361059] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.361065] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.361071] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.361075] sdhci: Host ctl2: 0x0000ffff
[   14.361081] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.361084] sdhci: ===========================================
[   14.461184] mmc1: Reset 0x4 never completed.
[   14.461188] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.461194] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.461200] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.461205] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.461211] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.461217] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.461223] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.461228] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.461234] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.461240] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.461246] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.461252] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.461256] sdhci: Host ctl2: 0x0000ffff
[   14.461262] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.461264] sdhci: ===========================================
[   14.561555] mmc1: Reset 0x1 never completed.
[   14.561561] sdhci: =========== REGISTER DUMP (mmc1)===========
[   14.561568] sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
[   14.561574] sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
[   14.561580] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
[   14.561586] sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
[   14.561592] sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
[   14.561598] sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
[   14.561604] sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
[   14.561609] sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
[   14.561615] sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
[   14.561621] sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
[   14.561627] sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
[   14.561631] sdhci: Host ctl2: 0x0000ffff
[   14.561637] sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
[   14.561640] sdhci: ===========================================
[   18.950171] device-mapper: init: attempting early device configuration.
[   18.961145] device-mapper: init: adding target '0 3584000 verity payload=ROOT_DEV hashtree=HASH_DEV hashstart=3584000 alg=sha1 root_hexdigest=a17a3bbbcad3d0463c75285389b23904e72a7aea salt=86a95059d39955c8e95'
[   18.986391] device-mapper: verity: Argument 0: 'payload=ROOT_DEV'
[   18.993233] device-mapper: verity: Argument 1: 'hashtree=HASH_DEV'
[   19.000171] device-mapper: verity: Argument 2: 'hashstart=3584000'
[   19.007103] device-mapper: verity: Argument 3: 'alg=sha1'
[   19.013155] device-mapper: verity: Argument 4: 'root_hexdigest=a17a3bbbcad3d0463c75285389b23904e72a7aea'
[   19.023776] device-mapper: verity: Argument 5: 'salt=86a95059d39955c8e953617171d1fc66385bfd3af9d0be174307db9c63889f26'
[   19.035767] device-mapper: table: 253:0: verity: Data device lookup failed
[   19.037834] usb 1-2.2.1: New USB device found, idVendor=0424, idProduct=ec00
[   19.037840] usb 1-2.2.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[   19.059754] device-mapper: init: starting dm-0 (vroot) failed
[   19.103173] ACPI: Preparing to enter system sleep state S5
[   19.109618] reboot: Power down

[NOte that after I recovered, the syslog still showed the same symptoms - nothing in the syslogs, so if we do not have the cty or the EC logs, these issues (no logs v/s emmc issue) are probably not distinguishable from each other)

The EC console:

[50.200104 HC 0x8e]
[50.200336 HC 0x8a]
[50.200629 HC 0x8b]
[50.200830 ACPI query = 0]
[50.232333 Port 80: 0x1002][50.232650 LPC RESET# asserted]
[50.232844 power state 3 = S0, in 0x003f]
[50.233062 power state 7 = S0->S3, in 0x003f]
[50.245264 power state 2 = S3, in 0x0008]
[50.245590 power state 8 = S3->S5, in 0x0008]
[50.245830 USB charge p0 m0]
[50.245990 USB charge p1 m0]
[50.246178 power state 1 = S5, in 0x0008]
[50.246403 power state 1 = S5, in 0x0008]
[50.264238 Disabling console in deep sleep]
[60.246727 power state 9 = S5->G3, in 0x0008]
[60.247065 power state 0 = G3, in 0x0008]
[60.247527 power state 0 = G3, in 0x00c8]
[60.248145 power state 0 = G3, in 0x00c0]


When I press Alt+VolUp+r, it still does not work:
> 
> 
> [497.958188 Disabling console in deep sleep]
[516.139344 KB poll]
[516.149489 event set 0x00001000]
[517.026309 KB wait]
[518.067673 KB poll]
[520.959907 KB warm reboot]
[520.960174 chipset_reset(0)]
[521.916372 KB wait]
951.098224 KB poll]
[951.401603 KB wait]
[952.413060 KB poll]
[952.670656 KB wait]
[952.890723 KB poll]
[953.120808 KB wait]
[953.188556 KB poll]
[953.403427 KB wait]


(and goes off for a long time before I can type any EC commands again). And no, the system does not reboot. While it does not give me back prompt immediately, it is still responsive as I can see it still print "KB wait/poll" messages as I try to type on the keyboard.

Now:

> port80
Port 80 writes:
(RESET) <--new
> 

I toggled PCH_RCIN_L (first to 0 and then to 1) and the system did not come up. I think this is definitely a different issue because it is in G3 as opposite to the previous state when everything looked good.

> powerindebug
power in:   0x00c0
debug mask: 0x0000
bit meanings:
  0x0001 0 PGOOD_PP1050
  0x0002 0 PGOOD_PP3300_PCH
  0x0004 0 PGOOD_PP5000
  0x0008 0 PGOOD_S5
  0x0010 0 PGOOD_VCORE
  0x0020 0 PGOOD_PP1000_S0IX
  0x0040 1 SLP_S3#_DEASSERTED
  0x0080 1 SLP_S4#_DEASSERTED
> 

I'll keep trying to produce the original issue. current logs attached.


messages.tar.gz
5.2 MB Download
Do you want to setup a small test farm with 4-6 Kips to improve repro chances? 

Comment 39 by rajatja@google.com, Mar 13 2017

The Kip CQ has been running pretty green since last 2 weeks. I have not seen this issue on kip since last ~135 builds (last seen on 27 feb). I think we should remove the experimental status from kip, while continuing to work on this.
Ok
The kip CQ has been green because when Autotest performs as
designed, it papers over the problem.  I just checked the
last 24 hours of history on the kip CQ, and I found 8 DUT
failures matching the basic symptom of "DUT went offline":
    2017-03-16 14:09:43  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host19/577792-repair/
    2017-03-17 06:34:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host17/579779-repair/
    2017-03-16 19:48:07  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host15/578437-repair/
    2017-03-17 02:02:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host11/579320-repair/
    2017-03-16 20:13:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host11/578495-repair/
    2017-03-16 14:09:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host11/577793-repair/
    2017-03-17 02:00:38  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host1/579318-repair/
    2017-03-17 08:40:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host5/580206-repair/

So, we can/should make kip not experimental, but we need to keep moving
the ball forward on finding the cause.

Comment 42 by rajatja@google.com, Mar 22 2017

Cc: furquan@chromium.org
OK, finally there is some progress made on this today. The observations make me believe that it is not a kernel issue, but may be something in Intel boot code (or coreboot).

1) I simplified the reproduction steps today to simply continue to do a reboot in a loop, and the issue could be reproduced in a few hundred to < 1000 iterations of reboots (from the AP side - which I believe is a warm reboot). 

2) The issue could be reproduced, and once in this messy state, "apreset warm" and "Alt+VolUp+r" do not help at all, neither does toggling the PCH_RCIN_L. Note that in a good state when system is not stuck, the "apreset warm" and "Alt_VolUp+r" work as intended (produce me a console-ramoops).

From here on based on the symptoms; I think there are 2 issues that we may hit - but 1 is a major issue (happens >95% of the time), so I'll differentiate between the two issues, and use this bug for the first one, and file a separate bug for the other one (b/36493878 - kip powers down instead of restart).

3) In all the times I reproduced this issue - the AP cty shows that the kernel did shutdown cleanly.  

[   10.611827] init: console-ttyS0 main process (505) killed by TERM signal
[   10.620422] init: powerd main process (956) killed by TERM signal
[   10.627822] init: chapsd main process (1117) killed by TERM signal
[   10.635628] init: crash-sender main process (1498) killed by TERM signal
[   10.643577] init: cros-machine-id-regen-periodic main process (1509) killed by TERM signal
[   10.653278] init: mtpd main process (1514) killed by TERM signal
[   10.660398] init: log-rotate main process (1525) killed by TERM signal
[   10.668040] init: cras main process (1556) killed by TERM signal
[   10.674014] EXT4-fs (mmcblk0p5): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[   10.685610] init: bluetoothd main process (1589) killed by TERM signal
[   10.698429] init: warn-collector main process (2258) killed by TERM signal
[   10.706670] init: tlsdated main process (2288) killed by TERM signal
[   10.735417] init: cryptohomed main process (1331) killed by TERM signal
[   10.768458] init: mosys-info main process (2273) terminated with status 143
[   10.778797] init: recover_duts main process (2256) killed by TERM signal
[   10.827273] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[   11.737298] gsmi: Log Shutdown Reason 0x00
[   11.812694] reboot: Restarting system
[   11.816817] reboot: machine restart

So although there is a limitation that I'm not able to generate the console-ramoops, I sincerely doubt if there'd be anything really useful for me there (more evidence below). Also, the fact that the kernel shut down cleanly before the system fell off the radar is also visible in the cautotest links in above comments.

3) When the issue happens, here is what is seen last on the EC console:
[816.229080 HC 0x0b]
[816.230691 HC 0x05]
[817.098221 DPTF sensor 1, threshold -73 C, index 0, disabled]
[817.099141 DPTF sensor 1, threshold -73 C, index 1, disabled]
[817.101107 DPTF sensor 2, threshold -73 C, index 0, disabled]
[817.101987 DPTF sensor 2, threshold -73 C, index 1, disabled]
[817.103930 DPTF sensor 3, threshold -73 C, index 0, disabled]
[817.104714 DPTF sensor 3, threshold -73 C, index 1, disabled]
[818.257115 KB disable]
[818.257381 KS disable]
[818.257529 KB IRQ disable]
[818.257753 KB IRQ enable]
[818.267211 chipset_reset(0)]
[818.268014 Port 80: 0x1002][818.268182 LPC RESET# asserted]
[818.272697 LPC RESET# deasserted]
[818.272862 GPIO_PP3300_LTE_EN 1->0]
[818.283503 Port 80: 0x10][818.299648 chipset_reset(0)]
[818.303103 GPIO_PP3300_LTE_EN 0->1]
[818.324695 Port 80: 0x2a][818.325502 HC 0x87]
[818.367517 Port 80: 0x1002][818.367771 LPC RESET# asserted]
[818.371799 LPC RESET# deasserted]
[818.372000 GPIO_PP3300_LTE_EN 1->0]
[818.382608 Port 80: 0x10][818.402239 GPIO_PP3300_LTE_EN 0->1]
[818.423799 Port 80: 0x2a][818.424606 HC 0x87]
[818.426667 Port 80: 0x1002][818.426836 LPC RESET# asserted]
[821.751205 Battery  96% / 0h:24]
[1015.245002 Charge state charge -> charge_near_full after 1012.222794 sec]
[1015.246769 Battery  97% / 0h:22]
>
> powerinfo
[1117.577165 power state 3 = S0, in 0x00ff]
> port80
Port 80 writes:
  27 2a 2a 2a 2b 2c 2d 2d 2d 2e 02 2b 2f 11 79 71 72 24 25 55
  73 74 75 f0 f1 75 93 75 75 75 75 75 75 75 75 75 75 75 75 75
  75 75 76 77 79 9c 9e 9d 7a 7b f8 aa
(RESET)
  01 10 20 23 27 2a 01 21 23 24 25 27 2a 2a 2a 2b 2c 2d 2d 2d
  2e 02 2b 2f 11 79 71 72 24 25 55 73 74 75 f0 f1 75 93 75 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 76 77 79 9c 9e 9d 7a
  f8 aa
(RESET)
  01 10 20 23 27 2a 01 21 23 24 25 27 2a 2a 2a 2b 2c 2d 2d 2d
  2e 02 2b 2f 11 79 71 72 24 25 55 73 74 75 f0 f1 75 93 75 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 76 77 79 9c 9e 9d 7a
  f8 aa
(RESET)
  01 10 20 23 27 2a 01 21 23 24 25 27 2a 2a 2a 2b 2c 2d 2d 2d
  2e 02 2b 2f 11 79 71 72 24 25 55 73 74 75 f0 f1 75 93 75 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 76 77 79 9c 9e 9d 7a
  f8 aa
(RESET)
  01 10 20 23 27 2a
(RESET)
  01 10 20 23 27 2a
(RESET) <--new
> 

4) I took help from Furquan today, and he mentioned that this means that the BIOS is reaching the memory init / configuration stage but hanging somewhere there. He also mentioned that this issue (of large number of warm resets making the CPU getting stuck in BIOS init sounds similar to one of the existing issues on reef: b/35587236)

5) Also, the "mosys eventlog list" says the following after each such reproduction (This is also seen in the logs from the cautotest so I think this is the same issue that we're reproducing):

74 | 2017-03-21 17:32:55 | Chrome OS Developer Mode
75 | 2017-03-21 17:36:19 | Kernel Event | Clean Shutdown
76 | 2017-03-21 17:36:20 | System boot | 1246
77 | 2017-03-21 17:36:20 | System Reset
78 | 2017-03-21 17:36:21 | Chrome OS Developer Mode
79 | 2017-03-21 17:36:37 | Kernel Event | Clean Shutdown
80 | 2017-03-21 17:36:38 | System boot | 1247
81 | 2017-03-21 17:36:38 | System Reset
82 | 2017-03-21 17:36:39 | Chrome OS Developer Mode
83 | 2017-03-21 17:36:56 | Kernel Event | Clean Shutdown
84 | 2017-03-21 17:36:57 | System boot | 1248
85 | 2017-03-21 17:36:57 | System Reset
86 | 2017-03-21 17:36:58 | Chrome OS Developer Mode
87 | 2017-03-21 17:37:14 | Kernel Event | Clean Shutdown
88 | 2017-03-21 17:38:26 | System boot | 1249
89 | 2017-03-21 17:38:26 | SUS Power Fail <================= Problem?
90 | 2017-03-21 17:38:26 | Wake Source | Power Button | 0
91 | 2017-03-21 17:38:27 | Chrome OS Developer Mode

I'll be taking Furquan's help in debugging this tomorrow, as I'd need some help in debugging the coreboot side of things.

Note: I updated the EC & BIOS firmware version to the latest (just mentioning here for completeness, not that I think it has an impact): 

localhost ~ # chromeos-firmwareupdate
Starting Google_Kip firmware updater v4 ()...
 - Updater package: [Google_Kip.5216.227.78 / EC:kip_v1.6.193-5f42355]
 - Current system:  [RO:Google_Kip.5216.227.5 , ACT:Google_Kip.5216.227.25 / EC:kip_v1.6.193-5f42355]


Kip is a baytrail machine. 2a is a super early post code. it seems like we're not even getting that far into booting when things go badly. Also, the final hanging '(RESET) <--new' line is even worse. We couldn't even get one post code out. As for the 'SUS Power Fail' line that is likely what you did to recover the system. Did you ec reboot? or apshutdown?

Comment 44 by rajatja@google.com, Mar 22 2017

Yes, EC reboot is the only option to recover the system - and that is what I did in order to boot and get mosys evenlog list.

Comment 45 by rajatja@google.com, Mar 24 2017

OK, the issue reproduced again, this time the serial logs from coreboot enabled, (hopefully the logs below give good pointers to where coreboot became unresponsive). It looks like this needs either Intel's or firmware team's attention.

[  926.013926] init: powerd main process (977) killed by TERM signal
[  926.021387] init: chapsd main process (1160) killed by TERM signal
[  926.029129] init: crash-sender main process (1491) killed by TERM signal
[  926.037134] init: cros-machine-id-regen-periodic main process (1506) killed by TERM signal
[  926.046831] init: mtpd main process (1512) killed by TERM signal
[  926.053986] init: log-rotate main process (1520) killed by TERM signal
[  926.062136] init: cras main process (1561) killed by TERM signal
[  926.069407] init: bluetoothd main process (1568) killed by TERM signal
[  926.082654] init: warn-collector main process (2240) killed by TERM signal
[  926.091037] init: tlsdated main process (2272) killed by TERM signal
[  926.124834] init: cryptohomed main process (1322) killed by TERM signal
[  926.125962] EXT4-fs (mmcblk0p5): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[  926.215396] init: recover_duts main process (2237) killed by TERM signal
[  926.272875] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[  927.518115] gsmi: Log Shutdown Reason 0x00
[  927.588170] reboot: Restarting system
[  927.592282] reboot: machine restart


coreboot-db3c5d9 Thu Mar 23 13:46:41 PDT 2017 starting...
Enabling VR PS2 mode: VNN VCC
ram_id=0, total_spds: 8
pm1_sts: 2000 pm1_en: 0000 pm1_cnt: 00001c00
gpe0_sts: 00000000 gpe0_en: 00000000 tco_sts: 00000000
prsts: 04410910 gen_pmcon1: 00201238 gen_pmcon2: 00000000
prev_sleep_state = S0
Boot Count incremented to 1801
FMAP: Found "FMAP" version 1.0 at ffe10000.
FMAP: base = 0 size = 800000 #areas = 31
FMAP: area RW_MRC_CACHE found
FMAP:   offset: 3e0000
FMAP:   size:   65536 bytes
FMAP: No valid base address, using 0xff800000
FMAP: RW_MRC_CACHE at ffbe0000 (offset 3e0000)
BayTrail-MD MRC wrapper v5
Training for Memory Down designs.
Applying weaker ODT settings. DRAM ODT is 120.
MRC v0.97
2 channels of DDR3 @ 1333MHz
CBMEM: root @ 7afff000 254 entries.
MRC Wrapper returned 0
MRC data at fe00965f 5719
<hangs here>

Please let me know what should be the next steps here.

Comment 46 by rajatja@google.com, Mar 24 2017

debug-logs_20170324-123104.tgz
24.1 MB Download

Comment 47 by rajatja@google.com, Mar 24 2017

Yet again, and this time core boot hung even earlier..

[   14.110019] init: bluetoothd main process (1602) killed by TERM signal
[   14.122883] init: warn-collector main process (2248) killed by TERM signal
[   14.130975] init: tlsdated main process (2276) killed by TERM signal
[   14.141814] EXT4-fs (mmcblk0p5): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[   14.158117] init: cryptohomed main process (1367) killed by TERM signal
[   14.237213] init: recover_duts main process (2246) killed by TERM signal
[   14.297256] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[   15.226131] gsmi: Log Shutdown Reason 0x00
[   15.295294] reboot: Restarting system
[   15.299405] reboot: machine restart


coreboot-db3c5d9 Thu Mar 23 13:46:41 PDT 2017 starting...
Enabling VR PS2 mode: VNN VCC
<restarts on its own>

coreboot-db3c5d9 Thu Mar 23 13:46:41 PDT 2017 starting...
Enabling VR PS2 mode: VNN VCC
<hangs here>

EC console:

[10163.310001 HC 0x08]
[10163.310325 HC 0x10]
[10163.310899 HC 0x18]
[10163.311107 HC err 1]
+[10163.311613 HC err 1]
+[10163.312136 HC err 1]
[10163.312629 HC 0x16]
+[10163.313409 HC 0x15]
[10163.314105 HC 0x16]
[10164.146913 KB disable]
[10164.147249 KS disable]
[10164.147466 KB IRQ disable]
[10164.147759 KB IRQ enable]
[10164.156681 chipset_reset(0)]
[10164.157463 Port 80: 0x1002][10164.157700 LPC RESET# asserted]
[10164.162076 LPC RESET# deasserted]
[10164.162311 GPIO_PP3300_LTE_EN 1->0]
[10164.172955 Port 80: 0x10][10164.189242 chipset_reset(0)]
[10164.192621 GPIO_PP3300_LTE_EN 0->1]
[10164.214072 Port 80: 0x2a][10164.223408 HC 0x87]
[10164.225480 Port 80: 0x1002][10164.225803 LPC RESET# asserted]
[10164.229714 LPC RESET# deasserted]
[10164.229949 GPIO_PP3300_LTE_EN 1->0]
[10164.240563 Port 80: 0x10][10164.260258 GPIO_PP3300_LTE_EN 0->1]
[10164.281688 Port 80: 0x2a][10164.291024 HC 0x87]
[10164.292139 Port 80: 0x1002][10164.292431 LPC RESET# asserted]

>
> port80
Port 80 writes:
  75 75 75 75 75 76 77 79 9c 9e 9d 7a 7b f8 aa
(RESET)
  01 10 20 27 2a 01 21 23 25 27 2a 2a 2b 2c 2d 2d 2e 02 2b 11
  79 39 40 70 71 72 24 25 24 25 55 55 73 74 75 f0 f1 75 93 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
  75 75 75 75 75 76 77 79 9c 9e 9d 7a 7b f8 aa
(RESET)
  01 10 20 27 2a 01 21 23 25 27 2a 2a 2b 2c 2d 2d 2e 02 2c 11
  79 39 40 70 71 72 24 25 24 25 55 55 73 74 75 f0 f1 75 93 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
  75 75 75 75 75 76 77 79 9c 9e 9d 7a 7b f8 aa
(RESET)
  01 10 20 27 2a 01 21 23 25 27 2a 2a 2b 2c 2d 2d 2e 02 2b 11
  79 39 40 70 71 72 24 25 24 25 55 55 73 74 75 f0 f1 75 93 75
  75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75
  75 75 75 75 75 76 77 79 9c 9e 9d 7a 7b f8 aa
(RESET)
  01 10 20 27 2a
(RESET)
  01 10 20 27 2a
(RESET) <--new
>



Owner: adurbin@chromium.org
I'm not sure I can do anything about this device. Intel will need to debug if we're getting random hangs and not able to even get out of the reset vector to post a single post code.

I'm not sure why LPC RESET# is not deasserted in those last logs. That's entirely a function of the chipset.

Comment 50 by rajatja@google.com, Apr 11 2017

Cc: rajatja@google.com
Intel seems to have reproduced this problem (in the context of the other thermal shutdown bug):
https://buganizer.corp.google.com/issues/36679033#comment16

Can we open up this bug to Intel, and assign it to them?

Comment 51 by rajatja@google.com, Apr 12 2017

Cc: shyam.su...@intel.corp-partner.google.com yoshiat@google.com wayne.bo...@intel.corp-partner.google.com yu.kang...@intel.corp-partner.google.com brian.b...@intel.corp-partner.google.com kane.c...@intel.corp-partner.google.com cl...@intel.corp-partner.google.com
Shyam: can you please help route to the right owner within Intel for this? Note that although we saw this issue on a Kip (baytrail device), this issue was reproduced by Brian (Intel) on another baytrail device (Rambi) recently.

https://buganizer.corp.google.com/issues/36679033#comment16

Comment 52 by rajatja@google.com, Apr 12 2017

FYI&A for Intel: the simple steps to reproduce the problem are:

Run the attached reboot_test.sh script (from within chroot > /usr/src/scripts). 
reboot_test.sh
2.0 KB View Download
Acking your note Rajat.

Kane, CL,

Can you take a stab at it?
Hi Rajatja,
Greetings!
Sorry I just got involved into this bug.
May I know if this issue is reproduced on every kip system you have? Is it PVT system?
I noticed that kernel 3.10->4.4 change, but this is unlikely causing BIOS hang up in this early stage.

Since Kip is shipped for a long time, it shouldn't be unstable like this, anything is updated or done specifically to make issue happen?
thank you


If the LPC RESET# is not deasserted, the BIOS code is not even run.
Is it possible to cut off all power AC/battery(this is to do RTC reset) and run again?

I just noticed that Brian@intel saw similar issue, I will also sync up with him.
Thanks.

Comment 56 by rajatja@google.com, Apr 12 2017

Hello Kane,

Thanks for taking a look, yes, the issue has been produced with multiple units in lab. I believe our labs are mostly MP systems. I can't of any substantial change that could have caused this. I think to debug this, it would be better if you had a servo enabled system. If you don't have it and are local in bay area, I can provide you a Kip with servo header installed (Although since the issue is produced in Brian@Intel's system - may be that works out for you).

Thanks,

Rajat
Hi Rajatja,
Greetings!
Sorry, I'm in Taiwan.
Since BYT is quiet old project, so I am only able to find Candy project to repro issue in my site.

So far it's 500 cycles and still alive, I will keep running the test.
Meanwhile, we will try to get more information on the system in the U.S.
Thank you

INFO    : kernel: 3.10.18 #1 SMP Wed Apr 5 13:05:32 PDT 2017
INFO    : loop 500
INFO    : Thu Apr 13 18:36:27 CST 2017
INFO    : Rebooting 10.5.253.20...

Comment 58 by rajatja@google.com, Apr 13 2017

Hello Kane : No, you want to upgrade the candy to the latest image (atleast something built in february) - so that you get the 4.4 kernel. We suspect that the issue may be discovered when we use the new kernel. I'd say you let the reboot script run overnight, by morning you should have passed 3000-5000 cycles. If you pass > 7000 cycles and not run into either b/36493878 or https://bugs.chromium.org/p/chromium/issues/detail?id=696679, then I think it may be safe to conclude that neither issue is seen on candy...
Rajat, Sameer,

Is there a Kip available in Taiwan you can loan to Kane?

Thanks,
Shyam
Hi Rajatja,
I put the system to run overnight, it stop at 1000th cycles
It looks like the script will only run 1000 cycles. Let modify it and run again.
I will also update the image to kernel v4.4 and run again.

If it's something to do with the kernel updates, then it's very strange. 
I couldn't think of SW updates that may cause BIOS hang up like this.
Thank you:)

INFO    : kernel: 3.10.18 #1 SMP Wed Apr 5 13:05:32 PDT 2017
INFO    : loop 1000
INFO    : Thu Apr 13 20:54:08 CST 2017
INFO    : Rebooting 10.5.253.20...
     15s: can not connect to device
     16s: reboot complete
INFO    : kernel: 3.10.18 #1 SMP Wed Apr 5 13:05:32 PDT 2017

Hi,
Latest update in TW.
With kernel 4.4, it's still running.(over 1060 cycles)
I will put the dut to run over weekend.
Thanks.

     20s: reboot complete
INFO    : kernel: 4.4.44-07220-gb7d53f9 #1 SMP PREEMPT Sat Feb 25 04:34:42 PST 2017
INFO    : loop 1060
INFO    : Fri Apr 14 16:51:48 CST 2017
INFO    : Rebooting 10.5.253.20...

Kane, do you need additional Kip units or are you good here?
Yung, Intel does need Kip units.  They want one here in the US and one in TPE.  Can you please help allocate?
Update the test in TW over the weekend.
it stopped at 1626 cycles, the dut is still alive
This is due to the IP address on DUT is somehow changed to 10.5.253.3 from 10.5.253.20.
I will use another warm reset script(without remote_sh) to repro issue.
thanks.

     15s: can not connect to device
     20s: reboot complete
INFO    : kernel: 4.4.44-07220-gb7d53f9 #1 SMP PREEMPT Sat Feb 25 04:34:42 PST 2017
INFO    : loop 1626
INFO    : Fri Apr 14 20:02:14 CST 2017
INFO    : Rebooting 10.5.253.20...

Hi all,
I reviewed the log in #47.
It seems the issue follows "chipset_reset(0)".
in normal reboot, I didn't see chipset_reset(0).
So, I did some experiments by using iotools
iotools io_write32 0xcf9 0x6 ==> no chipset_reset(0)
iotools io_write8 0x64 0xfe ==> chipset_reset(0) exists.

so, it looks like the warm reset behavior is kind of different when issue happens.

And I looked into the kernel, *reboot: Restarting system* this message is printed as emergency in kernel_restart function.
Is this expected in BYT with kernel v4.4? I felt something is odd here.
if you have any thought about this, plz kindly let me know.
thank you very much.

void kernel_restart(char *cmd)
{
        kernel_restart_prepare(cmd);
        migrate_to_reboot_cpu();
        syscore_shutdown();
        if (!cmd)
                pr_emerg("Restarting system\n");
        else
                pr_emerg("Restarting system with command '%s'\n", cmd);
        kmsg_dump(KMSG_DUMP_RESTART);
        machine_restart(cmd);
}

Comment 66 by rajatja@google.com, Apr 18 2017


> And I looked into the kernel, *reboot: Restarting system* this message is
> printed as emergency in kernel_restart function.
> Is this expected in BYT with kernel v4.4? I felt something is odd here.

That is expected. pr_emerg is used because it is a message important enough to not be lost because of insufficient loglevel, and also should be flashed on the console as well as all ttys in use. Also, this behavior is for all platforms, not just BYT. 
I'm not sure why we're seeing different behavior with the resets. In both 3.18 and 4.4 acpi path should be taken first. If that fails to reboot then the kernel will try the keyboard controller method. We've seen similar behavior on apollolake boards where the cf9 isn't responsive and the keyboard controller is tried. In this case, can we confirm reboot type is ACPI and the ACPI tables are indicating the correct values?
Hi Aaron, 
I just checked FACP table. Looks like it's correct.
Once we get the failing system, I will add more log in kernel to confirm if the issue is related to cf9 unresponsive.
Thanks

 95 [078h 0120   8]                      Address : 0000000000000CF9
 96
 97 [080h 0128   1]         Value to cause reset : 06

Comment 69 by rajatja@google.com, Apr 19 2017

> I'm not sure why we're seeing different behavior with the resets.
> In both 3.18 and 4.4 acpi path should be taken first. 

It probably doesn't make a difference, but to be factually correct, we moved from 3.10 (as opposite to 3.18) to 4.4.
Hi Aaron
I just double checked the log in #47 again.
When issue happens,
  OS used kb controller method to reset the system(0xcf9 unresponsive?)
  Then BIOS issued another reset by kb controller method right away.

From the bios log in #47, the system was stuck btwn "Enabling VR PS2 mode: VNN VCC" and "ram_id=2, total_spds: 12"
I read through the code btwn the above 2 places, but I couldn't find any code would do warm reset by kb controller.
Not sure if I miss anything.

Alos the interval of 2 chipset(0) is just around 30ms. 
It's kind of similar to APL, but the actual cpu reset in APL is just once.
However, in this BYT case, the CPU is reset twice.

Or is there any other case that EC issue warm reset itself?

Thank you:)

[10164.156681 chipset_reset(0)]                   ==> reset by kb controller method in OS.
[10164.157463 Port 80: 0x1002][10164.157700 LPC RESET# asserted]
[10164.162076 LPC RESET# deasserted]
[10164.162311 GPIO_PP3300_LTE_EN 1->0]
[10164.172955 Port 80: 0x10][10164.189242 chipset_reset(0)]      ==> reset by kb controller method in BIOS?
[10164.192621 GPIO_PP3300_LTE_EN 0->1]
[10164.214072 Port 80: 0x2a][10164.223408 HC 0x87]
[10164.225480 Port 80: 0x1002][10164.225803 LPC RESET# asserted]
[10164.229714 LPC RESET# deasserted]
[10164.229949 GPIO_PP3300_LTE_EN 1->0]
[10164.240563 Port 80: 0x10][10164.260258 GPIO_PP3300_LTE_EN 0->1]
I don't think we'd ever do a keyboard reset like that. On baytrail we don't have an RCIN signal -- it's a true reset button signal. Even if the host wanted to reset we wouldn't issue one using the keyboard controller.

But now that you have identified this issue I recall very peculiar behavior in baytrail. Essentially you can't reset the system early like it is because it hangs (just like this bug). I don't know why, but something is really not right in baytrail in this area. I haven't dug into the EC code, but if the code in the EC queues up a reset or if some transaction is flushed on the LPC bus then the EC could trigger the reset at the wrong time.
Hi Aaron,
I just went back to check my log.
Looks like I ran into strange situation but I didn't realize.
It didn't hang up. it just shutdown and up then boot to OS, but the behavior is wrong.

sorry, I was not involved in BYT too much before.
Do u recall anything like this with kernel v3.10(during rambi validation)?
I'm not sure if this is connected to this bug.
thank you:)

In coreboot=============================================================

Clearing the recovery request.
Shutting down all USB controllers.
Finalizing Coreboot

Starting kernel ...


coreboot-519ff11 Wed Apr 19 18:49:10 CST 2017 starting...
Enabling VR PS2 mode: VNN VCC (it seems the dut shutdown here)


coreboot-519ff11 Wed Apr 19 18:49:10 CST 2017 starting... (and it's up automatically)
Enabling VR PS2 mode: VNN VCC 
ram_id=2, total_spds: 12
pm1_sts: 2000 pm1_en: 0000 pm1_cnt: 00001c00
gpe0_sts: 00060000 gpe0_en: 00000000 tco_sts: 00000000
prsts: 04410900 gen_pmcon1: 00001038 gen_pmcon2: 00000000
prev_sleep_state = S0
Boot Count incremented to 5908
FMAP: Found "FMAP" version 1.0 at ffe10000.
FMAP: base = 0 size = 800000 #areas = 31
FMAP: area RW_MRC_CACHE found
FMAP:   offset: 3e0000
FMAP:   size:   65536 bytes
FMAP: No valid base address, using 0xff800000
FMAP: RW_MRC_CACHE at ffbe0000 (offset 3e0000)
BayTrail-MD MRC wrapper v5
Training for Memory Down designs.
Applying weaker ODT settings. DRAM ODT is 120.
MRC v0.97
.... boot to kernel



in ec console=============================================================

[3456.536809 KB IRQ disable]
[3456.537040 KB IRQ enable]
[3456.553860 chipset_reset(0)]
[3456.554541 Port 80: 0x1002][3456.554717 LPC RESET# asserted]
[3456.559432 LPC RESET# deasserted]
[3456.570131 Port 80: 0x01][3456.570377 Port 80: 0x10][3456.607765 Port 80: 0x20][3456.608014 Port 80: 0x23][3456.608177 Port 80: 0x27][3456.608340 Port 80: 0x2a][3456.617795 HC 0x87]
[3460.889282 Port 80: 0x1002][3460.889544 LPC RESET# asserted]
[3460.889723 power state 3 = S0, in 0x003f]
[3460.889957 power state 7 = S0->S3, in 0x003f]
[3460.898122 power state 2 = S3, in 0x000c]
[3460.898421 power state 8 = S3->S5, in 0x000c]
[3460.898677 USB charge p0 m0]
[3460.898852 USB charge p1 m0]
[3460.899102 power state 1 = S5, in 0x0008]
[3460.899406 power state 1 = S5, in 0x0008]
[3465.531449 power state 1 = S5, in 0x00c8]
[3465.531790 power state 5 = S5->S3, in 0x00c8]
[3465.532041 power state 2 = S3, in 0x00c8]
[3465.532278 power state 6 = S3->S0, in 0x00c8]
[3465.537805 Port 80: 0x1001][3465.537997 USB charge p0 m2]
[3465.538179 USB charge p1 m2]
[3465.654988 LPC RESET# deasserted]
[3465.655967 power PLTRST# deasserted]
[3465.656207 power state 3 = S0, in 0x00ff]
[3465.656444 power state 3 = S0, in 0x00ff]

I don't recall anything specifically, but that looks like a cold reset to me. The 4-5s delay gives it away.
Hi all,
I found that the reset behavior is very different btwn kernel v3.10 and v4.4. 
I'm not very sure if this is cause of this issue.
Plz see the attached EC log from kernel v4.4 and v3.10.
With kernel 4.4, I can see chipset_reset(0) and unexpected cold reset happen frequently.
With kernel 3.10, I *can't* find any chipset_reset(0) or unexpected cold reset during the warm reset test.

If the cf9 is unresponsive, it should happen in v3.10 as well?
However, this weird behavior doesn't seem to happen in kernel v3.10.

Dose Google see this hang issue before updating kernel v4.4?
Thank you:)
candy_warm_v4.4-ec.log
5.1 MB View Download
candy_warm_v310-ec.log
3.2 MB View Download

Comment 75 by rajatja@google.com, Apr 20 2017

No, this issue came into light only after the 4.4 upgrade. So yes, your analysis may be correct that this induced by 4.4

Comment 76 Deleted

Update.
With kernel v3.10 on the Kip in Folsom, the issue is not seen over 15 hours.
thanks.
Aaron,

Can you please make Kane the owner for this bug?

Thanks,
Shyam
Update the result.
With enabling kernel serial log, I finally was able to repro issue on DELL Candy yesterday. I saw the issue around 2 hours.

Then, I made the below patch, the issue and strange reset behavior is not seen over 18 hours.

Could u plz try this in your site?
Thank you


diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index f660d63..165c955 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -540,6 +540,7 @@ static void native_machine_emergency_restart(void)
                switch (reboot_type) {
                case BOOT_ACPI:
                        acpi_reboot();
+                       udelay(1000);
                        reboot_type = BOOT_KBD;
                        break;

I can't seem to make Kane the owner as it's flagging an error.

Comment 81 by rajatja@google.com, Apr 27 2017


---------- Forwarded message ----------
From: Chen, Kane <kane.chen@intel.com>
Date: Wed, Apr 26, 2017 at 5:32 PM
Subject: RE: Intel US and TW needs a Kip (total 2)
To: Sameer Nanda <snanda@google.com>, "Sundararaman, Shyam" <shyam.sundararaman@intel.com>
Cc: Yung Leem <yungleem@google.com>, Rajat Jain <rajatja@google.com>, Spencer West <spencerwest@google.com>, "Bhat, Kiran U" <kiran.u.bhat@intel.com>, Aaron Durbin <adurbin@google.com>


Hi Sameer,
The reason I added this is delay is due to I find linux kernel will do the following sequence.
1.       Set 0xcf9 = 0x6 to trigger warm reset

2.       If the system is not reset immediately by 0xcf9 = 0x6, then kernel issues multiple kbc reset (0x64=0xfe)

See native_machine_emergency_restart function
 
While kernel is using kbc reset, kernel would issue multiple commands to ec.
I observed multiple reset commands are queue up in the buffer in EC.
In this case, we will see consecutive warm reset by EC. Sometimes the interval of 2 warm reset is just 30 ms
In EC console, you can find chipset_reset(0) happens in a very short time
 
See
https://bugs.chromium.org/p/chromium/issues/detail?id=696679#c47
https://bugs.chromium.org/p/chromium/issues/detail?id=696679#c70
 
Plz note that the above behavior only happened on v4.4.
I never observed this behavior in kernel v3.10
 
Thank you
 
 
From: Sameer Nanda [mailto:snanda@google.com] 
Sent: Thursday, April 27, 2017 7:53 AM
To: Chen, Kane <kane.chen@intel.com>; Sundararaman, Shyam <shyam.sundararaman@intel.com>
Cc: Yung Leem <yungleem@google.com>; Rajat Jain <rajatja@google.com>; Spencer West <spencerwest@google.com>; Bhat, Kiran U <kiran.u.bhat@intel.com>; Aaron Durbin <adurbin@google.com>

Subject: Re: Intel US and TW needs a Kip (total 2)
 
Hi Kane,
 
We can try it but such udelay changes tend to be problematic since timings can change over time/releases etc.  Do you have a theory for why adding this delay helps?
 
On Tue, Apr 25, 2017 at 10:32 PM Chen, Kane <kane.chen@intel.com> wrote:
Update the result.
With enabling kernel log, I finally was able to repro issue on DELL Candy yesterday. I saw the issue around 2 hours.
Then, I made the below patch, then issue is not seen over 18 hours.
 
Could u plz try this in your site?
Thank you
 
 
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index f660d63..165c955 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -540,6 +540,7 @@ static void native_machine_emergency_restart(void)
                switch (reboot_type) {
                case BOOT_ACPI:
                        acpi_reboot();
+                       udelay(1000);
                        reboot_type = BOOT_KBD;
                        break;

Comment 82 by rajatja@google.com, Apr 27 2017

---------- Forwarded message ----------
From: Aaron Durbin <adurbin@google.com>
Date: Wed, Apr 26, 2017 at 8:00 PM
Subject: Re: Intel US and TW needs a Kip (total 2)
To: "Chen, Kane" <kane.chen@intel.com>
Cc: Sameer Nanda <snanda@google.com>, "Sundararaman, Shyam" <shyam.sundararaman@intel.com>, Yung Leem <yungleem@google.com>, Rajat Jain <rajatja@google.com>, Spencer West <spencerwest@google.com>, "Bhat, Kiran U" <kiran.u.bhat@intel.com>


On Wed, Apr 26, 2017 at 7:32 PM, Chen, Kane <kane.chen@intel.com> wrote:
> Hi Sameer,
>
> The reason I added this is delay is due to I find linux kernel will do the
> following sequence.
>
> 1.       Set 0xcf9 = 0x6 to trigger warm reset
>
> 2.       If the system is not reset immediately by 0xcf9 = 0x6, then kernel
> issues multiple kbc reset (0x64=0xfe)
>
> See native_machine_emergency_restart function
>
>
>
> While kernel is using kbc reset, kernel would issue multiple commands to ec.
>
> I observed multiple reset commands are queue up in the buffer in EC.
>
> In this case, we will see consecutive warm reset by EC. Sometimes the
> interval of 2 warm reset is just 30 ms
>
> In EC console, you can find chipset_reset(0) happens in a very short time
>
>
>
> See
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=696679#c47
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=696679#c70
>
>
>
> Plz note that the above behavior only happened on v4.4.
>
> I never observed this behavior in kernel v3.10

We've seen similar behavior on apollolake as well in the course of
debugging other bugs. It doesn't seem to fail like baytrail though.
However the odd thing is that the reboot path is similar from
3.10->4.4. Adding a printk in the kernel path was showing enough delay
on my end that the keyboard reset path wasn't taken when conducting
tests.

For whatever reason these small core designs appear to terminate the
cf9 i/o transaction and doing an async reset after that which takes a
variable amount of time.

Comment 83 by rajatja@google.com, Apr 27 2017


---------- Forwarded message ----------
From: Chen, Kane <kane.chen@intel.com>
Date: Wed, Apr 26, 2017 at 8:52 PM
Subject: RE: Intel US and TW needs a Kip (total 2)
To: Aaron Durbin <adurbin@google.com>
Cc: Sameer Nanda <snanda@google.com>, "Sundararaman, Shyam" <shyam.sundararaman@intel.com>, Yung Leem <yungleem@google.com>, Rajat Jain <rajatja@google.com>, Spencer West <spencerwest@google.com>, "Bhat, Kiran U" <kiran.u.bhat@intel.com>



The printk finding same with my initial experiment.
I initially just added some printk, then I didn't observe the odd reset behavior within 2 hrs(usually it happens within 30 mins)
That's why I choose to add udelay(1000) in the kernel, so far I don't observe the odd reset behavior and issue for over 37 hours.
Thank you.

Comment 84 by rajatja@google.com, Apr 27 2017


Kane: Ok, let me try out the same thing at my end.

> For whatever reason these small core designs appear to terminate the
> cf9 i/o transaction and doing an async reset after that which takes a
> variable amount of time.

How do we know what is the upper bound to this time (i.e. is udelay(1000) enough for all cases)?

Comment 85 by rajatja@google.com, Apr 27 2017

Also, I'm not sure if I understood what part of this scenario is the problem yet. Again, I'm not trying to incline towards any solution, but just trying to understand:

> The reason I added this is delay is due to I find linux kernel will do the
> following sequence.
>
> 1.       Set 0xcf9 = 0x6 to trigger warm reset
>
> 2.       If the system is not reset immediately by 0xcf9 = 0x6, then kernel
> issues multiple kbc reset (0x64=0xfe)

And it is OK / expected to do it, right? I mean, we do expect that the acpi_reboot() will fail sometimes and hence we want to try kb reset?

>
> See native_machine_emergency_restart function
>
>
>
> While kernel is using kbc reset, kernel would issue multiple commands to ec.

From the kernel code:

                case BOOT_KBD:
                        mach_reboot_fixups(); /* for board specific fixups */

                        for (i = 0; i < 10; i++) {
                                kb_wait();
                                udelay(50);
                                outb(0xfe, 0x64); /* pulse reset low */
                                udelay(50);
                        }

So the kernel will send 10 of these kbc reset commands to the EC, each spaced 100 us apart. Are we saying we should give EC more time for processing each kb reset?

>
> I observed multiple reset commands are queue up in the buffer in EC.
>
> In this case, we will see consecutive warm reset by EC. Sometimes the
> interval of 2 warm reset is just 30 ms

So are we saying that they get buffered in the EC, and the EC actually issues warm resets consecutively, spaced closer than it should have? TSo the actual issue is cause because the Intel expects that consequetive warm resets should be spaced > (some threshold) time) apart?

The issue that the BOOT_ACPI is what we actually want to work. It will write 6 to 0xcf9. But then the io transaction is retired and the cpu moves on and starts whacking the keyboard controller since the system hasn't reset yet. Apparently the reset is async and we haven't given enough time for the reset to occur. The system does eventually reset, but the EC has queued up reset requests and starts acting on them *after* the system has reset. That's the point of the patch in #81.

Comment 87 by rajatja@google.com, Apr 27 2017

> The system does eventually reset, but the EC has queued up reset requests
> and starts acting on them *after* the system has reset.

Yes, understood: so after the acpi reset, it will (unexpectedly) reset it 10 more times. But I'd think after the last such reset has been issued, the system should still boot up eventually, no? Why does coreboot not boot up after the last (buffered) reset has been issued?

> That's the point of
> the patch in #81.

Kane: I'm trying to validate this on my board. Meanwhile, is this something that can be upstreamed?
I don't think it's 10. It's probably a few. As I noted in #71 baytrail has very odd behavior in that it doesn't reset properly when a reset comes into too early after the previous reset.

Comment 89 by rajatja@google.com, Apr 27 2017

> I don't think it's 10. It's probably a few. As I noted in #71 baytrail has
> very odd behavior in that it doesn't reset properly when a reset comes into
> too early after the previous reset.

Ah got it, thanks. 

But then would it mean that the right fix should actually be in EC (not to send resets spaced closer than x microseconds)? I understand that it does not apply to this issue / baytrails because we don't want to release another EC image. But may be worth a considerations for newer designs (like apollolake you mentioned above).
Well, which reset request was the one that stuck? The cf9 or the one from the EC? It's a trick question as you don't now, but what we do know is that the cf9 request happened first. Reaction from SoC takes time. That's why the patch in #81 makes a lot of sense and would be the thing I would suggest.
Thanks Aaron for the clarification.
Similar thing *multiple warm reset cmds queued in EC* also happen on APL.
I would suggest putting some delay after the 0xcf9 = 0x6  in kernel to give soc time to respond.

Thank you

Comment 92 by rajatja@google.com, Apr 28 2017

I have run test of my 2 systems overnight and have so far not run into problems. However, I do want to say that we should leave headroom with delay because we're still not sure about how much is good enough (and a few ms delay on reboot is OK). I'd suggest may be 4-5 ms. Kane: is this something that you can upstream?
As for the udelay value I'd say 10-20ms is even sufficient. 
Thanks for the test, Rajat.
I will need to check with our linux kernel team about this upstream.
5/1 is public holiday in TW, so please give us some time to work on this.
Thank you.
Update on my test. I did hit the issue with this fix, but it was after 50+ hours of testing. So I think it might be that we want to increase the timeout a little. But overall, I think I'm satisfied with this solution. So if you can please send it upstream and raise a CL (with FROMLIST or UPSTREAM notification as appropriate), that should serve the purpose.

Comment 96 Deleted

Hi Rajat,
Today, I tried to use kernel 4.11 to repro the abnormal reset behavior.
I didn't see the abnormal reset behavior on kernel  4.11.0-rc8-00107-g0060e79 over 3 hours(originally I saw the abnormal behavior within 30 mins)
I think if the issue can't be repro on latest kernel, the upstream might be rejected.
I'm trying to run the warm reset test on my system over night with latest kernel.

Could u also try to repro this with latest kernel 4.11.0-rc8?
If this issue only happens on kernel v4.4, is it possible to upload a CL with CHROMIUM?
Thank you
Yes, if the issue is not reproduced with latest kernel, please send in CL with CHROMIUM.

Comment 99 Deleted

Comment 100 Deleted

Hi Rajat,
I've uploaded a change as CHROMIUM tot, please take a look.
https://chromium-review.googlesource.com/c/494446/
I only changed the udelay(1000) to mdelay(15), it should still fix the issue.
Thank you:)
Project Member

Comment 102 by bugdroid1@chromium.org, May 3 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0185e2615ffffecf77020b8acc1864b3739a4b24

commit 0185e2615ffffecf77020b8acc1864b3739a4b24
Author: Kane Chen <kane.chen@intel.com>
Date: Wed May 03 20:15:47 2017

CHROMIUM: x86: add delay after acpi_reboot

On baytrail, we saw an issue that linux kernel v4.4
sends multiple reset commands to EC,
when cpu is not reset by io 0xcf9=0x6 immediately.
Those reset commands are queued in EC and cause EC continuously reset cpu
in a very short time.
In this case, cpu will be reset for multiple times.
This delay is to give cpu time to respond the warm reset by acpi_reboot.

This issue can't be reproduced kernel v3.10 and 4.11.0-rc8-00107-g0060e79.
I can only see issue on kernel kernel v4.4

BUG= chromium:696679 
TEST=run warm reset test on BYT chrome books,
     and issue is not seen over 48 hours.

Change-Id: I29fa0254747467aba80150d802569c1349392fd4
Signed-off-by: Kane Chen <kane.chen@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/494446
Commit-Ready: Rajat Jain <rajatja@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/0185e2615ffffecf77020b8acc1864b3739a4b24/arch/x86/kernel/reboot.c

What release branches should we be targeting the fix at?  I presume that although the issue surfaced on Kip, it impacts all BYT devices.
We should backport it to R59 and R58. Potentially R57, but I think the 3 baytrails on R57 (clapper, quawks, squawks) will anyway be upgraded by R58, so we may be OK not putting in R57.
Labels: Merge-Request-58 Merge-Request-59 M-59 M-58
Labels: Merge-Approved-59
Kane, can you send a CL for R59?
Labels: -Merge-Request-58 Merge-Approved-58
Project Member

Comment 109 by sheriffbot@chromium.org, May 4 2017

Labels: -Merge-Request-59 Hotlist-Merge-Approved
Your change meets the bar and is auto-approved for M59. Please go ahead and merge the CL to branch 3071 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), gkihumba@(ChromeOS), Abdul Syed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Hi Rajat, 
Sorry I am kind of confused
I saw your comments in #107 and https://chromium-review.googlesource.com/c/494426

Do u you need me to run the warm reset test on R58, then cherry-pick to R59?
The test in #79 is done by release-R58-9334.B
Thank you.
Project Member

Comment 112 by bugdroid1@chromium.org, May 5 2017

Labels: merge-merged-release-R59-9460.B-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/c6521f7e1e75e2cbfcd8641c716d978585e8e9ff

commit c6521f7e1e75e2cbfcd8641c716d978585e8e9ff
Author: Kane Chen <kane.chen@intel.com>
Date: Fri May 05 16:08:48 2017

CHROMIUM: x86: add delay after acpi_reboot

On baytrail, we saw an issue that linux kernel v4.4
sends multiple reset commands to EC,
when cpu is not reset by io 0xcf9=0x6 immediately.
Those reset commands are queued in EC and cause EC continuously reset cpu
in a very short time.
In this case, cpu will be reset for multiple times.
This delay is to give cpu time to respond the warm reset by acpi_reboot.

This issue can't be reproduced kernel v3.10 and 4.11.0-rc8-00107-g0060e79.
I can only see issue on kernel kernel v4.4

BUG= chromium:696679 
TEST=run warm reset test on BYT chrome books,
     and issue is not seen over 48 hours.

Change-Id: I29fa0254747467aba80150d802569c1349392fd4
Signed-off-by: Kane Chen <kane.chen@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/494446
Commit-Ready: Rajat Jain <rajatja@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
(cherry picked from commit 0185e2615ffffecf77020b8acc1864b3739a4b24)
Reviewed-on: https://chromium-review.googlesource.com/497226
Commit-Queue: Rajat Jain <rajatja@chromium.org>
Trybot-Ready: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/c6521f7e1e75e2cbfcd8641c716d978585e8e9ff/arch/x86/kernel/reboot.c

Project Member

Comment 113 by bugdroid1@chromium.org, May 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/c6521f7e1e75e2cbfcd8641c716d978585e8e9ff

commit c6521f7e1e75e2cbfcd8641c716d978585e8e9ff
Author: Kane Chen <kane.chen@intel.com>
Date: Fri May 05 16:08:48 2017

CHROMIUM: x86: add delay after acpi_reboot

On baytrail, we saw an issue that linux kernel v4.4
sends multiple reset commands to EC,
when cpu is not reset by io 0xcf9=0x6 immediately.
Those reset commands are queued in EC and cause EC continuously reset cpu
in a very short time.
In this case, cpu will be reset for multiple times.
This delay is to give cpu time to respond the warm reset by acpi_reboot.

This issue can't be reproduced kernel v3.10 and 4.11.0-rc8-00107-g0060e79.
I can only see issue on kernel kernel v4.4

BUG= chromium:696679 
TEST=run warm reset test on BYT chrome books,
     and issue is not seen over 48 hours.

Change-Id: I29fa0254747467aba80150d802569c1349392fd4
Signed-off-by: Kane Chen <kane.chen@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/494446
Commit-Ready: Rajat Jain <rajatja@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
(cherry picked from commit 0185e2615ffffecf77020b8acc1864b3739a4b24)
Reviewed-on: https://chromium-review.googlesource.com/497226
Commit-Queue: Rajat Jain <rajatja@chromium.org>
Trybot-Ready: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/c6521f7e1e75e2cbfcd8641c716d978585e8e9ff/arch/x86/kernel/reboot.c

Project Member

Comment 114 by bugdroid1@chromium.org, May 5 2017

Labels: merge-merged-release-R58-9334.B-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/31d7be5289e63aabaf9a43cc3c1e24b57e40cdc3

commit 31d7be5289e63aabaf9a43cc3c1e24b57e40cdc3
Author: Kane Chen <kane.chen@intel.com>
Date: Fri May 05 16:11:09 2017

CHROMIUM: x86: add delay after acpi_reboot

On baytrail, we saw an issue that linux kernel v4.4
sends multiple reset commands to EC,
when cpu is not reset by io 0xcf9=0x6 immediately.
Those reset commands are queued in EC and cause EC continuously reset cpu
in a very short time.
In this case, cpu will be reset for multiple times.
This delay is to give cpu time to respond the warm reset by acpi_reboot.

This issue can't be reproduced kernel v3.10 and 4.11.0-rc8-00107-g0060e79.
I can only see issue on kernel kernel v4.4

BUG= chromium:696679 
TEST=run warm reset test on BYT chrome books,
     and issue is not seen over 48 hours.

Change-Id: I29fa0254747467aba80150d802569c1349392fd4
Signed-off-by: Kane Chen <kane.chen@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/494426
Commit-Queue: Rajat Jain <rajatja@chromium.org>
Trybot-Ready: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/31d7be5289e63aabaf9a43cc3c1e24b57e40cdc3/arch/x86/kernel/reboot.c

OK, fix has landed, need to mark kip-paladin as non-experimental again:

https://chromium-review.googlesource.com/c/455356/

Aviv, can you please review once again as the revert was not clean (were conflicts).
Project Member

Comment 116 by sheriffbot@chromium.org, May 8 2017

Cc: bhthompson@google.com gkihumba@google.com
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-58
Project Member

Comment 118 by sheriffbot@chromium.org, May 12 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 119 by sheriffbot@chromium.org, Jun 23 2017

Labels: -Merge-Approved-59
This issue hasn't been updated in the last 6 weeks, so removing its merge approval label. Please re-request a merge if needed.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.
Labels: -akeshet-pending-downgrade Pri-2
ChromeOS Infra P1 Bugscrub.

Issue untouched in a week after previous message. Downgrading to P2.
Status: Fixed (was: Assigned)
Fixed?
Status: Archived (was: Fixed)
Status: Fixed (was: Archived)
Showing comments 25 - 124 of 124 Older

Sign in to add a comment