New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 614488 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

[LARS] : sometime DUT enters suspend state after a delay and this fails test and device goes into repair-fail loop

Project Member Reported by shrawan@google.com, May 24 2016

Issue description

Test observation: https://wmatrix.googleplex.com/unfiltered?tests=audio_AudioAfterSuspend.internal_speaker&platforms=lars&days_back=10

It looks like sometime Lars takes some time to go in suspended state after issuing suspend command. This makes the test fails and also stops the further usability of DUT for other automated test. Since DUT gets suspended with a delay (after expected timeout), it has no IP connectivity so cannot be pinged anymore. All the following verify/repair tasks fail.

Test: audio_AudioAfterSuspend.internal_speaker
Job: lars-release/R53-8360.0.0/chameleon_audio/audio_AudioAfterSuspend.internal_speaker (64298547-chromeos-test)
job link: http://cautotest/afe/#tab_id=view_job&object_id=64298547
test Logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64298547-chromeos-test/chromeos1-row5-rack3-host2/debug


In debug logs at step  "TEST CASE 2: UNPLUG > suspend > PLUG > PLUG > resume > UNPLUG", DUT did not suspend after suspend command.

log snippet:
--------------------------------------------------------------------------------------------------------------------------------------
05/23 18:56:20.970 DEBUG|          ssh_host:0180| Running (ssh) '( echo 0 > /sys/class/rtc/rtc0/wakealarm && echo +30 > /sys/class/rtc/rtc0/wakealarm && powerd_dbus_suspend --delay=0 ) </dev/null >/dev/null 2>&1 & echo -n $!'
   ...
   ...
   ... 
05/23 18:56:31.107 DEBUG|        base_utils:0269| [stdout] --- chromeos1-row5-rack3-host2.cros.corp.google.com ping statistics ---
05/23 18:56:31.107 DEBUG|        base_utils:0269| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
05/23 18:56:31.108 DEBUG|        base_utils:0269| [stdout] rtt min/avg/max/mdev = 0.441/0.441/0.441/0.000 ms
05/23 18:56:31.109 DEBUG|audio_AudioAfterSu:0048| Plugging
05/23 18:56:33.141 INFO |       audio_board:0247| Plugged 3.5mm audio cable to Cros device
05/23 18:56:35.152 DEBUG|audio_AudioAfterSu:0048| Plugging

--------------------------------------------------------------------------------------------------------------------------------------

Then DUT suspends and so all subsequent rsync calls are failing and the test failed with scp command "Unhandled AutoservRunError: scp -rq -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmp94WNLz -P 22 root@chromeos1-row5-rack3-host2:"/tmp/audio_dump_Xzq1OA.txt" '/usr/local/autotest/results/63670285-chromeos-test/audio_AudioAfterSuspend.internal_speaker/results/audio_diagnostics.txt.before_playback'".


When I went to the DUT, screen was blacked out similar to one in suspended state. I touched the touch-pad and DUT resumed and I saw chrome browser open. I checked for the IP-address in the status bar and it was present. So I restarted the DUT and it got stuck at chrome splash screen. I tried restarting the DUT few more times and every time it got stuck at splash screen. Seeing this I restated the chameleon board and DUT got onto login screen. 


I'm going to rerun the test and will update the status.


 

Comment 1 by derat@chromium.org, May 24 2016

Cc: derat@chromium.org
I'm never able to find the correct log entries in the autotest results (apparently because we only gather entries that were written while the test was running), and chromeos1-row5-rack3-host2.cros looks like it's since been wiped.

The right way to debug this would be to check the /var/log/power_manager and /var/log/messages entries spanning the 18:56:20 suspend attempt to see what userspace and the kernel were doing. If you can find those entries, I'm happy to help interpret them.

Comment 2 by shrawan@google.com, May 24 2016

Thanks for input, let me see if I can get these entries during my test run.

Comment 3 by shrawan@google.com, May 24 2016

I ran the test manually for 10 iterations from chroot but not able to reproduce. Will wait overnight to see if chameleon goes into bad state and causes this. 

I will update this in tomorrow morning. 

Comment 4 by ka...@chromium.org, May 24 2016

Cc: waihong@chromium.org cychiang@chromium.org
Components: Test
I have some suspicion that such behavior is provoked by test tool chameleon and had similar observations with other device(s) - issue 604783, issue 606573, 

Particularly in this case while waiting to be suspended from 05/23 18:56:20 to 05/23 18:56:50, 
- there is no indication the DUT lost connectivity, 
- out of no where test is reporting that suspend did not last enough long(29 instead of 30 seconds), but there is no evidence it did suspend.
At 05/23 18:57:10 rsync command gets stuck and 45 min later(05/23 19:42:10) the error outputs starts... and test fails with connection timeouts and scp unable to copy files...


No data for this time-frame is in power_manager logs(https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64298547-chromeos-test/chromeos1-row5-rack3-host2/sysinfo/var/log/power_manager/). Last log is at 0523/185438


Shrawan reported he ran the test 10 times after the DUT and chameleon reboot and it passed.
My version of the problem is that after some time(after chameleon or DUT and chameleon) chameleon USB connection somehow interferes with DUT and prevents normal suspend/resume and boot to login screen. This is still to be looked at and proven, but observations like these are becoming more frequent.

Comment 5 by ka...@chromium.org, May 25 2016

Device is again in this state this morning after executing this same test - audio_AudioAfterSuspend.internal_speaker - http://cautotest/afe/#tab_id=view_job&object_id=64410121.

Shrawan please reset the test cell(audio-box) again, unplug chameleon USB cable and observe the test runss for the next one or two days.

Comment 6 by derat@chromium.org, May 25 2016

Can you get the logs described in #1?

Comment 7 by shrawan@google.com, May 25 2016



The new time stamp to look at is from 17:00:23. with the new failure today the log snippet from the test is as follow:

----------------------------------------------------
05/24 17:00:18.386 INFO |audio_AudioAfterSu:0270| TEST CASE 2: UNPLUG > suspend > PLUG > PLUG > resume > UNPLUG
05/24 17:00:18.386 DEBUG|audio_AudioAfterSu:0048| Unplugging
05/24 17:00:20.537 INFO |       audio_board:0253| Unplugged 3.5mm audio cable from Cros device
05/24 17:00:22.547 DEBUG|          ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
05/24 17:00:23.461 DEBUG|        base_utils:0269| [stdout] a0ff84ed-8548-4cc8-be16-478f648d4fb4
05/24 17:00:23.465 INFO |        server_job:0128| 	START	----	suspend	timestamp=1464134423	localtime=May 24 17:00:23	
05/24 17:00:23.465 DEBUG|        base_utils:0176| Running 'ping chromeos1-row5-rack3-host2 -w1 -c1'
05/24 17:00:23.517 INFO |        server_job:0128| 		GOOD	----	suspend.start for 30 seconds	timestamp=1464134423	localtime=May 24 17:00:23	
05/24 17:00:23.518 DEBUG|          ssh_host:0180| Running (ssh) '( echo 0 > /sys/class/rtc/rtc0/wakealarm && echo +30 > /sys/class/rtc/rtc0/wakealarm && powerd_dbus_suspend --delay=0 ) </dev/null >/dev/null 2>&1 & echo -n $!'
05/24 17:00:23.602 DEBUG|        base_utils:0269| [stdout] PING chromeos1-row5-rack3-host2.cros.corp.google.com (172.27.213.242) 56(84) bytes of data.
05/24 17:00:23.603 DEBUG|        base_utils:0269| [stdout] 64 bytes from 172.27.213.242: icmp_seq=1 ttl=57 time=59.4 ms
    ::
    ::
    ::
05/24 17:00:39.465 DEBUG|        base_utils:0176| Running 'ping chromeos1-row5-rack3-host2 -w1 -c1'
05/24 17:00:39.634 DEBUG|        base_utils:0269| [stdout] PING chromeos1-row5-rack3-host2.cros.corp.google.com (172.27.213.242) 56(84) bytes of data.
05/24 17:00:39.635 DEBUG|        base_utils:0269| [stdout] 64 bytes from 172.27.213.242: icmp_seq=1 ttl=57 time=59.2 ms
05/24 17:00:39.635 DEBUG|        base_utils:0269| [stdout] 
05/24 17:00:39.636 DEBUG|        base_utils:0269| [stdout] --- chromeos1-row5-rack3-host2.cros.corp.google.com ping statistics ---
05/24 17:00:39.636 DEBUG|        base_utils:0269| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
05/24 17:00:39.637 DEBUG|        base_utils:0269| [stdout] rtt min/avg/max/mdev = 59.233/59.233/59.233/0.000 ms
05/24 17:00:39.638 DEBUG|          ssh_host:0180| Running (ssh) 'ls /bin/uname >/dev/null 2>&1'
05/24 17:00:39.981 INFO |       audio_board:0247| Plugged 3.5mm audio cable to Cros device
05/24 17:00:41.993 DEBUG|          ssh_host:0180| Running (ssh) 'true'
05/24 17:01:42.386 WARNI|        base_utils:0921| run process timeout (60) fired on: /usr/bin/ssh -a -x     -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos1-row5-rack3-host2 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
05/24 17:01:49.425 DEBUG|      abstract_ssh:0550| Host chromeos1-row5-rack3-host2 is still down after waiting 68 seconds
05/24 17:01:49.426 DEBUG|audio_AudioAfterSu:0048| Unplugging
05/24 17:01:51.577 INFO |       audio_board:0253| Unplugged 3.5mm audio cable from Cros device
05/24 17:18:11.068 INFO |remote_facade_fact:0143| Setup the connection to RPC server, with retries...
05/24 17:18:11.069 DEBUG|          ssh_host:0180| Running (ssh) 'pkill -f 'multimedia_xmlrpc_server''
05/24 17:20:14.429 DEBUG|        base_utils:0269| [stderr] ssh: connect to host chromeos1-row5-rack3-host2 port 22: Connection timed out

-----------------------------------------------------------------

Log is at : https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64410121-chromeos-test/chromeos1-row5-rack3-host2/debug/


Generate log from the DUT is kept at: https://cloud.google.com/console/storage/chromiumos-test-logs/bugfiles/cr/614488/

Comment 8 by derat@chromium.org, May 25 2016

Cc: dbasehore@chromium.org jwer...@chromium.org
Here are the corresponding powerd log entries. You can see that the initial suspend attempt was canceled due to the wakeup count in the kernel having increased:

[0524/170023:INFO:daemon.cc(1253)] Got RequestSuspend message from :1.216
[0524/170023:INFO:suspender.cc(397)] Starting request 83230723
[0524/170023:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count
[0524/170023:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58
[0524/170023:INFO:internal_backlight_controller.cc(676)] Setting brightness to 0 (0%) over 0 ms
[0524/170023:INFO:internal_backlight_controller.cc(692)] Setting resume brightness to 932 (80%)
[0524/170023:INFO:suspend_delay_controller.cc(139)] Announcing suspend request 83230723 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[0524/170023:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83230721 (shill) is ready for suspend request 83230723 from :1.10
[0524/170024:INFO:daemon.cc(1412)] Chrome is using normal display mode
[0524/170024:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83230734 (chrome) is ready for suspend request 83230723 from :1.205
[0524/170024:INFO:suspend_delay_controller.cc(224)] Notifying observers that suspend is ready
[0524/170024:INFO:suspender.cc(466)] Starting suspend
[0524/170024:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=7"
[0524/170044:ERROR:util.cc(54)] Command failed with 3
[0524/170044:INFO:daemon.cc(714)] powerd_suspend returned 3
[0524/170044:WARNING:suspender.cc(554)] Suspend attempt #1 canceled due to wake event
[0524/170044:INFO:suspender.cc(562)] Notifying registered dark suspend delays about 83263489
[0524/170044:INFO:suspend_delay_controller.cc(139)] Announcing dark suspend request 83263489 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[0524/170044:INFO:audio_client.cc(112)] Updated audio devices: headphones plugged, HDMI inactive
[0524/170044:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263502 (chrome) is ready for dark suspend request 83263489 from :1.205
[0524/170044:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263489 (shill) is ready for dark suspend request 83263489 from :1.10
[0524/170044:INFO:suspend_delay_controller.cc(224)] Notifying observers that dark suspend is ready
[0524/170044:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count
[0524/170044:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58
[0524/170044:INFO:suspender.cc(466)] Starting suspend
[0524/170044:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=7"
[0524/170053:ERROR:util.cc(54)] Command failed with 3
[0524/170053:INFO:daemon.cc(714)] powerd_suspend returned 3
[0524/170053:WARNING:suspender.cc(554)] Suspend attempt #2 canceled due to wake event
[0524/170053:INFO:suspender.cc(562)] Notifying registered dark suspend delays about 83263490
[0524/170053:INFO:suspend_delay_controller.cc(139)] Announcing dark suspend request 83263490 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[0524/170053:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263489 (shill) is ready for dark suspend request 83263490 from :1.10
[0524/170053:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263502 (chrome) is ready for dark suspend request 83263490 from :1.205
[0524/170053:INFO:suspend_delay_controller.cc(224)] Notifying observers that dark suspend is ready
[0524/170053:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count
[0524/170053:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58
[0524/170053:INFO:suspender.cc(466)] Starting suspend
[0524/170053:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=8"
[0525/083246:INFO:daemon.cc(714)] powerd_suspend returned 0
[0525/083246:INFO:suspender.cc(424)] Finishing request 83230723 successfully

I'm not sure why the wakeup count didn't match. powerd tried to resuspend after this, and was eventually successful at 17:00:53, but I think that the 10-second wakealarm had already fired at this point, so the system stayed suspended.

Comment 9 by derat@chromium.org, May 25 2016

2016-05-24T17:00:24.331603-07:00 NOTICE powerd_suspend[13630]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=7
2016-05-24T17:00:24.362754-07:00 NOTICE powerd_suspend[13682]: Explicit sync
2016-05-24T17:00:24.542511-07:00 NOTICE powerd_suspend[13684]: Finalizing suspend
2016-05-24T17:00:24.558185-07:00 INFO kernel: [ 2586.833365] PM: Syncing filesystems ... done.
2016-05-24T17:00:24.558210-07:00 DEBUG kernel: [ 2586.846099] PM: Preparing system for sleep (mem)
2016-05-24T17:00:44.566391-07:00 WARNING kernel: [ 2586.846933] Freezing user space processes ... (elapsed 0.001 seconds) done.done.
2016-05-24T17:00:44.566415-07:00 WARNING kernel: [ 2586.848689] Freezing remaining freezable tasks ... 
2016-05-24T17:00:44.566423-07:00 ERR kernel: [ 2593.112797] usb 1-5.3: device descriptor read/64, error -110
2016-05-24T17:00:44.566427-07:00 WARNING kernel: [ 2606.831575] 
2016-05-24T17:00:44.566435-07:00 ERR kernel: [ 2606.831593] Freezing of tasks failed after 20.002 seconds (0 tasks refusing to freeze, wq_busy=1):
2016-05-24T17:00:44.566441-07:00 WARNING kernel: [ 2606.831722] 
2016-05-24T17:00:44.566448-07:00 WARNING kernel: [ 2606.831731] Restarting kernel threads ... done.
2016-05-24T17:00:44.571356-07:00 WARNING kernel: [ 2606.833544] Restarting tasks ... done.
2016-05-24T17:00:44.571993-07:00 INFO sshd[13688]: SSH: Server;Ltype: Version;Remote: 100.108.96.5-56873;Protocol: 2.0;Client: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2
2016-05-24T17:00:44.572790-07:00 INFO sshd[13689]: SSH: Server;Ltype: Version;Remote: 100.108.96.5-56918;Protocol: 2.0;Client: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2
2016-05-24T17:00:44.597187-07:00 NOTICE powerd_suspend[13694]: Error writing to /sys/power/state: cat: write error: Device or resource busy

Comment 10 by ka...@chromium.org, May 25 2016

Owner: cychiang@chromium.org
Same "ERR kernel: ... usb ... device descriptor read/64, error -110" was pbserved with lumpy board while ago having same problem -issue 605137 - with chameleon setup, getting stuck, and finally we removed lumpy from tested matrix.

Similar observation of chameleon USB interference was described in issue 604783, and fixed by replacing chameleon FPGA board.
I believe related are issues 606573 and  issue 610728 , and I will look for similar 'Err kernel' traces in devices logs.

This test uses server/hosts/cros_host.py#suspend rather than client/cros/sys_power.py#do_suspend, which is more sophisticated. This problem could be prevented (making the test fail clearly instead, and giving it the option to retry if it implemented that) by changing CrosHost#suspend to also read and pass the current wakeup count.

Comment 12 by ka...@chromium.org, May 26 2016

Labels: -Pri-2 Pri-1
The test passed for build R53-8368.0.0 at http://cautotest/afe/#tab_id=view_job&object_id=64534751 - this is where chameleon-to DUT USB connection is unplugged.


Then I connected the USB between chameleon and DUT and rebooted DUT - it rebooted well as expected(and not hanging on splash screen).

Then I re-ran the test again, but one more before it - just as usually being run last several days
1) audio_AudioQualityAfterSuspend.internal_speaker.wav - http://cautotest/afe/#tab_id=view_job&object_id=64704712
2) audio_AudioAfterSuspend.internal_speaker - http://cautotest/afe/#tab_id=view_job&object_id=64705041
Both tests failed with 'Wrong input node(s) selected ['USB'] instead ['INTERNAL_MIC']!', which indicated chameleon keeps the USB alive when it should not, so it must be in bad state.

Then I rebooted DUT and it hung on splash screen. Unplugging USB let the DUT proceed to login screen. 

Plugged the USB back, and shrawan@ ran the test(2) from chroot, and the DUT hanged on the second iteration, as it is described in this bug. 

Now we have both DUT and chameleon in the state of failure. The DUT is locked on cautotest, so no other tests will proceed.

Jimmy and Tom, can you please look at chameleon host and see if anything looks abnormal?
(This chameleon board was new and assebled specifically for this audio-box setup, and not beeing previously used.)





Project Member

Comment 13 by sheriffbot@chromium.org, Jun 2 2016

Labels: Hotlist-Google
Labels: USBconnect

Comment 15 by kalin@google.com, Oct 20 2016

Status: WontFix (was: Untriaged)
Old issues.

New issues will be filed if tests fail.

Sign in to add a comment