[LARS] : sometime DUT enters suspend state after a delay and this fails test and device goes into repair-fail loop |
||||||||
Issue descriptionTest observation: https://wmatrix.googleplex.com/unfiltered?tests=audio_AudioAfterSuspend.internal_speaker&platforms=lars&days_back=10 It looks like sometime Lars takes some time to go in suspended state after issuing suspend command. This makes the test fails and also stops the further usability of DUT for other automated test. Since DUT gets suspended with a delay (after expected timeout), it has no IP connectivity so cannot be pinged anymore. All the following verify/repair tasks fail. Test: audio_AudioAfterSuspend.internal_speaker Job: lars-release/R53-8360.0.0/chameleon_audio/audio_AudioAfterSuspend.internal_speaker (64298547-chromeos-test) job link: http://cautotest/afe/#tab_id=view_job&object_id=64298547 test Logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64298547-chromeos-test/chromeos1-row5-rack3-host2/debug In debug logs at step "TEST CASE 2: UNPLUG > suspend > PLUG > PLUG > resume > UNPLUG", DUT did not suspend after suspend command. log snippet: -------------------------------------------------------------------------------------------------------------------------------------- 05/23 18:56:20.970 DEBUG| ssh_host:0180| Running (ssh) '( echo 0 > /sys/class/rtc/rtc0/wakealarm && echo +30 > /sys/class/rtc/rtc0/wakealarm && powerd_dbus_suspend --delay=0 ) </dev/null >/dev/null 2>&1 & echo -n $!' ... ... ... 05/23 18:56:31.107 DEBUG| base_utils:0269| [stdout] --- chromeos1-row5-rack3-host2.cros.corp.google.com ping statistics --- 05/23 18:56:31.107 DEBUG| base_utils:0269| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms 05/23 18:56:31.108 DEBUG| base_utils:0269| [stdout] rtt min/avg/max/mdev = 0.441/0.441/0.441/0.000 ms 05/23 18:56:31.109 DEBUG|audio_AudioAfterSu:0048| Plugging 05/23 18:56:33.141 INFO | audio_board:0247| Plugged 3.5mm audio cable to Cros device 05/23 18:56:35.152 DEBUG|audio_AudioAfterSu:0048| Plugging -------------------------------------------------------------------------------------------------------------------------------------- Then DUT suspends and so all subsequent rsync calls are failing and the test failed with scp command "Unhandled AutoservRunError: scp -rq -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmp94WNLz -P 22 root@chromeos1-row5-rack3-host2:"/tmp/audio_dump_Xzq1OA.txt" '/usr/local/autotest/results/63670285-chromeos-test/audio_AudioAfterSuspend.internal_speaker/results/audio_diagnostics.txt.before_playback'". When I went to the DUT, screen was blacked out similar to one in suspended state. I touched the touch-pad and DUT resumed and I saw chrome browser open. I checked for the IP-address in the status bar and it was present. So I restarted the DUT and it got stuck at chrome splash screen. I tried restarting the DUT few more times and every time it got stuck at splash screen. Seeing this I restated the chameleon board and DUT got onto login screen. I'm going to rerun the test and will update the status.
,
May 24 2016
Thanks for input, let me see if I can get these entries during my test run.
,
May 24 2016
I ran the test manually for 10 iterations from chroot but not able to reproduce. Will wait overnight to see if chameleon goes into bad state and causes this. I will update this in tomorrow morning.
,
May 24 2016
I have some suspicion that such behavior is provoked by test tool chameleon and had similar observations with other device(s) - issue 604783, issue 606573, Particularly in this case while waiting to be suspended from 05/23 18:56:20 to 05/23 18:56:50, - there is no indication the DUT lost connectivity, - out of no where test is reporting that suspend did not last enough long(29 instead of 30 seconds), but there is no evidence it did suspend. At 05/23 18:57:10 rsync command gets stuck and 45 min later(05/23 19:42:10) the error outputs starts... and test fails with connection timeouts and scp unable to copy files... No data for this time-frame is in power_manager logs(https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64298547-chromeos-test/chromeos1-row5-rack3-host2/sysinfo/var/log/power_manager/). Last log is at 0523/185438 Shrawan reported he ran the test 10 times after the DUT and chameleon reboot and it passed. My version of the problem is that after some time(after chameleon or DUT and chameleon) chameleon USB connection somehow interferes with DUT and prevents normal suspend/resume and boot to login screen. This is still to be looked at and proven, but observations like these are becoming more frequent.
,
May 25 2016
Device is again in this state this morning after executing this same test - audio_AudioAfterSuspend.internal_speaker - http://cautotest/afe/#tab_id=view_job&object_id=64410121. Shrawan please reset the test cell(audio-box) again, unplug chameleon USB cable and observe the test runss for the next one or two days.
,
May 25 2016
Can you get the logs described in #1?
,
May 25 2016
The new time stamp to look at is from 17:00:23. with the new failure today the log snippet from the test is as follow: ---------------------------------------------------- 05/24 17:00:18.386 INFO |audio_AudioAfterSu:0270| TEST CASE 2: UNPLUG > suspend > PLUG > PLUG > resume > UNPLUG 05/24 17:00:18.386 DEBUG|audio_AudioAfterSu:0048| Unplugging 05/24 17:00:20.537 INFO | audio_board:0253| Unplugged 3.5mm audio cable from Cros device 05/24 17:00:22.547 DEBUG| ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 05/24 17:00:23.461 DEBUG| base_utils:0269| [stdout] a0ff84ed-8548-4cc8-be16-478f648d4fb4 05/24 17:00:23.465 INFO | server_job:0128| START ---- suspend timestamp=1464134423 localtime=May 24 17:00:23 05/24 17:00:23.465 DEBUG| base_utils:0176| Running 'ping chromeos1-row5-rack3-host2 -w1 -c1' 05/24 17:00:23.517 INFO | server_job:0128| GOOD ---- suspend.start for 30 seconds timestamp=1464134423 localtime=May 24 17:00:23 05/24 17:00:23.518 DEBUG| ssh_host:0180| Running (ssh) '( echo 0 > /sys/class/rtc/rtc0/wakealarm && echo +30 > /sys/class/rtc/rtc0/wakealarm && powerd_dbus_suspend --delay=0 ) </dev/null >/dev/null 2>&1 & echo -n $!' 05/24 17:00:23.602 DEBUG| base_utils:0269| [stdout] PING chromeos1-row5-rack3-host2.cros.corp.google.com (172.27.213.242) 56(84) bytes of data. 05/24 17:00:23.603 DEBUG| base_utils:0269| [stdout] 64 bytes from 172.27.213.242: icmp_seq=1 ttl=57 time=59.4 ms :: :: :: 05/24 17:00:39.465 DEBUG| base_utils:0176| Running 'ping chromeos1-row5-rack3-host2 -w1 -c1' 05/24 17:00:39.634 DEBUG| base_utils:0269| [stdout] PING chromeos1-row5-rack3-host2.cros.corp.google.com (172.27.213.242) 56(84) bytes of data. 05/24 17:00:39.635 DEBUG| base_utils:0269| [stdout] 64 bytes from 172.27.213.242: icmp_seq=1 ttl=57 time=59.2 ms 05/24 17:00:39.635 DEBUG| base_utils:0269| [stdout] 05/24 17:00:39.636 DEBUG| base_utils:0269| [stdout] --- chromeos1-row5-rack3-host2.cros.corp.google.com ping statistics --- 05/24 17:00:39.636 DEBUG| base_utils:0269| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms 05/24 17:00:39.637 DEBUG| base_utils:0269| [stdout] rtt min/avg/max/mdev = 59.233/59.233/59.233/0.000 ms 05/24 17:00:39.638 DEBUG| ssh_host:0180| Running (ssh) 'ls /bin/uname >/dev/null 2>&1' 05/24 17:00:39.981 INFO | audio_board:0247| Plugged 3.5mm audio cable to Cros device 05/24 17:00:41.993 DEBUG| ssh_host:0180| Running (ssh) 'true' 05/24 17:01:42.386 WARNI| base_utils:0921| run process timeout (60) fired on: /usr/bin/ssh -a -x -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos1-row5-rack3-host2 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 05/24 17:01:49.425 DEBUG| abstract_ssh:0550| Host chromeos1-row5-rack3-host2 is still down after waiting 68 seconds 05/24 17:01:49.426 DEBUG|audio_AudioAfterSu:0048| Unplugging 05/24 17:01:51.577 INFO | audio_board:0253| Unplugged 3.5mm audio cable from Cros device 05/24 17:18:11.068 INFO |remote_facade_fact:0143| Setup the connection to RPC server, with retries... 05/24 17:18:11.069 DEBUG| ssh_host:0180| Running (ssh) 'pkill -f 'multimedia_xmlrpc_server'' 05/24 17:20:14.429 DEBUG| base_utils:0269| [stderr] ssh: connect to host chromeos1-row5-rack3-host2 port 22: Connection timed out ----------------------------------------------------------------- Log is at : https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64410121-chromeos-test/chromeos1-row5-rack3-host2/debug/ Generate log from the DUT is kept at: https://cloud.google.com/console/storage/chromiumos-test-logs/bugfiles/cr/614488/
,
May 25 2016
Here are the corresponding powerd log entries. You can see that the initial suspend attempt was canceled due to the wakeup count in the kernel having increased: [0524/170023:INFO:daemon.cc(1253)] Got RequestSuspend message from :1.216 [0524/170023:INFO:suspender.cc(397)] Starting request 83230723 [0524/170023:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count [0524/170023:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58 [0524/170023:INFO:internal_backlight_controller.cc(676)] Setting brightness to 0 (0%) over 0 ms [0524/170023:INFO:internal_backlight_controller.cc(692)] Setting resume brightness to 932 (80%) [0524/170023:INFO:suspend_delay_controller.cc(139)] Announcing suspend request 83230723 with 2 pending delay(s) and 0 outstanding delay(s) from previous request [0524/170023:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83230721 (shill) is ready for suspend request 83230723 from :1.10 [0524/170024:INFO:daemon.cc(1412)] Chrome is using normal display mode [0524/170024:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83230734 (chrome) is ready for suspend request 83230723 from :1.205 [0524/170024:INFO:suspend_delay_controller.cc(224)] Notifying observers that suspend is ready [0524/170024:INFO:suspender.cc(466)] Starting suspend [0524/170024:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend --suspend_wakeup_count_valid --suspend_wakeup_count=7" [0524/170044:ERROR:util.cc(54)] Command failed with 3 [0524/170044:INFO:daemon.cc(714)] powerd_suspend returned 3 [0524/170044:WARNING:suspender.cc(554)] Suspend attempt #1 canceled due to wake event [0524/170044:INFO:suspender.cc(562)] Notifying registered dark suspend delays about 83263489 [0524/170044:INFO:suspend_delay_controller.cc(139)] Announcing dark suspend request 83263489 with 2 pending delay(s) and 0 outstanding delay(s) from previous request [0524/170044:INFO:audio_client.cc(112)] Updated audio devices: headphones plugged, HDMI inactive [0524/170044:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263502 (chrome) is ready for dark suspend request 83263489 from :1.205 [0524/170044:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263489 (shill) is ready for dark suspend request 83263489 from :1.10 [0524/170044:INFO:suspend_delay_controller.cc(224)] Notifying observers that dark suspend is ready [0524/170044:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count [0524/170044:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58 [0524/170044:INFO:suspender.cc(466)] Starting suspend [0524/170044:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend --suspend_wakeup_count_valid --suspend_wakeup_count=7" [0524/170053:ERROR:util.cc(54)] Command failed with 3 [0524/170053:INFO:daemon.cc(714)] powerd_suspend returned 3 [0524/170053:WARNING:suspender.cc(554)] Suspend attempt #2 canceled due to wake event [0524/170053:INFO:suspender.cc(562)] Notifying registered dark suspend delays about 83263490 [0524/170053:INFO:suspend_delay_controller.cc(139)] Announcing dark suspend request 83263490 with 2 pending delay(s) and 0 outstanding delay(s) from previous request [0524/170053:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263489 (shill) is ready for dark suspend request 83263490 from :1.10 [0524/170053:INFO:suspend_delay_controller.cc(88)] Got notification that delay 83263502 (chrome) is ready for dark suspend request 83263490 from :1.205 [0524/170053:INFO:suspend_delay_controller.cc(224)] Notifying observers that dark suspend is ready [0524/170053:INFO:daemon.cc(613)] Reading wakeup count from /sys/power/wakeup_count [0524/170053:INFO:daemon.cc(617)] Read wakeup count 0x7f07018ada58 [0524/170053:INFO:suspender.cc(466)] Starting suspend [0524/170053:INFO:util.cc(48)] Running "/usr/bin/powerd_setuid_helper --action=suspend --suspend_wakeup_count_valid --suspend_wakeup_count=8" [0525/083246:INFO:daemon.cc(714)] powerd_suspend returned 0 [0525/083246:INFO:suspender.cc(424)] Finishing request 83230723 successfully I'm not sure why the wakeup count didn't match. powerd tried to resuspend after this, and was eventually successful at 17:00:53, but I think that the 10-second wakealarm had already fired at this point, so the system stayed suspended.
,
May 25 2016
2016-05-24T17:00:24.331603-07:00 NOTICE powerd_suspend[13630]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=7 2016-05-24T17:00:24.362754-07:00 NOTICE powerd_suspend[13682]: Explicit sync 2016-05-24T17:00:24.542511-07:00 NOTICE powerd_suspend[13684]: Finalizing suspend 2016-05-24T17:00:24.558185-07:00 INFO kernel: [ 2586.833365] PM: Syncing filesystems ... done. 2016-05-24T17:00:24.558210-07:00 DEBUG kernel: [ 2586.846099] PM: Preparing system for sleep (mem) 2016-05-24T17:00:44.566391-07:00 WARNING kernel: [ 2586.846933] Freezing user space processes ... (elapsed 0.001 seconds) done.done. 2016-05-24T17:00:44.566415-07:00 WARNING kernel: [ 2586.848689] Freezing remaining freezable tasks ... 2016-05-24T17:00:44.566423-07:00 ERR kernel: [ 2593.112797] usb 1-5.3: device descriptor read/64, error -110 2016-05-24T17:00:44.566427-07:00 WARNING kernel: [ 2606.831575] 2016-05-24T17:00:44.566435-07:00 ERR kernel: [ 2606.831593] Freezing of tasks failed after 20.002 seconds (0 tasks refusing to freeze, wq_busy=1): 2016-05-24T17:00:44.566441-07:00 WARNING kernel: [ 2606.831722] 2016-05-24T17:00:44.566448-07:00 WARNING kernel: [ 2606.831731] Restarting kernel threads ... done. 2016-05-24T17:00:44.571356-07:00 WARNING kernel: [ 2606.833544] Restarting tasks ... done. 2016-05-24T17:00:44.571993-07:00 INFO sshd[13688]: SSH: Server;Ltype: Version;Remote: 100.108.96.5-56873;Protocol: 2.0;Client: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 2016-05-24T17:00:44.572790-07:00 INFO sshd[13689]: SSH: Server;Ltype: Version;Remote: 100.108.96.5-56918;Protocol: 2.0;Client: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 2016-05-24T17:00:44.597187-07:00 NOTICE powerd_suspend[13694]: Error writing to /sys/power/state: cat: write error: Device or resource busy
,
May 25 2016
Same "ERR kernel: ... usb ... device descriptor read/64, error -110" was pbserved with lumpy board while ago having same problem -issue 605137 - with chameleon setup, getting stuck, and finally we removed lumpy from tested matrix. Similar observation of chameleon USB interference was described in issue 604783, and fixed by replacing chameleon FPGA board. I believe related are issues 606573 and issue 610728 , and I will look for similar 'Err kernel' traces in devices logs.
,
May 25 2016
This test uses server/hosts/cros_host.py#suspend rather than client/cros/sys_power.py#do_suspend, which is more sophisticated. This problem could be prevented (making the test fail clearly instead, and giving it the option to retry if it implemented that) by changing CrosHost#suspend to also read and pass the current wakeup count.
,
May 26 2016
The test passed for build R53-8368.0.0 at http://cautotest/afe/#tab_id=view_job&object_id=64534751 - this is where chameleon-to DUT USB connection is unplugged. Then I connected the USB between chameleon and DUT and rebooted DUT - it rebooted well as expected(and not hanging on splash screen). Then I re-ran the test again, but one more before it - just as usually being run last several days 1) audio_AudioQualityAfterSuspend.internal_speaker.wav - http://cautotest/afe/#tab_id=view_job&object_id=64704712 2) audio_AudioAfterSuspend.internal_speaker - http://cautotest/afe/#tab_id=view_job&object_id=64705041 Both tests failed with 'Wrong input node(s) selected ['USB'] instead ['INTERNAL_MIC']!', which indicated chameleon keeps the USB alive when it should not, so it must be in bad state. Then I rebooted DUT and it hung on splash screen. Unplugging USB let the DUT proceed to login screen. Plugged the USB back, and shrawan@ ran the test(2) from chroot, and the DUT hanged on the second iteration, as it is described in this bug. Now we have both DUT and chameleon in the state of failure. The DUT is locked on cautotest, so no other tests will proceed. Jimmy and Tom, can you please look at chameleon host and see if anything looks abnormal? (This chameleon board was new and assebled specifically for this audio-box setup, and not beeing previously used.)
,
Jun 2 2016
,
Jun 7 2016
,
Oct 20 2016
Old issues. New issues will be filed if tests fail. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by derat@chromium.org
, May 24 2016