New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 782329 link

Starred by 4 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

[Stumpy] Device does not resume after suspended

Project Member Reported by pgangishetty@chromium.org, Nov 7 2017

Issue description

Chrome Version: 63.0.3239.39
OS: 10032.30.0

What steps will reproduce the problem?
(1) Device recovered with above version (ASUS 1080 HD monitor connected with DP-DP & DVI-VGA cable)
(2) Sign in to Device
(3) Now suspend the device using 'powerd_dbus_suspend' or 'suspend_stress_test'  
(4) Now try to resume the device (by any keyboard press, mouse, power button)

What is the expected result?
Device should resume and external monitor(s) should display

What happens instead?
Device does not resume.


Please use labels and text to provide additional information.
Logs attached. 

For graphics-related bugs, please copy/paste the contents of the about:gpu
page at the end of this report.

 

Comment 2 by ka...@chromium.org, Nov 7 2017

Cc: derat@chromium.org dbehr@chromium.org dbasehore@chromium.org gkihumba@chromium.org
Are other Sandybridge devices affected?

Comment 4 by derat@chromium.org, Nov 7 2017

Cc: snanda@chromium.org tbroch@chromium.org
I see these three suspend attempts in powerd's logs. The logs all end immediately after we tried to suspend:

powerd.20171107-100936:
...
[1107/101009:INFO:daemon.cc(1212)] Got RequestSuspend message from :1.42
[1107/101009:INFO:suspender.cc(399)] Starting request 38469633
[1107/101009:INFO:daemon.cc(596)] Reading wakeup count from /sys/power/wakeup_count
[1107/101009:INFO:daemon.cc(600)] Read wakeup count 0
[1107/101009:INFO:display_power_setter.cc(81)] Asking DisplayService to turn all displays off
[1107/101010:INFO:suspend_delay_controller.cc(137)] Announcing suspend request 38469633 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[1107/101010:INFO:audio_client.cc(125)] Updated audio devices: headphones unplugged, HDMI inactive
[1107/101010:INFO:daemon.cc(1412)] Chrome is using normal display mode
[1107/101010:INFO:suspend_delay_controller.cc(86)] Got notification that delay 38469633 (shill) is ready for suspend request 38469633 from :1.10
[1107/101010:INFO:suspend_delay_controller.cc(86)] Got notification that delay 38469634 (chrome) is ready for suspend request 38469633 from :1.11
[1107/101010:INFO:suspend_delay_controller.cc(223)] Notifying observers that suspend is ready
[1107/101010:INFO:suspender.cc(470)] Starting suspend
[1107/101010:INFO:main.cc(237)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=0"
EOF

powerd.20171107-103037:
...
[1107/103037:INFO:daemon.cc(1212)] Got RequestSuspend message from :1.45
[1107/103037:INFO:suspender.cc(399)] Starting request 223870977
[1107/103037:INFO:daemon.cc(596)] Reading wakeup count from /sys/power/wakeup_count
[1107/103037:INFO:daemon.cc(600)] Read wakeup count 0
[1107/103037:INFO:display_power_setter.cc(81)] Asking DisplayService to turn all displays off
[1107/103037:INFO:suspend_delay_controller.cc(137)] Announcing suspend request 223870977 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[1107/103037:INFO:suspend_delay_controller.cc(86)] Got notification that delay 223870978 (shill) is ready for suspend request 223870977 from :1.10
[1107/103037:INFO:suspend_delay_controller.cc(86)] Got notification that delay 223870977 (chrome) is ready for suspend request 223870977 from :1.11
[1107/103037:INFO:suspend_delay_controller.cc(223)] Notifying observers that suspend is ready
[1107/103037:INFO:suspender.cc(470)] Starting suspend
[1107/103037:INFO:main.cc(237)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=0"
EOF

powerd.20171107-180355:
...
[1107/180632:INFO:daemon.cc(1212)] Got RequestSuspend message from :1.45
[1107/180632:INFO:suspender.cc(399)] Starting request 42205185
[1107/180632:INFO:daemon.cc(596)] Reading wakeup count from /sys/power/wakeup_count
[1107/180632:INFO:daemon.cc(600)] Read wakeup count 0
[1107/180632:INFO:display_power_setter.cc(81)] Asking DisplayService to turn all displays off
[1107/180633:INFO:suspend_delay_controller.cc(137)] Announcing suspend request 42205185 with 2 pending delay(s) and 0 outstanding delay(s) from previous request
[1107/180633:INFO:audio_client.cc(125)] Updated audio devices: headphones unplugged, HDMI inactive
[1107/180633:INFO:daemon.cc(1412)] Chrome is using normal display mode
[1107/180633:INFO:suspend_delay_controller.cc(86)] Got notification that delay 42205185 (shill) is ready for suspend request 42205185 from :1.10
[1107/180633:INFO:suspend_delay_controller.cc(86)] Got notification that delay 42205186 (chrome) is ready for suspend request 42205185 from :1.12
[1107/180633:INFO:suspend_delay_controller.cc(223)] Notifying observers that suspend is ready
[1107/180633:INFO:suspender.cc(470)] Starting suspend
[1107/180633:INFO:main.cc(237)] Running "/usr/bin/powerd_setuid_helper --action=suspend  --suspend_wakeup_count_valid --suspend_wakeup_count=0"
EOF

I don't know what's going on with the system clock. When I look at the messages file, I can see that the 180355 attempt actually came before the other "earlier" ones.

It looks like the system dies right after powerd_suspend logs "Finalizing suspend", and before the kernel has logged anything:

...
2017-11-07T18:05:33.299738+00:00 INFO session_manager[683]: [INFO:policy_store.cc(69)] Persisted policy to disk.
2017-11-07T18:05:33.340363+00:00 INFO session_manager[683]: [INFO:child_exit_handler.cc(77)] Handling 3467 exit.
2017-11-07T18:05:33.372653+00:00 INFO session_manager[683]: [INFO:policy_store.cc(69)] Persisted policy to disk.
2017-11-07T18:05:33.422312+00:00 INFO session_manager[683]: [INFO:child_exit_handler.cc(77)] Handling 3551 exit.
2017-11-07T18:05:43.627531+00:00 WARNING crash_reporter[3706]: Received crash notification for chrome[732] user 1000 (called directly)
2017-11-07T18:05:43.683209+00:00 ERR session_manager[3732]: [ERROR:process.cc(329)] Exec of /usr/bin/vm_launcher failed:: No such file or directory
2017-11-07T18:06:24.206734+00:00 ERR cras_server[1138]: Unable to find the best channel map
2017-11-07T18:06:33.038918+00:00 ERR cras_server[1138]: pcm_avail returned frames larger than buf_size: HDA Intel PCH: HDMI 2:0,8: 16416 > 16384 for 1 times
2017-11-07T18:06:33.101777+00:00 ERR cras_server[1138]: Unable to find the best channel map
2017-11-07T18:06:33.175546+00:00 NOTICE powerd_suspend[4015]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=0
2017-11-07T18:06:33.246612+00:00 NOTICE powerd_suspend[4069]: Finalizing suspend
2017-11-07T10:09:36.254439-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpu
2017-11-07T10:09:36.254481-08:00 NOTICE kernel: [    0.000000] Linux version 3.8.11 (chrome-bot@cros-beefy173-c2) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r166-0c5a656a1322e
137fa4a251f2ccc6c4022918c0a_4.9.2-r166) ) #1 SMP Mon Nov 6 22:08:11 PST 2017
2017-11-07T10:09:36.254491-08:00 INFO kernel: [    0.000000] Command line: cros_secure  console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.
error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 2539520 verity payload=PARTUUID=51aeba08-f134-0144-bbfa-69972a95d601/PARTNROFF=1 hashtree=PARTUUID=51aeba08-f
134-0144-bbfa-69972a95d601/PARTNROFF=1 hashstart=2539520 alg=sha1 root_hexdigest=62753a6c7c05e6fb9dcc8f73e1350d95aa36f855 salt=bfa45950d45b8d4d92e94e9896f9333cbd69fa6362a4c27254f3c0ffa01d4cd8" 
noinitrd vt.global_cursor_default=0 kern_guid=51aeba08-f134-0144-bbfa-69972a95d601 add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=pan
ic,lapic iTCO_vendor_support.vendorsupport=3
...

Here's the relevant part of powerd_suspend:

  log_msg "Finalizing suspend"
  # Make sure that hwclock is cached for when we wake up, this way we don't
  # wait for io and get a more accurate time for wakeup
  # -V prints a little version information and exits, loading .so but not
  # making actual RTC accesses. Do *NOT* change this to a normal execution...
  # it will trigger an RTC interrupt that may count as a wakeup reason, abort
  # your suspend and cause you hours of pain and confusion!
  if [ -f "${TIMESTAMP_FILE}" ]; then
    /sbin/hwclock -V > /dev/null
  else
    # We don't want to overwrite the alarm for whatever test might be
    # running.
    setup_dark_resume
  fi

  # Echo freeze to /sys/power/state on platforms that support suspend to idle.
  action='mem'
  if [ "${FLAGS_suspend_to_idle}" -eq 0 ]; then
    action='freeze'
  fi

  # Suspend to RAM. This is piped through cat since we want to determine
  # the causes of failures -- dash's "echo" command appears to not print
  # anything in response to write errors.
  error=$( (printf '%s' "${action}" | cat > /sys/power/state) 2>&1)
  ...
Owner: derat@chromium.org
Status: Assigned (was: Untriaged)

Comment 6 by derat@chromium.org, Nov 7 2017

Owner: ----
Status: Available (was: Assigned)
I don't know anything about this beyond what I put in #4. Someone on the kernel side should take a look.
Owner: diand...@chromium.org
Status: Assigned (was: Available)
Re: to comment #3 - Not able to reproduce on butterfly with same version.
Cc: diand...@chromium.org
Owner: tbroch@chromium.org
Easily reproducible.  Occurs 100% of the time on my stumpy. 

Device appears stuck in suspend or at least early resume as power button led is still flashing

Only recovery I've found is hard reset (no USB keyboard, mouse or RTC wake).  Nothing interesting in logs except that unclean shutdown occurred on previous suspend.

/sys/power/pm_test doesn't expose a failure.

Despite having 29 DUTS in the lab it appears the power_Resume test has only run once in 60 days

https://wmatrix.googleplex.com/unfiltered?suites=power_build&tests=power_Resume&days_back=60&platforms=stumpy

and passed on R64 10041.0.0

So no easy way to identify when regression was introduced unfortunately.


At least for my stumpy I can recreate this failure on 

R64-10041.0.0
R64-10035.0.0
R63-10032.30.0
R62-9901.66.0

So its not isolated to beta,

The singular lab pass,

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149599519-chromeos-test/chromeos4-row2-rack8-host19

Looks legit,

seconds_system_resume{perf}=1.73400402069


But I can't repro it pass locally.

Host is,

chromeos4-row2-rack8-host19                Verifying      chromeos-server12.cbf.corp.google.com   False                                                                                                     None           stumpy                 board:stumpy, pool:suites, bluetooth, hw_video_acc_h264, hw_video_acc_enc_h264, storage:ssd, power:AC_only, audio_loopback_dongle, os:cros, variant:stumpy, sku:stumpy_intel_i5_2450m_2500mhz_4Gb, cts_abi_x86, cts_abi_arm, model:stumpy, cros-version:stumpy-release/R64-10044.0.0


Labels: M-64
Note this may also be reproducible on other devices w/ similar SoC running 3.8 kernel.  At least one report on link (3.8 ivybridge)
Hi, this is tagged as a beta blocker for M64.  We're not that far out and this bug hasn't been updated in a while.  Ping?  Thanks...

Still able to reproduce this issue on M64 build 10176.5.0
Cc: semenzato@chromium.org
M64 beta is targeted for next Tuesday (12-Dec); anything we can do to close this out and/or retag as stable block is appreciated.   I'll ping the owner as well.  Thanks!

Labels: -Pri-1 -ReleaseBlock-Beta -M-63 ReleaseBlock-Stable Pri-2
Broken in all of these releases as well,

R59-9460.73.0
R58-9334.72.0
R57-9202.64.0
R56-9000.56.0

Also note while repro on test image appears to be 100% normal users are succeeding at resume (Power.KernelResumeTimeOnAC UMA shows values) sometimes.

Given minimal amount of user feedback regarding this I suspect there may be something specific to test image that has soured this.  Going to punt ReleaseBlock down the road to stable & have a look at repro on normal mode.
This may be a hardware failure and same issue affecting  crbug.com/792562#c45  as I now have a 2nd stumpy device that suspend/resumes succesfully w/ exact same SW that fails on previous unit.


Labels: -ReleaseBlock-Stable
Removing as a M64 blocker since clearly not a regression....
Labels: ReleaseBlock-Beta
This was a beta blocker in 63 but never got fixed. We should get the fix in 64 as it's a pretty serious issue
Labels: -Pri-2 Pri-1
Components: OS>Hardware
Owner: ----
Status: Available (was: Assigned)
Given,
1) there's evidence that this is a hardware failure on specific units only w/in google (c#19)
2) no field reports of similar failure
3) current repro occurs  running test images as far back as R56 

I'm not sure the I agree w/ priority bump or re-adding the releaseBlock-Beta.

In any case I can't dedicate any more time at this point.  Marking available should others have more cycles to continue the effort and happy to offer up the two units (1 pass, 1 fail) if needed.


Labels: -ReleaseBlock-Beta ReleaseBlock-Stable
Changing to release block stable.   Blocking != priority; it should only be used if we can't release beta.  We'll take another look this week.
Labels: -ReleaseBlock-Stable
Punting RBS based on same assessment of  crbug.com/792562#c48  (HW related) and in fact bugs may be result of same HW failure as unit(s) that fail have both failures.
Status: WontFix (was: Available)
stumpy is EOL now, and it seems like the issues here are hardware-specific rather than common OS software

Sign in to add a comment