New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 879225 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

peppy-chrome-pfq HostUpdateError "No answer to ssh"

Project Member Reported by xiy...@chromium.org, Aug 30

Issue description

peppy-chrome-pfq failed:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936778983674647552

08/30 04:30:14.504 WARNI|              test:0606| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 800, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 464, in execute
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 371, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 191, in run_once
    host, url, use_quick_provision, with_cheets)
  File "/usr/local/autotest/server/afe_utils.py", line 112, in machine_install_and_update_labels
    image_name, host_attributes = updater.run_update()
  File "/usr/local/autotest/server/cros/autoupdater.py", line 994, in run_update
    self._prepare_host()
  File "/usr/local/autotest/server/cros/autoupdater.py", line 817, in _prepare_host
    HostUpdateError.DUT_DOWN)
HostUpdateError: Error on chromeos4-row6-rack12-host13 prior to update: No answer to ssh

jrbarnette@, could you help to triage? Is this related to issue 878403?
 
Owner: dzigterman@chromium.org
Status: Assigned (was: Untriaged)
This is unrelated to bug 868403:  It's not on any of the still-affected
boards, and it's not a suite or test timeout.

This bug means that the DUT selected to run tests was down when it went to
start provisioning the new build.  It will require someone with Chrome OS
expertise to debug why the DUT was down.

I can help rustle up logs relating to the DUT as necessary.

Passing to a sheriff for next steps.


Looking through the repair logs, it appears that the DUT's kernel was non-responsive.  The repair attempt to reset the DUT via the keyboard controls came through (as can be seen in the EC log), but wasn't sufficient to reboot the board.  The EC GPIO reset worked though, and allowed the system to reboot fully, bringing ssh back up.

Diving into the logs now to see if there are any indicators as to how the kernel got into a state where it wouldn't respond to the keyboard reset. 
Checking the event logs, it appears there was a SUS power failure causing problems on the DUT: 

182 | 2018-08-30 03:58:38 | System Reset
183 | 2018-08-30 03:59:47 | Kernel Event | Clean Shutdown
184 | 2018-08-30 03:59:48 | System boot | 2616
185 | 2018-08-30 03:59:48 | System Reset
186 | 2018-08-30 04:00:03 | System boot | 2618
187 | 2018-08-30 04:00:03 | SUS Power Fail
188 | 2018-08-30 04:00:03 | System Reset
189 | 2018-08-30 04:00:03 | ACPI Wake | S5
190 | 2018-08-30 04:00:03 | Wake Source | Power Button | 0
191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b
192 | 2018-08-30 04:49:01 | System boot | 2619
193 | 2018-08-30 04:49:01 | SUS Power Fail
194 | 2018-08-30 04:49:01 | System Reset
195 | 2018-08-30 04:49:01 | ACPI Wake | S5
196 | 2018-08-30 04:49:01 | Wake Source | Power Button | 0
Owner: jrbarnette@chromium.org
After asking around, it sounds like the SUS power failure means there was asynchronous power loss.  This could indicate an issue with the DUT, or an issue with firmware.  

jrbarnette@, are we seeing issues with other peppy DUTs running similar firmware and software, or has this only been seen in one peppy? 
> After asking around, it sounds like the SUS power failure
> means there was asynchronous power loss.

Wait, look at the time stamps:
192 | 2018-08-30 04:49:01 | System boot | 2619
193 | 2018-08-30 04:49:01 | SUS Power Fail
194 | 2018-08-30 04:49:01 | System Reset

and then this:
    2018-08-30 04:35:06  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839644-repair/

One of the things done by the repair task is that it uses servo to
assert `cold_reset`.  The full list of actions performed in this
case can be found in "status.log" for the repair task.  That shows
this:
	START	----	repair.servoreset	timestamp=1535629741	localtime=Aug 30 04:49:01	

IOW, the "SUS Power Fail" happened exactly at the moment that we
hit the `cold_reset` signal.  Which strikes me as "not surprising".

The time period we care about is the time _before_ the DUT was
found offline at 4:22.  That's these events:
82 | 2018-08-30 03:58:38 | System Reset
183 | 2018-08-30 03:59:47 | Kernel Event | Clean Shutdown
184 | 2018-08-30 03:59:48 | System boot | 2616
185 | 2018-08-30 03:59:48 | System Reset
186 | 2018-08-30 04:00:03 | System boot | 2618
187 | 2018-08-30 04:00:03 | SUS Power Fail
188 | 2018-08-30 04:00:03 | System Reset
189 | 2018-08-30 04:00:03 | ACPI Wake | S5
190 | 2018-08-30 04:00:03 | Wake Source | Power Button | 0
191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b

In particular, the entry into recovery mode is... troublesome.

Would the SUS power failure at 4:00 be due to another repair process?  Or is it unrelated? 
Owner: dzigterman@chromium.org
> Would the SUS power failure at 4:00 be due to another repair process?  Or is it unrelated? 

Oh.  Missed that.  The short answer is that I believe that it happened
during testing for this job:
    2018-08-30 03:52:35  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/232240647-chromeos-test/

We need to look at those logs to find out more what happened there.

Got it.  I'll take a look
After reading through the autotest scripts for this, I'm reasonably confident that the SUS power failure at 4:00 was not induced by the tests performing any kind of cold reset.  

The autoupdate_Rollback test had successfully updated the DUT and set it up to rollback.  The test ran a "reboot" OS command in order to initiate the reboot, and was executing wait_up() from abstract_ssh.py at the time of the SUS power failure (from 03:59:58 to 04:05:12).  The wait_up() function does not appear to make any servo calls in order to reset the DUT, but rather attempts to SSH into the system to verify it is up.  

Do we collect UART logs from the CPU or EC during these tests?  They would be helpful in debugging whether this is a firmware or hardware issue on the peppy board.  
> Do we collect UART logs from the CPU or EC during these tests?
> They would be helpful in debugging whether this is a firmware or
> hardware issue on the peppy board.

We gather EC logs from servo, but only in cases where servo is
expected to be used/needed.  Ordinarily, tests don't use servo,
so I think we won't have EC logs from the time of the rollback
test.

I believe no kernel in the lab will be configured to be generating
UART output, so the CPU logs are expected to be empty in all cases.

That said, there's one more item:
191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b

I think that that message says that when autoupdate_Rollback rebooted,
the firmware found a corrupted kernel.  So, we might go scanning for
complaints from the storage media.

From the other repair logs, I'm not seeing any other instances of "No bootable kernel found on disk" besides this most recent one, and the SUS power timestamps seem to correspond to when the repair process needed to use the EC GPIO reset. 

Is this the only DUT experiencing problems in autoupdate_Rollback?  
Owner: pprabhu@chromium.org
Reassigning to this week's infra deputy
Labels: Hotlist-Deputy
Labels: -Hotlist-Deputy
Owner: jrbarnette@chromium.org
This is a background debugging bug that hardly ever rises to the deputy's attention (see, I got to it at 4:58 PM PST on the Friday of my deputy week).

I am bumping it off the deputy list, with questions:

[1] Do we want to remove this DUT from supply, blaming storage problems? If so, add it back to deputy list.
[2] If we suspect that this DUT makes the PFQ fail again, we should first remove it from supply (deputy's job) then debug (likely not deputy's job)

Otherwise, debugging individual DUT problems is unlikely to get deputy attention. We need champions like jrbarnette@ for that :)
Ultimately, what's needed is some way to perform an objective test
for "is the hardware working?"  We don't have that.  Absent that
objective, automated test, and absent an engineer with time and
know-how to debug it, I don't know what to do with the DUT other
than put it back into service, and wait for a more egregious failure.

Cc: stagenut@chromium.org
Saw the same error twice this week (on different DUTs), but perhaps not much can be done at this point? The links are just for future reference.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927915501108004592

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8928038683980118688

Sign in to add a comment