peppy-chrome-pfq HostUpdateError "No answer to ssh" |
|||||||
Issue descriptionpeppy-chrome-pfq failed: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936778983674647552 08/30 04:30:14.504 WARNI| test:0606| The test failed with the following exception Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 800, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 464, in execute postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 371, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 191, in run_once host, url, use_quick_provision, with_cheets) File "/usr/local/autotest/server/afe_utils.py", line 112, in machine_install_and_update_labels image_name, host_attributes = updater.run_update() File "/usr/local/autotest/server/cros/autoupdater.py", line 994, in run_update self._prepare_host() File "/usr/local/autotest/server/cros/autoupdater.py", line 817, in _prepare_host HostUpdateError.DUT_DOWN) HostUpdateError: Error on chromeos4-row6-rack12-host13 prior to update: No answer to ssh jrbarnette@, could you help to triage? Is this related to issue 878403?
,
Aug 30
Relevant history on the DUT:
2018-08-30 05:09:43 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839817-provision/
2018-08-30 04:35:06 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839644-repair/
2018-08-30 04:22:56 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839601-provision/
The "repair" task should have logs.
,
Aug 30
Let's broaden the logs to include a time when the DUT was still
working:
2018-08-30 05:09:43 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839817-provision/
2018-08-30 04:35:06 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839644-repair/
2018-08-30 04:22:56 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839601-provision/
2018-08-30 03:52:35 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/232240647-chromeos-test/
2018-08-30 03:47:26 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839428-provision/
The DUT was working at the 3:47 provisioning task. It was broken (offline)
at the 4:22 provisioning task. It was brought back by the 4:35 repair task.
,
Aug 30
Looking through the repair logs, it appears that the DUT's kernel was non-responsive. The repair attempt to reset the DUT via the keyboard controls came through (as can be seen in the EC log), but wasn't sufficient to reboot the board. The EC GPIO reset worked though, and allowed the system to reboot fully, bringing ssh back up. Diving into the logs now to see if there are any indicators as to how the kernel got into a state where it wouldn't respond to the keyboard reset.
,
Aug 30
Checking the event logs, it appears there was a SUS power failure causing problems on the DUT: 182 | 2018-08-30 03:58:38 | System Reset 183 | 2018-08-30 03:59:47 | Kernel Event | Clean Shutdown 184 | 2018-08-30 03:59:48 | System boot | 2616 185 | 2018-08-30 03:59:48 | System Reset 186 | 2018-08-30 04:00:03 | System boot | 2618 187 | 2018-08-30 04:00:03 | SUS Power Fail 188 | 2018-08-30 04:00:03 | System Reset 189 | 2018-08-30 04:00:03 | ACPI Wake | S5 190 | 2018-08-30 04:00:03 | Wake Source | Power Button | 0 191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b 192 | 2018-08-30 04:49:01 | System boot | 2619 193 | 2018-08-30 04:49:01 | SUS Power Fail 194 | 2018-08-30 04:49:01 | System Reset 195 | 2018-08-30 04:49:01 | ACPI Wake | S5 196 | 2018-08-30 04:49:01 | Wake Source | Power Button | 0
,
Aug 30
After asking around, it sounds like the SUS power failure means there was asynchronous power loss. This could indicate an issue with the DUT, or an issue with firmware. jrbarnette@, are we seeing issues with other peppy DUTs running similar firmware and software, or has this only been seen in one peppy?
,
Aug 30
> After asking around, it sounds like the SUS power failure
> means there was asynchronous power loss.
Wait, look at the time stamps:
192 | 2018-08-30 04:49:01 | System boot | 2619
193 | 2018-08-30 04:49:01 | SUS Power Fail
194 | 2018-08-30 04:49:01 | System Reset
and then this:
2018-08-30 04:35:06 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839644-repair/
One of the things done by the repair task is that it uses servo to
assert `cold_reset`. The full list of actions performed in this
case can be found in "status.log" for the repair task. That shows
this:
START ---- repair.servoreset timestamp=1535629741 localtime=Aug 30 04:49:01
IOW, the "SUS Power Fail" happened exactly at the moment that we
hit the `cold_reset` signal. Which strikes me as "not surprising".
The time period we care about is the time _before_ the DUT was
found offline at 4:22. That's these events:
82 | 2018-08-30 03:58:38 | System Reset
183 | 2018-08-30 03:59:47 | Kernel Event | Clean Shutdown
184 | 2018-08-30 03:59:48 | System boot | 2616
185 | 2018-08-30 03:59:48 | System Reset
186 | 2018-08-30 04:00:03 | System boot | 2618
187 | 2018-08-30 04:00:03 | SUS Power Fail
188 | 2018-08-30 04:00:03 | System Reset
189 | 2018-08-30 04:00:03 | ACPI Wake | S5
190 | 2018-08-30 04:00:03 | Wake Source | Power Button | 0
191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b
In particular, the entry into recovery mode is... troublesome.
,
Aug 30
Would the SUS power failure at 4:00 be due to another repair process? Or is it unrelated?
,
Aug 30
> Would the SUS power failure at 4:00 be due to another repair process? Or is it unrelated?
Oh. Missed that. The short answer is that I believe that it happened
during testing for this job:
2018-08-30 03:52:35 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/232240647-chromeos-test/
We need to look at those logs to find out more what happened there.
,
Aug 30
Got it. I'll take a look
,
Aug 31
After reading through the autotest scripts for this, I'm reasonably confident that the SUS power failure at 4:00 was not induced by the tests performing any kind of cold reset. The autoupdate_Rollback test had successfully updated the DUT and set it up to rollback. The test ran a "reboot" OS command in order to initiate the reboot, and was executing wait_up() from abstract_ssh.py at the time of the SUS power failure (from 03:59:58 to 04:05:12). The wait_up() function does not appear to make any servo calls in order to reset the DUT, but rather attempts to SSH into the system to verify it is up. Do we collect UART logs from the CPU or EC during these tests? They would be helpful in debugging whether this is a firmware or hardware issue on the peppy board.
,
Aug 31
> Do we collect UART logs from the CPU or EC during these tests? > They would be helpful in debugging whether this is a firmware or > hardware issue on the peppy board. We gather EC logs from servo, but only in cases where servo is expected to be used/needed. Ordinarily, tests don't use servo, so I think we won't have EC logs from the time of the rollback test. I believe no kernel in the lab will be configured to be generating UART output, so the CPU logs are expected to be empty in all cases. That said, there's one more item: 191 | 2018-08-30 04:00:04 | Chrome OS Recovery Mode | No bootable kernel found on disk | 0x5b I think that that message says that when autoupdate_Rollback rebooted, the firmware found a corrupted kernel. So, we might go scanning for complaints from the storage media.
,
Aug 31
Checking the DUT's history:
$ dut-status -f -d 72 chromeos4-row6-rack12-host13 | grep repair
2018-08-30 06:38:16 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1840447-repair/
2018-08-30 05:28:44 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839914-repair/
2018-08-30 04:35:06 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1839644-repair/
2018-08-29 16:29:00 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1836230-repair/
2018-08-29 15:40:26 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1835884-repair/
2018-08-29 15:05:32 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1835737-repair/
2018-08-29 13:49:27 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1835166-repair/
2018-08-29 08:37:11 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row6-rack12-host13/1833525-repair/
It seems that the DUT has a slightly checkered past, so we can't rule
out simple storage hardware failure. We also can't rule out some more
exotic problem being triggered specifically by autoupdate_Rollback.
,
Aug 31
From the other repair logs, I'm not seeing any other instances of "No bootable kernel found on disk" besides this most recent one, and the SUS power timestamps seem to correspond to when the repair process needed to use the EC GPIO reset. Is this the only DUT experiencing problems in autoupdate_Rollback?
,
Sep 4
Reassigning to this week's infra deputy
,
Sep 5
,
Sep 8
This is a background debugging bug that hardly ever rises to the deputy's attention (see, I got to it at 4:58 PM PST on the Friday of my deputy week). I am bumping it off the deputy list, with questions: [1] Do we want to remove this DUT from supply, blaming storage problems? If so, add it back to deputy list. [2] If we suspect that this DUT makes the PFQ fail again, we should first remove it from supply (deputy's job) then debug (likely not deputy's job) Otherwise, debugging individual DUT problems is unlikely to get deputy attention. We need champions like jrbarnette@ for that :)
,
Sep 8
Ultimately, what's needed is some way to perform an objective test for "is the hardware working?" We don't have that. Absent that objective, automated test, and absent an engineer with time and know-how to debug it, I don't know what to do with the DUT other than put it back into service, and wait for a more egregious failure.
,
Sep 11
,
Dec 6
Saw the same error twice this week (on different DUTs), but perhaps not much can be done at this point? The links are just for future reference. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927915501108004592 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8928038683980118688 |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Aug 30Status: Assigned (was: Untriaged)