chromeos4-row11-rack10-host13 repair looping |
||||||||||||
Issue description
On builds #2444, #2449, the failure info is:
01/25 21:13:00.039 ERROR| logging_manager:0626| tko parser: ---------------------------------
01/25 21:13:00.039 ERROR| logging_manager:0626| tko parser: parsing test provision_AutoUpdate provision
01/25 21:13:00.039 ERROR| logging_manager:0626| tko parser: ADD: FAIL
01/25 21:13:00.040 ERROR| logging_manager:0626| Subdir: provision_AutoUpdate
01/25 21:13:00.040 ERROR| logging_manager:0626| Testname: provision
01/25 21:13:00.040 ERROR| logging_manager:0626| Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row12-rack11-host21: 0) Could not copy /home/chromeos-test/images/cyan-chrome-pfq/R66-10342.0.0-rc1/stateful.tgz to device., 1) Could not copy /home/chromeos-test/images/cyan-chrome-pfq/R66-10342.0.0-rc1/update.gz to device.,
01/25 21:13:00.040 ERROR| logging_manager:0626| tko parser: parsing test ---- SERVER_JOB
01/25 21:13:26.883 ERROR| traceback:0013| Traceback (most recent call last):
01/25 21:13:26.883 ERROR| traceback:0013| File "/usr/local/autotest/server/autoserv", line 495, in run_autoserv
01/25 21:13:26.884 ERROR| traceback:0013| job.provision(job_labels)
01/25 21:13:26.884 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 591, in provision
01/25 21:13:26.884 ERROR| traceback:0013| self.run(control=control, job_labels=labels)
01/25 21:13:26.884 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 884, in run
01/25 21:13:26.884 ERROR| traceback:0013| self._execute_code(server_control_file, namespace)
01/25 21:13:26.884 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code
01/25 21:13:26.884 ERROR| traceback:0013| execfile(code_file, namespace, namespace)
01/25 21:13:26.884 ERROR| traceback:0013| File "/usr/local/autotest/results/hosts/chromeos4-row12-rack11-host21/6416-provision/20182501204919/control.srv", line 111, in <module>
01/25 21:13:26.885 ERROR| traceback:0013| job.parallel_simple(provision_machine, machines)
01/25 21:13:26.885 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple
01/25 21:13:26.885 ERROR| traceback:0013| log=log, timeout=timeout, return_results=return_results)
01/25 21:13:26.885 ERROR| traceback:0013| File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
01/25 21:13:26.885 ERROR| traceback:0013| function(arg)
01/25 21:13:26.885 ERROR| traceback:0013| File "/usr/local/autotest/results/hosts/chromeos4-row12-rack11-host21/6416-provision/20182501204919/control.srv", line 102, in provision_machine
01/25 21:13:26.885 ERROR| traceback:0013| raise Exception('')
01/25 21:13:26.885 ERROR| traceback:0013| Exception
01/25 21:13:27.447 ERROR| autoserv:0759| Uncaught SystemExit with code 1
On builds #2447, #2450, #2451, the failure info is:
01/26 04:10:04.887 ERROR| logging_manager:0626| tko parser: parsing test provision_AutoUpdate provision
01/26 04:10:04.887 ERROR| logging_manager:0626| tko parser: ADD: FAIL
01/26 04:10:04.887 ERROR| logging_manager:0626| Subdir: provision_AutoUpdate
01/26 04:10:04.887 ERROR| logging_manager:0626| Testname: provision
01/26 04:10:04.887 ERROR| logging_manager:0626| Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row11-rack10-host13: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',),
01/26 04:10:04.887 ERROR| logging_manager:0626| tko parser: parsing test ---- SERVER_JOB
01/26 04:10:08.730 ERROR| utils:0282| [stderr] ls: cannot access '/var/tmp/messages.autotest_start': No such file or directory
01/26 04:10:28.326 ERROR| traceback:0013| Traceback (most recent call last):
01/26 04:10:28.326 ERROR| traceback:0013| File "/usr/local/autotest/server/autoserv", line 495, in run_autoserv
01/26 04:10:28.326 ERROR| traceback:0013| job.provision(job_labels)
01/26 04:10:28.326 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 591, in provision
01/26 04:10:28.327 ERROR| traceback:0013| self.run(control=control, job_labels=labels)
01/26 04:10:28.327 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 884, in run
01/26 04:10:28.327 ERROR| traceback:0013| self._execute_code(server_control_file, namespace)
01/26 04:10:28.327 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code
01/26 04:10:28.327 ERROR| traceback:0013| execfile(code_file, namespace, namespace)
01/26 04:10:28.327 ERROR| traceback:0013| File "/usr/local/autotest/results/hosts/chromeos4-row11-rack10-host13/9485-provision/20182601040306/control.srv", line 111, in <module>
01/26 04:10:28.327 ERROR| traceback:0013| job.parallel_simple(provision_machine, machines)
01/26 04:10:28.327 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple
01/26 04:10:28.328 ERROR| traceback:0013| log=log, timeout=timeout, return_results=return_results)
01/26 04:10:28.328 ERROR| traceback:0013| File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
01/26 04:10:28.328 ERROR| traceback:0013| function(arg)
01/26 04:10:28.328 ERROR| traceback:0013| File "/usr/local/autotest/results/hosts/chromeos4-row11-rack10-host13/9485-provision/20182601040306/control.srv", line 102, in provision_machine
01/26 04:10:28.328 ERROR| traceback:0013| raise Exception('')
01/26 04:10:28.328 ERROR| traceback:0013| Exception
01/26 04:10:28.880 ERROR| autoserv:0759| Uncaught SystemExit with code 1
Although #2445 is successful, there should be some problem here. And it is blocking Chrome PFQ.
Assigned to infra deputy.
,
Jan 26 2018
,
Jan 26 2018
cyan-paladin is passing hwtest. This suggests to me it's not an infra issue. Perhaps a chrome issue in the pfq preventing the rootfs update?
,
Jan 26 2018
For chrome issue, is it possible that is cyan board only? The error msg said: BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **
,
Jan 26 2018
My local run (test failed) shows these:
01/26 15:46:37.756 INFO | test_runner_utils:0199| autoserv| get_network_stats: at-start RXbytes 101874 TXbytes 34817
01/26 15:46:37.757 INFO | test_runner_utils:0199| autoserv| Not checking if job_repo_url contains autotest packages on ['100.96.58.50']
01/26 15:46:37.757 INFO | test_runner_utils:0199| autoserv| Processing control file
01/26 15:46:37.758 INFO | test_runner_utils:0199| autoserv| Exception escaped control file, job aborting:
01/26 15:46:37.758 INFO | test_runner_utils:0199| autoserv| Traceback (most recent call last):
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| File "/build/cyan/usr/local/build/autotest/server/server_job.py", line 884, in run
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| self._execute_code(server_control_file, namespace)
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| File "/build/cyan/usr/local/build/autotest/server/server_job.py", line 1396, in _execute_code
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| execfile(code_file, namespace, namespace)
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| File "/tmp/test_that_results_1HKHZt/results-1-provision_AutoUpdate/control.srv", line 52, in <module>
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| raise error.TestError("No provision value!")
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| TestError: No provision value!
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| INFO ---- ---- timestamp=1517003197 job_abort_reason=No provision value! localtime=Jan 26 15:46:37 No provision value!
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: {'drone': 'warx0.mtv.corp.google.com', 'user': 'autotest_system', 'job_started': 1517003195, 'hostname': '100.96.58.50', 'status_version': 1, 'experimental': 'False',
'label': 'ad_hoc_build/ad_hoc_suite/provision_AutoUpdate'}
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: MACHINE NAME: 100.96.58.50
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: Unable to parse host keyval for 100.96.58.50
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: MACHINE GROUP:
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: parsing partial test ---- SERVER_JOB
01/26 15:46:37.762 INFO | test_runner_utils:0199| autoserv| tko parser: parsing test ---- SERVER_JOB
01/26 15:46:37.771 INFO | test_runner_utils:0199| autoserv| Master ssh connection to 100.96.58.50 is down.
Which sounds unrelated though...
,
Jan 26 2018
I found two similar symptom issues: issue 795456 : chromeos4-row11-rack11-host7 provision/repair looping. In that issue, I saw "forcing repair" and finally fixed by replacing the unit. and issue 795396.
,
Jan 26 2018
#5 looks unrelated. #4 provision failures generally are often (incorrectly) labelled as "infrastructure" issues. Yes, it is possible that this is a chrome issue that only fails on cyan. I don't see evidence of an infrastructure issue yet.
,
Jan 26 2018
Is it one DUT consistnetly failing?
,
Jan 26 2018
For #5, I am using "test_that --board=cyan DUT_IP provision_AutoUpdate", not sure why it reports no provision value error.
,
Jan 26 2018
ok could be repair loop Issue 806312
,
Jan 26 2018
,
Jan 26 2018
Device locked.
,
Jan 26 2018
,
Jan 26 2018
,
Jan 26 2018
Yes, recent builds are consistently failing, see the reporter's context.
,
Jan 26 2018
Device locked. +jrbarnette you've assigned me quite a few of the "repair loop detected" bugs. What is the action you intend for the deputy to take on these, besides locking the DUT?
,
Jan 26 2018
We shouldn't necessarily lock the DUT at all. We have to debug
to see why the DUT fails provision, and fix the problem.
Roughly, the process should be this:
* Make sure there's a bug for the outage. Different DUTs
often will need to have different bugs, even if they failed
at the same time.
* Lock each DUT, with the lock reason being its bug.
* If the DUT isn't in the 'suites' pool, run balance_pools to
get the DUT into the suites pool.
* Debug to figure out why the DUT is failing.
* If we fix the problem, unlock the DUT.
* If debug determines that the DUT requires replacement or
manual repair, we file a ticket against englab-sys-cros@,
and then unlock/relock the DUT using the ticket as the
lock reason.
* Close the bug.
,
Jan 26 2018
> We shouldn't necessarily lock the DUT at all. We have to debug
> to see why the DUT fails provision, and fix the problem.
Well, I should have fixed this text to read "We should lock the DUT
solely for the specific purpose of debugging the problem".
Also, in lieu of extended debug, it may be reasonable to do one of
two other things:
* Force the DUT to re-install from USB using the `repair_test` command.
* Force the DUT to re-install the repair image via AU by following
a procedure like this:
https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/powerwash
Before doing either of those things, it's useful to look at the
repair log, since the log will tell you whether either of those
steps have already been tried.
,
Jan 26 2018
curious: could we manually uprev chrome in this case? This is the only failure in the last pfq run.
,
Jan 26 2018
,
Jan 26 2018
On YAQS https://yaqs.googleplex.com/eng/q/5592011371970560, it is suggested not doing manual uprev on Friday. cc next week's gardeners for tracking this issue
,
Jan 26 2018
> curious: could we manually uprev chrome in this case? This
> is the only failure in the last pfq run.
It depends entirely on the specifics of the failure. Some key
questions:
* What tests didn't run because of the failure? If we only
lost one test, there's less risk than if we lost 5 or 10
tests.
* Was the provision failure caused by Chrome? Chrome bugs can
definitely cause provision failures. Some of them even
have a specific message that clearly blames Chrome. So,
we should do at least some debug before any manual uprev.
> On YAQS https://yaqs.googleplex.com/eng/q/5592011371970560,
> it is suggested not doing manual uprev on Friday.
Yeah, I wrote the post-mortem that recommended that. I stand
by the recommendation. :-)
,
Feb 6 2018
Looking at chromeos4-row11-rack10-host13. Currently running from the USB stick.
Trying to update with chromeos_install --yes, it fails when installing the stateful partition:
+ copy_partition 1 /dev/sda /dev/mmcblk0
+ local part_num=1
+ local src=/dev/sda
+ local dst=/dev/mmcblk0
+ partsize /dev/sda 1
+ sudo /usr/bin/cgpt show -s -i 1 /dev/sda
+ local part_size=4194304
+ partoffset /dev/sda 1
+ sudo /usr/bin/cgpt show -b -i 1 /dev/sda
+ local src_offset=4415488
+ partoffset /dev/mmcblk0 1
+ sudo /usr/bin/cgpt show -b -i 1 /dev/mmcblk0
+ local dst_offset=8704000
+ echo Installing partition 1 to /dev/mmcblk0
Installing partition 1 to /dev/mmcblk0
+ install_stateful
+ echo Installing the stateful partition...
Installing the stateful partition...
+ loop_offset_setup /dev/mmcblk0 8704000
+ local filename=/dev/mmcblk0
+ local offset=8704000
+ losetup -f --show -o 4456448000 /dev/mmcblk0
+ LOOP_DEV=/dev/loop11
+ [ -z /dev/loop11 ]
+ LOOPS=/dev/loop11
+ mount_on_loop_dev readwrite
+ local rw_flag=readwrite
+ local mount_flags=
+ [ readwrite != readwrite ]
+ tracked_mount /dev/loop11 /tmp/install-mount-point
+ local last_arg
+ eval last_arg=$2
+ last_arg=/tmp/install-mount-point
+ MOUNTS=/tmp/install-mount-point
+ mount /dev/loop11 /tmp/install-mount-point
mount: mount /dev/loop11 on /tmp/install-mount-point failed: Structure needs cleaning
+ cleanup_on_failure
It looks like the stateful partition on the USB is corrupted.
fsck -yf ${LOOP_DEV} || /bin/true
Goes a little further, but it fails later:
+ IS_INSTALL=1 /tmp/install-mount-point/postinst /dev/mmcblk0p3
Segmentation fault (core dumped)
+ cleanup_on_failure
Keep digging, but I am suspecting something is wrong with the recovery image itself, or the media that hosts it.
,
Feb 23 2018
|
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by warx@chromium.org
, Jan 26 2018