New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 806264 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

chromeos4-row11-rack10-host13 repair looping

Project Member Reported by warx@chromium.org, Jan 26 2018

Issue description

On builds #2444, #2449, the failure info is:
01/25 21:13:00.039 ERROR|   logging_manager:0626| tko parser: ---------------------------------
01/25 21:13:00.039 ERROR|   logging_manager:0626| tko parser: parsing test provision_AutoUpdate provision
01/25 21:13:00.039 ERROR|   logging_manager:0626| tko parser: ADD: FAIL
01/25 21:13:00.040 ERROR|   logging_manager:0626| Subdir: provision_AutoUpdate
01/25 21:13:00.040 ERROR|   logging_manager:0626| Testname: provision
01/25 21:13:00.040 ERROR|   logging_manager:0626| Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row12-rack11-host21: 0) Could not copy /home/chromeos-test/images/cyan-chrome-pfq/R66-10342.0.0-rc1/stateful.tgz to device., 1) Could not copy /home/chromeos-test/images/cyan-chrome-pfq/R66-10342.0.0-rc1/update.gz to device., 
01/25 21:13:00.040 ERROR|   logging_manager:0626| tko parser: parsing test ---- SERVER_JOB
01/25 21:13:26.883 ERROR|         traceback:0013| Traceback (most recent call last):
01/25 21:13:26.883 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 495, in run_autoserv
01/25 21:13:26.884 ERROR|         traceback:0013|     job.provision(job_labels)
01/25 21:13:26.884 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 591, in provision
01/25 21:13:26.884 ERROR|         traceback:0013|     self.run(control=control, job_labels=labels)
01/25 21:13:26.884 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 884, in run
01/25 21:13:26.884 ERROR|         traceback:0013|     self._execute_code(server_control_file, namespace)
01/25 21:13:26.884 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code
01/25 21:13:26.884 ERROR|         traceback:0013|     execfile(code_file, namespace, namespace)
01/25 21:13:26.884 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row12-rack11-host21/6416-provision/20182501204919/control.srv", line 111, in <module>
01/25 21:13:26.885 ERROR|         traceback:0013|     job.parallel_simple(provision_machine, machines)
01/25 21:13:26.885 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple
01/25 21:13:26.885 ERROR|         traceback:0013|     log=log, timeout=timeout, return_results=return_results)
01/25 21:13:26.885 ERROR|         traceback:0013|   File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
01/25 21:13:26.885 ERROR|         traceback:0013|     function(arg)
01/25 21:13:26.885 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row12-rack11-host21/6416-provision/20182501204919/control.srv", line 102, in provision_machine
01/25 21:13:26.885 ERROR|         traceback:0013|     raise Exception('')
01/25 21:13:26.885 ERROR|         traceback:0013| Exception
01/25 21:13:27.447 ERROR|          autoserv:0759| Uncaught SystemExit with code 1

On builds #2447, #2450, #2451, the failure info is:
01/26 04:10:04.887 ERROR|   logging_manager:0626| tko parser: parsing test provision_AutoUpdate provision
01/26 04:10:04.887 ERROR|   logging_manager:0626| tko parser: ADD: FAIL
01/26 04:10:04.887 ERROR|   logging_manager:0626| Subdir: provision_AutoUpdate
01/26 04:10:04.887 ERROR|   logging_manager:0626| Testname: provision
01/26 04:10:04.887 ERROR|   logging_manager:0626| Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row11-rack10-host13: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Timeout (30) waiting for remote devserver port_file',), 
01/26 04:10:04.887 ERROR|   logging_manager:0626| tko parser: parsing test ---- SERVER_JOB
01/26 04:10:08.730 ERROR|             utils:0282| [stderr] ls: cannot access '/var/tmp/messages.autotest_start': No such file or directory
01/26 04:10:28.326 ERROR|         traceback:0013| Traceback (most recent call last):
01/26 04:10:28.326 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 495, in run_autoserv
01/26 04:10:28.326 ERROR|         traceback:0013|     job.provision(job_labels)
01/26 04:10:28.326 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 591, in provision
01/26 04:10:28.327 ERROR|         traceback:0013|     self.run(control=control, job_labels=labels)
01/26 04:10:28.327 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 884, in run
01/26 04:10:28.327 ERROR|         traceback:0013|     self._execute_code(server_control_file, namespace)
01/26 04:10:28.327 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code
01/26 04:10:28.327 ERROR|         traceback:0013|     execfile(code_file, namespace, namespace)
01/26 04:10:28.327 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row11-rack10-host13/9485-provision/20182601040306/control.srv", line 111, in <module>
01/26 04:10:28.327 ERROR|         traceback:0013|     job.parallel_simple(provision_machine, machines)
01/26 04:10:28.327 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple
01/26 04:10:28.328 ERROR|         traceback:0013|     log=log, timeout=timeout, return_results=return_results)
01/26 04:10:28.328 ERROR|         traceback:0013|   File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
01/26 04:10:28.328 ERROR|         traceback:0013|     function(arg)
01/26 04:10:28.328 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row11-rack10-host13/9485-provision/20182601040306/control.srv", line 102, in provision_machine
01/26 04:10:28.328 ERROR|         traceback:0013|     raise Exception('')
01/26 04:10:28.328 ERROR|         traceback:0013| Exception
01/26 04:10:28.880 ERROR|          autoserv:0759| Uncaught SystemExit with code 1

Although #2445 is successful, there should be some problem here. And it is blocking Chrome PFQ.

Assigned to infra deputy.
 

Comment 1 by warx@chromium.org, Jan 26 2018

Cc: malaykeshav@chromium.org warx@chromium.org

Comment 2 by khmel@chromium.org, Jan 26 2018

Cc: khmel@chromium.org
Cc: akes...@chromium.org
Owner: warx@chromium.org
cyan-paladin is passing hwtest. This suggests to me it's not an infra issue. Perhaps a chrome issue in the pfq preventing the rootfs update?


Comment 4 by warx@chromium.org, Jan 26 2018

For chrome issue, is it possible that is cyan board only?

The error msg said: BackgroundFailure: <class 'chromite.lib.failures_lib.TestLabFailure'>: ** HWTest did not complete due to infrastructure issues (code 3) **

Comment 5 by warx@chromium.org, Jan 26 2018

Cc: dshi@chromium.org davidri...@chromium.org
My local run (test failed) shows these:
01/26 15:46:37.756 INFO | test_runner_utils:0199| autoserv| get_network_stats: at-start RXbytes 101874 TXbytes 34817
01/26 15:46:37.757 INFO | test_runner_utils:0199| autoserv| Not checking if job_repo_url contains autotest packages on ['100.96.58.50']
01/26 15:46:37.757 INFO | test_runner_utils:0199| autoserv| Processing control file
01/26 15:46:37.758 INFO | test_runner_utils:0199| autoserv| Exception escaped control file, job aborting:
01/26 15:46:37.758 INFO | test_runner_utils:0199| autoserv| Traceback (most recent call last):
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| File "/build/cyan/usr/local/build/autotest/server/server_job.py", line 884, in run
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| self._execute_code(server_control_file, namespace)
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| File "/build/cyan/usr/local/build/autotest/server/server_job.py", line 1396, in _execute_code
01/26 15:46:37.759 INFO | test_runner_utils:0199| autoserv| execfile(code_file, namespace, namespace)
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| File "/tmp/test_that_results_1HKHZt/results-1-provision_AutoUpdate/control.srv", line 52, in <module>
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| raise error.TestError("No provision value!")
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| TestError: No provision value!
01/26 15:46:37.760 INFO | test_runner_utils:0199| autoserv| INFO        ----    ----    timestamp=1517003197    job_abort_reason=No provision value!    localtime=Jan 26 15:46:37       No provision value!
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: {'drone': 'warx0.mtv.corp.google.com', 'user': 'autotest_system', 'job_started': 1517003195, 'hostname': '100.96.58.50', 'status_version': 1, 'experimental': 'False',
'label': 'ad_hoc_build/ad_hoc_suite/provision_AutoUpdate'}
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: MACHINE NAME: 100.96.58.50
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: Unable to parse host keyval for 100.96.58.50
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: MACHINE GROUP:
01/26 15:46:37.761 INFO | test_runner_utils:0199| autoserv| tko parser: parsing partial test ---- SERVER_JOB
01/26 15:46:37.762 INFO | test_runner_utils:0199| autoserv| tko parser: parsing test ---- SERVER_JOB
01/26 15:46:37.771 INFO | test_runner_utils:0199| autoserv| Master ssh connection to 100.96.58.50 is down.

Which sounds unrelated though...

Comment 6 by warx@chromium.org, Jan 26 2018

Cc: shapiroc@chromium.org bmgordon@chromium.org
I found two similar symptom issues:
 issue 795456 : chromeos4-row11-rack11-host7 provision/repair looping. In that issue, I saw "forcing repair" and finally fixed by replacing the unit.
and issue 795396.



#5 looks unrelated.

#4 provision failures generally are often (incorrectly) labelled as "infrastructure" issues. Yes, it is possible that this is a chrome issue that only fails on cyan.

I don't see evidence of an infrastructure issue yet.
Is it one DUT consistnetly failing?

Comment 9 by warx@chromium.org, Jan 26 2018

For #5, I am using "test_that --board=cyan DUT_IP provision_AutoUpdate", not sure why it reports no provision value error.
ok could be repair loop  Issue 806312 
Summary: chromeos4-row11-rack10-host13 repair looping (was: cyan-chrome-pfq provision: FAIL: Unhandled DevServerException)
Device locked.
Labels: -Pri-1 Pri-2
Cc: jrbarnette@chromium.org jkop@chromium.org
 Issue 806287  has been merged into this issue.

Comment 15 by warx@chromium.org, Jan 26 2018

Yes, recent builds are consistently failing, see the reporter's context.
Owner: jrbarnette@chromium.org
Device locked. +jrbarnette you've assigned me quite a few of the "repair loop detected" bugs. What is the action you intend for the deputy to take on these, besides locking the DUT?
Owner: akes...@chromium.org
We shouldn't necessarily lock the DUT at all.  We have to debug
to see why the DUT fails provision, and fix the problem.

Roughly, the process should be this:
  * Make sure there's a bug for the outage.  Different DUTs
    often will need to have different bugs, even if they failed
    at the same time.
  * Lock each DUT, with the lock reason being its bug.
  * If the DUT isn't in the 'suites' pool, run balance_pools to
    get the DUT into the suites pool.
  * Debug to figure out why the DUT is failing.
  * If we fix the problem, unlock the DUT.
  * If debug determines that the DUT requires replacement or
    manual repair, we file a ticket against englab-sys-cros@,
    and then unlock/relock the DUT using the ticket as the
    lock reason.
  * Close the bug.

> We shouldn't necessarily lock the DUT at all.  We have to debug
> to see why the DUT fails provision, and fix the problem.

Well, I should have fixed this text to read "We should lock the DUT
solely for the specific purpose of debugging the problem".

Also, in lieu of extended debug, it may be reasonable to do one of
two other things:
  * Force the DUT to re-install from USB using the `repair_test` command.
  * Force the DUT to re-install the repair image via AU by following
    a procedure like this:
        https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/powerwash

Before doing either of those things, it's useful to look at the
repair log, since the log will tell you whether either of those
steps have already been tried.

Comment 19 by warx@chromium.org, Jan 26 2018

curious: could we manually uprev chrome in this case? This is the only failure in the last pfq run.

Comment 20 by warx@chromium.org, Jan 26 2018

Cc: steve...@chromium.org

Comment 21 by warx@chromium.org, Jan 26 2018

Cc: xiy...@chromium.org newcomer@chromium.org
On YAQS https://yaqs.googleplex.com/eng/q/5592011371970560, it is suggested not doing manual uprev on Friday.

cc next week's gardeners for tracking this issue


> curious: could we manually uprev chrome in this case? This
> is the only failure in the last pfq run.

It depends entirely on the specifics of the failure.  Some key
questions:
  * What tests didn't run because of the failure?  If we only
    lost one test, there's less risk than if we lost 5 or 10
    tests.
  * Was the provision failure caused by Chrome?  Chrome bugs can
    definitely cause provision failures.  Some of them even
    have a specific message that clearly blames Chrome.  So,
    we should do at least some debug before any manual uprev.


> On YAQS https://yaqs.googleplex.com/eng/q/5592011371970560,
> it is suggested not doing manual uprev on Friday.

Yeah, I wrote the post-mortem that recommended that.  I stand
by the recommendation. :-)

Looking at chromeos4-row11-rack10-host13. Currently running from the USB stick.

Trying to update with chromeos_install --yes, it fails when installing the stateful partition:

+ copy_partition 1 /dev/sda /dev/mmcblk0
+ local part_num=1
+ local src=/dev/sda
+ local dst=/dev/mmcblk0
+ partsize /dev/sda 1
+ sudo /usr/bin/cgpt show -s -i 1 /dev/sda
+ local part_size=4194304
+ partoffset /dev/sda 1
+ sudo /usr/bin/cgpt show -b -i 1 /dev/sda
+ local src_offset=4415488
+ partoffset /dev/mmcblk0 1
+ sudo /usr/bin/cgpt show -b -i 1 /dev/mmcblk0
+ local dst_offset=8704000
+ echo Installing partition 1 to /dev/mmcblk0
Installing partition 1 to /dev/mmcblk0
+ install_stateful
+ echo Installing the stateful partition...
Installing the stateful partition...
+ loop_offset_setup /dev/mmcblk0 8704000
+ local filename=/dev/mmcblk0
+ local offset=8704000
+ losetup -f --show -o 4456448000 /dev/mmcblk0
+ LOOP_DEV=/dev/loop11
+ [ -z /dev/loop11 ]
+ LOOPS=/dev/loop11
+ mount_on_loop_dev readwrite
+ local rw_flag=readwrite
+ local mount_flags=
+ [ readwrite != readwrite ]
+ tracked_mount /dev/loop11 /tmp/install-mount-point
+ local last_arg
+ eval last_arg=$2
+ last_arg=/tmp/install-mount-point 
+ MOUNTS=/tmp/install-mount-point   
+ mount /dev/loop11 /tmp/install-mount-point
mount: mount /dev/loop11 on /tmp/install-mount-point failed: Structure needs cleaning
+ cleanup_on_failure

It looks like the stateful partition on the USB is corrupted. 

fsck -yf ${LOOP_DEV} || /bin/true

Goes a little further, but it fails later:

+ IS_INSTALL=1 /tmp/install-mount-point/postinst /dev/mmcblk0p3
Segmentation fault (core dumped) 
+ cleanup_on_failure

Keep digging, but I am suspecting something is wrong with the recovery image itself, or the media that hosts it.

Owner: gwendal@chromium.org

Sign in to add a comment