Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: Duplicate
Owner:
Closed: Jun 8
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 689105
issue 730134



Sign in to add a comment
Stateful encryption related formats breaking FSI testing on selected boards post R53.
Project Member Reported by mcchou@chromium.org, Jun 5 Back to list
Where the issue happened:
Canary veyron_minnie-release

What the issue was:
Canary veyron_minnie-release failed at autoupdate_EndToEndTest.paygen_au_canary_full where update_stateful(https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/common_lib/cros/autoupdater.py?rcl=745b8167a5a346742905c7b4d8b74ec722d56314&l=516) failed. The DUT was no longer pingable after the failure of udpate_stateful() .

When the issue started:
This failure has been there since build #1184 (see https://chromegw.corp.google.com/i/chromeos/builders/veyron_minnie-release).

Error messages from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121595781-chromeos-test/chromeos4-row9-rack9-host3/autoupdate_EndToEndTest.paygen_au_canary_full/debug/:
06/05 07:47:32.358 DEBUG|      abstract_ssh:0390| Trying scp.
06/05 07:47:32.359 DEBUG|          ssh_host:0284| Running (ssh) 'ls "/tmp/sysinfo/autoserv-idJ0Y7/results/default/"*'
06/05 07:47:32.836 DEBUG|             utils:0298| [stderr] ls: cannot access /tmp/sysinfo/autoserv-idJ0Y7/results/default/*: No such file or directory
06/05 07:47:32.839 DEBUG|          ssh_host:0284| Running (ssh) 'ls "/tmp/sysinfo/autoserv-idJ0Y7/results/default/".[!.]*'
06/05 07:47:33.296 DEBUG|             utils:0298| [stderr] ls: cannot access /tmp/sysinfo/autoserv-idJ0Y7/results/default/.[!.]*: No such file or directory
06/05 07:47:33.300 DEBUG|        server_job:1372| Client state file /usr/local/autotest/results/121595781-chromeos-test/chromeos4-row9-rack9-host3/control.autoserv.state not found
06/05 07:47:33.304 DEBUG|          base_job:0399| Persistent state client.* deleted
06/05 07:47:33.305 DEBUG|          autotest:0966| Autotest job finishes.
06/05 07:47:33.306 ERROR|               log:0027| post-test iteration server sysinfo error:
06/05 07:47:33.307 ERROR|         traceback:0013| Traceback (most recent call last):
06/05 07:47:33.307 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/log.py", line 25, in decorated_func
06/05 07:47:33.308 ERROR|         traceback:0013|     fn(*args, **dargs)
06/05 07:47:33.309 ERROR|         traceback:0013|   File "/usr/local/autotest/server/test.py", line 71, in wrapper
06/05 07:47:33.309 ERROR|         traceback:0013|     func(self, mytest, host, at, outputdir)
06/05 07:47:33.310 ERROR|         traceback:0013|   File "/usr/local/autotest/server/test.py", line 216, in after_iteration_hook
06/05 07:47:33.311 ERROR|         traceback:0013|     results_dir=self.job.resultdir)
06/05 07:47:33.312 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 381, in run
06/05 07:47:33.312 ERROR|         traceback:0013|     client_disconnect_timeout, use_packaging=use_packaging)
06/05 07:47:33.313 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 464, in _do_run
06/05 07:47:33.314 ERROR|         traceback:0013|     client_disconnect_timeout=client_disconnect_timeout)
06/05 07:47:33.315 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 950, in execute_control
06/05 07:47:33.315 ERROR|         traceback:0013|     raise error.AutotestRunError(msg)
06/05 07:47:33.316 ERROR|         traceback:0013| AutotestRunError: Aborting - unexpected final status message from client on chromeos4-row9-rack9-host3
06/05 07:47:33.317 ERROR|         traceback:0013| 
06/05 07:47:33.317 DEBUG|              test:0396| after_iteration_hooks completed
06/05 07:47:33.318 WARNI|              test:0616| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 610, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 818, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 471, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 348, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 381, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1818, in run_once
    test_platform.prep_device_for_update(test_conf['source_release'])
  File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1151, in prep_device_for_update
    self._staged_urls.source_stateful_url)
  File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 985, in _install_source_version
    stateful_url, True)
  File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 942, in _update_via_test_payloads
    perform_update(stateful_url, True)
  File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 923, in perform_update
    updater.update_stateful(clobber=clobber)
  File "/usr/local/autotest/client/common_lib/cros/autoupdater.py", line 538, in update_stateful
    raise update_error
StatefulUpdateError: Failed to perform stateful update on chromeos4-row9-rack9-host3
 
Cc: gwendal@chromium.org dgarr...@chromium.org
Owner: gwendal@chromium.org
+don +gwendal


Many of the failures we're seeing on the Jerry board seems to be from a canary AU from  full_8530.96.0. Don thinks this might be because we set the cutover point for encryption incorrectly. What do you think?
When, exactly, was the encrypted stateful support landed?

Is there any chance it wasn't present in R53 8530.96.0?
Looking into the log:
06/05 07:46:56.192 WARNI|autoupdate_EndToEn:0983| Device has been powerwashed, need to reinstall stateful from http://100.115.219.136:8082/static/stable-channel/veyron-minnie/8530.96.0/stateful.tgz

And it fails, leading to:

06/05 07:47:30.296 DEBUG|     site_autotest:0194| bash: /tmp/sysinfo/autoserv-idJ0Y7/bin/autotestd_monitor: /usr/bin/python: bad interpreter: No such file or directory

So indeed, the device has been powerwash, and we can not recovery from this.

user space changes for ext4 crypto are indeed in R53, but not the 3.14 kernel changes:
Basic Kernel changes for ext4 in 3.14 are not in R53, only R54.
3.18 changes are in R51, 4.4 in R52.




This is worse on quawks and squawks were we don't reboot:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121810418-chromeos-test/chromeos4-row7-rack3-host19/debug/

and

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121810563-chromeos-test/chromeos4-row10-rack10-host1/debug/

I am suspecting these machines revert to 3.10 (testing) were we crash very early at boot. We will have to set the stepping stone to a very recent image for these machines.

Confirmed that squawks move to 4.4 from 3.10 on 1/17, then N (directory encryption) on 3/17. 
If we revert to 3.10 bad things will happen.
- I can fix 3.10 to fail mount nicely, allowing a power wash.
It will not fix the issue at hand, but we would fail more cleanly.
- We need to find a stepping stone between 1/17 - 3/17 for these machines.
Gwendal: Given that all Bay Trail boards have moved to v4.4, fixing 3.10 won't actually be deployed anywhere where it matters, will it?

3.10 on those boards only exist as a historical point, and we can't exactly push an auto update to M56 and before.
Summary: Stateful encryption related formats breaking FSI testing on selected boards post R53. (was: [Canary] veyron_minnie-release: Failed at autoupdate_EndToEndTest.paygen_au_canary_full)
Blockedon: 730134
Umbrella bugs have been created for the devices moved from  
3.10 to 4.4: https://bugs.chromium.org/p/chromium/issues/detail?id=730141
3.14 to 4.4: https://bugs.chromium.org/p/chromium/issues/detail?id=730134
Blockedon: 689105
Blocking on the (months old) original bug where we tried to fix and then workaround this problem.

I'm still trying to understand exactly what we could do differently in autotest to recover properly from this class of problems. If anyone has a good idea and 15 minutes to describe it to me, please ping me on chat.
Mergedinto: 689105
Status: Duplicate
Sign in to add a comment