New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 599893 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 559282
issue 572122
issue 572278
issue 585034
issue 589742
issue 592384
issue 592395
issue 592519
issue 592535
issue 599346



Sign in to add a comment

Provision update failures: "The priority of the inactive kernel partition is less than that of the active kernel partition"

Project Member Reported by steve...@chromium.org, Apr 1 2016

Issue description

Tracking issue for a series of recent failures with the following sumptom:

ANCHOR  TestFailure{bvt-cq,provision,Update failed. The priority of the inactive kernel partition is less than that of the active kernel partition.}

This looks like it may be another symptom of ssh failures?

From a recent autoserv.DEBUG:

03/31 23:19:07.178 DEBUG|          ssh_host:0153| Running (ssh) 'rm -rf /tmp/tmp.D9pLPHmAPA'
03/31 23:19:07.358 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
03/31 23:19:08.365 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
03/31 23:19:08.366 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
03/31 23:19:09.372 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
03/31 23:19:09.373 DEBUG|      abstract_ssh:0858| Terminated tunnel, pid 21101
03/31 23:19:09.374 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
03/31 23:19:10.380 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
03/31 23:19:10.381 ERROR|         traceback:0013| Traceback (most recent call last):
03/31 23:19:10.381 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 515, in run_autoserv
03/31 23:19:10.383 ERROR|         traceback:0013|     job.provision(job_labels)
03/31 23:19:10.383 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 478, in provision
03/31 23:19:10.384 ERROR|         traceback:0013|     self.run(control=control, job_labels=labels)
03/31 23:19:10.384 ERROR|         traceback:0013|   File "/usr/local/autotest/server/site_server_job.py", line 65, in run
03/31 23:19:10.385 ERROR|         traceback:0013|     super(site_server_job, self).run(*args, **dargs)
03/31 23:19:10.385 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 684, in run
03/31 23:19:10.386 ERROR|         traceback:0013|     self._execute_code(server_control_file, namespace)
03/31 23:19:10.386 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 1182, in _execute_code
03/31 23:19:10.386 ERROR|         traceback:0013|     execfile(code_file, namespace, namespace)
03/31 23:19:10.387 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row2-rack3-host9/489091-provision/20163103224347/control.srv", line 64, in <module>
03/31 23:19:10.387 ERROR|         traceback:0013|     job.parallel_simple(provision_machine, machines)
03/31 23:19:10.388 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 574, in parallel_simple
03/31 23:19:10.388 ERROR|         traceback:0013|     return_results=return_results)
03/31 23:19:10.389 ERROR|         traceback:0013|   File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple
03/31 23:19:10.389 ERROR|         traceback:0013|     function(arg)
03/31 23:19:10.390 ERROR|         traceback:0013|   File "/usr/local/autotest/results/hosts/chromeos4-row2-rack3-host9/489091-provision/20163103224347/control.srv", line 56, in provision_machine
03/31 23:19:10.390 ERROR|         traceback:0013|     raise Exception('')
03/31 23:19:10.391 ERROR|         traceback:0013| Exception
03/31 23:19:10.391 ERROR|          autoserv:0773| 1
 
I suspect this may be related to issue 598517 ?

Cc: aaboagye@chromium.org abhishekbh@chromium.org
The peach_pit-pfq also just recently ran into this issue.

buildslave: build294-m2
chromeos4-row8-rack12-host13
Link to logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/59017472-chromeos-test/chromeos4-row8-rack12-host13/debug/

DEBUG log snippet:
root@chromeos4-row8-rack12-host13:"/tmp/stateful_update""'
04/05 02:13:34.242 DEBUG|          ssh_host:0153| Running (ssh) '/tmp/stateful_update http://172.17.40.27:8082/static/peach_pit-chrome-pfq/R51-8152.0.0-rc7 --stateful_change=clean 2>&1'
04/05 02:13:34.587 DEBUG|        base_utils:0268| [stdout] Downloading stateful payload from http://172.17.40.27:8082/static/peach_pit-chrome-pfq/R51-8152.0.0-rc7/stateful.tgz
04/05 02:13:34.611 DEBUG|        base_utils:0268| [stdout]   HTTP/1.1 200 OK
04/05 02:13:34.612 DEBUG|        base_utils:0268| [stdout]   Date: Tue, 05 Apr 2016 09:12:26 GMT
04/05 02:13:34.612 DEBUG|        base_utils:0268| [stdout]   Server: Apache/2.4.7 (Ubuntu)
04/05 02:13:34.613 DEBUG|        base_utils:0268| [stdout]   Last-Modified: Tue, 05 Apr 2016 09:11:29 GMT
04/05 02:13:34.613 DEBUG|        base_utils:0268| [stdout]   ETag: "e6f8ca2-52fb93b0770c6"
04/05 02:13:34.614 DEBUG|        base_utils:0268| [stdout]   Accept-Ranges: bytes
04/05 02:13:34.614 DEBUG|        base_utils:0268| [stdout]   Content-Length: 242191522
04/05 02:13:34.615 DEBUG|        base_utils:0268| [stdout]   Keep-Alive: timeout=60, max=1000
04/05 02:13:34.615 DEBUG|        base_utils:0268| [stdout]   Connection: Keep-Alive
04/05 02:13:34.616 DEBUG|        base_utils:0268| [stdout]   Content-Type: application/x-gzip
04/05 02:14:08.507 DEBUG|        base_utils:0268| [stdout] Downloading command returns code 0.
04/05 02:14:08.507 DEBUG|        base_utils:0268| [stdout] Successfully downloaded update.
04/05 02:14:08.508 DEBUG|        base_utils:0268| [stdout] Restoring state to factory_install with dev_image.
04/05 02:17:27.031 DEBUG|        base_utils:0268| [stdout] [0405/021726:INFO:update_engine_client.cc(215)] Update succeeded -- reboot needed.
04/05 02:17:27.036 DEBUG|          ssh_host:0153| Running (ssh) '/usr/bin/update_engine_client -status 2>&1 | grep CURRENT_OP'
04/05 02:17:27.285 DEBUG|        base_utils:0268| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
04/05 02:17:27.296 INFO |       autoupdater:0521| Update complete.
04/05 02:17:27.297 INFO |       autoupdater:0532| Update engine log has downloaded in sysinfo/update_engine dir. Check the lastest.
04/05 02:17:37.348 DEBUG|          ssh_host:0153| Running (ssh) 'rootdev -s'
04/05 02:17:37.348 INFO |      abstract_ssh:0735| Master ssh connection to chromeos4-row8-rack12-host13 is down.
04/05 02:17:37.349 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
04/05 02:17:38.356 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
04/05 02:17:38.357 INFO |      abstract_ssh:0749| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0bdSpjssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row8-rack12-host13'
04/05 02:17:38.358 DEBUG|        base_utils:0177| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0bdSpjssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row8-rack12-host13'
04/05 02:17:43.434 INFO |      abstract_ssh:0764| Timed out waiting for master-ssh connection to be established.
04/05 02:18:16.060 DEBUG|        base_utils:0268| [stdout] /dev/mmcblk0p3
04/05 02:18:16.067 DEBUG|          ssh_host:0153| Running (ssh) 'cgpt show -n -i 4 -P $(rootdev -s -d)'
04/05 02:18:16.330 DEBUG|        base_utils:0268| [stdout] 1
04/05 02:18:16.334 DEBUG|          ssh_host:0153| Running (ssh) 'cgpt show -n -i 2 -P $(rootdev -s -d)'
04/05 02:18:16.586 DEBUG|        base_utils:0268| [stdout] 2
04/05 02:18:16.589 ERROR|provision_AutoUpda:0138| Update failed. The priority of the inactive kernel partition is less than that of the active kernel partition.
04/05 02:18:16.591 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 139, in run_once
    raise error.TestFail(str(e))
TestFail: Update failed. The priority of the inactive kernel partition is less than that of the active kernel partition.
04/05 02:18:16.595 DEBUG|   logging_manager:0627| Logging subprocess finished
04/05 02:18:16.611 DEBUG|   logging_manager:0627| Logging subprocess finished
04/05 02:18:16.628 DEBUG|          ssh_host:0153| Running (ssh) 'rm -rf /tmp/sysinfo/autoserv-UF0Bfq'
04/05 02:18:16.795 DEBUG|        base_utils:0268| [stderr] mux_client_request_session: read from master failed: Broken pipe
04/05 02:18:18.114 DEBUG|          ssh_host:0153| Running (ssh) 'rm -rf "/tmp/sysinfo/autoserv-UF0Bfq"'
04/05 02:18:18.115 INFO |      abstract_ssh:0735| Master ssh connection to chromeos4-row8-rack12-host13 is down.
04/05 02:18:18.115 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
04/05 02:18:18.115 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
04/05 02:18:18.117 INFO |      abstract_ssh:0749| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_OprTe_ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row8-rack12-host13'
04/05 02:18:18.117 DEBUG|        base_utils:0177| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_OprTe_ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row8-rack12-host13'
04/05 02:18:19.594 DEBUG|          ssh_host:0153| Running (ssh) 'rm -rf "/tmp/autoserv-OgbBQY"'
04/05 02:18:19.841 DEBUG|          ssh_host:0153| Running (ssh) 'rm -rf "/tmp/autoserv-E1kHHE"'
04/05 02:18:20.053 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
04/05 02:18:21.059 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
04/05 02:18:21.061 INFO |        server_job:0128| 		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1459847901	localtime=Apr 05 02:18:21	Update failed. The priority of the inactive kernel partition is less than that of the active kernel partition.
04/05 02:18:21.062 INFO |        server_job:0128| 	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1459847901	localtime=Apr 05 02:18:21	
Owner: aaboagye@chromium.org
"The priority of the inactive kernel partition is less than that of the active kernel partition."

This doesn't look to me like an ssh flake problem. This looks like a legitimate image problem.

 -> to sheriff for digging
Cc: jrbarnette@chromium.org shuqianz@chromium.org de...@chromium.org
+ccing a few people knowledge about update_engine.
Does this failure mean it just didn't boot into the other image?
The message text comes from autotest server/host/cros_host.py:786

            # Ensure inactive kernel has higher priority than active.
            if (updater.get_kernel_priority(inactive_kernel)
                    < updater.get_kernel_priority(active_kernel)):
                raise autoupdater.ChromiumOSError(
                    'Update failed. The priority of the inactive kernel'
                    ' partition is less than that of the active kernel'
                    ' partition.')
Cc: sosa@chromium.org
+sosa who originally wrote that part of machine_install.

Is there any way this message can come from ssh or infra flake (long shot, if ssh failures are silenced and mapped to some incorrect priority values)?

Or, more likely, any other insight on why this might be?
I don't fully understand what the logs are saying, but one
possible scenario is this:
  * Updated is downloaded and installed on disk.
  * DUT is ready for reboot to apply the update.
  * The reboot fails to happen.
  * We make the check above, and find the failure.

The other possibility I can think of runs like this:
  * Updated is downloaded and installed on disk.
  * DUT says it's ready for reboot to apply the update.
  * We reboot.
  * Some of the stuff we wrote to the GPT never gets
    flushed to storage, with the result that we see old
    data that causes the failure.

A more careful look at the logs should help distinguish these
cases...

My assessment of the problem is off.  The error in question
is detected before the reboot.  The error means the following
sequence:
  * DUT is told to download and install the update.
  * DUT reports "success, need reboot"
  * We check the GPT, and data that's supposed to be in the
    GPT isn't there.

The logs show the sequence.  It's unclear why the DUT doesn't
have the right data in the GPT.

Looking at the logs, I'm beginning to suspect that the DUT could
have a failing SSD.  Certainly, the most recent result is that
the DUT has gone offline.

For now, I've balanced the peach_pit BVT pool to replace the DUT
with a working spare.

Comment 13 by sosa@google.com, Apr 5 2016

re:#9

No we differentiate between excepts there. For ssh errors, we don't catch them so they should be propagated up. If we're getting an actual autoupdater.ChromiumOSError that means that the kernel check returned false rather than through an exception. It would only do that if it got valid data.

I think Richard's on the right track.
Labels: OS-Chrome
Owner: ----
Status: Available (was: Assigned)
I'm not sheriff anymore, so no longer the owner. Also, I don't recall seeing this issue in a while...
Looking again at this, I note that both DUTs in question seem
to be back up and working normally.

Looking again at the logs in c#4, I note this:
04/05 02:17:38.358 DEBUG|        base_utils:0177| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0bdSpjssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row8-rack12-host13'
04/05 02:17:43.434 INFO |      abstract_ssh:0764| Timed out waiting for master-ssh connection to be established.
04/05 02:18:16.060 DEBUG|        base_utils:0268| [stdout] /dev/mmcblk0p3

The "Timed out ..." message could indicate a crash and reboot (or at
least, reboot).  The way the error check is written, any reboot will
cause the symptom in question.  I note that in particular, veyron_rialto
has shown this very symptom in the past, and the cause was that rialto
was automatically rebooting when an update was ready.

Labels: -Pri-1 Pri-2
Seems like a symptom of devserver failures or networking flakes. Lowering priority -- please reassess or close as needed.
Status: WontFix (was: Available)
Whatever it is, it's not happening now.

Sign in to add a comment