New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 792767 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Quick provision failure with bad rootfs data

Project Member Reported by davidri...@chromium.org, Dec 7 2017

Issue description

Quick provision loadtest (90 simultaneous DUTS, 2000 provisions in total), resulted in two back to back failures provisioning leon chromeos4-row1-rack10-host15.

{"avg_active": 90, "board": "leon", "build_name": "leon-release/R64-10164.0.0", "clobber_stateful": true, "devserver": "chromeos2-devserver9.cros.corp.google.com", "elapsed": 219, "end_active": 90, "finish_time": 1512603920.039462, "force_update": false, "full_update": false, "id": ["Job", 443], "name": "chromeos4-row1-rack10-host15", "parent": ["Runner", 1512602801], "quick_provision": true, "start_active": 88, "start_time": 1512603700.193045, "status": "fail", "trigger_response": "[true, 254974]"}
{"avg_active": 90, "board": "leon", "build_name": "leon-release/R64-10159.0.0", "clobber_stateful": true, "devserver": "chromeos2-devserver9.cros.corp.google.com", "elapsed": 177, "end_active": 90, "finish_time": 1512604136.907756, "force_update": false, "full_update": false, "id": ["Job", 562], "name": "chromeos4-row1-rack10-host15", "parent": ["Runner", 1512602801], "quick_provision": true, "start_active": 84, "start_time": 1512603959.079989, "status": "fail", "trigger_response": "[true, 257328]"}

First provision failure appeared to fail due to SSH connection failure mid-provision:
 65536K ........ ........ ........ ........ 17% 23.7M 28s
 98304K ........ ........ .......
2017/12/06 15:42:31.165 DEBUG|    cros_build_lib:0644| (stderr):
Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M
Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100  7518  100  7518    0     0   7518      0  0:00:01 --:--:--  0:00:01 1223k
Timeout, server chromeos4-row1-rack10-host15 not responding.^M

2017/12/06 15:42:31.165 ERROR|     remote_access:0961| Error connecting to device chromeos4-row1-rack10-host15
2017/12/06 15:42:31.165 WARNI|       cros_update:0286| Error during quick provision, falling back: Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M

(A separate issue needs to be opened for a number of SSH connection errors, but I believe them to be due to network load in the lab, potentially from the load test).

The subsequent SSH retries fail, and since the non-quick provision image is not staged, the fallback does not occur.

The next provision shows the quick provision script successfully running, but the check after provision failing due to the version not changing.  It starts 40 seconds after the failure of the first provision is recorded by my loadtest script.

Investigation shows that the rootfs has a 2MiB chunk that was not as expected for version R64-10159 being provisioned to (blocks 300-301, bs=1MiB)

In particular:
block 300 == version R64-10164
block 301 == 128KiB R64-10164, 64KiB neither, 832KiB R64-10159 (version being provisioned)

R64-10164 was the version attempted to be provisioned in the prior version.
R64-10162 was the version currently running on /dev/sda5.
R64-10159 was the version previously provisioned to /dev/sda3 (and also attempting to be provisioned)

A few possibilities:
- the initial provision does not fail when the devserver thinks it does and continues provisioning in parallel with the new provision -- this might have happened for some time, but the other provision almost certainly did not run to completion or it would have interrupted the second with a reboot
- for some reason the write of those blocks to disk do not happen/do not get flushed and the data from the previous provision exists at the time of reboot
- the writes of those blocks are out of order and the correct ones get overwritten by the previous provision.

#1 is a possibility, but seems unlikely due to the four minute head start.  That being said, if there is an SSH connection failure, maybe there's some transient network issues which could have stalled the HTTP connection.  The quick provision code should ensure this does not happen however because it will likely come to bite us in the future.  (To be fixed by davidriley@).

#2 and #3 represent more substantial kernel related issues if they are occurring.  

snanda@/gwendal@ for opinions on the latter.
 
In that 1MiB chunk, it actually appears to be a 512 byte chunk which is the transition between the expected contents and the other contents from 10164.  That 512 byte chunk appears to be from neither file.  (I haven't done exhaustive analysis here of the data, so there might be some mistakes in this exact transition point).
Cc: puneetster@chromium.org
Labels: provisionflake
Status: Assigned (was: Untriaged)

Sign in to add a comment