Quick provision loadtest (90 simultaneous DUTS, 2000 provisions in total), resulted in two back to back failures provisioning leon chromeos4-row1-rack10-host15.
{"avg_active": 90, "board": "leon", "build_name": "leon-release/R64-10164.0.0", "clobber_stateful": true, "devserver": "chromeos2-devserver9.cros.corp.google.com", "elapsed": 219, "end_active": 90, "finish_time": 1512603920.039462, "force_update": false, "full_update": false, "id": ["Job", 443], "name": "chromeos4-row1-rack10-host15", "parent": ["Runner", 1512602801], "quick_provision": true, "start_active": 88, "start_time": 1512603700.193045, "status": "fail", "trigger_response": "[true, 254974]"}
{"avg_active": 90, "board": "leon", "build_name": "leon-release/R64-10159.0.0", "clobber_stateful": true, "devserver": "chromeos2-devserver9.cros.corp.google.com", "elapsed": 177, "end_active": 90, "finish_time": 1512604136.907756, "force_update": false, "full_update": false, "id": ["Job", 562], "name": "chromeos4-row1-rack10-host15", "parent": ["Runner", 1512602801], "quick_provision": true, "start_active": 84, "start_time": 1512603959.079989, "status": "fail", "trigger_response": "[true, 257328]"}
First provision failure appeared to fail due to SSH connection failure mid-provision:
65536K ........ ........ ........ ........ 17% 23.7M 28s
98304K ........ ........ .......
2017/12/06 15:42:31.165 DEBUG| cros_build_lib:0644| (stderr):
Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M
Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 7518 100 7518 0 0 7518 0 0:00:01 --:--:-- 0:00:01 1223k
Timeout, server chromeos4-row1-rack10-host15 not responding.^M
2017/12/06 15:42:31.165 ERROR| remote_access:0961| Error connecting to device chromeos4-row1-rack10-host15
2017/12/06 15:42:31.165 WARNI| cros_update:0286| Error during quick provision, falling back: Warning: Permanently added 'chromeos4-row1-rack10-host15,100.115.192.96' (ED25519) to the list of known hosts.^M
(A separate issue needs to be opened for a number of SSH connection errors, but I believe them to be due to network load in the lab, potentially from the load test).
The subsequent SSH retries fail, and since the non-quick provision image is not staged, the fallback does not occur.
The next provision shows the quick provision script successfully running, but the check after provision failing due to the version not changing. It starts 40 seconds after the failure of the first provision is recorded by my loadtest script.
Investigation shows that the rootfs has a 2MiB chunk that was not as expected for version R64-10159 being provisioned to (blocks 300-301, bs=1MiB)
In particular:
block 300 == version R64-10164
block 301 == 128KiB R64-10164, 64KiB neither, 832KiB R64-10159 (version being provisioned)
R64-10164 was the version attempted to be provisioned in the prior version.
R64-10162 was the version currently running on /dev/sda5.
R64-10159 was the version previously provisioned to /dev/sda3 (and also attempting to be provisioned)
A few possibilities:
- the initial provision does not fail when the devserver thinks it does and continues provisioning in parallel with the new provision -- this might have happened for some time, but the other provision almost certainly did not run to completion or it would have interrupted the second with a reboot
- for some reason the write of those blocks to disk do not happen/do not get flushed and the data from the previous provision exists at the time of reboot
- the writes of those blocks are out of order and the correct ones get overwritten by the previous provision.
#1 is a possibility, but seems unlikely due to the four minute head start. That being said, if there is an SSH connection failure, maybe there's some transient network issues which could have stalled the HTTP connection. The quick provision code should ensure this does not happen however because it will likely come to bite us in the future. (To be fixed by davidriley@).
#2 and #3 represent more substantial kernel related issues if they are occurring.
snanda@/gwendal@ for opinions on the latter.
Comment 1 by davidri...@chromium.org
, Dec 7 2017