VMTest timeout on asan builds |
||||||||||||||||||||
Issue descriptionamd64-generic-asan: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938751840014079680 betty-asan: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938840932943986528 Failures to connect to the VM: Could not initiate first contact with remote host Connection timed out during banner exchange Connection timed out during banner exchange
,
Aug 8
,
Aug 8
Regression range as per manojgupta@: https://crosland.corp.google.com/log/10908.0.0..10911.0.0
,
Aug 8
Not that I recall, sorry.
,
Aug 8
Weak Hypothesis: culprit is crrev.com/c/1145825, "cbuildbot/stages: Serialize DebugSymbols to run after UnitTest". Looking at the code changed, it doesn't appear to have any means of enabling DebugSymbols if UnitTestStage isn't run, which IIRC it isn't in VMTest-only builds.
,
Aug 8
Hypothesis disconfirmed, both UnitTestStage and DebugSymbols ran.
,
Aug 10
The same failure is still seen on both builders. Any ideas how we can fix it?
,
Aug 13
Passing on Hotlist-Deputy
,
Aug 16
Pass to @alex, as he's recently working on VMTest.
,
Aug 29
Has any progress been made on this?
,
Sep 5
Sorry, this bug fell through the cracks. Looks like a similar failure happened to asan builds a few years ago: crbug/311478 +vapier who solved this failure in crbug/311478 @vapier, do you have a guess as to why starting up/connecting to the VM is failing this time around?
,
Sep 5
if we can't ssh, then the next step is to check the system logs. see if the system was able to offload any of them. not 100% sure we have the ability to do so if ssh is completely dead. in which case we usually fallback to downloading images and trying to reproduce/debug locally. if we lack that functionality, we prob should implement it whereby we manually mount the image outside of the VM and try to extract all the logs.
,
Sep 5
How do I view system logs of a remote tryjob? Is that even possible? Meanwhile, I'll attempt a local run.
,
Sep 5
if you look at the summary page, there are usually links to GS buckets which autotest/etc... upload. the test team should know more.
,
Sep 6
Clarification based on some offline questions Alex asked: vapier@ is referring to the system logs from the launched CrOS VM (not the builder) In this case, initial SSH to VM fails. So there are no logs to be had. If this failure were at a later stage, Autotest would have collected some system logs from the VM in the results directory. For the OP, the logs are at https://stainless.corp.google.com/browse/chromeos-image-archive/amd64-generic-asan/R70-10950.0.0-b2824990/vm_test_results_1/ for example. But they are empty since the builder never succeeded in connecting to the cros vm. The only reasonable thing to do here is to download the VM image archived along with other artifacts and run it locally. Most likely, it will fail to boot. Look at the "Artifacts" link on the build page (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938751840014079680) for the image. The image is named chromiumos_qemu_image.tar.xz
,
Sep 6
i've filed issue 881403 to improve the VM test framework in general so that in scenarios like this, we'd have better access to logs rather than devs having to download the multi-GB disk images and manually mounting+extracting themselves.
,
Sep 11
I downloaded the qemu image and successfully started a qemu with the image from a failing test. The process showed no error and kept running in the background. Then I tried to ssh into the running qemu. The ssh timed out with no explicit error. One potential guess is that the problem is with private keys. Namely, the file /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key exists inside the amd64-generic-asan image from a failing test and is empty. The same file in a betty image from a succeeding pre-cq run is non-empty. In the latest successful pre-cq run for amd64-generic-asan the /etc/ssh/ssh_host_rsa_key doesn't exist on any of the mount points. (see https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940068130054724016) Do you think this could be the issue?
,
Sep 11
you should shut down the VM then mount the stateful image and look at the logs directly
,
Sep 11
-Chase-Pending Scope too open, and asan builders not a Tier-1 client (don't quote me on that).
,
Sep 11
what is Tier-1 client? ASAN builders are very important and we always try to keep them green.
,
Sep 14
Ping, VMTests are now broken for a month! zamorzaev@: Have you been able to make any progress?
,
Sep 18
Dealing with this issue requires knowledge about how to mount and work with images, and once mounted where to look for ssh logs / what to look for to see why SSH failed on the (VM) DUT. I think Alex will need some help from CrOS team proper to get up to speed on all that. Alex, please borrow a sheriff's attention to help you mount and poke around in the image.
,
Sep 18
We poked around a bit. It looks like the kernel is stuck booting (screenshot from vnc https://screenshot.googleplex.com/v3K85Z8eLRR). I couldn't get the serial console to give me anything useful. I don't remember how to cajole the kernel to write to serial port (cros_vm already directs the serial console from the VM to a file).
,
Sep 18
+ this week's sheriff to help us understand what asan kernel is unhappy and doesn't want to boot for VMTests.
,
Sep 21
As near as I can tell, VMTests on amd64-generic-asan have not succeeded since at least 7/25: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2782266 The did however succeed on 7/24: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2778365 Unfortunately there were a few BuildPackages failures in between, and that was long enough ago that a change history is unlikely to be super helpful. Hopefully we can prioritize this effort, it would be super helpful to have VMTests running on an asan chrome + chromeos again. (This is a bit of a tangent, but there is also an argument to be made for creating chrome-asan builders, e.g., {amd64-generic|betty}-chrome-asan and only building chrome with asan, not all of chromeos; these builders could then just test chromeos with asan and be moved to a location monitored by sheriffs instead of gardeners.
,
Oct 4
Asan bots are still not successing as of today. Is there any ongoing work on this?
,
Oct 8
Ping. Does anyone have this?
,
Oct 8
Sheriff or product owner needs to own this, this is not related to test infrastructure.
,
Oct 16
Ryan, can you look into this bug during your shadow gardener shift?
,
Oct 22
Andrew - can you take a look? This is a longstanding issue.
,
Oct 23
Will have a look.
,
Oct 25
I'm not sure it's asan-specific as a similar error (banner exchange failed) happens on VMTest (attempt 1) (but not attempt 2) on e.g. amd64-generic-telemetry: cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8931723950152789536 I think I have reproduced with a locally built image. I am seeing the issue (banner exchange failed) even with VMs that boot up OK to UI (i.e. not stuck like at the screenshot in #27). Next I will try to mount the drive and look for some logs on the VM (any tips on where to poke for logs appreciated!)
,
Oct 29
Please add your notes on how you reproduced this bug if you're planning to pass on to the next gardener/shadow.
,
Oct 29
,
Oct 29
To reproduce locally: ./setup_board --board=amd64-generic --profile=asan --force && \ ./build_packages --board=amd64-generic && \ ./build_image '--board=amd64-generic' '--disk_layout=4gb-rootfs' test && \ ./image_to_vm.sh '--board=amd64-generic' --test_image '--disk_layout=4gb-rootfs' && \ cros_vm --start --board=amd64-generic Wait a good while, then: ssh -p 9222 -oConnectTimeout=2 root@localhost Also VNC to the machine to see its partially-booted state. Notes: 1. I wasn't able to do the above with betty-asan because build_image failed with out-of-rootfs space (did not investigate that). 2. Removing "--profile=asan" seems to indeed fix the issue. 3. I think it might be necessary to tweak the VM launch options to not cache writes as I don't know a way to cleanly shut down the half-booted VM. 4. Planned next step: Do ./mount_gpt_image.sh --board=amd64-generic --safe --most_recent -i chromiumos_qemu_image.bin and then look for SSH logs. But I think (3.) may need dealing with first. Reassigning to next shadow!
,
Oct 29
> 3. I think it might be necessary to tweak the VM launch options to not cache writes as I don't know a way to cleanly shut down the half-booted VM. cros_vm has a "--copy-on-write" option that won't persist any filesystem changes to the base image when it's launched. You can try using that. > 4. Planned next step: Do ./mount_gpt_image.sh --board=amd64-generic --safe --most_recent -i chromiumos_qemu_image.bin and then look for SSH logs. But I think (3.) may need dealing with first. The logs will most likely be in /var/log, but everything in /var/ is in the encrypted-stateful partition by default, so would be difficult to pull out and read. You might want to rebuild the image w/ "USE=-encrypted_stateful" which I think should disable that.
,
Oct 31
martis@ have you had a chance to continue investigation here? It seems like this brought down a PFQ bot today[1] based on the log snippet
From the "betty-arcnext-chrome-pfq/R72-11213.0.0-rc2/vm_test_results_2/pfq_suite/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-40-login_VMSanity/debug/autoserv.INFO":
10/31 05:25:05.233 INFO | server_job:0217| START ---- ---- timestamp=1540988704 localtime=Oct 31 05:25:04
10/31 05:25:05.344 INFO | server_job:0217| START login_VMSanity login_VMSanity timestamp=1540988704 localtime=Oct 31 05:25:04
10/31 05:27:11.580 INFO | ssh_multiplex:0079| Master ssh connection to 127.0.0.1 is down.
10/31 05:27:11.581 INFO | ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_rvl0wWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
10/31 05:27:16.401 ERROR| utils:2769| Timed out waiting for condition: Wait for a socket file to exist
10/31 05:27:16.401 INFO | ssh_multiplex:0113| Timed out waiting for master-ssh connection to be established.
10/31 05:27:19.419 ERROR| utils:0287| [stderr] ssh: connect to host 127.0.0.1 port 9227: Connection refused
10/31 05:27:19.423 INFO | server_job:0217| FAIL ---- ---- timestamp=1540988839 localtime=Oct 31 05:27:19 Autotest client terminated unexpectedly: DUT is pingable but not SSHable, it most likely sporadically rebooted during testing. command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_rvl0wWssh-master/socket
-o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9227 127.0.0.1
"export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then
logger -tag \"autotest\" \"server[stack::_diagnose_dut|get_boot_id|run] ->
ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat
'/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available';
fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat
'/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
Exit status: 255
Duration: 3.00589108467
stderr:
ssh: connect to host 127.0.0.1 port 9227: Connection refused
10/31 05:27:19.423 INFO | server_job:0217| END FAIL ---- ---- timestamp=1540988839 localtime=Oct 31 05:27:19
1: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931168185137309856.
,
Oct 31
I am working on investigating this, but I haven't yet made it past the work done by other engineers in this bug.
,
Nov 5
,
Nov 12
,
Nov 16
Any further progress on this?
,
Nov 19
Is there any update on that?
,
Dec 11
SSH doesn't work for the qemu script either: ./setup_board --board=amd64-generic --profile=asan --force ./build_packages --board=amd64-generic ./build_image --board=amd64-generic '--disk_layout=4gb-rootfs' --noenable_rootfs_verification test cros_vm --start --image-path ../build/images/amd64-generic/R73-11383.0.2018_12_11_1240-a1/chromiumos_test_image.bin When I run: ssh -p9222 root@127.0.0.1 I get the following error: ssh_exchange_identification: read: Connection reset by peer The build string at the top right shows it as a test build and developer build. I have been trying to get the shell but sending Ctrl-alt-f2 is not bringing up a terminal, I connected to 127.0.0.1:5900 with a VNC client and the UI doesn't show an ethernet device (this might be expected for virtio). There is a UI element that says "Enable debugging features". Clicking that shows a screen with a button to "Proceed". Clicking "Proceed" results in an error "Something went wrong" "Debugging features were not completely enabled on this Chromium device."
,
Dec 11
if you have UI, you should be able to start a guest session and then start a crosh instance where you can run shell to get root and debug that way
,
Dec 11
I think you need: ./image_to_vm.sh --test --board=amd64-generic You should use the QEMU image, not the built image.
,
Dec 11
I tried ./image_to_vm.sh but it failed with:
failed: error writing '/dev/loop3p3': No space left on device
so I ran the script with the full image.
I don't see the "Browse as Guest" option that is normally there.
,
Dec 11
Allen, did you pass '--disk_layout=4gb-rootfs' to image_to_vm.sh as well? Without it, the partitions may not have enough space.
,
Dec 12
I just repeated the same test with the image produced by: ./image_to_vm.sh --from=../build/images/amd64-generic/R73-11383.0.2018_12_11_1555-a1 --disk_layout=4gb-rootfs --board=amd64-generic --test Nothing changed. I am seeing the same behavior.
,
Dec 12
I tried the same procedure without --profile=asan and both ssh and ctrl-alt-f2 work. Also, the ethernet shows up as connected without --profile=asan. It would be helpful for troubleshooting if I could get a shell. I cannot get into guest mode with --profile=asan set because the "Browse as Guest" button doesn't show up without getting past the network setup screen which is blocked by the Ethernet not showing up, so I cannot get past the OOBE network setup screen to where the guest button shows up. It also is odd to me that Ctrl-Alt-F2 doesn't work with --profile=asan set.
,
Dec 12
> It would be helpful for troubleshooting if I could get a shell there's this doc: http://dev.chromium.org/chromium-os/how-tos-and-troubleshooting/serial-debugging-howto but not sure it has entire details for getting a serial console in a VM. it is possible though. cros_start_vm has an --serial option. > It also is odd to me that Ctrl-Alt-F2 doesn't work with --profile=asan set. it relies on daemons in the system (like frecon) working, so if they aren't, then the system can easily fall down.
,
Dec 17
Is is possible to maybe disable the failing test so we can get this stage working again? Four months of not being able to run this stage successfully is a long time...
,
Dec 17
Sounds reasonable. updowndota@ ping me if you need help doing that.
,
Jan 3
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/crosutils/+/181be74f53fe892091a102bd16e84b5fcc574db3 commit 181be74f53fe892091a102bd16e84b5fcc574db3 Author: Manoj Gupta <manojgupta@google.com> Date: Thu Jan 03 23:04:21 2019 mod_for_test_scripts: Fix up for sanitizer builds. Remove the limits from the init files if it is a sanitizer build. BUG= chromium:872391 TEST=amd64-generic-asan VMTests pass Change-Id: I02e3cb650333f69d04eff631bf73e31c976e8d67 Reviewed-on: https://chromium-review.googlesource.com/1390988 Commit-Ready: Manoj Gupta <manojgupta@chromium.org> Tested-by: Manoj Gupta <manojgupta@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [add] https://crrev.com/181be74f53fe892091a102bd16e84b5fcc574db3/mod_for_test_scripts/390removeLimitForSanitizers
,
Jan 4
Amd64-generic-asan VMTests fails should be fixed by https://chromium-review.googlesource.com/1390988
,
Jan 19
(3 days ago)
|
||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||
Comment 1 by afakhry@chromium.org
, Aug 8Components: Infra>Client>ChromeOS
Labels: -Pri-3 Pri-1
Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)