New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 872391 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

VMTest timeout on asan builds

Project Member Reported by rcui@chromium.org, Aug 8

Issue description

amd64-generic-asan: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938751840014079680

betty-asan: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938840932943986528

Failures to connect to the VM:

Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange


 
Cc: dgarr...@chromium.org michae...@chromium.org
Components: Infra>Client>ChromeOS
Labels: -Pri-3 Pri-1
Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)
Not much info provided in the logs:

bin/ctest --board=betty --type=vm --no_graphics --verbose --target_image=/b/swarming/w/ir/cache/cbuild/repository/src/build/images/betty/latest-cbuildbot/chromiumos_test_image.bin --test_results_root=/b/swarming/w/ir/cache/cbuild/repository/chroot/tmp/cbuildbotXUwgvo/smoke/test_harness --ssh_port=9228 --only_verify --suite=smoke --whitelist_chrome_crashes --ssh_private_key=/b/swarming/w/ir/cache/cbuild/repository/src/build/images/betty/latest-cbuildbot/id_rsa exited with code 1

michaelpg@ I have a vague memory that you fixed a similar issue before? Please disregard if not right.
Cc: llozano@chromium.org
 Issue 871780  has been merged into this issue.
Cc: manojgupta@chromium.org kirtika@chromium.org
Regression range as per manojgupta@: https://crosland.corp.google.com/log/10908.0.0..10911.0.0
Not that I recall, sorry.
Cc: dhanyaganesh@chromium.org
Weak Hypothesis: culprit is crrev.com/c/1145825, "cbuildbot/stages: Serialize DebugSymbols to run after UnitTest". Looking at the code changed, it doesn't appear to have any means of enabling DebugSymbols if UnitTestStage isn't run, which IIRC it isn't in VMTest-only builds.


Cc: -dhanyaganesh@chromium.org
Hypothesis disconfirmed, both UnitTestStage and DebugSymbols ran.
The same failure is still seen on both builders. Any ideas how we can fix it?

Comment 8 Deleted

Comment 9 Deleted

Comment 10 Deleted

Owner: xixuan@chromium.org
Passing on Hotlist-Deputy
Owner: zamorzaev@chromium.org
Pass to @alex, as he's recently working on VMTest.
Has any progress been made on this?
Cc: vapier@chromium.org
Labels: -Hotlist-Deputy Chase-Pending
Sorry, this bug fell through the cracks.

Looks like a similar failure happened to asan builds a few years ago: crbug/311478 

+vapier who solved this failure in crbug/311478

@vapier, do you have a guess as to why starting up/connecting to the VM is failing this time around?
if we can't ssh, then the next step is to check the system logs.  see if the system was able to offload any of them.  not 100% sure we have the ability to do so if ssh is completely dead.  in which case we usually fallback to downloading images and trying to reproduce/debug locally.

if we lack that functionality, we prob should implement it whereby we manually mount the image outside of the VM and try to extract all the logs.
How do I view system logs of a remote tryjob? Is that even possible?

Meanwhile, I'll attempt a local run.
if you look at the summary page, there are usually links to GS buckets which autotest/etc... upload.  the test team should know more.
Clarification based on some offline questions Alex asked: vapier@ is referring to the system logs from the launched CrOS VM (not the builder)

In this case, initial SSH to VM fails. So there are no logs to be had. If this failure were at a later stage, Autotest would have collected some system logs from the VM in the results directory. For the OP, the logs are at https://stainless.corp.google.com/browse/chromeos-image-archive/amd64-generic-asan/R70-10950.0.0-b2824990/vm_test_results_1/ for example. But they are empty since the builder never succeeded in connecting to the cros vm.

The only reasonable thing to do here is to download the VM image archived along with other artifacts and run it locally. Most likely, it will fail to boot.

Look at the "Artifacts" link on the build page (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8938751840014079680) for the image. The image is named chromiumos_qemu_image.tar.xz
i've filed issue 881403 to improve the VM test framework in general so that in scenarios like this, we'd have better access to logs rather than devs having to download the multi-GB disk images and manually mounting+extracting themselves.
I downloaded the qemu image and successfully started a qemu with the image from a failing test. The process showed no error and kept running in the background. Then I tried to ssh into the running qemu. The ssh timed out with no explicit error.

One potential guess is that the problem is with private keys. 

Namely, the file /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key exists inside the amd64-generic-asan image from a failing test and is empty.
The same file in a betty image from a succeeding pre-cq run is non-empty.

In the latest successful pre-cq run for amd64-generic-asan the /etc/ssh/ssh_host_rsa_key doesn't exist on any of the mount points.
(see https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940068130054724016)

Do you think this could be the issue?
you should shut down the VM then mount the stateful image and look at the logs directly
Labels: -Chase-Pending
-Chase-Pending
Scope too open, and asan builders not a Tier-1 client (don't quote me on that).
what is Tier-1 client?
ASAN builders are very important and we always try to keep them green.
Ping, VMTests are now broken for a month!
zamorzaev@: Have you been able to make any progress?

Comment 25 Deleted

Dealing with this issue requires knowledge about how to mount and work with images, and once mounted where to look for ssh logs / what to look for to see why SSH failed on the (VM) DUT.

I think Alex will need some help from CrOS team proper to get up to speed on all that. Alex, please borrow a sheriff's attention to help you mount and poke around in the image.
We poked around a bit. It looks like the kernel is stuck booting (screenshot from vnc https://screenshot.googleplex.com/v3K85Z8eLRR).

I couldn't get the serial console to give me anything useful. I don't remember how to cajole the kernel to write to serial port (cros_vm already directs the serial console from the VM to a file).
Cc: dbasehore@chromium.org
+ this week's sheriff to help us understand what asan kernel is unhappy and doesn't want to boot for VMTests.
As near as I can tell, VMTests on amd64-generic-asan have not succeeded since at least 7/25:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2782266 

The did however succeed on 7/24:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2778365

Unfortunately there were a few BuildPackages failures in between, and that was long enough ago that a change history is unlikely to be super helpful.

Hopefully we can prioritize this effort, it would be super helpful to have VMTests running on an asan chrome + chromeos again.

(This is a bit of a tangent, but there is also an argument to be made for creating chrome-asan builders, e.g., {amd64-generic|betty}-chrome-asan and only building chrome with asan, not all of chromeos; these builders could then just test chromeos with asan and be moved to a location monitored by sheriffs instead of gardeners.

Cc: sammiequon@chromium.org agawronska@chromium.org
Asan bots are still not successing as of today. Is there any ongoing work on this?
Ping. Does anyone have this?
Owner: ----
Status: Available (was: Assigned)
Sheriff or product owner needs to own this, this is not related to test infrastructure.
Owner: rcui@chromium.org
Ryan, can you look into this bug during your shadow gardener shift?

Comment 34 Deleted

Owner: amoylan@chromium.org
Andrew - can you take a look? This is a longstanding issue.
Will have a look.
I'm not sure it's asan-specific as a similar error (banner exchange failed) happens on VMTest (attempt 1) (but not attempt 2) on e.g. amd64-generic-telemetry:

cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8931723950152789536

I think I have reproduced with a locally built image. I am seeing the issue (banner exchange failed) even with VMs that boot up OK to UI (i.e. not stuck like at the screenshot in #27). Next I will try to mount the drive and look for some logs on the VM (any tips on where to poke for logs appreciated!)
Please add your notes on how you reproduced this bug if you're planning to pass on to the next gardener/shadow.
Cc: martis@chromium.org jdufault@chromium.org
Owner: martis@chromium.org
To reproduce locally:

./setup_board --board=amd64-generic --profile=asan --force && \
./build_packages --board=amd64-generic && \
./build_image '--board=amd64-generic' '--disk_layout=4gb-rootfs' test && \
./image_to_vm.sh '--board=amd64-generic' --test_image '--disk_layout=4gb-rootfs' && \
cros_vm --start --board=amd64-generic

Wait a good while, then:

ssh -p 9222 -oConnectTimeout=2 root@localhost

Also VNC to the machine to see its partially-booted state.

Notes:
1. I wasn't able to do the above with betty-asan because build_image failed with out-of-rootfs space (did not investigate that).
2. Removing "--profile=asan" seems to indeed fix the issue.
3. I think it might be necessary to tweak the VM launch options to not cache writes as I don't know a way to cleanly shut down the half-booted VM.
4. Planned next step: Do
  ./mount_gpt_image.sh --board=amd64-generic --safe --most_recent -i chromiumos_qemu_image.bin
and then look for SSH logs. But I think (3.) may need dealing with first.

Reassigning to next shadow!
> 3. I think it might be necessary to tweak the VM launch options to not cache writes as I don't know a way to cleanly shut down the half-booted VM.

cros_vm has a "--copy-on-write" option that won't persist any filesystem changes to the base image when it's launched. You can try using that.

> 4. Planned next step: Do
  ./mount_gpt_image.sh --board=amd64-generic --safe --most_recent -i chromiumos_qemu_image.bin
and then look for SSH logs. But I think (3.) may need dealing with first.

The logs will most likely be in /var/log, but everything in /var/ is in the encrypted-stateful partition by default, so would be difficult to pull out and read. You might want to rebuild the image w/ "USE=-encrypted_stateful" which I think should disable that.
martis@ have you had a chance to continue investigation here? It seems like this brought down a PFQ bot today[1] based on the log snippet

From the "betty-arcnext-chrome-pfq/R72-11213.0.0-rc2/vm_test_results_2/pfq_suite/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-40-login_VMSanity/debug/autoserv.INFO":

10/31 05:25:05.233 INFO |        server_job:0217| START	----	----	timestamp=1540988704	localtime=Oct 31 05:25:04	
10/31 05:25:05.344 INFO |        server_job:0217| 	START	login_VMSanity	login_VMSanity	timestamp=1540988704	localtime=Oct 31 05:25:04	
10/31 05:27:11.580 INFO |     ssh_multiplex:0079| Master ssh connection to 127.0.0.1 is down.
10/31 05:27:11.581 INFO |     ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_rvl0wWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
10/31 05:27:16.401 ERROR|             utils:2769| Timed out waiting for condition: Wait for a socket file to exist
10/31 05:27:16.401 INFO |     ssh_multiplex:0113| Timed out waiting for master-ssh connection to be established.
10/31 05:27:19.419 ERROR|             utils:0287| [stderr] ssh: connect to host 127.0.0.1 port 9227: Connection refused
10/31 05:27:19.423 INFO |        server_job:0217| 		FAIL	----	----	timestamp=1540988839	localtime=Oct 31 05:27:19	Autotest client terminated unexpectedly: DUT is pingable but not SSHable, it most likely sporadically rebooted during testing. command execution error
  * Command: 
      /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_rvl0wWssh-master/socket
      -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9227 127.0.0.1
      "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then
      logger -tag \"autotest\" \"server[stack::_diagnose_dut|get_boot_id|run] ->
      ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat
      '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available';
      fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat
      '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
  Exit status: 255
  Duration: 3.00589108467
  
  stderr:
  ssh: connect to host 127.0.0.1 port 9227: Connection refused
10/31 05:27:19.423 INFO |        server_job:0217| 	END FAIL	----	----	timestamp=1540988839	localtime=Oct 31 05:27:19	

1: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931168185137309856.
I am working on investigating this, but I haven't yet made it past the work done by other engineers in this bug.
Cc: wutao@chromium.org
Owner: sa...@chromium.org
Cc: weidongg@chromium.org
Owner: updowndota@chromium.org
Any further progress on this?

Comment 47 Deleted

Cc: nverne@chromium.org
Is there any update on that?  

Comment 49 Deleted

Comment 50 Deleted

SSH doesn't work for the qemu script either:

./setup_board --board=amd64-generic --profile=asan --force
./build_packages --board=amd64-generic
./build_image --board=amd64-generic '--disk_layout=4gb-rootfs' --noenable_rootfs_verification test
cros_vm --start --image-path ../build/images/amd64-generic/R73-11383.0.2018_12_11_1240-a1/chromiumos_test_image.bin

When I run:
ssh -p9222 root@127.0.0.1
I get the following error:
ssh_exchange_identification: read: Connection reset by peer

The build string at the top right shows it as a test build and developer build. I have been trying to get the shell but sending Ctrl-alt-f2 is not bringing up a terminal,

I connected to 127.0.0.1:5900 with a VNC client and the UI doesn't show an ethernet device (this might be expected for virtio).

There is a UI element that says "Enable debugging features". Clicking that shows a screen with a button to "Proceed". Clicking "Proceed" results in an error "Something went wrong" "Debugging features were not completely enabled on this Chromium device."

if you have UI, you should be able to start a guest session and then start a crosh instance where you can run shell to get root and debug that way
I think you need:
./image_to_vm.sh --test --board=amd64-generic 
You should use the QEMU image, not the built image.
I tried ./image_to_vm.sh but it failed with:
    failed: error writing '/dev/loop3p3': No space left on device
so I ran the script with the full image.

I don't see the "Browse as Guest" option that is normally there.
Allen, did you pass '--disk_layout=4gb-rootfs' to image_to_vm.sh as well? Without it, the partitions may not have enough space.
I just repeated the same test with the image produced by:
./image_to_vm.sh --from=../build/images/amd64-generic/R73-11383.0.2018_12_11_1555-a1 --disk_layout=4gb-rootfs --board=amd64-generic --test

Nothing changed. I am seeing the same behavior.
I tried the same procedure without --profile=asan and both ssh and ctrl-alt-f2 work. Also, the ethernet shows up as connected without --profile=asan.

It would be helpful for troubleshooting if I could get a shell. I cannot get into guest mode with --profile=asan set because the "Browse as Guest" button doesn't show up without getting past the network setup screen which is blocked by the Ethernet not showing up, so I cannot get past the OOBE network setup screen to where the guest button shows up.

It also is odd to me that Ctrl-Alt-F2 doesn't work with --profile=asan set.


> It would be helpful for troubleshooting if I could get a shell

there's this doc:
  http://dev.chromium.org/chromium-os/how-tos-and-troubleshooting/serial-debugging-howto

but not sure it has entire details for getting a serial console in a VM.  it is possible though.  cros_start_vm has an --serial option.

> It also is odd to me that Ctrl-Alt-F2 doesn't work with --profile=asan set.

it relies on daemons in the system (like frecon) working, so if they aren't, then the system can easily fall down.
Is is possible to maybe disable the failing test so we can get this stage working again?  Four months of not being able to run this stage successfully is a long time...
Sounds reasonable. updowndota@ ping me if you need help doing that.
Project Member

Comment 61 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosutils/+/181be74f53fe892091a102bd16e84b5fcc574db3

commit 181be74f53fe892091a102bd16e84b5fcc574db3
Author: Manoj Gupta <manojgupta@google.com>
Date: Thu Jan 03 23:04:21 2019

mod_for_test_scripts: Fix up for sanitizer builds.

Remove the limits from the init files if it is a sanitizer
build.

BUG= chromium:872391 
TEST=amd64-generic-asan VMTests pass

Change-Id: I02e3cb650333f69d04eff631bf73e31c976e8d67
Reviewed-on: https://chromium-review.googlesource.com/1390988
Commit-Ready: Manoj Gupta <manojgupta@chromium.org>
Tested-by: Manoj Gupta <manojgupta@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[add] https://crrev.com/181be74f53fe892091a102bd16e84b5fcc574db3/mod_for_test_scripts/390removeLimitForSanitizers

Owner: manojgupta@chromium.org
Status: Fixed (was: Available)
Amd64-generic-asan VMTests fails should be fixed by https://chromium-review.googlesource.com/1390988

Comment 63 by vapier@chromium.org, Jan 19 (3 days ago)

Labels: -Restrict-View-Google

Sign in to add a comment