New issue
Advanced search Search tips

Issue 799724 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 725620
issue 810061



Sign in to add a comment

Make moblab-generic-vm-paladin faster

Project Member Reported by pprabhu@chromium.org, Jan 6 2018

Issue description

It's actually not the slowest slave consistently in the last 4-5 builds, but it can be far slower than the rest: http://shortn/_4XRJImjUH7

iirc, most of the time is spent in simply staging the image on the devserver inside moblab VM. Explore ways to make that much faster (maybe mount the working disk differently? / increase RAM/CPU dedicated to the moblab VM?)
 
Blocking: 725620
Note that moblab-generic-vm-paladin is not the slowest board right now (cave-paladin) is. But it's hovering near the slowest:

https://viceroy.corp.google.com/chromeos/build_health#_VG_Ejn1kKf8
MoblabVMTestStage has been taking roughly 1 hour 5 seconds in the last few runs.
Focussing on a particular run:
https://uberchromegw.corp.google.com/i/chromiumos/builders/moblab-generic-vm-paladin/builds/510

Roughly 6 minutes are spent waiting for moblab DUT to boot:
https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/debug/

01/08 00:53:57.264 INFO |              test:0911| before_hook completed
01/08 00:53:57.312 WARNI|             retry:0228| <class 'autotest_lib.server.hosts.moblab_host.UpstartServiceNotRunning'>(Upstart service moblab-scheduler-init not in running state.)
0
...
01/08 00:58:53.103 WARNI|             retry:0228| <class 'autotest_lib.server.hosts.moblab_host.UpstartServiceNotRunning'>(Upstart service moblab-scheduler-init not in running state.)
01/08 00:58:53.104 WARNI|             retry:0183| Retrying in 14.016423 seconds...
01/08 00:59:23.268 ERROR|             utils:0280| [stderr] DEBUG:root:Running 'ping '192.168.231.100' -w1 -c1'

Almost all the time is spent waiting for the run_suite within moblab to complete:
01/08 00:59:54.043 INFO | test_runner_utils:0199| autoserv| [stderr] @@@STEP_LINK@Link to suite@localhost/afe/#tab_id=view_job&object_id=1@@@
01/08 01:29:54.619 INFO | test_runner_utils:0199| autoserv| [stderr] The suite job has another 23:29:59.425843 till timeout.
01/08 01:59:57.304 INFO | test_runner_utils:0199| autoserv| [stderr] 01-07-2018 [23:59:56] Suite job is finished.

Within that, of the 60 minutes: 

Provision:
https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/sysinfo/mnt/moblab/results/hosts/192.168.231.100/2-provision/

33 minutes are spent waiting for various artifacts to be staged:

01/07 23:00:24.888 INFO |        dev_server:0707| Getting devservers for host: 192.168.231.100
01/07 23:00:24.959 INFO |        dev_server:1153| Staging artifacts on devserver http://192.168.231.1:8080: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2
01/07 23:22:19.819 INFO |        dev_server:1171| Finished staging artifacts: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2
01/07 23:22:19.824 INFO |        dev_server:1153| Staging artifacts on devserver http://192.168.231.1:8080: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['quick_provision'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2
01/07 23:33:18.258 INFO |        dev_server:1171| Finished staging artifacts: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['quick_provision'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2
01/07 23:33:18.262 INFO |        dev_server:1489| Requesting contents from devserver http://192.168.231.1:8080 for image moblab-generic-vm-paladin/R65-10289.0.0-rc2

10 minutes are spent for the actual rootfs update:

01/07 23:39:22.209 ERROR|             utils:0280| [stderr] [0107/233922:INFO:update_engine_client.cc(517)] Waiting for update to complete.
01/07 23:49:17.975 ERROR|             utils:0280| [stderr] [0107/234918:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed.
01/07 23:49:18.186 ERROR|             utils:0280| [stderr] [0107/234918:INFO:update_engine_client.cc(508)] Querying Update Engine status...

5 minutes are then wasted waiting for AFE to come up:

01/07 23:34:19.851 INFO |     ssh_multiplex:0076| Master ssh connection to 192.168.231.100 is down.
01/07 23:34:19.853 INFO |     ssh_multiplex:0092| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_Mg93uessh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100'
01/07 23:34:20.189 INFO |        server_job:0218| 		END GOOD	----	reboot	kernel=4.4.109-12391-g6e7366951a2e-dirty	localtime=Jan 07 23:34:20	timestamp=1515396860	
01/07 23:39:20.213 ERROR|       moblab_host:0163| DUT has rebooted but AFE has failed to load: Moblab AFE is not responding: Timeout occurred- waited 300.0 seconds.
01/07 23:39:20.216 ERROR|       moblab_host:0164| Ignoring this error condition
01/07 23:39:20.326 INFO |       autoupdater:0543| Updating from version 10289.0.0-rc2 to R65-10289.0.0-rc2.

(we know that moblab VM takes ~10 minutes for the AFE to come up, so this is not surprising.
This error is ignored, so we should just stop waiting for this condition here)

2 minutes are spent for the actual stateful update:

01/07 23:49:18.193 INFO |       autoupdater:0516| Updating stateful partition...
01/07 23:51:40.087 INFO |       autoupdater:0577| Update complete.

5 minutes are again wasted for AFE to come up after a reboot. Again, we know this is bound to fail, and is ignored:

01/07 23:52:47.690 INFO |        server_job:0218| 		END GOOD	----	reboot	kernel=4.4.109-12391-g6e7366951a2e-dirty	localtime=Jan 07 23:52:47	timestamp=1515397967	
01/07 23:57:47.718 ERROR|       moblab_host:0163| DUT has rebooted but AFE has failed to load: Moblab AFE is not responding: Timeout occurred- waited 300.0 seconds.
01/07 23:57:47.720 ERROR|       moblab_host:0164| Ignoring this error condition

Almost no time is spent in running the dummy server test.
We can easily shave off 10 minutes by removing the always-failing and useless AFE check on reboot during provision.

Next up is figuring out a way to speed up the staging of the image on devserver. We're staging the image that was built on the same builder and then uploaded to GS just so that moblab can stage it back from GS.
This is actually very wasteful. The reasoning behind doing it this way was:
1. to get us off the ground
2. to really test the devserver staging flow.

If we're willing to give up on the coverage of the devserver staging flow, we can just populate the moblab disk with the required image before running the test. (and update MoblabTest to not delete that directory, 'cause it does right now to ensure hermetic testing)

This will save us time in multiple ways:
- we won't have to upload test artifacts to GS, and wait for the upload to finish before running test.
- current staging is slow, I suspect, because the moblab-internet bridge is slirp, which is really bad.
- it may also be slow because the virtual disk writes are not optimized.

Status: Started (was: Assigned)
#5 landed, and shaved 10 minutes off the build. Builder is still hovering ~2 hours build time.
Cc: akes...@chromium.org
Aviv I would like your opinion here. I can shave off nearly 30 minutes from this stage if I pre-populate the devserver cache on moblab before running the test.

This will open up some critical headroom wrt timeout, allowing us to retry some common flakes and make this integration test more robust. (See  issue 801049 ).

We will lose coverage of the fetching part of devserver itself. Provision flow will still be tested, but the requested image / artifacts will always be found to be staged.
Re #7, note that this will also eliminate a GS roundtrip the builder is currently doing -- it uploads the test artifacts for the current build to GS, waits for that archive step to complete and then runs the test. This allows the devserver running inside moblab to stage the test artifacts (that are, ahem, already available on disk in the builder) from GS to test the full flow.

I'd drop this whole dance, also reducing the upload/download from the builder.
> We will lose coverage of the fetching part of devserver itself. Provision flow will still be tested, but the requested image / artifacts will always be found to be staged.


That sounds fine, the moblab vmtest is mainly for testing autotest infra code not devserver code. The loss of coverage sounds like it is worth the performance gain to me.
I'm exploring an approach here using 'cros stage' to pre-stage the image on moblab.

There are some fixes needed in cros stage, but more importantly, one run shows that ~30 minutes is spent just copying the staging artifacts *locally from one path to another inside the moblab VM*. This is clearly io-bound (and in this case thoroughly unnecessary, the files can be moved instead).

11:21:55: DEBUG: RunCommand: ssh -p 11072 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpeAHH1i/testing_rsa root@localhost -- curl --fail "'http://localhost:8080/stage?local_path=/mnt/moblab/static/tmp/moblab-generic-vm-custom/R65-10218.0.2017_12_15_1354&artifacts=full_payload,stateful,test_suites,control_files,autotest_packages,autotest_server_package'"
11:53:27: INFO: 

Staging Completed!

oh boy, even rsync'ing the image over can take ~33 mins depending on the VM CPU consumption. We should switch to scp'ing images over, as we've always known we should do.
I have partial success in: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8956664733248973424

The two remaining issues are:
- Due to https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/823231 we are always staging the quick_update script. This script is not staged by 'cros stage', so it gets stage from GS, forcing us back to waiting for the long slow staging.

The fix here should be to not stage quick_provision unless it's needed (turns out the artifacts needed are quite heavy. Not just the quick_provision script, also all the unpacked partitions are needed)

Also the build there failed because of autoupdate failure which I can't quite explain yet.
This CL seems to be failing the pre-cq due to timeouts that I can only relate to slowness in staging: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/896026

:(

Need to make time for this bug.
Also, the paladin has again been the slowest often in the last week: http://shortn/_IxWDsI5iOf

What happened? GS got slower on us?
Blocking: 810061
Project Member

Comment 16 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/4f62d7acac949b751d0b73c26a7ebb96e4feeb45

commit 4f62d7acac949b751d0b73c26a7ebb96e4feeb45
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 01:40:59 2018

lib: Add a contextmanager to mount moblab external disk in MoblabVm.

BUG= chromium:799724 
TEST=(1) unittests.
     (2) create, start, destroy a moblabvm.

Change-Id: Icf8a22eee31c449ba2d09e5208bb0574ab503317
Reviewed-on: https://chromium-review.googlesource.com/882169
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/4f62d7acac949b751d0b73c26a7ebb96e4feeb45/lib/moblab_vm.py
[modify] https://crrev.com/4f62d7acac949b751d0b73c26a7ebb96e4feeb45/lib/moblab_vm_unittest.py

I'm stuck at a very interesting point.
I have things _almost_ working here: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954681119935437552
(Look at the large number of CLs there).

But, the actual update keeps failing.
I followed it through to devserver trying to compute some hashes on the file to server in response to the updatecheck call.

In this function call: https://cs.corp.google.com/chromeos_public/src/platform/dev/autoupdate.py?q=file:platform/dev/autoupdate.py&sq=package:chromeos&dr&l=861
At this line: https://cs.corp.google.com/chromeos_public/src/platform/dev/autoupdate.py?q=file:platform/dev/autoupdate.py&sq=package:chromeos&dr&l=942
Basically that call never finishes because the sha1 computation here: https://cs.corp.google.com/chromeos_public/src/platform/dev/common_util.py?l=275 never finishes.

I tried just computing that hash on a local moblabvm from the failing build: The first time I tried this: 

localhost tmp # cat doit.py                                                                           │
import common_util                                                                                    │
print common_util.GetFileSha1('/mnt/moblab/static/trybot-moblab-generic-vm-paladin-tryjob/R66-10401.0.│
0-b5417/update.gz')

It took 10 minutes! The update_engine on the sub-DUT being provisioned tries twice, and times out each time at ~3 minutes.

When I retried the same thing again (in the same VM):
localhost tmp # time PYTHONPATH=/usr/lib/devserver python doit.py                                     │
REDACTED                                                                                              │
                                                                                                      │
real    0m1.971s                                                                                      │
user    0m1.706s                                                                                      │
sys     0m0.264s

I took ~ 1 second.

-------
Note that even when staging from gs, the first attempt at updatecheck hangs (for ~3 mintues), the retry from the DUT succeeds:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954664378470805200

===========

Basically, what is happening is 
- we copied the files on the moblab's external disk before running the vm
- we then staged the files by just mv'ing the files over. This only modified the inodes on the disk
- Finally, to compute the hashes, we had to load the whole file in memory. This took ~10 minutes. Whoops.

In fact, if you wait around after this attempt for a bit, and re-run provision, it succeeds, because the file is cached in memory.
If we change the way we stage to copying instead of mv'ing, it will force the whole file to be loaded in memory during staging, and "fix this problem".
But then we'll pay the penalty of 10 minutes in staging, which we're trying to avoid here.
Yeah, read disk io is terrible:

host:
pprabhu@pprabhu:chromiumos$ sudo dd if=/dev/sdc1 of=/dev/null bs=1M count=1024 iflag=direct skip=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.99456 s, 269 MB/s

guest:
localhost tmp # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 iflag=direct skip=1024
32+0 records in
32+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 26.8975 s, 1.2 MB/s

-----

This is not exactly the same io benchmark as the actual application, but is representative, since we're reading sequentially from an fd for the hash computation (granted, the file could be not stored in sequential blocks, but given this is in a brand new disk, I'd expect somewhat contiguous allocation) 
Using virtio for the external disk: https://fedoraproject.org/wiki/Features/virtio-scsi

https://chromium-review.googlesource.com/c/chromiumos/chromite/+/917962/

disk io goes up through the window:
localhost ~ # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 iflag=direct skip=1024
32+0 records in
32+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 0.111606 s, 301 MB/s

Project Member

Comment 21 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/58d08930b96bd00770799e1c149d852948ac2ea4

commit 58d08930b96bd00770799e1c149d852948ac2ea4
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:15:43 2018

stage: Add argument to move files locally instead of copying.

In some use cases of staging files from a local path, we begin by
preparing the files to stage in a tempdir, we then stage these files via
the RPC and immediately delete the source file.
This CL adds an argument to the stage RPC to be used when staging local
files that directs us to simply move the files instead of copying.

This is especially useful when the devserver is running in a VM (moblab
VM) where the staging time goes down from ~30 minutes to 3 minutes after
this change.

BUG= chromium:799724 
TEST=Run 'cros stage' on a moblab vm.

Change-Id: Ie1206ae9ccb25802cc241b25e3a611e46893b16c
Reviewed-on: https://chromium-review.googlesource.com/877387
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Congbin Guo <guocb@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/58d08930b96bd00770799e1c149d852948ac2ea4/downloader.py
[modify] https://crrev.com/58d08930b96bd00770799e1c149d852948ac2ea4/devserver.py

Project Member

Comment 22 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/bee63be83f2eca375257d477af50bf3ab7268038

commit bee63be83f2eca375257d477af50bf3ab7268038
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:15:44 2018

Make LocalDownloader play well with MultiArtifact.

CL:751790 added support for artifacts that involve downloading multiple
files, but LocalDownloader was not updated to support it. FixIt.

BUG=chromium:779870
BUG= chromium:799724 
TEST=quick-provision artifacts are staged correctly in moblab-vm.

Change-Id: I5f93db6d4371839f4def36a1bdc670df3fe6ee44
Reviewed-on: https://chromium-review.googlesource.com/912777
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: David Riley <davidriley@chromium.org>
Reviewed-by: Congbin Guo <guocb@chromium.org>

[modify] https://crrev.com/bee63be83f2eca375257d477af50bf3ab7268038/downloader.py

Project Member

Comment 23 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/1d26c79c517b229e66ee627de22435a81f03d6a8

commit 1d26c79c517b229e66ee627de22435a81f03d6a8
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:16:22 2018

cbuildbot: Collect autotest tarball commands into one.

These commands are all required to build the autotest tarballs required
for running hwtests. These are already called from two places, and soon
there will be a third. So, collect them into a helper.

BUG= chromium:799724 
TEST=unittests.

Change-Id: I77f9878cbf5e76ba2a8da9bad2cfb1dba03d674b
Reviewed-on: https://chromium-review.googlesource.com/884821
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cbuildbot/commands.py
[modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cli/cros/cros_stage.py
[modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cbuildbot/stages/artifact_stages.py

Project Member

Comment 24 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5607e1eec5ce859d26a387da0c8e7454a0addb7c

commit 5607e1eec5ce859d26a387da0c8e7454a0addb7c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:16:46 2018

moblab_RunSuite: Move and document devserver cleaning option.

Nothing uses this flag yet, but MoblabVMTestStage is slated to start
using it. This is a moblab_RunSuite feature really -- the test wants to
ensure that devserver staging is tested (or not), so move it there.
Also document the option better.

BUG= chromium:799724 
TEST=moblab-generic-vm-pre-cq w/wo the new option passed in.

Change-Id: Ia5d42b9be15cd58b1614a6dd0174d6bacd1d6dd7
Reviewed-on: https://chromium-review.googlesource.com/865276
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.dummyServer
[modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
[modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.dummyServerNoSsp
[modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/hosts/moblab_host.py
[modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.smoke

Project Member

Comment 25 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1affd71e99ea67d56e4d3bc794ca8164957c2400

commit 1affd71e99ea67d56e4d3bc794ca8164957c2400
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:16:48 2018

moblab: Permit local image_storage_server paths.

BUG= chromium:799724 
TEST=moblab-generic-vm-paladin-tryjob

Change-Id: I8e61e5cb23d155bf6890890012064b887d783667
Reviewed-on: https://chromium-review.googlesource.com/887422
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/1affd71e99ea67d56e4d3bc794ca8164957c2400/server/cros/moblab_test.py

Project Member

Comment 26 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c2fb94ef330e91588cd548d71b0ed6f4229e7eb6

commit c2fb94ef330e91588cd548d71b0ed6f4229e7eb6
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 14 05:17:13 2018

autotest: Allow staging artifacts from local path on devserver.

When staging from local path, the client library sets
delete_source=True. The only reason a test would request staging from a
local path on the devserver is if the payloads were earlier copied in
expressly for that purpose. delete_source greatly speeds up staging,
especially for devsever running in a moblab VM.

BUG= chromium:799724 
TEST=moblab-generic-vm-paladin-tryjob

Change-Id: I3374b15bbc9f9600baafc9b49824d5f2765c9571
Reviewed-on: https://chromium-review.googlesource.com/889993
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/c2fb94ef330e91588cd548d71b0ed6f4229e7eb6/client/common_lib/cros/dev_server.py

Ah! Sweet success!
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954639586631423952

Finally. :)

MoblabVMTestStage is down to 23 minutes from > 90 minutes.
(Two more CLs to land before declaring victory here.
+ revert the timeout bump for issue 810061)
Cc: xixuan@chromium.org pprabhu@chromium.org
 Issue 801049  has been merged into this issue.
Project Member

Comment 30 by bugdroid1@chromium.org, Feb 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/155e0273894d7b6babcd3bfbcbd6648574a0d451

commit 155e0273894d7b6babcd3bfbcbd6648574a0d451
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Feb 15 08:11:06 2018

project-moblab: delay devserver cleanup && run in verbose mode.

When debugging moblab-vm failures, we often download an archived
moblabvm workspace. Since the build may have run 1+ day ago, this
results in the cleanup script deleting the staged images immediately on
boot, making debugging hard.

Instead, delay the initial cleanup a bit. On any long running moblab
device, the initial delay should be irrelevant, but it provides a window
to debug vm failures. Also, make the cleanup verbose to leave
breadcrumbs when we do clean up the devserver cache.

BUG= chromium:799724 
TEST=local moblabvm run.

Change-Id: I9bcb391304cfc462372cc417e866a63f78feaae4
Reviewed-on: https://chromium-review.googlesource.com/915213
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[rename] https://crrev.com/155e0273894d7b6babcd3bfbcbd6648574a0d451/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r84.ebuild
[modify] https://crrev.com/155e0273894d7b6babcd3bfbcbd6648574a0d451/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-devserver-cleanup-init.conf

Project Member

Comment 31 by bugdroid1@chromium.org, Feb 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/fb7f078446dee8c8788d65b767d004ba02e5056c

commit fb7f078446dee8c8788d65b767d004ba02e5056c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Feb 16 01:42:20 2018

lib: Use virtio for moblabvm's external disk.

Before this CL, we were attaching moblab's external disk as a usb-disk.
This CL changes it to be a scsi disk using a virtio driver.

For a sequential read benchmark, this gives us > 200x speedup.

Before:
localhost tmp # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 \
    iflag=direct skip=1024
32+0 records in
32+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 26.8975 s, 1.2 MB/s

After:
localhost ~ # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 \
    iflag=direct skip=1024
32+0 records in
32+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 0.111606 s, 301 MB/s

Practically, this makes the devserver running on the moblab VM usable:
During an update running off the devserver, the chromiumos image cached
on the external disk needs to be loaded in memory for some metadata
computations. Without this CL, this takes too long (~ 10 minutes) and
the sub-DUT's upate_engine times out (on the updatecheck call). With
this CL, this operation takes ~2 seconds.

BUG= chromium:799724 
TEST=(1) Manual sequence read test as above.
     (2) Manual sha1 computation on a chromiumos_test_image.bin from a
         cold cache (post reboot).
     (3) moblab-vm-test passes with a cold cache (as after CL:882170).

Change-Id: I553d57fd2cda8938fddf5bf7f370682ad3d9a88b
Reviewed-on: https://chromium-review.googlesource.com/917962
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/fb7f078446dee8c8788d65b767d004ba02e5056c/lib/moblab_vm.py

Project Member

Comment 32 by bugdroid1@chromium.org, Feb 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f

commit 7090d1c44f3fdb902a6eb5039b1fbaad56505d0f
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Sat Feb 17 02:47:45 2018

cbuildbot: Locally stage artifacts on devserver inside moblab.

Before this CL, MoblabVMTestStage would upload test artifacts to GS.
Then, the test itself would cause the devserver inside moblab VM to
stage these same artifacts from GS. This was very slow (time spent
waiting for artifact upload + 30 minutes (anecdotal average) for the
staging operation).

We now pre-generate the test artifacts inside the external disk used by
the devserver in moblab as local cache. Also, we instruct the
moblab_RunSuite test to force the local cache to be used as the source
of test artifacts. This cuts GS out of the loop.

With CL:917962, this reduces the MoblabVmTestStage time from ~100
minutes to < 30 minutes.

BUG= chromium:799724 
TEST=moblab-generic-vm-pre-cq

Change-Id: Ic68c3acc16702a722673e29ee5164f50d8b1b442
Reviewed-on: https://chromium-review.googlesource.com/882170
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f/cbuildbot/stages/vm_test_stages.py
[modify] https://crrev.com/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f/cbuildbot/stages/vm_test_stages_unittest.py

Status: Fixed (was: Started)
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954376999108179040

The build time is now hovering under 100 minutes. That's good enough for this bug.
http://shortn/_4Gw8Wt17Ia
1 day window graph of time taken by this bot: http://shortn/_vujiMi3com
Project Member

Comment 35 by bugdroid1@chromium.org, Feb 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3

commit 9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Feb 20 21:14:27 2018

Revert "cbuildbot: (workaround) Bump up MoblabVMTestStage timeout."

This reverts commit 8cd474693924be0c335e7e4d1b585f64266fd9c2.

Reason for revert: Root cause fixed.

Original change's description:
> cbuildbot: (workaround) Bump up MoblabVMTestStage timeout.
>
> This is just a hot-fix while we try to determine why this stage is
> taking so long on the pre-cq.
>
> After this CL, the pre-cq should pass again, albeit taking longer.
> Degraded performance is better than none.
>
> BUG=chromium:810061
> TEST=Run moblab-generic-vm-pre-cq with this CL, watch it pass.
>
> Change-Id: I483a4d3b53c7c04c50057831ce9b87c2e97162bc
> Reviewed-on: https://chromium-review.googlesource.com/907705
> Reviewed-by: Xixuan Wu <xixuan@chromium.org>
> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

BUG=chromium:810061
BUG= chromium:799724 

Change-Id: I6703922188d62983158f73f2128fc7d1db95d447
Reviewed-on: https://chromium-review.googlesource.com/924664
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3/cbuildbot/stages/vm_test_stages.py

Sign in to add a comment