Make moblab-generic-vm-paladin faster |
|||||
Issue descriptionIt's actually not the slowest slave consistently in the last 4-5 builds, but it can be far slower than the rest: http://shortn/_4XRJImjUH7 iirc, most of the time is spent in simply staging the image on the devserver inside moblab VM. Explore ways to make that much faster (maybe mount the working disk differently? / increase RAM/CPU dedicated to the moblab VM?)
,
Jan 8 2018
Note that moblab-generic-vm-paladin is not the slowest board right now (cave-paladin) is. But it's hovering near the slowest: https://viceroy.corp.google.com/chromeos/build_health#_VG_Ejn1kKf8
,
Jan 9 2018
MoblabVMTestStage has been taking roughly 1 hour 5 seconds in the last few runs. Focussing on a particular run: https://uberchromegw.corp.google.com/i/chromiumos/builders/moblab-generic-vm-paladin/builds/510 Roughly 6 minutes are spent waiting for moblab DUT to boot: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/debug/ 01/08 00:53:57.264 INFO | test:0911| before_hook completed 01/08 00:53:57.312 WARNI| retry:0228| <class 'autotest_lib.server.hosts.moblab_host.UpstartServiceNotRunning'>(Upstart service moblab-scheduler-init not in running state.) 0 ... 01/08 00:58:53.103 WARNI| retry:0228| <class 'autotest_lib.server.hosts.moblab_host.UpstartServiceNotRunning'>(Upstart service moblab-scheduler-init not in running state.) 01/08 00:58:53.104 WARNI| retry:0183| Retrying in 14.016423 seconds... 01/08 00:59:23.268 ERROR| utils:0280| [stderr] DEBUG:root:Running 'ping '192.168.231.100' -w1 -c1' Almost all the time is spent waiting for the run_suite within moblab to complete: 01/08 00:59:54.043 INFO | test_runner_utils:0199| autoserv| [stderr] @@@STEP_LINK@Link to suite@localhost/afe/#tab_id=view_job&object_id=1@@@ 01/08 01:29:54.619 INFO | test_runner_utils:0199| autoserv| [stderr] The suite job has another 23:29:59.425843 till timeout. 01/08 01:59:57.304 INFO | test_runner_utils:0199| autoserv| [stderr] 01-07-2018 [23:59:56] Suite job is finished. Within that, of the 60 minutes: Provision: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/sysinfo/mnt/moblab/results/hosts/192.168.231.100/2-provision/ 33 minutes are spent waiting for various artifacts to be staged: 01/07 23:00:24.888 INFO | dev_server:0707| Getting devservers for host: 192.168.231.100 01/07 23:00:24.959 INFO | dev_server:1153| Staging artifacts on devserver http://192.168.231.1:8080: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2 01/07 23:22:19.819 INFO | dev_server:1171| Finished staging artifacts: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2 01/07 23:22:19.824 INFO | dev_server:1153| Staging artifacts on devserver http://192.168.231.1:8080: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['quick_provision'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2 01/07 23:33:18.258 INFO | dev_server:1171| Finished staging artifacts: build=moblab-generic-vm-paladin/R65-10289.0.0-rc2, artifacts=['quick_provision'], files=, archive_url=gs://chromeos-image-archive/moblab-generic-vm-paladin/R65-10289.0.0-rc2 01/07 23:33:18.262 INFO | dev_server:1489| Requesting contents from devserver http://192.168.231.1:8080 for image moblab-generic-vm-paladin/R65-10289.0.0-rc2 10 minutes are spent for the actual rootfs update: 01/07 23:39:22.209 ERROR| utils:0280| [stderr] [0107/233922:INFO:update_engine_client.cc(517)] Waiting for update to complete. 01/07 23:49:17.975 ERROR| utils:0280| [stderr] [0107/234918:INFO:update_engine_client.cc(239)] Update succeeded -- reboot needed. 01/07 23:49:18.186 ERROR| utils:0280| [stderr] [0107/234918:INFO:update_engine_client.cc(508)] Querying Update Engine status... 5 minutes are then wasted waiting for AFE to come up: 01/07 23:34:19.851 INFO | ssh_multiplex:0076| Master ssh connection to 192.168.231.100 is down. 01/07 23:34:19.853 INFO | ssh_multiplex:0092| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_Mg93uessh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100' 01/07 23:34:20.189 INFO | server_job:0218| END GOOD ---- reboot kernel=4.4.109-12391-g6e7366951a2e-dirty localtime=Jan 07 23:34:20 timestamp=1515396860 01/07 23:39:20.213 ERROR| moblab_host:0163| DUT has rebooted but AFE has failed to load: Moblab AFE is not responding: Timeout occurred- waited 300.0 seconds. 01/07 23:39:20.216 ERROR| moblab_host:0164| Ignoring this error condition 01/07 23:39:20.326 INFO | autoupdater:0543| Updating from version 10289.0.0-rc2 to R65-10289.0.0-rc2. (we know that moblab VM takes ~10 minutes for the AFE to come up, so this is not surprising. This error is ignored, so we should just stop waiting for this condition here) 2 minutes are spent for the actual stateful update: 01/07 23:49:18.193 INFO | autoupdater:0516| Updating stateful partition... 01/07 23:51:40.087 INFO | autoupdater:0577| Update complete. 5 minutes are again wasted for AFE to come up after a reboot. Again, we know this is bound to fail, and is ignored: 01/07 23:52:47.690 INFO | server_job:0218| END GOOD ---- reboot kernel=4.4.109-12391-g6e7366951a2e-dirty localtime=Jan 07 23:52:47 timestamp=1515397967 01/07 23:57:47.718 ERROR| moblab_host:0163| DUT has rebooted but AFE has failed to load: Moblab AFE is not responding: Timeout occurred- waited 300.0 seconds. 01/07 23:57:47.720 ERROR| moblab_host:0164| Ignoring this error condition Almost no time is spent in running the dummy server test.
,
Jan 9 2018
We can easily shave off 10 minutes by removing the always-failing and useless AFE check on reboot during provision. Next up is figuring out a way to speed up the staging of the image on devserver. We're staging the image that was built on the same builder and then uploaded to GS just so that moblab can stage it back from GS. This is actually very wasteful. The reasoning behind doing it this way was: 1. to get us off the ground 2. to really test the devserver staging flow. If we're willing to give up on the coverage of the devserver staging flow, we can just populate the moblab disk with the required image before running the test. (and update MoblabTest to not delete that directory, 'cause it does right now to ensure hermetic testing) This will save us time in multiple ways: - we won't have to upload test artifacts to GS, and wait for the upload to finish before running test. - current staging is slow, I suspect, because the moblab-internet bridge is slirp, which is really bad. - it may also be slow because the virtual disk writes are not optimized.
,
Jan 9 2018
CL for the lowest hanging fruit: https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/855199
,
Jan 11 2018
#5 landed, and shaved 10 minutes off the build. Builder is still hovering ~2 hours build time.
,
Jan 11 2018
Aviv I would like your opinion here. I can shave off nearly 30 minutes from this stage if I pre-populate the devserver cache on moblab before running the test. This will open up some critical headroom wrt timeout, allowing us to retry some common flakes and make this integration test more robust. (See issue 801049 ). We will lose coverage of the fetching part of devserver itself. Provision flow will still be tested, but the requested image / artifacts will always be found to be staged.
,
Jan 11 2018
Re #7, note that this will also eliminate a GS roundtrip the builder is currently doing -- it uploads the test artifacts for the current build to GS, waits for that archive step to complete and then runs the test. This allows the devserver running inside moblab to stage the test artifacts (that are, ahem, already available on disk in the builder) from GS to test the full flow. I'd drop this whole dance, also reducing the upload/download from the builder.
,
Jan 11 2018
> We will lose coverage of the fetching part of devserver itself. Provision flow will still be tested, but the requested image / artifacts will always be found to be staged. That sounds fine, the moblab vmtest is mainly for testing autotest infra code not devserver code. The loss of coverage sounds like it is worth the performance gain to me.
,
Jan 19 2018
I'm exploring an approach here using 'cros stage' to pre-stage the image on moblab. There are some fixes needed in cros stage, but more importantly, one run shows that ~30 minutes is spent just copying the staging artifacts *locally from one path to another inside the moblab VM*. This is clearly io-bound (and in this case thoroughly unnecessary, the files can be moved instead). 11:21:55: DEBUG: RunCommand: ssh -p 11072 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpeAHH1i/testing_rsa root@localhost -- curl --fail "'http://localhost:8080/stage?local_path=/mnt/moblab/static/tmp/moblab-generic-vm-custom/R65-10218.0.2017_12_15_1354&artifacts=full_payload,stateful,test_suites,control_files,autotest_packages,autotest_server_package'" 11:53:27: INFO: Staging Completed!
,
Jan 19 2018
oh boy, even rsync'ing the image over can take ~33 mins depending on the VM CPU consumption. We should switch to scp'ing images over, as we've always known we should do.
,
Jan 23 2018
I have partial success in: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8956664733248973424 The two remaining issues are: - Due to https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/823231 we are always staging the quick_update script. This script is not staged by 'cros stage', so it gets stage from GS, forcing us back to waiting for the long slow staging. The fix here should be to not stage quick_provision unless it's needed (turns out the artifacts needed are quite heavy. Not just the quick_provision script, also all the unpacked partitions are needed) Also the build there failed because of autoupdate failure which I can't quite explain yet.
,
Feb 7 2018
This CL seems to be failing the pre-cq due to timeouts that I can only relate to slowness in staging: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/896026 :( Need to make time for this bug.
,
Feb 7 2018
Also, the paladin has again been the slowest often in the last week: http://shortn/_IxWDsI5iOf What happened? GS got slower on us?
,
Feb 7 2018
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4f62d7acac949b751d0b73c26a7ebb96e4feeb45 commit 4f62d7acac949b751d0b73c26a7ebb96e4feeb45 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 01:40:59 2018 lib: Add a contextmanager to mount moblab external disk in MoblabVm. BUG= chromium:799724 TEST=(1) unittests. (2) create, start, destroy a moblabvm. Change-Id: Icf8a22eee31c449ba2d09e5208bb0574ab503317 Reviewed-on: https://chromium-review.googlesource.com/882169 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/4f62d7acac949b751d0b73c26a7ebb96e4feeb45/lib/moblab_vm.py [modify] https://crrev.com/4f62d7acac949b751d0b73c26a7ebb96e4feeb45/lib/moblab_vm_unittest.py
,
Feb 14 2018
I'm stuck at a very interesting point. I have things _almost_ working here: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954681119935437552 (Look at the large number of CLs there). But, the actual update keeps failing. I followed it through to devserver trying to compute some hashes on the file to server in response to the updatecheck call. In this function call: https://cs.corp.google.com/chromeos_public/src/platform/dev/autoupdate.py?q=file:platform/dev/autoupdate.py&sq=package:chromeos&dr&l=861 At this line: https://cs.corp.google.com/chromeos_public/src/platform/dev/autoupdate.py?q=file:platform/dev/autoupdate.py&sq=package:chromeos&dr&l=942 Basically that call never finishes because the sha1 computation here: https://cs.corp.google.com/chromeos_public/src/platform/dev/common_util.py?l=275 never finishes. I tried just computing that hash on a local moblabvm from the failing build: The first time I tried this: localhost tmp # cat doit.py │ import common_util │ print common_util.GetFileSha1('/mnt/moblab/static/trybot-moblab-generic-vm-paladin-tryjob/R66-10401.0.│ 0-b5417/update.gz') It took 10 minutes! The update_engine on the sub-DUT being provisioned tries twice, and times out each time at ~3 minutes. When I retried the same thing again (in the same VM): localhost tmp # time PYTHONPATH=/usr/lib/devserver python doit.py │ REDACTED │ │ real 0m1.971s │ user 0m1.706s │ sys 0m0.264s I took ~ 1 second. ------- Note that even when staging from gs, the first attempt at updatecheck hangs (for ~3 mintues), the retry from the DUT succeeds: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954664378470805200 =========== Basically, what is happening is - we copied the files on the moblab's external disk before running the vm - we then staged the files by just mv'ing the files over. This only modified the inodes on the disk - Finally, to compute the hashes, we had to load the whole file in memory. This took ~10 minutes. Whoops. In fact, if you wait around after this attempt for a bit, and re-run provision, it succeeds, because the file is cached in memory.
,
Feb 14 2018
If we change the way we stage to copying instead of mv'ing, it will force the whole file to be loaded in memory during staging, and "fix this problem". But then we'll pay the penalty of 10 minutes in staging, which we're trying to avoid here.
,
Feb 14 2018
Yeah, read disk io is terrible: host: pprabhu@pprabhu:chromiumos$ sudo dd if=/dev/sdc1 of=/dev/null bs=1M count=1024 iflag=direct skip=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.99456 s, 269 MB/s guest: localhost tmp # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 iflag=direct skip=1024 32+0 records in 32+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 26.8975 s, 1.2 MB/s ----- This is not exactly the same io benchmark as the actual application, but is representative, since we're reading sequentially from an fd for the hash computation (granted, the file could be not stored in sequential blocks, but given this is in a brand new disk, I'd expect somewhat contiguous allocation)
,
Feb 14 2018
Using virtio for the external disk: https://fedoraproject.org/wiki/Features/virtio-scsi https://chromium-review.googlesource.com/c/chromiumos/chromite/+/917962/ disk io goes up through the window: localhost ~ # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 iflag=direct skip=1024 32+0 records in 32+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 0.111606 s, 301 MB/s
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/58d08930b96bd00770799e1c149d852948ac2ea4 commit 58d08930b96bd00770799e1c149d852948ac2ea4 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:15:43 2018 stage: Add argument to move files locally instead of copying. In some use cases of staging files from a local path, we begin by preparing the files to stage in a tempdir, we then stage these files via the RPC and immediately delete the source file. This CL adds an argument to the stage RPC to be used when staging local files that directs us to simply move the files instead of copying. This is especially useful when the devserver is running in a VM (moblab VM) where the staging time goes down from ~30 minutes to 3 minutes after this change. BUG= chromium:799724 TEST=Run 'cros stage' on a moblab vm. Change-Id: Ie1206ae9ccb25802cc241b25e3a611e46893b16c Reviewed-on: https://chromium-review.googlesource.com/877387 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Congbin Guo <guocb@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/58d08930b96bd00770799e1c149d852948ac2ea4/downloader.py [modify] https://crrev.com/58d08930b96bd00770799e1c149d852948ac2ea4/devserver.py
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/bee63be83f2eca375257d477af50bf3ab7268038 commit bee63be83f2eca375257d477af50bf3ab7268038 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:15:44 2018 Make LocalDownloader play well with MultiArtifact. CL:751790 added support for artifacts that involve downloading multiple files, but LocalDownloader was not updated to support it. FixIt. BUG=chromium:779870 BUG= chromium:799724 TEST=quick-provision artifacts are staged correctly in moblab-vm. Change-Id: I5f93db6d4371839f4def36a1bdc670df3fe6ee44 Reviewed-on: https://chromium-review.googlesource.com/912777 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: David Riley <davidriley@chromium.org> Reviewed-by: Congbin Guo <guocb@chromium.org> [modify] https://crrev.com/bee63be83f2eca375257d477af50bf3ab7268038/downloader.py
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/1d26c79c517b229e66ee627de22435a81f03d6a8 commit 1d26c79c517b229e66ee627de22435a81f03d6a8 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:16:22 2018 cbuildbot: Collect autotest tarball commands into one. These commands are all required to build the autotest tarballs required for running hwtests. These are already called from two places, and soon there will be a third. So, collect them into a helper. BUG= chromium:799724 TEST=unittests. Change-Id: I77f9878cbf5e76ba2a8da9bad2cfb1dba03d674b Reviewed-on: https://chromium-review.googlesource.com/884821 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cbuildbot/commands.py [modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cli/cros/cros_stage.py [modify] https://crrev.com/1d26c79c517b229e66ee627de22435a81f03d6a8/cbuildbot/stages/artifact_stages.py
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5607e1eec5ce859d26a387da0c8e7454a0addb7c commit 5607e1eec5ce859d26a387da0c8e7454a0addb7c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:16:46 2018 moblab_RunSuite: Move and document devserver cleaning option. Nothing uses this flag yet, but MoblabVMTestStage is slated to start using it. This is a moblab_RunSuite feature really -- the test wants to ensure that devserver staging is tested (or not), so move it there. Also document the option better. BUG= chromium:799724 TEST=moblab-generic-vm-pre-cq w/wo the new option passed in. Change-Id: Ia5d42b9be15cd58b1614a6dd0174d6bacd1d6dd7 Reviewed-on: https://chromium-review.googlesource.com/865276 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> [modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.dummyServer [modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/moblab_RunSuite.py [modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.dummyServerNoSsp [modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/hosts/moblab_host.py [modify] https://crrev.com/5607e1eec5ce859d26a387da0c8e7454a0addb7c/server/site_tests/moblab_RunSuite/control.smoke
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1affd71e99ea67d56e4d3bc794ca8164957c2400 commit 1affd71e99ea67d56e4d3bc794ca8164957c2400 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:16:48 2018 moblab: Permit local image_storage_server paths. BUG= chromium:799724 TEST=moblab-generic-vm-paladin-tryjob Change-Id: I8e61e5cb23d155bf6890890012064b887d783667 Reviewed-on: https://chromium-review.googlesource.com/887422 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> [modify] https://crrev.com/1affd71e99ea67d56e4d3bc794ca8164957c2400/server/cros/moblab_test.py
,
Feb 14 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c2fb94ef330e91588cd548d71b0ed6f4229e7eb6 commit c2fb94ef330e91588cd548d71b0ed6f4229e7eb6 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 14 05:17:13 2018 autotest: Allow staging artifacts from local path on devserver. When staging from local path, the client library sets delete_source=True. The only reason a test would request staging from a local path on the devserver is if the payloads were earlier copied in expressly for that purpose. delete_source greatly speeds up staging, especially for devsever running in a moblab VM. BUG= chromium:799724 TEST=moblab-generic-vm-paladin-tryjob Change-Id: I3374b15bbc9f9600baafc9b49824d5f2765c9571 Reviewed-on: https://chromium-review.googlesource.com/889993 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/c2fb94ef330e91588cd548d71b0ed6f4229e7eb6/client/common_lib/cros/dev_server.py
,
Feb 14 2018
Ah! Sweet success! https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954639586631423952 Finally. :) MoblabVMTestStage is down to 23 minutes from > 90 minutes.
,
Feb 14 2018
(Two more CLs to land before declaring victory here. + revert the timeout bump for issue 810061)
,
Feb 14 2018
,
Feb 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/155e0273894d7b6babcd3bfbcbd6648574a0d451 commit 155e0273894d7b6babcd3bfbcbd6648574a0d451 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Feb 15 08:11:06 2018 project-moblab: delay devserver cleanup && run in verbose mode. When debugging moblab-vm failures, we often download an archived moblabvm workspace. Since the build may have run 1+ day ago, this results in the cleanup script deleting the staged images immediately on boot, making debugging hard. Instead, delay the initial cleanup a bit. On any long running moblab device, the initial delay should be irrelevant, but it provides a window to debug vm failures. Also, make the cleanup verbose to leave breadcrumbs when we do clean up the devserver cache. BUG= chromium:799724 TEST=local moblabvm run. Change-Id: I9bcb391304cfc462372cc417e866a63f78feaae4 Reviewed-on: https://chromium-review.googlesource.com/915213 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [rename] https://crrev.com/155e0273894d7b6babcd3bfbcbd6648574a0d451/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r84.ebuild [modify] https://crrev.com/155e0273894d7b6babcd3bfbcbd6648574a0d451/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-devserver-cleanup-init.conf
,
Feb 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/fb7f078446dee8c8788d65b767d004ba02e5056c commit fb7f078446dee8c8788d65b767d004ba02e5056c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Feb 16 01:42:20 2018 lib: Use virtio for moblabvm's external disk. Before this CL, we were attaching moblab's external disk as a usb-disk. This CL changes it to be a scsi disk using a virtio driver. For a sequential read benchmark, this gives us > 200x speedup. Before: localhost tmp # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 \ iflag=direct skip=1024 32+0 records in 32+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 26.8975 s, 1.2 MB/s After: localhost ~ # sudo dd if=/dev/sdb of=/dev/null bs=1M count=32 \ iflag=direct skip=1024 32+0 records in 32+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 0.111606 s, 301 MB/s Practically, this makes the devserver running on the moblab VM usable: During an update running off the devserver, the chromiumos image cached on the external disk needs to be loaded in memory for some metadata computations. Without this CL, this takes too long (~ 10 minutes) and the sub-DUT's upate_engine times out (on the updatecheck call). With this CL, this operation takes ~2 seconds. BUG= chromium:799724 TEST=(1) Manual sequence read test as above. (2) Manual sha1 computation on a chromiumos_test_image.bin from a cold cache (post reboot). (3) moblab-vm-test passes with a cold cache (as after CL:882170). Change-Id: I553d57fd2cda8938fddf5bf7f370682ad3d9a88b Reviewed-on: https://chromium-review.googlesource.com/917962 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/fb7f078446dee8c8788d65b767d004ba02e5056c/lib/moblab_vm.py
,
Feb 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f commit 7090d1c44f3fdb902a6eb5039b1fbaad56505d0f Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Sat Feb 17 02:47:45 2018 cbuildbot: Locally stage artifacts on devserver inside moblab. Before this CL, MoblabVMTestStage would upload test artifacts to GS. Then, the test itself would cause the devserver inside moblab VM to stage these same artifacts from GS. This was very slow (time spent waiting for artifact upload + 30 minutes (anecdotal average) for the staging operation). We now pre-generate the test artifacts inside the external disk used by the devserver in moblab as local cache. Also, we instruct the moblab_RunSuite test to force the local cache to be used as the source of test artifacts. This cuts GS out of the loop. With CL:917962, this reduces the MoblabVmTestStage time from ~100 minutes to < 30 minutes. BUG= chromium:799724 TEST=moblab-generic-vm-pre-cq Change-Id: Ic68c3acc16702a722673e29ee5164f50d8b1b442 Reviewed-on: https://chromium-review.googlesource.com/882170 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f/cbuildbot/stages/vm_test_stages.py [modify] https://crrev.com/7090d1c44f3fdb902a6eb5039b1fbaad56505d0f/cbuildbot/stages/vm_test_stages_unittest.py
,
Feb 17 2018
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8954376999108179040 The build time is now hovering under 100 minutes. That's good enough for this bug. http://shortn/_4Gw8Wt17Ia
,
Feb 20 2018
1 day window graph of time taken by this bot: http://shortn/_vujiMi3com
,
Feb 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3 commit 9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Feb 20 21:14:27 2018 Revert "cbuildbot: (workaround) Bump up MoblabVMTestStage timeout." This reverts commit 8cd474693924be0c335e7e4d1b585f64266fd9c2. Reason for revert: Root cause fixed. Original change's description: > cbuildbot: (workaround) Bump up MoblabVMTestStage timeout. > > This is just a hot-fix while we try to determine why this stage is > taking so long on the pre-cq. > > After this CL, the pre-cq should pass again, albeit taking longer. > Degraded performance is better than none. > > BUG=chromium:810061 > TEST=Run moblab-generic-vm-pre-cq with this CL, watch it pass. > > Change-Id: I483a4d3b53c7c04c50057831ce9b87c2e97162bc > Reviewed-on: https://chromium-review.googlesource.com/907705 > Reviewed-by: Xixuan Wu <xixuan@chromium.org> > Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> BUG=chromium:810061 BUG= chromium:799724 Change-Id: I6703922188d62983158f73f2128fc7d1db95d447 Reviewed-on: https://chromium-review.googlesource.com/924664 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/9ea1bef9d36ef0cc537384f62b6c91aa6d3af0a3/cbuildbot/stages/vm_test_stages.py |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Jan 7 2018