Test the lxc_pool service |
||||||||
Issue descriptionPer ihf@: I would start with this, both boards are in no CQ/PFQ. No CTS either. Nobody will notice if anything fails: chromeos-server35.cbf.corp.google.com board:rikku, board:zako This shard will very occasionally run a server test. Not much stress on the container pool (provisioning only stress, a few server tests here and there). Once that works I would use this small shard, which only has one board but does run CTS. setzer is in no CQ/PFQ. We will miss some CTS results if it blows up, but not problems with the queues: chromeos-server26.mtv.corp.google.com board:setzer Setzer should have 18 DUTs running CTS, which is going to be reasonably stressful. Then again viceroy thinks that setzer just died? Do we still have an outage? Alternatively for a "large" shard chromeos-server101.mtv.corp.google.com board:snappy, board:panther But viceroy also thinks snappy just died (it also only has 10 DUTs so less stress than chromeos-server26), so maybe we have other problems at hand. One issue with all above shards that don't run cq/pfq, is that maybe container pool usage pattern is going to be more moderate than for cq/pfq. But we could flip it and watch for a few days. We should see more than a thousand CTS jobs running lxc fly by per day on chromeos-server26.mtv. That should build some confidence. Otherwise, if we want to see the worst case that can happen I suggest chromeos-server98.mtv.corp.google.com board:kevin, board:veyron_minnie, board:caroline We will see breaks quickly in cq/pfq/cts.
,
Dec 13 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/14ce95c9cc9871cd362e7e7d37084c7b1f213fd0 commit 14ce95c9cc9871cd362e7e7d37084c7b1f213fd0 Author: Ben Kwa <kenobi@google.com> Date: Wed Dec 13 22:58:35 2017
,
Dec 14 2017
On the positive side I don't see jobs failing on cs-35. I also found jobs that show that the container pool is active. I lost the logs unfortunately, but it showed that the time to start was still significant O(1 minute). Looking at this the times seem also have gone up since about 4pm, but not just on cb-35 but a few others. So maybe this is a local cbf issue? https://viceroy.corp.google.com/chromeos/ssp_health?duration=1d&refresh=-1&topstreams=50 My suggestion is to log into cb-35 and poke a bit at the logs under /usr/local/autotest/results/*/*/ssp_logs/debug/autoserv.DEBUG They may not always be there depending on the job queue.
,
Dec 14 2017
What we know so far:
- setup test duration appears to be slightly raised on server35 (viceroy link @[1])
- the increase starts right around 4pm on 12/13 [2] which, unfortunately, corresponds with exactly when the container pool came online and began serving containers (local logs in /usr/local/autotest/logs/lxc_pool.1513209079.INFO indicate the first container was served from the pool at 15:54)
- currently, 581 containers have been served from the pool (581 get requests). Of those, I see 1 instance where the pool failed to yield a pool:
$ grep "No container" lxc_pool.1513209079.DEBUG
12/13 19:46:56.225 DEBUG| service:0310| client_thread| No container (id=test_162425705_1513223210_65381)
- logs for that test [3] indicate that it failed completely, which is not the intended outcome. Bug filed: crbug/795029
- the pool appears to be functioning as expected.
- container start times are down in actual tests
- a small sampling shows container startup times in the test ssp logs to be in the ~1 sec range.
- conversely we can see the container startup times in the lxc_pool log in the 5-45 second range
- increased times would appear to be due to the use of container IDs. These are persistent identifiers that have to be serialized and stored in each container in order to identify it (previously, container names were used for identification, but containers cannot be renamed, and container names are not available until a test job comes in). I will investigate this more.
1: https://viceroy.corp.google.com/chromeos/ssp_health?duration=8d&refresh=-1&topstreams=50#_VG_8XSx1Bwu
2: https://screenshot.googleplex.com/Oo6tCJeP2EJ
3: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162425705-chromeos-test/chromeos4-row3-rack6-host10/ssp_logs/debug/
,
Dec 14 2017
The container id is stored in a pickle file, did we do any comparison between pickle and json? I remember we ran into that bottleneck when there are large number of containers (>30) running in a single host.
,
Dec 14 2017
Regarding ID serialization, more discussion in crbug/792564. TLDR is to serialize directly and not use sudo and tmpfiles.
,
Dec 18 2017
There may have been a decrease in time last night. Maybe due to the serialization change? A further observation. Comparing logs from cs-35 and cs-104 each "Running 'sudo" line costs 300+ms. I am surprised it is that much as we are staying on the server. Anyhow. The log from cs-104 shows 72 such calls (but including teardown), the new log from cs-35 shows 127 such calls. This may explain 15-20s. 1. Why does sudo take so long? 2. If it can't be changed, maybe we should reduce the calls. Many of them seem redundant or could be tied together into a little script (copying files into the container, like the .boto file alone takes 6 calls and 2s).
,
Dec 18 2017
Log in as root cs-27 :time ls - 30ms, time sudo ls - 390ms cs-35 :time ls - 7ms, time sudo ls - 750ms (!) :-(
,
Dec 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/640644a1ee45995aec937ee760f00a45ba7ebc20 commit 640644a1ee45995aec937ee760f00a45ba7ebc20 Author: Ben Kwa <kenobi@google.com> Date: Wed Dec 20 18:15:54 2017
,
Dec 21 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/205f346057bac01908ea910bd3ac6ef5266a6b67 commit 205f346057bac01908ea910bd3ac6ef5266a6b67 Author: Ben Kwa <kenobi@google.com> Date: Thu Dec 21 03:34:20 2017
,
Dec 21 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/ba0f1caca8c6c479475d08bf43b754b4a42d956b commit ba0f1caca8c6c479475d08bf43b754b4a42d956b Author: Ben Kwa <kenobi@google.com> Date: Thu Dec 21 09:53:02 2017
,
Dec 21 2017
cs-101 looks good to me. Enabling now the last of the four cs-98.
,
Dec 21 2017
To track the times cs26 before CP enabled sudo reduced 50% 0.6m 0.8m 90% n/a n/a 95% n/a n/a cs35 before CP enabled sudo reduced 50% n/a n/a 90% 0.91m 1.04m 95% 0.91m 1.05m cs98 before CP enabled sudo reduced 50% 1.38m 1.44m 90% 1.60m 1.63m 95% 1.64m 1.65m cs101 before CP enabled sudo reduced 50% 0.97m 1.31m 90% 1.05m 1.41m 95% 1.06m 1.42m
,
Dec 21 2017
There seem to be widespread infra issues this morning. They affect servers other than the ones we have used here. I spot checked that CP is not enabled where it shouldn't be and that CP seems to be working fine where it is (aka the 4 servers in this issue).
,
Dec 22 2017
Over to ihf@ for follow-up.
,
Jan 18 2018
CPU and RAM are a bit elevated on cs98. The rest seems fine. Now there are lots of stale chameleon ssh connections on cs98 for which I filed issue 803563.
,
Feb 10 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/8faafd4cf9059d1cbc460fed2b9b177897d4e6da commit 8faafd4cf9059d1cbc460fed2b9b177897d4e6da Author: Jacob Kopczynski <jkop@google.com> Date: Sat Feb 10 01:01:23 2018
,
Feb 16 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d057f7bb581705bd11a093d60f56bb02f201a52c commit d057f7bb581705bd11a093d60f56bb02f201a52c Author: Ilja H. Friedel <ihf@google.com> Date: Fri Feb 16 06:37:12 2018
,
Feb 21 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/b6091f7d4eee396b6d03d6ade6813f6cdc445812 commit b6091f7d4eee396b6d03d6ade6813f6cdc445812 Author: Ilja H. Friedel <ihf@google.com> Date: Wed Feb 21 21:34:59 2018
,
Mar 15 2018
,
Mar 15 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/986902a5e4bf507c20b0003d72cc24ec0633c99c commit 986902a5e4bf507c20b0003d72cc24ec0633c99c Author: Jacob Kopczynski <jkop@google.com> Date: Thu Mar 15 20:16:09 2018
,
Mar 22 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/7f317790fc2fc781c4add1422d6c97169344775a commit 7f317790fc2fc781c4add1422d6c97169344775a Author: Jacob Kopczynski <jkop@google.com> Date: Thu Mar 22 22:02:22 2018
,
Mar 27 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/ffc6d536ba547a7b8a4a81f881206e9ec953c3f6 commit ffc6d536ba547a7b8a4a81f881206e9ec953c3f6 Author: Jacob Kopczynski <jkop@google.com> Date: Tue Mar 27 20:41:29 2018
,
Apr 20 2018
,
Jun 6 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by kenobi@chromium.org
, Dec 12 2017