New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 794232 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Task

Blocked on:
issue 835469
issue 795029
issue 803563

Blocking:
issue 720219



Sign in to add a comment

Test the lxc_pool service

Project Member Reported by kenobi@chromium.org, Dec 12 2017

Issue description

Per ihf@:

I would start with this, both boards are in no CQ/PFQ. No CTS either. Nobody will notice if anything fails:
chromeos-server35.cbf.corp.google.com   board:rikku, board:zako
This shard will very occasionally run a server test. Not much stress on the container pool (provisioning only stress, a few server tests here and there).

Once that works I would use this small shard, which only has one board but does run CTS. setzer is in no CQ/PFQ. We will miss some CTS results if it blows up, but not problems with the queues:
chromeos-server26.mtv.corp.google.com   board:setzer
Setzer should have 18 DUTs running CTS, which is going to be reasonably stressful. Then again viceroy thinks that setzer just died? Do we still have an outage?

Alternatively for a "large" shard
chromeos-server101.mtv.corp.google.com  board:snappy, board:panther
But viceroy also thinks snappy just died (it also only has 10 DUTs so less stress than chromeos-server26), so maybe we have other problems at hand.

One issue with all above shards that don't run cq/pfq, is that maybe container pool usage pattern is going to be more moderate than for cq/pfq. But we could flip it and watch for a few days. We should see more than a thousand CTS jobs running lxc fly by per day on chromeos-server26.mtv. That should build some confidence.

Otherwise, if we want to see the worst case that can happen I suggest
chromeos-server98.mtv.corp.google.com   board:kevin, board:veyron_minnie, board:caroline
We will see breaks quickly in cq/pfq/cts.

 

Comment 1 by kenobi@chromium.org, Dec 12 2017

Components: Infra>Client>ChromeOS
Project Member

Comment 2 by bugdroid1@chromium.org, Dec 13 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/14ce95c9cc9871cd362e7e7d37084c7b1f213fd0

commit 14ce95c9cc9871cd362e7e7d37084c7b1f213fd0
Author: Ben Kwa <kenobi@google.com>
Date: Wed Dec 13 22:58:35 2017

Comment 3 by ihf@chromium.org, Dec 14 2017

Labels: M-65 OS-Chrome
On the positive side I don't see jobs failing on cs-35. I also found jobs that show that the container pool is active. I lost the logs unfortunately, but it showed that the time to start was still significant O(1 minute).

Looking at this the times seem also have gone up since about 4pm, but not just on cb-35 but a few others. So maybe this is a local cbf issue?
https://viceroy.corp.google.com/chromeos/ssp_health?duration=1d&refresh=-1&topstreams=50

My suggestion is to log into cb-35 and poke a bit at the logs under
/usr/local/autotest/results/*/*/ssp_logs/debug/autoserv.DEBUG

They may not always be there depending on the job queue.

Comment 4 by kenobi@chromium.org, Dec 14 2017

Blockedon: 795029
What we know so far:

- setup test duration appears to be slightly raised on server35 (viceroy link @[1])
  - the increase starts right around 4pm on 12/13 [2] which, unfortunately, corresponds with exactly when the container pool came online and began serving containers (local logs in /usr/local/autotest/logs/lxc_pool.1513209079.INFO indicate the first container was served from the pool at 15:54)

- currently, 581 containers have been served from the pool (581 get requests).  Of those, I see 1 instance where the pool failed to yield a pool:
     $ grep "No container" lxc_pool.1513209079.DEBUG
     12/13 19:46:56.225 DEBUG|           service:0310|    client_thread| No container (id=test_162425705_1513223210_65381)
  - logs for that test [3] indicate that it failed completely, which is not the intended outcome.  Bug filed: crbug/795029

- the pool appears to be functioning as expected.
  - container start times are down in actual tests
  - a small sampling shows container startup times in the test ssp logs to be in the  ~1 sec range.
  - conversely we can see the container startup times in the lxc_pool log in the 5-45 second range

- increased times would appear to be due to the use of container IDs.  These are persistent identifiers that have to be serialized and stored in each container in order to identify it (previously, container names were used for identification, but containers cannot be renamed, and container names are not available until a test job comes in).  I will investigate this more.


1: https://viceroy.corp.google.com/chromeos/ssp_health?duration=8d&refresh=-1&topstreams=50#_VG_8XSx1Bwu
2: https://screenshot.googleplex.com/Oo6tCJeP2EJ
3: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/162425705-chromeos-test/chromeos4-row3-rack6-host10/ssp_logs/debug/

Comment 5 by dshi@chromium.org, Dec 14 2017

The container id is stored in a pickle file, did we do any comparison between pickle and json? I remember we ran into that bottleneck when there are large number of containers (>30) running in a single host.

Comment 6 by kenobi@chromium.org, Dec 14 2017

Regarding ID serialization, more discussion in crbug/792564.  TLDR is to serialize directly and not use sudo and tmpfiles.

Comment 7 by ihf@chromium.org, Dec 18 2017

There may have been a decrease in time last night. Maybe due to the serialization change?

A further observation. Comparing logs from cs-35 and cs-104 each "Running 'sudo" line costs 300+ms. I am surprised it is that much as we are staying on the server. Anyhow. The log from cs-104 shows 72 such calls (but including teardown), the new log from cs-35 shows 127 such calls. This may explain 15-20s.

1. Why does sudo take so long?
2. If it can't be changed, maybe we should reduce the calls. Many of them seem redundant or could be tied together into a little script (copying files into the container, like the .boto file alone takes 6 calls and 2s).

Comment 8 by ihf@chromium.org, Dec 18 2017

Log in as root
cs-27 :time ls - 30ms, time sudo ls - 390ms
cs-35 :time ls - 7ms, time sudo ls - 750ms (!) :-(
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/640644a1ee45995aec937ee760f00a45ba7ebc20

commit 640644a1ee45995aec937ee760f00a45ba7ebc20
Author: Ben Kwa <kenobi@google.com>
Date: Wed Dec 20 18:15:54 2017

Project Member

Comment 10 by bugdroid1@chromium.org, Dec 21 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/205f346057bac01908ea910bd3ac6ef5266a6b67

commit 205f346057bac01908ea910bd3ac6ef5266a6b67
Author: Ben Kwa <kenobi@google.com>
Date: Thu Dec 21 03:34:20 2017

Project Member

Comment 11 by bugdroid1@chromium.org, Dec 21 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/ba0f1caca8c6c479475d08bf43b754b4a42d956b

commit ba0f1caca8c6c479475d08bf43b754b4a42d956b
Author: Ben Kwa <kenobi@google.com>
Date: Thu Dec 21 09:53:02 2017

Comment 12 by ihf@chromium.org, Dec 21 2017

cs-101 looks good to me. Enabling now the last of the four cs-98.

Comment 13 by ihf@chromium.org, Dec 21 2017

To track the times

cs26   before     CP enabled   sudo reduced
50%     0.6m         0.8m   
90%      n/a          n/a
95%      n/a          n/a

cs35   before     CP enabled   sudo reduced
50%      n/a           n/a  
90%     0.91m        1.04m
95%     0.91m        1.05m

cs98   before     CP enabled   sudo reduced
50%     1.38m        1.44m
90%     1.60m        1.63m
95%     1.64m        1.65m

cs101  before    CP enabled    sudo reduced
50%     0.97m        1.31m
90%     1.05m        1.41m
95%     1.06m        1.42m

Comment 14 by ihf@chromium.org, Dec 21 2017

There seem to be widespread infra issues this morning. They affect servers other than the ones we have used here. I spot checked that CP is not enabled where it shouldn't be and that CP seems to be working fine where it is (aka the 4 servers in this issue).
Cc: -ihf@chromium.org
Owner: ihf@chromium.org
Over to ihf@ for follow-up.

Comment 16 by ihf@chromium.org, Jan 18 2018

Blockedon: 803563
CPU and RAM are a bit elevated on cs98. The rest seems fine. Now there are lots of stale chameleon ssh connections on cs98 for which I filed issue 803563.
Project Member

Comment 17 by bugdroid1@chromium.org, Feb 10 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/8faafd4cf9059d1cbc460fed2b9b177897d4e6da

commit 8faafd4cf9059d1cbc460fed2b9b177897d4e6da
Author: Jacob Kopczynski <jkop@google.com>
Date: Sat Feb 10 01:01:23 2018

Project Member

Comment 18 by bugdroid1@chromium.org, Feb 16 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d057f7bb581705bd11a093d60f56bb02f201a52c

commit d057f7bb581705bd11a093d60f56bb02f201a52c
Author: Ilja H. Friedel <ihf@google.com>
Date: Fri Feb 16 06:37:12 2018

Project Member

Comment 19 by bugdroid1@chromium.org, Feb 21 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/b6091f7d4eee396b6d03d6ade6813f6cdc445812

commit b6091f7d4eee396b6d03d6ade6813f6cdc445812
Author: Ilja H. Friedel <ihf@google.com>
Date: Wed Feb 21 21:34:59 2018

Comment 20 by jkop@chromium.org, Mar 15 2018

Owner: jkop@chromium.org
Status: Started (was: Assigned)
Project Member

Comment 21 by bugdroid1@chromium.org, Mar 15 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/986902a5e4bf507c20b0003d72cc24ec0633c99c

commit 986902a5e4bf507c20b0003d72cc24ec0633c99c
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Mar 15 20:16:09 2018

Project Member

Comment 22 by bugdroid1@chromium.org, Mar 22 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/7f317790fc2fc781c4add1422d6c97169344775a

commit 7f317790fc2fc781c4add1422d6c97169344775a
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Mar 22 22:02:22 2018

Project Member

Comment 23 by bugdroid1@chromium.org, Mar 27 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/ffc6d536ba547a7b8a4a81f881206e9ec953c3f6

commit ffc6d536ba547a7b8a4a81f881206e9ec953c3f6
Author: Jacob Kopczynski <jkop@google.com>
Date: Tue Mar 27 20:41:29 2018

Comment 24 by jkop@chromium.org, Apr 20 2018

Blockedon: 835469

Comment 25 by jkop@chromium.org, Jun 6 2018

Status: WontFix (was: Started)

Sign in to add a comment