New issue
Advanced search Search tips

Issue 682558 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner: ----
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug



Sign in to add a comment

buildbot slaves have enough disk space to run linux_deterministic builder

Project Member Reported by yyanagisawa@chromium.org, Jan 19 2017

Issue description

To run a builder to confirm build determinism, it need to keep results for two clobber chromium build to compare.

To deploy linux_deterministic builder, we need to make sure slaves have enough disk space for it.
 
Labels: OS-Linux
Cc: sergeybe...@chromium.org dpranke@chromium.org estaab@chromium.org
Components: -Infra>Platform>Swarming Infra>Platform>Buildbot
Status: Available (was: Untriaged)
@estaab, sergeyberezin - how easy is it to figure out if we can afford an additional linux release build on the main linux cq pool?

Comment 3 Deleted

As easy as this:

https://goto.google.com/ocnrq

Looks like *on average* machines in this pool use 75% of their disk space. They are 500GB disks, so we only have 125GB left, which I'd be nervous to use up for anything else - we need a safety buffer.

I'd go for a separate pool for linux_deterministic, just to be safe.

But keep in mind that tryserver.chromium.linux is already running ~600 slaves, and historically it's dangerously close to the master's breaking point. I'd look into enabling logdog-only mode for this master before we load it up with more busy slaves.
I think it fine to wait until logdog get ready.

After that, how can I get separated pool of builders?
To carve out a separate pool, just update the slaves-to-builders assignments in the master's config: https://chromium.googlesource.com/chromium/tools/build/+/master/masters/master.tryserver.chromium.linux/slaves.cfg#11

and request to restart the master at http://go/bugatrooper .
Actually, I think we're well under the master's capacity at the moment, as we're only running ~300 concurrent builds at peak the past few days. But, I agree that removing the logging will hopefully give us a fair amount headroom on the master, too.

I'm guessing that we'll probably need 60-100 machines in the pool to handle the load, given that we'll be doing two full compiles per build. We can take those from the existing 600.

@sergeyberezin - if the new pool is only serving one builder, we should trivially have enough disk space per builder, right? Do you know if we have any rough heuristics of disk space needed per builder (possibly per different config variants like debug/release, regular/official, linux/android, etc.), or how we might get such numbers?

Comment 8 by estaab@chromium.org, Jan 24 2017

Just so I'm caught up, we're going to compare two builds for all chromium CLs and have failures block commits? How often do we expect this new builder to catch bad CLs?

I notice https://build.chromium.org/p/chromium.fyi/console?category=deterministic is red for the past 200 builds each, do we need that to be green first?

I just want to make sure we're getting good value out of doing this. :)
Re: #6
Do you suggest me to get certain number of bots from Infra Labs?

Re: #8
Let me focus on Linux deterministic builder because Mac and Windows have known issues.

When it become stable, I expect it less than once a month.
https://goto.google.com/zfzjz
Build had been deterministic for two month between 2016-10-11 and 2017-01-12.

Yes, making it green should be done before making in runs as CQ.  I have already filed non-deterministic build on Linux.
https://bugs.chromium.org/p/chromium/issues/detail?id=678903
Yes, we expect the builder to catch CLs that break determinism and it has in the past.

And yes, it looks like the Linux builder is currently broken and we'd need to fix that.

As to how often we catch failures and whether that's enough to put it in the CQ, that's a good question and something we need to come up with real guidelines for. We may end up wanting to move the waterfall builder from the fyi waterfall to the main waterfall but *not* put it on the CQ.

Re: comment #9 - I was suggesting that we just take some of the machines from the existing linux_cq pool out, since there are more machines in that pool than we currently need. However, Erik is raising a good point that maybe we shouldn't do anything in the CQ at all at the moment. I'll think more about this and update this bug again tomorrow.
Deterministic builds regressions are relatively rare but on the other hand it's annoying for devs to only have post-commit checks forcing an (otherwise) unnecessary revert.
Yes, it is. 

On the other hand, every builder we have on the CQ imposes both hardware and operational costs, so we need to figure out the right balance here (as I was saying in paragraph #3 of comment #10).
Yes, if this costs us frequent annoyance through false rejections and CQ cycle time for a relatively rare annoyance through an occasional rollback I don't think it's worth it. Let's figure out the balance.

And we should definitely start with a sheriffed waterfall builder so we don't have a red CQ and green tree.
FYI, non-deterministic build caused in Linux release builder has been fixed.
https://uberchromegw.corp.google.com/i/chromium.fyi/builders/Linux%20deterministic

If I understand dpranke and estaab suggestion correctly, you mean:
1. integrate deterministic builder to continuous integration, and not set it to presubmit. (maruel might have different opinions?)
2. I can borrow some buildbot slaves from linux_cq?
3. use different pool.

Is my understanding correct?
1. +1 to defining a waterfall builder first (in fact, our tryserver builders are mirrors of the waterfall builders, so there is really no other way)

2. linux_cq (tryserver) slaves live on a different network from the waterfall. So you'd need to request a new slave for the waterfall.

3. If / when we get to add a tryserver builder, we'll likely need a separate pool due to disk space constraints. Waterfall already has separate slaves (pools of size 1) per each builder.
Re: #15
Please fix me if I misunderstand,
1. might mean to add new builder to tryserver, right?
2. and 3. will you advice me the way to calculate how much builders is enough for this?

I am going to use the same recipe that runs
https://uberchromegw.corp.google.com/i/chromium.fyi/builders/Linux%20deterministic
Builder name will be:
https://chromium.googlesource.com/chromium/tools/build.git/+/master/scripts/slave/recipes/swarming/deterministic_build.py#90

I think you can just move your existing "Linux Deterministic" builder from chromium.fyi to chromium.linux. You already have the "linux_deterministic_rel" optional tryserver, so I don't think anything needs to change for that (apart from updating the entry in trybots.py when you move the other builder).

Does that make sense?

I think so.

However, I am also an owner of the issue https://bugs.chromium.org/p/chromium/issues/detail?id=644641).  I do not want to add new precise builder to chromium.linux.

Let me ask to have yet another buildbot slave with trusty
https://bugs.chromium.org/p/chromium/issues/detail?id=689380
Now the buildbot slave has been converted to trusty.

I have updated the builder name in https://chromium-review.googlesource.com/c/416511.  Will you review this?
Project Member

Comment 20 by sheriffbot@chromium.org, Feb 12 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Is this issue actually resolved? (it appears so from the comments). Can we close it?
Status: Fixed (was: Untriaged)
Please reopen if needed.

Sign in to add a comment