build_RootFilesystemSize too small for M71 bvt-inline testing |
|||||||||
Issue descriptionbvt-inline testing failing across boards for M71 builders. M71 DEV Blocker since blocking builds Note that M71 branched on Friday so these are the initial attempts for those builders. Logs: https://luci-logdog.appspot.com/v/?s=chromeos/buildbucket/cr-buildbucket.appspot.com/8932800024225867760/+/steps/HWTest__bvt-inline_/0/stdout Build Health: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=3033458
,
Oct 15
This is probably a duplicate of https://bugs.chromium.org/p/chromium/issues/detail?id=888744 As long as it still builds, we can still use the images, the test is an early warning indicator that we are at the last few bytes before we cannot fit, but not a critical failure on its own. I would advise dropping RBD on this, if this is a problem on the branch the test can be hacked off of the relevant suites.
,
Oct 15
Thanks; I kept the RBD because I wasn't sure if this was also tied to crbug/895438 and perhaps other artifacts from the branch and M71 builder setup. I'll keep the RBD for now since I'm tracking closely.
,
Oct 15
This isn't a real issue *now* other than a warning to the file space. So removing the RBD. But the space shortage needs to be resolved before we run out...
,
Oct 15
Fixing Issue 894277 should give back 9MB of space, re-assigning to sammc@
,
Oct 15
Should we tag this as a P0 since all of the M71 builder status is failing and we're getting close to an emergency?
,
Oct 15
Is it really an emergency? The builds themselves are succeeding, that test is more a canary in the coalmine, right? If you want to green-ify the dashboard urgently, maybe changing https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/build_RootFilesystemSize/build_RootFilesystemSize.py?rcl=bf1fe06fb8f8d719da07cf2385de54e36a6ef706&l=55 to only 10MB of margin on the branch would be acceptable for now. BTW, is that test failing on ToT too, or just the branch? If the latter, that's a bit strange. Maybe slight randomness in the size of binaries.
,
Oct 15
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1281065 will disable this, as the test failing is not really a blocking failure, and it should reduce confusion by the test team when making releases, we can still merge in a real fix if we come up with something that looks safe enough for 71.
,
Oct 15
Per #7: It's going to be a P0 when we run out of space ;-), assuming this is a true indicator that we're about to run out of resources. This is happening for M71 branch and M72 ToT. I'd rather keep the visibility if we're about to hit a real resource limit. Admittedly confused if we are or not at this point....
,
Oct 15
We're not hitting a real resource limit yet. The test tells us that we have slightly less than 11MB free space on the rootfs, it's an early warning. 3 possibilities: 1. reduce the disk usage on rootfs. ETA unclear 2. remove the test altogether, presumably only on the branch? 3. make the test a bit more permissive so it only fails when there are, say only 10MB left while we work on 1.
,
Oct 15
So it sounds like the way this 'warning' is being reported is incorrect. This should not be a hard FAIL, but a WARNING. If the information is useful to bubble up to dashboards, then raise it as a WARNING, otherwise, don't raise it and notify whoever needs to be notified to take action on it. Right now it's just scary noise on a dashboard with a big red FAIL, but it's not that at all. Do we already return multiple statuses from builds? More than PASS/FAIL? If not, then the consumers (dashboards) of these statuses need to interpret the nuanced meanings of "FAIL".
,
Oct 15
We have many failed builds that are entirely releasable, and having a more granular status would be helpful, in particular I think separating out the 'build' from the 'test' in our dashboard views would help. We could then go further with granularity on the test portion into tests that are more or less severe. I think one of the big values of this test is that it is used in the commit queue, so if a CL tries to land that pushes us very close to the edge it gets rejected, making it easier to keep the margin we have left. Since the release branch does not have a commit queue, the value here is less certain to me.
,
Oct 15
Requesting merge to 71 of https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1281065 which disables this test on the release branhc.
,
Oct 15
#9: this doesn't seem that widespread. auron_paine is failing on 71, but not ToT afaict. I'm not sure if any board is actually failing on 72. Maybe other boards on 71? Not sure which. Since the failure is pretty limited, it sounds like reducing the margin from 11MB down to 10MB on 71, and maybe ToT(?) would be fine. We still catch unexpected growth in FS size while we work on the underlying cause, until we can move the margin back up to 15MB. #11: It's kind of a separate discussion but I think a fail is preferable in that case, warnings won't trigger any action in practice. It's better to simply check if the canary in the coalmine is dead, not try to interpret every cough is serious or not. #12: is it actually in the CQ? I only see it in the canary builders.
,
Oct 15
This is part of bvt-inline so it should run in the CQ on any boards that run bvt-inline as part of the CQ, though this is a small subset of the boards. https://cs.corp.google.com/chromeos_public/chromite/config/chromeos_config.py?l=2733
,
Oct 15
> #11: It's kind of a separate discussion but I think a fail is preferable in that case, warnings won't trigger any action in practice. It's better to simply check if the canary in the coalmine is dead, not try to interpret every cough is serious or not. Do we have that lever? Warnings? Maybe they should trigger an action. A big RED FAIL for a mild warning is only visible to those who look at the dashboard regularly (TPMs, test). Is that failure actionable to them? It's noise and it's a distracting noise, because they have to go track down and interpret what it means. If the goal is to warn (someone) about a potential increase in the filesystem size, then warn the person/group who can take action on it. GO ahead and put a ~warning on the dashboard, but don't put that noise on the people that won't take an action on it.
,
Oct 15
We know the coal mine already has bad air. The test doesn't matter any more, as long as people are putting enough urgency behind really dealing with this (are they?).
,
Oct 16
Your change meets the bar and is auto-approved for M71. Please go ahead and merge the CL to branch 3578 manually. Please contact milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 22
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 22
This is probably fixed enough for 71, the efforts to reduce filesystem size will continue on 72.
,
Oct 22
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/473470a7680232926a4fd754ca77203c70624007 commit 473470a7680232926a4fd754ca77203c70624007 Author: Bernie Thompson <bhthompson@google.com> Date: Mon Oct 22 16:32:33 2018 Disable build_RootFilesystemSize on R71 BUG= chromium:895174 TEST=None Change-Id: I9005fce94fc840b97fe866426f5587663b19d0d5 Reviewed-on: https://chromium-review.googlesource.com/c/1281065 Commit-Queue: Bernie Thompson <bhthompson@chromium.org> Tested-by: Bernie Thompson <bhthompson@chromium.org> Reviewed-by: Bernie Thompson <bhthompson@chromium.org> [modify] https://crrev.com/473470a7680232926a4fd754ca77203c70624007/client/site_tests/build_RootFilesystemSize/control
,
Oct 26
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by kbleicher@google.com
, Oct 14