Make sizes step work on LUCI |
|||||||
Issue descriptionThe sizes step in the Chromium recipe is currently broken on LUCI due to always using ChartJSON for per dashboard uploading. This forces it to use IP whitelisting for authentication, which no longer works on LUCI due to the very large machine pools. The solution is to make the sizes.py script output data in the HistogramSet format, which allows the data to be uploading using oauth for authentication. The necessary groundwork with getting runtest.py to support HistogramSet uploading was already been done in Issue 892344 and Issue 892302, which were running into the same issue. So, the only things remaining should be to: 1. Make sizes.py take an '--output-dir' argument and an option to specify outputting in the HistogramSet format. 2. Make the sizes step in the Chromium recipe pass the necessary arguments to runtest.py.
,
Nov 27
It's great to see progress on this! As more of our bots get converted to LUCI, we're losing more and more size data, so I'm looking forward to seeing this working.
,
Nov 27
I started a build yesterday with the necessary recipe-side change (plus re-enabled sizes), but not seeing the sizes step https://chromium-swarm.appspot.com/task?id=416bc1a61075b010&refresh=10 I'll dig deeper today.
,
Nov 28
EoD update: Got it to the point where it's attempting to run the sizes step (although it takes ~3 hours to compile, which is really annoying). However, it's failing during the chartjson-histogram conversion due to what looks like the chartjson being malformed. See the error at https://logs.chromium.org/logs/chromium/led/bsheedy_google.com/ce74ce8affe09a7fc87bbba510df45e60a56839a0406963560b15aa7809f6af9/+/steps/sizes/0/stdout Do you happen to have a sample of what sizes actually produces and uploads? I suspect the fix should be simple, either on sizes' end or the conversion script (I ran into several issues with the conversion that needed fixing earlier), but I'm not sure what exactly it's trying to convert. I *think* I can run LED jobs with a src-side patch applied, but I'm not 100% sure, so if there's existing data handy that I could use, that would be preferred.
,
Nov 29
Got it to the point where the sizes step is passing, and as far as the builder is concerned, the data is properly uploaded. However, I'm not seeing the data show up on the perf dashboard - there is no ToTAndroidOfficial bot listed for the "sizes" test, and the ToTAndroid bot doesn't have the uploaded data. Since the validity of the data isn't actually checked during the upload process, the histogram data is probably malformed and getting rejected later on. I'll spin up a local perf dashboard and check with that since it properly errors out locally.
,
Nov 29
Just checked and I see ToTWin(dbg) or ToTLinux(dbg) still failing on sizes. Might be useful to verify on these builders. +John to provide additional guidance.
,
Nov 29
Hm, the local dashboard server doesn't find any issues with the submitted data. I'll see about verifying on ToTLinux(dbg).
,
Nov 29
Build finished on ToTLinux (dbg), and supposedly successfully uploaded. Still not seeing the uploaded data on the dashboard, though. See https://chromeperf.appspot.com/report?sid=7b981e3470b74da766f96a49d371724f843b052b9fce2e72bc68086a1746625b - if it did upload, we would have a data point ending on commit position 612276. Builder output is at https://chromium-swarm.appspot.com/task?id=41797877fc6ba710&refresh=10 Roping in Ben for some guidance from someone more familiar with the dashboard. Ben, any ideas why uploaded data wouldn't be showing on the dashboard despite the local dev dashboard server accepting it just fine? Some debug data like the histograms that are actually getting uploaded are in https://logs.chromium.org/logs/chromium/led/bsheedy_google.com/83da951986b99359dbde2606e6e51baaf2f54999f69dbe59dc04cc8ec7b4c380/+/steps/sizes/0/stdout (look for the "Got histograms:" line)
,
Nov 29
It looks like a queue timeout. "Request was aborted after waiting too long to attempt to service your request." Stackdriver shows these errors are consistently between 0/s and 0.2/s as far back as the logs go. Let me know if these errors happen consistently for you. If so, I can try to look into it. If not, let's wait until simonhatch gets back to see if he can find a way to eliminate these aborts.
,
Nov 29
These seem to be pretty consistent when uploading from these Chromium Clang bots on LUCI, although I only have a sample size of... 4? I could commit the two small src-side and build-side I have that got the upload working. That'll give us more data to see if this is consistent, and if it's not happening 100% of the time, at least some of the data from LUCI should start getting uploaded.
,
Nov 29
Let's try that until Simon is back. Thanks Brian! +simonhatch on bug to comment.
,
Nov 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/a7f27047b8146a699248316ca0d4495f7a9f9a99 commit a7f27047b8146a699248316ca0d4495f7a9f9a99 Author: bsheedy <bsheedy@chromium.org> Date: Fri Nov 30 01:08:30 2018 Support histograms in sizes Adds support for uploading in the HistogramSet format during the sizes step on LUCI. This, combined with https://chromium-review.googlesource.com/c/chromium/src/+/1352395, seems to fix sizes uploading when on LUCI. Also re-enables the sizes step on ToTAndroidOfficial, which was disabled due to the upload issues. Bug: 906685 Change-Id: I392b72e7321da8e4098ec93c9b9cbe8a1e1baa2d Reviewed-on: https://chromium-review.googlesource.com/c/1355348 Commit-Queue: Brian Sheedy <bsheedy@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/a7f27047b8146a699248316ca0d4495f7a9f9a99/scripts/slave/README.recipes.md [modify] https://crrev.com/a7f27047b8146a699248316ca0d4495f7a9f9a99/scripts/slave/recipe_modules/chromium_tests/chromium_clang.py [modify] https://crrev.com/a7f27047b8146a699248316ca0d4495f7a9f9a99/scripts/slave/recipe_modules/chromium/tests/sizes.py [modify] https://crrev.com/a7f27047b8146a699248316ca0d4495f7a9f9a99/scripts/slave/recipe_modules/chromium/api.py [modify] https://crrev.com/a7f27047b8146a699248316ca0d4495f7a9f9a99/scripts/slave/runtest.py
,
Nov 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/11d533a29fc9a9afd8b345e825b5484471e1f10d commit 11d533a29fc9a9afd8b345e825b5484471e1f10d Author: bsheedy <bsheedy@chromium.org> Date: Fri Nov 30 01:57:34 2018 Fix sizes script conversion error Fixes an issue in the sizes script when converting from ChartJSON to HistogramSet. The data produced by the results collector does not contain the benchmark_name or format_version keys, which causes the conversion script to fail. Bug: 906685 Change-Id: I15258026651499ac2a7640958c7bd2d93a794a35 Reviewed-on: https://chromium-review.googlesource.com/c/1352395 Commit-Queue: Brian Sheedy <bsheedy@chromium.org> Reviewed-by: Thomas Anderson <thomasanderson@chromium.org> Cr-Commit-Position: refs/heads/master@{#612511} [modify] https://crrev.com/11d533a29fc9a9afd8b345e825b5484471e1f10d/infra/scripts/legacy/scripts/slave/chromium/sizes.py
,
Nov 30
> Also re-enables the sizes step on ToTAndroidOfficial, which was disabled due to the upload issues. Cool! It looks like the first build succeeded to upload sizes: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/ToTAndroidOfficial/788 But the next one failed: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/ToTAndroidOfficial/789
,
Nov 30
Looks like that failure was caused by https://chromium.googlesource.com/chromium/src/+/7268f2d1789e824d05fee1d2fe5d52a4c19f12be. It removed the --failures flag in sizes.py without removing its usage build-side, so the script failed before it could actually do any work. It's been reverted, and the bot looks (mostly) green since.
,
Nov 30
A quick glance through the builders shows that most LUCI versions are "successfully" uploading (step is green, but still not seeing the results on the dashboard, so it looks like the aborts are consistent). A few, like ToTAndroid64, seem to be running into another issue I discovered with running both the Buildbot and LUCI versions in parallel: ChartJSON files that fail to upload on Buildbot will be cached and cause the LUCI versions to fail. Looking at a failed sizes step, it's trying to upload several ChartJSON files despite being on LUCI https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8928439304591587216/+/steps/sizes/0/stdout. This *should* be cleared up once the Buildbot version gets around to trying to upload again (why is it taking ToTAndroid64 on Buildbot taking 14+ hours to complete??), but there's a potential for the same issue to pop up again in the future if the Buildbot version has an upload issue. The solution I went with for the XR builders that had this issue was to simply remove the scheduler for the Buildbot version when switching permanently to LUCI instead of removing the bot entirely. If the issue pops up, the buildbot version can have a build manually triggered to clear the cache. Once all the LUCI versions are confirmed to no longer have any ChartJSON data in the cache, the Buildbot versions can be completely removed.
,
Dec 3
> ChartJSON files that fail to upload on Buildbot will be cached and cause the LUCI versions to fail. Is this "caching" intentional? Can we get rid of it?
,
Dec 3
The caching itself is intentional, but caches not being restricted to either buildbot or LUCI probably isn't. It does have some value (less likely to lose data), so I'd be hesitant to completely remove it. Trying to fix the caches being mixed might be beneficial, but I'm not sure off-hand where that code is or how easy it would be to fix.
,
Dec 3
where is the data being cached on the bots?
,
Dec 3
(there's a supported caching mechanism for both luci and buildbot, which won't conflict, but the above comment makes me wonder if this data is being cached in some third special location)
,
Dec 3
This is the cache that data eventually gets set from https://cs.chromium.org/chromium/build/scripts/slave/results_dashboard.py?sq=package:chromium&dr=C&g=0&l=113 At a glance, I don't see any reason why it should be available anywhere other than on that particular bot, but it clearly is.
,
Dec 3
Is 'build_dir' a standard build directory path? like ....../out/Release or something? if so, that should be fine...
,
Dec 4
,
Dec 4
Looks like the queue timeout is happening for the Windows builders, too. Ben, do you think you could take a look since they're happening regularly and Simon won't be back for a while?
,
Dec 4
Looking.
,
Dec 4
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/3f9ec8d997af4800c502ee12dff2b948f1623d86 commit 3f9ec8d997af4800c502ee12dff2b948f1623d86 Author: bsheedy <bsheedy@chromium.org> Date: Tue Dec 04 20:08:47 2018 Fix sizes step on Win Fixes the sizes step on Windows builders by relocating the temporary directory cleanup call. This was being called too late, which was causing a directory that the sizes script needed to be deleted before the script ran. Bug: 910717 , 906685 Change-Id: I992a39c5fa431af4dd67dce5cacb764cd36ce028 Reviewed-on: https://chromium-review.googlesource.com/c/1361881 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Brian Sheedy <bsheedy@chromium.org> [modify] https://crrev.com/3f9ec8d997af4800c502ee12dff2b948f1623d86/scripts/slave/runtest.py
,
Dec 4
I started adding a new datastore index to fix a NeedIndexError. It should start serving within a day. Let's see if that helps.
,
Dec 5
Not seeing any data showing up from the LUCI builders, so it looks like the NeedIndexError fix didn't help :(
,
Dec 10
ToTLinux(dbg) started working on 12/6: https://chromeperf.appspot.com/report?sid=7b981e3470b74da766f96a49d371724f843b052b9fce2e72bc68086a1746625b
,
Dec 10
Looks like all those results are from the Buildbot version, not the LUCI one.
,
Dec 10
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/74c92bb2207675ea7702153dd1cf60fddefe8495 commit 74c92bb2207675ea7702153dd1cf60fddefe8495 Author: benshayden <benjhayden@chromium.org> Date: Mon Dec 10 22:27:37 2018 Add SparseDiagnostic index for _FindOrInsertDiagnosticsOutOfOrder Seeing lots of NeedIndexError in the logs for /add_histograms/process with Suite: ChromiumPerf/chromium_perf/sizes as well as others. SparseDiagnostic._FindOrInsertDiagnosticsOutOfOrder() calls query(end_revision >= rev - 1, test == test).order(-end_revision). This query requires an index like (test, -end_revision). This CL adds the required index and a debug log print. Bug: chromium:906685 Change-Id: I90ed1b6ce5e1d8fc2a620f298cb2d24d0def1ca6 Reviewed-on: https://chromium-review.googlesource.com/c/1362081 Reviewed-by: Sean McCullough <seanmccullough@chromium.org> Commit-Queue: Ben Hayden <benjhayden@chromium.org> [modify] https://crrev.com/74c92bb2207675ea7702153dd1cf60fddefe8495/dashboard/index.yaml [modify] https://crrev.com/74c92bb2207675ea7702153dd1cf60fddefe8495/dashboard/dashboard/add_histograms.py
,
Dec 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7e802b3ba3cec408e97bdec7527be095d4e0af8c commit 7e802b3ba3cec408e97bdec7527be095d4e0af8c Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Tue Dec 11 02:01:49 2018 Roll src/third_party/catapult 60a61e85fed1..74c92bb22076 (1 commits) https://chromium.googlesource.com/catapult.git/+log/60a61e85fed1..74c92bb22076 git log 60a61e85fed1..74c92bb22076 --date=short --no-merges --format='%ad %ae %s' 2018-12-10 benjhayden@chromium.org Add SparseDiagnostic index for _FindOrInsertDiagnosticsOutOfOrder Created with: gclient setdep -r src/third_party/catapult@74c92bb22076 The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG= chromium:906685 TBR=sullivan@chromium.org Change-Id: I4195a550775a54f26d803d6c4c7350be20f692f0 Reviewed-on: https://chromium-review.googlesource.com/c/1370984 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#615381} [modify] https://crrev.com/7e802b3ba3cec408e97bdec7527be095d4e0af8c/DEPS
,
Dec 11
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/8435aca79e4dc14a6307154953bc2a23bbb49da0 commit 8435aca79e4dc14a6307154953bc2a23bbb49da0 Author: bsheedy <bsheedy@chromium.org> Date: Tue Dec 11 18:42:14 2018 [tracing] Check for both type and value presence Changes the behavior in chart_json_converter.html to check for either value.type or the actual value fields when decided how to add samples to the histogram. This fixes samples not being added if the type isn't explicitly set, but the data is otherwise valid and unambiguous. Bug: chromium:906685 Change-Id: I4510ca4dfef82ef8f29f997e1e9df2e0d3f6ae54 Reviewed-on: https://chromium-review.googlesource.com/c/1370690 Reviewed-by: Ben Hayden <benjhayden@chromium.org> Commit-Queue: Ben Hayden <benjhayden@chromium.org> [modify] https://crrev.com/8435aca79e4dc14a6307154953bc2a23bbb49da0/tracing/tracing/value/chart_json_converter_test.html [modify] https://crrev.com/8435aca79e4dc14a6307154953bc2a23bbb49da0/tracing/tracing/value/chart_json_converter.html
,
Dec 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4b8b35e2647985ea9b4b551245c837166253775c commit 4b8b35e2647985ea9b4b551245c837166253775c Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Tue Dec 11 20:13:07 2018 Roll src/third_party/catapult 74c92bb22076..8435aca79e4d (1 commits) https://chromium.googlesource.com/catapult.git/+log/74c92bb22076..8435aca79e4d git log 74c92bb22076..8435aca79e4d --date=short --no-merges --format='%ad %ae %s' 2018-12-11 bsheedy@chromium.org [tracing] Check for both type and value presence Created with: gclient setdep -r src/third_party/catapult@8435aca79e4d The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG= chromium:906685 TBR=sullivan@chromium.org Change-Id: I6ba0a6ae9eae9106da018dd8e836d9016177952e Reviewed-on: https://chromium-review.googlesource.com/c/1372405 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#615638} [modify] https://crrev.com/4b8b35e2647985ea9b4b551245c837166253775c/DEPS
,
Dec 12
So the CL from c#33 was supposed to fix an issue in the conversion script that was causing all the numerical data to be dropped during the conversion. Unfortunately, it looks like either the fix didn't work as expected, or there's another problem that needs fixing. I'll get a dump of the data that's being uploaded to verify that the fix is working as intended and report back. benjhayden@, when you get a chance, could you take another look server-side?
,
Dec 13
Looks like the conversion script fix didn't work as expected :/ The numerical data is still getting stripped. I'll poke around locally.
,
Dec 13
I think I found the issue, which is that the ChartJSON data from the sizes step is malformed.
ChartJSON is supposed to be in the following format:
{
...
"charts": {
"metric1": {
"story1": { <data> },
"story2": { <data> },
},
"metric2": { ... },
}
}
But the data output by the sizes step is:
{
"charts": {
"story1": { <data> },
"story2": { <data> },
...
},
}
I.e. the metric is always implied to be "size".
The dashboard currently handles this by automatically using anything after a "-" as the actual story name, e.g. for "libchrome.so-bss" on ToTAndroid, the path on the dashboard becomes "ChromeiumClang/ToTAndroid/sizes/libchrome.so-bss/bss".
This should be easy enough to add in the sizes script itself whenever we dump data meant to be converted to histograms. However, I'm not quite sure how to handle the stories that don't have a "-" in the name, e.g. "libchrome.so", which would only show up as "ChromiumClang/ToTAndroid/sizes/libchrome.so". Would having a "libchrome.so" metric with only a "summary" story be equivalent?
,
Dec 13
Scratch that, looks like the sizes data has an "identifier" field that we can use for the story name, which is what the dashboard seems to use as well. That should make things a little easier.
,
Dec 14
Success! I ran a build (https://chromium-swarm.appspot.com/task?id=41c27f0f0267d410&refresh=10) with https://chromium-review.googlesource.com/c/chromium/src/+/1377195, and it looks like the results are showing up on the dashboard. Look for the point for CP 616444-616458 in https://chromeperf.appspot.com/report?sid=2e9482f2b10dae2581372d0fce74da39bb7e9df80ed66bdf8fb29b4dccb9dc09. It looks like there's still a minor issue of it not displaying links to the build page or stdio, although I'm not actually sure that's an issue with sizes or some general LUCI issue and/or because the build was run via LED instead of on a normal bot. I'll get the patch in so that we can at least get data from the LUCI builders.
,
Dec 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c8230585073a89605f5dd505f8a04d195fbd7756 commit c8230585073a89605f5dd505f8a04d195fbd7756 Author: bsheedy <bsheedy@chromium.org> Date: Fri Dec 14 20:29:32 2018 Fix malformed sizes data Fixes the sizes.py script producing malformed data that couldn't be converted to histograms. The conversion script expects metric dicts that contain story dicts, but sizes was essentially only producing metric dicts. This pulls out the "identifier" value to be the story name since that seems to be what the perf dashboard is doing automatically when uploading in the old format. Bug: 906685 Change-Id: Ie06f2ff477e3346059473863138478c261712772 Reviewed-on: https://chromium-review.googlesource.com/c/1377195 Reviewed-by: Thomas Anderson <thomasanderson@chromium.org> Commit-Queue: Brian Sheedy <bsheedy@chromium.org> Cr-Commit-Position: refs/heads/master@{#616807} [modify] https://crrev.com/c8230585073a89605f5dd505f8a04d195fbd7756/infra/scripts/legacy/scripts/slave/chromium/sizes.py
,
Dec 17
Looks like sizes data is being uploaded/displayed regularly now, e.g. https://chromeperf.appspot.com/report?sid=45acb07a7534d6505707be5046f2762699e5ac46d0cb04ee1a52a733e318b7db It doesn't quite keep keep continuity with historical data due to how the data is being displayed. For example, libchrome.so has both the old and new data on the same chart since both end up being ChromiumClang/ToTAndroid/sizes/libchrome.so. However, libchrome.so-bss is slightly different. The old data shows up in ChromiumClang/ToTAndroid/sizes/libchrome.so-bss/bss, while the new data is in either .../libchrome.so-bss or .../libchrome.so-bss_avg. I'm not sure if there's anything we can really do about that - any input on this Ben? If there isn't anything we can do about that, I think this can be closed out.
,
Dec 18
I can migrate timeseries from old test paths to new test paths if you want to file a migration bug like this one: https://bugs.chromium.org/p/chromium/issues/detail?id=903362&desc=2 If you want to keep using the old test paths, you can abuse the stories diagnostic. /add_histograms creates test paths like master/bot/suite/histogram.name/story, unless storyTags are set, in which case they go between histogram.name and story: https://github.com/catapult-project/catapult/blob/master/dashboard/dashboard/common/histogram_helpers.py#L94 If you want to move ../libchrome.so-bss to ../libchrome.so-bss/bss, then you can add a stories diagnostic to the histogram containing 'bss': https://github.com/catapult-project/catapult/blob/master/tracing/tracing/value/diagnostics/reserved_infos.py#L64
,
Dec 18
I'll leave that decision to someone who's more involved with these bots/data sets - hans@, any preference on what to do?
,
Dec 19
> hans@, any preference on what to do? From #41 it sounds like the question is do we care about keeping continuity with historical data, because the new data is now filed under some slightly different name? I don't think that's a problem. The important thing is that we get the data going forward.
,
Dec 19
Yep, all the data's there. In that case, I'll mark this as fixed since all the LUCI builders seem happy now. Re-open if there's some special one that's misbehaving.
,
Dec 19
Great, thanks for all your help! |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by bugdroid1@chromium.org
, Nov 26