Builds hang during parent completion step between CIDB polls |
||||
Issue descriptionThis CQ run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934219634575875552 Appears to have hung during CommitQueueCompletion: https://luci-logdog.appspot.com/v/?s=chromeos/bb/chromeos/master-paladin/19706/+/recipes/steps/CommitQueueCompletion/0/stdout 10:15:34: INFO: Still waiting for the following builds to complete: [u'amd64-generic-paladin', u'arm-generic-paladin', u'atlas-paladin', u'auron_paine-paladin', u'auron_yuna-paladin', u'betty-arc64-paladin', u'betty-arcnext-paladin', u'betty-paladin', u'bob-paladin', u'caroline-arcnext-paladin', u'caroline-paladin', u'cave-paladin', u'chell-nowithdebug-paladin', u'chell-paladin', u'cheza-paladin', u'coral-paladin', u'cyan-paladin', u'daisy-paladin', u'daisy_skate-paladin', u'edgar-paladin', u'elm-paladin', u'eve-paladin', u'falco-paladin', u'fizz-paladin', u'glados-paladin', u'guado_moblab-paladin', u'hana-paladin', u'kevin-arcnext-paladin', u'kevin-paladin', u'kip-paladin', u'kukui-paladin', u'leon-paladin', u'link-paladin', u'moblab-generic-vm-paladin', u'monroe-paladin', u'nami-paladin', u'nocturne-paladin', u'nyan_big-paladin', u'nyan_kitty-paladin', u'oak-paladin', u'octopus-paladin', u'panther-paladin', u'peach_pit-paladin', u'peppy-paladin', u'poppy-paladin', u'quawks-paladin', u'reef-paladin', u'samus-paladin', u'scarlet-paladin', u'sentry-paladin', u'stout-paladin', u'terra-paladin', u'tidus-paladin', u'veyron_jaq-paladin', u'veyron_jerry-paladin', u'veyron_mighty-paladin', u'veyron_minnie-paladin', u'veyron_speedy-paladin', u'veyron_tiger-paladin', u'winky-paladin', u'wizpig-paladin', u'wolf-paladin'] 10:15:34: INFO: 2:40:24.007942 until timeout... 10:16:34: INFO: Updating slave status... 10:17:13: INFO: RetriableHttp: attempt 1 receiving status 502, will retry 10:17:14: INFO: RetriableHttp: attempt 1 receiving status 503, will retry 10:17:17: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 10:17:17: INFO: Refreshing access_token [1;33m10:17:20: WARNING: HttpsMonitor.send received status 429: { "error": { "code": 429, "message": "Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'prodxmon-pa.googleapis.com' for consumer 'project_number:102025095358'.", "status": "RESOURCE_EXHAUSTED", "details": [ { "@type": "type.googleapis.com/google.rpc.Help", "links": [ { "description": "Google developer console API key", "url": "https://console.developers.google.com/project/102025095358/apiui/credential" } ] } ] } } [0m <wash, rinse, repeat>
,
Sep 27
It looks like a quota exhaustion when sending metrics? Any idea if that could have been caused by killing the previous master-paladin run?
,
Sep 27
Killing the previous build shouldn't matter, we do things like that fairly often. Also.... metrics aren't supposed to be able to hang the process.
,
Sep 28
More likely, this is a CIDB hang (thanks ekbot@ and jclinton@)
,
Oct 12
This appears to be the same issue as crbug/883421 yet that was marked as fixed. This is normally a case in which the DB hangs; the symptoms seem to indicate an issue with the metrics but, from past experience and debugging, metrics are a background process and should be retried on the server side without causing impacts to the build itself. This appears to be an issue with CIDB hanging but without finding it in a hung state we have very little to go on. Will keep this open, unassigned, for the next oncall who experiences an issue with CIDB to associate. -- Mike
,
Oct 12
Damn. We did fix two real problems in that other bug.
,
Oct 12
No worries. This is may not be related other than it appearing to be related to the DB. I just do not have anything more to go by based on the error provided. I thought the other issue was still open so was going to associate. For now, this is just a tracking bug for issues related to the DB. We hope this becomes a stale bug and gets closed due to there be no further issues. =) -- Mike
,
Dec 15
Stayed up late watching builds for an unrelated reason and saw this happen by pure chance. So, we finally have some debugging information. This build was affected: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8927109628284744048 You can see the hang in the log timestamps at 35m 43s: https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927109628284744048/+/steps/MasterSlaveSyncCompletion/0/stdout#L5988_0 This occurred right after reading the experimental status before calling CIDB. Here was the process tree at the time of the hang when I logged in to the bot: 1038 ? Ssl 0:00 /usr/bin/python -u /opt/chromebuild/chromebuild-startup.py --log /var/log/messages/chromebuild/startup.log 1093 ? Sl 0:57 \_ /usr/bin/python /b/swarming/swarming_bot.1.zip start_bot 5568 ? Sl 0:13 \_ /usr/bin/python /b/swarming/swarming_bot.1.zip task_runner --swarming-server https://3945-402d3be-dot-chrome-swarming.appspot.com --in-file /b/swarming/w/task_runner_in.json --out-file /b/swarming/w/task 5581 ? S 0:12 \_ /usr/bin/python /b/swarming/swarming_bot.1.zip run_isolated -a /b/swarming/w/run_isolated_args.json 5655 ? Sl 0:10 \_ /b/swarming/w/ir/kitchen cook -buildbucket-hostname cr-buildbucket.appspot.com -buildbucket-build-id 8927109628284744048 -call-update-build -build-url https://ci.chromium.org/b/892710962828474404 5935 ? S 0:00 \_ python kitchen-checkout/scripts/slave/recipes.py --operational-args-path /b/swarming/w/ir/tmp/rr/op_args.json run --properties-file /b/swarming/w/ir/tmp/rr/properties.json --workdir /b/swarmi 6000 ? S 0:03 \_ /b/swarming/w/ir/cache/vpython/02d5f8/bin/python -u -s /b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/main.py --package /b/swarming/w/ir/kitchen- 6757 ? S 0:05 \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cache-d 6760 ? Sl 0:09 \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cac 6764 ? S 0:01 \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cac 15361 ? S 0:04 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 892 15364 ? S 0:00 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 15371 ? S 0:00 \_ bash -c while :; do read -t 240; [ $? -le 128 ] && exit; sudo -n true 2>/dev/null; sudo -n true < /dev/null > /dev/null 2>&1; done 15392 ? Ss 0:00 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 15395 ? Sl 0:09 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 15399 ? S 0:01 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 15401 ? Sl 0:00 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 15406 ? Sl 0:00 \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 26626 ? S 0:09 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir /b/s 26629 ? Ss 0:00 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 26631 ? Sl 0:09 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 26635 ? S 0:03 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 26637 ? Sl 0:00 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 26642 ? Sl 0:00 \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir I had a hunch that it was a network hang so I took at look at netstat: $ sudo netstat -tnp | grep python2 tcp 0 0 10.0.2.119:60969 173.194.216.95:443 ESTABLISHED 15399/python2 tcp 0 0 10.0.2.119:32772 173.194.216.95:443 ESTABLISHED 26635/python2 tcp 1 0 10.0.2.119:40617 74.125.141.84:443 CLOSE_WAIT 26635/python2 tcp 0 0 10.0.2.119:38264 130.211.191.11:3306 ESTABLISHED 26626/python2 tcp 1 0 10.0.2.119:40473 74.125.141.84:443 CLOSE_WAIT 6764/python2 tcp 0 0 10.0.2.119:46825 173.194.209.95:443 ESTABLISHED 6764/python2 tcp 0 0 10.0.2.119:37704 130.211.191.11:3306 ESTABLISHED 15361/python2 tcp 1 0 10.0.2.119:40581 74.125.141.84:443 CLOSE_WAIT 15399/python2 tcp 0 0 10.0.2.119:47023 173.194.209.153:443 ESTABLISHED 26626/python2 On the hunch that the problem was MySQL, I straced both PID's with 3306 ports open: jclinton@swarm-cros-576:~$ sudo strace -f -p 15361 Process 15361 attached wait4(26626, ^CProcess 15361 detached <detached ...> jclinton@swarm-cros-576:~$ sudo strace -f -p 26626 Process 26626 attached read(9, ^CProcess 26626 detached <detached ...> So only 26626 was blocked on I/O. I watch that for awhile and it never moved. However, file descriptor 9 was *not* the MySQL connection: jclinton@swarm-cros-576:~$ sudo lsof -p 26626 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python2 26626 chrome-bot cwd DIR 8,1 4096 10096449 /b/swarming/w/ir/cache/cbuild/repository python2 26626 chrome-bot rtd DIR 8,1 4096 2 / python2 26626 chrome-bot txt REG 8,1 3341256 5540 /usr/bin/python2.7 python2 26626 chrome-bot mem REG 8,1 90160 2972 /lib/x86_64-linux-gnu/libgcc_s.so.1 python2 26626 chrome-bot mem REG 8,1 3359328 6340 /usr/lib/x86_64-linux-gnu/libmysqlclient.so.18.0.0 python2 26626 chrome-bot mem REG 8,1 58760 2206 /usr/lib/python2.7/dist-packages/_mysql.so python2 26626 chrome-bot mem REG 8,1 101240 4297 /lib/x86_64-linux-gnu/libresolv-2.19.so python2 26626 chrome-bot mem REG 8,1 22952 4361 /lib/x86_64-linux-gnu/libnss_dns-2.19.so python2 26626 chrome-bot mem REG 8,1 43616 4295 /lib/x86_64-linux-gnu/libnss_files-2.19.so python2 26626 chrome-bot mem REG 8,1 47760 4329 /lib/x86_64-linux-gnu/libnss_nis-2.19.so python2 26626 chrome-bot mem REG 8,1 97296 4355 /lib/x86_64-linux-gnu/libnsl-2.19.so python2 26626 chrome-bot mem REG 8,1 39824 4333 /lib/x86_64-linux-gnu/libnss_compat-2.19.so python2 26626 chrome-bot mem REG 8,1 170056 2558 /lib/x86_64-linux-gnu/libexpat.so.1.6.0 python2 26626 chrome-bot mem REG 8,1 69200 10292 /usr/lib/python2.7/lib-dynload/pyexpat.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 58360 10286 /usr/lib/python2.7/lib-dynload/_elementtree.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 19560 793966 /usr/lib/python2.7/dist-packages/sqlalchemy/cresultproxy.so python2 26626 chrome-bot mem REG 8,1 10328 793965 /usr/lib/python2.7/dist-packages/sqlalchemy/cutils.so python2 26626 chrome-bot mem REG 8,1 15680 793964 /usr/lib/python2.7/dist-packages/sqlalchemy/cprocessors.so python2 26626 chrome-bot mem REG 8,1 33448 10273 /usr/lib/python2.7/lib-dynload/_multiprocessing.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 31720 1055192 /usr/lib/python2.7/dist-packages/Crypto/Cipher/_AES.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 15584 1055165 /usr/lib/python2.7/dist-packages/Crypto/Util/_counter.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 475168 2893 /usr/lib/x86_64-linux-gnu/libgmp.so.10.1.3 python2 26626 chrome-bot mem REG 8,1 73728 1055176 /usr/lib/python2.7/dist-packages/Crypto/PublicKey/_fastmath.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 66488 8578 /usr/lib/python2.7/dist-packages/OpenSSL/SSL.so python2 26626 chrome-bot mem REG 8,1 12280 8595 /usr/lib/python2.7/dist-packages/OpenSSL/rand.so python2 26626 chrome-bot mem REG 8,1 96776 8577 /usr/lib/python2.7/dist-packages/OpenSSL/crypto.so python2 26626 chrome-bot mem REG 8,1 18936 2944 /lib/x86_64-linux-gnu/libuuid.so.1.3.0 python2 26626 chrome-bot mem REG 8,1 30952 20779 /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1 python2 26626 chrome-bot mem REG 8,1 136232 10360 /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 54064 10282 /usr/lib/python2.7/lib-dynload/_json.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 24648 10291 /usr/lib/python2.7/lib-dynload/termios.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 387272 9727 /lib/x86_64-linux-gnu/libssl.so.1.0.0 python2 26626 chrome-bot mem REG 8,1 38480 10300 /usr/lib/python2.7/lib-dynload/_ssl.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 1938752 9728 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 python2 26626 chrome-bot mem REG 8,1 20664 10272 /usr/lib/python2.7/lib-dynload/_hashlib.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 109232 10376 /usr/lib/python2.7/lib-dynload/datetime.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 77752 10285 /usr/lib/python2.7/lib-dynload/parser.x86_64-linux-gnu.so python2 26626 chrome-bot mem REG 8,1 3381168 4260 /usr/lib/locale/locale-archive python2 26626 chrome-bot mem REG 8,1 1071552 4337 /lib/x86_64-linux-gnu/libm-2.19.so python2 26626 chrome-bot mem REG 8,1 100728 3115 /lib/x86_64-linux-gnu/libz.so.1.2.8 python2 26626 chrome-bot mem REG 8,1 10680 4365 /lib/x86_64-linux-gnu/libutil-2.19.so python2 26626 chrome-bot mem REG 8,1 14664 4357 /lib/x86_64-linux-gnu/libdl-2.19.so python2 26626 chrome-bot mem REG 8,1 1857312 4366 /lib/x86_64-linux-gnu/libc-2.19.so python2 26626 chrome-bot mem REG 8,1 141574 4349 /lib/x86_64-linux-gnu/libpthread-2.19.so python2 26626 chrome-bot mem REG 8,1 149120 4343 /lib/x86_64-linux-gnu/ld-2.19.so python2 26626 chrome-bot DEL REG 0,20 219290 /run/shm/sem.w6zFhE python2 26626 chrome-bot 0r CHR 1,3 0t0 1029 /dev/null python2 26626 chrome-bot 1w FIFO 0,8 0t0 134488 pipe python2 26626 chrome-bot 2w FIFO 0,8 0t0 134488 pipe python2 26626 chrome-bot 3r FIFO 0,8 0t0 206423 pipe python2 26626 chrome-bot 4w FIFO 0,8 0t0 206423 pipe python2 26626 chrome-bot 5u IPv4 189789 0t0 TCP swarm-cros-576.c.chromeos-bot.internal:38264->cosmic-strategy-646:cidb-gen2:mysql (ESTABLISHED) python2 26626 chrome-bot 6uW REG 8,1 0 19796748 /b/swarming/w/ir/tmp/t/cbuildbot-tmpGtmXZ1/tmpbvzsQ4 (deleted) python2 26626 chrome-bot 7r CHR 1,9 0t0 1034 /dev/urandom python2 26626 chrome-bot 8r CHR 1,9 0t0 1034 /dev/urandom python2 26626 chrome-bot 9u IPv4 369725 0t0 TCP swarm-cros-576.c.chromeos-bot.internal:47023->vg-in-f153.1e100.net:https (ESTABLISHED) python2 26626 chrome-bot 10u unix 0xffff883317d23fc0 0t0 219289 socket python2 26626 chrome-bot 11u unix 0xffff883317d24740 0t0 219293 socket python2 26626 chrome-bot 12u unix 0xffff883317d25280 0t0 219294 socket FD 9 is, instead, an HTTPS connection to "vg-in-f153.1e100.net". Wanting to terminate that socket to see if the build would gracefully recover, I used gdb to do that: jclinton@swarm-cros-576:~$ sudo gdb -p 26626 GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1 ... Attaching to process 26626 Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug//usr/bin/python2.7...done. done. ... Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1 0x00007f56f75653ad in read () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) call close(9) $1 = 0 (gdb) quit A debugging session is active. Inferior 1 [process 26626] will be detached. Quit anyway? (y or n) y Detaching from program: /usr/bin/python2.7, process 26626 At the point that I did this, the build had been hung at this point for 2 hours. As soon as I did this, the build recovered and finished gracefully as can be seen in the logs: https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927109628284744048/+/steps/MasterSlaveSyncCompletion/0/stdout#L6021_0 So, we know it's not CIDB, now. vg-in-f153.1e100.net/173.194.209.153 appears to be a Google-owned GFE fronting either AppEngine or a Cloud service. The GFE's statusz page is here: http://vgas132/statusz . As an L1 GFE, it has a lot of services mapped to it but, interestingly, the GFE stats show that all of the traffic is Cloud and AppEngine. My best guess is that TSMon uploader hung forever trying to push metrics to the HTTPS endpoint. However, it could also be a LUCI service. More investigation of the exact completion code path/loop needed. Mike, do we have a TSMon client contact?
,
Dec 17
I'm not aware who actually owns ts-mon but can reach out to see if we can find a contact. From what I've seen, ts-mon is forked as a background thread. It would be odd for it to manage to block, given all the different clients I've seen in the past that utilize MonAPI. That said, there are ways this could end up blocking so I'm not going to make any assumptions. I'll see if I can find an owner for ts-mon and at least see whether we can pinpoint it down to a service. I'm also going to look at the code further as it seems odd that we'd somehow block here for ts-mon. Could be some other service. Try to rule others out. -- Mike
,
Dec 18
Digging into the code a bit more it would appear that this might be a call to Buildbucket that is hanging as we provide no timeout. Going to put in a timeout to all Buildbucket calls to ensure we are at least giving up and not end up in a hung status. -- Mike |
||||
►
Sign in to add a comment |
||||
Comment 1 by dgarr...@chromium.org
, Sep 27