New issue
Advanced search Search tips

Issue 890053 link

Starred by 2 users

Issue metadata

Status: Available
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Builds hang during parent completion step between CIDB polls

Project Member Reported by dgarr...@chromium.org, Sep 27

Issue description

This CQ run:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934219634575875552

Appears to have hung during CommitQueueCompletion:

https://luci-logdog.appspot.com/v/?s=chromeos/bb/chromeos/master-paladin/19706/+/recipes/steps/CommitQueueCompletion/0/stdout


10:15:34: INFO: Still waiting for the following builds to complete: [u'amd64-generic-paladin', u'arm-generic-paladin', u'atlas-paladin', u'auron_paine-paladin', u'auron_yuna-paladin', u'betty-arc64-paladin', u'betty-arcnext-paladin', u'betty-paladin', u'bob-paladin', u'caroline-arcnext-paladin', u'caroline-paladin', u'cave-paladin', u'chell-nowithdebug-paladin', u'chell-paladin', u'cheza-paladin', u'coral-paladin', u'cyan-paladin', u'daisy-paladin', u'daisy_skate-paladin', u'edgar-paladin', u'elm-paladin', u'eve-paladin', u'falco-paladin', u'fizz-paladin', u'glados-paladin', u'guado_moblab-paladin', u'hana-paladin', u'kevin-arcnext-paladin', u'kevin-paladin', u'kip-paladin', u'kukui-paladin', u'leon-paladin', u'link-paladin', u'moblab-generic-vm-paladin', u'monroe-paladin', u'nami-paladin', u'nocturne-paladin', u'nyan_big-paladin', u'nyan_kitty-paladin', u'oak-paladin', u'octopus-paladin', u'panther-paladin', u'peach_pit-paladin', u'peppy-paladin', u'poppy-paladin', u'quawks-paladin', u'reef-paladin', u'samus-paladin', u'scarlet-paladin', u'sentry-paladin', u'stout-paladin', u'terra-paladin', u'tidus-paladin', u'veyron_jaq-paladin', u'veyron_jerry-paladin', u'veyron_mighty-paladin', u'veyron_minnie-paladin', u'veyron_speedy-paladin', u'veyron_tiger-paladin', u'winky-paladin', u'wizpig-paladin', u'wolf-paladin']
10:15:34: INFO: 2:40:24.007942 until timeout...
10:16:34: INFO: Updating slave status...
10:17:13: INFO: RetriableHttp: attempt 1 receiving status 502, will retry
10:17:14: INFO: RetriableHttp: attempt 1 receiving status 503, will retry
10:17:17: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
10:17:17: INFO: Refreshing access_token
10:17:20: WARNING: HttpsMonitor.send received status 429: {
  "error": {
    "code": 429,
    "message": "Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'prodxmon-pa.googleapis.com' for consumer 'project_number:102025095358'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Google developer console API key",
            "url": "https://console.developers.google.com/project/102025095358/apiui/credential"
          }
        ]
      }
    ]
  }
}


<wash, rinse, repeat>
 
I'm reading that as a metris thing? Is that correct?
It looks like a quota exhaustion when sending metrics?  Any idea if that could have been caused by killing the previous master-paladin run?
Owner: mikenichols@chromium.org
Killing the previous build shouldn't matter, we do things like that fairly often.

Also.... metrics aren't supposed to be able to hang the process.
More likely, this is a CIDB hang (thanks ekbot@ and jclinton@)
Owner: ----
Status: Available (was: Untriaged)
Summary: CIDB hanging during builds (was: CQ run hung sending metrics?)
This appears to be the same issue as crbug/883421 yet that was marked as fixed.  

This is normally a case in which the DB hangs; the symptoms seem to indicate an issue with the metrics but, from past experience and debugging, metrics are a background process and should be retried on the server side without causing impacts to the build itself.  

This appears to be an issue with CIDB hanging but without finding it in a hung state we have very little to go on.  

Will keep this open, unassigned, for the next oncall who experiences an issue with CIDB to associate.  

-- Mike
Damn. We did fix two real problems in that other bug.
No worries.  This is may not be related other than it appearing to be related to the DB.  I just do not have anything more to go by based on the error provided.  I thought the other issue was still open so was going to associate. 

For now, this is just a tracking bug for issues related to the DB.  We hope this becomes a stale bug and gets closed due to there be no further issues.  =)

-- Mike
Labels: -Pri-3 OS-Chrome Pri-2
Owner: mikenichols@chromium.org
Summary: Builds hang during parent completion step between CIDB polls (was: CIDB hanging during builds)
Stayed up late watching builds for an unrelated reason and saw this happen by pure chance. So, we finally have some debugging information.

This build was affected:
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8927109628284744048

You can see the hang in the log timestamps at 35m 43s:
https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927109628284744048/+/steps/MasterSlaveSyncCompletion/0/stdout#L5988_0

This occurred right after reading the experimental status before calling CIDB.

Here was the process tree at the time of the hang when I logged in to the bot:

 1038 ?        Ssl    0:00 /usr/bin/python -u /opt/chromebuild/chromebuild-startup.py --log /var/log/messages/chromebuild/startup.log
 1093 ?        Sl     0:57  \_ /usr/bin/python /b/swarming/swarming_bot.1.zip start_bot
 5568 ?        Sl     0:13      \_ /usr/bin/python /b/swarming/swarming_bot.1.zip task_runner --swarming-server https://3945-402d3be-dot-chrome-swarming.appspot.com --in-file /b/swarming/w/task_runner_in.json --out-file /b/swarming/w/task
 5581 ?        S      0:12          \_ /usr/bin/python /b/swarming/swarming_bot.1.zip run_isolated -a /b/swarming/w/run_isolated_args.json
 5655 ?        Sl     0:10              \_ /b/swarming/w/ir/kitchen cook -buildbucket-hostname cr-buildbucket.appspot.com -buildbucket-build-id 8927109628284744048 -call-update-build -build-url https://ci.chromium.org/b/892710962828474404
 5935 ?        S      0:00                  \_ python kitchen-checkout/scripts/slave/recipes.py --operational-args-path /b/swarming/w/ir/tmp/rr/op_args.json run --properties-file /b/swarming/w/ir/tmp/rr/properties.json --workdir /b/swarmi
 6000 ?        S      0:03                      \_ /b/swarming/w/ir/cache/vpython/02d5f8/bin/python -u -s /b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/main.py --package /b/swarming/w/ir/kitchen-
 6757 ?        S      0:05                          \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cache-d
 6760 ?        Sl     0:09                              \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cac
 6764 ?        S      0:01                              \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch --buildroot /b/swarming/w/ir/cache/cbuild --branch master --buildbucket-id 8927109628284744048 --git-cac
15361 ?        S      0:04                              \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 892
15364 ?        S      0:00                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
15371 ?        S      0:00                                  \_ bash -c while :; do read -t 240; [ $? -le 128 ] && exit; sudo -n true 2>/dev/null; sudo -n true < /dev/null > /dev/null 2>&1; done
15392 ?        Ss     0:00                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
15395 ?        Sl     0:09                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
15399 ?        S      0:01                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
15401 ?        Sl     0:00                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
15406 ?        Sl     0:00                                  \_ python2 /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id
26626 ?        S      0:09                                  \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir /b/s
26629 ?        Ss     0:00                                      \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 
26631 ?        Sl     0:09                                      \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 
26635 ?        S      0:03                                      \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 
26637 ?        Sl     0:00                                      \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 
26642 ?        Sl     0:00                                      \_ python2 chromite/bin/cbuildbot master-postsubmit --buildroot /b/swarming/w/ir/cache/cbuild/repository --branch master --buildbucket-id 8927109628284744048 --git-cache-dir 

I had a hunch that it was a network hang so I took at look at netstat:

$ sudo netstat -tnp | grep python2
tcp        0      0 10.0.2.119:60969        173.194.216.95:443      ESTABLISHED 15399/python2   
tcp        0      0 10.0.2.119:32772        173.194.216.95:443      ESTABLISHED 26635/python2   
tcp        1      0 10.0.2.119:40617        74.125.141.84:443       CLOSE_WAIT  26635/python2   
tcp        0      0 10.0.2.119:38264        130.211.191.11:3306     ESTABLISHED 26626/python2   
tcp        1      0 10.0.2.119:40473        74.125.141.84:443       CLOSE_WAIT  6764/python2    
tcp        0      0 10.0.2.119:46825        173.194.209.95:443      ESTABLISHED 6764/python2    
tcp        0      0 10.0.2.119:37704        130.211.191.11:3306     ESTABLISHED 15361/python2   
tcp        1      0 10.0.2.119:40581        74.125.141.84:443       CLOSE_WAIT  15399/python2   
tcp        0      0 10.0.2.119:47023        173.194.209.153:443     ESTABLISHED 26626/python2   

On the hunch that the problem was MySQL, I straced both PID's with 3306 ports open:

jclinton@swarm-cros-576:~$ sudo strace -f -p 15361
Process 15361 attached
wait4(26626, ^CProcess 15361 detached
 <detached ...>
jclinton@swarm-cros-576:~$ sudo strace -f -p 26626
Process 26626 attached
read(9, ^CProcess 26626 detached
 <detached ...>

So only 26626 was blocked on I/O. I watch that for awhile and it never moved. However, file descriptor 9 was *not* the MySQL connection:

jclinton@swarm-cros-576:~$ sudo lsof -p 26626
COMMAND   PID       USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
python2 26626 chrome-bot  cwd    DIR                8,1     4096 10096449 /b/swarming/w/ir/cache/cbuild/repository
python2 26626 chrome-bot  rtd    DIR                8,1     4096        2 /
python2 26626 chrome-bot  txt    REG                8,1  3341256     5540 /usr/bin/python2.7
python2 26626 chrome-bot  mem    REG                8,1    90160     2972 /lib/x86_64-linux-gnu/libgcc_s.so.1
python2 26626 chrome-bot  mem    REG                8,1  3359328     6340 /usr/lib/x86_64-linux-gnu/libmysqlclient.so.18.0.0
python2 26626 chrome-bot  mem    REG                8,1    58760     2206 /usr/lib/python2.7/dist-packages/_mysql.so
python2 26626 chrome-bot  mem    REG                8,1   101240     4297 /lib/x86_64-linux-gnu/libresolv-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    22952     4361 /lib/x86_64-linux-gnu/libnss_dns-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    43616     4295 /lib/x86_64-linux-gnu/libnss_files-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    47760     4329 /lib/x86_64-linux-gnu/libnss_nis-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    97296     4355 /lib/x86_64-linux-gnu/libnsl-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    39824     4333 /lib/x86_64-linux-gnu/libnss_compat-2.19.so
python2 26626 chrome-bot  mem    REG                8,1   170056     2558 /lib/x86_64-linux-gnu/libexpat.so.1.6.0
python2 26626 chrome-bot  mem    REG                8,1    69200    10292 /usr/lib/python2.7/lib-dynload/pyexpat.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    58360    10286 /usr/lib/python2.7/lib-dynload/_elementtree.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    19560   793966 /usr/lib/python2.7/dist-packages/sqlalchemy/cresultproxy.so
python2 26626 chrome-bot  mem    REG                8,1    10328   793965 /usr/lib/python2.7/dist-packages/sqlalchemy/cutils.so
python2 26626 chrome-bot  mem    REG                8,1    15680   793964 /usr/lib/python2.7/dist-packages/sqlalchemy/cprocessors.so
python2 26626 chrome-bot  mem    REG                8,1    33448    10273 /usr/lib/python2.7/lib-dynload/_multiprocessing.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    31720  1055192 /usr/lib/python2.7/dist-packages/Crypto/Cipher/_AES.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    15584  1055165 /usr/lib/python2.7/dist-packages/Crypto/Util/_counter.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1   475168     2893 /usr/lib/x86_64-linux-gnu/libgmp.so.10.1.3
python2 26626 chrome-bot  mem    REG                8,1    73728  1055176 /usr/lib/python2.7/dist-packages/Crypto/PublicKey/_fastmath.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    66488     8578 /usr/lib/python2.7/dist-packages/OpenSSL/SSL.so
python2 26626 chrome-bot  mem    REG                8,1    12280     8595 /usr/lib/python2.7/dist-packages/OpenSSL/rand.so
python2 26626 chrome-bot  mem    REG                8,1    96776     8577 /usr/lib/python2.7/dist-packages/OpenSSL/crypto.so
python2 26626 chrome-bot  mem    REG                8,1    18936     2944 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
python2 26626 chrome-bot  mem    REG                8,1    30952    20779 /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1
python2 26626 chrome-bot  mem    REG                8,1   136232    10360 /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    54064    10282 /usr/lib/python2.7/lib-dynload/_json.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    24648    10291 /usr/lib/python2.7/lib-dynload/termios.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1   387272     9727 /lib/x86_64-linux-gnu/libssl.so.1.0.0
python2 26626 chrome-bot  mem    REG                8,1    38480    10300 /usr/lib/python2.7/lib-dynload/_ssl.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1  1938752     9728 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
python2 26626 chrome-bot  mem    REG                8,1    20664    10272 /usr/lib/python2.7/lib-dynload/_hashlib.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1   109232    10376 /usr/lib/python2.7/lib-dynload/datetime.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1    77752    10285 /usr/lib/python2.7/lib-dynload/parser.x86_64-linux-gnu.so
python2 26626 chrome-bot  mem    REG                8,1  3381168     4260 /usr/lib/locale/locale-archive
python2 26626 chrome-bot  mem    REG                8,1  1071552     4337 /lib/x86_64-linux-gnu/libm-2.19.so
python2 26626 chrome-bot  mem    REG                8,1   100728     3115 /lib/x86_64-linux-gnu/libz.so.1.2.8
python2 26626 chrome-bot  mem    REG                8,1    10680     4365 /lib/x86_64-linux-gnu/libutil-2.19.so
python2 26626 chrome-bot  mem    REG                8,1    14664     4357 /lib/x86_64-linux-gnu/libdl-2.19.so
python2 26626 chrome-bot  mem    REG                8,1  1857312     4366 /lib/x86_64-linux-gnu/libc-2.19.so
python2 26626 chrome-bot  mem    REG                8,1   141574     4349 /lib/x86_64-linux-gnu/libpthread-2.19.so
python2 26626 chrome-bot  mem    REG                8,1   149120     4343 /lib/x86_64-linux-gnu/ld-2.19.so
python2 26626 chrome-bot  DEL    REG               0,20            219290 /run/shm/sem.w6zFhE
python2 26626 chrome-bot    0r   CHR                1,3      0t0     1029 /dev/null
python2 26626 chrome-bot    1w  FIFO                0,8      0t0   134488 pipe
python2 26626 chrome-bot    2w  FIFO                0,8      0t0   134488 pipe
python2 26626 chrome-bot    3r  FIFO                0,8      0t0   206423 pipe
python2 26626 chrome-bot    4w  FIFO                0,8      0t0   206423 pipe
python2 26626 chrome-bot    5u  IPv4             189789      0t0      TCP swarm-cros-576.c.chromeos-bot.internal:38264->cosmic-strategy-646:cidb-gen2:mysql (ESTABLISHED)
python2 26626 chrome-bot    6uW  REG                8,1        0 19796748 /b/swarming/w/ir/tmp/t/cbuildbot-tmpGtmXZ1/tmpbvzsQ4 (deleted)
python2 26626 chrome-bot    7r   CHR                1,9      0t0     1034 /dev/urandom
python2 26626 chrome-bot    8r   CHR                1,9      0t0     1034 /dev/urandom
python2 26626 chrome-bot    9u  IPv4             369725      0t0      TCP swarm-cros-576.c.chromeos-bot.internal:47023->vg-in-f153.1e100.net:https (ESTABLISHED)
python2 26626 chrome-bot   10u  unix 0xffff883317d23fc0      0t0   219289 socket
python2 26626 chrome-bot   11u  unix 0xffff883317d24740      0t0   219293 socket
python2 26626 chrome-bot   12u  unix 0xffff883317d25280      0t0   219294 socket

FD 9 is, instead, an HTTPS connection to "vg-in-f153.1e100.net".

Wanting to terminate that socket to see if the build would gracefully recover, I used gdb to do that:

jclinton@swarm-cros-576:~$ sudo gdb -p 26626
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
...
Attaching to process 26626
Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug//usr/bin/python2.7...done.
done.
...
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007f56f75653ad in read () at ../sysdeps/unix/syscall-template.S:81
81	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) call close(9)
$1 = 0
(gdb) quit
A debugging session is active.

	Inferior 1 [process 26626] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/python2.7, process 26626

At the point that I did this, the build had been hung at this point for 2 hours. As soon as I did this, the build recovered and finished gracefully as can be seen in the logs: https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927109628284744048/+/steps/MasterSlaveSyncCompletion/0/stdout#L6021_0

So, we know it's not CIDB, now.

vg-in-f153.1e100.net/173.194.209.153 appears to be a Google-owned GFE fronting either AppEngine or a Cloud service. The GFE's statusz page is here: http://vgas132/statusz . As an L1 GFE, it has a lot of services mapped to it but, interestingly, the GFE stats show that all of the traffic is Cloud and AppEngine.

My best guess is that TSMon uploader hung forever trying to push metrics to the HTTPS endpoint. However, it could also be a LUCI service. More investigation of the exact completion code path/loop needed.

Mike, do we have a TSMon client contact?


I'm not aware who actually owns ts-mon but can reach out to see if we can find a contact.  

From what I've seen, ts-mon is forked as a background thread.  It would be odd for it to manage to block, given all the different clients I've seen in the past that utilize MonAPI.  That said, there are ways this could end up blocking so I'm not going to make any assumptions.  

I'll see if I can find an owner for ts-mon and at least see whether we can pinpoint it down to a service.

I'm also going to look at the code further as it seems odd that we'd somehow block here for ts-mon.  Could be some other service.  Try to rule others out.

-- Mike  


Digging into the code a bit more it would appear that this might be a call to Buildbucket that is hanging as we provide no timeout.  

Going to put in a timeout to all Buildbucket calls to ensure we are at least giving up and not end up in a hung status.  

-- Mike

Sign in to add a comment