Hostless job waits for 1-2 hours to obtain results for children jobs (that have long since finished) |
|||||||||
Issue descriptionFailed build: https://luci-milo.appspot.com/buildbot/chromeos/gale-paladin/2957 Failed suite: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124311787-chromeos-test/hostless/debug/ The problem is that the hostless job kept getting stuck for an hour at times complaining about mysql db disappearing: 6/21 02:39:23.686 INFO | server_job:0199| END GOOD 124312122-chromeos-test/chromeos4-row9-jetstream-host2/jetstream_LocalApi jetstream_LocalApi timestamp=1495744078 localtime=May 25 13:27:58 06/21 03:32:40.921 ERROR| db:0024| 03:32:40 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 06/21 03:33:08.059 INFO | server_job:0199| START 124312128-chromeos-test/chromeos4-row9-jetstream-host6/jetstream_ApiServerDeveloperConfiguration jetstream_ApiServerDeveloperConfiguration timestamp=1495745865 ....... ....... 06/21 03:33:08.526 INFO | server_job:0199| END GOOD 124312146-chromeos-test/chromeos4-row9-jetstream-host2/jetstream_WanCustomDns jetstream_WanCustomDns timestamp=1495745940 localtime=May 25 13:59:00 06/21 04:09:13.499 ERROR| db:0024| 04:09:13 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet .......
,
Jun 21 2017
Note that the two CQ runs around that time took 3 hours instead of 2. So things were slow around that time, as expected: https://chromiumos-build-annotator.googleplex.com/build_annotations/builds_list/master-paladin/?num_builds=10&latest_build_id=1609850
,
Jun 22 2017
Something similar happened to cave-paladin here (making it really.... slow) the suite_job had only three children left, but then it sat around for ~1 hour polling for them because it kept getting RPC errors. This coincides with a partial RPC outage we saw this afternoon: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/659 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124423106-chromeos-test/hostless/debug/ 06/21 15:24:40.685 DEBUG| suite:1451| Adding job keyval for login_RetrieveActiveSessions=124423230-chromeos-test 06/21 15:37:30.734 DEBUG| retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>) 06/21 15:37:30.735 DEBUG| retry_util:0065| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ... 06/21 15:39:48.932 DEBUG| retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer) 06/21 15:39:48.932 DEBUG| retry_util:0065| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ... 06/21 15:40:24.062 DEBUG| retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>) 06/21 15:40:24.063 DEBUG| retry_util:0065| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ... 06/21 15:43:05.284 DEBUG| retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer) 06/21 15:43:05.284 DEBUG| retry_util:0065| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ... 06/21 16:10:31.230 DEBUG| retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>) 06/21 16:10:31.231 DEBUG| retry_util:0065| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ... 06/21 16:10:56.286 DEBUG| retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>) 06/21 16:10:56.287 DEBUG| retry_util:0065| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ... 06/21 16:13:18.532 DEBUG| retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer) 06/21 16:13:18.532 DEBUG| retry_util:0065| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ... 06/21 16:14:13.630 DEBUG| retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>) 06/21 16:14:13.631 DEBUG| retry_util:0065| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ... 06/21 16:17:37.860 DEBUG| retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer) 06/21 16:17:37.860 DEBUG| retry_util:0065| Retrying in 160.000000 (160.000000 + jitter 0.000000) seconds ... 06/21 16:26:09.526 ERROR| db:0024| 16:26:09 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
,
Jun 22 2017
,
Aug 17 2017
Issue 756634 has been merged into this issue.
,
Aug 17 2017
I did not dream this. It is still happening.
,
Aug 17 2017
,
Oct 30 2017
I ran into this again today with a timed out suite: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=152744569 10/30 07:10:14.265 INFO | server_job:0214| END GOOD 152745019-chromeos-test/chromeos6-row2-rack14-host3/video_ChromeHWDecodeUsed.h264 video_ChromeHWDecodeUsed.h264 timestamp=1509372423 localtime=Oct 30 07:07:03 10/30 07:10:14.266 DEBUG| suite:1372| Adding job keyval for video_ChromeHWDecodeUsed.h264=152745019-chromeos-test 10/30 08:15:12.789 ERROR| db:0023| 08:15:12 10/30/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 10/30 08:15:56.976 INFO | server_job:0214| START ---- sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264 timestamp=1509376556 localtime=Oct 30 08:15:56 10/30 08:15:56.977 INFO | server_job:0214| ABORT ---- sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264 timestamp=1509376556 localtime=Oct 30 08:15:56 10/30 08:15:56.978 INFO | server_job:0214| END ABORT ---- sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264 timestamp=1509376556 localtime=Oct 30 08:15:56 10/30 08:16:19.042 DEBUG| dynamic_suite:0622| Finished waiting on suite. Returning from _perform_reimage_and_run.
,
Oct 30 2017
+this week's primary
,
May 17 2018
I'd like to believe that the new TKO gen2 solves this problem
,
Jun 8 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by pprabhu@chromium.org
, Jun 21 2017