New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735514 link

Starred by 3 users

Issue metadata

Status: Unconfirmed
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Hostless job waits for 1-2 hours to obtain results for children jobs (that have long since finished)

Project Member Reported by pprabhu@chromium.org, Jun 21 2017

Issue description

Failed build: https://luci-milo.appspot.com/buildbot/chromeos/gale-paladin/2957
Failed suite: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124311787-chromeos-test/hostless/debug/

The problem is that the hostless job kept getting stuck for an hour at times complaining about mysql db disappearing:

6/21 02:39:23.686 INFO |        server_job:0199| END GOOD	124312122-chromeos-test/chromeos4-row9-jetstream-host2/jetstream_LocalApi	jetstream_LocalApi	timestamp=1495744078	localtime=May 25 13:27:58	

06/21 03:32:40.921 ERROR|                db:0024| 03:32:40 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet

06/21 03:33:08.059 INFO |        server_job:0199| START	124312128-chromeos-test/chromeos4-row9-jetstream-host6/jetstream_ApiServerDeveloperConfiguration	jetstream_ApiServerDeveloperConfiguration	timestamp=1495745865	
.......
.......

06/21 03:33:08.526 INFO |        server_job:0199| END GOOD	124312146-chromeos-test/chromeos4-row9-jetstream-host2/jetstream_WanCustomDns	jetstream_WanCustomDns	timestamp=1495745940	localtime=May 25 13:59:00	
06/21 04:09:13.499 ERROR|                db:0024| 04:09:13 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
.......


 
Status: Unconfirmed (was: Untriaged)
On the DB, I do see a spike in memory reads at 2:30 AM this morning: https://viceroy.corp.google.com/chromeos/afe_db?duration=1d&hostname=cros-autotest-shard4&refresh=90#_VG_kwLghQ3j

ditto with disk reads: https://viceroy.corp.google.com/chromeos/afe_db?duration=6h&hostname=cros-autotest-shard4&refresh=90&utc_end=1498047240#_VG_3jjRaXLz

And there was a small number of slow queries coming in: https://viceroy.corp.google.com/chromeos/afe_db?duration=6h&hostname=cros-autotest-shard4&refresh=90&utc_end=1498047240#_VG_3jjRaXLz

But none of that seems catastrophic.

Leaving this bug open for a day for observation. Then close.

Note that the two CQ runs around that time took 3 hours instead of 2.
So things were slow around that time, as expected:

https://chromiumos-build-annotator.googleplex.com/build_annotations/builds_list/master-paladin/?num_builds=10&latest_build_id=1609850
Something similar happened to cave-paladin here (making it really.... slow)
the suite_job had only three children left, but then it sat around for ~1 hour polling for them because it kept getting RPC errors.

This coincides with a partial RPC outage we saw this afternoon:

https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/659
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124423106-chromeos-test/hostless/debug/

06/21 15:24:40.685 DEBUG|             suite:1451| Adding job keyval for login_RetrieveActiveSessions=124423230-chromeos-test
06/21 15:37:30.734 DEBUG|        retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
06/21 15:37:30.735 DEBUG|        retry_util:0065| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ...
06/21 15:39:48.932 DEBUG|        retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer)
06/21 15:39:48.932 DEBUG|        retry_util:0065| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ...
06/21 15:40:24.062 DEBUG|        retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
06/21 15:40:24.063 DEBUG|        retry_util:0065| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ...
06/21 15:43:05.284 DEBUG|        retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer)
06/21 15:43:05.284 DEBUG|        retry_util:0065| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ...
06/21 16:10:31.230 DEBUG|        retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
06/21 16:10:31.231 DEBUG|        retry_util:0065| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ...
06/21 16:10:56.286 DEBUG|        retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
06/21 16:10:56.287 DEBUG|        retry_util:0065| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ...
06/21 16:13:18.532 DEBUG|        retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer)
06/21 16:13:18.532 DEBUG|        retry_util:0065| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ...
06/21 16:14:13.630 DEBUG|        retry_util:0203| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
06/21 16:14:13.631 DEBUG|        retry_util:0065| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ...
06/21 16:17:37.860 DEBUG|        retry_util:0203| <class 'socket.error'>([Errno 104] Connection reset by peer)
06/21 16:17:37.860 DEBUG|        retry_util:0065| Retrying in 160.000000 (160.000000 + jitter 0.000000) seconds ...
06/21 16:26:09.526 ERROR|                db:0024| 16:26:09 06/21/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
Summary: Hostless job waits for 1-2 hours to obtain results for children jobs (that have long since finished) (was: jetstream_cq HWTest took over 2 hours)
Cc: nxia@chromium.org akes...@chromium.org jrbarnette@chromium.org xixuan@chromium.org pprabhu@chromium.org
 Issue 756634  has been merged into this issue.
Labels: -Pri-3 Pri-1
Status: Untriaged (was: Unconfirmed)
I did not dream this. It is still happening.
Owner: ----
Cc: teravest@chromium.org
I ran into this again today with a timed out suite:
http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=152744569

10/30 07:10:14.265 INFO |        server_job:0214| END GOOD	152745019-chromeos-test/chromeos6-row2-rack14-host3/video_ChromeHWDecodeUsed.h264	video_ChromeHWDecodeUsed.h264	timestamp=1509372423	localtime=Oct 30 07:07:03	
10/30 07:10:14.266 DEBUG|             suite:1372| Adding job keyval for video_ChromeHWDecodeUsed.h264=152745019-chromeos-test
10/30 08:15:12.789 ERROR|                db:0023| 08:15:12 10/30/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
10/30 08:15:56.976 INFO |        server_job:0214| START	----	sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264	timestamp=1509376556	localtime=Oct 30 08:15:56	
10/30 08:15:56.977 INFO |        server_job:0214| 	ABORT	----	sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264	timestamp=1509376556	localtime=Oct 30 08:15:56	
10/30 08:15:56.978 INFO |        server_job:0214| END ABORT	----	sentry-paladin/R64-10080.0.0-rc2/suite_attr_wrapper/cheets_MediaPlayerVideoHWDecodeUsed.h264	timestamp=1509376556	localtime=Oct 30 08:15:56	
10/30 08:16:19.042 DEBUG|     dynamic_suite:0622| Finished waiting on suite. Returning from _perform_reimage_and_run.

Cc: ayatane@chromium.org
+this week's primary
Status: Unconfirmed (was: Untriaged)
I'd like to believe that the new TKO gen2 solves this problem

Comment 11 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment