New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 878277 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Closed: Sep 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

PFQ time out due to "Could not resolve host: storage.googleapis.com"

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Aug 28

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of yuhsuan@google.com

kevin-arcnext-paladin:1260, kevin-paladin:4400 timed out

Builders failed on: 
- kevin-arcnext-paladin: 
  https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936996884616956256


19:48:26: INFO: Re-run swarming_cmd to avoid buildbot salency check.
19:48:26: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpAScgIR/tmp1VEna9/temp_summary.json --print-status-updates --timeout 9000 --raw-cmd --task-name kevin-arcnext-paladin/R70-11010.0.0-rc2-bvt-arc --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-arc' '--tags=build:kevin-arcnext-paladin/R70-11010.0.0-rc2' '--tags=task_name:kevin-arcnext-paladin/R70-11010.0.0-rc2-bvt-arc' '--tags=board:kevin-arcnext' -- /usr/local/autotest/site_utils/run_suite.py --build kevin-arcnext-paladin/R70-11010.0.0-rc2 --board kevin --suite_name bvt-arc --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 90436842L, 'cidb_build_id': 2884616, 'datastore_parent_key': ('Build', 2884616, 'BuildStage', 90436842L)}" --test_args "{'fast': 'True'}" -m 231392899
20:32:39: INFO: Refreshing due to a 401 (attempt 1/2)
20:32:39: INFO: Refreshing access_token
20:38:29: INFO: Refreshing due to a 401 (attempt 1/2)
20:38:29: INFO: Refreshing access_token
20:42:15: INFO: Refreshing due to a 401 (attempt 1/2)
20:42:15: INFO: Refreshing access_token
20:52:10: ERROR: Timeout occurred- waited 15236 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.

 
Owner: jrbarnette@chromium.org
Components: Infra>Client>ChromeOS
Labels: Build-PFQ-Failures OS-Chrome
Looks like a dup of issue 874308
Mergedinto: 874308
Status: Duplicate (was: Available)
Cc: xiy...@chromium.org jrbarnette@chromium.org
Owner: jkop@chromium.org
Status: Untriaged (was: Duplicate)
"Refreshing due to a 401 " is a red herring. There is actually a failure.
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936978623559055952

ssp_logs shows:

08/27 20:31:12.002 DEBUG|         container:0344| Command <sudo lxc-attach -P /usr/local/autotest/containers -n test_231392919_1535426243_163245 -- bash -c "curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz"> failed, rc=6, Command returned non-zero exit status
* Command: 
    sudo lxc-attach -P /usr/local/autotest/containers -n
    test_231392919_1535426243_163245 -- bash -c "curl --head
    https://storage.googleapis.com/abci-ssp/autotest-
    containers/base_09.tar.xz"
Exit status: 6
Duration: 19.8597819805

stderr:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0

curl: (6) Could not resolve host: storage.googleapis.com

jkop@, could you check it out and help to triage this? Thanks.
Summary: kevin-arcnext-paladin:1260, kevin-paladin:5400, kevin-arcnext-chrome-pfq timed out (was: kevin-arcnext-paladin:1260, kevin-paladin:4400 timed out)
Mergedinto: -874308 878403
Status: Duplicate (was: Untriaged)
kevin is on cros-full-0007
Cc: -jrbarnette@chromium.org jkop@chromium.org
Owner: jrbarnette@chromium.org
Status: Assigned (was: Duplicate)
Summary: PFQ time out due to "Could not resolve host: storage.googleapis.com" (was: kevin-arcnext-paladin:1260, kevin-paladin:5400, kevin-arcnext-chrome-pfq timed out)
Unmerge from 878403

This now also happens for caroline-arcnext-chrome-pfq
e.g
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936803581058848320
There are multiple different failures here:
  * The failure in the description looks like bug 874308.
  * Concurrent with the symptom of bug 874308, there are also multiple
    test failures in the underlying suite.
  * The failure in #c4 is a new failure and TTBOMK not reported elsewhere.
  * The caroline failure is also bug 874308.

ATM, I'm inclined to say that this bug is the failure in #c4, but I'd like
to know whether that problem is still seen.  My expectation is that that
sort of failure is transient, and won't be recurring.

Owner: xiy...@chromium.org
Passing to the Chrome gardener to answer whether the problem
in #c4 has been seen recently on any builder.

Owner: jrbarnette@chromium.org
The caroline failure has the same symptom as #c4. The error is hidden in the ssp_logs of the test failures. If you click into "GE Suite Details", then click on the failed tests, say, "cheets_CTS_N.7.1", you get to here:
https://stainless.corp.google.com/browse/chromeos-autotest-results/232140354-chromeos-test/

Then if you look at autoserv.DEBUG  here:
     ├── ssp_logs
     │   └── debug
     │        ├── autoserv.DEBUG [58.4 kB]

You will see the failure is actually caused by the name resolving failure.

The problem is not that transient. When it happens, it will happen for several builds. E.g. carolin has failed for the last 4 builds. And I saw the same error for all builds listed here.
> The caroline failure has the same symptom as #c4. [ ... ]

The only caroline build cited is in #c7:
    https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936803581058848320

That build ran this test suite:
    http://cautotest-prod/afe/#tab_id=view_job&object_id=232212677

There are no test failures in that suite.  The only failure in the
associated build is bug 874308.

> The error is hidden in the ssp_logs of the test failures. If you click
> into "GE Suite Details", then click on the failed tests, say,
> "cheets_CTS_N.7.1", you get to here:
> https://stainless.corp.google.com/browse/chromeos-autotest-results/232140354-chromeos-test/

That test result is from testing caroline-release/R70-11017.0.0, not
the caroline-arcnext-chrome-pfq at all.  That's this build:
    https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936804363040971296

The fact that this is happening on cros-full-0005.mtv makes me
suspect some side-effect of bug 874308, but I don't yet have the
evidence to show it.

> The fact that this is happening on cros-full-0005.mtv makes me
> suspect some side-effect of bug 874308, but I don't yet have the
> evidence to show it.

<sigh> the proper bug to suspect is bug 878403.  The general sickness
on the affected shard may be having unexpected side-effects, such as
networking glitches.  Whatever is causing it, the problem is intermittent.
I just tried the failing 'curl' and it passed.

chromeos-test@cros-full-0005:/tmp$ curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  432M    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

Thanks for the investigation. So merge back to issue 878403 ?
I've confirmed that three of the past four caroline-arcnext-chrome-pfq
runs failed with the symptom at issue here.  The fourth (the one with
a different symptom) is the build cited in #c7.

> Thanks for the investigation. So merge back to issue 878403 ?

I'm inclined to hold this open as a separate bug until cros-full-0005.mtv
is more healthy.  If the problem doesn't persist past that point, we can
close this.  If the problem _does_ persist, it'll be easier not to have the
extra churn.

Labels: Hotlist-Deputy
Owner: pprabhu@chromium.org
Status: Duplicate (was: Assigned)
The affected builder has been green over the past couple days. Let's blame it on the shard slowness.

Sign in to add a comment