New issue
Advanced search Search tips

Issue 602546 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug



Sign in to add a comment

Swarming tasks unable to reach isolateserver.appspot.com after all tests pass (net_unittests)

Project Member Reported by kjellander@chromium.org, Apr 12 2016

Issue description

There are two bots that have problems with their swarming execution:
https://build.chromium.org/p/chromium.memory/builders/Linux%20ASan%20LSan%20Tests%20(1)
https://build.chromium.org/p/chromium.linux/builders/Linux%20Tests%20(dbg)(1)(32)

It's only affecting net_unittests which makes me believe something in the test is causing a crash somewhere which propagates into hard-to-read "some shards did not complete" errors.

By comparing the blamelists on the two bots I end up with:
https://chromium.googlesource.com/chromium/src/+log/cb7d55e9e2424f2676c5a4656548cdda69443fb2%5E..0646cce6b4397895614b152f1a1547e2075e40ea?pretty=fuller
as a common blamelist, but nothing stands out here (since I'm unable to find where the test actually fails/crashes).

If I take a closer look at one failure:
https://build.chromium.org/p/chromium.memory/builders/Linux%20ASan%20LSan%20Tests%20%281%29/builds/25123
has 4 missing shards. But when I look at those shards all tests pass in each one of them.
At the end of the run, there's an error like this:

SUCCESS: all tests passed.
Tests took 1065 seconds.
Additional test environment:
    ASAN_OPTIONS=symbolize=1 external_symbolizer_path=/tmp/runNZyNPJ/third_party/llvm-build/Release+Asserts/bin/llvm-symbolizer detect_leaks=1
    CHROME_DEVEL_SANDBOX=/opt/chromium/chrome_sandbox
    G_SLICE=always-malloc
    LANG=en_US.UTF-8
    LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/debug:
    LSAN_OPTIONS=
    NSS_DISABLE_ARENA_FREE_LIST=1
    NSS_DISABLE_UNLOAD=1
Command: ../out/Release/net_unittests --brave-new-test-launcher --test-launcher-bot-mode --test-launcher-print-test-stdio=always --test-launcher-batch-limit=1 --test-launcher-summary-output=/tmp/outhmFPuu/output.json --no-sandbox

24110 2016-04-12 02:19:53.051 E: Unable to open given url, https://isolateserver.appspot.com/_ah/api/isolateservice/v1/preupload, after 30 attempts.
HTTPSConnectionPool(host='isolateserver.appspot.com', port=443): Max retries exceeded with url: /_ah/api/isolateservice/v1/preupload (Caused by NewConnectionError('<third_party.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3872750>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
24110 2016-04-12 02:19:53.064 E: Leaking out_dir /tmp/outhmFPuu: Failed to execute preupload query
Traceback (most recent call last):
  File "/b/swarm_slave/swarming_bot.1.zip/client/run_isolated.py", line 402, in map_and_run
    storage, out_dir, leak_temp_dir)
  File "/b/swarm_slave/swarming_bot.1.zip/client/run_isolated.py", line 248, in delete_and_upload
    storage, [out_dir], None)
  File "/b/swarm_slave/swarming_bot.1.zip/client/isolateserver.py", line 2050, in archive_files_to_storage
    uploaded = storage.upload_items(items_to_upload)
  File "/b/swarm_slave/swarming_bot.1.zip/client/isolateserver.py", line 449, in upload_items
    for missing_item, push_state in self.get_missing_items(items):
  File "/b/swarm_slave/swarming_bot.1.zip/client/isolateserver.py", line 628, in get_missing_items
    for missing_item, push_state in channel.pull().iteritems():
  File "/b/swarm_slave/swarming_bot.1.zip/utils/threading_utils.py", line 377, in _task_executer
    result = func(*args, **kwargs)
  File "/b/swarm_slave/swarming_bot.1.zip/client/isolateserver.py", line 618, in contains
    return self._storage_api.contains(batch)
  File "/b/swarm_slave/swarming_bot.1.zip/client/isolateserver.py", line 1112, in contains
    'Failed to execute preupload query')
MappingError: Failed to execute preupload query
 
Status: WontFix (was: Untriaged)
This has recovered somehow, but it would still be interesting to know what caused it.

Comment 2 by mar...@chromium.org, Apr 12 2016

https://status.cloud.google.com/incident/compute/16007 is the cause. Then  issue 602573  caused cascading failure.

Comment 3 by aga...@chromium.org, Apr 26 2016

Components: Infra>Platform>Swarming
Labels: -Infra-Swarming

Sign in to add a comment