New issue
Advanced search Search tips

Issue 813065 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 820416



Sign in to add a comment

Handle sigterm in V8 test framework and log hanging tests

Project Member Reported by gab@chromium.org, Feb 16 2018

Issue description

Example failure : https://ci.chromium.org/p/v8/builds/b8954429453077102880  in CL @ https://chromium-review.googlesource.com/c/v8/v8/+/922103. 

Traceback (most recent call last):
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/run.py", line 275, in _new_run
    recipe_result = recipe_script.run(api, self.properties, self.environ)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 98, in run
    self.run_steps, properties, environ, self.PROPERTIES, api=api)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 626, in invoke_with_properties
    arg_names, **additional_args)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 589, in _invoke_with_properties
    return callable_obj(*props, **additional_args)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipes/v8.py", line 91, in RunSteps
    test_results = v8.runtests(tests)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/recipe_api.py", line 750, in _inner
    return func(*a, **kw)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/api.py", line 870, in runtests
    test_results += t.run(coverage_context=coverage_context)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/testing.py", line 552, in run
    return result + self.post_run(self.test, coverage_context)
  File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/testing.py", line 402, in post_run
    assert len(json_output) == 1
TypeError: object of type 'NoneType' has no len()  
 
Cc: serg...@chromium.org machenb...@chromium.org
+v8 folks
to clarify: are you seeing this on the chromium CQ, or only on the v8 one?

Comment 3 by gab@chromium.org, Feb 16 2018

Haven't seen it on Chromium CQ, perhaps v8 specific.
Summary: Occasional timeout in v8's Check step (was: CQ is dead? Check step : object of type 'NoneType' has no len)
This appears to be due to Check timing out. 8 of more than 250 runs over the last 24 hours on the same bot have timed out in similar fashion: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1518792420000&f=name%3ACheck&f=os%3AMac-10.9&l=50&n=true&s=created_ts%3Adesc&st=1518706020000

I would suspect that this isn't actually a trooper issue.
(though the v8 recipe module could likely handle the lack of json output a bit more clearly)

Comment 6 by gab@chromium.org, Feb 16 2018

Hmmm interesting, 8/250 sounds like the number of tries I've made... could my CL be causing this? If so this is a very obscure way to catch an error in my CL..? It builds and passes all tests locally...
(Note that the 8/250+ is from Mac-10.9 alone; I didn't look at other bots too much.)

7 of the 8 are from your two CLs (https://chromium-review.googlesource.com/c/v8/v8/+/918663 and https://chromium-review.googlesource.com/c/v8/v8/+/922103), with the 8th being on https://chromium-review.googlesource.com/c/v8/v8/+/918641/6.

Your CL could be causing or perhaps exacerbating this; I'm not familiar enough w/ v8's tests to say.

Comment 8 by gab@chromium.org, Feb 16 2018

What does the error even mean though..? I have no idea where to start...
I'd be happy to raise the timeout if the increased runtime is expected and not a regression that needs to be fixed. The default hard timeout we use is 45 minutes.
To be clear, the stack trace you posted in #0 isn't the problem; it's a symptom of it. (As mentioned in #5, this definitely isn't clear from the bot page.)

If your CL is indeed causing or exacerbating this, then the issue is that your CL caused something in Check (possibly, but not necessarily, *everything* in Check) to take long enough that the suite hit the hard timeout sergiyb mentioned. You're looking for what part of that suite is taking longer & why.
I've re-run the task https://chromium-swarm.appspot.com/task?id=3bb7e71658c57d10 locally and it generated the attached log. What is not clear to me and Gabriel is why aren't these errors also visible in the server log?
Components: Infra>Client>V8
Labels: -Infra-Troopers
Potential theories:
 - logs are so large that bot has stalled while trying to print them to stdout causing a timeout
 - affected tests run slower on a bot (due to small number of cores compared to workstations) and get interrupted prior to printing errors

It was also confusing that log did not contain the tests that were triggered, making it hard to know which tests have been running when timeout occurred. In any case, I don't think this is trooper issue.
Summary: Log which tests have been triggered (with timestamp when they were triggered) to allow seeing which tests run longer than the timeout (was: Occasional timeout in v8's Check step)
Cc: -serg...@chromium.org gab@chromium.org
Owner: serg...@chromium.org
Status: Started (was: Untriaged)

Comment 15 by gab@chromium.org, Feb 16 2018

FWIW, I found the cause of the error : bots run a much smaller v8 thread pool than my machine (ref. DefaultPlatform::SetThreadPoolSize). My CL has problems with a small pool looks like.

That generated a 20 line stack trace repeated 16620 times in log @#11. I can now generate those errors locally by hardcoding a small thread pool.

Having had this stack trace on the bots would have saved my a day, thanks for digging into this and helping out!
Re 11: The difference when running locally is strange. @gab: What does this failing dcheck mean? Can it be connected to our workstations having much more cores? It seems task number related.

Re general:
1) The V8 test framework does not handle the sigterm, when swarming wants stop it due to hard timeout. We could teach the framework to properly handle it, terminate gracefully and print the final state.
2) The test framework doesn't log which tests are starting. They're just pulled from a queue. 

We could tackle any of those problems. They are usually rare and nobody had cycles.

Re root cause: There is very likely one test running very slow - we sadly can't see which one. But I suggest to start running the test locally and see if you can repro what Sergiy could repro above. Maybe the timeouts will be gone too once this is fixed?

Alright, seeing some answers in comment 15 now. The test framework still needs improvement for handling such hanging tests better.

Comment 18 by gab@chromium.org, Feb 16 2018

@#16, see #15. It's not slow, it's just hitting a DCHECK in *every* test and doing a LOT of printing (maybe quit after a certain number of failures?).

Yes this is related to thread pool size (hard coding it small locally results in the error being hit -- so I just discovered).

Comment 19 by gab@chromium.org, Feb 16 2018

Idea : we could fuzz the thread pool size locally (based on recently --random-seed based fuzzing)?
WIP CL: https://crrev.com/c/924188. I probably won't finish it before my vacation until 2018-03-08, hence feel free to steal this bug.
Re 19: Good idea. Can you give me pointer on how to manipulate the thread pool size and which are the number ranges that make sense?
Though, unrelated bug here. Maybe we should make a new feature request.
Sorry, I've stolen this bug for improving output. Can you please file a new bug for improving fuzzing of the number of threads?

Comment 25 Deleted

Labels: -Restrict-View-Google
Also related to my CL, looks like I can't use mock library on our swarming bots. I've tried using .vpython file, but it didn't work. Asked about this here: http://shortn/_VXUgff0F3C (sorry, internal only).

(the comment above was rewritten to avoid containing internal links, which allows me to remove RVG and keep this bug open)
Labels: -Pri-1 Pri-2
P.S. It was Michael who questioned whether this bug needs to be RVG'd. Thank your for the pushback and for keeping V8 a truly opensource project.
Cc: serg...@chromium.org
Owner: machenb...@chromium.org
Summary: Handle sigterm in V8 test framework and log hanging tests (was: Log which tests have been triggered (with timestamp when they were triggered) to allow seeing which tests run longer than the timeout)
Project Member

Comment 30 by bugdroid1@chromium.org, Feb 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84

commit 4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84
Author: Michael Achenbach <machenbach@chromium.org>
Date: Tue Feb 20 18:02:25 2018

[test] Abort testing immediately on interrupt and sigterm

This is a partial revert of:
https://crrev.com/c/890938 and https://crrev.com/c/893982

Before this CL, the test runner blocked on ongoing tests in order to
process their results after an internal timeout. However, the logic
required for this feature was overly complicated and prevented an
acceptable implementation for fast aborts. Furthermore, also the fuzzers
suffered from timeouts on swarming due to hanging tests.

Instead, we now abort immediately on internal timeout (used on
fuzzers), SIGINT (Ctrl-C) and SIGTERM. Ongoing tests are immediately
terminated and their results are disregarded. On SIGTERM and SIGINT,
we return with non-zero exit codes, and zero on internal timeout.

This will also properly return json output, when the external hard
timeout is reached on swarming (causes SIGTERM).

TBR=sergiyb@chromium.org

Bug:  v8:7423 , chromium:813065
Change-Id: Ib20f835f58a0970693bdd3b21dc5d766d8e115d8
Reviewed-on: https://chromium-review.googlesource.com/924852
Reviewed-by: Michael Achenbach <machenbach@chromium.org>
Commit-Queue: Michael Achenbach <machenbach@chromium.org>
Cr-Commit-Position: refs/heads/master@{#51399}
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/base_runner.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/command.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/pool.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/pool_unittest.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/utils.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/num_fuzzer.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/standard_runner.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/execution.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/progress.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/sigproc.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/timeout.py
[modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/unittests/run_tests_test.py

Blocking: 820416
Project Member

Comment 32 by bugdroid1@chromium.org, Mar 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca

commit 87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca
Author: Sergiy Byelozyorov <sergiyb@chromium.org>
Date: Fri Mar 09 20:33:36 2018

[tools] Add pypiwin32 pacakge to VPython config

TBR=machenbach@chromium.org

No-Try: true
Bug: chromium:813065
Change-Id: Ic502079f66432aefe33a765d212389b4ab191c1d
Reviewed-on: https://chromium-review.googlesource.com/955652
Commit-Queue: Sergiy Byelozyorov <sergiyb@chromium.org>
Reviewed-by: Sergiy Byelozyorov <sergiyb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#51854}
[modify] https://crrev.com/87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca/.vpython

More evidence that our sigterm handling also doesn't work on windows:
https://chromium-swarm.appspot.com/task?id=3c39ad5760734010&refresh=10&show_raw=1

I also doesn't work on mac... but reliably works on linux.

Sign in to add a comment