Handle sigterm in V8 test framework and log hanging tests |
|||||||||
Issue descriptionExample failure : https://ci.chromium.org/p/v8/builds/b8954429453077102880 in CL @ https://chromium-review.googlesource.com/c/v8/v8/+/922103. Traceback (most recent call last): File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/run.py", line 275, in _new_run recipe_result = recipe_script.run(api, self.properties, self.environ) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 98, in run self.run_steps, properties, environ, self.PROPERTIES, api=api) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 626, in invoke_with_properties arg_names, **additional_args) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 589, in _invoke_with_properties return callable_obj(*props, **additional_args) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipes/v8.py", line 91, in RunSteps test_results = v8.runtests(tests) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/recipe_api.py", line 750, in _inner return func(*a, **kw) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/api.py", line 870, in runtests test_results += t.run(coverage_context=coverage_context) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/testing.py", line 552, in run return result + self.post_run(self.test, coverage_context) File "/b/swarming/w/ir/kitchen-checkout/scripts/slave/recipe_modules/v8/testing.py", line 402, in post_run assert len(json_output) == 1 TypeError: object of type 'NoneType' has no len()
,
Feb 16 2018
to clarify: are you seeing this on the chromium CQ, or only on the v8 one?
,
Feb 16 2018
Haven't seen it on Chromium CQ, perhaps v8 specific.
,
Feb 16 2018
This appears to be due to Check timing out. 8 of more than 250 runs over the last 24 hours on the same bot have timed out in similar fashion: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1518792420000&f=name%3ACheck&f=os%3AMac-10.9&l=50&n=true&s=created_ts%3Adesc&st=1518706020000 I would suspect that this isn't actually a trooper issue.
,
Feb 16 2018
(though the v8 recipe module could likely handle the lack of json output a bit more clearly)
,
Feb 16 2018
Hmmm interesting, 8/250 sounds like the number of tries I've made... could my CL be causing this? If so this is a very obscure way to catch an error in my CL..? It builds and passes all tests locally...
,
Feb 16 2018
(Note that the 8/250+ is from Mac-10.9 alone; I didn't look at other bots too much.) 7 of the 8 are from your two CLs (https://chromium-review.googlesource.com/c/v8/v8/+/918663 and https://chromium-review.googlesource.com/c/v8/v8/+/922103), with the 8th being on https://chromium-review.googlesource.com/c/v8/v8/+/918641/6. Your CL could be causing or perhaps exacerbating this; I'm not familiar enough w/ v8's tests to say.
,
Feb 16 2018
What does the error even mean though..? I have no idea where to start...
,
Feb 16 2018
I'd be happy to raise the timeout if the increased runtime is expected and not a regression that needs to be fixed. The default hard timeout we use is 45 minutes.
,
Feb 16 2018
To be clear, the stack trace you posted in #0 isn't the problem; it's a symptom of it. (As mentioned in #5, this definitely isn't clear from the bot page.) If your CL is indeed causing or exacerbating this, then the issue is that your CL caused something in Check (possibly, but not necessarily, *everything* in Check) to take long enough that the suite hit the hard timeout sergiyb mentioned. You're looking for what part of that suite is taking longer & why.
,
Feb 16 2018
I've re-run the task https://chromium-swarm.appspot.com/task?id=3bb7e71658c57d10 locally and it generated the attached log. What is not clear to me and Gabriel is why aren't these errors also visible in the server log?
,
Feb 16 2018
Potential theories: - logs are so large that bot has stalled while trying to print them to stdout causing a timeout - affected tests run slower on a bot (due to small number of cores compared to workstations) and get interrupted prior to printing errors It was also confusing that log did not contain the tests that were triggered, making it hard to know which tests have been running when timeout occurred. In any case, I don't think this is trooper issue.
,
Feb 16 2018
,
Feb 16 2018
,
Feb 16 2018
FWIW, I found the cause of the error : bots run a much smaller v8 thread pool than my machine (ref. DefaultPlatform::SetThreadPoolSize). My CL has problems with a small pool looks like. That generated a 20 line stack trace repeated 16620 times in log @#11. I can now generate those errors locally by hardcoding a small thread pool. Having had this stack trace on the bots would have saved my a day, thanks for digging into this and helping out!
,
Feb 16 2018
Re 11: The difference when running locally is strange. @gab: What does this failing dcheck mean? Can it be connected to our workstations having much more cores? It seems task number related. Re general: 1) The V8 test framework does not handle the sigterm, when swarming wants stop it due to hard timeout. We could teach the framework to properly handle it, terminate gracefully and print the final state. 2) The test framework doesn't log which tests are starting. They're just pulled from a queue. We could tackle any of those problems. They are usually rare and nobody had cycles. Re root cause: There is very likely one test running very slow - we sadly can't see which one. But I suggest to start running the test locally and see if you can repro what Sergiy could repro above. Maybe the timeouts will be gone too once this is fixed?
,
Feb 16 2018
Alright, seeing some answers in comment 15 now. The test framework still needs improvement for handling such hanging tests better.
,
Feb 16 2018
@#16, see #15. It's not slow, it's just hitting a DCHECK in *every* test and doing a LOT of printing (maybe quit after a certain number of failures?). Yes this is related to thread pool size (hard coding it small locally results in the error being hit -- so I just discovered).
,
Feb 16 2018
Idea : we could fuzz the thread pool size locally (based on recently --random-seed based fuzzing)?
,
Feb 16 2018
WIP CL: https://crrev.com/c/924188. I probably won't finish it before my vacation until 2018-03-08, hence feel free to steal this bug.
,
Feb 16 2018
Re 19: Good idea. Can you give me pointer on how to manipulate the thread pool size and which are the number ranges that make sense?
,
Feb 16 2018
Though, unrelated bug here. Maybe we should make a new feature request.
,
Feb 16 2018
Sorry, I've stolen this bug for improving output. Can you please file a new bug for improving fuzzing of the number of threads?
,
Feb 16 2018
,
Feb 16 2018
Also related to my CL, looks like I can't use mock library on our swarming bots. I've tried using .vpython file, but it didn't work. Asked about this here: http://shortn/_VXUgff0F3C (sorry, internal only). (the comment above was rewritten to avoid containing internal links, which allows me to remove RVG and keep this bug open)
,
Feb 16 2018
,
Feb 16 2018
P.S. It was Michael who questioned whether this bug needs to be RVG'd. Thank your for the pushback and for keeping V8 a truly opensource project.
,
Feb 17 2018
,
Feb 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84 commit 4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84 Author: Michael Achenbach <machenbach@chromium.org> Date: Tue Feb 20 18:02:25 2018 [test] Abort testing immediately on interrupt and sigterm This is a partial revert of: https://crrev.com/c/890938 and https://crrev.com/c/893982 Before this CL, the test runner blocked on ongoing tests in order to process their results after an internal timeout. However, the logic required for this feature was overly complicated and prevented an acceptable implementation for fast aborts. Furthermore, also the fuzzers suffered from timeouts on swarming due to hanging tests. Instead, we now abort immediately on internal timeout (used on fuzzers), SIGINT (Ctrl-C) and SIGTERM. Ongoing tests are immediately terminated and their results are disregarded. On SIGTERM and SIGINT, we return with non-zero exit codes, and zero on internal timeout. This will also properly return json output, when the external hard timeout is reached on swarming (causes SIGTERM). TBR=sergiyb@chromium.org Bug: v8:7423 , chromium:813065 Change-Id: Ib20f835f58a0970693bdd3b21dc5d766d8e115d8 Reviewed-on: https://chromium-review.googlesource.com/924852 Reviewed-by: Michael Achenbach <machenbach@chromium.org> Commit-Queue: Michael Achenbach <machenbach@chromium.org> Cr-Commit-Position: refs/heads/master@{#51399} [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/base_runner.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/command.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/pool.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/pool_unittest.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/local/utils.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/num_fuzzer.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/standard_runner.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/execution.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/progress.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/sigproc.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/testrunner/testproc/timeout.py [modify] https://crrev.com/4a2d9b1b208c7cde00b98a0ab7ad2de67f966b84/tools/unittests/run_tests_test.py
,
Mar 9 2018
,
Mar 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca commit 87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca Author: Sergiy Byelozyorov <sergiyb@chromium.org> Date: Fri Mar 09 20:33:36 2018 [tools] Add pypiwin32 pacakge to VPython config TBR=machenbach@chromium.org No-Try: true Bug: chromium:813065 Change-Id: Ic502079f66432aefe33a765d212389b4ab191c1d Reviewed-on: https://chromium-review.googlesource.com/955652 Commit-Queue: Sergiy Byelozyorov <sergiyb@chromium.org> Reviewed-by: Sergiy Byelozyorov <sergiyb@chromium.org> Cr-Commit-Position: refs/heads/master@{#51854} [modify] https://crrev.com/87956c74cdb5d4dd8ab87adf9dd9104ff1cd5cca/.vpython
,
Mar 14 2018
More evidence that our sigterm handling also doesn't work on windows: https://chromium-swarm.appspot.com/task?id=3c39ad5760734010&refresh=10&show_raw=1 I also doesn't work on mac... but reliably works on linux. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by jbudorick@chromium.org
, Feb 16 2018