vm_sanity crashes |
|||
Issue descriptionThis is the cros_amd64-generic builder: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel?limit=200 Looking at the last ~300 builds, from 5/27 to 5/31, we have 5 different types of vm_sanity failures. The last 140 builds have had no vm_sanity failures. deploy_chrome time out (probably as a result of symbolized chrome): https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13161 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13149 Crash #1: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13119 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13118 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13114 Stack: INFO:root:Browser is closed. INFO:root:Try printing formatted exception: <class 'telemetry.internal.backends.chrome_inspector.devtools_http.DevToolsClientConnectionError'> '' <traceback object at 0x7c1ca6fe8518> Traceback (most recent call last): <module> at /usr/local/autotest/bin/vm_sanity.py:72 sys.exit(main(sys.argv[1:])) main at /usr/local/autotest/bin/vm_sanity.py:39 with chrome.Chrome(arc_mode=arc_mode) as cr: __init__ at /usr/local/autotest/common_lib/cros/chrome.py:172 self._browser = self._browser_to_create.Create() Create at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_finder.py:101 browser_backend, self._platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_with_oobe.py:14 backend, platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/browser.py:48 self._browser_backend.Start(startup_args, startup_url=startup_url) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Start at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:102 self.oobe.NavigateFakeLogin( oobe at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:169 return self.misc_web_contents_backend.GetOobe() GetOobe at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/misc_web_contents_backend.py:28 if not len(self): __len__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:95 self._Update() _Update at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py:99 backends_map = self._devtools_client.GetUpdatedInspectableContexts() GetUpdatedInspectableContexts at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_client_backend.py:376 contexts = self._ListInspectableContexts() _ListInspectableContexts at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_client_backend.py:301 return self._devtools_http.RequestJson('') RequestJson at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_http.py:107 return json.loads(self.Request(path, timeout)) Request at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_http.py:88 response = self._conn.getresponse() getresponse at /usr/local/lib64/python2.7/httplib.py:1132 response.begin() begin at /usr/local/lib64/python2.7/httplib.py:453 version, status, reason = self._read_status() _read_status at /usr/local/lib64/python2.7/httplib.py:417 raise BadStatusLine(line) DevToolsClientConnectionError: '' Locals: line : '' Crash #2: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13116 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13115 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13030 Stack: INFO:root:Try printing formatted exception: <class 'py_utils.TimeoutException'> Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30). <traceback object at 0x7fae0567f6c8> Traceback (most recent call last): <module> at /usr/local/autotest/bin/vm_sanity.py:72 sys.exit(main(sys.argv[1:])) main at /usr/local/autotest/bin/vm_sanity.py:39 with chrome.Chrome(arc_mode=arc_mode) as cr: __init__ at /usr/local/autotest/common_lib/cros/chrome.py:172 self._browser = self._browser_to_create.Create() Create at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_finder.py:101 browser_backend, self._platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_with_oobe.py:14 backend, platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/browser.py:48 self._browser_backend.Start(startup_args, startup_url=startup_url) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Start at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:82 py_utils.WaitFor(lambda: self.oobe_exists, 30) WaitFor at /usr/local/telemetry/src/third_party/catapult/common/py_utils/py_utils/__init__.py:148 (timeout, GetConditionString())) TimeoutException: Timed out while waiting 30s for py_utils.WaitFor(lambda: self.oobe_exists, 30). Locals: GetConditionString : <function GetConditionString at 0x7fae056c4410> condition : <function <lambda> at 0x7fae06127f50> elapsed_time : 32.533758878707886 last_output_elapsed_time : 32.533758878707886 last_output_time : 1527613943.612802 now : 1527613976.1465609 poll_interval : 3.2533758878707886 res : False start_time : 1527613943.612802 timeout : 30 Crash #3: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13106 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13082 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13032 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13021 Stack: Traceback (most recent call last): <module> at /usr/local/autotest/bin/vm_sanity.py:72 sys.exit(main(sys.argv[1:])) main at /usr/local/autotest/bin/vm_sanity.py:39 with chrome.Chrome(arc_mode=arc_mode) as cr: __init__ at /usr/local/autotest/common_lib/cros/chrome.py:172 self._browser = self._browser_to_create.Create() Create at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_finder.py:101 browser_backend, self._platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_with_oobe.py:14 backend, platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/browser.py:48 self._browser_backend.Start(startup_args, startup_url=startup_url) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Start at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:104 not self.browser_options.disable_gaia_services) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) NavigateFakeLogin at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/oobe.py:71 enterprise_enroll) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) _ExecuteOobeApi at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/oobe.py:47 self.ExecuteJavaScript('{{ @f }}({{ *args }})', f=api, args=args) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) ExecuteJavaScript at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/web_contents.py:192 return self._inspector_backend.ExecuteJavaScript(*args, **kwargs) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Inner at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:41 inspector_backend._ConvertExceptionFromInspectorWebsocket(e) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Inner at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:38 return func(inspector_backend, *args, **kwargs) ExecuteJavaScript at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:225 self._runtime.Execute(statement, context_id, timeout) Execute at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_runtime.py:20 self.Evaluate(expr + '; 0;', context_id, timeout) Evaluate at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_runtime.py:44 res = self._inspector_websocket.SyncRequest(request, timeout) SyncRequest at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_websocket.py:116 res = self._Receive(timeout) _Receive at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_websocket.py:155 data = self._socket.recv() recv at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_core.py:293 opcode, data = self.recv_data() recv_data at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_core.py:310 opcode, frame = self.recv_data_frame(control_frame) recv_data_frame at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_core.py:323 frame = self.recv_frame() recv_frame at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_core.py:357 return self.frame_buffer.recv_frame() recv_frame at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_abnf.py:336 self.recv_header() recv_header at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_abnf.py:286 header = self.recv_strict(2) recv_strict at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_abnf.py:371 bytes_ = self.recv(min(16384, shortage)) _recv at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_core.py:427 return recv(self.sock, bufsize) recv at /usr/local/telemetry/src/third_party/catapult/telemetry/third_party/websocket-client/websocket/_socket.py:93 "Connection is already closed.") DevtoolsTargetCrashException: Devtools target crashed Crash #4: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13083 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13065 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13061 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13060 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13057 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13054 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13052 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13051 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13048 Stack: Traceback (most recent call last): <module> at /usr/local/autotest/bin/vm_sanity.py:72 sys.exit(main(sys.argv[1:])) main at /usr/local/autotest/bin/vm_sanity.py:39 with chrome.Chrome(arc_mode=arc_mode) as cr: __init__ at /usr/local/autotest/common_lib/cros/chrome.py:172 self._browser = self._browser_to_create.Create() Create at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_finder.py:101 browser_backend, self._platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_with_oobe.py:14 backend, platform_backend, startup_args) __init__ at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/browser.py:48 self._browser_backend.Start(startup_args, startup_url=startup_url) traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) Start at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py:81 self.BindDevToolsClient() traced_function at /usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:52 return func(*args, **kwargs) BindDevToolsClient at /usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py:139 raise exceptions.BrowserGoneException(self.browser, e) BrowserGoneException: Timed out while waiting 240s for _GetDevToolsClient. Found Minidump: False
,
May 31 2018
And if we have trouble figuring out what's holding it up, we may just wanna promote the test out of experimental mode anyway. Like you mentioned earlier somewhere, the CQ does do auto-retries. Given how infrequently it hangs, we'd probably be able to use those retries to mask the flakiness.
,
Jun 4 2018
The flake rate has gone back down, and it does look like the crashing has stopped. From 8a 5/30 to 4p 6/4 about 5.5 days, we've had 400 builds with 3 failures. The failures are the time outs: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13449 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13397 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13370 This is in addition to the 2 other cases we've seen before: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13161 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel/13149 We have more of the stdout, but we'll probably need additional log files. Ben - can we extract and make available additional files in the run results page? The postprocess_for_goma.upload_log step seems to make log files available. My fix in comment #3 is unlikely to help
,
Jun 4 2018
Pulling logs from the VM is a good idea, and would probably be useful outside of this. We can tag tests with an arbitrary number of output/result files, so we can easily throw logs in there. What logs would you think we'd need? Everything under /var/log/chrome/ ? I can look into pulling them from the VM after each test.
,
Jun 4 2018
Everything under /var/log/chrome would be nice, but at least chrome, and ideally also chrome.PREVIOUS. Next, we should consider everything in /var/log, prioritizing /var/log/messages, /var/log/ui/ui.LATEST This will end up being a lot of files, so we would ideally provide a browse-able directory structure instead of a flat list of files.
,
Jun 5 2018
All of /var/log/ doesn't seem that large. I think we can squeeze it all in. The viewer ui would look like https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=32a4bfbb5da4503ad730b0e157f826a18cb65ab0 (lmk if you don't have permissions to view that). It's flat, but at least it's sorted and shows the dirs. My one concern is that the tasks in question here are getting timed out, so we have to be sure to pull the logs from the VM before swarming sends us the kill signal and we exit. I believe there's a grace period between when it sends an initial sigterm and the fallback sigkill (I think it's ~30-60s? +MA) I think we should be able to trap the first signal and do our log business in time before we get sigkill'ed. I'll work on that.
,
Jun 5 2018
grace_periods defaults to 30 seconds, it's configurable per TaskProperties if you want to increase it.
,
Jun 5 2018
Is the timeout really 60 min? We should fail before then instead of hanging - I'll take a look. There's no timing info in stdout so it's a bit hard to tell if we're running along and getting stuck somewhere.
,
Jun 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/417c70dc0081dcae1559905de877ed6f404c23ac commit 417c70dc0081dcae1559905de877ed6f404c23ac Author: Ben Pastene <bpastene@chromium.org> Date: Wed Jun 06 00:11:51 2018 Add option to CrOS VM test runner to collect VM logs after the test. And enable it for the sanity test. This won't work when the test times out and swarming kills it (ie: we won't pull the logs from the VM before we die.) The fix for that will come in a follow-up CL. Bug: 848402 Change-Id: Ie2396adb7e284d5069162a0c8aea145c12d85be1 Reviewed-on: https://chromium-review.googlesource.com/1086416 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Ben Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#564723} [modify] https://crrev.com/417c70dc0081dcae1559905de877ed6f404c23ac/build/chromeos/run_vm_test.py [modify] https://crrev.com/417c70dc0081dcae1559905de877ed6f404c23ac/testing/buildbot/gn_isolate_map.pyl
,
Jun 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b602383ee8c4cf2007dc300fc9860e36fb8b0f2c commit b602383ee8c4cf2007dc300fc9860e36fb8b0f2c Author: Ben Pastene <bpastene@chromium.org> Date: Thu Jun 07 18:25:02 2018 Gracefully kill child procs of CrOS VM tests on SIGTERM. We collect logs from the VM after tests. The problem is, if the test hangs indefinitely, we won't get a chance to grab the logs before swarming kills us when we reach the timeout. Swarming sends a SIGTERM, waits for grace_period, then SIGKILLs the task if it's still running. This CL will catch the SIGTERM and kill what should be the frozen test process. This will let cros_run_vm_test pull the logs before we get SIGKILLed. Bug: 848402 Change-Id: I48be1de865e3b287584978b5461f15e2bae49dfd Reviewed-on: https://chromium-review.googlesource.com/1087800 Commit-Queue: Ben Pastene <bpastene@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#565357} [modify] https://crrev.com/b602383ee8c4cf2007dc300fc9860e36fb8b0f2c/.vpython [modify] https://crrev.com/b602383ee8c4cf2007dc300fc9860e36fb8b0f2c/build/chromeos/run_vm_test.py
,
Jun 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ebf5c08e17179fa6593405acce1aab14efd6444d commit ebf5c08e17179fa6593405acce1aab14efd6444d Author: Achuith Bhandarkar <achuith@chromium.org> Date: Thu Jun 07 20:11:41 2018 vm_sanity: Eliminate retries Prefer not to suppress failures. BUG= chromium:848402 Change-Id: I72047a40208de4162a3f71ba98b99c519bfa6d6e Reviewed-on: https://chromium-review.googlesource.com/1089793 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Ben Pastene <bpastene@chromium.org> [modify] https://crrev.com/ebf5c08e17179fa6593405acce1aab14efd6444d/client/bin/vm_sanity.py
,
Jun 8 2018
The change in #11 was supposed to capture vm logs on task timeout. But that doesn't seem to working quite right. There's no isolated outputs for https://chromium-swarm.appspot.com/task?id=3df8d76f95e95d10 It works like it's supposed to when I manually sigterm the pid. I wonder if swarming's doing something weird...
,
Jun 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e5ece81fab9147defb70bc3209ae8e8cd8167eeb commit e5ece81fab9147defb70bc3209ae8e8cd8167eeb Author: Ben Pastene <bpastene@chromium.org> Date: Wed Jun 20 22:39:34 2018 Make test_env.py forward swarming's timeout signals. This will let any test invoked via test_env.py on swarming handle timeouts. This should also cover xvfb.py: https://chromium.googlesource.com/chromium/src/+/master/testing/xvfb.py#97 Bug: 733612, 848402 Change-Id: I33242bd6fe86072f0b32150eab6b878436252388 Reviewed-on: https://chromium-review.googlesource.com/1105022 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Ben Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#569055} [modify] https://crrev.com/e5ece81fab9147defb70bc3209ae8e8cd8167eeb/testing/test_env.py
,
Jul 16
|
|||
►
Sign in to add a comment |
|||
Comment 1 by bpastene@chromium.org
, May 31 2018