New issue
Advanced search Search tips

Issue 803210 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

"telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testCollectAgentTraceDataFormat" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Jan 17 2018

Issue description

"telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testCollectAgentTraceDataFormat" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNygwELEgVGbGFrZSJ4dGVsZW1ldHJ5LmludGVybmFsLnBsYXRmb3JtLnRyYWNpbmdfYWdlbnQuY3B1X3RyYWNpbmdfYWdlbnRfdW5pdHRlc3QuQ3B1VHJhY2luZ0FnZW50VGVzdC50ZXN0Q29sbGVjdEFnZW50VHJhY2VEYXRhRm9ybWF0DA.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs
 
Components: Speed>Telemetry
Owner: perezju@chromium.org
Status: Assigned (was: Untriaged)
 Issue 803249  has been merged into this issue.
 Issue 803229  has been merged into this issue.
 Issue 803211  has been merged into this issue.
Cc: chiniforooshan@chromium.org charliea@chromium.org
It's possible that this is due to Ehsan's tracing change. But for now I am suspecting Juan's recent work
Errors appear to be on:
telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest

  Traceback (most recent call last):
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent_unittest.py", line 61, in testStartAgentTracing
      self.assertTrue(self._agent.StartAgentTracing(self._config, 0))
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 289, in StartAgentTracing
      self._KeepTakingSnapshots()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 298, in _KeepTakingSnapshots
      (self._collector.GetProcesses(), trace_time.Now()))
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 119, in GetProcesses
      processes = super(WindowsProcessCollector, self).GetProcesses()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 69, in GetProcesses
      self._ParseProcessString(proc_string) for proc_string in proc_strings
  TypeError: 'NoneType' object is not iterable

From: https://chromium-swarm.appspot.com/task?id=3b1c2155bc1bb210&refresh=10&show_raw=1&wide_logs=true

Looking into it ...
Actually, it looks like the error is coming from here:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent.py?rcl=fbd65219a8245c27c19f00ea37d3136e00d9a993&l=144

Which from the logs shows up as:

  Error when running wmic command, which gave output: 
 
(note: no output is given)

I don't think this is related to my changes. Would there be a way to get more diagnostic info?

As the message is logged as a warning, is it safe to just return an empty list here to make the error non-fatal?
Cc: perezju@chromium.org
Owner: charliea@chromium.org
Looks like that warning message was added by Charlie on  issue 732584  to investigate some similar flakiness. And the problem just went away that time.

Assigning to Charlie to decide what to do next.
Labels: -Sheriff-Chromium
I see no flakes for this since 2018-01-17 20:29:31 UTC. Dropping from sheriff queue for now, but please do your best to fix this test soon or a future sheriff will disable it.
Have a patch in the CQ now which adds logging of the exit code of the command, which should help us debug this. Going to snooze this for 4 days, during which time hopefully another flake happens.
Project Member

Comment 11 by bugdroid1@chromium.org, Jan 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/031058b18eab12c71dd9a0122dd0d8a525d01f72

commit 031058b18eab12c71dd9a0122dd0d8a525d01f72
Author: Charlie Andrews <charliea@chromium.org>
Date: Thu Jan 18 20:37:47 2018

Log wmic exit code when it fails

Previously, when wmic failed, I tried to log the output that it gave,
hoping that it would give some clue as to why it failed. Unfortunately,
it looks like when it fails, _no_ output is given.

wmic does give very specific exit codes when it fails, though, so I'm
hoping that this latest logging will help us diagnose the failure.

TBR=nednguyen@google.com
Bug: chromium:803210
Change-Id: I9f31a08712545940c4c9f2a82aac77fb99fb013e
Reviewed-on: https://chromium-review.googlesource.com/874599
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/031058b18eab12c71dd9a0122dd0d8a525d01f72/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent.py

Project Member

Comment 12 by bugdroid1@chromium.org, Jan 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/9886544c22cf9a4763e6d8a400ae3a1f1c02777e

commit 9886544c22cf9a4763e6d8a400ae3a1f1c02777e
Author: Charlie Andrews <charliea@chromium.org>
Date: Fri Jan 19 15:29:51 2018

Reraise exception when wmic fails

This was suggested by perezju@, and ensures that this root exception is
raised instead of a more obscure one down the line.

R=perezju@chromium.org

Bug: chromium:803210
Change-Id: I432b4699e17209ae67b25b3b5f841099d3ed3dd3
Reviewed-on: https://chromium-review.googlesource.com/875912
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/9886544c22cf9a4763e6d8a400ae3a1f1c02777e/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent.py

Project Member

Comment 13 by chromium...@appspot.gserviceaccount.com, Jan 30 2018

Labels: Sheriff-Chromium
Detected 3 new flakes for test/step "telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testTraceSpecifiesTelemetryClockDomain". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyigELEgVGbGFrZSJ_dGVsZW1ldHJ5LmludGVybmFsLnBsYXRmb3JtLnRyYWNpbmdfYWdlbnQuY3B1X3RyYWNpbmdfYWdlbnRfdW5pdHRlc3QuQ3B1VHJhY2luZ0FnZW50VGVzdC50ZXN0VHJhY2VTcGVjaWZpZXNUZWxlbWV0cnlDbG9ja0RvbWFpbgw. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Sheriff Bug Queue (unless already there).
Logs:

=== testTraceSpecifiesTelemetryClockDomain ===

[115/292] telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testTraceSpecifiesTelemetryClockDomain queuedERROR:

Description = 
[116/292] telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testTraceSpecifiesTelemetryClockDomain failed unexpectedly 0.7080s:
  wmic failed with error code -1073738817 when running command, which gave output: 
  Traceback (most recent call last):
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent_unittest.py", line 143, in testTraceSpecifiesTelemetryClockDomain
      self._agent.StartAgentTracing(self._config, 0)
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 289, in StartAgentTracing
      self._collector.Init()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 116, in Init
      self._GetProcessesAsStrings()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 143, in _GetProcessesAsStrings
      self._GET_PERF_DATA_SHELL_COMMAND).strip().split('\n')[2:]
    File "c:\infra-system\bin\lib\subprocess.py", line 219, in check_output
      raise CalledProcessError(retcode, cmd, output=output)
  CalledProcessError: Command '['wmic', 'path', 'Win32_PerfFormattedData_PerfProc_Process', 'get', 'CreatingProcessID,IDProcess,Name,PercentProcessorTime,WorkingSet']' returned non-zero exit status -1073738817
[116/292] telemetry.internal.results.artifact_results_unittest.ArtifactResultsUnittest.testAddAndMove queued


== testCollectAgentTraceDataFormat ===

ERROR:

Description = [1/3] telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testCollectAgentTraceDataFormat failed unexpectedly 0.7670s:
  wmic failed with error code -1073738817 when running command, which gave output: 
  Traceback (most recent call last):
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent_unittest.py", line 107, in testCollectAgentTraceDataFormat
      self._agent.StartAgentTracing(self._config, 0)
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 289, in StartAgentTracing
      self._collector.Init()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 116, in Init
      self._GetProcessesAsStrings()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 143, in _GetProcessesAsStrings
      self._GET_PERF_DATA_SHELL_COMMAND).strip().split('\n')[2:]
    File "c:\infra-system\bin\lib\subprocess.py", line 219, in check_output
      raise CalledProcessError(retcode, cmd, output=output)
  CalledProcessError: Command '['wmic', 'path', 'Win32_PerfFormattedData_PerfProc_Process', 'get', 'CreatingProcessID,IDProcess,Name,PercentProcessorTime,WorkingSet']' returned non-zero exit status -1073738817
ERROR:

=== testStartAgentTracing ===

ERROR:

Description = [2/3] telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testStartAgentTracing failed unexpectedly 0.7750s:
  wmic failed with error code -1073738817 when running command, which gave output: 
  Traceback (most recent call last):
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent_unittest.py", line 61, in testStartAgentTracing
      self.assertTrue(self._agent.StartAgentTracing(self._config, 0))
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 289, in StartAgentTracing
      self._collector.Init()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 116, in Init
      self._GetProcessesAsStrings()
    File "e:\b\swarm_slave\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 143, in _GetProcessesAsStrings
      self._GET_PERF_DATA_SHELL_COMMAND).strip().split('\n')[2:]
    File "c:\infra-system\bin\lib\subprocess.py", line 219, in check_output
      raise CalledProcessError(retcode, cmd, output=output)
  CalledProcessError: Command '['wmic', 'path', 'Win32_PerfFormattedData_PerfProc_Process', 'get', 'CreatingProcessID,IDProcess,Name,PercentProcessorTime,WorkingSet']' returned non-zero exit status -1073738817

Project Member

Comment 16 by bugdroid1@chromium.org, Jan 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/857b65094956a0c65aa76fa2247d6217129cc3e7

commit 857b65094956a0c65aa76fa2247d6217129cc3e7
Author: vitaliii <vitaliii@chromium.org>
Date: Tue Jan 30 12:34:10 2018

Disable some telemetry.i.p.t.c.CpuTracingAgentTest tests on Win (flaky).

Disable:

telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testTraceSpecifiesTelemetryClockDomain

telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testCollectAgentTraceDataFormat

telemetry.internal.platform.tracing_agent.cpu_tracing_agent_unittest.CpuTracingAgentTest.testStartAgentTracing

on Win, because flaky.

TBR=perezju@chromium.org

Bug: chromium:803210
Change-Id: I27b61d197a6831d0a5133ac2cfeb30f8f8d8bf1d
Reviewed-on: https://chromium-review.googlesource.com/892862
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>

[modify] https://crrev.com/857b65094956a0c65aa76fa2247d6217129cc3e7/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent_unittest.py

Labels: -Sheriff-Chromium
Please have a look and reenable the tests. Thank you!
Cc: brucedaw...@chromium.org
Great! I was waiting for this to happen. Thanks for reporting it vitaliii@. 

It looks like wmic is failing with error code -1073738817 as interpreted by Python.

Python interprets this error code as a 32 bit signed integer, whereas in reality, it's actually a 32 bit unsigned integer. Using http://www.binaryconvert.com/result_signed_int.html?decimal=045049048055051055051056056049055, I can find out that the number is equivalent to 0xC0000BBF in hex.

I then go to the WMI error constants page (https://msdn.microsoft.com/en-us/library/aa394559(v=vs.85).aspx) and look up that hex number and... nothing. 

brucedawson@, is there any chance you could help me figure out what this means? I'm stumped.
I search the VS include files for C0000BBF and found two hits. Unfortunately error codes are not namespaced and it's not clear which (if either) of these is applicable. Here they are:

C:\Program Files (x86)\Windows Kits\10\Include\10.0.15063.0\um\PdhMsg.h
#define PDH_CSTATUS_NO_COUNTERNAME 0xC0000BBFL


...\Windows Kits\10\Include\10.0.15063.0\um\WindowsSearchErrors.h
//
// MessageId: EVENT_GATHERER_PERFMON
//
// MessageText:
//
// Performance monitoring cannot be initialized for the gatherer object, because the counters are not loaded or the shared memory object cannot be opened. This only affects availability of the perfmon counters. Restart the computer.%1
//
#define EVENT_GATHERER_PERFMON           0xC0000BBFL



The name of the second one sounds encouraging but WindowsSearchErrors.h seems like an odd location for a WMI error. So maybe WMI uses Pdh?


BTW, I filed  bug 803617  for the confusion caused by printing the inherently hexadecimal error codes as signed decimal.

Thanks for the tip Bruce! 

Searching around, I also found https://msdn.microsoft.com/en-us/library/windows/desktop/aa373214(v=vs.85).aspx

In there is:

WMI, Event 10 
Event filter with query "select * from __instancemodificationevent within 30 where targetinstance isa ‘Win32_PerfFormattedData_PerfDisk_LogicalDisk’ and targetinstance.PercentFreeSpace < 1 and targetinstance.Name != ‘_Total’" could not be reactivated in namespace "//./root/CIMV2" because of error 0xc0000bbf. Events cannot be delivered through this filter until the problem is corrected.

WMI, Event 5612: 
Windows Management Instrumentation has stopped WMIPRVSE.EXE because a quota reached a warning value. Quota: HandleCount  Value: 12853 Maximum value: 12288 WMIPRVSE PID: 7232

http://bit.ly/2nC6KGf also seems to suggest that the counters can get into a corrupted state and need to be cleared. It's unfortunate that this seems to be happening every ~10 days, though. That's not very reliable :-/

Both also seem to suggest that, if I'm serious about debugging this, then finding some way to dump the event logs from when the WMI command was run is the next step.
I'm probably going to put this on my backlog, but not work on it right now. It doesn't seem like there's a really obvious fix to this, and it seems like the sort of thing that I could spend an entire week trying to debug. There are also no severe ramifications of it, given that things are working well enough when this code is used in Telemetry that we haven't had any errors reported. 
 Issue 753257  has been merged into this issue.
Project Member

Comment 23 by bugdroid1@chromium.org, Feb 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/874bc7c4eabb1a1b08968210723ba4137b4a76aa

commit 874bc7c4eabb1a1b08968210723ba4137b4a76aa
Author: Charlie Andrews <charliea@chromium.org>
Date: Tue Feb 06 21:43:59 2018

Disable all CPU tracing agent integration tests on Windows

All of these tests are susceptible to a bug where wmic fails for
seemingly no reason.

TBR=nednguyen@google.com
Bug: chromium:803210
Change-Id: Ice0d0bec30a4564b6437c5dd8b5d1dd489893c44
Reviewed-on: https://chromium-review.googlesource.com/905284
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/874bc7c4eabb1a1b08968210723ba4137b4a76aa/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent_unittest.py

Comment 24 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 25 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Speed>Telemetry

Sign in to add a comment