New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 649333 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

New win builds not getting passed to chromium.perf/Win 10 High-DPI Perf on chromium.perf

Project Member Reported by charliea@chromium.org, Sep 22 2016

Issue description

Win builder status page: http://bit.ly/2cTDo2d
Win builder (x64) status page: http://bit.ly/2cTC1k6

Win 10 High-DPI Perf (1): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%282%29/

(2): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%282%29/

(3): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%284%29/

(4): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%284%29/

(5) has a separate issue (https://bugs.chromium.org/p/chromium/issues/detail?id=647398).

Basically, the Win builders appear to be spitting out new builds (as recently as today, but many of the Windows perf bots appear to not have a  new build supplied since 9/18. This is blocking perf tests from running on many of the machines.

Infra-trooper, any idea what might be happening?

 

Comment 1 by emso@chromium.org, Sep 22 2016

Sorry, I don't know what the problem is here, but the MTV trooper shift starts very soon and they should have more knowledge about the Windows builders.

Comment 2 by aga...@chromium.org, Sep 22 2016

Cc: seanmccullough@chromium.org hinoka@chromium.org
Note: initial bug report contains not all of the links, some luci-milo links which aren't (yet) as helpful as buildbot links, and some extraneous links. Here's everything needed to dig in:

Builder: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%20x64%20Builder
Tester 1: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29
Tester 2: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%282%29
Tester 3: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%283%29
Tester 4: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29
Tester 5: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29

Equivalent luci-milo links:
Builder: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%20x64%20Builder
Tester 1: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29
Tester 2: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%282%29
Tester 3: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%283%29
Tester 4: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29
Tester 5: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29

Two of the luci-milo links show recent activity (builder, and tester (2)). The others have no updates for a long time (some since the 14th, some since the 18th). Note, however, that the ones which have no recent updates on luci-milo are also the ones for which the uberchromegw link times out and then gives an uberproxy error.

So something is very wrong on chromium.perf itself. It's not serving data regarding those testers, so luci-milo can't get updates about them either.

My most likely guess: something went wrong, and those testers started getting huge pending queues. Buildbot is really bad at rendering pending queues. Every time anyone requests those tester pages (including luci-milo requesting the JSON version of that page), it tries to read thousands of pending jobs from disk, times out, and doesn't cache any of the partial result.

Trooper: try to look into pending queue length for the testers that won't load. Try to look into the buildbot logs to see if it is even receiving the request to load those pages, and if it is timing out trying to service that request.

Comment 3 by aga...@chromium.org, Sep 22 2016

Also, here's graphs for the pending queues: http://shortn/_jIZnbiCzCj

As predicted, the (2) tester (and the (3) tester, surprisingly) is keeping a reasonable queue, while the others are stalled out and growing their queue indefinitely.
Yikes. BIUO?

Anyway - what is the right thing to do here? Do we clear the queue or have less builders kick off testers?

Comment 5 by aga...@chromium.org, Sep 22 2016

Owner: aga...@chromium.org
Status: Started (was: Untriaged)
./run.py infra_internal.tools.cancelall -m chromium.perf -b "Win 10 High-DPI Perf (1)" "Win 10 High-DPI Perf (4)" "Win 10 High-DPI Perf (5)"
spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/cancelbuild`
spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29/cancelbuild`
spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29/cancelbuild`

Should be better soon now...
Labels: -Pri-0 Pri-1 Type-Bug
And to prevent this from happening again...

'-1 priority level

Comment 7 by aga...@chromium.org, Sep 22 2016

Owner: seanmccullough@chromium.org
They haven't actually healed. The cancelall script didn't return any sort of error code, but the (1), (4), and (5) pages still aren't loading and the graph of pending queues hasn't gone down. I'm not sure what the next resort is, so assigning to the actual trooper :(
What is the difference between "Win 10 High-DPI Perf (1)" "Win 10 High-DPI Perf (4)" and "Win 10 High-DPI Perf (5)"? Are they just sharded builders or do they actually run different things?
Tester 5 seems to be stuck in a loop in step "battor.tough_video_cases ( 8 days 2 hrs ) battor.tough_video_cases", reporting this error for the past 8 days: 

ERROR:root:Failed to get process data
Traceback (most recent call last):
  File "C:\b\c\b\Win_10_High_DPI_Perf__5_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses
    'name': p.name,
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__
    ret = self.func(instance)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name
    name = self._platform_impl.get_process_name()
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper
    return fun(self, *args, **kwargs)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name
    return os.path.basename(self.get_process_exe())
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper
    raise AccessDenied(self.pid, self._process_name)
AccessDenied: (pid=5060)
Tester 1 also failing continuously with this: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/builds/221/steps/battor.power_cases/logs/stdio

CRITICAL:root:Finding return code for BattOr shell.
CRITICAL:root:Found return code: None
ERROR:root:Failed to get process data
Traceback (most recent call last):
  File "C:\b\c\b\Win_10_High_DPI_Perf__1_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses
    'name': p.name,
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__
    ret = self.func(instance)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name
    name = self._platform_impl.get_process_name()
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper
    return fun(self, *args, **kwargs)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name
    return os.path.basename(self.get_process_exe())
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper
    raise AccessDenied(self.pid, self._process_name)
AccessDenied: (pid=1480)

What is BattOr shell?
Tester 1 also failing continuously with this: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/builds/221/steps/battor.power_cases/logs/stdio

CRITICAL:root:Finding return code for BattOr shell.
CRITICAL:root:Found return code: None
ERROR:root:Failed to get process data
Traceback (most recent call last):
  File "C:\b\c\b\Win_10_High_DPI_Perf__1_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses
    'name': p.name,
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__
    ret = self.func(instance)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name
    name = self._platform_impl.get_process_name()
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper
    return fun(self, *args, **kwargs)
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name
    return os.path.basename(self.get_process_exe())
  File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper
    raise AccessDenied(self.pid, self._process_name)
AccessDenied: (pid=1480)

What is BattOr shell?
Cc: rnep...@chromium.org charliea@chromium.org
Adding some battor folks.   

Charlie, Randy Looks like these battors may be wedged and need to be reset?  
re #12
A RC of none on a subprocess means the subprocess hasn't closed yet, so its certainly a possibility that its wedged.

re #11
The battor shell is how we communicate with the battor power monitoring device.

That error is related to the cpu_tracing_agent though, and not the battor_tracing_agent. They just happen to be next to each other.
I think that there are a couple levels of problems here (none of which involve BattOrs directly I don't think, but some of which definitely involve BattOr benchmarks:

- The initial hang was caused by the "CPU tracing agent", which basically collects the results of the `top` command every 1s or so during a benchmark run. The Windows version of this apparently has a bug that makes it occasionally hang indefinitely (possibly due to permission problems?). This is being tracked here: https://bugs.chromium.org/p/chromium/issues/detail?id=647398. I disabled this agent today, so future builds shouldn't fail in the same way while we get this fixed. 

- The second level of this: how is it even possible in Telemetry for an agent to hang indefinitely? I would definitely expect Telemetry to forcibly kill any benchmark that takes longer than some high deadline (1h, 2h, 3h, 6h, something like that). nednguyen@, do you know if we have any logic for this?

Fixing both of these problems is definitely necessary, but I'd say the second is the more potentially problematic.
Cc: nednguyen@chromium.org
+nednguyen@ (I asked him a question in the last comment, but forgot to add him)
Cc: vhang@chromium.org
vhang@ is having a tech check out Win 10 High-DPI Perf (1)'s slave build117-b1 to make sure the battOr is connected correctly.
That tech would be me.  It looks connected correctly.
We don't have any built-in timeout logic for telemetry. Usually it's the recipe which sends SIGTERM/SIGKILL to telemetry process.

Comment 19 by dtu@chromium.org, Sep 22 2016

Someone had a question about this benchmark over Speed Infra chat, last week, I think. I looked into it and found that the cpu tracing agent is spending an absurd amount of time spewing thousands stack traces to the log. It's not hung, exactly. Because the code was written by an intern, nobody followed up on a fix.

It should be somewhere around here:
https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent.py#L97
fwiw buildbot does indeed deal poorly with "tons of logs"

Comment 21 by vhang@chromium.org, Sep 22 2016

Ken, one of the techs, checked out build117-b1.  Said that the light should stay green with occasional yellow flashes but build117-b1 would flash yellow and then die.

dtu or sullivan, do you either of you know what's going on.  These things are new to us.  Any idea what the problem is?  Has the BattOr gone bad?
charliea: please respond to #21. Is there a playbook for debugging BattOrs?
Vince, could you please try to unplug and replug both cables going into the BattOr, and see if that resolves the problem?
I'll do that the next time I'm in the lab.
Done.  Steady flashing orange led every 2 seconds.   No happy green led.   I think we have spares in the cabinet.  Want me to swap it out?
Owner: ----
Owner: charliea@chromium.org
Assigning back to Charlie to triage next steps.
Went ahead and added a section to the BattOr manual troubleshooting section here: https://docs.google.com/document/d/1_IdRUB8GKux40GsF9herpBAkktO4UjPYDzXkm5T-PI8/edit#heading=h.gqcbq6sq5odv
Components: -Infra>Client Infra>Client>Perf
Ping. Does this need any more work?

Comment 30 by zh...@chromium.org, Nov 15 2016

Charlie, is this fixed now?
Status: Fixed (was: Started)
Yep, this is fixed.

Sign in to add a comment