New win builds not getting passed to chromium.perf/Win 10 High-DPI Perf on chromium.perf |
||||||||||||
Issue descriptionWin builder status page: http://bit.ly/2cTDo2d Win builder (x64) status page: http://bit.ly/2cTC1k6 Win 10 High-DPI Perf (1): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%282%29/ (2): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%282%29/ (3): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%284%29/ (4): https://luci-milo.appspot.com/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf%20%284%29/ (5) has a separate issue (https://bugs.chromium.org/p/chromium/issues/detail?id=647398). Basically, the Win builders appear to be spitting out new builds (as recently as today, but many of the Windows perf bots appear to not have a new build supplied since 9/18. This is blocking perf tests from running on many of the machines. Infra-trooper, any idea what might be happening?
,
Sep 22 2016
Note: initial bug report contains not all of the links, some luci-milo links which aren't (yet) as helpful as buildbot links, and some extraneous links. Here's everything needed to dig in: Builder: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%20x64%20Builder Tester 1: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29 Tester 2: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%282%29 Tester 3: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%283%29 Tester 4: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29 Tester 5: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29 Equivalent luci-milo links: Builder: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%20x64%20Builder Tester 1: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29 Tester 2: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%282%29 Tester 3: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%283%29 Tester 4: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29 Tester 5: https://luci-milo.appspot.com/buildbot/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29 Two of the luci-milo links show recent activity (builder, and tester (2)). The others have no updates for a long time (some since the 14th, some since the 18th). Note, however, that the ones which have no recent updates on luci-milo are also the ones for which the uberchromegw link times out and then gives an uberproxy error. So something is very wrong on chromium.perf itself. It's not serving data regarding those testers, so luci-milo can't get updates about them either. My most likely guess: something went wrong, and those testers started getting huge pending queues. Buildbot is really bad at rendering pending queues. Every time anyone requests those tester pages (including luci-milo requesting the JSON version of that page), it tries to read thousands of pending jobs from disk, times out, and doesn't cache any of the partial result. Trooper: try to look into pending queue length for the testers that won't load. Try to look into the buildbot logs to see if it is even receiving the request to load those pages, and if it is timing out trying to service that request.
,
Sep 22 2016
Also, here's graphs for the pending queues: http://shortn/_jIZnbiCzCj As predicted, the (2) tester (and the (3) tester, surprisingly) is keeping a reasonable queue, while the others are stalled out and growing their queue indefinitely.
,
Sep 22 2016
Yikes. BIUO? Anyway - what is the right thing to do here? Do we clear the queue or have less builders kick off testers?
,
Sep 22 2016
./run.py infra_internal.tools.cancelall -m chromium.perf -b "Win 10 High-DPI Perf (1)" "Win 10 High-DPI Perf (4)" "Win 10 High-DPI Perf (5)" spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/cancelbuild` spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%284%29/cancelbuild` spawning `curl -F id=all http://chromegw/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%285%29/cancelbuild` Should be better soon now...
,
Sep 22 2016
And to prevent this from happening again... '-1 priority level
,
Sep 22 2016
They haven't actually healed. The cancelall script didn't return any sort of error code, but the (1), (4), and (5) pages still aren't loading and the graph of pending queues hasn't gone down. I'm not sure what the next resort is, so assigning to the actual trooper :(
,
Sep 22 2016
What is the difference between "Win 10 High-DPI Perf (1)" "Win 10 High-DPI Perf (4)" and "Win 10 High-DPI Perf (5)"? Are they just sharded builders or do they actually run different things?
,
Sep 22 2016
Tester 5 seems to be stuck in a loop in step "battor.tough_video_cases ( 8 days 2 hrs ) battor.tough_video_cases", reporting this error for the past 8 days:
ERROR:root:Failed to get process data
Traceback (most recent call last):
File "C:\b\c\b\Win_10_High_DPI_Perf__5_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses
'name': p.name,
File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__
ret = self.func(instance)
File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name
name = self._platform_impl.get_process_name()
File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper
return fun(self, *args, **kwargs)
File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name
return os.path.basename(self.get_process_exe())
File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper
raise AccessDenied(self.pid, self._process_name)
AccessDenied: (pid=5060)
,
Sep 22 2016
Tester 1 also failing continuously with this: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/builds/221/steps/battor.power_cases/logs/stdio CRITICAL:root:Finding return code for BattOr shell. CRITICAL:root:Found return code: None ERROR:root:Failed to get process data Traceback (most recent call last): File "C:\b\c\b\Win_10_High_DPI_Perf__1_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses 'name': p.name, File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__ ret = self.func(instance) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name name = self._platform_impl.get_process_name() File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper return fun(self, *args, **kwargs) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name return os.path.basename(self.get_process_exe()) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper raise AccessDenied(self.pid, self._process_name) AccessDenied: (pid=1480) What is BattOr shell?
,
Sep 22 2016
Tester 1 also failing continuously with this: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20High-DPI%20Perf%20%281%29/builds/221/steps/battor.power_cases/logs/stdio CRITICAL:root:Finding return code for BattOr shell. CRITICAL:root:Found return code: None ERROR:root:Failed to get process data Traceback (most recent call last): File "C:\b\c\b\Win_10_High_DPI_Perf__1_\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\cpu_tracing_agent.py", line 108, in GetProcesses 'name': p.name, File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_common.py", line 48, in __get__ ret = self.func(instance) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\__init__.py", line 341, in name name = self._platform_impl.get_process_name() File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 190, in wrapper return fun(self, *args, **kwargs) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 222, in get_process_name return os.path.basename(self.get_process_exe()) File "C:\b\depot_tools\python276_bin\lib\site-packages\psutil\_psmswindows.py", line 194, in wrapper raise AccessDenied(self.pid, self._process_name) AccessDenied: (pid=1480) What is BattOr shell?
,
Sep 22 2016
Adding some battor folks. Charlie, Randy Looks like these battors may be wedged and need to be reset?
,
Sep 22 2016
re #12 A RC of none on a subprocess means the subprocess hasn't closed yet, so its certainly a possibility that its wedged. re #11 The battor shell is how we communicate with the battor power monitoring device. That error is related to the cpu_tracing_agent though, and not the battor_tracing_agent. They just happen to be next to each other.
,
Sep 22 2016
I think that there are a couple levels of problems here (none of which involve BattOrs directly I don't think, but some of which definitely involve BattOr benchmarks: - The initial hang was caused by the "CPU tracing agent", which basically collects the results of the `top` command every 1s or so during a benchmark run. The Windows version of this apparently has a bug that makes it occasionally hang indefinitely (possibly due to permission problems?). This is being tracked here: https://bugs.chromium.org/p/chromium/issues/detail?id=647398. I disabled this agent today, so future builds shouldn't fail in the same way while we get this fixed. - The second level of this: how is it even possible in Telemetry for an agent to hang indefinitely? I would definitely expect Telemetry to forcibly kill any benchmark that takes longer than some high deadline (1h, 2h, 3h, 6h, something like that). nednguyen@, do you know if we have any logic for this? Fixing both of these problems is definitely necessary, but I'd say the second is the more potentially problematic.
,
Sep 22 2016
+nednguyen@ (I asked him a question in the last comment, but forgot to add him)
,
Sep 22 2016
vhang@ is having a tech check out Win 10 High-DPI Perf (1)'s slave build117-b1 to make sure the battOr is connected correctly.
,
Sep 22 2016
That tech would be me. It looks connected correctly.
,
Sep 22 2016
We don't have any built-in timeout logic for telemetry. Usually it's the recipe which sends SIGTERM/SIGKILL to telemetry process.
,
Sep 22 2016
Someone had a question about this benchmark over Speed Infra chat, last week, I think. I looked into it and found that the cpu tracing agent is spending an absurd amount of time spewing thousands stack traces to the log. It's not hung, exactly. Because the code was written by an intern, nobody followed up on a fix. It should be somewhere around here: https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/platform/tracing_agent/cpu_tracing_agent.py#L97
,
Sep 22 2016
fwiw buildbot does indeed deal poorly with "tons of logs"
,
Sep 22 2016
Ken, one of the techs, checked out build117-b1. Said that the light should stay green with occasional yellow flashes but build117-b1 would flash yellow and then die. dtu or sullivan, do you either of you know what's going on. These things are new to us. Any idea what the problem is? Has the BattOr gone bad?
,
Sep 23 2016
charliea: please respond to #21. Is there a playbook for debugging BattOrs?
,
Sep 23 2016
Vince, could you please try to unplug and replug both cables going into the BattOr, and see if that resolves the problem?
,
Sep 23 2016
I'll do that the next time I'm in the lab.
,
Sep 23 2016
Done. Steady flashing orange led every 2 seconds. No happy green led. I think we have spares in the cabinet. Want me to swap it out?
,
Oct 3 2016
,
Oct 3 2016
Assigning back to Charlie to triage next steps.
,
Oct 3 2016
Went ahead and added a section to the BattOr manual troubleshooting section here: https://docs.google.com/document/d/1_IdRUB8GKux40GsF9herpBAkktO4UjPYDzXkm5T-PI8/edit#heading=h.gqcbq6sq5odv
,
Oct 24 2016
Ping. Does this need any more work?
,
Nov 15 2016
Charlie, is this fixed now?
,
Nov 15 2016
Yep, this is fixed. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by emso@chromium.org
, Sep 22 2016