Issue metadata
Sign in to add a comment
|
v8.browsing_desktop* failing on Win 10 High-DPI Perf |
||||||||||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf Disabling on Windows platforms
,
Jan 25 2018
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8de943aa2a0ca22848c439c5c2f4ee54b472a76b commit 8de943aa2a0ca22848c439c5c2f4ee54b472a76b Author: Emily Hanley <eyaich@google.com> Date: Thu Jan 25 16:30:44 2018 Disabling v8.browsing_desktop and v8.browsing_desktop-future on windows. TBR=charliea@chromium.org NOTRY=true Bug: 805922 Change-Id: I8da65a3eb71a3523f507805da980bca2077ab389 Reviewed-on: https://chromium-review.googlesource.com/886653 Reviewed-by: Emily Hanley <eyaich@chromium.org> Commit-Queue: Emily Hanley <eyaich@chromium.org> Cr-Commit-Position: refs/heads/master@{#531913} [modify] https://crrev.com/8de943aa2a0ca22848c439c5c2f4ee54b472a76b/tools/perf/expectations.config
,
Jan 25 2018
I don't have any idea on what is going wrong. From the logs I don't see anything suspicious. There is no crash or any exceptions. I am not sure where to look for additional information. I am not sure, I am the right person to look into this. Is there anyone else who knows more into this take over this bug or provide me more information so that I could debug.
,
Jan 25 2018
,
Jan 25 2018
I think there were also similar timeouts on Mac: crbug.com/803457. If they are related I can merge the bug into this.
,
Jan 25 2018
Let me investigate a few things. I didn't look closely at the logs before disabling. The thing that seems fishy to me is that this timed out after pending for 2h2m 39s and then ran for 28 minutes. 28 minutes of run time is definitely well within our perf time limits. I am curious why it was pending for so long. Our execution timeout is 3 hours and that is only ~2.5 hours. I will take a look at how frequently this is happening. Given it is happening on multiple platforms it leads me to believe it is a scheduling/swarming issue rather than the test itself.
,
Jan 25 2018
So this has been timing out on and off for quite some time. My guess is that given it takes an hour to run, we just have a bad sharding algorithm on the perf side that doesn't always allow this test to finish. Basically as the recipe exists right now, we trigger all the benchmarks for a given bot and then wait for them all to finish in the queue. It looks like this one is waiting around for too long before it gets its execution slot. If you look at a succesful run on January 19 it takes about an hour to run: https://chromium-swarm.appspot.com/task?id=3b29e676dd45ae10&refresh=10&request_detail=true&show_raw=1 On the failed tests it is not getting enough time with the 3 hour execution timeout. Ned do you have any thoughts on a workaround before we get to per story sharding later in 2018? This doesn't seem to be an issue on other windows platforms and given this is an issue on our side, should we consider re-enabling and taking the hit on Win 10 High-DPI Perf?
,
Jan 25 2018
I look at the swarming log in https://chromium-swarm.appspot.com/task?id=3b452a16448dab10&refresh=10&show_raw=1 The last test it was running is: [ RUN ] browse:news:nytimes@{'case': 'browse', 'group': 'news'} I am guessing that the test "browse:news:nytimes" is stuck, which cause the I/O timed out (10 minutes). To test this hypothesis, I suggest running this test locally with the release windows build. If this is true, disabling only "v8.browsing_desktop/browse:news:nytimes" may fix the timeout error
,
Jan 25 2018
It is not clear to me that it is hitting the I/O timeout but I agree with Ned that it is a likely hypothesis since it is being killed before the 3 hour execution timeout. It isn't a consistent failure but re-assigning to mythria@ to try and run locally and debug.
,
Feb 1 2018
The NextAction date has arrived: 2018-02-01
,
Feb 1 2018
Is is possible to bisect on the bot based on timeout?
,
Feb 1 2018
gab@, fyi
,
Feb 1 2018
,
Feb 1 2018
+ simonhatch@ to comment on bisect options more Timeout is a clear failure that I think you can bisect on. I think the bigger issue is there is no clear regression range that this failed on since it is a flaky timeout and is not failing consistently
,
Feb 1 2018
+dtu Yeah Pinpoint should be able to bisect on failures like timeout, go to the dashboard and choose "functional" when starting the bisect.
,
Feb 1 2018
I don't think Pinpoint has the same timeouts set as the perf waterfall. Should they be 3 hours execution timeout and 10 minutes io timeout?
,
Feb 1 2018
In #9, I am guessing test "browse:news:nytimes" is stuck, so we only need to bisect on that single test
,
Feb 2 2018
Yes, on the perf side our timeouts are 3 hour execution and 10 minute I/O
,
Feb 2 2018
See issue 808025 for minimal logs from dev win10 machine (hang in browse:news:cnn)
,
Feb 2 2018
,
Feb 6 2018
Any updates here? Thanks!
,
Feb 6 2018
Sorry for the delay. I tried running this on windows 10. I ran newyork times for 5 times and it did not fail in any of those runs. I also tried running the entire benchmark, apart from flipboard which crashed the remaining stories ran. Flipboard was disabled even before this bug, so I don't think that would have caused this timeout.
,
Feb 6 2018
I get the same error that a file cannot be delete as seen in crbug/808025, but the test still completes.
[ RUN ] browse:news:nytimes@{'case': 'browse', 'group': 'news'}
(ERROR) 2018-02-06 15:37:13,644 tracing_controller_backend.CollectAgentTraceData:314 Error when deleting c:\users\chromium\appdata\local\temp\tmpc3_ycs, will try again at exit.
Traceback (most recent call last):
File "D:\mythria\chromium\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 310, in CollectAgentTraceData
os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\chromium\\appdata\\local\\temp\\tmpc3_ycs'
(WARNING) 2018-02-06 15:40:19,368 timeline_based_measurement._ComputeTimelineBasedMetrics:311 Processing resulting traces took 156.968 seconds
[ OK ] browse:news:nytimes@{'case': 'browse', 'group': 'news'} (268075 ms)
,
Feb 6 2018
I looked at the recent failures on the bots for v8.browsing.desktop-future, it seems to be failing either on imgur (builds 1582, 1577) or on nytimes (1578, 1575). As mentioned in comment #7, the timeouts happen around 30min (it was pending for almost 2.5hrs before it started executing) and I think it happens that either imgur or nytimes are executing at that time. The trace at the end looks as follows: +-----------------------------------------------------------------------------+ | End of shard 0 | | Pending: 8667.0s Duration: 1590.7s Bot: build119-b1 Exit: -1 TIMED_OUT | +-----------------------------------------------------------------------------+ Total duration: 1590.7s WARNING:root:collect_cmd had non-zero return code: -1 WARNING:root:No shard json files found in task_output_dir: 'c:\\users\\chrome~2\\appdata\\local\\temp\\tmpdrpphw' Found ['c:\\users\\chrome~2\\appdata\\local\\temp\\tmpdrpphw\\summary.json'] step returned non-zero exit code: -1
,
Feb 6 2018
I am assigning it to eyaich@ again, since locally this seems to be working fine. I am also running the entire benchmark with a repeat-count of 5 and I will update the results tomorrow. (It takes 5-6 hours for it to finish). I am happy to provide any further information from the local runs.
,
Feb 6 2018
eyaich@ was just doing sheriffing work, so it's not right to reassign this bug to her. If you have trouble reducing this bug locally, I would suggest doing either: i) Add more logging to figure out why it's failing on bot. ii) Test this stories with pinpoint tryjob & see if it's still failing. If not, make CL to reeenable it.
,
Feb 6 2018
I am sorry about that. From comment #7 and #8, I thought eyaich@ may take a look into it if the local runs did not show a problem.
,
Feb 6 2018
,
Feb 6 2018
Ned: I'm very unclear on what is likely to be happening on this bug. From the comments on #7, #8 and #25 seem to suggest that the bot is stuck in a pending state for almost 2.5 hours, after which it happens to time out after actually running for 30 minutes. If this is the case isn't this a swarming / sharding issue (i.e., it doesn't matter what page it times out on once it starts running since it is never going to finish in the 30minutes it has remaining after 2.5 hours pending.
,
Feb 6 2018
Ok sorry about the confusion. To clarify timeouts. We actually have a 3 hour expiration timeout which is the pending time. ie a job can sit around before it gets scheduled for up to three hours. Once it gets schedules we have a hard timeout of 10 hours, which is the length of time it has to execute once it gets scheduled. Therefore, I think this test must be hitting the I/O timeout since it is no where near the 10 hour time limit.
,
Feb 6 2018
I talked with eyaich@ offline about this bug. Our recommended next step are: Reenable this benchmark on "Win 10 High-DPI Perf", but disable the last story that was run in the benchmark. If that makes the benchmark stop failing, we have confirmed that the test was I/O timed out due to that single story. Meanwhile, eyaich@ also discovered some weird timing in the swarming log. She will file another bug with swarming infra folks about this.
,
Feb 6 2018
,
Feb 6 2018
,
Feb 7 2018
Thanks ned and emily. From the bug crubug/809644 it seems as if the actual problem is with imgur. It timesout while processing the page (it takes about ~629 seconds) from the following warning: (WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310 Processing resulting traces took 629.867 seconds I will disable this page and reenable this benchmark on windows. This page is already disabled on linux ( crbug.com/788796 ) because the trace is too large to process. It appears as a different error but the root cause is the same.
,
Feb 7 2018
Oh wow, great catch! +Charlie: I think we would need to add the heartbeat to trace processing. See #35
,
Feb 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/18f39d4fedef58be7787f95a0412b00acafadc6b commit 18f39d4fedef58be7787f95a0412b00acafadc6b Author: Mythri Alle <mythria@chromium.org> Date: Wed Feb 07 13:24:34 2018 Enable v8.browsing_desktop on Windows and disable imgur story. Imgur page is timing out because of an I/O timeout while processing a large trace. This also happens on Mac sometimes. So disable this page and enable the benchmark on windows. Bug: chromium:805922 Change-Id: Ifea590099720c6ee19632d09c664e19af8445089 Reviewed-on: https://chromium-review.googlesource.com/905604 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#534993} [modify] https://crrev.com/18f39d4fedef58be7787f95a0412b00acafadc6b/tools/perf/expectations.config
,
Feb 7 2018
I'm not convinced that this has to do with the trace processing that I was talking about adding a heartbeat for.
Generally, the log for every story looks something like:
(INFO) 2018-02-06 06:45:02,315 ts_proxy_server._IssueCommand:97 Issuing command to ts_proxy_server: set rtt 0
(INFO) 2018-02-06 06:45:02,316 ts_proxy_server._IssueCommand:97 Issuing command to ts_proxy_server: set inkbps 0
(INFO) 2018-02-06 06:45:02,318 ts_proxy_server._IssueCommand:97 Issuing command to ts_proxy_server: set outkbps 0
(INFO) 2018-02-06 06:45:02,319 cache_temperature.EnsurePageCacheTemperature:181 PageCacheTemperature: any
(INFO) 2018-02-06 06:46:03,244 chrome_tracing_agent.RecordClockSyncMarker:183 Chrome version: 3342
(ERROR) 2018-02-06 06:46:12,482 tracing_controller_backend.CollectAgentTraceData:314 Error when deleting c:\b\s\w\itj_83w8\tmpzgkdu5, will try again at exit.
Traceback (most recent call last):
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 310, in CollectAgentTraceData
os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\b\\s\\w\\itj_83w8\\tmpzgkdu5'
(INFO) 2018-02-06 06:46:12,502 trace_data.Serialize:191 Trace sizes in bytes: {'traceEvents': 160622417L, 'telemetry': 122499L, 'tabIds': 36L}
(WARNING) 2018-02-06 06:47:57,813 timeline_based_measurement._ComputeTimelineBasedMetrics:311 Processing resulting traces took 89.737 seconds
However, the last line of the story that failed most recently, browse:news:nytimes, is:
(INFO) 2018-02-06 07:10:54,279 cache_temperature.EnsurePageCacheTemperature:181 PageCacheTemperature: any
To me, this would suggest that the next log line that we should see is something like:
(INFO) 2018-02-06 06:46:03,244 chrome_tracing_agent.RecordClockSyncMarker:183 Chrome version: 3342
However, we never see that line when the test fails.
EnsurePageCacheTemperature is only called in one place: shared_page_state.WillRunStory (http://bit.ly/2EmsKiR), which is called before a story is run. The next log line I'd expect to see,
The RecordClockSyncMarker is called through tracing_controller_backend.StopTracing(), and the fact that the log is missing when the I/O timeout happens suggests to me that, for some reason, the story is never reaching the point where StopTracing() is called. At the point where we're hitting the I/O timeout, the trace hasn't even been streamed back from Chrome to Telemetry, so I don't think that adding a heartbeat during trace processing will have any effect.
,
Feb 7 2018
From comment #2 on crbug.com/809644 , we may output more data after we kill the process because of a timeout. If I understand the comment correctly, nytimes didn't get a chance to start properly and the I/O timeout actually happened on the page before.
,
Feb 7 2018
Ah okay, I missed that context. Thanks for the help, mythria@. I'll make an effort to add this heartbeat today.
,
Feb 7 2018
Thanks Charlie, if that's too hard, we can consider bumping the I/O timeout to 20 minutes as band-aid fix. Mean while, I will file a bug about trace processing imgur story taking 10+ minutes to see if we can do something about it.
,
Feb 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1390163857ff2a4cff7b64888ded712a39b8dcd9 commit 1390163857ff2a4cff7b64888ded712a39b8dcd9 Author: Ned Nguyen <nednguyen@google.com> Date: Wed Feb 07 19:55:47 2018 Increase benchmark I/O timeout to 20 minutes This is a bandaid fix to address the problem that some story whose trace processing takes more than 10 minutes to process. crbug.com/810060 is opened for tracking reducing the time of trace processing. Bug:805922 Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd NOTRY=true # All testings are covered by PRESUBMIT Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd Reviewed-on: https://chromium-review.googlesource.com/907313 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Emily Hanley <eyaich@chromium.org> Cr-Commit-Position: refs/heads/master@{#535109} [modify] https://crrev.com/1390163857ff2a4cff7b64888ded712a39b8dcd9/testing/buildbot/chromium.perf.json [modify] https://crrev.com/1390163857ff2a4cff7b64888ded712a39b8dcd9/tools/perf/core/perf_data_generator.py
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/776028322bca8bcdb0c2900a18330f3b48e683ea commit 776028322bca8bcdb0c2900a18330f3b48e683ea Author: calamity <calamity@chromium.org> Date: Thu Feb 08 00:28:13 2018 Revert "Increase benchmark I/O timeout to 20 minutes" This reverts commit 1390163857ff2a4cff7b64888ded712a39b8dcd9. Reason for revert: Broke telemetry_perf_unittests https://uberchromegw.corp.google.com/i/chromium.chromiumos/builders/linux-chromeos-rel/builds/5034 Original change's description: > Increase benchmark I/O timeout to 20 minutes > > This is a bandaid fix to address the problem that some story whose trace processing > takes more than 10 minutes to process. > > crbug.com/810060 is opened for tracking reducing the time of trace processing. > > Bug:805922 > Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd > > NOTRY=true # All testings are covered by PRESUBMIT > > Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd > Reviewed-on: https://chromium-review.googlesource.com/907313 > Commit-Queue: Ned Nguyen <nednguyen@google.com> > Reviewed-by: Emily Hanley <eyaich@chromium.org> > Cr-Commit-Position: refs/heads/master@{#535109} TBR=nednguyen@google.com,eyaich@chromium.org Change-Id: I4c9d1101e38f5d4089335cb123ebc970b806e0d8 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 805922 Reviewed-on: https://chromium-review.googlesource.com/907908 Reviewed-by: calamity <calamity@chromium.org> Commit-Queue: calamity <calamity@chromium.org> Cr-Commit-Position: refs/heads/master@{#535212} [modify] https://crrev.com/776028322bca8bcdb0c2900a18330f3b48e683ea/testing/buildbot/chromium.perf.json [modify] https://crrev.com/776028322bca8bcdb0c2900a18330f3b48e683ea/tools/perf/core/perf_data_generator.py
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/079b7094d88e036de97f8947d49b17130da94931 commit 079b7094d88e036de97f8947d49b17130da94931 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Feb 08 11:35:21 2018 Increase benchmark I/O timeout to 20 minutes This is a reland of https://chromium-review.googlesource.com/907313 Bug:805922 TBR=eyaich@chromium.org Change-Id: I68da5215880b7bce27462b24aacae8cfb1e3f84a Reviewed-on: https://chromium-review.googlesource.com/907859 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#535355} [modify] https://crrev.com/079b7094d88e036de97f8947d49b17130da94931/testing/buildbot/chromium.perf.json [modify] https://crrev.com/079b7094d88e036de97f8947d49b17130da94931/tools/perf/core/perf_data_generator.py [modify] https://crrev.com/079b7094d88e036de97f8947d49b17130da94931/tools/perf/core/perf_data_generator_unittest.py
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4284049eaca9f395a64171d9a1d655690453328f commit 4284049eaca9f395a64171d9a1d655690453328f Author: Ned Nguyen <nednguyen@google.com> Date: Thu Feb 08 11:40:27 2018 Reenable v8.browsing_desktop.*/browse:media:imgur test We have bumped up the I/O timeout to 20 minutes, so these tests may be safe to reenable. Bug:805922 Change-Id: Idc15471ed3f32bddd0f1349d899e9e8de6bfbf48 NOTRY=true # All the testings are covered by PRESUBMIT Change-Id: Idc15471ed3f32bddd0f1349d899e9e8de6bfbf48 Reviewed-on: https://chromium-review.googlesource.com/907332 Reviewed-by: Mythri Alle <mythria@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#535358} [modify] https://crrev.com/4284049eaca9f395a64171d9a1d655690453328f/tools/perf/expectations.config
,
Feb 13 2018
I think we are no longer seeing timeouts on windows. Can we close this bug now?
,
Mar 8 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by eyaich@chromium.org
, Jan 25 2018