New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 805922 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-02-01
OS: Windows
Pri: 1
Type: Bug

Blocked on:
issue 809644



Sign in to add a comment

v8.browsing_desktop* failing on Win 10 High-DPI Perf

Project Member Reported by eyaich@chromium.org, Jan 25 2018

Issue description

Comment 1 by eyaich@chromium.org, Jan 25 2018

Most recent failing build: 

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf/builds/1539


Links to logs of most recent failing builds: 

v8.browsing_desktop-future is completed but failed: 
https://chromium-swarm.appspot.com/task?id=3b452a1a54ec4210&refresh=10&show_raw=1

Disabling browse:media:flickr_infinite_scroll* story here

v8.browsing_desktop is timing out: 
https://chromium-swarm.appspot.com/task?id=3b452a16448dab10&refresh=10&show_raw=1

Since there is no clear culprit, disabling the entire test.


Labels: v8-benchmarks
Project Member

Comment 3 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8de943aa2a0ca22848c439c5c2f4ee54b472a76b

commit 8de943aa2a0ca22848c439c5c2f4ee54b472a76b
Author: Emily Hanley <eyaich@google.com>
Date: Thu Jan 25 16:30:44 2018

Disabling v8.browsing_desktop and v8.browsing_desktop-future on windows.

TBR=charliea@chromium.org
NOTRY=true

Bug:  805922 
Change-Id: I8da65a3eb71a3523f507805da980bca2077ab389
Reviewed-on: https://chromium-review.googlesource.com/886653
Reviewed-by: Emily Hanley <eyaich@chromium.org>
Commit-Queue: Emily Hanley <eyaich@chromium.org>
Cr-Commit-Position: refs/heads/master@{#531913}
[modify] https://crrev.com/8de943aa2a0ca22848c439c5c2f4ee54b472a76b/tools/perf/expectations.config

Cc: hablich@chromium.org
NextAction: 2018-02-01
I don't have any idea on what is going wrong. From the logs I don't see anything suspicious. There is no crash or any exceptions. I am not sure where to look for additional information. I am not sure, I am the right person to look into this. Is there anyone else who knows more into this take over this bug or provide me more information so that I could debug. 
Cc: nedngu...@google.com
I think there were also similar timeouts on Mac: crbug.com/803457. If they are related I can merge the bug into this.

Comment 7 by eyaich@chromium.org, Jan 25 2018

Cc: mar...@chromium.org mythria@chromium.org
Owner: eyaich@chromium.org
Let me investigate a few things.  I didn't look closely at the logs before disabling. 

The thing that seems fishy to me is that this timed out after pending for 2h2m 39s and then ran for 28 minutes.  28 minutes of run time is definitely well within our perf time limits.  I am curious why it was pending for so long.  

Our execution timeout is 3 hours and that is only ~2.5 hours.   I will take a look at how frequently this is happening.  Given it is happening on multiple platforms it leads me to believe it is a scheduling/swarming issue rather than the test itself.  

Comment 8 by eyaich@chromium.org, Jan 25 2018

So this has been timing out on and off for quite some time.  

My guess is that given it takes an hour to run, we just have a bad sharding algorithm on the perf side that doesn't always allow this test to finish. Basically as the recipe exists right now, we trigger all the benchmarks for a given bot and then wait for them all to finish in the queue.  It looks like this one is waiting around for too long before it gets its execution slot.   

If you look at a succesful run on January 19 it takes about an hour to run:
https://chromium-swarm.appspot.com/task?id=3b29e676dd45ae10&refresh=10&request_detail=true&show_raw=1

On the failed tests it is not getting enough time with the 3 hour execution timeout.  

Ned do you have any thoughts on a workaround before we get to per story sharding later in 2018? This doesn't seem to be an issue on other windows platforms and given this is an issue on our side, should we consider re-enabling and taking the hit on Win 10 High-DPI Perf?



I look at the swarming log in
https://chromium-swarm.appspot.com/task?id=3b452a16448dab10&refresh=10&show_raw=1

The last test it was running is:
[ RUN      ] browse:news:nytimes@{'case': 'browse', 'group': 'news'}


I am guessing that the test "browse:news:nytimes" is stuck, which cause the I/O timed out (10 minutes). 

To test this hypothesis, I suggest running this test locally with the release windows build. 

If this is true, disabling only "v8.browsing_desktop/browse:news:nytimes" may fix the timeout error 
Cc: -mythria@chromium.org eyaich@chromium.org
Owner: mythria@chromium.org
It is not clear to me that it is hitting the I/O timeout but I agree with Ned that it is a likely hypothesis since it is being killed before the 3 hour execution timeout.  

It isn't a consistent failure but re-assigning to mythria@ to try and run locally and debug.  
The NextAction date has arrived: 2018-02-01

Comment 12 by u...@chromium.org, Feb 1 2018

Is is possible to bisect on the bot based on timeout?

Comment 13 by u...@chromium.org, Feb 1 2018

Cc: gab@chromium.org
gab@, fyi

Comment 14 by u...@chromium.org, Feb 1 2018

Cc: hpayer@chromium.org
 Issue 808025  has been merged into this issue.
Cc: simonhatch@chromium.org
+ simonhatch@ to comment on bisect options more 

Timeout is a clear failure that I think you can bisect on.  I think the bigger issue is there is no clear regression range that this failed on since it is a flaky timeout and is not failing consistently
Cc: dtu@chromium.org
+dtu

Yeah Pinpoint should be able to bisect on failures like timeout, go to the dashboard and choose "functional" when starting the bisect.

Comment 17 by dtu@chromium.org, Feb 1 2018

I don't think Pinpoint has the same timeouts set as the perf waterfall. Should they be 3 hours execution timeout and 10 minutes io timeout?
In #9, I am guessing test "browse:news:nytimes" is stuck, so we only need to bisect on that single test
Yes, on the perf side our timeouts are 3 hour execution and 10 minute I/O

Comment 20 by gab@chromium.org, Feb 2 2018

See  issue 808025  for minimal logs from dev win10 machine (hang in browse:news:cnn)
Status: Assigned (was: Untriaged)

Comment 22 by gab@chromium.org, Feb 6 2018

Any updates here? Thanks!
Sorry for the delay. I tried running this on windows 10. I ran newyork times for  5 times and it did not fail in any of those runs. I also tried running the entire benchmark, apart from flipboard which crashed the remaining stories ran. Flipboard was disabled even before this bug, so I don't think that would have caused this timeout. 
I get the same error that a file cannot be delete as seen in crbug/808025, but the test still completes. 

[ RUN      ] browse:news:nytimes@{'case': 'browse', 'group': 'news'}
(ERROR) 2018-02-06 15:37:13,644 tracing_controller_backend.CollectAgentTraceData:314  Error when deleting c:\users\chromium\appdata\local\temp\tmpc3_ycs, will try again at exit.
Traceback (most recent call last):
  File "D:\mythria\chromium\src\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 310, in CollectAgentTraceData
    os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\chromium\\appdata\\local\\temp\\tmpc3_ycs'
(WARNING) 2018-02-06 15:40:19,368 timeline_based_measurement._ComputeTimelineBasedMetrics:311  Processing resulting traces took 156.968 seconds
[       OK ] browse:news:nytimes@{'case': 'browse', 'group': 'news'} (268075 ms)
I looked at the recent failures on the bots for v8.browsing.desktop-future, it seems to be failing either on imgur (builds 1582, 1577) or on nytimes (1578, 1575). As mentioned in comment #7, the timeouts happen around 30min (it was pending for almost 2.5hrs before it started executing) and I think it happens that either imgur or nytimes are executing at that time. The trace at the end looks as follows: 

+-----------------------------------------------------------------------------+
| End of shard 0                                                              |
|  Pending: 8667.0s  Duration: 1590.7s  Bot: build119-b1  Exit: -1  TIMED_OUT |
+-----------------------------------------------------------------------------+
Total duration: 1590.7s
WARNING:root:collect_cmd had non-zero return code: -1
WARNING:root:No shard json files found in task_output_dir: 'c:\\users\\chrome~2\\appdata\\local\\temp\\tmpdrpphw'
Found ['c:\\users\\chrome~2\\appdata\\local\\temp\\tmpdrpphw\\summary.json']
step returned non-zero exit code: -1

Owner: eyaich@chromium.org
I am assigning it to eyaich@ again, since locally this seems to be working fine. I am also running the entire benchmark with a repeat-count of 5 and I will update the results tomorrow. (It takes 5-6 hours for it to finish). I am happy to provide any further information from the local runs. 
Owner: mythria@chromium.org
eyaich@ was just doing sheriffing work, so it's not right to reassign this bug to her.

If you have trouble reducing this bug locally, I would suggest doing either:
i) Add more logging to figure out why it's failing on bot.
ii) Test this stories with pinpoint tryjob & see if it's still failing. If not, make CL to reeenable it.
I am sorry about that. From comment #7 and #8, I thought eyaich@ may take a look into it if the local runs did not show a problem.
Cc: rmcilroy@chromium.org
Ned: I'm very unclear on what is likely to be happening on this bug. From the comments on #7, #8 and #25 seem to suggest that the bot is stuck in a pending state for almost 2.5 hours, after which it happens to time out after actually running for 30 minutes. 

If this is the case isn't this a swarming / sharding issue (i.e., it doesn't matter what page it times out on once it starts running since it is never going to finish in the 30minutes it has remaining after 2.5 hours pending. 

Ok sorry about the confusion. 

To clarify timeouts.  We actually have a 3 hour expiration timeout which is the pending time.  ie a job can sit around before it gets scheduled for up to three hours.

Once it gets schedules we have a hard timeout of 10 hours, which is the length of time it has to execute once it gets scheduled.  

Therefore, I think this test must be hitting the I/O timeout since it is no where near the 10 hour time limit.  
I talked with eyaich@ offline about this bug. Our recommended next step are:

Reenable this benchmark on "Win 10 High-DPI Perf", but disable the last story that was run in the benchmark. If that makes the benchmark stop failing, we have confirmed that the test was I/O timed out due to that single story.

Meanwhile, eyaich@ also discovered some weird timing in the swarming log. She will file another bug with swarming infra folks about this.
Cc: -nedngu...@google.com nednguyen@chromium.org
Blockedon: 809644
Thanks ned and emily. From the bug crubug/809644 it seems as if the actual problem is with imgur. It timesout while processing the page (it takes about ~629 seconds) from the following warning:

(WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310  Processing resulting traces took 629.867 seconds

I will disable this page and reenable this benchmark on windows. This page is already disabled on linux ( crbug.com/788796 ) because the trace is too large to process. It appears as a different error but the root cause is the same.
Cc: charliea@chromium.org
Oh wow, great catch!

+Charlie: I think we would need to add the heartbeat to trace processing. See #35
Project Member

Comment 37 by bugdroid1@chromium.org, Feb 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/18f39d4fedef58be7787f95a0412b00acafadc6b

commit 18f39d4fedef58be7787f95a0412b00acafadc6b
Author: Mythri Alle <mythria@chromium.org>
Date: Wed Feb 07 13:24:34 2018

Enable v8.browsing_desktop on Windows and disable imgur story.

Imgur page is timing out because of an I/O timeout while processing
a large trace. This also happens on Mac sometimes. So disable this
page and enable the benchmark on windows.

Bug:  chromium:805922 
Change-Id: Ifea590099720c6ee19632d09c664e19af8445089
Reviewed-on: https://chromium-review.googlesource.com/905604
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#534993}
[modify] https://crrev.com/18f39d4fedef58be7787f95a0412b00acafadc6b/tools/perf/expectations.config

I'm not convinced that this has to do with the trace processing that I was talking about adding a heartbeat for.

Generally, the log for every story looks something like:

(INFO) 2018-02-06 06:45:02,315 ts_proxy_server._IssueCommand:97  Issuing command to ts_proxy_server: set rtt 0
(INFO) 2018-02-06 06:45:02,316 ts_proxy_server._IssueCommand:97  Issuing command to ts_proxy_server: set inkbps 0
(INFO) 2018-02-06 06:45:02,318 ts_proxy_server._IssueCommand:97  Issuing command to ts_proxy_server: set outkbps 0
(INFO) 2018-02-06 06:45:02,319 cache_temperature.EnsurePageCacheTemperature:181  PageCacheTemperature: any
(INFO) 2018-02-06 06:46:03,244 chrome_tracing_agent.RecordClockSyncMarker:183  Chrome version: 3342
(ERROR) 2018-02-06 06:46:12,482 tracing_controller_backend.CollectAgentTraceData:314  Error when deleting c:\b\s\w\itj_83w8\tmpzgkdu5, will try again at exit.
Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 310, in CollectAgentTraceData
    os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\b\\s\\w\\itj_83w8\\tmpzgkdu5'
(INFO) 2018-02-06 06:46:12,502 trace_data.Serialize:191  Trace sizes in bytes: {'traceEvents': 160622417L, 'telemetry': 122499L, 'tabIds': 36L}
(WARNING) 2018-02-06 06:47:57,813 timeline_based_measurement._ComputeTimelineBasedMetrics:311  Processing resulting traces took 89.737 seconds

However, the last line of the story that failed most recently, browse:news:nytimes, is:

(INFO) 2018-02-06 07:10:54,279 cache_temperature.EnsurePageCacheTemperature:181  PageCacheTemperature: any

To me, this would suggest that the next log line that we should see is something like:

(INFO) 2018-02-06 06:46:03,244 chrome_tracing_agent.RecordClockSyncMarker:183  Chrome version: 3342

However, we never see that line when the test fails.

EnsurePageCacheTemperature is only called in one place: shared_page_state.WillRunStory (http://bit.ly/2EmsKiR), which is called before a story is run. The next log line I'd expect to see, 

The RecordClockSyncMarker is called through tracing_controller_backend.StopTracing(), and the fact that the log is missing when the I/O timeout happens suggests to me that, for some reason, the story is never reaching the point where StopTracing() is called. At the point where we're hitting the I/O timeout, the trace hasn't even been streamed back from Chrome to Telemetry, so I don't think that adding a heartbeat during trace processing will have any effect.
From comment #2 on  crbug.com/809644 , we may output more data after we kill the process because of a timeout. If I understand the comment correctly, nytimes didn't get a chance to start properly and the I/O timeout actually happened on the page before.
Ah okay, I missed that context. Thanks for the help, mythria@.

I'll make an effort to add this heartbeat today.
Thanks Charlie, if that's too hard, we can consider bumping the I/O timeout to 20 minutes as band-aid fix.

Mean while, I will file a bug about trace processing imgur story taking 10+ minutes to see if we can do something about it.
Project Member

Comment 42 by bugdroid1@chromium.org, Feb 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1390163857ff2a4cff7b64888ded712a39b8dcd9

commit 1390163857ff2a4cff7b64888ded712a39b8dcd9
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Feb 07 19:55:47 2018

Increase benchmark I/O timeout to 20 minutes

This is a bandaid fix to address the problem that some story whose trace processing
takes more than 10 minutes to process.

crbug.com/810060 is opened for tracking reducing the time of trace processing.

Bug:805922
Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd

NOTRY=true  # All testings are covered by PRESUBMIT

Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd
Reviewed-on: https://chromium-review.googlesource.com/907313
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Emily Hanley <eyaich@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535109}
[modify] https://crrev.com/1390163857ff2a4cff7b64888ded712a39b8dcd9/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/1390163857ff2a4cff7b64888ded712a39b8dcd9/tools/perf/core/perf_data_generator.py

Project Member

Comment 43 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/776028322bca8bcdb0c2900a18330f3b48e683ea

commit 776028322bca8bcdb0c2900a18330f3b48e683ea
Author: calamity <calamity@chromium.org>
Date: Thu Feb 08 00:28:13 2018

Revert "Increase benchmark I/O timeout to 20 minutes"

This reverts commit 1390163857ff2a4cff7b64888ded712a39b8dcd9.

Reason for revert: Broke telemetry_perf_unittests
https://uberchromegw.corp.google.com/i/chromium.chromiumos/builders/linux-chromeos-rel/builds/5034

Original change's description:
> Increase benchmark I/O timeout to 20 minutes
> 
> This is a bandaid fix to address the problem that some story whose trace processing
> takes more than 10 minutes to process.
> 
> crbug.com/810060 is opened for tracking reducing the time of trace processing.
> 
> Bug:805922
> Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd
> 
> NOTRY=true  # All testings are covered by PRESUBMIT
> 
> Change-Id: Id7b8a860f26460d49901ef4ebe0fee536b0029fd
> Reviewed-on: https://chromium-review.googlesource.com/907313
> Commit-Queue: Ned Nguyen <nednguyen@google.com>
> Reviewed-by: Emily Hanley <eyaich@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#535109}

TBR=nednguyen@google.com,eyaich@chromium.org

Change-Id: I4c9d1101e38f5d4089335cb123ebc970b806e0d8
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  805922 
Reviewed-on: https://chromium-review.googlesource.com/907908
Reviewed-by: calamity <calamity@chromium.org>
Commit-Queue: calamity <calamity@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535212}
[modify] https://crrev.com/776028322bca8bcdb0c2900a18330f3b48e683ea/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/776028322bca8bcdb0c2900a18330f3b48e683ea/tools/perf/core/perf_data_generator.py

Project Member

Comment 44 by bugdroid1@chromium.org, Feb 8 2018

Project Member

Comment 45 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4284049eaca9f395a64171d9a1d655690453328f

commit 4284049eaca9f395a64171d9a1d655690453328f
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Feb 08 11:40:27 2018

Reenable v8.browsing_desktop.*/browse:media:imgur test

We have bumped up the I/O timeout to 20 minutes, so these tests may be safe to
reenable.

Bug:805922
Change-Id: Idc15471ed3f32bddd0f1349d899e9e8de6bfbf48

NOTRY=true # All the testings are covered by PRESUBMIT

Change-Id: Idc15471ed3f32bddd0f1349d899e9e8de6bfbf48
Reviewed-on: https://chromium-review.googlesource.com/907332
Reviewed-by: Mythri Alle <mythria@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#535358}
[modify] https://crrev.com/4284049eaca9f395a64171d9a1d655690453328f/tools/perf/expectations.config

I think we are no longer seeing timeouts on windows. Can we close this bug now?
Status: Fixed (was: Assigned)

Sign in to add a comment