New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 809644 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 805922



Sign in to add a comment

Swarming I/O Timeout for perf test issue

Project Member Reported by eyaich@chromium.org, Feb 6 2018

Issue description

We have a task that seems to be hitting the I/O timeout and we are unable to see how that could be possible from the logs.  It is timing out after about a half hour which means it is not hitting the hard timeout, but it is expiring.  

If you look at the end of this log: 

https://chromium-swarm.appspot.com/task?id=3b452a16448dab10&refresh=10&show_raw=1

You will see the last timestamp is at 2:35 which is exactly when the task completed. 

I am assuming that I/O timeout means we haven't heard from the task in > 10 mins (that is what our current I/O timeout is set to) but this seems to be killed mid-execution. 

Marc-Antoine, can you clarify further what I/O timeout means?  This seems like a bug in swarming execution rather than this test since it is not clear how it is hitting the timeout.  
 
Blocking: 805922
Clarifying a few things;
- The task is not expiring, it times out.
- It is effectively an I/O timeout.


The type of time out that was enforced is not surfaced at the task result level. To be clear, it's effectively an I/O timeout. Since the time out is of 10 minutes, and the task was abandoned at 03:35, one can look at output at around 02:25:

---
(INFO) 2018-01-25 02:24:17,750 trace_data.Serialize:191  Trace sizes in bytes: {'traceEvents': 310689286L, 'telemetry': 205918L, 'tabIds': 36L}
(WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310  Processing resulting traces took 629.867 seconds
---

The SIGTERM, sent around 02:35, causes the tool to wake up and spew more data, which is confusing for the reader but kinda expected.


Description of time out handling:
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/User-Guide.md#properties
and
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/Bot.md#timeout-handling


Thanks eyiach@ and maruel@. So if I understand it correctly, the timeout is happening when processing the trace from the following warning: 

(WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310  Processing resulting traces took 629.867 seconds

This takes more than 10 minutes to process that trace presumably because it is quite large. So, the actual culprit would be imgur page. We have seen problems with imgur on linux as well and it is disabled on linux. I will disable imgur on windows as well. 
Status: Fixed (was: Untriaged)
Thanks for pointing us to those docs MA, super helpful. 

It would be nice to have a clearer signal of why a job is timing out surfaced in swarming, but that is not the discussion of this bug so I am closing it.  We were just trying to understand the logs for this specific case. mythria@ is fixing the associated bug.

Sign in to add a comment