Swarming I/O Timeout for perf test issue |
||
Issue descriptionWe have a task that seems to be hitting the I/O timeout and we are unable to see how that could be possible from the logs. It is timing out after about a half hour which means it is not hitting the hard timeout, but it is expiring. If you look at the end of this log: https://chromium-swarm.appspot.com/task?id=3b452a16448dab10&refresh=10&show_raw=1 You will see the last timestamp is at 2:35 which is exactly when the task completed. I am assuming that I/O timeout means we haven't heard from the task in > 10 mins (that is what our current I/O timeout is set to) but this seems to be killed mid-execution. Marc-Antoine, can you clarify further what I/O timeout means? This seems like a bug in swarming execution rather than this test since it is not clear how it is hitting the timeout.
,
Feb 6 2018
Clarifying a few things;
- The task is not expiring, it times out.
- It is effectively an I/O timeout.
The type of time out that was enforced is not surfaced at the task result level. To be clear, it's effectively an I/O timeout. Since the time out is of 10 minutes, and the task was abandoned at 03:35, one can look at output at around 02:25:
---
(INFO) 2018-01-25 02:24:17,750 trace_data.Serialize:191 Trace sizes in bytes: {'traceEvents': 310689286L, 'telemetry': 205918L, 'tabIds': 36L}
(WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310 Processing resulting traces took 629.867 seconds
---
The SIGTERM, sent around 02:35, causes the tool to wake up and spew more data, which is confusing for the reader but kinda expected.
Description of time out handling:
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/User-Guide.md#properties
and
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/doc/Bot.md#timeout-handling
,
Feb 7 2018
Thanks eyiach@ and maruel@. So if I understand it correctly, the timeout is happening when processing the trace from the following warning: (WARNING) 2018-01-25 02:35:09,404 timeline_based_measurement._ComputeTimelineBasedMetrics:310 Processing resulting traces took 629.867 seconds This takes more than 10 minutes to process that trace presumably because it is quite large. So, the actual culprit would be imgur page. We have seen problems with imgur on linux as well and it is disabled on linux. I will disable imgur on windows as well.
,
Feb 7 2018
Thanks for pointing us to those docs MA, super helpful. It would be nice to have a clearer signal of why a job is timing out surfaced in swarming, but that is not the discussion of this bug so I am closing it. We were just trying to understand the logs for this specific case. mythria@ is fixing the associated bug. |
||
►
Sign in to add a comment |
||
Comment 1 by eyaich@chromium.org
, Feb 6 2018