Flaky crashes due to conflict in TraceEventHandle of 2 different events |
||||||
Issue descriptionOS: Likely on Windows release builds, also happens on Linux I added 2 trace events in a function and it made the win_chromium_release_ng bot flaky. The Linux ones also crashed very rarely. The change is just adding extra trace event. https://codereview.chromium.org/1851233002/diff/1/base/threading/sequenced_worker_pool.cc More discussion in the bug crbug.com/599794 and crrev.com/1851233002. The trace event's duration is being updated twice. Here's a failure on Linux which does have a symbolized stack trace. https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/205571 [ RUN ] trace_test.2DCanvasWebGL (INFO) 2016-04-01 10:55:18,259 chrome_tracing_agent._CreateTraceConfigFile:175 Trace config file string: {"trace_config":{"included_categories": ["disabled-by-default-gpu.service", "disabled-by-default-gpu.device"], "record_mode": "record-as-much-as-possible"}} (INFO) 2016-04-01 10:55:18,265 tracing_backend.StartTracing:120 Start Tracing Request: {'params': {'transferMode': 'ReturnAsStream', 'options': 'record-as-much-as-possible', 'categories': 'disabled-by-default-gpu.service,disabled-by-default-gpu.device'}, 'method': 'Tracing.start'} [14392:14424:0401/105518:FATAL:trace_event_impl.cc(188)] Check failed: duration_.ToInternalValue() == -1 (2 vs. -1) #0 0x7fb4d38dc02e base::debug::StackTrace::StackTrace() #1 0x7fb4d38f2d7b logging::LogMessage::~LogMessage() #2 0x7fb4d396fefe base::trace_event::TraceEvent::UpdateDuration() #3 0x7fb4d3958fb9 base::trace_event::TraceLog::UpdateTraceEventDuration() #4 0x7fb4d3937fd5 base::SequencedWorkerPool::Inner::ThreadLoop() #5 0x7fb4d3937732 base::SequencedWorkerPool::Worker::Run() #6 0x7fb4d393b01d base::SimpleThread::ThreadMain() #7 0x7fb4d3936ebd base::(anonymous namespace)::ThreadFunc() #8 0x7fb4caa43182 start_thread #9 0x7fb4c7c3147d clone
,
Apr 4 2016
Another thing that I find strange is that there are several places that we add trace events to, but TraceEventHandle doesn't indicate where to find added event, so we try them all.
TraceLog::AddTraceEventWithThreadIdAndTimestamp() adds to two places:
...
if (thread_local_event_buffer) {
#1 trace_event = thread_local_event_buffer->AddTraceEvent(&handle);
} else {
lock.EnsureAcquired();
#2 trace_event = AddEventToThreadSharedChunkWhileLocked(&handle, true);
}
TraceLog::GetEventByHandleInternal() (used by TraceLog::UpdateTraceEventDuration()) looks in three places:
if (thread_local_event_buffer_.Get()) {
TraceEvent* trace_event =
#1 thread_local_event_buffer_.Get()->GetEventByHandle(handle);
if (trace_event)
return trace_event;
}
// The event has been out-of-control of the thread local buffer.
// Try to get the event from the main buffer with a lock.
if (lock)
lock->EnsureAcquired();
if (thread_shared_chunk_ &&
handle.chunk_index == thread_shared_chunk_index_) {
#2 return handle.chunk_seq == thread_shared_chunk_->seq()
? thread_shared_chunk_->GetEventAt(handle.event_index)
: NULL;
}
#3 return logged_events_->GetEventByHandle(handle);
So maybe when the test is flaky, we're looking in the wrong place?
,
Apr 4 2016
Looks like what you pointed in comment #2. I added some logs and ran the bots. The code runs like this:
TraceLog::UpdateTraceEventDuration is called with
name:"SequencedWorkerPool::Inner::ThreadLoop"
category:"toplevel"
handle:{33 32 36}
It calls GetEventByHandleInternal({33 32 36}..)
gets back a TraceEvent object that has
trace_event.name:"Scheduler::OnBeginImplFrameDeadline"
trace_event.category:"cc,benchmark".
This event had already ended.
TraceBuffer implementations seem pretty complex. Can someone who knows about the infra find a solution to this issue?
CL: https://codereview.chromium.org/1860653002/
failed bot: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/199079/steps/trace_test%20on%20ATI%20GPU%20on%20Windows%20%28with%20patch%29%20on%20Windows-2008ServerR2-SP1/logs/stdio
,
Apr 23 2016
It will be great if someone could provide hints here? Thanks!
,
Apr 29 2016
(I think primiano@ and oystein@ are your best bets for understanding how the nuances of trace buffering might have introduced the crash)
,
Apr 30 2016
Agree with comment in #5. My other comment is the fact that this kind of bug only caught by trace_test.2DCanvasWebGL is problematic. This test exists thanks to the habit of adding tests to whatever tech stack gpu team uses of kbr@, but the tracing team should own more integration tracing tests.
,
May 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/90b03660f60165cb29a92eb0bba9b6d61812295b commit 90b03660f60165cb29a92eb0bba9b6d61812295b Author: ssid <ssid@chromium.org> Date: Mon May 16 23:18:09 2016 [tracing] Add task execution event for sequenced worker pool The sequenced worker pool tasks use a flow event while other task runners add without flow. So, instead of using TRACE_TASK_EXECUTION macro, add just the SCOPED_TASK_EXECUTION_EVENT part for these tasks. BUG=600440 Review-Url: https://codereview.chromium.org/1929913002 Cr-Commit-Position: refs/heads/master@{#393976} [modify] https://crrev.com/90b03660f60165cb29a92eb0bba9b6d61812295b/base/threading/sequenced_worker_pool.cc
,
Aug 17 2016
,
Aug 17 2016
Something just changed to cause that DCHECK to start firing on the Windows GPU FYI bots. Please see Issue 638744 . Help is urgently needed to understand what changed and to revert it.
,
Aug 17 2016
,
Aug 29 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by dskiba@chromium.org
, Apr 4 2016Not sure if related, but I've found that MakeHandle function truncates chunk_index: // Use this function instead of TraceEventHandle constructor to keep the // overhead of ScopedTracer (trace_event.h) constructor minimum. void MakeHandle(uint32_t chunk_seq, size_t chunk_index, size_t event_index, TraceEventHandle* handle) { DCHECK(chunk_seq); DCHECK(chunk_index <= TraceBufferChunk::kMaxChunkIndex); DCHECK(event_index < TraceBufferChunk::kTraceBufferChunkSize); handle->chunk_seq = chunk_seq; handle->chunk_index = static_cast<uint16_t>(chunk_index); handle->event_index = static_cast<uint16_t>(event_index); } Here, kMaxChunkIndex is "(1u << 26) - 1", but static_cast truncates values to uint16_t.