New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 600440 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows
Pri: 3
Type: Bug

Blocked on:
issue 638744

Blocking:
issue 594803



Sign in to add a comment

Flaky crashes due to conflict in TraceEventHandle of 2 different events

Project Member Reported by ssid@chromium.org, Apr 4 2016

Issue description

OS: Likely on Windows release builds, also happens on Linux

I added 2 trace events in a function and it made the win_chromium_release_ng bot flaky. The Linux ones also crashed very rarely.

The change is just adding extra trace event.
https://codereview.chromium.org/1851233002/diff/1/base/threading/sequenced_worker_pool.cc

More discussion in the bug  crbug.com/599794  and crrev.com/1851233002.

The trace event's duration is being updated twice.
Here's a failure on Linux which does have a symbolized stack trace.

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/205571

[ RUN      ] trace_test.2DCanvasWebGL
(INFO) 2016-04-01 10:55:18,259 chrome_tracing_agent._CreateTraceConfigFile:175  Trace config file string: {"trace_config":{"included_categories": ["disabled-by-default-gpu.service", "disabled-by-default-gpu.device"], "record_mode": "record-as-much-as-possible"}}
(INFO) 2016-04-01 10:55:18,265 tracing_backend.StartTracing:120  Start Tracing Request: {'params': {'transferMode': 'ReturnAsStream', 'options': 'record-as-much-as-possible', 'categories': 'disabled-by-default-gpu.service,disabled-by-default-gpu.device'}, 'method': 'Tracing.start'}
[14392:14424:0401/105518:FATAL:trace_event_impl.cc(188)] Check failed: duration_.ToInternalValue() == -1 (2 vs. -1)
#0 0x7fb4d38dc02e base::debug::StackTrace::StackTrace()
#1 0x7fb4d38f2d7b logging::LogMessage::~LogMessage()
#2 0x7fb4d396fefe base::trace_event::TraceEvent::UpdateDuration()
#3 0x7fb4d3958fb9 base::trace_event::TraceLog::UpdateTraceEventDuration()
#4 0x7fb4d3937fd5 base::SequencedWorkerPool::Inner::ThreadLoop()
#5 0x7fb4d3937732 base::SequencedWorkerPool::Worker::Run()
#6 0x7fb4d393b01d base::SimpleThread::ThreadMain()
#7 0x7fb4d3936ebd base::(anonymous namespace)::ThreadFunc()
#8 0x7fb4caa43182 start_thread
#9 0x7fb4c7c3147d clone

 
Not sure if related, but I've found that MakeHandle function truncates chunk_index:

// Use this function instead of TraceEventHandle constructor to keep the
// overhead of ScopedTracer (trace_event.h) constructor minimum.
void MakeHandle(uint32_t chunk_seq,
                size_t chunk_index,
                size_t event_index,
                TraceEventHandle* handle) {
  DCHECK(chunk_seq);
  DCHECK(chunk_index <= TraceBufferChunk::kMaxChunkIndex);
  DCHECK(event_index < TraceBufferChunk::kTraceBufferChunkSize);
  handle->chunk_seq = chunk_seq;
  handle->chunk_index = static_cast<uint16_t>(chunk_index);
  handle->event_index = static_cast<uint16_t>(event_index);
}

Here, kMaxChunkIndex is "(1u << 26) - 1", but static_cast truncates values to uint16_t.

Another thing that I find strange is that there are several places that we add trace events to, but TraceEventHandle doesn't indicate where to find added event, so we try them all.

TraceLog::AddTraceEventWithThreadIdAndTimestamp() adds to two places:
...
      if (thread_local_event_buffer) {
#1      trace_event = thread_local_event_buffer->AddTraceEvent(&handle);
      } else {
        lock.EnsureAcquired();
#2      trace_event = AddEventToThreadSharedChunkWhileLocked(&handle, true);
      }


TraceLog::GetEventByHandleInternal() (used by TraceLog::UpdateTraceEventDuration()) looks in three places:

    if (thread_local_event_buffer_.Get()) {
      TraceEvent* trace_event =
#1        thread_local_event_buffer_.Get()->GetEventByHandle(handle);
      if (trace_event)
        return trace_event;
    }

    // The event has been out-of-control of the thread local buffer.
    // Try to get the event from the main buffer with a lock.
    if (lock)
      lock->EnsureAcquired();

    if (thread_shared_chunk_ &&
        handle.chunk_index == thread_shared_chunk_index_) {
#2    return handle.chunk_seq == thread_shared_chunk_->seq()
                 ? thread_shared_chunk_->GetEventAt(handle.event_index)
                 : NULL;
    }

#3  return logged_events_->GetEventByHandle(handle);



So maybe when the test is flaky, we're looking in the wrong place?

Comment 3 by ssid@chromium.org, Apr 4 2016

Summary: Flaky crashes due to conflict in TraceEventHandle of 2 different events (was: Flaky crashes due to 2 trace events in same function)
Looks like what you pointed in comment #2. I added some logs and ran the bots. The code runs like this:

TraceLog::UpdateTraceEventDuration is called with 
name:"SequencedWorkerPool::Inner::ThreadLoop"
category:"toplevel"
handle:{33 32 36}

It calls GetEventByHandleInternal({33 32 36}..)
gets back a TraceEvent object that has 
trace_event.name:"Scheduler::OnBeginImplFrameDeadline"
trace_event.category:"cc,benchmark".

This event had already ended.

TraceBuffer implementations seem pretty complex. Can someone who knows about the infra find a solution to this issue?

CL: https://codereview.chromium.org/1860653002/
failed bot: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/199079/steps/trace_test%20on%20ATI%20GPU%20on%20Windows%20%28with%20patch%29%20on%20Windows-2008ServerR2-SP1/logs/stdio

Comment 4 by ssid@chromium.org, Apr 23 2016

It will be great if someone could provide hints here?
Thanks!

Comment 5 by charliea@google.com, Apr 29 2016

(I think primiano@ and oystein@ are your best bets for understanding how the nuances of trace buffering might have introduced the crash)
Agree with comment in #5. My other comment is the fact that this kind of bug only caught by trace_test.2DCanvasWebGL is problematic. This test exists thanks to the habit of adding tests to whatever tech stack gpu team uses of kbr@, but the tracing team should own more integration tracing tests.
Project Member

Comment 7 by bugdroid1@chromium.org, May 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/90b03660f60165cb29a92eb0bba9b6d61812295b

commit 90b03660f60165cb29a92eb0bba9b6d61812295b
Author: ssid <ssid@chromium.org>
Date: Mon May 16 23:18:09 2016

[tracing] Add task execution event for sequenced worker pool

The sequenced worker pool tasks use a flow event while other task
runners add without flow. So, instead of using TRACE_TASK_EXECUTION
macro, add just the SCOPED_TASK_EXECUTION_EVENT part for these tasks.

BUG=600440

Review-Url: https://codereview.chromium.org/1929913002
Cr-Commit-Position: refs/heads/master@{#393976}

[modify] https://crrev.com/90b03660f60165cb29a92eb0bba9b6d61812295b/base/threading/sequenced_worker_pool.cc

Comment 8 by kbr@chromium.org, Aug 17 2016

Blockedon: 638744

Comment 9 by kbr@chromium.org, Aug 17 2016

Owner: ssid@chromium.org
Status: Assigned (was: Untriaged)
Something just changed to cause that DCHECK to start firing on the Windows GPU FYI bots. Please see  Issue 638744 . Help is urgently needed to understand what changed and to revert it.

Cc: mariakho...@chromium.org
Cc: -charliea@chromium.org

Sign in to add a comment