New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 823519 link

Starred by 6 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug


Participants' hotlists:
Cros-Hwsec-Ready


Sign in to add a comment

trunks: BackgroundTransceiverTest.Asynchronous unit test failure flake

Reported by jrbarnette@chromium.org, Mar 19 2018

Issue description

A recent CQ run failed.  This is the master:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18032
This is the failed slave:
    https://luci-milo.appspot.com/buildbot/chromeos/eve-arcnext-paladin/260

The specific failure was a unit test failure in the "trunks" ebuild.

The blamelist for the CQ doesn't include any CL that might plausibly
cause a failure in "trunks".  Moreover, 3 of the 4 CLs in the list
went through the CQ at least once without causing the failure.

So...  It seems likely (at least possible) that there a problem with a rare,
intermittent failure in trunks unit tests.

Below is the full output from the trunks ebuild failure:
=== Start output for job trunks-0.0.1-r2512 (0m12.9s) ===
trunks-0.0.1-r2512: >>> Emerging (1 of 1) chromeos-base/trunks-0.0.1-r2512::chromiumos for /build/eve-arcnext/
trunks-0.0.1-r2512:  * Running stacked hooks for pre_pkg_setup
trunks-0.0.1-r2512:  *    sysroot_build_bin_dir ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512:  * Running stacked hooks for post_pkg_setup
trunks-0.0.1-r2512:  *    python_eclass_hack ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512:  * Running stacked hooks for pre_src_unpack
trunks-0.0.1-r2512:  *    python_multilib_setup ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512: >>> Unpacking source...
trunks-0.0.1-r2512:  * /mnt/host/source/src/platform2 is not at rev 88481efde4161ad9e80bfc44769804208c64da06
trunks-0.0.1-r2512: Cloning into '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1'...
trunks-0.0.1-r2512: done.
trunks-0.0.1-r2512: >>> Source unpacked in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work
trunks-0.0.1-r2512:  * Running stacked hooks for post_src_unpack
trunks-0.0.1-r2512:  *    asan_init ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512: >>> Preparing source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: >>> Source prepared.
trunks-0.0.1-r2512: >>> Configuring source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2.py --libdir=/usr/lib64 --use_flags=abi_x86_64 amd64 cr50_onboard cros-debug elibc_glibc kernel_linux test userland_GNU --jobs=32 --action=configure --cache_dir=/build/eve-arcnext/var/cache/portage/chromeos-base/trunks /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks/trunks.gyp
trunks-0.0.1-r2512: >>> Source configured.
trunks-0.0.1-r2512: >>> Compiling source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2.py --libdir=/usr/lib64 --use_flags=abi_x86_64 amd64 cr50_onboard cros-debug elibc_glibc kernel_linux test userland_GNU --jobs=32 --action=compile --cache_dir=/build/eve-arcnext/var/cache/portage/chromeos-base/trunks all
trunks-0.0.1-r2512: >>> Source compiled.
trunks-0.0.1-r2512:  * Running stacked hooks for pre_src_test
trunks-0.0.1-r2512:  *    gconv_path ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512:  *    ldpaths ...
trunks-0.0.1-r2512:  [ ok ]
trunks-0.0.1-r2512: >>> Test phase: chromeos-base/trunks-0.0.1-r2512
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2_test.py --action=pre_test --sysroot=/build/eve-arcnext -- 
trunks-0.0.1-r2512: ERROR: ld.so: object 'libsandbox.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
trunks-0.0.1-r2512: ........Invalid format for first line: bad_data
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512: ..Invalid file format: bad_data
trunks-0.0.1-r2512: .
trunks-0.0.1-r2512: ----------------------------------------------------------------------
trunks-0.0.1-r2512: Ran 11 tests in 0.003s
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512: OK
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2_test.py --action=run --sysroot=/build/eve-arcnext -- /build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner
trunks-0.0.1-r2512: ERROR: ld.so: object 'libsandbox.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
trunks-0.0.1-r2512: chroot: /build/eve-arcnext
trunks-0.0.1-r2512: cwd: /tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks
trunks-0.0.1-r2512: cmd: {/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner} '/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512: [==========] Running 313 tests from 18 test cases.
trunks-0.0.1-r2512: [----------] Global test environment set-up.
trunks-0.0.1-r2512: [----------] 1 test from TpmState_DeathTest
trunks-0.0.1-r2512: [ RUN      ] TpmState_DeathTest.NotInitialized
trunks-0.0.1-r2512: [       OK ] TpmState_DeathTest.NotInitialized (3113 ms)
trunks-0.0.1-r2512: [----------] 1 test from TpmState_DeathTest (3113 ms total)
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512: [----------] 2 tests from BackgroundTransceiverTest
trunks-0.0.1-r2512: [ RUN      ] BackgroundTransceiverTest.Asynchronous
trunks-0.0.1-r2512: [FATAL:weak_ptr.cc(20)] Check failed: sequence_checker_.CalledOnValidSequencedThread() || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread.
trunks-0.0.1-r2512: /usr/lib64/libbase-core-395517.so(_ZN4base5debug10StackTraceC1Ev+0x13) [0x7f6f6cd58553]
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512: Error: /var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner: failed with signal SIGIOT|SIGABRT(6)
trunks-0.0.1-r2512:  * ERROR: chromeos-base/trunks-0.0.1-r2512::chromiumos failed (test phase):
trunks-0.0.1-r2512:  *   (no error message)
trunks-0.0.1-r2512:  * 
trunks-0.0.1-r2512:  * Call stack:
trunks-0.0.1-r2512:  *     ebuild.sh, line  133:  Called src_test
trunks-0.0.1-r2512:  *   environment, line 4123:  Called platform_src_test
trunks-0.0.1-r2512:  *   environment, line 3696:  Called platform_pkg_test
trunks-0.0.1-r2512:  *   environment, line 3677:  Called platform_test 'run' '/build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512:  *   environment, line 3729:  Called die
trunks-0.0.1-r2512:  * The specific snippet of code:
trunks-0.0.1-r2512:  *       "${cmd[@]}" || die
trunks-0.0.1-r2512:  * 
trunks-0.0.1-r2512:  * If you need support, post the output of `emerge --info '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`,
trunks-0.0.1-r2512:  * the complete build log and the output of `emerge -pqv '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`.
trunks-0.0.1-r2512:  * The complete build log is located at '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'.
trunks-0.0.1-r2512:  * For convenience, a symlink to the build log is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/build.log'.
trunks-0.0.1-r2512:  * The ebuild environment file is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/environment'.
trunks-0.0.1-r2512:  * Working directory: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512:  * S: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512: >>> Failed to emerge chromeos-base/trunks-0.0.1-r2512 for /build/eve-arcnext/, Log file:
trunks-0.0.1-r2512: >>>  '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512:  * Messages for package chromeos-base/trunks-0.0.1-r2512 merged to /build/eve-arcnext/:
trunks-0.0.1-r2512: 
trunks-0.0.1-r2512:  * /mnt/host/source/src/platform2 is not at rev 88481efde4161ad9e80bfc44769804208c64da06
trunks-0.0.1-r2512:  * ERROR: chromeos-base/trunks-0.0.1-r2512::chromiumos failed (test phase):
trunks-0.0.1-r2512:  *   (no error message)
trunks-0.0.1-r2512:  * 
trunks-0.0.1-r2512:  * Call stack:
trunks-0.0.1-r2512:  *     ebuild.sh, line  133:  Called src_test
trunks-0.0.1-r2512:  *   environment, line 4123:  Called platform_src_test
trunks-0.0.1-r2512:  *   environment, line 3696:  Called platform_pkg_test
trunks-0.0.1-r2512:  *   environment, line 3677:  Called platform_test 'run' '/build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512:  *   environment, line 3729:  Called die
trunks-0.0.1-r2512:  * The specific snippet of code:
trunks-0.0.1-r2512:  *       "${cmd[@]}" || die
trunks-0.0.1-r2512:  * 
trunks-0.0.1-r2512:  * If you need support, post the output of `emerge --info '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`,
trunks-0.0.1-r2512:  * the complete build log and the output of `emerge -pqv '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`.
trunks-0.0.1-r2512:  * The complete build log is located at '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'.
trunks-0.0.1-r2512:  * For convenience, a symlink to the build log is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/build.log'.
trunks-0.0.1-r2512:  * The ebuild environment file is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/environment'.
trunks-0.0.1-r2512:  * Working directory: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512:  * S: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
=== Complete: job trunks-0.0.1-r2512 (0m12.9s) ===

 
Cc: mikenichols@chromium.org akes...@chromium.org amstan@google.com josephsih@chromium.org
Labels: -Pri-3 Pri-2
Two recent builds failed this way:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932396769816711648

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932305093748581008

I didn't see anything in the shared CLs that looks relevant here.
Cc: furquan@chromium.org benzh@chromium.org vbendeb@chromium.org
Another build failure with similar signature:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931976278155816144


Owner: apronin@chromium.org
Andrey, do you know whats going on here?
Components: -OS>Systems OS>Systems>Security
Status: Available (was: Untriaged)
> Andrey, do you know whats going on here?
Not yet. This is the actual failure - calling the destructor of some WeakPointer/WeakFabric from a wrong thread:

trunks-0.0.1-r2512: [ RUN      ] BackgroundTransceiverTest.Asynchronous
trunks-0.0.1-r2512: [FATAL:weak_ptr.cc(20)] Check failed: sequence_checker_.CalledOnValidSequencedThread() || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread.

I don't remember any recent changes to trunksd, let alone its cleanup code, so can be a change in timing that made a dormant issue more frequent.
Seen on bob, cheza, kukui - so, aarch64 specific? Any other boards where this issue was seen? 


Cc: apronin@chromium.org amstan@chromium.org emaxx@chromium.org
 Issue 843544  has been merged into this issue.
Summary: trunks: BackgroundTransceiverTest.Asynchronous unit test failure flake (was: trunks unit test failure flake)
Similar failures on 2 builders in CQ: bob-paladin, kukui-paladin
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8931906207547669312


No root cause yet. Capturing here just to save time, in case somebody decides to perform a similar analysis here. 

kukui shows this behavior for quite some time. It became more frequent, though. 
The first one was the build at 2018-10-17 19:08 (http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932367094127927920), then another after 12 successful builds, then two more with 5-7 successful in between. And since then we have this failure every 2-3 builds for kukui.

Similar for bob. The first one (at least in a long time) was 2018-10-17 23:21 (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932351229842295280). Then one 3 builds later. Then 20+ builds in between. Then another 8 builds later, and then 2 of the last 4 builds.

Interestingly, there are no issues with kukui-release and bob-release.

At the same time, there are no common CLs that are present in all failed CQ builds. So, still looks like the flakiness is already in ToT, and some relatively recent changes, including not yet submitted CLs, are triggering it more frequently. The last successful CQ was 2018-10-21 10:59 - 2018-10-21 13:51 (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932035402251595184). And the frequent failures on bob/kukui have started already after it.
Again, there are no common CLs in all runs that failed due to this unit test on bob and kukui since the last green CQ.  E.g. one failed kukui run (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932023111917341296) has a single CL, which is unrelated: https://crrev.com/c/1292203.

The last successful CQ completion also seemingly doesn't have CLs that could have triggered such trunks unit tests behavior after landing on ToT:
https://crrev.com/i/701349
https://crrev.com/i/701350
https://crrev.com/c/1292278
https://crrev.com/c/1292202

Looking at a few previous builds, also nothing related:
https://crrev.com/c/1289189 [the most suspicious for kukui, which is 4.14, but bob is 4.4]
https://crrev.com/c/1290110
https://crrev.com/c/1291874
https://crrev.com/c/1291951
https://crrev.com/c/1292129

Looking at the CLs that landed right before the 1st failed builds (10/17) - also nothing of notice.

So far, looks like a race condition that is randomly seen, especially on bob & kukui, possibly under a higher load during unit testing.
cheza-paladin: The UnitTest stage failed: Packages failed in /mnt/host/source/chromite/bin/cros_run_unit_tests: chromeos-base/trunks in https://ci.chromium.org/b/8931633318288840880


Cc: -emaxx@chromium.org
Labels: OS-Chrome
Owner: emaxx@chromium.org
Status: Started (was: Available)
I was able to repro this locally with some additional logging (although the repro rate is very low on the workstation - <0.1%).

The problem, according to my logs, is that the destructor of BackgroundCommandTransceiver is triggered while its method SendCommandTask() is still running on the background thread. I think this may happen because the result is returned to the main thread by posting a task to it (see PostCallbackToTaskRunner()), and this task may get executed before the chain of calls that posted it finishes.

I think this problem was present in the unit test code since ever. IIUC, this bug could be fixed simply by waiting until the background thread stops in the test body. I'll public a CL with that fix soon.

P.S. Curiously, the same fix was already done for the neighbor test BackgroundTransceiverTest.Synchronous back in 2017: https://crrev.com/c/431503.
> P.S. Curiously, the same fix was already done for the neighbor test BackgroundTransceiverTest.Synchronous back in 2017: https://crrev.com/c/431503.

Yes, and adding thread.Stop() was even suggested back then in 2017, but I somehow reason that it is not needed :)

Vut, following the 2017 repro approach, adding sleep to the end of PostCallbackToTaskRunner() after task_runner->PostTask() gives a good stable repro for this Asynchronous test. 
Left a comment on CL https://crrev.com/c/1303516, and sent it to CQ.

Project Member

Comment 13 by bugdroid1@chromium.org, Oct 31

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/e2625b333add1fd88b00b3ebcc4e1bb62112aa8e

commit e2625b333add1fd88b00b3ebcc4e1bb62112aa8e
Author: Maksim Ivanov <emaxx@chromium.org>
Date: Wed Oct 31 19:42:30 2018

trunks: Fix flaky BackgroundTransceiverTest.Asynchronous

The test was sometimes failing due to hitting this DCHECK:

[FATAL:weak_ptr.cc(20)] Check failed:
sequence_checker_.CalledOnValidSequencedThread() || HasOneRef().
WeakPtrs must be invalidated on the same sequenced thread.

Fix this by making sure that the background thread is stopped
before destroying BackgroundCommandTransceiver, since otherwise
there may be a running background thread task that uses this
object.

BUG=chromium:823519
TEST=existing unit tests

Change-Id: I06dc0f549271e166e2c5f213f827ef7e3828587d
Reviewed-on: https://chromium-review.googlesource.com/1303516
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Maksim Ivanov <emaxx@chromium.org>
Reviewed-by: Andrey Pronin <apronin@chromium.org>

[modify] https://crrev.com/e2625b333add1fd88b00b3ebcc4e1bb62112aa8e/trunks/background_command_transceiver_test.cc

Labels: Cros-Hwsec-Ready

Sign in to add a comment