trunks: BackgroundTransceiverTest.Asynchronous unit test failure flake
Reported by
jrbarnette@chromium.org,
Mar 19 2018
|
||||||||
Issue description
A recent CQ run failed. This is the master:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18032
This is the failed slave:
https://luci-milo.appspot.com/buildbot/chromeos/eve-arcnext-paladin/260
The specific failure was a unit test failure in the "trunks" ebuild.
The blamelist for the CQ doesn't include any CL that might plausibly
cause a failure in "trunks". Moreover, 3 of the 4 CLs in the list
went through the CQ at least once without causing the failure.
So... It seems likely (at least possible) that there a problem with a rare,
intermittent failure in trunks unit tests.
Below is the full output from the trunks ebuild failure:
=== Start output for job trunks-0.0.1-r2512 (0m12.9s) ===
trunks-0.0.1-r2512: >>> Emerging (1 of 1) chromeos-base/trunks-0.0.1-r2512::chromiumos for /build/eve-arcnext/
trunks-0.0.1-r2512: * Running stacked hooks for pre_pkg_setup
trunks-0.0.1-r2512: * sysroot_build_bin_dir ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: * Running stacked hooks for post_pkg_setup
trunks-0.0.1-r2512: * python_eclass_hack ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: * Running stacked hooks for pre_src_unpack
trunks-0.0.1-r2512: * python_multilib_setup ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: >>> Unpacking source...
trunks-0.0.1-r2512: * /mnt/host/source/src/platform2 is not at rev 88481efde4161ad9e80bfc44769804208c64da06
trunks-0.0.1-r2512: Cloning into '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1'...
trunks-0.0.1-r2512: done.
trunks-0.0.1-r2512: >>> Source unpacked in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work
trunks-0.0.1-r2512: * Running stacked hooks for post_src_unpack
trunks-0.0.1-r2512: * asan_init ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: >>> Preparing source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: >>> Source prepared.
trunks-0.0.1-r2512: >>> Configuring source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2.py --libdir=/usr/lib64 --use_flags=abi_x86_64 amd64 cr50_onboard cros-debug elibc_glibc kernel_linux test userland_GNU --jobs=32 --action=configure --cache_dir=/build/eve-arcnext/var/cache/portage/chromeos-base/trunks /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks/trunks.gyp
trunks-0.0.1-r2512: >>> Source configured.
trunks-0.0.1-r2512: >>> Compiling source in /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks ...
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2.py --libdir=/usr/lib64 --use_flags=abi_x86_64 amd64 cr50_onboard cros-debug elibc_glibc kernel_linux test userland_GNU --jobs=32 --action=compile --cache_dir=/build/eve-arcnext/var/cache/portage/chromeos-base/trunks all
trunks-0.0.1-r2512: >>> Source compiled.
trunks-0.0.1-r2512: * Running stacked hooks for pre_src_test
trunks-0.0.1-r2512: * gconv_path ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: * ldpaths ...
trunks-0.0.1-r2512: [ ok ]
trunks-0.0.1-r2512: >>> Test phase: chromeos-base/trunks-0.0.1-r2512
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2_test.py --action=pre_test --sysroot=/build/eve-arcnext --
trunks-0.0.1-r2512: ERROR: ld.so: object 'libsandbox.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
trunks-0.0.1-r2512: ........Invalid format for first line: bad_data
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: ..Invalid file format: bad_data
trunks-0.0.1-r2512: .
trunks-0.0.1-r2512: ----------------------------------------------------------------------
trunks-0.0.1-r2512: Ran 11 tests in 0.003s
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: OK
trunks-0.0.1-r2512: /build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/common-mk/platform2_test.py --action=run --sysroot=/build/eve-arcnext -- /build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner
trunks-0.0.1-r2512: ERROR: ld.so: object 'libsandbox.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
trunks-0.0.1-r2512: chroot: /build/eve-arcnext
trunks-0.0.1-r2512: cwd: /tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks
trunks-0.0.1-r2512: cmd: {/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner} '/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512: [0;32m[==========] [mRunning 313 tests from 18 test cases.
trunks-0.0.1-r2512: [0;32m[----------] [mGlobal test environment set-up.
trunks-0.0.1-r2512: [0;32m[----------] [m1 test from TpmState_DeathTest
trunks-0.0.1-r2512: [0;32m[ RUN ] [mTpmState_DeathTest.NotInitialized
trunks-0.0.1-r2512: [0;32m[ OK ] [mTpmState_DeathTest.NotInitialized (3113 ms)
trunks-0.0.1-r2512: [0;32m[----------] [m1 test from TpmState_DeathTest (3113 ms total)
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: [0;32m[----------] [m2 tests from BackgroundTransceiverTest
trunks-0.0.1-r2512: [0;32m[ RUN ] [mBackgroundTransceiverTest.Asynchronous
trunks-0.0.1-r2512: [FATAL:weak_ptr.cc(20)] Check failed: sequence_checker_.CalledOnValidSequencedThread() || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread.
trunks-0.0.1-r2512: /usr/lib64/libbase-core-395517.so(_ZN4base5debug10StackTraceC1Ev+0x13) [0x7f6f6cd58553]
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: Error: /var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner: failed with signal SIGIOT|SIGABRT(6)
trunks-0.0.1-r2512: * ERROR: chromeos-base/trunks-0.0.1-r2512::chromiumos failed (test phase):
trunks-0.0.1-r2512: * (no error message)
trunks-0.0.1-r2512: *
trunks-0.0.1-r2512: * Call stack:
trunks-0.0.1-r2512: * ebuild.sh, line 133: Called src_test
trunks-0.0.1-r2512: * environment, line 4123: Called platform_src_test
trunks-0.0.1-r2512: * environment, line 3696: Called platform_pkg_test
trunks-0.0.1-r2512: * environment, line 3677: Called platform_test 'run' '/build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512: * environment, line 3729: Called die
trunks-0.0.1-r2512: * The specific snippet of code:
trunks-0.0.1-r2512: * "${cmd[@]}" || die
trunks-0.0.1-r2512: *
trunks-0.0.1-r2512: * If you need support, post the output of `emerge --info '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`,
trunks-0.0.1-r2512: * the complete build log and the output of `emerge -pqv '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`.
trunks-0.0.1-r2512: * The complete build log is located at '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'.
trunks-0.0.1-r2512: * For convenience, a symlink to the build log is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/build.log'.
trunks-0.0.1-r2512: * The ebuild environment file is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/environment'.
trunks-0.0.1-r2512: * Working directory: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512: * S: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512: >>> Failed to emerge chromeos-base/trunks-0.0.1-r2512 for /build/eve-arcnext/, Log file:
trunks-0.0.1-r2512: >>> '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: * Messages for package chromeos-base/trunks-0.0.1-r2512 merged to /build/eve-arcnext/:
trunks-0.0.1-r2512:
trunks-0.0.1-r2512: * /mnt/host/source/src/platform2 is not at rev 88481efde4161ad9e80bfc44769804208c64da06
trunks-0.0.1-r2512: * ERROR: chromeos-base/trunks-0.0.1-r2512::chromiumos failed (test phase):
trunks-0.0.1-r2512: * (no error message)
trunks-0.0.1-r2512: *
trunks-0.0.1-r2512: * Call stack:
trunks-0.0.1-r2512: * ebuild.sh, line 133: Called src_test
trunks-0.0.1-r2512: * environment, line 4123: Called platform_src_test
trunks-0.0.1-r2512: * environment, line 3696: Called platform_pkg_test
trunks-0.0.1-r2512: * environment, line 3677: Called platform_test 'run' '/build/eve-arcnext/var/cache/portage/chromeos-base/trunks/out/Default/trunks_testrunner'
trunks-0.0.1-r2512: * environment, line 3729: Called die
trunks-0.0.1-r2512: * The specific snippet of code:
trunks-0.0.1-r2512: * "${cmd[@]}" || die
trunks-0.0.1-r2512: *
trunks-0.0.1-r2512: * If you need support, post the output of `emerge --info '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`,
trunks-0.0.1-r2512: * the complete build log and the output of `emerge -pqv '=chromeos-base/trunks-0.0.1-r2512::chromiumos'`.
trunks-0.0.1-r2512: * The complete build log is located at '/build/eve-arcnext/tmp/portage/logs/chromeos-base:trunks-0.0.1-r2512:20180317-190738.log'.
trunks-0.0.1-r2512: * For convenience, a symlink to the build log is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/build.log'.
trunks-0.0.1-r2512: * The ebuild environment file is located at '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/temp/environment'.
trunks-0.0.1-r2512: * Working directory: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
trunks-0.0.1-r2512: * S: '/build/eve-arcnext/tmp/portage/chromeos-base/trunks-0.0.1-r2512/work/trunks-0.0.1/trunks'
=== Complete: job trunks-0.0.1-r2512 (0m12.9s) ===
,
Oct 22
Another build failure with similar signature: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931976278155816144
,
Oct 22
Andrey, do you know whats going on here?
,
Oct 22
> Andrey, do you know whats going on here? Not yet. This is the actual failure - calling the destructor of some WeakPointer/WeakFabric from a wrong thread: trunks-0.0.1-r2512: [0;32m[ RUN ] [mBackgroundTransceiverTest.Asynchronous trunks-0.0.1-r2512: [FATAL:weak_ptr.cc(20)] Check failed: sequence_checker_.CalledOnValidSequencedThread() || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread. I don't remember any recent changes to trunksd, let alone its cleanup code, so can be a change in timing that made a dormant issue more frequent. Seen on bob, cheza, kukui - so, aarch64 specific? Any other boards where this issue was seen?
,
Oct 22
Issue 843544 has been merged into this issue.
,
Oct 22
,
Oct 23
Similar failures on 2 builders in CQ: bob-paladin, kukui-paladin https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8931906207547669312
,
Oct 23
No root cause yet. Capturing here just to save time, in case somebody decides to perform a similar analysis here. kukui shows this behavior for quite some time. It became more frequent, though. The first one was the build at 2018-10-17 19:08 (http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932367094127927920), then another after 12 successful builds, then two more with 5-7 successful in between. And since then we have this failure every 2-3 builds for kukui. Similar for bob. The first one (at least in a long time) was 2018-10-17 23:21 (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932351229842295280). Then one 3 builds later. Then 20+ builds in between. Then another 8 builds later, and then 2 of the last 4 builds. Interestingly, there are no issues with kukui-release and bob-release. At the same time, there are no common CLs that are present in all failed CQ builds. So, still looks like the flakiness is already in ToT, and some relatively recent changes, including not yet submitted CLs, are triggering it more frequently. The last successful CQ was 2018-10-21 10:59 - 2018-10-21 13:51 (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932035402251595184). And the frequent failures on bob/kukui have started already after it. Again, there are no common CLs in all runs that failed due to this unit test on bob and kukui since the last green CQ. E.g. one failed kukui run (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932023111917341296) has a single CL, which is unrelated: https://crrev.com/c/1292203. The last successful CQ completion also seemingly doesn't have CLs that could have triggered such trunks unit tests behavior after landing on ToT: https://crrev.com/i/701349 https://crrev.com/i/701350 https://crrev.com/c/1292278 https://crrev.com/c/1292202 Looking at a few previous builds, also nothing related: https://crrev.com/c/1289189 [the most suspicious for kukui, which is 4.14, but bob is 4.4] https://crrev.com/c/1290110 https://crrev.com/c/1291874 https://crrev.com/c/1291951 https://crrev.com/c/1292129 Looking at the CLs that landed right before the 1st failed builds (10/17) - also nothing of notice. So far, looks like a race condition that is randomly seen, especially on bob & kukui, possibly under a higher load during unit testing.
,
Oct 23
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931843889260268576 Adding more data: cheza failure in the current CQ run
,
Oct 26
cheza-paladin: The UnitTest stage failed: Packages failed in /mnt/host/source/chromite/bin/cros_run_unit_tests: chromeos-base/trunks in https://ci.chromium.org/b/8931633318288840880
,
Oct 28
I was able to repro this locally with some additional logging (although the repro rate is very low on the workstation - <0.1%). The problem, according to my logs, is that the destructor of BackgroundCommandTransceiver is triggered while its method SendCommandTask() is still running on the background thread. I think this may happen because the result is returned to the main thread by posting a task to it (see PostCallbackToTaskRunner()), and this task may get executed before the chain of calls that posted it finishes. I think this problem was present in the unit test code since ever. IIUC, this bug could be fixed simply by waiting until the background thread stops in the test body. I'll public a CL with that fix soon. P.S. Curiously, the same fix was already done for the neighbor test BackgroundTransceiverTest.Synchronous back in 2017: https://crrev.com/c/431503.
,
Oct 29
> P.S. Curiously, the same fix was already done for the neighbor test BackgroundTransceiverTest.Synchronous back in 2017: https://crrev.com/c/431503. Yes, and adding thread.Stop() was even suggested back then in 2017, but I somehow reason that it is not needed :) Vut, following the 2017 repro approach, adding sleep to the end of PostCallbackToTaskRunner() after task_runner->PostTask() gives a good stable repro for this Asynchronous test. Left a comment on CL https://crrev.com/c/1303516, and sent it to CQ.
,
Oct 31
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/e2625b333add1fd88b00b3ebcc4e1bb62112aa8e commit e2625b333add1fd88b00b3ebcc4e1bb62112aa8e Author: Maksim Ivanov <emaxx@chromium.org> Date: Wed Oct 31 19:42:30 2018 trunks: Fix flaky BackgroundTransceiverTest.Asynchronous The test was sometimes failing due to hitting this DCHECK: [FATAL:weak_ptr.cc(20)] Check failed: sequence_checker_.CalledOnValidSequencedThread() || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread. Fix this by making sure that the background thread is stopped before destroying BackgroundCommandTransceiver, since otherwise there may be a running background thread task that uses this object. BUG=chromium:823519 TEST=existing unit tests Change-Id: I06dc0f549271e166e2c5f213f827ef7e3828587d Reviewed-on: https://chromium-review.googlesource.com/1303516 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Maksim Ivanov <emaxx@chromium.org> Reviewed-by: Andrey Pronin <apronin@chromium.org> [modify] https://crrev.com/e2625b333add1fd88b00b3ebcc4e1bb62112aa8e/trunks/background_command_transceiver_test.cc
,
Nov 2
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by seobrien@chromium.org
, Oct 18Labels: -Pri-3 Pri-2