New issue
Advanced search Search tips

Issue 771785 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

cidb_integration_test test flake

Project Member Reported by dgarr...@chromium.org, Oct 4 2017

Issue description

We now run cidb_integration_test as part of the PreCQ for chromite changes.

Mostly... it works, but this looks like a flaky test failure that doesn't make any sense to me.


https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/pre_cq/59577

Test Insert and Get operations on hwTestResultTable. ... ok

======================================================================
ERROR: [chromite.lib.cidb_integration_test] CIDBMigrationsTest.testActions
Test that InsertCLActions accepts 0-, 1-, and multi-item lists.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 262, in _stacked_setUp
    StackedSetup._stacked_tearDown(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 278, in _stacked_tearDown
    target(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1072, in tearDown
    None)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/host/source/chromite/lib/parallel.py", line 750, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/mnt/host/source/chromite/lib/parallel.py", line 747, in BackgroundTaskRunner
    yield queue
TypeError: __init__() takes at least 3 arguments (2 given)
 
Labels: -Pri-1 Pri-2
Did this happen more than once? If not, ->P2.
Cc: akes...@chromium.org
Owner: ----
Status: Available (was: Untriaged)
reassign and ->P1 if recurs
I only saw it once, but was worried because it happened on one of the first production PreCQ builds I saw that ran the test.

Comment 4 by nxia@chromium.org, Oct 14 2017

.
======================================================================
ERROR: [chromite.lib.cidb_integration_test] DataSeries1Test.runTest
Simulate a single set of canary builds with database schema v56.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 263, in _stacked_setUp
    StackedSetup._stacked_tearDown(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 279, in _stacked_tearDown
    target(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1073, in tearDown
    None)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/host/source/chromite/lib/parallel.py", line 751, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/mnt/host/source/chromite/lib/parallel.py", line 748, in BackgroundTaskRunner
    yield queue
TypeError: __init__() takes at least 3 arguments (2 given)

----------------------------------------------------------------------
Ran 25 tests in 167.227s

FAILED (errors=1, skipped=1)


sometimes I can reproduce it locally 
Owner: pprabhu@chromium.org
Status: Started (was: Available)
Again:

https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromiumos.tryserver%2Fpre_cq%2F62180%2F%2B%2Frecipes%2Fsteps%2FCidbIntegrationTest%2F0%2Fstdout

I know what is happening here. I wrote this bit a while ago.


Project Member

Comment 6 by bugdroid1@chromium.org, Oct 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/fb7c09382221cd6d103ef331f138dacd528ba545

commit fb7c09382221cd6d103ef331f138dacd528ba545
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Oct 19 02:08:26 2017

cros_test_lib: Use addCleanup in TempDirTestCase.

unittest.TestCase.addCleanup registered functions run after
unittest.TestCase.tearDown. TempDirTestCase needs to use addCleanup so
that users of this test case can safely use addCleanup assuming that the
tempdir provided will not be cleaned up until after these cleanup
functions have run.

BUG=chromium:771785
TEST=cbuildbot/run_tests

Change-Id: Ibf84cf16ed2b2d1e29bc8e787d17329480b7ee59
Reviewed-on: https://chromium-review.googlesource.com/726310
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/fb7c09382221cd6d103ef331f138dacd528ba545/lib/cros_test_lib.py

Project Member

Comment 7 by bugdroid1@chromium.org, Oct 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/386964f9720cff98d9fc9e105b41e53b2d18d271

commit 386964f9720cff98d9fc9e105b41e53b2d18d271
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Oct 25 00:36:31 2017

cidb: Fix integration test case failure mode.

One of the failure modes of cidb_integration_test was being reported
badly due to a bug in the way we created the resulting exception.

This CL does not fix the actual problem (probably timing related), but
fixes the incorrect reporting.

BUG=chromium:771785
TEST=cidb_integration_test; force bad cases.

Change-Id: I44b177f3e3f43f5501124fad889cbe116ee7cc20
Reviewed-on: https://chromium-review.googlesource.com/726522
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/386964f9720cff98d9fc9e105b41e53b2d18d271/lib/cros_test_lib.py

Status: Available (was: Started)
We now wait for this to strike again with the real failure reason :)

Comment 9 by nxia@chromium.org, Nov 3 2017

Got a failure

======================================================================
ERROR: [chromite.lib.cidb_integration_test] BuildTableTest.testBuildbucketId
Test InsertBuild with buildbucket_id.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1053, in <lambda>
    'mysqladmin failed to ping mysqld: %s' % e))
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1084, in _CleanupMysqld
    None,
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/host/source/chromite/lib/parallel.py", line 751, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/mnt/host/source/chromite/lib/parallel.py", line 748, in BackgroundTaskRunner
    yield queue
Exception: mysqladmin failed to ping mysqld: return code: 1; command: mysqladmin -S /tmp/chromite.testd2SH5F/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/chromite.testd2SH5F/mysqld_dir/mysqld.socket' (2)'
Check that mysqld is running and that the socket: '/tmp/chromite.testd2SH5F/mysqld_dir/mysqld.socket' exists!

cmd=['mysqladmin', '-S', '/tmp/chromite.testd2SH5F/mysqld_dir/mysqld.socket', 'ping']. We force killed the mysqld process.


Comment 10 by nxia@chromium.org, Nov 3 2017

The ping command failed after 5 times retries. Checked the log, the first couple 'ping' often failed and got retired for other passed test. probably we can add the sleep time between the 'ping' retries, and add sleep time after a 'shutdown' operation.


171103 15:40:49  InnoDB: Shutdown completed; log sequence number 2046640
171103 15:40:49 [Note] /usr/sbin/mysqld: Shutdown complete

.15:40:49: DEBUG: RunCommand: /usr/share/mysql/scripts/mysql_install_db --no-defaults '--basedir=/usr' '--ldata=/tmp/chromite.testXSY5hV/mysqld_dir'
15:40:49: DEBUG: RunCommand: mysqladmin -S /tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket ping
15:40:49: WARNING: Command failed with retriable error.
return code: 1; command: mysqladmin -S /tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket' (2)'
Check that mysqld is running and that the socket: '/tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket' exists!

cmd=['mysqladmin', '-S', '/tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket', 'ping']
15:40:49: DEBUG: <class 'chromite.lib.cros_build_lib.RunCommandError'>(return code: 1; command: mysqladmin -S /tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket' (2)'
Check that mysqld is running and that the socket: '/tmp/chromite.testXSY5hV/mysqld_dir/mysqld.socket' exists!



Status: Started (was: Available)
We already have retries for ping with a backoff factor of 1.5: https://cs.corp.google.com/chromeos_public/chromite/lib/cros_test_lib.py?l=1049

Also, we wait for the mysqld process to end before continuing after we run 'mysqladmin shutdown', so we're actually sure mysqld is gone before we continue.

On a loaded server, even 1 + 1*1.5 + 1* 1.5^2 + 1* 1.5^3 + 1* 1.5^4 = 13.1875 is not enough.
Two points here:
- My backoff factor is stupid. It should at least be 2 . Without any backoff factor, the timeout would have been : 1 + 2 + 3 + 4 + 5 = 15
- Such timeouts tend to make for flaky tests on the builders. We should increase the timeout to practially mean "forever". Let's make it 
backoff_factor = 2, # attempts = 8, i.e
1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 = 255s total timeout.
If the mysql server doesn't come up for 4 minutes, we can safely fail the test (oh, come on!) and it's not prohibitively long for the builders to hang for.

https://chromium-review.googlesource.com/c/chromiumos/chromite/+/754380
My guess is that if the server doesn't want to come up within 15 seconds, it probably won't come up in 4 minutes either. If it's hung, it's hung.

Instead of a backoff-and-wait strategy, what about a kill-it-and-try-again one?
Re #12: 12 seconds isn't nearly good enough on a loaded builder. This reminds of the flaky EC test we've marked expectedFailure recently.

Basically reasonable timeout assumptions don't work on the builders.
- unittests should mock out tests
- integration tests (like this one), should keep a super long timeout that practically means "wait forever".

We could retry at the test level rather than in the fixture.
*unittests should mock out time.
Project Member

Comment 15 by bugdroid1@chromium.org, Nov 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/7ba82062bd6b5f1146c8bc82d2f9cd13d482ceaf

commit 7ba82062bd6b5f1146c8bc82d2f9cd13d482ceaf
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Nov 08 23:10:15 2017

cidb_integration_test: Wait longer for mysqld to initialize.

BUG=chromium:771785
TEST=cidb_integration_test

Change-Id: I033516d8400e16858c492c080cb546fa8c490160
Reviewed-on: https://chromium-review.googlesource.com/754380
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/7ba82062bd6b5f1146c8bc82d2f9cd13d482ceaf/lib/cros_test_lib.py

Status: Fixed (was: Started)
Speculative 'Fixed'.
Reopen if re-seen.
Status: Assigned (was: Fixed)
Re-seen.

https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/pre_cq/68788

Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 259, in _stacked_setUp
    target(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1054, in setUp
    sleep=1, backoff_factor=2)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 353, in RunCommandWithRetries
    return RetryCommand(cros_build_lib.RunCommand, max_retry, *args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 325, in RetryCommand
    return GenericRetry(ShouldRetry, max_retry, functor, *args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 244, in GenericRetry
    return _run()
  File "/mnt/host/source/chromite/lib/retry_util.py", line 177, in _Wrapper
    ret = func(*args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 243, in _run
    return functor(*args, **kwargs)
  File "/mnt/host/source/chromite/lib/cros_build_lib.py", line 654, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: mysqladmin -S /tmp/chromite.testfgib1P/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
This is confusing.
It looks like it was running a test twice (in parallel?)

======================================================================
ERROR: [chromite.lib.cidb_integration_test] CIDBAPITest.testGetTime
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 259, in _stacked_setUp
    target(obj)
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1074, in setUp
    sleep=1, backoff_factor=2)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 353, in RunCommandWithRetries
    return RetryCommand(cros_build_lib.RunCommand, max_retry, *args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 325, in RetryCommand
    return GenericRetry(ShouldRetry, max_retry, functor, *args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 244, in GenericRetry
    return _run()
  File "/mnt/host/source/chromite/lib/retry_util.py", line 177, in _Wrapper
    ret = func(*args, **kwargs)
  File "/mnt/host/source/chromite/lib/retry_util.py", line 243, in _run
    return functor(*args, **kwargs)
  File "/mnt/host/source/chromite/lib/cros_build_lib.py", line 654, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: mysqladmin -S /tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket' (2)'
Check that mysqld is running and that the socket: '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket' exists!
cmd=['mysqladmin', '-S', '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket', 'ping']
======================================================================
ERROR: [chromite.lib.cidb_integration_test] CIDBAPITest.testGetTime
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1077, in <lambda>
    'mysqladmin failed to ping mysqld: %s' % e))
  File "/mnt/host/source/chromite/lib/cros_test_lib.py", line 1108, in _CleanupMysqld
    None,
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/mnt/host/source/chromite/lib/parallel.py", line 751, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/mnt/host/source/chromite/lib/parallel.py", line 748, in BackgroundTaskRunner
    yield queue
Exception: mysqladmin failed to ping mysqld: return code: 1; command: mysqladmin -S /tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket ping
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket' (2)'
Check that mysqld is running and that the socket: '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket' exists!
cmd=['mysqladmin', '-S', '/tmp/chromite.testaW3cYD/mysqld_dir/mysqld.socket', 'ping']. We force killed the mysqld process.
----------------------------------------------------------------------
Ran 27 tests in 440.649s
FAILED (errors=2, skipped=1)

The two errors are the same test failing twice...
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Owner: ----
Status: Untriaged (was: Assigned)
No longer the right team ownership. Onto triage queue for CI.
Owner: jclinton@chromium.org
Status: Available (was: Untriaged)
Looks like a good starter bug. Will find owner.

Comment 24 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Status: Assigned (was: Available)

Sign in to add a comment