New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 801467 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Verify/Reset running while an autotest job is active.

Project Member Reported by kinaba@chromium.org, Jan 12 2018

Issue description

Splitting out from b/70796610#comment30.
@Ilja, who should we loop in to the bug?


I occasionally see tests are failing because the autotest server was running reset or verify while the test job was still running.
How can we avoid that?



(1)
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/164569602-chromeos-test/chromeos6-row4-rack7-host8/

According to the autotest DEBUG log, the test failed in the middle because DUT Android container is disconnected (around 18:43:20 in UTC):

12/22 10:43:12.776 DEBUG|             utils:0280| [stdout] 12-22 10:43:12 W/CompatibilityTest: Inaccurate runtime hint for armeabi-v7a CtsThemeHostTestCases, expected 5h 2m 34s was 2s
12/22 10:43:12.777 DEBUG|             utils:0280| [stdout] 12-22 10:43:12 W/ModuleDef: Skipping Preparer: com.android.compatibility.common.tradefed.targetprep.DynamicConfigPusher since it is not in the whitelist [com.android.compatibility.common.tradefed.targetprep.ApkInstaller, com.android.compatibility.common.tradefed.targetprep.FilePusher]
12/22 10:43:25.148 DEBUG|             utils:0280| [stdout] 12-22 10:43:25 E/Device: Error during Sync: Connection reset by peer
12/22 10:43:25.149 DEBUG|             utils:0280| [stdout] 12-22 10:43:25 W/AndroidNativeDevice: InstallException (Connection reset by peer cause: IOException (Connection reset by peer)) when attempting install /tmp/autotest-tradefed-install_rhBmyF/1a367466d2d00eb06cf726981035dde1/android-cts-7.1_r12-linux_x86-arm/android-cts/tools/../../android-cts/testcases/CtsMediaTestCases.apk on device chromeos6-row4-rack7-host8:22

According to /var/log/message, what happened at that time point was the autotest "reset" (18:43:21 in UTC):

2017-12-22T18:43:21.290974+00:00 INFO sshd[8309]: Accepted publickey for root from 127.0.0.1 port 44672 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-12-22T18:43:21.786237+00:00 NOTICE ag[8314]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)
2017-12-22T18:43:22.134803+00:00 NOTICE ag[8318]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused)
2017-12-22T18:43:22.516600+00:00 NOTICE ag[8322]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ])
2017-12-22T18:43:22.874709+00:00 NOTICE ag[8324]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible)
2017-12-22T18:43:23.214173+00:00 NOTICE ag[8329]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui)



(2)
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167542725-chromeos-test/chromeos6-row4-rack2-host7/

Similarly, this test failed around 16:13:00 UTC because of disconnection,

01/04 08:12:38.954 DEBUG|             utils:0280| [stdout] 01-04 08:12:38 I/ConsoleReporter: [40955/40955 armeabi-v7a CtsDeqpTestCases chromeos6-row4-rack2-host7:22] dEQP-GLES3.functional.default_vertex_array_object#vertex_attrib_divisor pass
01/04 08:12:38.955 DEBUG|             utils:0280| [stdout] 01-04 08:12:38 I/ConsoleReporter: [chromeos6-row4-rack2-host7:22] armeabi-v7a CtsDeqpTestCases completed in 14s. 40955 passed, 0 failed, 0 not executed
01/04 08:12:38.955 DEBUG|             utils:0280| [stdout] 01-04 08:12:38 W/CompatibilityTest: Inaccurate runtime hint for armeabi-v7a CtsDeqpTestCases, expected 1h 15m 0s was 15s
01/04 08:13:06.086 DEBUG|             utils:0280| [stdout] 01-04 08:13:06 E/Device: Error during Sync: Connection reset by peer
01/04 08:13:06.087 DEBUG|             utils:0280| [stdout] 01-04 08:13:06 W/AndroidNativeDevice: InstallException (Connection reset by peer cause: IOException (Connection reset by peer)) when attempting install /tmp/autotest-tradefed-install_s77zHx/1a367466d2d00eb06cf726981035dde1/android-cts-7.1_r12-linux_x86-arm/android-cts/tools/../../android-cts/testcases/CtsHardwareTestCases.apk on device chromeos6-row4-rack2-host7:22

and the corresponding /var/log/messages log is verify/reboot by the autotest.

2018-01-04T16:12:32.327630+00:00 NOTICE ag[9067]: autotest server[stack::wait_for_system_services|upstart_status|run] -> ssh_run(status system-services)
2018-01-04T16:12:32.903876+00:00 NOTICE ag[9070]: autotest server[stack::verify_software|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ])
2018-01-04T16:12:33.351721+00:00 NOTICE ag[9073]: autotest server[stack::verify|verify_software|run] -> ssh_run(update_engine_client --status)
2018-01-04T16:12:34.179247+00:00 NOTICE ag[9077]: autotest server[stack::get_release_version|_get_lsb_release_content|run] -> ssh_run(cat "/etc/lsb-release")
2018-01-04T16:12:34.748581+00:00 NOTICE ag[9080]: autotest server[stack::_verify_host|verify|run] -> ssh_run(crossystem hwid)
2018-01-04T16:12:35.306865+00:00 NOTICE ag[9084]: autotest server[stack::_verify_host|verify|run] -> ssh_run(vpd -g serial_number)
2018-01-04T16:12:35.321075+00:00 DEBUG kernel: [  471.371887] SELinux: initialized (dev mtd_inodefs, type mtd_inodefs), not configured for labeling
2018-01-04T16:12:35.996411+00:00 NOTICE ag[9089]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release)
2018-01-04T16:12:36.566558+00:00 NOTICE ag[9093]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2018-01-04T16:12:37.129806+00:00 NOTICE ag[9096]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!)
...
2018-01-04T16:12:40.741053+00:00 NOTICE pre-shutdown[9220]: Shutting down for reboot: unknown-reason
2018-01-04T08:12:45.820720-08:00 INFO kernel: [    0.000000] Booting Linux on physical CPU 0x0



(Other instances)
https://bugs.chromium.org/p/chromium/issues/detail?id=776997#c22
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/166963328-chromeos-test/chromeos2-row4-rack11-host12/
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167450149-chromeos-test/chromeos4-row6-rack9-host1/
 

Comment 4 by ihf@chromium.org, Jan 25 2018

Components: -Test Infra>Client>ChromeOS
I looked for the job that ran after 170461379 above:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170461413-chromeos-test/chromeos2-row6-rack9-host11/sysinfo/

I think it is pretty clear that StartAndroid was running

2018-01-17T08:56:57.758667-08:00 NOTICE root[6918]: autotest starting iteration /usr/local/autotest/results/default/cheets_StartAndroid.stress/sysinfo/iteration.1 on reks_2.6GHz_4GB

and at about two minutes into the third Android starting iteration a verify took over the DUT and restarted the UI and killed the test.

2018-01-17T08:58:51.661011-08:00 NOTICE ag[13283]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui)

---

Notice the build is reks-release/R64-10176.53.0. I found the waterfall entry, nothing suspicious, looks green (!?)
https://luci-milo.appspot.com/buildbot/chromeos_release/reks-release%20release-R64-10176.B/52
(Notice build #52 but R64-10176.53.0.)

I looked up the parent job for 170461379
http://cautotest/afe/#tab_id=view_job&object_id=170461157
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170461157-chromeos-test/hostless/debug/

Here it gets interesting:
01/17 01:16:19.584 DEBUG|             suite:1120| Scheduling cheets_StartAndroid.stress.1
01/17 01:16:20.044 DEBUG|             suite:1390| Adding job keyval for cheets_StartAndroid.stress.1=170461379-chromeos-test

[...]

01/17 02:25:18.586 INFO |        server_job:0218| END GOOD	170461298-chromeos-test/chromeos2-row6-rack9-host22/cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1516184591	localtime=Jan 17 02:23:11	
01/17 02:25:18.586 DEBUG|             suite:1390| Adding job keyval for cheets_StartAndroid.stress=170461298-chromeos-test
01/17 02:31:32.361 ERROR|                db:0023| 02:31:32 01/17/18: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
01/17 08:53:45.244 ERROR|                db:0023| 08:53:45 01/17/18: An operational error occurred during a database operation: (1053, 'Server shutdown in progress'); retrying, don't panic yet
01/17 08:53:45.245 ERROR|             suite:1283| Exception waiting for results
Traceback (most recent call last):
  File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1278, in wait
    for result in waiter.wait_for_results():
  File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 157, in wait_for_results
    for result in _yield_job_results(self._afe, self._tko, job):
  File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 194, in _yield_job_results
    statuses = tko.get_job_test_statuses_from_db(job.id)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/frontend.py", line 157, in get_job_test_statuses_from_db
    for entry in self._db.select(','.join(fields), table, (where, None)):
  File "/usr/local/autotest/tko/db.py", line 305, in select
    return self.run_with_retry(exec_sql)
  File "/usr/local/autotest/tko/db.py", line 164, in run_with_retry
    result = function(*args, **dargs)
  File "/usr/local/autotest/tko/db.py", line 297, in exec_sql
    numRec = self.cur.execute(sql, values)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (1053, 'Server shutdown in progress')
01/17 08:53:45.310 INFO |        server_job:0218| FAIL	----	bvt-perbuild	timestamp=1516208025	localtime=Jan 17 08:53:45	Exception waiting for results
01/17 08:53:45.396 DEBUG|     dynamic_suite:0624| Finished waiting on suite. Returning from _perform_reimage_and_run.

cautotest shows indeed a long pause running on the DUT from 2am to 9am.
viceroy also shows idleness on chromeos-server100 during that timeframe on Jan 17th.

---

So I guess we can summarize that shards have a race running jobs, especially when there are already other problems happening. Notice the server was alive during that time according to the logs and did not reboot.

Comment 5 by ihf@chromium.org, Jan 25 2018

Notice the reason for
https://luci-milo.appspot.com/buildbot/chromeos_release/reks-release%20release-R64-10176.B/52
being green is that the failure is in bvt-perbuild which is async. So that part is expected.

Comment 6 by kinaba@chromium.org, Jan 25 2018

Cc: akes...@chromium.org dshi@chromium.org jrbarnette@chromium.org
adding more eyes

Comment 7 by jkop@chromium.org, Jan 25 2018

Cc: jkop@chromium.org
Cc: ayatane@chromium.org
I just took a deeper look from the logs for the first cited example.
That's this test:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=164569602

It ran on chromeos6-row4-rack7-host8; here's the DUT's relevant history:
    2017-12-22 10:46:20  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack7-host8/2099679-reset/
    2017-12-22 10:39:08  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/164569602-chromeos-test/
    2017-12-22 10:19:27  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack7-host8/2099545-provision/

I snagged the autotest.DEBUG and /var/log/messages content from the
test job, filtered both files to contain only the ssh commands executed,
and sifted through the content to remove every command executed on the
client that that was also present in the server logs.

The contents of the attached "collation.txt" file contains all of the
client commands that the server did _not_ log, in time order.  That
content seems to correspond exactly to one complete Reset task.  Here's
the timeline:
    10:40:28 - Test logs begin
    10:43:21 - First Reset task command logged on the client
    10:43:46 - Last Reset task command logged on the client
    10:45:39 - Test logs end

This isn't caused by a race condition.  There's a rogue shard or
drone executing jobs on the DUT at the same time as the DUT's
designated shard is doing the legitimate work.

collation.txt
3.5 KB View Download
The leading theory here would be that this was caused by a
shard migration:  The old shard allowed work to continue or
start after the new shard had already taken control.  I'm
prepared to believe that if we have data that can correlate
the failures with known events where we migrated a shard.

However, it's not clear such an event could exist.  The `dut-status`
command is showing that chromeos6-row4-rack7-host8 was running
jobs on its current shard as far back as 2017-12-20, a full 48 hours
before the failure incident.  That's a pretty long time for an
errant shard to continue to try and schedule work on DUTs in no longer
owns.

How often shard migrations are happening?
I'm seeing test failures with these exotic reset tasks recorded in /var/log/messages in more than daily basis. I'm wondering if the theory alone can explain the frequency.

Comment 12 by jkop@google.com, Jan 30 2018

This week and last, on a daily basis. Possibly next week as well; we have to migrate many of our shards.
Issue 812938 has been merged into this issue.
Issue 812991 has been merged into this issue.
Re #12: thanks, I see.

What will be the next action we can take on this issue?
This is creeping into every test and is consuing the failure triaging process a lot.

Comment 16 by jkop@chromium.org, Feb 20 2018

Cc: nxia@chromium.org dgarr...@chromium.org
All shards are migrated as of last Tuesday. If this is ongoing, then it's not because of that.
This is still ongoing

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178150767-chromeos-test/chromeos6-row2-rack17-host15/

video_WebRtcPerf.DEBUG (PST)
02/20 17:41:57.658 DEBUG|    cros_interface:0058| sh -c cryptohome-path user 'test@test.test' 
02/20 17:41:57.664 DEBUG|      global_hooks:0056| ['sh', '-c', "cryptohome-path user 'test@test.test'"]
02/20 17:41:57.689 DEBUG|    cros_interface:0067|  > stdout=[/home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab
], stderr=[]
02/20 17:41:57.690 DEBUG|    cros_interface:0058| sh -c /bin/df --output=source,target /home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab 
02/20 17:41:57.691 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/df --output=source,target /home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab']
02/20 17:41:57.738 DEBUG|    cros_interface:0067|  > stdout=[Filesystem     Mounted on
/dev/mmcblk0p1 /home
], stderr=[]
02/20 17:41:57.739 INFO |           browser:0224| Browser is closed.
02/20 17:41:57.742 INFO |     helper_logger:0095| videobuf2 log disable : echo 0 > /sys/module/videobuf2_core/parameters/debug
02/20 17:41:57.743 DEBUG|      global_hooks:0056| 'echo 0 > /sys/module/videobuf2_core/parameters/debug'
02/20 17:41:57.757 DEBUG|              test:0410| Test failed due to Devtools target crashed

/var/log/messages (UTC)
2018-02-21T01:41:54.197054+00:00 NOTICE ag[14024]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused)
2018-02-21T01:41:54.518976+00:00 NOTICE ag[14033]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ])
2018-02-21T01:41:55.228164+00:00 NOTICE ag[14146]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui)
2018-02-21T01:41:55.914334+00:00 NOTICE ag[14181]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release)
2018-02-21T01:41:56.237623+00:00 NOTICE ag[14202]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2018-02-21T01:41:56.555568+00:00 NOTICE ag[14205]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!)
2018-02-21T01:41:56.905049+00:00 NOTICE ag[14209]: autotest server[stack::wait_down|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/177145207-chromeos-test/chromeos2-row8-rack2-host11/graphics_WebGLAquarium

.DEBUG
02/16 15:03:31.217 ERROR|           browser:0055| Failed with DevtoolsTargetCrashException while starting the browser backend.

messages
2018-02-16T23:03:29.627229+00:00 NOTICE ag[4231]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)
2018-02-16T23:03:29.985168+00:00 NOTICE ag[4234]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused)
2018-02-16T23:03:30.351954+00:00 NOTICE ag[4239]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ])
2018-02-16T23:03:30.743635+00:00 NOTICE ag[4242]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible)
2018-02-16T23:03:31.110149+00:00 NOTICE ag[4247]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui)

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178176543-chromeos-test
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178143713-chromeos-test
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178143497-chromeos-test
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178141958-chromeos-test
Is there anyway to see what IP address the "rogue" commands are coming from?
> Is there anyway to see what IP address the "rogue" commands are coming from?

Good question.  I think so.  IIRC, sshd logs the source of ssh
requests; if so we should be able to match the unexpected commands
up against the logged client IP addresses.
> [ ... ] IIRC, sshd logs the source of ssh requests [ ... ]

Alas, the logging doesn't help.  The failing tests are CTS
tests.   More important, the tests are failing on ARC++ enabled
software.  On ARC++, ssh must be muxed to accommodate ssh connections
to the Android container.  That means that when sshd logs a connection,
it always logs the IP address as local host.  Here's an example:

2017-12-22T18:43:21.290974+00:00 INFO sshd[8309]: Accepted publickey for root from 127.0.0.1 port 44672 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-12-22T18:43:21.786237+00:00 NOTICE ag[8314]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)

That's one of the rogue commands, but note how the reported client
is 127.0.0.1.  We'd need to adjust how logging is done in order to
learn more.
Cc: pprabhu@chromium.org shuqianz@chromium.org
Then the best guess is that it's one of the shards we have recently turned down.

Can we go ahead and shut them down all the way?

Also, cc'ing Charlene, since she runs some private servers that might also be doing this unexpectedly.
I returned all the ganeti instances for shards as I turned them down. (modulo the possibility that I missed one or that one of the "return this instance" requests failed)
Owner: shuqianz@chromium.org
Hum. Passing to Charlene to see if any of her machines could be the issue.
Owner: ----
I do have one private shard, but it doesn't have any hosts or run any jobs. It is used to test some puppet changes. I've shut down the server. Fyi, the hostname of the server is shuqianz-test-shard.hot.corp.google.com and ip is 100.109.170.37. However, I don't think this server caused the issue discussed in the bug :/
Owner: dgarr...@chromium.org
Owner: jrbarnette@chromium.org
I don't have any other ideas where this is coming from.

Perhaps Richard has ideas about how to improve the relevant logging?
> Perhaps Richard has ideas about how to improve the relevant logging?

Improving the logging means changes to Chrome OS; that should be a
separate bug.  Given time, I can rustle up what to ask for, and
who to ask, but off the top of my head, all I know is that
"sshd says 127.0.0.1", that it's because of the ARC container.

IIRC the issue is being observed on non-ARC boards as well, which I hope are running without sslh muxing.
Let me dig logs and pick up some instances..

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178326213-chromeos-test/chromeos4-row7-rack3-host9/ on Winky.

100.109.25.142 looks to be operating the test normally, and 100.108.189.41 is breaking in.


/sysinfo/messages (logs till 2018-02-21T14:40:09) + /video_VideoCorruption/sysinfo/messages (logs after 2018-02-21T14:40:11):

2018-02-21T14:39:46.942509-08:00 INFO sshd[16608]: Accepted publickey for root from 100.109.25.142 port 32862 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2018-02-21T14:39:47.528973-08:00 NOTICE ag[16611]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)
2018-02-21T14:39:48.119618-08:00 NOTICE ag[16614]: autotest server[stack::parallel_simple|get_network_stats|run] -> ssh_run(route; echo SEPARATOR; cat /proc/net/dev)
2018-02-21T14:39:50.590189-08:00 NOTICE ag[16620]: autotest server[stack::get_kernel_ver|must_be_installed|run] -> ssh_run(ls /bin/uname >/dev/null 2>&1)
2018-02-21T14:39:50.995209-08:00 NOTICE ag[16623]: autotest server[stack::log_kernel|get_kernel_ver|run] -> ssh_run(/bin/uname -r)
2018-02-21T14:39:51.403341-08:00 NOTICE ag[16626]: autotest server[stack::is_up|ssh_ping|run] -> ssh_run(true)
2018-02-21T14:39:51.841670-08:00 NOTICE ag[16628]: autotest server[stack::get_install_dir|get_installed_autodir|run] -> ssh_run(test -x /usr/local/autotest/bin/autotest)
2018-02-21T14:39:52.253301-08:00 NOTICE ag[16630]: autotest server[stack::get_install_dir|get_installed_autodir|run] -> ssh_run(test -w /usr/local/autotest)
2018-02-21T14:39:52.661707-08:00 NOTICE ag[16632]: autotest server[stack::install|_install|run] -> ssh_run(mkdir -p /usr/local/autotest)
2018-02-21T14:39:53.078007-08:00 NOTICE ag[16635]: autotest server[stack::install|_install|run] -> ssh_run(rm -rf /usr/local/autotest/results/*)
2018-02-21T14:39:53.494520-08:00 NOTICE ag[16638]: autotest server[stack::_install|_install_using_packaging|run] -> ssh_run(cd /usr/local/autotest && ls | grep -v "^packages$" | grep -v "^result_tools$" | xargs rm -rf && rm -rf .[!.]*)
2018-02-21T14:39:53.962123-08:00 NOTICE ag[16646]: autotest server[stack::install_pkg|_run_command|run] -> ssh_run(mkdir -p /usr/local/autotest/packages)
2018-02-21T14:39:54.384490-08:00 NOTICE ag[16649]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest/packages)
2018-02-21T14:39:54.798034-08:00 NOTICE ag[16651]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages/client-autotest.tar.bz2)
2018-02-21T14:39:55.212770-08:00 NOTICE ag[16653]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum)
2018-02-21T14:39:55.626232-08:00 NOTICE ag[16655]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest)
2018-02-21T14:39:56.049504-08:00 NOTICE ag[16657]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum)
2018-02-21T14:39:56.468787-08:00 NOTICE ag[16659]: autotest server[stack::fetch_pkg_file|_run_command|run] -> ssh_run(curl --connect-timeout 15 --retry 5 --retry-delay 5 -s http://100.115.219.133:8082/static/winky-release/R66-10425.0.0/autotest/packages/packages.checksum -o /usr/local/autotest/packages.checksum)
2018-02-21T14:39:56.888743-08:00 NOTICE ag[16662]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum)
2018-02-21T14:39:57.294949-08:00 NOTICE ag[16664]: autotest server[stack::_get_checksum_dict|_run_command|run] -> ssh_run(cat /usr/local/autotest/packages.checksum)
2018-02-21T14:39:57.708796-08:00 NOTICE ag[16667]: autotest server[stack::compute_checksum|_run_command|run] -> ssh_run(md5sum /usr/local/autotest/packages/client-autotest.tar.bz2 )
2018-02-21T14:39:58.299679-08:00 NOTICE ag[16670]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest)
2018-02-21T14:39:58.718624-08:00 NOTICE ag[16672]: autotest server[stack::untar_required|_run_command|run] -> ssh_run(cat /usr/local/autotest/.checksum)
2018-02-21T14:39:59.121685-08:00 NOTICE ag[16675]: autotest server[stack::install_pkg|_run_command|run] -> ssh_run(mkdir -p /usr/local/autotest)
2018-02-21T14:39:59.532106-08:00 NOTICE ag[16678]: autotest server[stack::untar_pkg|_run_command|run] -> ssh_run(tar --no-same-owner -xjf /usr/local/autotest/packages/client-autotest.tar.bz2 -C /usr/local/autotest)
2018-02-21T14:40:03.428198-08:00 NOTICE ag[16683]: autotest server[stack::compute_checksum|_run_command|run] -> ssh_run(md5sum /usr/local/autotest/packages/client-autotest.tar.bz2 )
2018-02-21T14:40:03.874548-08:00 NOTICE ag[16686]: autotest server[stack::untar_pkg|_run_command|run] -> ssh_run(echo "9459a8e1b904e7bf370a7b68e4caf88e" > /usr/local/autotest/.checksum )
2018-02-21T14:40:04.306944-08:00 NOTICE ag[16688]: autotest server[stack::is_up|ssh_ping|run] -> ssh_run(true)
2018-02-21T14:40:04.716408-08:00 NOTICE ag[16691]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(ls /usr/local/autotest/bin/autotest > /dev/null 2>&1)
2018-02-21T14:40:05.132213-08:00 NOTICE ag[16694]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(umount /usr/local/autotest/tmp)
2018-02-21T14:40:05.548054-08:00 NOTICE ag[16697]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(umount /usr/local/autotest/tests/download)
2018-02-21T14:40:05.971849-08:00 NOTICE ag[16700]: autotest server[stack::run|_do_run|run] -> ssh_run(rm -f /usr/local/autotest/control.autoserv;rm -f /usr/local/autotest/control.autoserv.state;rm -f /usr/local/autotest/control;rm -f /usr/local/autotest/control.state)
2018-02-21T14:40:06.389025-08:00 NOTICE ag[16706]: autotest server[stack::use_rsync|check_rsync|run] -> ssh_run(rsync --version)
2018-02-21T14:40:07.658400-08:00 NOTICE ag[16713]: autotest server[stack::execute_control|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2018-02-21T14:40:08.063871-08:00 NOTICE ag[16716]: autotest server[stack::_execute_daemon|get_tmp_dir|run] -> ssh_run(mkdir -p /tmp)
2018-02-21T14:40:08.477038-08:00 NOTICE ag[16719]: autotest server[stack::_execute_daemon|get_tmp_dir|run] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2018-02-21T14:40:09.309876-08:00 NOTICE ag[16724]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-hFinTc -H autoserv --verbose --hostname=chromeos4-row3-rack13-host11 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &)
2018-02-21T14:40:09.724029-08:00 NOTICE ag[16729]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-hFinTc 0 0)
...
...
2018-02-21T14:40:11.403128-08:00 NOTICE root[16785]: autotest starting iteration /usr/local/autotest/results/default/video_VideoCorruption/sysinfo/iteration.1 on winky_2.6GHz_2GB
...
...
2018-02-21T14:40:23.469856-08:00 INFO sshd[17361]: Accepted publickey for root from 100.108.189.41 port 36722 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2018-02-21T14:40:23.972470-08:00 INFO session_manager[16943]: [INFO:child_exit_handler.cc(77)] Handling 17380 exit.
2018-02-21T14:40:24.013084-08:00 INFO session_manager[16943]: [INFO:policy_store.cc(89)] Persisted policy to disk, path: /var/lib/whitelist/policy.1
2018-02-21T14:40:24.034583-08:00 NOTICE ag[17534]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)
2018-02-21T14:40:24.203680-08:00 INFO session_manager[16943]: [INFO:child_exit_handler.cc(77)] Handling 17532 exit.
2018-02-21T14:40:24.456946-08:00 NOTICE ag[17611]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused)
2018-02-21T14:40:24.863954-08:00 NOTICE ag[17614]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ])
2018-02-21T14:40:25.268173-08:00 NOTICE ag[17616]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible)
2018-02-21T14:40:25.684096-08:00 NOTICE ag[17621]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui)
2018-02-21T14:40:25.745559-08:00 INFO session_manager[16943]: [INFO:browser_job.cc(164)] Terminating process: exiting cleanly
2018-02-21T14:40:25.745589-08:00 INFO session_manager[16943]: [INFO:system_utils_impl.cc(94)] Sending 15 to 17025 as 1000
2018-02-21T14:40:25.745646-08:00 INFO session_manager[16943]: [INFO:system_utils_impl.cc(94)] Sending 15 to 17282 as 1000
2018-02-21T14:40:26.342761-08:00 INFO session_manager[16943]: [INFO:session_manager_service.cc(494)] SessionManagerService quitting run loop
2018-02-21T14:40:26.343330-08:00 INFO session_manager[16943]: [INFO:session_manager_service.cc(203)] SessionManagerService exiting
2018-02-21T14:40:26.389172-08:00 INFO chapsd[1007]: Token at /home/root/0a5e9f9ff1cac2c2752b8b8c82515a9ceeeada7a/chaps has been removed from slot 1
2018-02-21T14:40:26.425683-08:00 INFO chapsd[1007]: Unloaded keys for slot 1
2018-02-21T14:40:26.469782-08:00 NOTICE ag[17664]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release)
2018-02-21T14:40:26.874551-08:00 NOTICE ag[17673]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2018-02-21T14:40:27.280968-08:00 NOTICE ag[17676]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!)
2018-02-21T14:40:27.592057-08:00 WARNING chapsd[1007]: Unload Token event received for unknown path: /home/root/0a5e9f9ff1cac2c2752b8b8c82515a9ceeeada7a/chaps
2018-02-21T14:40:27.639757-08:00 INFO imageloader[17684]: The latest-version file does not exist. Component PepperFlashPlayer is probably not installed.
2018-02-21T14:40:27.682934-08:00 NOTICE ag[17743]: autotest server[stack::wait_down|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2018-02-21T14:40:27.789266-08:00 WARNING session_manager[17785]: [WARNING:chrome_setup.cc(86)] Could not find both paths: /usr/share/chromeos-assets/wallpaper/oem_large.jpg and /usr/share/chromeos-assets/wallpaper/oem_small.jpg

Comment 30 by jkop@chromium.org, Feb 22 2018

Cc: -jkop@chromium.org

Comment 32 by ihf@chromium.org, Feb 22 2018

Nice find!

Is cros-full-0027 still unused or is it part of the master? It looks like it became very active at about 14:20-14:40 today, which corresponds to the log in #29.

https://viceroy.corp.google.com/chromeos/machines?hostname=cros-full-0027&duration=1382399&mdb_role=chrome-infra

I poked on cros-full-0027 and found a repair log for chromeos6-row1-rack7-host21/62495156-repair/debug (banjo - which is supposed to be assigned to cros-full-0018?)

http://cros-full-0018.mtv.corp.google.com/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/debug/

Why would two drones try to repair the same DUT at about the same time?

Comment 33 by ihf@chromium.org, Feb 22 2018

root@cros-full-0027:/usr/local/autotest/chromeos6-row1-rack7-host21/62495156-repair/debug# more autoserv.DEBUG 
02/21 19:13:13.186 INFO |          autoserv:0685| Results placed in /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair
02/21 19:13:13.186 DEBUG|          autoserv:0693| autoserv is running in drone cros-full-0027.mtv.corp.google.com.
02/21 19:13:13.186 DEBUG|          autoserv:0694| autoserv command was: /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair -m chromeos
6-row1-rack7-host21 --verbose --lab True -R --host-protection NO_PROTECTION
02/21 19:13:13.186 INFO |           pidfile:0016| Logged pid 13938 to /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
02/21 19:13:13.187 DEBUG|          autoserv:0383| faulthandler registered on SIGTERM.
02/21 19:13:13.188 DEBUG|          base_job:0357| Persistent state global_properties.test_retry now set to 0
02/21 19:13:13.188 DEBUG|          base_job:0357| Persistent state global_properties.tag now set to ''
02/21 19:13:13.507 DEBUG|         host_info:0274| Refreshing HostInfo using store AfeStore[chromeos6-row1-rack7-host21]
02/21 19:13:13.507 DEBUG|         host_info:0275| Old host_info: None
02/21 19:13:13.816 DEBUG|         host_info:0283| New host_info: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n28
30_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo'
, '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}]
02/21 19:13:13.816 INFO |        server_job:1599| Shadowing AFE store with a FileStore at /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824
-467a-98db-2e8a382dd753
02/21 19:13:13.817 DEBUG|         host_info:0239| Committing HostInfo to store FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-
467a-98db-2e8a382dd753]
02/21 19:13:13.817 DEBUG|         host_info:0243| HostInfo updated to: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celer
on_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:
banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}]
02/21 19:13:13.817 DEBUG|          base_job:0357| Persistent state global_properties.fast now set to False
02/21 19:13:13.818 DEBUG|          base_job:0357| Persistent state global_properties.max_result_size_KB now set to 20000
02/21 19:13:13.841 INFO |    connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
02/21 19:13:14.258 NOTIC|      cros_logging:0038| ts_mon was set up.
02/21 19:13:14.260 INFO |        server_job:0218| START	----	repair	timestamp=1519269194	localtime=Feb 21 19:13:14	
02/21 19:13:14.281 DEBUG|         host_info:0274| Refreshing HostInfo using store ShadowingStore[AfeStore[chromeos6-row1-rack7-host21], FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-ho
st21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753]]
[...]
02/21 19:18:27.279 INFO |      crashcollect:0269| Waiting 0.001 hours for chromeos6-row1-rack7-host21 to come up (Feb 21 19:14:48)
02/21 19:18:27.288 DEBUG|          ssh_host:0301| Running (ssh) 'true' from 'wait_for_machine_to_recover|wait_up|is_up|ssh_ping|run|run_very_slowly'
02/21 19:18:27.291 INFO |     ssh_multiplex:0080| Master ssh connection to chromeos6-row1-rack7-host21 is down.
02/21 19:18:27.291 DEBUG|     ssh_multiplex:0122| Nuking ssh master_job
02/21 19:18:27.291 DEBUG|     ssh_multiplex:0127| Cleaning ssh master_tempdir
02/21 19:18:27.292 INFO |     ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ZvDS7yssh-master/socket -o StrictHostKeyChecking=
no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row1-rack7-host2
1'
02/21 19:18:27.292 DEBUG|             utils:0214| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ZvDS7yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFil
e=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row1-rack7-host21'

Comment 34 by ihf@chromium.org, Feb 22 2018

02/21 19:18:13.832 INFO |          autoserv:0685| Results placed in /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812
02/21 19:18:13.832 DEBUG|          autoserv:0693| autoserv is running in drone cros-full-0018.mtv.corp.google.com.
02/21 19:18:13.832 DEBUG|          autoserv:0694| autoserv command was: /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 -m chromeos6-row1-rack7-host21 --verbose --lab True -R --host-protection NO_PROTECTION
02/21 19:18:13.832 INFO |           pidfile:0016| Logged pid 181204 to /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/.autoserv_execute
02/21 19:18:13.833 DEBUG|          autoserv:0383| faulthandler registered on SIGTERM.
02/21 19:18:13.834 DEBUG|          base_job:0357| Persistent state global_properties.test_retry now set to 0
02/21 19:18:13.834 DEBUG|          base_job:0357| Persistent state global_properties.tag now set to ''
02/21 19:18:13.922 DEBUG|         host_info:0274| Refreshing HostInfo using store AfeStore[chromeos6-row1-rack7-host21]
02/21 19:18:13.922 DEBUG|         host_info:0275| Old host_info: None
02/21 19:18:13.956 DEBUG|         host_info:0283| New host_info: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}]
02/21 19:18:13.957 INFO |        server_job:1599| Shadowing AFE store with a FileStore at /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c
02/21 19:18:13.957 DEBUG|         host_info:0239| Committing HostInfo to store FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c]
02/21 19:18:13.958 DEBUG|         host_info:0243| HostInfo updated to: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}]
02/21 19:18:13.958 DEBUG|          base_job:0357| Persistent state global_properties.fast now set to False
02/21 19:18:13.958 DEBUG|          base_job:0357| Persistent state global_properties.max_result_size_KB now set to 20000
02/21 19:18:13.982 INFO |    connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
02/21 19:18:14.403 NOTIC|      cros_logging:0038| ts_mon was set up.
02/21 19:18:14.405 INFO |        server_job:0218| START	----	repair	timestamp=1519269494	localtime=Feb 21 19:18:14	
02/21 19:18:14.433 DEBUG|         host_info:0274| Refreshing HostInfo using store ShadowingStore[AfeStore[chromeos6-row1-rack7-host21], FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c]]
[...]
02/21 19:23:27.028 INFO |      crashcollect:0269| Waiting 0.001 hours for chromeos6-row1-rack7-host21 to come up (Feb 21 19:19:48)
[...]
02/21 19:27:59.584 DEBUG|             utils:0214| Running 'ping chromeos6-row1-rack7-host21 -w1 -c1'
02/21 19:28:00.667 DEBUG|             utils:0282| [stdout] PING chromeos6-row1-rack7-host21.cros.corp.google.com (100.115.128.77) 56(84) bytes of data.
02/21 19:28:00.667 DEBUG|             utils:0282| [stdout] 
02/21 19:28:00.667 DEBUG|             utils:0282| [stdout] --- chromeos6-row1-rack7-host21.cros.corp.google.com ping statistics ---
02/21 19:28:00.667 DEBUG|             utils:0282| [stdout] 2 packets transmitted, 0 received, 100% packet loss, time 999ms
02/21 19:28:00.667 DEBUG|             utils:0282| [stdout] 
02/21 19:28:00.668 ERROR|            repair:0354| Failed: host is available via ssh
[...]
02/21 19:28:01.294 INFO |            repair:0110| Skipping this operation: All host verification checks pass
02/21 19:28:01.294 DEBUG|            repair:0111| The following dependencies failed:
02/21 19:28:01.294 DEBUG|            repair:0113|     host is available via ssh
02/21 19:28:01.295 ERROR|            repair:0043| Repair failed due to Exception.
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/repair", line 38, in repair
    target.repair()
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1243, in repair
    self._repair_strategy.repair(self)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 774, in repair
    self._verify_root._verify_host(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 348, in _verify_host
    self._verify_dependencies(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 226, in _verify_dependencies
    self._verify_list(host, self._dependency_list, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 216, in _verify_list
    raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: No answer to ping from chromeos6-row1-rack7-host21
02/21 19:28:01.295 INFO |        server_job:0218| END FAIL	----	repair	timestamp=1519270081	localtime=Feb 21 19:28:01	
02/21 19:28:01.296 DEBUG|     ssh_multiplex:0122| Nuking ssh master_job
02/21 19:28:01.296 DEBUG|     ssh_multiplex:0127| Cleaning ssh master_tempdir
02/21 19:28:01.296 DEBUG|             utils:0358| Running result_tools/utils on path: /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812
02/21 19:28:01.296 DEBUG|             utils:0361| Throttle result size to : 19 MB
02/21 19:28:01.298 DEBUG|             utils:0382| Directory summary of /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 is saved to file /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/dir_summary_1519270081.json.
02/21 19:28:01.298 DEBUG|             utils:0390| Skip throttling /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812: size=86670, throttle_probability=236.2986039
02/21 19:28:01.301 ERROR|         traceback:0013| Traceback (most recent call last):
02/21 19:28:01.302 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 491, in run_autoserv
02/21 19:28:01.302 ERROR|         traceback:0013|     job.repair(job_labels)
02/21 19:28:01.302 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 579, in repair
02/21 19:28:01.302 ERROR|         traceback:0013|     self._execute_code(REPAIR_CONTROL_FILE, namespace, protect=False)
02/21 19:28:01.302 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code
02/21 19:28:01.302 ERROR|         traceback:0013|     execfile(code_file, namespace, namespace)
02/21 19:28:01.302 ERROR|         traceback:0013|   File "/usr/local/autotest/server/control_segments/repair", line 53, in <module>
02/21 19:28:01.303 ERROR|         traceback:0013|     job.parallel_simple(repair, machines)
02/21 19:28:01.303 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple
02/21 19:28:01.303 ERROR|         traceback:0013|     log=log, timeout=timeout, return_results=return_results)
02/21 19:28:01.303 ERROR|         traceback:0013|   File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
02/21 19:28:01.303 ERROR|         traceback:0013|     function(arg)
02/21 19:28:01.303 ERROR|         traceback:0013|   File "/usr/local/autotest/server/control_segments/repair", line 47, in repair
02/21 19:28:01.303 ERROR|         traceback:0013|     raise Exception('')
02/21 19:28:01.303 ERROR|         traceback:0013| Exception
02/21 19:28:01.313 INFO |            client:0570| Attempting refresh to obtain initial access_token
02/21 19:28:01.349 INFO |            client:0872| Refreshing access_token
02/21 19:28:02.192 ERROR|          autoserv:0759| Uncaught SystemExit with code 1
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 755, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 562, in run_autoserv
    sys.exit(exit_code)
SystemExit: 1
02/21 19:28:02.260 DEBUG|   logging_manager:0627| Logging subprocess finished
02/21 19:28:02.262 DEBUG|   logging_manager:0627| Logging subprocess finished

Comment 36 by ihf@chromium.org, Feb 22 2018

Cc: davidri...@chromium.org
David, fyi in case you run into this.
I just did a spot check against the servers cited in #c31:

100.109.25.142 = cros-full-0002.mtv.corp.google.com
100.108.189.41 = cros-full-0027.mtv.corp.google.com

The database is down, so I can't confirm the actual configuration, but
from logs on cros-full-0002, it's a shard, and it's run lots of jobs
on chromeos4-row3-rack13-host11.

Checking cros-full-0027, it appears to be a drone, and there are
no log files that seem to mention chromeos4-row3-rack13-host11.  However,
there are signs that special tasks ran against that host:
    $ cd /usr/local/autotest/results
    $ ls -ld hosts/chromeos4-row3-rack13-host11
    drwxr-xr-x 2 chromeos-test eng 4096 Feb 21 14:41 hosts/chromeos4-row3-rack13-host11

Drones don't do anything on their own, so the behavior here is
that the master server scheduled a special task against a DUT
owned by a shard, for reasons as yet unknown.

OK.  Adding more proof of the theory (but still no explanation).

On cautotest-prod (the master scheduler), you can see that the
master ran 6 special tasks against the cited DUT:

$ grep chromeos4-row3-rack13-host11 scheduler.log.2018-02-* | grep 'Starting: '
scheduler.log.2018-02-15-00.57.26:02/15 00:59:21.973 INFO |            models:2110| Starting: Special Task 62405503 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-15 00:58:42)
scheduler.log.2018-02-15-16.54.56:02/15 16:56:34.167 INFO |            models:2110| Starting: Special Task 62415955 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-15 16:56:11)
scheduler.log.2018-02-15-16.54.56:02/15 16:59:12.796 INFO |            models:2110| Starting: Special Task 62417141 (host chromeos4-row3-rack13-host11, task Repair, time 2018-02-15 16:59:05)
scheduler.log.2018-02-20-15.01.21:02/20 15:04:10.842 INFO |            models:2110| Starting: Special Task 62479158 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-20 15:02:41)
scheduler.log.2018-02-21-13.53.52:02/21 13:55:26.717 INFO |            models:2110| Starting: Special Task 62492180 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-21 13:55:05)
scheduler.log.2018-02-21-14.38.36:02/21 14:40:11.984 INFO |            models:2110| Starting: Special Task 62493576 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-21 14:39:44)

Of particular interest to me is what caused the 5 day hiatus between
2018-02-15 16:59:0 and 2018-02-20 15:02:41; at first blush it looks
like the master knew (for a while) that it no longer owned the DUT,
then somehow forgot that fact 5 days later.

Cc: hidehiko@chromium.org
Owner: ----
Status: Available (was: Untriaged)
OK.  Summary of the problem:
  * The DUT is owned by a shard.
  * Despite the shard ownership, the master continues to
    schedule reset tasks, over the space of a week or more.

Best theory is that the problem is caused by the process of
moving the DUT from one shard to the next.  However, even if
that's true, a week of failures is too much fallout to be
tolerable.  Moreover, there are symptoms not fully explained:
  * The master is scheduling Reset tasks.  This is normally
    only done prior to a test.  What are the tests being
    scheduled?
  * Why are the events so intermittent?

I'm not sure who should work on this debug, so I'm dropping this
from my plate for now.
Owner: akes...@chromium.org
Status: Assigned (was: Available)
Could I ask someone in the infra team to take priority on this?

Nowadays, 80% of flaky failures I encounter during investigating ARC tests are due to this, which makes it very hard to diagnose other real failures.
friendly ping...

or, is there any pointer to documents that I can follow to debug the issue further by myself? (Not being an infra deputy I think I don't have access to scheduler logs etc.)
Owner: kinaba@chromium.org
Can you point out a DUT recently that suffered this, and the test that suffered from it? I will take a look thereafter in line with jrbarnette's theory.
(reassign to me once answered)
Owner: akes...@chromium.org
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/180927784-chromeos-test/chromeos6-row2-rack11-host10/
  dut: chromeos6-row2-rack11-host10 (peach_pit)
  test: touch_TapSettings
  test-run-by: cros-full-0011 (100.109.25.135)
  reset-by: cros-full-0031 (100.108.189.50) (2018-03-05T06:27:00.001295+00:00)

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/180488013-chromeos-test/chromeos4-row4-rack9-host6/
  dut: chromeos4-row4-rack9-host (expresso)
  test: video_PlaybackPerf
  test-run-by: chromeos-server134.hot (100.109.175.140)
  reset-by: cros-full-0027 (100.108.189.41) 2018-03-02T19:05:11.591284+00:00

Some instances from ARC-enabled devices (though it's hard to tell who ssh'ed into them, due to #c20):
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/181314732-chromeos-test/chromeos2-row3-rack11-host1/ (2018-03-06T18:48:24.058288+00:00)
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/181310203-chromeos-test/chromeos6-row1-rack19-host15/ (2018-03-06T23:25:04.643351+00:00)
Investigating. cros-full-0027 and cros-full-0031 are both general purpose drones (that are supposed to only do work for non-sharded boards) whereas both of those boards are sharded, so this is consistent with the theory that the master is going rogue and acting on DUTs that it shouldn't be acting on.
Investigating chromeos4-row4-rack9-host6.

Both cautotest/ and http://chromeos-server134.hot/ agree that the DUT lives on 134. Master should not be touching this. Digging into master logs now.
Master scheduler indeed has sporadic special tasks against that DUT, despite not owning the DUT.

e.g.
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep 62621609 *
grep: dut-data: Is a directory
scheduler.log.2018-03-02-16.21.24:03/02 16:21:56.118 INFO |     drone_manager:0813| monitoring pidfile /usr/local/autotest/results/hosts/chromeos4-row4-rack9-host6/62621609-reset/.autoserv_execute
scheduler.log.2018-03-02-16.21.24:03/02 16:21:56.118 INFO |            models:2181| Starting: Special Task 62621609 (host chromeos4-row4-rack9-host6, task Reset, time 2018-03-02 16:21:49)


Investigating that special task closer:

MySQL [chromeos_autotest_db]> select * from afe_special_tasks where id=62621609 \G;
*************************** 1. row ***************************
             id: 62621609
        host_id: 1819
           task: Reset
 time_requested: 2018-03-02 16:21:49
      is_active: 0
    is_complete: 1
   time_started: 2018-03-02 16:21:56
 queue_entry_id: NULL
        success: 1
requested_by_id: 19
     is_aborted: 0
  time_finished: 2018-03-02 16:22:30
1 row in set (0.01 sec)


So, it's not associated with any hqe. Haven't tracked down where it was requested and how it ended up in master afe...
Some unverified leads:
There are two ways in which special tasks are created unrelated to HQE:

- As a result of a reset/cleanup being requested at the end of a job
- Scheduler itself goes around scheduling special tasks as part of its main loop (this for instance creates verify tasks for DUTs that are currently in REPAIR_FAILED state)

I think one of these isn't respecting the shard_id on the host.
chromeos2-row3-rack11-host1 appearas to be locked by duenasa@chromium.org for deployment? Let's ignore that one for now as the locking and deployment can easily be confounding the issue.

For chromeos6-row1-rack19-host15 both cautotest/ and cros-full-0023.mtv/ agree that it lives on cros-full-0023.mtv.

Master ran a special task against it recently anyway...

scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.411 INFO |            models:2181| Starting: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37)
scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.414 INFO |       prejob_task:0272| starting reset task for host: chromeos6-row1-rack19-host15
scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.414 INFO |         rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Provisioning updating {'status': 'Resetting'} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) 
scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.427 INFO |     drone_manager:0787| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/hosts/chromeos6-row1-rack19-host15/62666251-reset', '-m', u'chromeos6-row1-rack19-host15', '--verbose', '--lab', 'True', '--reset']
scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.138 INFO |            models:2192| Finished: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) (active)
scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.142 INFO |     drone_manager:0825| forgetting pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack19-host15/62666251-reset/.autoserv_execute
scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.145 INFO |         rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Resetting updating {'dirty': 0} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) 
scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.147 INFO |         rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Resetting updating {'status': 'Ready'} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) 



#49 those sound like good leads.
*************************** 1. row ***************************
             id: 62666251
        host_id: 6631
           task: Reset
 time_requested: 2018-03-06 16:03:37
      is_active: 0
    is_complete: 1
   time_started: 2018-03-06 16:04:00
 queue_entry_id: NULL
        success: 1
requested_by_id: 19
     is_aborted: 0
  time_finished: 2018-03-06 16:04:36
1 row in set (0.01 sec)


Once again, no hqe. However, what is "requested_by_id: 19" about?
I believe that is a foreign key into afe_users, where entry 19 is autotest_system

MySQL [chromeos_autotest_db]> select * from afe_users where id=19 \G;
*************************** 1. row ***************************
               id: 19
            login: autotest_system
     access_level: 100
    reboot_before: 1
     reboot_after: 0
show_experimental: 0
     drone_set_id: NULL
1 row in set (0.01 sec)
This code in monitor_db looks suspicious... gotta find out where it is called. Doesn't seem to inherently filter by shard ownership.

    def _reverify_hosts_where(self, where,
                              print_message='Reverifying host %s'):
        full_where='locked = 0 AND invalid = 0 AND ' + where
        for host in scheduler_models.Host.fetch(where=full_where):
            if self.host_has_agent(host):
                # host has already been recovered in some way
                continue
            if self._host_has_scheduled_special_task(host):
                # host will have a special task scheduled on the next cycle
                continue
            if print_message:
                logging.error(print_message, host.hostname)
            try:
                user = models.User.objects.get(login='autotest_system')
            except models.User.DoesNotExist:
                user = models.User.objects.get(
                        id=self.DEFAULT_REQUESTED_BY_USER_ID)
            models.SpecialTask.objects.create(
                    task=models.SpecialTask.Task.RESET,
                    host=models.Host.objects.get(id=host.id),
                    requested_by=user)
scheduler.log.2018-03-06-15.24.12:03/06 15:24:40.743 ERROR|        monitor_db:0850| Recovering active host chromeos6-row1-rack19-host15 - this probably indicates a scheduler bug      

suggests we are calling into 

    def _reverify_remaining_hosts(self):
        # recover active hosts that have not yet been recovered, although this
        # should never happen
        message = ('Recovering active host %s - this probably indicates a '
                   'scheduler bug')
        self._reverify_hosts_where(
                "status IN ('Repairing', 'Verifying', 'Cleaning', 'Provisioning')",
                print_message=message)


Even on hosts that aren't owned by the master.
Pretty sure that reverify_hosts_where should be modified such that

full_where='locked = 0 AND invalid = 0 AND shard_id = %my_id AND ' + where  

But I'm not yet convinced that monitor_db actually knows its own shard id.
Labels: Chase-Pending
Project Member

Comment 59 by bugdroid1@chromium.org, Mar 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1f7d518223bfcd28138dad35326ae80560f19d00

commit 1f7d518223bfcd28138dad35326ae80560f19d00
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Mar 08 21:26:49 2018

autotest: scheduler: if master, do not run special tasks on shard hosts

BUG= chromium:801467 
TEST=None

Change-Id: I1a8bf1719f898edfe99674fdd5fb09c9322cd7ba
Reviewed-on: https://chromium-review.googlesource.com/953370
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/1f7d518223bfcd28138dad35326ae80560f19d00/scheduler/monitor_db.py

Waiting on prod push.
Labels: -Chase-Pending
Believed fixed, awaiting push.
Owner: xixuan@chromium.org
Status: Assigned (was: Started)
-> xixuan who will orchestrate the push, then this can be closed.
Status: Fixed (was: Assigned)
Fix in #59 is pushed. Close this.
Amazing! Thank you all! I'll monitor the tree for a week or so to see the outcome of the fix.

Sign in to add a comment