Verify/Reset running while an autotest job is active. |
|||||||||||||||||||||||
Issue descriptionSplitting out from b/70796610#comment30. @Ilja, who should we loop in to the bug? I occasionally see tests are failing because the autotest server was running reset or verify while the test job was still running. How can we avoid that? (1) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/164569602-chromeos-test/chromeos6-row4-rack7-host8/ According to the autotest DEBUG log, the test failed in the middle because DUT Android container is disconnected (around 18:43:20 in UTC): 12/22 10:43:12.776 DEBUG| utils:0280| [stdout] 12-22 10:43:12 W/CompatibilityTest: Inaccurate runtime hint for armeabi-v7a CtsThemeHostTestCases, expected 5h 2m 34s was 2s 12/22 10:43:12.777 DEBUG| utils:0280| [stdout] 12-22 10:43:12 W/ModuleDef: Skipping Preparer: com.android.compatibility.common.tradefed.targetprep.DynamicConfigPusher since it is not in the whitelist [com.android.compatibility.common.tradefed.targetprep.ApkInstaller, com.android.compatibility.common.tradefed.targetprep.FilePusher] 12/22 10:43:25.148 DEBUG| utils:0280| [stdout] 12-22 10:43:25 E/Device: Error during Sync: Connection reset by peer 12/22 10:43:25.149 DEBUG| utils:0280| [stdout] 12-22 10:43:25 W/AndroidNativeDevice: InstallException (Connection reset by peer cause: IOException (Connection reset by peer)) when attempting install /tmp/autotest-tradefed-install_rhBmyF/1a367466d2d00eb06cf726981035dde1/android-cts-7.1_r12-linux_x86-arm/android-cts/tools/../../android-cts/testcases/CtsMediaTestCases.apk on device chromeos6-row4-rack7-host8:22 According to /var/log/message, what happened at that time point was the autotest "reset" (18:43:21 in UTC): 2017-12-22T18:43:21.290974+00:00 INFO sshd[8309]: Accepted publickey for root from 127.0.0.1 port 44672 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-12-22T18:43:21.786237+00:00 NOTICE ag[8314]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) 2017-12-22T18:43:22.134803+00:00 NOTICE ag[8318]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused) 2017-12-22T18:43:22.516600+00:00 NOTICE ag[8322]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ]) 2017-12-22T18:43:22.874709+00:00 NOTICE ag[8324]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible) 2017-12-22T18:43:23.214173+00:00 NOTICE ag[8329]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui) (2) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167542725-chromeos-test/chromeos6-row4-rack2-host7/ Similarly, this test failed around 16:13:00 UTC because of disconnection, 01/04 08:12:38.954 DEBUG| utils:0280| [stdout] 01-04 08:12:38 I/ConsoleReporter: [40955/40955 armeabi-v7a CtsDeqpTestCases chromeos6-row4-rack2-host7:22] dEQP-GLES3.functional.default_vertex_array_object#vertex_attrib_divisor pass 01/04 08:12:38.955 DEBUG| utils:0280| [stdout] 01-04 08:12:38 I/ConsoleReporter: [chromeos6-row4-rack2-host7:22] armeabi-v7a CtsDeqpTestCases completed in 14s. 40955 passed, 0 failed, 0 not executed 01/04 08:12:38.955 DEBUG| utils:0280| [stdout] 01-04 08:12:38 W/CompatibilityTest: Inaccurate runtime hint for armeabi-v7a CtsDeqpTestCases, expected 1h 15m 0s was 15s 01/04 08:13:06.086 DEBUG| utils:0280| [stdout] 01-04 08:13:06 E/Device: Error during Sync: Connection reset by peer 01/04 08:13:06.087 DEBUG| utils:0280| [stdout] 01-04 08:13:06 W/AndroidNativeDevice: InstallException (Connection reset by peer cause: IOException (Connection reset by peer)) when attempting install /tmp/autotest-tradefed-install_s77zHx/1a367466d2d00eb06cf726981035dde1/android-cts-7.1_r12-linux_x86-arm/android-cts/tools/../../android-cts/testcases/CtsHardwareTestCases.apk on device chromeos6-row4-rack2-host7:22 and the corresponding /var/log/messages log is verify/reboot by the autotest. 2018-01-04T16:12:32.327630+00:00 NOTICE ag[9067]: autotest server[stack::wait_for_system_services|upstart_status|run] -> ssh_run(status system-services) 2018-01-04T16:12:32.903876+00:00 NOTICE ag[9070]: autotest server[stack::verify_software|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ]) 2018-01-04T16:12:33.351721+00:00 NOTICE ag[9073]: autotest server[stack::verify|verify_software|run] -> ssh_run(update_engine_client --status) 2018-01-04T16:12:34.179247+00:00 NOTICE ag[9077]: autotest server[stack::get_release_version|_get_lsb_release_content|run] -> ssh_run(cat "/etc/lsb-release") 2018-01-04T16:12:34.748581+00:00 NOTICE ag[9080]: autotest server[stack::_verify_host|verify|run] -> ssh_run(crossystem hwid) 2018-01-04T16:12:35.306865+00:00 NOTICE ag[9084]: autotest server[stack::_verify_host|verify|run] -> ssh_run(vpd -g serial_number) 2018-01-04T16:12:35.321075+00:00 DEBUG kernel: [ 471.371887] SELinux: initialized (dev mtd_inodefs, type mtd_inodefs), not configured for labeling 2018-01-04T16:12:35.996411+00:00 NOTICE ag[9089]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release) 2018-01-04T16:12:36.566558+00:00 NOTICE ag[9093]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2018-01-04T16:12:37.129806+00:00 NOTICE ag[9096]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!) ... 2018-01-04T16:12:40.741053+00:00 NOTICE pre-shutdown[9220]: Shutting down for reboot: unknown-reason 2018-01-04T08:12:45.820720-08:00 INFO kernel: [ 0.000000] Booting Linux on physical CPU 0x0 (Other instances) https://bugs.chromium.org/p/chromium/issues/detail?id=776997#c22 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/166963328-chromeos-test/chromeos2-row4-rack11-host12/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167450149-chromeos-test/chromeos4-row6-rack9-host1/
,
Jan 24 2018
,
Jan 24 2018
Majority of the cheets_StartAndroid.stress "Android did not boot!" flakiness looks coming from this. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170461379-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167536814-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167450160-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167381785-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167377073-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167358319-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167120924-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/167116686-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/166714869-chromeos-test
,
Jan 25 2018
I looked for the job that ran after 170461379 above: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170461413-chromeos-test/chromeos2-row6-rack9-host11/sysinfo/ I think it is pretty clear that StartAndroid was running 2018-01-17T08:56:57.758667-08:00 NOTICE root[6918]: autotest starting iteration /usr/local/autotest/results/default/cheets_StartAndroid.stress/sysinfo/iteration.1 on reks_2.6GHz_4GB and at about two minutes into the third Android starting iteration a verify took over the DUT and restarted the UI and killed the test. 2018-01-17T08:58:51.661011-08:00 NOTICE ag[13283]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui) --- Notice the build is reks-release/R64-10176.53.0. I found the waterfall entry, nothing suspicious, looks green (!?) https://luci-milo.appspot.com/buildbot/chromeos_release/reks-release%20release-R64-10176.B/52 (Notice build #52 but R64-10176.53.0.) I looked up the parent job for 170461379 http://cautotest/afe/#tab_id=view_job&object_id=170461157 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/170461157-chromeos-test/hostless/debug/ Here it gets interesting: 01/17 01:16:19.584 DEBUG| suite:1120| Scheduling cheets_StartAndroid.stress.1 01/17 01:16:20.044 DEBUG| suite:1390| Adding job keyval for cheets_StartAndroid.stress.1=170461379-chromeos-test [...] 01/17 02:25:18.586 INFO | server_job:0218| END GOOD 170461298-chromeos-test/chromeos2-row6-rack9-host22/cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1516184591 localtime=Jan 17 02:23:11 01/17 02:25:18.586 DEBUG| suite:1390| Adding job keyval for cheets_StartAndroid.stress=170461298-chromeos-test 01/17 02:31:32.361 ERROR| db:0023| 02:31:32 01/17/18: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 01/17 08:53:45.244 ERROR| db:0023| 08:53:45 01/17/18: An operational error occurred during a database operation: (1053, 'Server shutdown in progress'); retrying, don't panic yet 01/17 08:53:45.245 ERROR| suite:1283| Exception waiting for results Traceback (most recent call last): File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1278, in wait for result in waiter.wait_for_results(): File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 157, in wait_for_results for result in _yield_job_results(self._afe, self._tko, job): File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 194, in _yield_job_results statuses = tko.get_job_test_statuses_from_db(job.id) File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper return fn(*args, **kwargs) File "/usr/local/autotest/server/frontend.py", line 157, in get_job_test_statuses_from_db for entry in self._db.select(','.join(fields), table, (where, None)): File "/usr/local/autotest/tko/db.py", line 305, in select return self.run_with_retry(exec_sql) File "/usr/local/autotest/tko/db.py", line 164, in run_with_retry result = function(*args, **dargs) File "/usr/local/autotest/tko/db.py", line 297, in exec_sql numRec = self.cur.execute(sql, values) File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue OperationalError: (1053, 'Server shutdown in progress') 01/17 08:53:45.310 INFO | server_job:0218| FAIL ---- bvt-perbuild timestamp=1516208025 localtime=Jan 17 08:53:45 Exception waiting for results 01/17 08:53:45.396 DEBUG| dynamic_suite:0624| Finished waiting on suite. Returning from _perform_reimage_and_run. cautotest shows indeed a long pause running on the DUT from 2am to 9am. viceroy also shows idleness on chromeos-server100 during that timeframe on Jan 17th. --- So I guess we can summarize that shards have a race running jobs, especially when there are already other problems happening. Notice the server was alive during that time according to the logs and did not reboot.
,
Jan 25 2018
Notice the reason for https://luci-milo.appspot.com/buildbot/chromeos_release/reks-release%20release-R64-10176.B/52 being green is that the failure is in bvt-perbuild which is async. So that part is expected.
,
Jan 25 2018
adding more eyes
,
Jan 25 2018
,
Jan 26 2018
,
Jan 29 2018
I just took a deeper look from the logs for the first cited example.
That's this test:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=164569602
It ran on chromeos6-row4-rack7-host8; here's the DUT's relevant history:
2017-12-22 10:46:20 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack7-host8/2099679-reset/
2017-12-22 10:39:08 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/164569602-chromeos-test/
2017-12-22 10:19:27 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack7-host8/2099545-provision/
I snagged the autotest.DEBUG and /var/log/messages content from the
test job, filtered both files to contain only the ssh commands executed,
and sifted through the content to remove every command executed on the
client that that was also present in the server logs.
The contents of the attached "collation.txt" file contains all of the
client commands that the server did _not_ log, in time order. That
content seems to correspond exactly to one complete Reset task. Here's
the timeline:
10:40:28 - Test logs begin
10:43:21 - First Reset task command logged on the client
10:43:46 - Last Reset task command logged on the client
10:45:39 - Test logs end
This isn't caused by a race condition. There's a rogue shard or
drone executing jobs on the DUT at the same time as the DUT's
designated shard is doing the legitimate work.
,
Jan 29 2018
The leading theory here would be that this was caused by a shard migration: The old shard allowed work to continue or start after the new shard had already taken control. I'm prepared to believe that if we have data that can correlate the failures with known events where we migrated a shard. However, it's not clear such an event could exist. The `dut-status` command is showing that chromeos6-row4-rack7-host8 was running jobs on its current shard as far back as 2017-12-20, a full 48 hours before the failure incident. That's a pretty long time for an errant shard to continue to try and schedule work on DUTs in no longer owns.
,
Jan 30 2018
How often shard migrations are happening? I'm seeing test failures with these exotic reset tasks recorded in /var/log/messages in more than daily basis. I'm wondering if the theory alone can explain the frequency.
,
Jan 30 2018
This week and last, on a daily basis. Possibly next week as well; we have to migrate many of our shards.
,
Feb 19 2018
Issue 812938 has been merged into this issue.
,
Feb 19 2018
Issue 812991 has been merged into this issue.
,
Feb 19 2018
Re #12: thanks, I see. What will be the next action we can take on this issue? This is creeping into every test and is consuing the failure triaging process a lot.
,
Feb 20 2018
All shards are migrated as of last Tuesday. If this is ongoing, then it's not because of that.
,
Feb 21 2018
This is still ongoing https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178150767-chromeos-test/chromeos6-row2-rack17-host15/ video_WebRtcPerf.DEBUG (PST) 02/20 17:41:57.658 DEBUG| cros_interface:0058| sh -c cryptohome-path user 'test@test.test' 02/20 17:41:57.664 DEBUG| global_hooks:0056| ['sh', '-c', "cryptohome-path user 'test@test.test'"] 02/20 17:41:57.689 DEBUG| cros_interface:0067| > stdout=[/home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab ], stderr=[] 02/20 17:41:57.690 DEBUG| cros_interface:0058| sh -c /bin/df --output=source,target /home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab 02/20 17:41:57.691 DEBUG| global_hooks:0056| ['sh', '-c', '/bin/df --output=source,target /home/user/5351d5a6240a3e64a2641839ea363c9efeda77ab'] 02/20 17:41:57.738 DEBUG| cros_interface:0067| > stdout=[Filesystem Mounted on /dev/mmcblk0p1 /home ], stderr=[] 02/20 17:41:57.739 INFO | browser:0224| Browser is closed. 02/20 17:41:57.742 INFO | helper_logger:0095| videobuf2 log disable : echo 0 > /sys/module/videobuf2_core/parameters/debug 02/20 17:41:57.743 DEBUG| global_hooks:0056| 'echo 0 > /sys/module/videobuf2_core/parameters/debug' 02/20 17:41:57.757 DEBUG| test:0410| Test failed due to Devtools target crashed /var/log/messages (UTC) 2018-02-21T01:41:54.197054+00:00 NOTICE ag[14024]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused) 2018-02-21T01:41:54.518976+00:00 NOTICE ag[14033]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ]) 2018-02-21T01:41:55.228164+00:00 NOTICE ag[14146]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui) 2018-02-21T01:41:55.914334+00:00 NOTICE ag[14181]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release) 2018-02-21T01:41:56.237623+00:00 NOTICE ag[14202]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2018-02-21T01:41:56.555568+00:00 NOTICE ag[14205]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!) 2018-02-21T01:41:56.905049+00:00 NOTICE ag[14209]: autotest server[stack::wait_down|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/177145207-chromeos-test/chromeos2-row8-rack2-host11/graphics_WebGLAquarium .DEBUG 02/16 15:03:31.217 ERROR| browser:0055| Failed with DevtoolsTargetCrashException while starting the browser backend. messages 2018-02-16T23:03:29.627229+00:00 NOTICE ag[4231]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) 2018-02-16T23:03:29.985168+00:00 NOTICE ag[4234]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused) 2018-02-16T23:03:30.351954+00:00 NOTICE ag[4239]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ]) 2018-02-16T23:03:30.743635+00:00 NOTICE ag[4242]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible) 2018-02-16T23:03:31.110149+00:00 NOTICE ag[4247]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178176543-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178143713-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178143497-chromeos-test https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178141958-chromeos-test
,
Feb 21 2018
Is there anyway to see what IP address the "rogue" commands are coming from?
,
Feb 21 2018
> Is there anyway to see what IP address the "rogue" commands are coming from? Good question. I think so. IIRC, sshd logs the source of ssh requests; if so we should be able to match the unexpected commands up against the logged client IP addresses.
,
Feb 21 2018
> [ ... ] IIRC, sshd logs the source of ssh requests [ ... ] Alas, the logging doesn't help. The failing tests are CTS tests. More important, the tests are failing on ARC++ enabled software. On ARC++, ssh must be muxed to accommodate ssh connections to the Android container. That means that when sshd logs a connection, it always logs the IP address as local host. Here's an example: 2017-12-22T18:43:21.290974+00:00 INFO sshd[8309]: Accepted publickey for root from 127.0.0.1 port 44672 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-12-22T18:43:21.786237+00:00 NOTICE ag[8314]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) That's one of the rogue commands, but note how the reported client is 127.0.0.1. We'd need to adjust how logging is done in order to learn more.
,
Feb 21 2018
Then the best guess is that it's one of the shards we have recently turned down. Can we go ahead and shut them down all the way? Also, cc'ing Charlene, since she runs some private servers that might also be doing this unexpectedly.
,
Feb 21 2018
I returned all the ganeti instances for shards as I turned them down. (modulo the possibility that I missed one or that one of the "return this instance" requests failed)
,
Feb 21 2018
Hum. Passing to Charlene to see if any of her machines could be the issue.
,
Feb 21 2018
I do have one private shard, but it doesn't have any hosts or run any jobs. It is used to test some puppet changes. I've shut down the server. Fyi, the hostname of the server is shuqianz-test-shard.hot.corp.google.com and ip is 100.109.170.37. However, I don't think this server caused the issue discussed in the bug :/
,
Feb 21 2018
,
Feb 21 2018
I don't have any other ideas where this is coming from. Perhaps Richard has ideas about how to improve the relevant logging?
,
Feb 21 2018
> Perhaps Richard has ideas about how to improve the relevant logging? Improving the logging means changes to Chrome OS; that should be a separate bug. Given time, I can rustle up what to ask for, and who to ask, but off the top of my head, all I know is that "sshd says 127.0.0.1", that it's because of the ARC container.
,
Feb 22 2018
IIRC the issue is being observed on non-ARC boards as well, which I hope are running without sslh muxing. Let me dig logs and pick up some instances..
,
Feb 22 2018
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178326213-chromeos-test/chromeos4-row7-rack3-host9/ on Winky. 100.109.25.142 looks to be operating the test normally, and 100.108.189.41 is breaking in. /sysinfo/messages (logs till 2018-02-21T14:40:09) + /video_VideoCorruption/sysinfo/messages (logs after 2018-02-21T14:40:11): 2018-02-21T14:39:46.942509-08:00 INFO sshd[16608]: Accepted publickey for root from 100.109.25.142 port 32862 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2018-02-21T14:39:47.528973-08:00 NOTICE ag[16611]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) 2018-02-21T14:39:48.119618-08:00 NOTICE ag[16614]: autotest server[stack::parallel_simple|get_network_stats|run] -> ssh_run(route; echo SEPARATOR; cat /proc/net/dev) 2018-02-21T14:39:50.590189-08:00 NOTICE ag[16620]: autotest server[stack::get_kernel_ver|must_be_installed|run] -> ssh_run(ls /bin/uname >/dev/null 2>&1) 2018-02-21T14:39:50.995209-08:00 NOTICE ag[16623]: autotest server[stack::log_kernel|get_kernel_ver|run] -> ssh_run(/bin/uname -r) 2018-02-21T14:39:51.403341-08:00 NOTICE ag[16626]: autotest server[stack::is_up|ssh_ping|run] -> ssh_run(true) 2018-02-21T14:39:51.841670-08:00 NOTICE ag[16628]: autotest server[stack::get_install_dir|get_installed_autodir|run] -> ssh_run(test -x /usr/local/autotest/bin/autotest) 2018-02-21T14:39:52.253301-08:00 NOTICE ag[16630]: autotest server[stack::get_install_dir|get_installed_autodir|run] -> ssh_run(test -w /usr/local/autotest) 2018-02-21T14:39:52.661707-08:00 NOTICE ag[16632]: autotest server[stack::install|_install|run] -> ssh_run(mkdir -p /usr/local/autotest) 2018-02-21T14:39:53.078007-08:00 NOTICE ag[16635]: autotest server[stack::install|_install|run] -> ssh_run(rm -rf /usr/local/autotest/results/*) 2018-02-21T14:39:53.494520-08:00 NOTICE ag[16638]: autotest server[stack::_install|_install_using_packaging|run] -> ssh_run(cd /usr/local/autotest && ls | grep -v "^packages$" | grep -v "^result_tools$" | xargs rm -rf && rm -rf .[!.]*) 2018-02-21T14:39:53.962123-08:00 NOTICE ag[16646]: autotest server[stack::install_pkg|_run_command|run] -> ssh_run(mkdir -p /usr/local/autotest/packages) 2018-02-21T14:39:54.384490-08:00 NOTICE ag[16649]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest/packages) 2018-02-21T14:39:54.798034-08:00 NOTICE ag[16651]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages/client-autotest.tar.bz2) 2018-02-21T14:39:55.212770-08:00 NOTICE ag[16653]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum) 2018-02-21T14:39:55.626232-08:00 NOTICE ag[16655]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest) 2018-02-21T14:39:56.049504-08:00 NOTICE ag[16657]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum) 2018-02-21T14:39:56.468787-08:00 NOTICE ag[16659]: autotest server[stack::fetch_pkg_file|_run_command|run] -> ssh_run(curl --connect-timeout 15 --retry 5 --retry-delay 5 -s http://100.115.219.133:8082/static/winky-release/R66-10425.0.0/autotest/packages/packages.checksum -o /usr/local/autotest/packages.checksum) 2018-02-21T14:39:56.888743-08:00 NOTICE ag[16662]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -e /usr/local/autotest/packages.checksum) 2018-02-21T14:39:57.294949-08:00 NOTICE ag[16664]: autotest server[stack::_get_checksum_dict|_run_command|run] -> ssh_run(cat /usr/local/autotest/packages.checksum) 2018-02-21T14:39:57.708796-08:00 NOTICE ag[16667]: autotest server[stack::compute_checksum|_run_command|run] -> ssh_run(md5sum /usr/local/autotest/packages/client-autotest.tar.bz2 ) 2018-02-21T14:39:58.299679-08:00 NOTICE ag[16670]: autotest server[stack::exists|_run_command|run] -> ssh_run(test -d /usr/local/autotest) 2018-02-21T14:39:58.718624-08:00 NOTICE ag[16672]: autotest server[stack::untar_required|_run_command|run] -> ssh_run(cat /usr/local/autotest/.checksum) 2018-02-21T14:39:59.121685-08:00 NOTICE ag[16675]: autotest server[stack::install_pkg|_run_command|run] -> ssh_run(mkdir -p /usr/local/autotest) 2018-02-21T14:39:59.532106-08:00 NOTICE ag[16678]: autotest server[stack::untar_pkg|_run_command|run] -> ssh_run(tar --no-same-owner -xjf /usr/local/autotest/packages/client-autotest.tar.bz2 -C /usr/local/autotest) 2018-02-21T14:40:03.428198-08:00 NOTICE ag[16683]: autotest server[stack::compute_checksum|_run_command|run] -> ssh_run(md5sum /usr/local/autotest/packages/client-autotest.tar.bz2 ) 2018-02-21T14:40:03.874548-08:00 NOTICE ag[16686]: autotest server[stack::untar_pkg|_run_command|run] -> ssh_run(echo "9459a8e1b904e7bf370a7b68e4caf88e" > /usr/local/autotest/.checksum ) 2018-02-21T14:40:04.306944-08:00 NOTICE ag[16688]: autotest server[stack::is_up|ssh_ping|run] -> ssh_run(true) 2018-02-21T14:40:04.716408-08:00 NOTICE ag[16691]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(ls /usr/local/autotest/bin/autotest > /dev/null 2>&1) 2018-02-21T14:40:05.132213-08:00 NOTICE ag[16694]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(umount /usr/local/autotest/tmp) 2018-02-21T14:40:05.548054-08:00 NOTICE ag[16697]: autotest server[stack::_do_run|verify_machine|run] -> ssh_run(umount /usr/local/autotest/tests/download) 2018-02-21T14:40:05.971849-08:00 NOTICE ag[16700]: autotest server[stack::run|_do_run|run] -> ssh_run(rm -f /usr/local/autotest/control.autoserv;rm -f /usr/local/autotest/control.autoserv.state;rm -f /usr/local/autotest/control;rm -f /usr/local/autotest/control.state) 2018-02-21T14:40:06.389025-08:00 NOTICE ag[16706]: autotest server[stack::use_rsync|check_rsync|run] -> ssh_run(rsync --version) 2018-02-21T14:40:07.658400-08:00 NOTICE ag[16713]: autotest server[stack::execute_control|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2018-02-21T14:40:08.063871-08:00 NOTICE ag[16716]: autotest server[stack::_execute_daemon|get_tmp_dir|run] -> ssh_run(mkdir -p /tmp) 2018-02-21T14:40:08.477038-08:00 NOTICE ag[16719]: autotest server[stack::_execute_daemon|get_tmp_dir|run] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX) 2018-02-21T14:40:09.309876-08:00 NOTICE ag[16724]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-hFinTc -H autoserv --verbose --hostname=chromeos4-row3-rack13-host11 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &) 2018-02-21T14:40:09.724029-08:00 NOTICE ag[16729]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-hFinTc 0 0) ... ... 2018-02-21T14:40:11.403128-08:00 NOTICE root[16785]: autotest starting iteration /usr/local/autotest/results/default/video_VideoCorruption/sysinfo/iteration.1 on winky_2.6GHz_2GB ... ... 2018-02-21T14:40:23.469856-08:00 INFO sshd[17361]: Accepted publickey for root from 100.108.189.41 port 36722 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2018-02-21T14:40:23.972470-08:00 INFO session_manager[16943]: [INFO:child_exit_handler.cc(77)] Handling 17380 exit. 2018-02-21T14:40:24.013084-08:00 INFO session_manager[16943]: [INFO:policy_store.cc(89)] Persisted policy to disk, path: /var/lib/whitelist/policy.1 2018-02-21T14:40:24.034583-08:00 NOTICE ag[17534]: autotest server[stack::create_host|job_start|run] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) 2018-02-21T14:40:24.203680-08:00 INFO session_manager[16943]: [INFO:child_exit_handler.cc(77)] Handling 17532 exit. 2018-02-21T14:40:24.456946-08:00 NOTICE ag[17611]: autotest server[stack::reset|cleanup|run] -> ssh_run(rm -f /var/lib/cleanup_logs_paused) 2018-02-21T14:40:24.863954-08:00 NOTICE ag[17614]: autotest server[stack::_restart_ui|_is_factory_image|run] -> ssh_run([ -f /root/.factory_test ]) 2018-02-21T14:40:25.268173-08:00 NOTICE ag[17616]: autotest server[stack::_restart_ui|get_chrome_session_ident|run] -> ssh_run(bootstat_get_last login-prompt-visible) 2018-02-21T14:40:25.684096-08:00 NOTICE ag[17621]: autotest server[stack::cleanup_services|_restart_ui|run] -> ssh_run(stop ui; start ui) 2018-02-21T14:40:25.745559-08:00 INFO session_manager[16943]: [INFO:browser_job.cc(164)] Terminating process: exiting cleanly 2018-02-21T14:40:25.745589-08:00 INFO session_manager[16943]: [INFO:system_utils_impl.cc(94)] Sending 15 to 17025 as 1000 2018-02-21T14:40:25.745646-08:00 INFO session_manager[16943]: [INFO:system_utils_impl.cc(94)] Sending 15 to 17282 as 1000 2018-02-21T14:40:26.342761-08:00 INFO session_manager[16943]: [INFO:session_manager_service.cc(494)] SessionManagerService quitting run loop 2018-02-21T14:40:26.343330-08:00 INFO session_manager[16943]: [INFO:session_manager_service.cc(203)] SessionManagerService exiting 2018-02-21T14:40:26.389172-08:00 INFO chapsd[1007]: Token at /home/root/0a5e9f9ff1cac2c2752b8b8c82515a9ceeeada7a/chaps has been removed from slot 1 2018-02-21T14:40:26.425683-08:00 INFO chapsd[1007]: Unloaded keys for slot 1 2018-02-21T14:40:26.469782-08:00 NOTICE ag[17664]: autotest server[stack::get_board|parse_cmd_output|run] -> ssh_run(cat /etc/lsb-release) 2018-02-21T14:40:26.874551-08:00 NOTICE ag[17673]: autotest server[stack::reboot|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2018-02-21T14:40:27.280968-08:00 NOTICE ag[17676]: autotest server[stack::reboot|run_background|run] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!) 2018-02-21T14:40:27.592057-08:00 WARNING chapsd[1007]: Unload Token event received for unknown path: /home/root/0a5e9f9ff1cac2c2752b8b8c82515a9ceeeada7a/chaps 2018-02-21T14:40:27.639757-08:00 INFO imageloader[17684]: The latest-version file does not exist. Component PepperFlashPlayer is probably not installed. 2018-02-21T14:40:27.682934-08:00 NOTICE ag[17743]: autotest server[stack::wait_down|get_boot_id|run] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2018-02-21T14:40:27.789266-08:00 WARNING session_manager[17785]: [WARNING:chrome_setup.cc(86)] Could not find both paths: /usr/share/chromeos-assets/wallpaper/oem_large.jpg and /usr/share/chromeos-assets/wallpaper/oem_small.jpg
,
Feb 22 2018
,
Feb 22 2018
Oops I pasted the wrong link. I meant: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178323980-chromeos-test/chromeos4-row3-rack13-host11/ 100.109.25.142 = cros-full-0002.mtv.corp.google.com 100.108.189.41 = cros-full-0027.mtv.corp.google.com
,
Feb 22 2018
Nice find! Is cros-full-0027 still unused or is it part of the master? It looks like it became very active at about 14:20-14:40 today, which corresponds to the log in #29. https://viceroy.corp.google.com/chromeos/machines?hostname=cros-full-0027&duration=1382399&mdb_role=chrome-infra I poked on cros-full-0027 and found a repair log for chromeos6-row1-rack7-host21/62495156-repair/debug (banjo - which is supposed to be assigned to cros-full-0018?) http://cros-full-0018.mtv.corp.google.com/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/debug/ Why would two drones try to repair the same DUT at about the same time?
,
Feb 22 2018
root@cros-full-0027:/usr/local/autotest/chromeos6-row1-rack7-host21/62495156-repair/debug# more autoserv.DEBUG 02/21 19:13:13.186 INFO | autoserv:0685| Results placed in /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair 02/21 19:13:13.186 DEBUG| autoserv:0693| autoserv is running in drone cros-full-0027.mtv.corp.google.com. 02/21 19:13:13.186 DEBUG| autoserv:0694| autoserv command was: /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair -m chromeos 6-row1-rack7-host21 --verbose --lab True -R --host-protection NO_PROTECTION 02/21 19:13:13.186 INFO | pidfile:0016| Logged pid 13938 to /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute 02/21 19:13:13.187 DEBUG| autoserv:0383| faulthandler registered on SIGTERM. 02/21 19:13:13.188 DEBUG| base_job:0357| Persistent state global_properties.test_retry now set to 0 02/21 19:13:13.188 DEBUG| base_job:0357| Persistent state global_properties.tag now set to '' 02/21 19:13:13.507 DEBUG| host_info:0274| Refreshing HostInfo using store AfeStore[chromeos6-row1-rack7-host21] 02/21 19:13:13.507 DEBUG| host_info:0275| Old host_info: None 02/21 19:13:13.816 DEBUG| host_info:0283| New host_info: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n28 30_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo' , '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}] 02/21 19:13:13.816 INFO | server_job:1599| Shadowing AFE store with a FileStore at /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824 -467a-98db-2e8a382dd753 02/21 19:13:13.817 DEBUG| host_info:0239| Committing HostInfo to store FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824- 467a-98db-2e8a382dd753] 02/21 19:13:13.817 DEBUG| host_info:0243| HostInfo updated to: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celer on_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model: banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}] 02/21 19:13:13.817 DEBUG| base_job:0357| Persistent state global_properties.fast now set to False 02/21 19:13:13.818 DEBUG| base_job:0357| Persistent state global_properties.max_result_size_KB now set to 20000 02/21 19:13:13.841 INFO | connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal 02/21 19:13:14.258 NOTIC| cros_logging:0038| ts_mon was set up. 02/21 19:13:14.260 INFO | server_job:0218| START ---- repair timestamp=1519269194 localtime=Feb 21 19:13:14 02/21 19:13:14.281 DEBUG| host_info:0274| Refreshing HostInfo using store ShadowingStore[AfeStore[chromeos6-row1-rack7-host21], FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-ho st21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753]] [...] 02/21 19:18:27.279 INFO | crashcollect:0269| Waiting 0.001 hours for chromeos6-row1-rack7-host21 to come up (Feb 21 19:14:48) 02/21 19:18:27.288 DEBUG| ssh_host:0301| Running (ssh) 'true' from 'wait_for_machine_to_recover|wait_up|is_up|ssh_ping|run|run_very_slowly' 02/21 19:18:27.291 INFO | ssh_multiplex:0080| Master ssh connection to chromeos6-row1-rack7-host21 is down. 02/21 19:18:27.291 DEBUG| ssh_multiplex:0122| Nuking ssh master_job 02/21 19:18:27.291 DEBUG| ssh_multiplex:0127| Cleaning ssh master_tempdir 02/21 19:18:27.292 INFO | ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ZvDS7yssh-master/socket -o StrictHostKeyChecking= no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row1-rack7-host2 1' 02/21 19:18:27.292 DEBUG| utils:0214| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ZvDS7yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFil e=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row1-rack7-host21'
,
Feb 22 2018
02/21 19:18:13.832 INFO | autoserv:0685| Results placed in /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 02/21 19:18:13.832 DEBUG| autoserv:0693| autoserv is running in drone cros-full-0018.mtv.corp.google.com. 02/21 19:18:13.832 DEBUG| autoserv:0694| autoserv command was: /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 -m chromeos6-row1-rack7-host21 --verbose --lab True -R --host-protection NO_PROTECTION 02/21 19:18:13.832 INFO | pidfile:0016| Logged pid 181204 to /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/.autoserv_execute 02/21 19:18:13.833 DEBUG| autoserv:0383| faulthandler registered on SIGTERM. 02/21 19:18:13.834 DEBUG| base_job:0357| Persistent state global_properties.test_retry now set to 0 02/21 19:18:13.834 DEBUG| base_job:0357| Persistent state global_properties.tag now set to '' 02/21 19:18:13.922 DEBUG| host_info:0274| Refreshing HostInfo using store AfeStore[chromeos6-row1-rack7-host21] 02/21 19:18:13.922 DEBUG| host_info:0275| Old host_info: None 02/21 19:18:13.956 DEBUG| host_info:0283| New host_info: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}] 02/21 19:18:13.957 INFO | server_job:1599| Shadowing AFE store with a FileStore at /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c 02/21 19:18:13.957 DEBUG| host_info:0239| Committing HostInfo to store FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c] 02/21 19:18:13.958 DEBUG| host_info:0243| HostInfo updated to: HostInfo[Labels: ['internal_display', 'bluetooth', 'touchpad', 'ec:cros', 'hw_video_acc_enc_h264', 'os:cros', 'sku:banjo_intel_celeron_n2830_2Gb', 'power:battery', 'variant:banjo', 'storage:mmc', 'cts_abi_arm', 'phase:PVT', 'hw_jpeg_acc_dec', 'cts_abi_x86', 'hw_video_acc_h264', 'board:banjo', 'webcam', 'banjo', 'pool:suites', 'model:banjo', '4k_video_h264', 'audio_loopback_dongle'], Attributes: {'serial_number': 'N9G15WW003517005CF7600', 'HWID': 'BANJO C7A-C6I-A4O'}] 02/21 19:18:13.958 DEBUG| base_job:0357| Persistent state global_properties.fast now set to False 02/21 19:18:13.958 DEBUG| base_job:0357| Persistent state global_properties.max_result_size_KB now set to 20000 02/21 19:18:13.982 INFO | connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal 02/21 19:18:14.403 NOTIC| cros_logging:0038| ts_mon was set up. 02/21 19:18:14.405 INFO | server_job:0218| START ---- repair timestamp=1519269494 localtime=Feb 21 19:18:14 02/21 19:18:14.433 DEBUG| host_info:0274| Refreshing HostInfo using store ShadowingStore[AfeStore[chromeos6-row1-rack7-host21], FileStore[/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/host_info_store/store_d777afd6-08e0-4349-bf61-c52ad541670c]] [...] 02/21 19:23:27.028 INFO | crashcollect:0269| Waiting 0.001 hours for chromeos6-row1-rack7-host21 to come up (Feb 21 19:19:48) [...] 02/21 19:27:59.584 DEBUG| utils:0214| Running 'ping chromeos6-row1-rack7-host21 -w1 -c1' 02/21 19:28:00.667 DEBUG| utils:0282| [stdout] PING chromeos6-row1-rack7-host21.cros.corp.google.com (100.115.128.77) 56(84) bytes of data. 02/21 19:28:00.667 DEBUG| utils:0282| [stdout] 02/21 19:28:00.667 DEBUG| utils:0282| [stdout] --- chromeos6-row1-rack7-host21.cros.corp.google.com ping statistics --- 02/21 19:28:00.667 DEBUG| utils:0282| [stdout] 2 packets transmitted, 0 received, 100% packet loss, time 999ms 02/21 19:28:00.667 DEBUG| utils:0282| [stdout] 02/21 19:28:00.668 ERROR| repair:0354| Failed: host is available via ssh [...] 02/21 19:28:01.294 INFO | repair:0110| Skipping this operation: All host verification checks pass 02/21 19:28:01.294 DEBUG| repair:0111| The following dependencies failed: 02/21 19:28:01.294 DEBUG| repair:0113| host is available via ssh 02/21 19:28:01.295 ERROR| repair:0043| Repair failed due to Exception. Traceback (most recent call last): File "/usr/local/autotest/server/control_segments/repair", line 38, in repair target.repair() File "/usr/local/autotest/server/hosts/cros_host.py", line 1243, in repair self._repair_strategy.repair(self) File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 774, in repair self._verify_root._verify_host(host, silent) File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 348, in _verify_host self._verify_dependencies(host, silent) File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 226, in _verify_dependencies self._verify_list(host, self._dependency_list, silent) File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 216, in _verify_list raise AutoservVerifyDependencyError(self, failures) AutoservVerifyDependencyError: No answer to ping from chromeos6-row1-rack7-host21 02/21 19:28:01.295 INFO | server_job:0218| END FAIL ---- repair timestamp=1519270081 localtime=Feb 21 19:28:01 02/21 19:28:01.296 DEBUG| ssh_multiplex:0122| Nuking ssh master_job 02/21 19:28:01.296 DEBUG| ssh_multiplex:0127| Cleaning ssh master_tempdir 02/21 19:28:01.296 DEBUG| utils:0358| Running result_tools/utils on path: /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 02/21 19:28:01.296 DEBUG| utils:0361| Throttle result size to : 19 MB 02/21 19:28:01.298 DEBUG| utils:0382| Directory summary of /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812 is saved to file /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812/dir_summary_1519270081.json. 02/21 19:28:01.298 DEBUG| utils:0390| Skip throttling /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/149249-repair/20182102191812: size=86670, throttle_probability=236.2986039 02/21 19:28:01.301 ERROR| traceback:0013| Traceback (most recent call last): 02/21 19:28:01.302 ERROR| traceback:0013| File "/usr/local/autotest/server/autoserv", line 491, in run_autoserv 02/21 19:28:01.302 ERROR| traceback:0013| job.repair(job_labels) 02/21 19:28:01.302 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 579, in repair 02/21 19:28:01.302 ERROR| traceback:0013| self._execute_code(REPAIR_CONTROL_FILE, namespace, protect=False) 02/21 19:28:01.302 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 1396, in _execute_code 02/21 19:28:01.302 ERROR| traceback:0013| execfile(code_file, namespace, namespace) 02/21 19:28:01.302 ERROR| traceback:0013| File "/usr/local/autotest/server/control_segments/repair", line 53, in <module> 02/21 19:28:01.303 ERROR| traceback:0013| job.parallel_simple(repair, machines) 02/21 19:28:01.303 ERROR| traceback:0013| File "/usr/local/autotest/server/server_job.py", line 688, in parallel_simple 02/21 19:28:01.303 ERROR| traceback:0013| log=log, timeout=timeout, return_results=return_results) 02/21 19:28:01.303 ERROR| traceback:0013| File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple 02/21 19:28:01.303 ERROR| traceback:0013| function(arg) 02/21 19:28:01.303 ERROR| traceback:0013| File "/usr/local/autotest/server/control_segments/repair", line 47, in repair 02/21 19:28:01.303 ERROR| traceback:0013| raise Exception('') 02/21 19:28:01.303 ERROR| traceback:0013| Exception 02/21 19:28:01.313 INFO | client:0570| Attempting refresh to obtain initial access_token 02/21 19:28:01.349 INFO | client:0872| Refreshing access_token 02/21 19:28:02.192 ERROR| autoserv:0759| Uncaught SystemExit with code 1 Traceback (most recent call last): File "/usr/local/autotest/server/autoserv", line 755, in main use_ssp) File "/usr/local/autotest/server/autoserv", line 562, in run_autoserv sys.exit(exit_code) SystemExit: 1 02/21 19:28:02.260 DEBUG| logging_manager:0627| Logging subprocess finished 02/21 19:28:02.262 DEBUG| logging_manager:0627| Logging subprocess finished
,
Feb 22 2018
A few more instances. (orco : 2018-02-20T06:38:24.126179-08:00) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178058286-chromeos-test/chromeos4-row6-rack4-host3/ * chromeos-skunk-2.mtv.corp.google.com (100.116.60.160) was running the test. * cros-autotest-shard4.hot.corp.google.com (100.108.96.2) broke in. (heli : 2018-02-21T01:41:54.197054+00:00) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178150767-chromeos-test/chromeos6-row2-rack17-host15/ * cros-full-0003.mtv.corp.google.com (100.109.25.145) was running the test. * cros-autotest-shard5.hot.corp.google.com (100.108.96.5) broke in. (glimmer : 2018-02-21T14:17:57.273055-08:00) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/178326213-chromeos-test/chromeos4-row7-rack3-host9/ * cros-full-0001.mtv.corp.google.com (100.109.25.130) was running the test. * cros-full-0026.mtv.corp.google.com (100.108.189.40) broke in.
,
Feb 22 2018
David, fyi in case you run into this.
,
Feb 22 2018
I just did a spot check against the servers cited in #c31: 100.109.25.142 = cros-full-0002.mtv.corp.google.com 100.108.189.41 = cros-full-0027.mtv.corp.google.com The database is down, so I can't confirm the actual configuration, but from logs on cros-full-0002, it's a shard, and it's run lots of jobs on chromeos4-row3-rack13-host11. Checking cros-full-0027, it appears to be a drone, and there are no log files that seem to mention chromeos4-row3-rack13-host11. However, there are signs that special tasks ran against that host: $ cd /usr/local/autotest/results $ ls -ld hosts/chromeos4-row3-rack13-host11 drwxr-xr-x 2 chromeos-test eng 4096 Feb 21 14:41 hosts/chromeos4-row3-rack13-host11 Drones don't do anything on their own, so the behavior here is that the master server scheduled a special task against a DUT owned by a shard, for reasons as yet unknown.
,
Feb 22 2018
OK. Adding more proof of the theory (but still no explanation). On cautotest-prod (the master scheduler), you can see that the master ran 6 special tasks against the cited DUT: $ grep chromeos4-row3-rack13-host11 scheduler.log.2018-02-* | grep 'Starting: ' scheduler.log.2018-02-15-00.57.26:02/15 00:59:21.973 INFO | models:2110| Starting: Special Task 62405503 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-15 00:58:42) scheduler.log.2018-02-15-16.54.56:02/15 16:56:34.167 INFO | models:2110| Starting: Special Task 62415955 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-15 16:56:11) scheduler.log.2018-02-15-16.54.56:02/15 16:59:12.796 INFO | models:2110| Starting: Special Task 62417141 (host chromeos4-row3-rack13-host11, task Repair, time 2018-02-15 16:59:05) scheduler.log.2018-02-20-15.01.21:02/20 15:04:10.842 INFO | models:2110| Starting: Special Task 62479158 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-20 15:02:41) scheduler.log.2018-02-21-13.53.52:02/21 13:55:26.717 INFO | models:2110| Starting: Special Task 62492180 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-21 13:55:05) scheduler.log.2018-02-21-14.38.36:02/21 14:40:11.984 INFO | models:2110| Starting: Special Task 62493576 (host chromeos4-row3-rack13-host11, task Reset, time 2018-02-21 14:39:44) Of particular interest to me is what caused the 5 day hiatus between 2018-02-15 16:59:0 and 2018-02-20 15:02:41; at first blush it looks like the master knew (for a while) that it no longer owned the DUT, then somehow forgot that fact 5 days later.
,
Feb 26 2018
,
Feb 26 2018
OK. Summary of the problem:
* The DUT is owned by a shard.
* Despite the shard ownership, the master continues to
schedule reset tasks, over the space of a week or more.
Best theory is that the problem is caused by the process of
moving the DUT from one shard to the next. However, even if
that's true, a week of failures is too much fallout to be
tolerable. Moreover, there are symptoms not fully explained:
* The master is scheduling Reset tasks. This is normally
only done prior to a test. What are the tests being
scheduled?
* Why are the events so intermittent?
I'm not sure who should work on this debug, so I'm dropping this
from my plate for now.
,
Feb 27 2018
Could I ask someone in the infra team to take priority on this? Nowadays, 80% of flaky failures I encounter during investigating ARC tests are due to this, which makes it very hard to diagnose other real failures.
,
Mar 6 2018
friendly ping... or, is there any pointer to documents that I can follow to debug the issue further by myself? (Not being an infra deputy I think I don't have access to scheduler logs etc.)
,
Mar 7 2018
Can you point out a DUT recently that suffered this, and the test that suffered from it? I will take a look thereafter in line with jrbarnette's theory.
,
Mar 7 2018
(reassign to me once answered)
,
Mar 7 2018
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/180927784-chromeos-test/chromeos6-row2-rack11-host10/ dut: chromeos6-row2-rack11-host10 (peach_pit) test: touch_TapSettings test-run-by: cros-full-0011 (100.109.25.135) reset-by: cros-full-0031 (100.108.189.50) (2018-03-05T06:27:00.001295+00:00) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/180488013-chromeos-test/chromeos4-row4-rack9-host6/ dut: chromeos4-row4-rack9-host (expresso) test: video_PlaybackPerf test-run-by: chromeos-server134.hot (100.109.175.140) reset-by: cros-full-0027 (100.108.189.41) 2018-03-02T19:05:11.591284+00:00 Some instances from ARC-enabled devices (though it's hard to tell who ssh'ed into them, due to #c20): https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/181314732-chromeos-test/chromeos2-row3-rack11-host1/ (2018-03-06T18:48:24.058288+00:00) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/181310203-chromeos-test/chromeos6-row1-rack19-host15/ (2018-03-06T23:25:04.643351+00:00)
,
Mar 7 2018
Investigating. cros-full-0027 and cros-full-0031 are both general purpose drones (that are supposed to only do work for non-sharded boards) whereas both of those boards are sharded, so this is consistent with the theory that the master is going rogue and acting on DUTs that it shouldn't be acting on.
,
Mar 7 2018
Investigating chromeos4-row4-rack9-host6. Both cautotest/ and http://chromeos-server134.hot/ agree that the DUT lives on 134. Master should not be touching this. Digging into master logs now.
,
Mar 7 2018
Master scheduler indeed has sporadic special tasks against that DUT, despite not owning the DUT.
e.g.
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep 62621609 *
grep: dut-data: Is a directory
scheduler.log.2018-03-02-16.21.24:03/02 16:21:56.118 INFO | drone_manager:0813| monitoring pidfile /usr/local/autotest/results/hosts/chromeos4-row4-rack9-host6/62621609-reset/.autoserv_execute
scheduler.log.2018-03-02-16.21.24:03/02 16:21:56.118 INFO | models:2181| Starting: Special Task 62621609 (host chromeos4-row4-rack9-host6, task Reset, time 2018-03-02 16:21:49)
Investigating that special task closer:
MySQL [chromeos_autotest_db]> select * from afe_special_tasks where id=62621609 \G;
*************************** 1. row ***************************
id: 62621609
host_id: 1819
task: Reset
time_requested: 2018-03-02 16:21:49
is_active: 0
is_complete: 1
time_started: 2018-03-02 16:21:56
queue_entry_id: NULL
success: 1
requested_by_id: 19
is_aborted: 0
time_finished: 2018-03-02 16:22:30
1 row in set (0.01 sec)
So, it's not associated with any hqe. Haven't tracked down where it was requested and how it ended up in master afe...
,
Mar 7 2018
Some unverified leads: There are two ways in which special tasks are created unrelated to HQE: - As a result of a reset/cleanup being requested at the end of a job - Scheduler itself goes around scheduling special tasks as part of its main loop (this for instance creates verify tasks for DUTs that are currently in REPAIR_FAILED state) I think one of these isn't respecting the shard_id on the host.
,
Mar 7 2018
chromeos2-row3-rack11-host1 appearas to be locked by duenasa@chromium.org for deployment? Let's ignore that one for now as the locking and deployment can easily be confounding the issue. For chromeos6-row1-rack19-host15 both cautotest/ and cros-full-0023.mtv/ agree that it lives on cros-full-0023.mtv. Master ran a special task against it recently anyway... scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.411 INFO | models:2181| Starting: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.414 INFO | prejob_task:0272| starting reset task for host: chromeos6-row1-rack19-host15 scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.414 INFO | rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Provisioning updating {'status': 'Resetting'} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) scheduler.log.2018-03-06-16.03.09:03/06 16:04:00.427 INFO | drone_manager:0787| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/hosts/chromeos6-row1-rack19-host15/62666251-reset', '-m', u'chromeos6-row1-rack19-host15', '--verbose', '--lab', 'True', '--reset'] scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.138 INFO | models:2192| Finished: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) (active) scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.142 INFO | drone_manager:0825| forgetting pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack19-host15/62666251-reset/.autoserv_execute scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.145 INFO | rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Resetting updating {'dirty': 0} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37) scheduler.log.2018-03-06-16.03.09:03/06 16:04:36.147 INFO | rdb_hosts:0222| Host chromeos6-row1-rack19-host15 in Resetting updating {'status': 'Ready'} through rdb on behalf of: Task: Special Task 62666251 (host chromeos6-row1-rack19-host15, task Reset, time 2018-03-06 16:03:37)
,
Mar 7 2018
#49 those sound like good leads.
,
Mar 7 2018
*************************** 1. row ***************************
id: 62666251
host_id: 6631
task: Reset
time_requested: 2018-03-06 16:03:37
is_active: 0
is_complete: 1
time_started: 2018-03-06 16:04:00
queue_entry_id: NULL
success: 1
requested_by_id: 19
is_aborted: 0
time_finished: 2018-03-06 16:04:36
1 row in set (0.01 sec)
Once again, no hqe. However, what is "requested_by_id: 19" about?
,
Mar 7 2018
I believe that is a foreign key into afe_users, where entry 19 is autotest_system
MySQL [chromeos_autotest_db]> select * from afe_users where id=19 \G;
*************************** 1. row ***************************
id: 19
login: autotest_system
access_level: 100
reboot_before: 1
reboot_after: 0
show_experimental: 0
drone_set_id: NULL
1 row in set (0.01 sec)
,
Mar 7 2018
This code in monitor_db looks suspicious... gotta find out where it is called. Doesn't seem to inherently filter by shard ownership.
def _reverify_hosts_where(self, where,
print_message='Reverifying host %s'):
full_where='locked = 0 AND invalid = 0 AND ' + where
for host in scheduler_models.Host.fetch(where=full_where):
if self.host_has_agent(host):
# host has already been recovered in some way
continue
if self._host_has_scheduled_special_task(host):
# host will have a special task scheduled on the next cycle
continue
if print_message:
logging.error(print_message, host.hostname)
try:
user = models.User.objects.get(login='autotest_system')
except models.User.DoesNotExist:
user = models.User.objects.get(
id=self.DEFAULT_REQUESTED_BY_USER_ID)
models.SpecialTask.objects.create(
task=models.SpecialTask.Task.RESET,
host=models.Host.objects.get(id=host.id),
requested_by=user)
,
Mar 7 2018
scheduler.log.2018-03-06-15.24.12:03/06 15:24:40.743 ERROR| monitor_db:0850| Recovering active host chromeos6-row1-rack19-host15 - this probably indicates a scheduler bug
suggests we are calling into
def _reverify_remaining_hosts(self):
# recover active hosts that have not yet been recovered, although this
# should never happen
message = ('Recovering active host %s - this probably indicates a '
'scheduler bug')
self._reverify_hosts_where(
"status IN ('Repairing', 'Verifying', 'Cleaning', 'Provisioning')",
print_message=message)
Even on hosts that aren't owned by the master.
,
Mar 7 2018
Pretty sure that reverify_hosts_where should be modified such that full_where='locked = 0 AND invalid = 0 AND shard_id = %my_id AND ' + where But I'm not yet convinced that monitor_db actually knows its own shard id.
,
Mar 7 2018
https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/953370
,
Mar 7 2018
,
Mar 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1f7d518223bfcd28138dad35326ae80560f19d00 commit 1f7d518223bfcd28138dad35326ae80560f19d00 Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Mar 08 21:26:49 2018 autotest: scheduler: if master, do not run special tasks on shard hosts BUG= chromium:801467 TEST=None Change-Id: I1a8bf1719f898edfe99674fdd5fb09c9322cd7ba Reviewed-on: https://chromium-review.googlesource.com/953370 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/1f7d518223bfcd28138dad35326ae80560f19d00/scheduler/monitor_db.py
,
Mar 10 2018
Waiting on prod push.
,
Mar 12 2018
Believed fixed, awaiting push.
,
Mar 12 2018
-> xixuan who will orchestrate the push, then this can be closed.
,
Mar 13 2018
Fix in #59 is pushed. Close this.
,
Mar 14 2018
Amazing! Thank you all! I'll monitor the tree for a week or so to see the outcome of the fix. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by kinaba@chromium.org
, Jan 15 2018