DeviceNotPingable exception fails provisions |
||||||||||||||
Issue descriptionchromeos2-row3-rack4-host3 not pingable? Builder: https://luci-milo.appspot.com/buildbot/chromeos/falco-chrome-pfq/ Failures: https://luci-milo.appspot.com/buildbot/chromeos/falco-chrome-pfq/5226 https://luci-milo.appspot.com/buildbot/chromeos/falco-chrome-pfq/5225 https://luci-milo.appspot.com/buildbot/chromeos/falco-chrome-pfq/5224 Associated bug: https://bugs.chromium.org/p/chromium/issues/detail?id=655350 Last couple of PFQ failures have failed because of this bot.
,
Jan 20 2017
yeah, I filed b/34504240 for syslab to repair it. They were able to run repair and no issue found. I ran a test in that dut, also succeeded. http://cautotest/afe/#tab_id=view_job&object_id=97066848
,
Jan 23 2017
falco-chrome-pfq new infra failures: https://bugs.chromium.org/p/chromium/issues/detail?id=668863
,
Jan 23 2017
dshi@, can you take another look?
,
Jan 23 2017
changing owner to this week's infra sheriff
,
Jan 23 2017
loop in ChromeOS sheriffs as pprabhu@ has other issue needs attention
,
Jan 26 2017
Two more falco failures last night, blocking chrome PFQ: https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5249 https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5250 "provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: DeviceNotPingableError: Device chromeos2-row3-rack4-host3 is not pingable." falco seems especially prone to infra/lab failures in the last couple weeks. Can someone look at the DUTs?
,
Jan 26 2017
Looking now.
,
Jan 26 2017
Looking at this specific failure: https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5249 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98016931-chromeos-test/chromeos2-row3-rack4-host3/ autotest.DEBUG shows that the device was functioning at 1:05:27, then down roughly between 1:06 and 1:07 and up again at 1:07. The sysinfo didn't collect a console-ramoops and the syslog ends right before the failure. (Including tail of syslog. NOTE: times are GMT = PST + 8) 2017-01-26T09:03:35.478688+00:00 NOTICE ag[13641]: autotest server[stack::send_file|use_rsync|_check_rsync] -> ssh_run(rsync --version) 2017-01-26T09:04:29.732809+00:00 DEBUG kernel: [ 8039.735043] ieee80211 phy0: device no longer idle - scanning 2017-01-26T09:04:32.912995+00:00 DEBUG kernel: [ 8042.917029] ieee80211 phy0: device now idle 2017-01-26T09:05:19.340939+00:00 NOTICE ag[13664]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mkdir -p /tmp) 2017-01-26T09:05:19.654615+00:00 NOTICE ag[13667]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX) 2017-01-26T09:05:19.937098+00:00 NOTICE ag[13670]: autotest server[stack::decorated_func|wrapper|erase_dir_contents] -> ssh_run(find "/tmp/autoserv-HWw8gF" -mindepth 1 -maxdepth 1 -print0 | xargs -0 rm -rf) 2017-01-26T09:05:20.237094+00:00 NOTICE ag[13675]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true) 2017-01-26T09:05:20.519767+00:00 NOTICE ag[13677]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(ls /tmp/sysinfo/autoserv-vnjeh2/bin/autotest > /dev/null 2>&1) 2017-01-26T09:05:20.804448+00:00 NOTICE ag[13680]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tmp) 2017-01-26T09:05:21.094031+00:00 NOTICE ag[13683]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tests/download) 2017-01-26T09:05:21.372591+00:00 NOTICE ag[13686]: autotest server[stack::before_hook|run|_do_run] -> ssh_run(rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv.state;rm -f /tmp/sysinfo/autoserv-vnjeh2/control;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.state) 2017-01-26T09:05:22.949790+00:00 NOTICE ag[13696]: autotest server[stack::_do_run|execute_control|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2017-01-26T09:05:23.175030+00:00 NOTICE ag[13699]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mkdir -p /tmp) 2017-01-26T09:05:23.400290+00:00 NOTICE ag[13702]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX) 2017-01-26T09:05:24.209810+00:00 NOTICE ag[13707]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(nohup /tmp/sysinfo/autoserv-vnjeh2/bin/autotestd /tmp/autoserv-JpOHSS -H autoserv --verbose --hostname=chromeos2-row3-rack4-host3 --user=chromeos-test /tmp/sysinfo/autoserv-vnjeh2/control.autoserv >/dev/null 2>/dev/null &) 2017-01-26T09:05:24.481611+00:00 NOTICE ag[13727]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(/tmp/sysinfo/autoserv-vnjeh2/bin/autotestd_monitor /tmp/autoserv-JpOHSS 0 0)
,
Jan 26 2017
OK the good news first. The following test also failed but collected usable logs here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98016953-chromeos-test/chromeos2-row3-rack4-host3/sysinfo/var/log/ The bad news: not much happened in the syslog between 9:05:27 and 9:07. Just some phy0 scanning. Maybe a network switch was briefly down? 2017-01-26T09:05:19.654615+00:00 NOTICE ag[13667]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX) 2017-01-26T09:05:19.937098+00:00 NOTICE ag[13670]: autotest server[stack::decorated_func|wrapper|erase_dir_contents] -> ssh_run(find "/tmp/autoserv-HWw8gF" -mindepth 1 -maxdepth 1 -print0 | xargs -0 rm -rf) 2017-01-26T09:05:20.237094+00:00 NOTICE ag[13675]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true) 2017-01-26T09:05:20.519767+00:00 NOTICE ag[13677]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(ls /tmp/sysinfo/autoserv-vnjeh2/bin/autotest > /dev/null 2>&1) 2017-01-26T09:05:20.804448+00:00 NOTICE ag[13680]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tmp) 2017-01-26T09:05:21.094031+00:00 NOTICE ag[13683]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tests/download) 2017-01-26T09:05:21.372591+00:00 NOTICE ag[13686]: autotest server[stack::before_hook|run|_do_run] -> ssh_run(rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv.state;rm -f /tmp/sysinfo/autoserv-vnjeh2/control;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.state) 2017-01-26T09:05:22.949790+00:00 NOTICE ag[13696]: autotest server[stack::_do_run|execute_control|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi) 2017-01-26T09:05:23.175030+00:00 NOTICE ag[13699]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mkdir -p /tmp) 2017-01-26T09:05:23.400290+00:00 NOTICE ag[13702]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX) 2017-01-26T09:05:24.209810+00:00 NOTICE ag[13707]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(nohup /tmp/sysinfo/autoserv-vnjeh2/bin/autotestd /tmp/autoserv-JpOHSS -H autoserv --verbose --hostname=chromeos2-row3-rack4-host3 --user=chromeos-test /tmp/sysinfo/autoserv-vnjeh2/control.autoserv >/dev/null 2>/dev/null &) 2017-01-26T09:05:24.481611+00:00 NOTICE ag[13727]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(/tmp/sysinfo/autoserv-vnjeh2/bin/autotestd_monitor /tmp/autoserv-JpOHSS 0 0) 2017-01-26T09:05:24.846159+00:00 NOTICE ag[13750]: autotest server[stack::write|_process_line|_process_line] -> ssh_run(echo B > /tmp/sysinfo/autoserv-vnjeh2/tmp/_autotmp_RUYeMeharness-fifo/autoserv.fifo) 2017-01-26T09:05:27.189428+00:00 NOTICE ag[13756]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true) 2017-01-26T09:05:32.937460+00:00 DEBUG kernel: [ 8102.929141] ieee80211 phy0: device no longer idle - scanning 2017-01-26T09:05:36.114738+00:00 DEBUG kernel: [ 8106.107901] ieee80211 phy0: device now idle 2017-01-26T09:06:36.143696+00:00 DEBUG kernel: [ 8166.124850] ieee80211 phy0: device no longer idle - scanning 2017-01-26T09:06:39.322475+00:00 DEBUG kernel: [ 8169.304651] ieee80211 phy0: device now idle 2017-01-26T09:07:00.368488+00:00 NOTICE ag[13761]: autotest server[stack::runtest|cleanup|uninstall] -> ssh_run(rm -rf /tmp/sysinfo/autoserv-vnjeh2) 2017-01-26T09:07:00.702616+00:00 NOTICE ag[13764]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/sysinfo/autoserv-vnjeh2") 2017-01-26T09:07:00.976279+00:00 NOTICE ag[13767]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/autoserv-HWw8gF") 2017-01-26T09:07:01.250384+00:00 NOTICE ag[13770]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/autoserv-JpOHSS") 2017-01-26T09:07:01.313903+00:00 INFO sshd[13624]: Received disconnect from 172.25.90.2 port 57946:11: disconnected by user 2017-01-26T09:07:01.313981+00:00 INFO sshd[13624]: Disconnected from 172.25.90.2 port 57946 2017-01-26T09:07:03.194688+00:00 INFO sshd[13772]: Accepted publickey for root from 172.25.90.2 port 59559 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-01-26T09:07:03.593112+00:00 NOTICE ag[13775]: autotest server[stack::fetch_orphaned_crashdumps|_find_orphaned_crashdumps|list_files_glob] -> ssh_run(python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)')
,
Jan 26 2017
net.log also shows nothing of interest, just wifi activity.
,
Jan 26 2017
Chrome system log: [12174:12174:0126/000837.426108:VERBOSE1:gaia_screen_handler.cc(660)] Gaia is loaded [12174:12174:0126/000837.426213:VERBOSE1:signin_screen_handler.cc(1304)] Login WebUI >> loginVisible, src: gaia-signin, webui_visible_: 1 [12174:12174:0126/000902.900827:VERBOSE1:gaia_screen_handler.cc(426)] OnPortalDetectionCompleted Online [12174:12174:0126/000931.218489:VERBOSE1:gaia_screen_handler.cc(426)] OnPortalDetectionCompleted Online ... and that's it. There is no other useful information. Everything seems to be working normally from the DUT viewpoint. Is it possible to access logs for the network switch that this device is connected to? I'll take a look at the other recent failure as well.
,
Jan 26 2017
,
Jan 26 2017
Given that this is a SMSC 7500 NIC and given the email thread about it perhaps swapping it out for a ASIX based USB NIC might be helpful? Or temporary change the port on the switch used by the DUT???
,
Jan 26 2017
Thank you for the suggestion. But this looks different from the other failures. Let me investigate some more. Also Build 5250 is very similar. Basically the devserver is telling us it cannot communicate with the DUT. But autoserv seems to have no problems with it. AND I should add that autoserv has no problems communicating with the devserver either. (I think.) Is it possible that the connections between pairs of devices (autoserv, devserver, DUT) are going through different switches? Anyway, time to look at the dreaded gigabyte-sized devserver logs... https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98063633-chromeos-test/chromeos2-row3-rack4-host1/debug/ The failures here are between 5:30:06 and 5:31:18. Devserver is 100.115.245.198 for both failures.
,
Jan 26 2017
For the first failure: extract.py /var/log/devserver/server.log 92.40 1000 | grep falco > /tmp/server.log attached below.
,
Jan 26 2017
Here's also some info from console.log. I see no obvious failures in either.
,
Jan 26 2017
Here's a theory: RemoteDevice.__init__ takes a "ping" parameter, which makes the devserver issue a "ping -c 1" to the DUT before attempting to connect. If the ping fails, the DeviceNotPingable exception is thrown. If the network is loaded, it's not unthinkable that a single ping could fail. There are spikes of activity around 0050 to 0100 and around 0530. The network interface is far from saturation, but other devservers/DUTs may be contributing. I suspect that sending a few extra pings wouldn't hurt.
,
Jan 27 2017
CLs in progress. https://chromium-review.googlesource.com/#/c/433389/ https://chromium-review.googlesource.com/#/c/433370/
,
Jan 27 2017
Issue 652207 has been merged into this issue.
,
Jan 27 2017
,
Jan 27 2017
To answer the question @c15 yes the devservers and DUTs eventually are on different switches. Corp --> Devserver Corp(autoserv) -> Cat6K -> Lab dist -> lab server switch -> devserver Corp --> DUT Corp(autoserv) -> Cat6K -> Lab dist -> end of row switch -> top of rack switch -> DUT Devserver <--> DUT Devserver <--> lab server switch <--> dist <--> Cat6K <--> end of row <--> top of rack <--> DUT https://docs.google.com/drawings/d/1dtPtUTNvpt9T_l1GVMU9dCxX_U0x2rhSrgkKDSKi5IA/edit Something else I would like to be done is some downtime to update the Cisco IOS software on the switches to Cisco blessed stable version. https://b.corp.google.com/issues/34196870
,
Feb 2 2017
This issue is still seen on falco-chrome-pfq intermittently, e.g., https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5276. Any updates?
,
Feb 2 2017
I had forgotten to mark one of the CLs ready. I just did it. https://chromium-review.googlesource.com/#/c/433389/
,
Feb 2 2017
,
Feb 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/bddfc7946aa8ca2017f669e89692126f67a32cb1 commit bddfc7946aa8ca2017f669e89692126f67a32cb1 Author: Luigi Semenzato <semenzato@chromium.org> Date: Thu Feb 02 20:29:32 2017 Remote connection should not fail on failed ping by default The purpose of the ping check is to speed up failure in the interactive case ("cros flash" and "cros deploy" for instance) but it can be damaging in a test lab environment, where packet losses can be more common and need to be tolerated better. So do not use the ping check by default. BUG=chromium:683304 TEST=ran cros flash with disconnected ethernet, verified error is still "device not pingable" Change-Id: Ie4dcd53c01428cff51ef3dd43acb64e9a0a3b1aa Reviewed-on: https://chromium-review.googlesource.com/433389 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/bddfc7946aa8ca2017f669e89692126f67a32cb1/lib/remote_access.py
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 24 2017
ChromeOS Infra P1 Bugscrub. Issue untouched in a week after previous message. Downgrading to P2.
,
Jun 8 2018
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you. |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by achuith@chromium.org
, Jan 20 2017