Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users
Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
DeviceNotPingable exception fails provisions
Project Member Reported by achuith@chromium.org, Jan 20 2017 Back to list
Cc: keta...@chromium.org
Comment 2 by dshi@chromium.org, Jan 20 2017
Status: Fixed
yeah, I filed b/34504240 for syslab to repair it. They were able to run repair and no issue found. I ran a test in that dut, also succeeded.
http://cautotest/afe/#tab_id=view_job&object_id=97066848


Comment 3 by warx@chromium.org, Jan 23 2017
Status: Assigned
falco-chrome-pfq new infra failures: https://bugs.chromium.org/p/chromium/issues/detail?id=668863
Comment 4 by warx@chromium.org, Jan 23 2017
dshi@, can you take another look?
Comment 5 by warx@chromium.org, Jan 23 2017
Cc: dshi@chromium.org
Owner: pprabhu@chromium.org
changing owner to this week's infra sheriff
Comment 6 by warx@chromium.org, Jan 23 2017
Cc: reve...@chromium.org ravisadineni@chromium.org
loop in ChromeOS sheriffs as pprabhu@ has other issue needs attention
Two more falco failures last night, blocking chrome PFQ:
https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5249
https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5250

"provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: DeviceNotPingableError: Device chromeos2-row3-rack4-host3 is not pingable."

falco seems especially prone to infra/lab failures in the last couple weeks.

Can someone look at the DUTs?

Owner: semenzato@chromium.org
Looking now.
Looking at this specific failure:

https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5249

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98016931-chromeos-test/chromeos2-row3-rack4-host3/

autotest.DEBUG shows that the device was functioning at 1:05:27, then down roughly between 1:06 and 1:07 and up again at 1:07.

The sysinfo didn't collect a console-ramoops and the syslog ends right before the failure.  (Including tail of syslog.  NOTE: times are GMT = PST + 8)

2017-01-26T09:03:35.478688+00:00 NOTICE ag[13641]: autotest server[stack::send_file|use_rsync|_check_rsync] -> ssh_run(rsync --version)
2017-01-26T09:04:29.732809+00:00 DEBUG kernel: [ 8039.735043] ieee80211 phy0: device no longer idle - scanning
2017-01-26T09:04:32.912995+00:00 DEBUG kernel: [ 8042.917029] ieee80211 phy0: device now idle
2017-01-26T09:05:19.340939+00:00 NOTICE ag[13664]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mkdir -p /tmp)
2017-01-26T09:05:19.654615+00:00 NOTICE ag[13667]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2017-01-26T09:05:19.937098+00:00 NOTICE ag[13670]: autotest server[stack::decorated_func|wrapper|erase_dir_contents] -> ssh_run(find "/tmp/autoserv-HWw8gF" -mindepth 1 -maxdepth 1 -print0 | xargs -0 rm -rf)
2017-01-26T09:05:20.237094+00:00 NOTICE ag[13675]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)
2017-01-26T09:05:20.519767+00:00 NOTICE ag[13677]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(ls /tmp/sysinfo/autoserv-vnjeh2/bin/autotest > /dev/null 2>&1)
2017-01-26T09:05:20.804448+00:00 NOTICE ag[13680]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tmp)
2017-01-26T09:05:21.094031+00:00 NOTICE ag[13683]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tests/download)
2017-01-26T09:05:21.372591+00:00 NOTICE ag[13686]: autotest server[stack::before_hook|run|_do_run] -> ssh_run(rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv.state;rm -f /tmp/sysinfo/autoserv-vnjeh2/control;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.state)
2017-01-26T09:05:22.949790+00:00 NOTICE ag[13696]: autotest server[stack::_do_run|execute_control|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2017-01-26T09:05:23.175030+00:00 NOTICE ag[13699]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mkdir -p /tmp)
2017-01-26T09:05:23.400290+00:00 NOTICE ag[13702]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2017-01-26T09:05:24.209810+00:00 NOTICE ag[13707]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(nohup /tmp/sysinfo/autoserv-vnjeh2/bin/autotestd /tmp/autoserv-JpOHSS -H autoserv --verbose --hostname=chromeos2-row3-rack4-host3 --user=chromeos-test /tmp/sysinfo/autoserv-vnjeh2/control.autoserv >/dev/null 2>/dev/null &)
2017-01-26T09:05:24.481611+00:00 NOTICE ag[13727]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(/tmp/sysinfo/autoserv-vnjeh2/bin/autotestd_monitor /tmp/autoserv-JpOHSS 0 0)

OK the good news first.

The following test also failed but collected usable logs here:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98016953-chromeos-test/chromeos2-row3-rack4-host3/sysinfo/var/log/

The bad news: not much happened in the syslog between 9:05:27 and 9:07.  Just some phy0 scanning.  Maybe a network switch was briefly down?

2017-01-26T09:05:19.654615+00:00 NOTICE ag[13667]: autotest server[stack::wrapper|_install|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2017-01-26T09:05:19.937098+00:00 NOTICE ag[13670]: autotest server[stack::decorated_func|wrapper|erase_dir_contents] -> ssh_run(find "/tmp/autoserv-HWw8gF" -mindepth 1 -maxdepth 1 -print0 | xargs -0 rm -rf)
2017-01-26T09:05:20.237094+00:00 NOTICE ag[13675]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)
2017-01-26T09:05:20.519767+00:00 NOTICE ag[13677]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(ls /tmp/sysinfo/autoserv-vnjeh2/bin/autotest > /dev/null 2>&1)
2017-01-26T09:05:20.804448+00:00 NOTICE ag[13680]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tmp)
2017-01-26T09:05:21.094031+00:00 NOTICE ag[13683]: autotest server[stack::run|_do_run|verify_machine] -> ssh_run(umount /tmp/sysinfo/autoserv-vnjeh2/tests/download)
2017-01-26T09:05:21.372591+00:00 NOTICE ag[13686]: autotest server[stack::before_hook|run|_do_run] -> ssh_run(rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.autoserv.state;rm -f /tmp/sysinfo/autoserv-vnjeh2/control;rm -f /tmp/sysinfo/autoserv-vnjeh2/control.state)
2017-01-26T09:05:22.949790+00:00 NOTICE ag[13696]: autotest server[stack::_do_run|execute_control|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)
2017-01-26T09:05:23.175030+00:00 NOTICE ag[13699]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mkdir -p /tmp)
2017-01-26T09:05:23.400290+00:00 NOTICE ag[13702]: autotest server[stack::execute_section|_execute_daemon|get_tmp_dir] -> ssh_run(mktemp -d /tmp/autoserv-XXXXXX)
2017-01-26T09:05:24.209810+00:00 NOTICE ag[13707]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(nohup /tmp/sysinfo/autoserv-vnjeh2/bin/autotestd /tmp/autoserv-JpOHSS -H autoserv --verbose --hostname=chromeos2-row3-rack4-host3 --user=chromeos-test /tmp/sysinfo/autoserv-vnjeh2/control.autoserv >/dev/null 2>/dev/null &)
2017-01-26T09:05:24.481611+00:00 NOTICE ag[13727]: autotest server[stack::execute_control|execute_section|_execute_daemon] -> ssh_run(/tmp/sysinfo/autoserv-vnjeh2/bin/autotestd_monitor /tmp/autoserv-JpOHSS 0 0)
2017-01-26T09:05:24.846159+00:00 NOTICE ag[13750]: autotest server[stack::write|_process_line|_process_line] -> ssh_run(echo B > /tmp/sysinfo/autoserv-vnjeh2/tmp/_autotmp_RUYeMeharness-fifo/autoserv.fifo)
2017-01-26T09:05:27.189428+00:00 NOTICE ag[13756]: autotest server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)
2017-01-26T09:05:32.937460+00:00 DEBUG kernel: [ 8102.929141] ieee80211 phy0: device no longer idle - scanning
2017-01-26T09:05:36.114738+00:00 DEBUG kernel: [ 8106.107901] ieee80211 phy0: device now idle
2017-01-26T09:06:36.143696+00:00 DEBUG kernel: [ 8166.124850] ieee80211 phy0: device no longer idle - scanning
2017-01-26T09:06:39.322475+00:00 DEBUG kernel: [ 8169.304651] ieee80211 phy0: device now idle
2017-01-26T09:07:00.368488+00:00 NOTICE ag[13761]: autotest server[stack::runtest|cleanup|uninstall] -> ssh_run(rm -rf /tmp/sysinfo/autoserv-vnjeh2)
2017-01-26T09:07:00.702616+00:00 NOTICE ag[13764]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/sysinfo/autoserv-vnjeh2")
2017-01-26T09:07:00.976279+00:00 NOTICE ag[13767]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/autoserv-HWw8gF")
2017-01-26T09:07:01.250384+00:00 NOTICE ag[13770]: autotest server[stack::close|close|close] -> ssh_run(rm -rf "/tmp/autoserv-JpOHSS")
2017-01-26T09:07:01.313903+00:00 INFO sshd[13624]: Received disconnect from 172.25.90.2 port 57946:11: disconnected by user
2017-01-26T09:07:01.313981+00:00 INFO sshd[13624]: Disconnected from 172.25.90.2 port 57946
2017-01-26T09:07:03.194688+00:00 INFO sshd[13772]: Accepted publickey for root from 172.25.90.2 port 59559 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-01-26T09:07:03.593112+00:00 NOTICE ag[13775]: autotest server[stack::fetch_orphaned_crashdumps|_find_orphaned_crashdumps|list_files_glob] -> ssh_run(python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)')



net.log also shows nothing of interest, just wifi activity.

Cc: johndhong@chromium.org
Chrome system log:

[12174:12174:0126/000837.426108:VERBOSE1:gaia_screen_handler.cc(660)] Gaia is loaded
[12174:12174:0126/000837.426213:VERBOSE1:signin_screen_handler.cc(1304)] Login WebUI >> loginVisible, src: gaia-signin, webui_visible_: 1
[12174:12174:0126/000902.900827:VERBOSE1:gaia_screen_handler.cc(426)] OnPortalDetectionCompleted Online
[12174:12174:0126/000931.218489:VERBOSE1:gaia_screen_handler.cc(426)] OnPortalDetectionCompleted Online

... and that's it.  There is no other useful information.

Everything seems to be working normally from the DUT viewpoint.  Is it possible to access logs for the network switch that this device is connected to?

I'll take a look at the other recent failure as well.
Cc: -johndhong@chromium.org haoweiw@chromium.org englab-sys-cros@google.com
Given that this is a SMSC 7500 NIC and given the email thread about it perhaps swapping it out for a ASIX based USB NIC might be helpful?

Or temporary change the port on the switch used by the DUT???
Thank you for the suggestion.  But this looks different from the other failures.  Let me investigate some more.

Also

Build 5250 is very similar.

Basically the devserver is telling us it cannot communicate with the DUT.  But autoserv seems to have no problems with it.  AND I should add that autoserv has no problems communicating with the devserver either.  (I think.)

Is it possible that the connections between pairs of devices (autoserv, devserver, DUT) are going through different switches?

Anyway, time to look at the dreaded gigabyte-sized devserver logs...

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/98063633-chromeos-test/chromeos2-row3-rack4-host1/debug/

The failures here are between 5:30:06 and 5:31:18.
Devserver is 100.115.245.198 for both failures.

For the first failure:

extract.py /var/log/devserver/server.log 92.40 1000 | grep falco > /tmp/server.log

attached below.


server.log
253 KB View Download
Here's also some info from console.log.

I see no obvious failures in either.




console.log
150 KB View Download
Here's a theory:

RemoteDevice.__init__ takes a "ping" parameter, which makes the devserver issue a "ping -c 1" to the DUT before attempting to connect.  If the ping fails, the DeviceNotPingable exception is thrown.  If the network is loaded, it's not unthinkable that a single ping could fail.

There are spikes of activity around 0050 to 0100 and around 0530.  The network interface is far from saturation, but other devservers/DUTs may be contributing.

I suspect that sending a few extra pings wouldn't hurt.

Components: Infra>Client>ChromeOS
Labels: OS-Chrome
Summary: DeviceNotPingable exception fails provisions (was: falco-chrome-pfq failures)
CLs in progress.

https://chromium-review.googlesource.com/#/c/433389/
https://chromium-review.googlesource.com/#/c/433370/
Cc: johndhong@chromium.org cywang@chromium.org warx@chromium.org kevcheng@chromium.org bccheng@chromium.org semenzato@chromium.org jrbarnette@chromium.org vpalatin@chromium.org
 Issue 652207  has been merged into this issue.
Cc: -johndhong@chromium.org
To answer the question @c15 yes the devservers and DUTs eventually are on different switches.

Corp --> Devserver
Corp(autoserv) -> Cat6K -> Lab dist -> lab server switch -> devserver

Corp --> DUT
Corp(autoserv) -> Cat6K -> Lab dist -> end of row switch -> top of rack switch -> DUT

Devserver <--> DUT
Devserver <--> lab server switch <--> dist <--> Cat6K <--> end of row <--> top of rack <--> DUT

https://docs.google.com/drawings/d/1dtPtUTNvpt9T_l1GVMU9dCxX_U0x2rhSrgkKDSKi5IA/edit


Something else I would like to be done is some downtime to update the Cisco IOS  software on the switches to Cisco blessed stable version.
https://b.corp.google.com/issues/34196870
Comment 23 by x...@chromium.org, Feb 2 2017
This issue is still seen on falco-chrome-pfq intermittently, e.g., https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5276. Any updates?
I had forgotten to mark one of the CLs ready.  I just did it.
https://chromium-review.googlesource.com/#/c/433389/
Status: Started
Project Member Comment 26 by bugdroid1@chromium.org, Feb 2 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/bddfc7946aa8ca2017f669e89692126f67a32cb1

commit bddfc7946aa8ca2017f669e89692126f67a32cb1
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Thu Feb 02 20:29:32 2017

Remote connection should not fail on failed ping by default

The purpose of the ping check is to speed up failure in the
interactive case ("cros flash" and "cros deploy" for instance)
but it can be damaging in a test lab environment, where
packet losses can be more common and need to be tolerated
better.  So do not use the ping check by default.

BUG=chromium:683304
TEST=ran cros flash with disconnected ethernet, verified error is still "device not pingable"

Change-Id: Ie4dcd53c01428cff51ef3dd43acb64e9a0a3b1aa
Reviewed-on: https://chromium-review.googlesource.com/433389
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/bddfc7946aa8ca2017f669e89692126f67a32cb1/lib/remote_access.py

Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.
Labels: -akeshet-pending-downgrade Pri-2
ChromeOS Infra P1 Bugscrub.

Issue untouched in a week after previous message. Downgrading to P2.
Sign in to add a comment