Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 4 users
Status: Verified
Owner:
Closed: Jun 20
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 698309

Blocking:
issue 667067
issue 668863
issue 695529



Sign in to add a comment
Provision fails: one devserver cannot ssh to a DUT, but another devserver can.
Project Member Reported by jen...@chromium.org, Feb 14 2017 Back to list
We've seen this happens on several chrome pfq bots(falco, daisy-skate, etc). When running provision AutoUpdate test on DUT, it failed to ssh to DUT after the stage of "Transfer Devserver/Stateful Update Package". Then it retries to auto update, and continuing fail to connect to DUT by ssh. See a falco log at:
https://00e9e64bac834380acfa54453eb6aaa69f6ab2db0ac2e97c7a-apidata.googleusercontent.com/download/storage/v1/b/chromeos-autotest-results/o/87607394-chromeos-test%2Fchromeos2-row3-rack1-host21%2Fdebug%2Fautoserv.DEBUG?qk=AD5uMEsW8WI8_IUmZ9Vl8e6z5_-Mer5SlXNxpjCGkv7PsEXkniyqo4IuCFC9-ZBkht1PoWL5UiXXSb5nUZI84bk7Pie2eU26TX5eqo9iFxqai4oEjzEqjqLqleRj_oXNiQnwVRXfGDpWmXP7Xsh_knTFUxG6BG63yJ8MfCjKGBOYctPrDtoC1NK0rTEAhpsKHDpMx41VjumJo_OE9Z3kfiVlC6dPhwOGC6dGS_gdy0zCt8JbVPskNseouBUXKu6iZUrs8LC8S3CXBOhJ526gg9M5suUtYOmroIQSs37ljccUdAOkLdkCcaDfQXt4q0USRcbg2oJdn3zoyK8rGffptNNe8OtqK2pJVXa1EVnp1B9UQTyhvKOyxRcukPPk2IHm7FVOn-Gz0ZgXJIGVUwEvK-Rvsd2s9AzxgawvhUEHm4FFzEp7UERJu2TxEHNu4V8-0D02uxnegIE3RSHKsqtxbSOZmjCv2i4Av59NKNxBqTPw_2Dea3XcI06Xn4xzsBXZ8npclhRnoGF-S22XS9EEj8n3hXIrk-LQCzb66XXHsVHRN1OS8JwSpOn0VsnfI9N2x3DzbT3H3lhYMaimeKzyduV1d8a-ApvDZbjLmwMGufmZtDydLVN9THErKCMeuS-XnUUdyFZIPllvqaP9VrUP3Ql36-bwjyHoGdUgscleRUlO5UjuvTXpvJTtO5JeaGKen6BRTE01BMYfD-J-Ze0dMQfLyZuMgW38dQZ4lsbMmUjeyX9MnTgEF73m9rYsaQh8Ld8eM-TeKEOfISlerOAPfiEPl3rAmswHDFwBKmLNasL6a64Sz_F6z9ij9iUTuwIiQqLfyfj3lEAvjzuic6wRS-IVRNakWgTWEd1Zn0s1YVKvZXpNmtmM8K5hT-omqQ7OwSCeu4nyxFwX

11/27 02:55:16.663 DEBUG|        dev_server:1706| Current CrOS auto-update status: Transfer Devserver/Stateful Update Package
....
11/27 02:55:26.715 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/get_au_status?build_name=falco-chrome-pfq/R57-9026.0.0-rc3&force_update=True&pid=14020&host_name=chromeos2-row3-rack1-host21&full_update=False"''
11/27 02:55:28.153 DEBUG|        dev_server:1706| Current CrOS auto-update status: transfer rootfs update package
....
11/27 03:07:04.069 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/get_au_status?build_name=falco-chrome-pfq/R57-9026.0.0-rc3&force_update=True&pid=14020&host_name=chromeos2-row3-rack1-host21&full_update=False"''
11/27 03:07:05.356 DEBUG|        dev_server:1768| Failed to trigger auto-update process on devserver
11/27 03:07:05.358 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/handler_cleanup?pid=14020&host_name=chromeos2-row3-rack1-host21"''
11/27 03:07:06.701 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/collect_cros_au_log?pid=14020&host_name=chromeos2-row3-rack1-host21"''
11/27 03:07:08.115 DEBUG|        dev_server:1600| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row3-rack1-host21/1217524-provision/20162711025401/autoupdate_logs/CrOS_update_chromeos2-row3-rack1-host21_14020.log
11/27 03:07:08.116 DEBUG|        dev_server:1867| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 227, in TriggerAU
    self._WriteAUStatus(cros_update_progress.FINISHED)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 542, in __exit__
    self.device.Cleanup()
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 665, in Cleanup
    self.BaseRunCommand(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 830, in BaseRunCommand
    return self.GetAgent().RemoteSh(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 340, in RemoteSh
    raise SSHConnectionError(e.result.error)
SSHConnectionError: ssh: connect to host chromeos2-row3-rack1-host21 port 22: Connection timed out
11/27 03:07:08.117 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/kill_au_proc?host_name=chromeos2-row3-rack1-host21"''
11/27 03:07:29.521 DEBUG|        dev_server:1885| AU failed, trying IP instead of hostname: 100.115.226.11

My suspicion is either the device hung during updating stateful package, or somehow it takes too long to finish the stage, the timeout is not long enough to wait for that to finish. But I am not clear what is being done and checked during that stage. 

Maybe we should ssh into the DUT and look into the detailed test log to see if there is any clue what has failed or where it is.

 
Comment 1 by jen...@chromium.org, Feb 14 2017
Blocking: 668863
Comment 2 by jen...@chromium.org, Feb 14 2017
This issue could be one of the issues mentioned in  issue 687248 .
Comment 3 by xixuan@chromium.org, Feb 14 2017
debug log directory: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/87607394-chromeos-test/chromeos2-row3-rack1-host21

I don't know which build this job belongs to. But due to my check of host chromeos2-row3-rack1-host21, my answer for comment #2 is no. 

Looks like this host is either complete offline (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row3-rack1-host21/1427510-provision),
 
or is not sshable and can only be repaired by servo (provision 96000490 failed, and repair 96000490 works by using servo to repair it).

If the DUT is not online during test, retry won't help.

Comment 4 by aut...@google.com, Feb 15 2017
Owner: xixuan@chromium.org
Cc: afakhry@chromium.org levarum@chromium.org akes...@chromium.org
Labels: -Pri-3 OS-Chrome Pri-2
Status: Assigned
We had 9 reports of this failure on falco-chrome-pfq yesterday! 

Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out.

Most recent report is at https://bugs.chromium.org/p/chromium/issues/detail?id=668863#c76.

Currently this is the only failing important build on the PFQ. Can we do anything about this DUT?
Cc: steve...@chromium.org
Labels: -Pri-2 Pri-1
Summary: Provision fails: one devserver cannot connect to a DUT, but another devserver can. (was: Failed to connect to DUT during or after Autoupdate)
The failure happened in two hosts:

chromeos2-row3-rack4-host1 & chromeos2-row3-rack4-host3

The reason is their assigned devserver 100.115.245.197 is not able to ssh to these two DUTs for whatever reason.

The reason they're now working is because for whatever reason, they're assigned 100.115.245.198 to do provisioning once, and succeed.

We now have 're-resolve a devserver to do provision if the DUT is not pingbale', but we don't handle this case. Let me think to add another re-resolve fix to handle this case.
Summary: Provision fails: one devserver cannot ssh to a DUT, but another devserver can. (was: Provision fails: one devserver cannot connect to a DUT, but another devserver can.)
Cc: jen...@chromium.org xixuan@chromium.org uekawa@chromium.org slavamn@chromium.org x...@chromium.org sbasi@chromium.org semenzato@chromium.org dbehr@chromium.org jrobbins@chromium.org achuith@chromium.org jrbarnette@chromium.org ejcaruso@chromium.org
 Issue 687248  has been merged into this issue.
For falco the recent errors all appear to be associated with:

chromeos2-row3-rack4-host1
chromeos2-row3-rack4-host3

In the past we have removed / replaced failing hosts when this occurs by filing an issue using one of the go/bugatrooper templates.

Blockedon: 698309
> In the past we have removed / replaced failing hosts
> when this occurs by filing an issue using one of the
> go/bugatrooper templates.

Please don't assume that the problem is hardware, or that
it can be fixed simply by removing a device, please let
the CrOS Infra team make that judgment.  These devices are
literally irreplaceable; we can't afford to remove them
without conclusive evidence.

In particular, this problem could easily be the external
network environment, not the device; that's currently what
I expect.

Note also that go/bug-a-trooper isn't the right forum.
If you need action relating to the Chrome OS test lab, a
bug filed in Monorail against component
Infra>Client>ChromeOS (just like this bug) is the way to
go.

So, I put up a CL to modify the go/bug-a-trooper info. Maybe we should add an extra template there instead? It would be super helpful to gardeners to have a single place to go for that sort of how-to information.

WRT the failing devices, there is a pretty clear pattern of those 2 devices failing and others not. While I recognize there are a myriad of reasons as to why, as long as they are in the pool and constantly timing out, It is very unlikely for a PFQ runt to pass.

I already filed https://b.corp.google.com/u/0/issues/35951324, please feel free to update that and address the issue differently.

> WRT the failing devices, there is a pretty clear pattern
> of those 2 devices failing and others not. While I recognize
> there are a myriad of reasons as to why, as long as they are
> in the pool and constantly timing out, It is very unlikely
> for a PFQ runt to pass.

It's possible to shift the devices out of the pool while we
figure out what's wrong.

> I already filed https://b.corp.google.com/u/0/issues/35951324,
> please feel free to update that and address the issue differently.

I've closed the ticket.  The lab techs can't help with the problem,
since the devices in question are actually up.

Our best bet is to move these devices out of the BVT pool, and see
if we can figure out the reason for the failures offline.

Looking at the DUT histories, I note that chromeos2-row3-rack4-host1
has failed about 61% of all its provisioning tasks in the past week.
chromeos2-row3-rack4-host3 has failed only(!) 38%.  Despite that, both
DUTs are continuing to run 100+ tests/day.

Cc: teravest@chromium.org
FYI +teravest@ who has foolishly expressed interest in this problem :)

jrbarnette@, thanks for diving into this a bit more. 61% (and even 38%) failure rates sound super alarming, I naively assumed that if the failure rate was that high it was basically 100%.

Do we have any notion of why it might be that high just on those devices?

Can you move those out of the test pool for now, or remind me how to do that? (If that is something we want gardeners to be able to do, we should document it and point them to the doc in go/bug-a-trooper; if not we should document the correct escalation process and point them there).

I have removed 'pool:bvt' for these two hosts. 
I add chromeos2-row3-rack2-host17 & chromeos2-row3-rack2-host15 to pool:bvt. Let's monitor how these two DUTs work.
> Do we have any notion of why it might be that high just on those devices?

https://en.wikipedia.org/wiki/The_$64,000_Question

That said, I haven't checked the failure rate across the board.  I'm
starting that work now.

We really need to dig deeper into the failure symptoms; it's
possible, even likely, that there's more than one reason for failure
at work.

On that note, looking at the original bug, it's not clear to me that
we've been talking about the same problem all along.  What do we think
this bug is really about?  Is it just a handful of suspect falco DUTs,
or is it some broader failure in provisioning?

There may well be more than one issue here. The gardener noticed that falco was especially flaky this week so I looked into that and noticed that it was primarily 2 DUTs (mostly because I have run into this sort of thing before).

This particular issue should probably be used to track the specific failure described initially and/or broader problems of how we deal with provision failures and/or why they seem to have been occurring more frequently in the past couple of months (which also might be more than one root cause).

OK.  Probably we should open a bug.  Until then, I'll note that I've
gathered the logs of every failed provision task, and every repair
task (pass or fail) on every falco for the past week.  I've done some
mining of the data, but one item is of particular interest relative
to chromeos2-row3-rack4-host1:
  * The host has failed provisioning 71 times in the same time period,
    substantially more than any other host.
  * The host has been the subject of a repair task 73 times in
    the past week.  (N.B. Every failed provision triggers repair)
  * In all but three of those repair tasks, no problem was found.

Put another way, according to the *repair* logs, chromeos2-row3-rack4-host1
is basically never broken.

We've got a known problem with load on devservers contributing to
provision failures.  That usually manifests as "provision fails,
complaining that the DUT is offline; 45 seconds later, repair reports
that the DUT is online".  chromeos2-row3-rack4-host1 was never offline
at the start of any repair task (even the tasks that found problems).

Questions:
1. Is there a simplified timeline of events  extracted from the logs that so we can reason about them without all trying to parse the logs separately?  Specifically something like:

0s: Server-A: Provision: ssh DUT1 commandX - start
60s: Server-A: Provision: ssh DUT1 commandX - time out
105s: Server-B: Reset: ssh DUT1 commandY - start/succeed

Is this timeline the same in all these cases?

2. Is the server the same the same in all of these cases?  
3. Is command X the same in all cases?
4. Can we try adding diagnostics when we get the timeout?  netstat, mtr --report.
5. Can we try adding a 45 second delay and retry on that timeout?
6. Can we see the ssh/provision data from Monarch for these operations in questions?


for question 1, you mean we print the whole thing in one log?
Or we print one line "0s: Server-A: Provision: ssh DUT1 commandX - start
60s: Server-A: Provision: ssh DUT1 commandX - time out" in provision log, and another line "105s: Server-B: Reset: ssh DUT1 commandY - start/succeed" in reset log, then parse them from each log?

2. it may be not the same. Server-A for provision can vary, any devserver in the same subnet is available. But usually it won't change, unless server-A is not in good health state. Server-B should keep the same.

3. commandX is the same.

for 4, what we want to get? for the host or for the server?

5, it's doable.

6. from monarch, currently the answer is no. These data only exist in logs and needs to be manually found. It will be great if we can check whether a DUT is online/offline by time from monarch.
For #1: I mean has somebody either manually or through an automatic script, taken a few cases that are exhibiting this problem and distilled down into a simple relevant log like I described so we can reason about why these failures are happening.  We're starting to build tools that can analyze these logs, so it would be awesome if we can script it up.

2. So if the problems always happen with the same devservers, that tells us something.  If the command Y that succeeds happens from the same devserver or different devserver from command X that tells us something.  Have a number of timelines in simplified format from above lets us answer these questions and lets us narrow down where we should be looking further.

3. What is command X that always fails?  Is the failure always connection timed out?  How long from the command getting issued, to when it times out?  Have we tried running with ssh -v  for provisions to get more data?  There should be Monarch data corresponding to the SSH calls so we should be able to see the run, and the exact retried calls under the cover.

4. Well I'm assuming we can't really talk to the dut since ssh is failing, so from the server.  But if we can get a connection to the dut to ask it to dump that information it would also be useful.  If the dut has a ton of connections open, it could explain why connections are being timed out.

5. If we have data to back up that many provisions that fail are followed by a successful operation a mere 45 seconds later, it seems prudent to try this retry.  We should continue to root cause the underlying failure instead of just relying on this extra retry, but it lets us get past this immediate issue.  In addition, there should be data emitted on these retries so we can understand if they're working and why they are?

6. We definitely have ssh data and provision data.  The ssh data might not be correlated to DUT (but I thought it was) and should let us look for patterns on these operations.  There's also data in Monarch about the status of a dut being up or down.
For 2, At least in this bug, problems happens since one devserver cannot ssh to the device, and and when provision resolves another devserver, it's fine.

For 3, “ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpmhizac/testing_rsa root@chromeos4-row4-rack9-host4”, ssh command like this.

For 6, there's sth on monarch, but I thought there's no data to track a DUT's condition (offline/online checked by which server), if there's, what's the url for it? what's the "ssh data" you mentioned?
For #1, I've found that approach to be very helpful for debugging test infrastructure in the past. Do we have any documentation for getting access to autotest job results and log files? I can't figure out how to get job results and log files for a given hostname right now; presumably autotest has job records with some pointers to the logs.
> For #1: I mean has somebody either manually or through an
> automatic script, taken a few cases that are exhibiting
> this problem and distilled down into a simple relevant log
> like I described so we can reason about why these failures
> are happening.  We're starting to build tools that can analyze
> these logs, so it would be awesome if we can script it up.

I cobbled together a throwaway script to parse the output of
`dut-status`, and did a lot of ad hoc processing on that data
to learn stuff.

The total data is nearly 2GB; I won't upload it here.  However,
downloading it yourself is pretty easy.  The 'falco-histories.txt'
file is the raw `dut-status` output, including URLs to logs.
The 'get_failures' script can process 'falco-histories.txt'
and produce a script of 'gsutil' commands that will copy the
logs to the current directory.

> 3. What is command X that always fails?  Is the failure
> always connection timed out?  How long from the command
> getting issued, to when it times out?  Have we tried running
> with ssh -v  for provisions to get more data?  There should
> be Monarch data corresponding to the SSH calls so we should
> be able to see the run, and the exact retried calls under the cover.

There's no single command that always fails.  I don't believe
I've seen any ssh failures that weren't a timeout.

I believe that the ssh timeout failures can be bucketed into
two cases.

Bucket 1 - Repair found the DUT online.
Repair runs from the shard, so this means "devserver failed ssh,
shard succeeded." For the falco logs, I found that the time
between "devserver saw timeout" and "shard saw the DUT online"
was consistently around 45-50 seconds.  I can't rule out actual
DUT downtime in that interval, but network congestion caused by
local load is by far the most plausible answer.

Bucket 2) Repair found the DUT offline.
In this case, the DUTs are almost universally recovered by
servo actions such as resetting the hardware.  In other words,
the DUT really was down, and power cycling fixed it.

The causes for this symptom will almost certainly be board
specific.

> 5. If we have data to back up that many provisions that
> fail are followed by a successful operation a mere 45 seconds
> later, it seems prudent to try this retry. [ ... ]

Alas, no, see the comments above.  The ssh failures are on
the devserver, the successes come from the shard.  Given
that the likely cause of the ssh failure is devserver load,
retrying will just exacerbate the load problem.  Note also
that in some cases we _already_ are retrying ssh: in those
cases, we give up only after 8 minutes of trying.

> 6. We definitely have ssh data and provision data.  The
> ssh data might not be correlated to DUT (but I thought it
> was) and should let us look for patterns on these operations.
> There's also data in Monarch about the status of a dut being
> up or down.

For the falco data, the DUT that failed the most was always because
of ssh failures on the devserver that never occurred from the
shard.  The imbalance is quite striking.  I don't know why one
DUT would be so heavily hit by that problem over all the others.
That gives me pause, but not enough to discard the basic theory
that a large part of the underlying symptom is devserver network
load.

falco-histories.txt
2.8 MB View Download
get_failures
211 bytes View Download
Re c#26: See the DUT Health and DUT ssh requests dashboards on vi/chromeos.  DUT Health should show the working DUTs and if you filter down by board or bring it up in panopticon you should be able to get to the point of looking at individual DUTs.  The DUT ssh requests shows the overall number of DUTs ssh calls but doesn't break down by board or devserver or DUT at this point.

Re c#28:
Don and my scripts are in the process of landing and can do the same sort of things.  One generates a list of builds and suites based on some search parameters (eg last two master-paladin builds), one takes the list of suites job and searches for special tasks (eg provisions the failed) and dumps the list of jobs and the next task, and the link to the logs.  The last takes a link to logs and performs some automatic classification.

You said: 
> Bucket 1 - Repair found the DUT online.
> Repair runs from the shard, so this means "devserver failed ssh,
> shard succeeded." For the falco logs, I found that the time
> between "devserver saw timeout" and "shard saw the DUT online"
> was consistently around 45-50 seconds.  I can't rule out actual
> DUT downtime in that interval, but network congestion caused by
> local load is by far the most plausible answer.

This is the most contentious thing in your response and I think we need to get data backing up and analyze this -- either by instrumenting the cases when it happens through checked in code, or manually trying to replicate this while gathering data.
> Re c#26: See the DUT Health and DUT ssh requests dashboards on vi/chromeos. [ ... ]

Those dashboards don't include ssh accesses from the devservers.
Devserver accesses are the accesses that fail.  I know it's wrong;
don't get me started.

Blocking: 695529
Status: Fixed
Already add new feature to re-resolve a devserver if one is not working: https://chromium-review.googlesource.com/c/447878/. Mark this as fixed.
Status: Verified
Closing. Please reopen it if its not fixed. Thanks!
Sign in to add a comment