New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 673585 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

automated_deploy hangs forever (or very long time) if one server is failing to update.

Project Member Reported by akes...@chromium.org, Dec 13 2016

Issue description

I started automated_deploy a few hours ago. It has been hung after the following output for a long time:

omeos-server4.cbf.corp.google.com
Updating server chromeos-server70.hot.corp.google.com...
[0/5] Try to update server chromeos-server70.hot.corp.google.com
Time used to update server chromeos-server11.hot.corp.google.com: 209.198833942
Successfully updated server chromeos-server11.hot.corp.google.com.
Time used to update server chromeos-server3.hot.corp.google.com: 236.741982937
Successfully updated server chromeos-server3.hot.corp.google.com.
Time used to update server chromeos-server36.cbf.corp.google.com: 673.467088938
Successfully updated server chromeos-server36.cbf.corp.google.com.
Time used to update server chromeos-server54.cbf.corp.google.com: 792.669389963
Successfully updated server chromeos-server54.cbf.corp.google.com.
Time used to update server chromeos-server42.cbf.corp.google.com: 822.947048187
Successfully updated server chromeos-server42.cbf.corp.google.com.
Time used to update server chromeos-server55.cbf.corp.google.com: 850.063956022
Successfully updated server chromeos-server55.cbf.corp.google.com.
Time used to update server chromeos-server20.cbf.corp.google.com: 862.847936869
Successfully updated server chromeos-server20.cbf.corp.google.com.
Time used to update server chromeos-server56.hot.corp.google.com: 938.998557091
Successfully updated server chromeos-server56.hot.corp.google.com.
Time used to update server chromeos-server46.hot.corp.google.com: 959.901597023
Successfully updated server chromeos-server46.hot.corp.google.com.
Time used to update server chromeos-server53.cbf.corp.google.com: 1030.92389393
Successfully updated server chromeos-server53.cbf.corp.google.com.
Time used to update server chromeos-server82.cbf.corp.google.com: 1077.50296092
Successfully updated server chromeos-server82.cbf.corp.google.com.
Time used to update server chromeos-server33.cbf.corp.google.com: 1116.17488313
Successfully updated server chromeos-server33.cbf.corp.google.com.
Time used to update server chromeos-server44.cbf.corp.google.com: 1189.06689286
Successfully updated server chromeos-server44.cbf.corp.google.com.
Time used to update server chromeos-server80.hot.corp.google.com: 1610.88680291
Successfully updated server chromeos-server80.hot.corp.google.com.
Time used to update server cros-autotest-shard2.cbf.corp.google.com: 2412.81266403
Successfully updated server cros-autotest-shard2.cbf.corp.google.com.
Time used to update server chromeos-server5.hot.corp.google.com: 2412.81450486
Successfully updated server chromeos-server5.hot.corp.google.com.
Time used to update server chromeos-server78.hot.corp.google.com: 2412.80637717
Successfully updated server chromeos-server78.hot.corp.google.com.
Time used to update server chromeos-server7.mtv.corp.google.com: 3024.05615592
Time used to update server chromeos-server83.cbf.corp.google.com: 3024.04649687
Time used to update server chromeos-server40.cbf.corp.google.com: 3024.04985499
Time used to update server chromeos-server27.mtv.corp.google.com: 3024.05372095
Time used to update server chromeos-server57.hot.corp.google.com: 3024.05012798
Time used to update server chromeos-server26.mtv.corp.google.com: 3024.05443907
Time used to update server chromeos-server75.cbf.corp.google.com: 3024.04918098
Successfully updated server chromeos-server7.mtv.corp.google.com.
Successfully updated server chromeos-server83.cbf.corp.google.com.
Successfully updated server chromeos-server40.cbf.corp.google.com.
Successfully updated server chromeos-server57.hot.corp.google.com.
Successfully updated server chromeos-server26.mtv.corp.google.com.
Successfully updated server chromeos-server27.mtv.corp.google.com.
Successfully updated server chromeos-server75.cbf.corp.google.com.
Time used to update server chromeos-server6.mtv.corp.google.com: 3024.05911493
Time used to update server chromeos-server48.hot.corp.google.com: 3024.05373502
Time used to update server chromeos-server22.cbf.corp.google.com: 3024.05654001
Successfully updated server chromeos-server6.mtv.corp.google.com.
Successfully updated server chromeos-server48.hot.corp.google.com.


I am tempted to try cancelling and re-trying. Ideally, I would only retry on the servers that failed. But based on the output of the script, I have no idea which servers have failed.

I can hackishly figure this out with `ps -Af | grep deploy` but really it would be much better if the script either timed out faster, or at least had an occasional message about how many servers are still outstanding.


 
Follow up. Based on the process list, I determined which was the hung process

$ ps -Af | grep deploy
akeshet   81232  46073  0 15:42 pts/25   00:00:00 /usr/bin/python ./automated_deploy.py --skip_autotest --skip_chromite
akeshet   81237  81232  0 15:42 pts/25   00:00:33 /usr/bin/python /usr/local/google/home/akeshet/chromiumos/src/third_party/autotest/files/site_utils/deploy_server.py --afe=cautotest
akeshet   84014  81237  0 15:47 pts/25   00:00:00 /bin/bash /usr/bin/googlesh -s -uchromeos-test -mchromeos-server71.cbf.corp.google.com /usr/local/autotest/site_utils/deploy_server_local.py 
akeshet  110388  80514  0 17:34 pts/18   00:00:00 grep --color=auto deploy


I went ahead and killed it
$ kill 84014

However, I don't see that a retry has been launched for it, and the automated_deploy script is still hung.
$ ps -Af | grep deploy
akeshet   81232  46073  0 15:42 pts/25   00:00:00 /usr/bin/python ./automated_deploy.py --skip_autotest --skip_chromite
akeshet   81237  81232  0 15:42 pts/25   00:00:34 /usr/bin/python /usr/local/google/home/akeshet/chromiumos/src/third_party/autotest/files/site_utils/deploy_server.py --afe=cautotest
akeshet  110689  80514  0 17:37 pts/18   00:00:00 grep --color=auto deploy





Also, strange that we have a huge quantity of un-reaped deploy_server processes.

deploy_server.p─┬─googlesh
                └─254*[{deploy_server.p}]



$ pstree 81232 -p
automated_deplo(81232)───deploy_server.p(81237)─┬─googlesh(84014)
                                                ├─{deploy_server.p}(81661)
                                                ├─{deploy_server.p}(81662)
                                                ├─{deploy_server.p}(81663)
                                                ├─{deploy_server.p}(81664)
                                                ├─{deploy_server.p}(81665)

Shameless plug: See a laundry list of deploy_server usability improvements at  issue 666101  from my deputy shift.
shuqianz@ writes:
I can't open my bug tracker because of some credential issue. So, I just reply in this email. 

When I retried the push, it got stuck at updating chromeos-server71.cbf again. So, I also canceled the push, and manually ran deploy on the skipped servers, which I found from the nagios alert. 

Then I ssh into the chromeos-server71.cbf, and manually ran deploy_server_local.py on this server. It hanged at running test_importer.py. I broke the deploy and tried to only run test_importer.py on that server.
$  /usr/local/autotest/utils/test_importer.py
It is running >1h30mins, and it is still running (not freeze) when I am sending this email. 

So, I think this issue has nothing to do with the deploy process. It is caused by the test_importer.py script taking too long. I've noticed that there are other servers taking about >30mins to update, I think it may be caused by the same issue.

Charlene

Comment 5 by autumn@chromium.org, Dec 13 2016

Labels: -current-issue
Status: Fixed (was: Untriaged)
From the email sent by me about the fix on chromeos-server71.cbf:

For the chromeos-server71 server, I had tested deploy_server_local.py directly on the server and found out that the test_importer.py script was running super slow. That script took 2 hours to finish. I checked the code, it seemed that it was caused by the shard database was slow on that server. Therefore, I removed the shard from prod, cleaned up the shard database, and also cleaned up logs/ and results/ folder on that server. After all of these have been done, I kicked off another test_importer, this time it only took 5mins. 

Status: Assigned (was: Fixed)
Is it possible to add a timeout to the deployment script? Seems like <1hr should be a reasonably expectation.
Will add a timeput
Status: Fixed (was: Assigned)
After adding logging, and all other improvement. I think this issue is not applied to current deployment flow any more.

Comment 10 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment