automated_deploy hangs forever (or very long time) if one server is failing to update. |
||||||
Issue descriptionI started automated_deploy a few hours ago. It has been hung after the following output for a long time: omeos-server4.cbf.corp.google.com Updating server chromeos-server70.hot.corp.google.com... [0/5] Try to update server chromeos-server70.hot.corp.google.com Time used to update server chromeos-server11.hot.corp.google.com: 209.198833942 Successfully updated server chromeos-server11.hot.corp.google.com. Time used to update server chromeos-server3.hot.corp.google.com: 236.741982937 Successfully updated server chromeos-server3.hot.corp.google.com. Time used to update server chromeos-server36.cbf.corp.google.com: 673.467088938 Successfully updated server chromeos-server36.cbf.corp.google.com. Time used to update server chromeos-server54.cbf.corp.google.com: 792.669389963 Successfully updated server chromeos-server54.cbf.corp.google.com. Time used to update server chromeos-server42.cbf.corp.google.com: 822.947048187 Successfully updated server chromeos-server42.cbf.corp.google.com. Time used to update server chromeos-server55.cbf.corp.google.com: 850.063956022 Successfully updated server chromeos-server55.cbf.corp.google.com. Time used to update server chromeos-server20.cbf.corp.google.com: 862.847936869 Successfully updated server chromeos-server20.cbf.corp.google.com. Time used to update server chromeos-server56.hot.corp.google.com: 938.998557091 Successfully updated server chromeos-server56.hot.corp.google.com. Time used to update server chromeos-server46.hot.corp.google.com: 959.901597023 Successfully updated server chromeos-server46.hot.corp.google.com. Time used to update server chromeos-server53.cbf.corp.google.com: 1030.92389393 Successfully updated server chromeos-server53.cbf.corp.google.com. Time used to update server chromeos-server82.cbf.corp.google.com: 1077.50296092 Successfully updated server chromeos-server82.cbf.corp.google.com. Time used to update server chromeos-server33.cbf.corp.google.com: 1116.17488313 Successfully updated server chromeos-server33.cbf.corp.google.com. Time used to update server chromeos-server44.cbf.corp.google.com: 1189.06689286 Successfully updated server chromeos-server44.cbf.corp.google.com. Time used to update server chromeos-server80.hot.corp.google.com: 1610.88680291 Successfully updated server chromeos-server80.hot.corp.google.com. Time used to update server cros-autotest-shard2.cbf.corp.google.com: 2412.81266403 Successfully updated server cros-autotest-shard2.cbf.corp.google.com. Time used to update server chromeos-server5.hot.corp.google.com: 2412.81450486 Successfully updated server chromeos-server5.hot.corp.google.com. Time used to update server chromeos-server78.hot.corp.google.com: 2412.80637717 Successfully updated server chromeos-server78.hot.corp.google.com. Time used to update server chromeos-server7.mtv.corp.google.com: 3024.05615592 Time used to update server chromeos-server83.cbf.corp.google.com: 3024.04649687 Time used to update server chromeos-server40.cbf.corp.google.com: 3024.04985499 Time used to update server chromeos-server27.mtv.corp.google.com: 3024.05372095 Time used to update server chromeos-server57.hot.corp.google.com: 3024.05012798 Time used to update server chromeos-server26.mtv.corp.google.com: 3024.05443907 Time used to update server chromeos-server75.cbf.corp.google.com: 3024.04918098 Successfully updated server chromeos-server7.mtv.corp.google.com. Successfully updated server chromeos-server83.cbf.corp.google.com. Successfully updated server chromeos-server40.cbf.corp.google.com. Successfully updated server chromeos-server57.hot.corp.google.com. Successfully updated server chromeos-server26.mtv.corp.google.com. Successfully updated server chromeos-server27.mtv.corp.google.com. Successfully updated server chromeos-server75.cbf.corp.google.com. Time used to update server chromeos-server6.mtv.corp.google.com: 3024.05911493 Time used to update server chromeos-server48.hot.corp.google.com: 3024.05373502 Time used to update server chromeos-server22.cbf.corp.google.com: 3024.05654001 Successfully updated server chromeos-server6.mtv.corp.google.com. Successfully updated server chromeos-server48.hot.corp.google.com. I am tempted to try cancelling and re-trying. Ideally, I would only retry on the servers that failed. But based on the output of the script, I have no idea which servers have failed. I can hackishly figure this out with `ps -Af | grep deploy` but really it would be much better if the script either timed out faster, or at least had an occasional message about how many servers are still outstanding.
,
Dec 13 2016
Also, strange that we have a huge quantity of un-reaped deploy_server processes.
deploy_server.p─┬─googlesh
└─254*[{deploy_server.p}]
$ pstree 81232 -p
automated_deplo(81232)───deploy_server.p(81237)─┬─googlesh(84014)
├─{deploy_server.p}(81661)
├─{deploy_server.p}(81662)
├─{deploy_server.p}(81663)
├─{deploy_server.p}(81664)
├─{deploy_server.p}(81665)
,
Dec 13 2016
Shameless plug: See a laundry list of deploy_server usability improvements at issue 666101 from my deputy shift.
,
Dec 13 2016
shuqianz@ writes: I can't open my bug tracker because of some credential issue. So, I just reply in this email. When I retried the push, it got stuck at updating chromeos-server71.cbf again. So, I also canceled the push, and manually ran deploy on the skipped servers, which I found from the nagios alert. Then I ssh into the chromeos-server71.cbf, and manually ran deploy_server_local.py on this server. It hanged at running test_importer.py. I broke the deploy and tried to only run test_importer.py on that server. $ /usr/local/autotest/utils/test_importer.py It is running >1h30mins, and it is still running (not freeze) when I am sending this email. So, I think this issue has nothing to do with the deploy process. It is caused by the test_importer.py script taking too long. I've noticed that there are other servers taking about >30mins to update, I think it may be caused by the same issue. Charlene
,
Dec 13 2016
,
Jan 5 2017
From the email sent by me about the fix on chromeos-server71.cbf: For the chromeos-server71 server, I had tested deploy_server_local.py directly on the server and found out that the test_importer.py script was running super slow. That script took 2 hours to finish. I checked the code, it seemed that it was caused by the shard database was slow on that server. Therefore, I removed the shard from prod, cleaned up the shard database, and also cleaned up logs/ and results/ folder on that server. After all of these have been done, I kicked off another test_importer, this time it only took 5mins.
,
Jan 5 2017
Is it possible to add a timeout to the deployment script? Seems like <1hr should be a reasonably expectation.
,
Jun 20 2017
Will add a timeput
,
Aug 1 2017
After adding logging, and all other improvement. I think this issue is not applied to current deployment flow any more.
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Dec 13 2016