autotest failed to stage artifact to DUT | HTTP OK not accompanied by 'Success' |
|||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/edgar-paladin/builds/814 My guess is this is another symptom of devserver network load problem.
,
Sep 18 2017
Could this be from jrbarnette's load limiting changes? When I looked at the logs, the devserver had already staged the image long before the failure was being reported. I don't see the failure in the devserver logs. 09/18 14:07:36.148 DEBUG| utils:0212| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/stage?artifacts=autotest_packages&files=&async=True&archive_url=gs://chromeos-image-archive/edgar-paladin/R63-9954.0.0-rc1"'' 09/18 14:07:36.821 DEBUG| dev_server:1067| response for RPC: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n<hr>\n<address>Apache/2.4.7 (Ubuntu) Server at 100.115.219.132 Port 8082</address>\n</body></html>\n' Images previously staged: [18/Sep/2017:13:52:13] AUTOTEST_TARBALL ArtifactStaged() -> yes, autotest_packages.tar->/home/chromeos-test/images/edgar-paladin/R63-9954.0.0-rc1 is staged. [18/Sep/2017:13:52:13] AUTOTEST_TARBALL autotest_packages.tar->/home/chromeos-test/images/edgar-paladin/R63-9954.0.0-rc1 is already staged.
,
Sep 19 2017
> Could this be from jrbarnette's load limiting changes? No. The workqueue throttling doesn't impact staging. More importantly, the feature is currently off; see bug 760652 .
,
Sep 19 2017
xixuan@ had some comments about this on crosoncall, maybe you can distill them to this bug?
,
Sep 19 2017
The error happens at:
09/18 14:07:36.148 DEBUG| utils:0212| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/stage?artifacts=autotest_packages&files=&async=True&archive_url=gs://chromeos-image-archive/edgar-paladin/R63-9954.0.0-rc1"''
09/18 14:07:36.821 DEBUG| dev_server:1067| response for RPC: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n<hr>\n<address>Apache/2.4.7 (Ubuntu) Server at 100.115.219.132 Port 8082</address>\n</body></html>\n'
The culprit devserver is: chromeos4-devserver4, here's its logs:
[18/Sep/2017:14:07:26] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 8080)) shut down
[18/Sep/2017:14:07:26] ENGINE Stopped thread '_TimeoutMonitor'.
[18/Sep/2017:14:07:26] ENGINE Bus STOPPED
[18/Sep/2017:14:07:26] ENGINE Bus EXITING
[18/Sep/2017:14:07:26] ENGINE Bus EXITED
[18/Sep/2017:14:07:26] ENGINE Waiting for child threads to terminate...
[18/Sep/2017:14:07:26] ENGINE Caught signal SIGTERM.
[18/Sep/2017:14:07:26] ENGINE Bus STOPPING
[18/Sep/2017:14:07:26] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 8080)) already shut down
[18/Sep/2017:14:07:26] ENGINE No thread running for None.
[18/Sep/2017:14:07:26] ENGINE Bus STOPPED
[18/Sep/2017:14:07:26] ENGINE Bus EXITING
[18/Sep/2017:14:07:26] ENGINE Bus EXITED
[18/Sep/2017:14:07:35] DEVSERVER Using cache directory /home/chromeos-test/images/cache
[18/Sep/2017:14:07:35] DEVSERVER Serving from /home/chromeos-test/images
[18/Sep/2017:14:07:36] XBUDDY Using shadow config file stored at /home/chromeos-test/chromiumos/src/platform/dev/shadow_xbuddy_config.ini
[18/Sep/2017:14:07:39] ENGINE Listening for SIGHUP.
[18/Sep/2017:14:07:39] ENGINE Listening for SIGTERM.
[18/Sep/2017:14:07:39] ENGINE Listening for SIGUSR1.
[18/Sep/2017:14:07:39] ENGINE Bus STARTING
[18/Sep/2017:14:07:39] ENGINE Started monitor thread '_TimeoutMonitor'.
In the same period, that devserver is restarting, so returns 503 error for any calls, which is not retryable.
For devserver, it's restarting == it's not working, we can't differentiate the 2 cases. A quick fix may be we wait and retry such errors for a very short time (1-2 minutes) to cover this case. I will prepare a CL for this. Any other suggestions?
,
Sep 23 2017
xixuan@ has a suggested fix above.
,
Sep 25 2017
,
Sep 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e3dfe00d35c5d1b4ac48a23b05539fc3347e82b3 commit e3dfe00d35c5d1b4ac48a23b05539fc3347e82b3 Author: Xixuan Wu <xixuan@chromium.org> Date: Wed Sep 27 03:44:02 2017 autotest: Add retry if devserver call fails due to down service. It's possible that when devserver service is restarting, it temporarily doesn't work for a very short time window, but may cause HWTest errors. This CL adds retry for this case. If after a given time period (60s) the service is still down, we will declare that this devserver is not serving and won't retry. BUG= chromium:766342 TEST=Ran unittest. Change-Id: I66ed4f138ee89909797b889458ddb490dd5aa304 Reviewed-on: https://chromium-review.googlesource.com/685260 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/e3dfe00d35c5d1b4ac48a23b05539fc3347e82b3/client/common_lib/utils.py [modify] https://crrev.com/e3dfe00d35c5d1b4ac48a23b05539fc3347e82b3/client/common_lib/cros/dev_server.py [modify] https://crrev.com/e3dfe00d35c5d1b4ac48a23b05539fc3347e82b3/client/common_lib/cros/dev_server_unittest.py
,
Oct 2 2017
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, Sep 18 2017