Reduce Apache RPC timeout |
|||||||||
Issue descriptionFailed CQ: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14396 Failure reason: cannot connect to shard AFE: 04/25 19:53:01.732 DEBUG| base_job:0350| Persistent state global_properties.test_retry now set to 0 04/25 19:53:01.732 DEBUG| base_job:0350| Persistent state global_properties.tag now set to '' 04/25 19:53:01.820 DEBUG| retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>) 04/25 19:53:11.872 DEBUG| retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>) 04/25 19:53:31.947 DEBUG| retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>) 04/25 19:54:12.049 DEBUG| retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>) 04/25 19:55:32.150 DEBUG| retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>) 04/25 19:58:01.815 ERROR| autoserv:0769| Timeout occurred- waited 300 seconds. Traceback (most recent call last): File "/usr/local/autotest/server/autoserv", line 761, in main use_ssp) File "/usr/local/autotest/server/autoserv", line 494, in run_autoserv test_retry, **kwargs) File "/usr/local/autotest/server/site_server_job.py", line 48, in __init__ super(site_server_job, self).__init__(*args, **dargs) File "/usr/local/autotest/server/server_job.py", line 331, in __init__ self.machines, self.in_lab, host_attributes) File "/usr/local/autotest/server/server_job.py", line 102, in get_machine_dicts afe_host = _create_afe_host(machine, in_lab) File "/usr/local/autotest/server/server_job.py", line 1440, in _create_afe_host hosts = afe.get_hosts(hostname=hostname) File "/usr/local/autotest/server/frontend.py", line 538, in get_hosts hosts = self.run('get_hosts', **query_args) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 127, in run self, call, **dargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 114, in GenericRetry time.sleep(sleep_time) File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 62, in kill_us raise TimeoutError(error_message % {'time': max_run_time}) TimeoutError: Timeout occurred- waited 300 seconds. Possible reason: It happened on 2 shard, server36.cbf & server14.mtv, both at about 19:50~20:00. At that time, the apache on the 2 shards are restarting: [Tue Apr 25 19:45:02.128431 2017] [mpm_event:notice] [pid 11587:tid 140097477130112] AH00491: caught SIGTERM, shutting down [Tue Apr 25 20:01:32.527596 2017] [mpm_event:notice] [pid 25522:tid 140243570194304] AH00489: Apache/2.4.10 (Ubuntu) mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations
,
Apr 27 2017
,
Apr 27 2017
I'm ok with reducing the frequency. Though is there a workaround we can try? Why does it take apache 15 minutes to restart?
,
Apr 27 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/3f82404769bd4b721e538e946d87652b303d20bf commit 3f82404769bd4b721e538e946d87652b303d20bf Author: xixuan <xixuan@chromium.org> Date: Thu Apr 27 22:45:01 2017
,
May 2 2017
An idea brough up in the team meeting is that apache might be trying to serve all outstanding rpcs before shutting down, and there might be 1 long lived RPC that prevents it from shutting down. We should have an rpc timeout to prevent this. PRobably want such a timeout anyway to prevent us from accumulating slow RPCs.
,
May 5 2017
Work remaining: reduce apache rpc timeout (we think it's an apache config). @ paul - can you take a look?
,
May 8 2017
,
May 9 2017
,
May 12 2017
chromeos-server4.cbf.corp.google.com is found not working from 8:45-12:45 today, [Fri May 12 08:45:01.925853 2017] [mpm_event:notice] [pid 26841:tid 140451891160960] AH00491: caught SIGTERM, shutting down [Fri May 12 12:45:02.398859 2017] [mpm_event:notice] [pid 2415:tid 140168440772480] AH00489: Apache/2.4.10 (Ubuntu) mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations metrics: http://shortn/_pbLW3YNgf5
,
May 12 2017
,
May 12 2017
Issue 721887 has been merged into this issue.
,
May 12 2017
Issue 721846 has been merged into this issue.
,
May 12 2017
Related: feature bug for adding metrics for monitoring services: https://bugs.chromium.org/p/chromium/issues/detail?id=720175 Not sure of the Apache error log metrics phobbs was working on would also help
,
May 12 2017
There's also second order effects visible on Viceroy: https://viceroy.corp.google.com/chromeos/dut_health?board=winky&duration=1d&mdb_role=chrome-infra&refresh=-1&utc_end=1494622929.14
,
May 22 2017
chromeos-server36's apache was just down during restart, leads to CQ failure: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5987 Use this bug for annotation.
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e73b3280d55983a0945333bee46d3d1455a1613d commit e73b3280d55983a0945333bee46d3d1455a1613d Author: Paul Hobbs <phobbs@google.com> Date: Fri May 26 00:16:09 2017 [autotest] Added 60s timeout to RPCs BUG= chromium:715415 TEST=None Change-Id: I913db08b6e70aa82f12104534953d962cb29100a Reviewed-on: https://chromium-review.googlesource.com/513585 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/e73b3280d55983a0945333bee46d3d1455a1613d/apache/conf/django-directives
,
May 26 2017
oops, we need a fix. In test_push servers: Command 'sudo service apache2 reload' returned non-zero exit status 1 * The apache2 configtest failed. Not doing anything. Output of config test was: AH00526: Syntax error on line 66 of /usr/local/autotest/apache/conf/django-directives: Invalid command 'maximum-requests=200', perhaps misspelled or defined by a module not included in the server configuration Action 'configtest' failed. The Apache error log may have more information.
,
May 26 2017
I manually test, change code to oneline: WSGIDaemonProcess autotestapache processes=65 threads=1 maximum-requests=200 request-timeout=60 will cause: chromeos-test@chromeos-shard2-staging:~$ sudo service apache2 reload * Reloading web server apache2 * * The apache2 configtest failed. Not doing anything. Output of config test was: AH00526: Syntax error on line 65 of /usr/local/autotest/apache/conf/django-directives: Invalid option to WSGI daemon process definition. Action 'configtest' failed. The Apache error log may have more information.
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ca4d02ea65adcb8969de74f1fd0a402847194a9b commit ca4d02ea65adcb8969de74f1fd0a402847194a9b Author: Xixuan Wu <xixuan@chromium.org> Date: Fri May 26 16:11:27 2017 Revert "[autotest] Added 60s timeout to RPCs" Temporarily revert for test_push. This reverts commit e73b3280d55983a0945333bee46d3d1455a1613d. Reason for revert: break test_push. Original change's description: > [autotest] Added 60s timeout to RPCs > > BUG= chromium:715415 > TEST=None > > Change-Id: I913db08b6e70aa82f12104534953d962cb29100a > Reviewed-on: https://chromium-review.googlesource.com/513585 > Commit-Ready: Paul Hobbs <phobbs@google.com> > Tested-by: Paul Hobbs <phobbs@google.com> > Reviewed-by: Paul Hobbs <phobbs@google.com> > Reviewed-by: Dan Shi <dshi@google.com> > BUG= chromium:715415 Change-Id: I6229fb8ab116295cc1f1270c1c007c655fc05609 Reviewed-on: https://chromium-review.googlesource.com/516494 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/ca4d02ea65adcb8969de74f1fd0a402847194a9b/apache/conf/django-directives
,
Jun 8 2017
The flag is introduced in mod_wsgi version 4.10, while the latest mod_wsgi package for trusty is 3.4-4 (https://packages.ubuntu.com/trusty/libapache2-mod-wsgi). We might be able to do a timeout in the python layer using signals.
,
Jun 8 2017
Lowering priority and adding to fixit list. Also, there's no reason to use the Restict-View-Google label.
,
Jun 8 2017
> We might be able to do a timeout in the python layer using signals. Er, um... If I understand this suggestion properly, it won't work. See bug 711806. Or is this suggestion different?
,
Jun 8 2017
#22: Ah right, I forgot about that. I suppose another option would be to rejigger the RPC handler to use threads, which would make it interruptible after a timeout.
,
Mar 31 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by xixuan@chromium.org
, Apr 27 2017