New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 715415 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: ----

Blocking:
issue 719347
issue 721887



Sign in to add a comment

Reduce Apache RPC timeout

Project Member Reported by xixuan@chromium.org, Apr 26 2017

Issue description

Failed CQ: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14396

Failure reason:
cannot connect to shard AFE: 
04/25 19:53:01.732 DEBUG|          base_job:0350| Persistent state global_properties.test_retry now set to 0
04/25 19:53:01.732 DEBUG|          base_job:0350| Persistent state global_properties.tag now set to ''
04/25 19:53:01.820 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
04/25 19:53:11.872 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
04/25 19:53:31.947 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
04/25 19:54:12.049 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
04/25 19:55:32.150 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 111] Connection refused>)
04/25 19:58:01.815 ERROR|          autoserv:0769| Timeout occurred- waited 300 seconds.
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 761, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 494, in run_autoserv
    test_retry, **kwargs)
  File "/usr/local/autotest/server/site_server_job.py", line 48, in __init__
    super(site_server_job, self).__init__(*args, **dargs)
  File "/usr/local/autotest/server/server_job.py", line 331, in __init__
    self.machines, self.in_lab, host_attributes)
  File "/usr/local/autotest/server/server_job.py", line 102, in get_machine_dicts
    afe_host = _create_afe_host(machine, in_lab)
  File "/usr/local/autotest/server/server_job.py", line 1440, in _create_afe_host
    hosts = afe.get_hosts(hostname=hostname)
  File "/usr/local/autotest/server/frontend.py", line 538, in get_hosts
    hosts = self.run('get_hosts', **query_args)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 127, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 114, in GenericRetry
    time.sleep(sleep_time)
  File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 62, in kill_us
    raise TimeoutError(error_message % {'time': max_run_time})
TimeoutError: Timeout occurred- waited 300 seconds.

Possible reason:
It happened on 2 shard, server36.cbf & server14.mtv, both at about 19:50~20:00.
At that time, the apache on the 2 shards are restarting:
[Tue Apr 25 19:45:02.128431 2017] [mpm_event:notice] [pid 11587:tid 140097477130112] AH00491: caught SIGTERM, shutting down
[Tue Apr 25 20:01:32.527596 2017] [mpm_event:notice] [pid 25522:tid 140243570194304] AH00489: Apache/2.4.10 (Ubuntu) mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations


 

Comment 1 by xixuan@chromium.org, Apr 27 2017

It seems affect CQ round again: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14410

Do we consider reduce the frequency & time for restarting shard apache?
Cc: chadversary@chromium.org
I'm ok with reducing the frequency. Though is there a workaround we can try?

Why does it take apache 15 minutes to restart?
Project Member

Comment 4 by bugdroid1@chromium.org, Apr 27 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/3f82404769bd4b721e538e946d87652b303d20bf

commit 3f82404769bd4b721e538e946d87652b303d20bf
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 27 22:45:01 2017

Cc: dgarr...@chromium.org jrbarnette@chromium.org
An idea brough up in the team meeting is that apache might be trying to serve all outstanding rpcs before shutting down, and there might be 1 long lived RPC that prevents it from shutting down.

We should have an rpc timeout to prevent this. PRobably want such a timeout anyway to prevent us from accumulating slow RPCs.

Comment 6 by aut...@google.com, May 5 2017

Owner: pho...@chromium.org
Summary: Reduce Apache RPC timeout (was: Restart shard apache takes too much time, which causes CQ failure)
Work remaining: reduce apache rpc timeout (we think it's an apache config). 

@ paul - can you take a look?

Comment 7 by dbehr@chromium.org, May 8 2017

Blocking: 719347
Status: Started (was: Untriaged)

Comment 9 by xixuan@chromium.org, May 12 2017

Blocking: 721887
chromeos-server4.cbf.corp.google.com is found not working from 8:45-12:45 today, 

[Fri May 12 08:45:01.925853 2017] [mpm_event:notice] [pid 26841:tid 140451891160960] AH00491: caught SIGTERM, shutting down
[Fri May 12 12:45:02.398859 2017] [mpm_event:notice] [pid 2415:tid 140168440772480] AH00489: Apache/2.4.10 (Ubuntu) mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations

metrics: http://shortn/_pbLW3YNgf5
Cc: ayatane@chromium.org xixuan@chromium.org
 Issue 720594  has been merged into this issue.
 Issue 721887  has been merged into this issue.
 Issue 721846  has been merged into this issue.
Related: feature bug for adding metrics for monitoring services: https://bugs.chromium.org/p/chromium/issues/detail?id=720175

Not sure of the Apache error log metrics phobbs was working on would also help
chromeos-server36's apache was just down during restart, leads to CQ failure:

https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5987

Use this bug for annotation.
Project Member

Comment 16 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e73b3280d55983a0945333bee46d3d1455a1613d

commit e73b3280d55983a0945333bee46d3d1455a1613d
Author: Paul Hobbs <phobbs@google.com>
Date: Fri May 26 00:16:09 2017

[autotest] Added 60s timeout to RPCs

BUG= chromium:715415 
TEST=None

Change-Id: I913db08b6e70aa82f12104534953d962cb29100a
Reviewed-on: https://chromium-review.googlesource.com/513585
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/e73b3280d55983a0945333bee46d3d1455a1613d/apache/conf/django-directives

oops, we need a fix.

In test_push servers:

Command 'sudo service apache2 reload' returned non-zero exit status 1

* The apache2 configtest failed. Not doing anything.
Output of config test was:
AH00526: Syntax error on line 66 of /usr/local/autotest/apache/conf/django-directives:
Invalid command 'maximum-requests=200', perhaps misspelled or defined by a module not included in the server configuration
Action 'configtest' failed.
The Apache error log may have more information.
I manually test, change code to oneline:

WSGIDaemonProcess autotestapache processes=65 threads=1 maximum-requests=200 request-timeout=60

will cause:
chromeos-test@chromeos-shard2-staging:~$ sudo service apache2 reload
 * Reloading web server apache2                                                                                                                                 * 
 * The apache2 configtest failed. Not doing anything.
Output of config test was:
AH00526: Syntax error on line 65 of /usr/local/autotest/apache/conf/django-directives:
Invalid option to WSGI daemon process definition.
Action 'configtest' failed.
The Apache error log may have more information.
Project Member

Comment 19 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ca4d02ea65adcb8969de74f1fd0a402847194a9b

commit ca4d02ea65adcb8969de74f1fd0a402847194a9b
Author: Xixuan Wu <xixuan@chromium.org>
Date: Fri May 26 16:11:27 2017

Revert "[autotest] Added 60s timeout to RPCs"

Temporarily revert for test_push.

This reverts commit e73b3280d55983a0945333bee46d3d1455a1613d.

Reason for revert: break test_push.

Original change's description:
> [autotest] Added 60s timeout to RPCs
> 
> BUG= chromium:715415 
> TEST=None
> 
> Change-Id: I913db08b6e70aa82f12104534953d962cb29100a
> Reviewed-on: https://chromium-review.googlesource.com/513585
> Commit-Ready: Paul Hobbs <phobbs@google.com>
> Tested-by: Paul Hobbs <phobbs@google.com>
> Reviewed-by: Paul Hobbs <phobbs@google.com>
> Reviewed-by: Dan Shi <dshi@google.com>
> 

BUG= chromium:715415 

Change-Id: I6229fb8ab116295cc1f1270c1c007c655fc05609
Reviewed-on: https://chromium-review.googlesource.com/516494
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ca4d02ea65adcb8969de74f1fd0a402847194a9b/apache/conf/django-directives

The flag is introduced in mod_wsgi version 4.10, while the latest mod_wsgi package for trusty is 3.4-4 (https://packages.ubuntu.com/trusty/libapache2-mod-wsgi).

We might be able to do a timeout in the python layer using signals. 
Cc: pho...@chromium.org
Labels: -Restrict-View-Google -Pri-1 Hotlist-Fixit Pri-2
Owner: ----
Status: Available (was: Started)
Lowering priority and adding to fixit list. Also, there's no reason to use the Restict-View-Google label.
> We might be able to do a timeout in the python layer using signals. 

Er, um...  If I understand this suggestion properly, it won't work.
See bug 711806.  Or is this suggestion different?

#22: Ah right, I forgot about that. I suppose another option would be to rejigger the RPC handler to use threads, which would make it interruptible after a timeout.
Status: Archived (was: Available)

Sign in to add a comment