New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 684667 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: ----



Sign in to add a comment

ts_mon is flushing incorrectly in bot_main process

Project Member Reported by mar...@chromium.org, Jan 24 2017

Issue description

Bots have their swarming_bot.log (that it, bot_main process) filled with entries like this (independent of the OS):

302 2017-01-24 18:48:25.378 E: Automatic monitoring flush failed.
Traceback (most recent call last):
  File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/interface.py", line 248, in _flush_and_log_exceptions
    flush()
  File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/interface.py", line 111, in flush
    state.global_monitor.send(proto)
  File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/monitors.py", line 116, in send
    body=body)
  File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 572, in new_request
    self._refresh(request_orig)
  File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 842, in _refresh
    self._do_refresh_request(http_request)
  File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 874, in _do_refresh_request
    self.token_uri, method='POST', body=body, headers=headers)
  File "/b/s/swarming_bot.2.zip/infra_libs/httplib2_utils.py", line 188, in request
    **kwargs)
  File "/b/s/swarming_bot.2.zip/infra_libs/httplib2_utils.py", line 252, in request
    uri, method, body, *args, **kwargs)
  File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1552, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1294, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1214, in _conn_request
    conn.connect()
  File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1037, in connect
    self.disable_ssl_certificate_validation, self.ca_certs)
  File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket
    cert_reqs=cert_reqs, ca_certs=ca_certs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 911, in wrap_socket
    ciphers=ciphers)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 520, in __init__
    self._context.load_verify_locations(ca_certs)
IOError: [Errno 20] Not a directory


It is harmless but it is very noisy and makes issue diagnosis like issue 684648 unnecessarily harder.

A thread interface._FlushThread is created via process_argparse_options() in ts_mon/config.py when --ts-mon-flush == 'auto' (the default) and doesn't synchronize the cacert.pem file utilization with utils/tools.py get_cacerts_bundle().
 

Comment 1 by estaab@chromium.org, Jan 27 2017

Labels: Pri-2
It's harmless in a sense that it doesn't break the bot process. But I think we don't actually send any metrics. (Not sure what metrics we are supposed to send, but tsmon tries to flush, so there is something to send...)
Cc: bpastene@chromium.org
+bpastene@ - is this related to your work on ts_mon for swarming bots?
Cc: sergeybe...@chromium.org
Owner: bpastene@chromium.org
Yeah, it has to do with my ts_mon stuff in swarming.

It's explicitly using bot's bundle here:
https://chrome-internal.googlesource.com/infra/infra_internal/+/master/services/swarming/bot_config_public.py#1257

What bots are you seeing this on maruel@? I'm currently debugging something to do with that bundle not being correct on chromium-swarm-dev bots, but fine on chromium-swarm bots.

Comment 5 by mar...@chromium.org, Jan 31 2017

I ssh'ed into the first bot listed at https://chromium-swarm.appspot.com/botlist which is build1-b4 and saw certification failure in /b/swarm_slave/logs/swarming_bot.log
:O

Ok, looking.
Looks like this started happening recently: http://shortn/_f40rTT2oxm

I wonder what that drop in data correlates with. A new swarming version maybe?
Cc: dsansome@chromium.org
Lines up pretty well with https://chrome-internal.googlesource.com/infra/puppet/+/ed9a8ab4ef9fff52136acb5d3887151c01e9772d

+ Dave in case he has any ideas.
Note that the ts_mon lib in use by the swarming bot lacks the following changes:
https://chromium.googlesource.com/infra/infra/+log/066f135e603f90945be764884a3d2863f72d21c1..master/packages/infra_libs/infra_libs/ts_mon

Maybe rolling forward here would help?
This seems to correlate with https://chrome-internal-review.googlesource.com/c/317177/
ts_mon now by default talks to ProdX Mon endpoint directly, bypassing PubSub/proxy. It's a Platform One endpoint, and it may require some additional stuff to work correctly. That would be my guess.
Re #10: That's the same CL I mentioned in #8 :)

I manually ran the bot's ts_mon uploading code from a bot (with some junk values) and it worked: http://shortn/_AAQjqh3w49

Not sure why that managed to go through. Nothing really changed. I wonder if there's some weird race condition in the bot.
The PubSubMonitor (old one) passes the ca_certs through to the httplib2.Http constructor, but the HttpsMonitor (new one) does not!

HttpsMonitor just uses httplib2.Http's default.  Which is:

try:
    # Users can optionally provide a module that tells us where the CA_CERTS
    # are located.
    import ca_certs_locater
    CA_CERTS = ca_certs_locater.get()
except ImportError:
    # Default CA certificates file bundled with httplib2.
    CA_CERTS = os.path.join(
        os.path.dirname(os.path.abspath(__file__ )), "cacerts.txt")

So why does this fail here but not on other machines?  My guess is that self._context.load_verify_locations is just a call to the C OpenSSL library's load_verify_locations funcion, and of course that doesn't support loading files from zip archives.
Dave got to the fix in https://chromium.googlesource.com/infra/infra/+/bd74bd626329195ccfc6db751152607401c9973b

I'll roll it into luci-py and redeploy, then these errors should disappear.
Project Member

Comment 14 by bugdroid1@chromium.org, Feb 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/d9eec3590341f6cd4834ac49f1c662c0a0062d53

commit d9eec3590341f6cd4834ac49f1c662c0a0062d53
Author: Benjamin Pastene <bpastene@google.com>
Date: Thu Feb 02 02:12:01 2017

Bump the version of infra_libs.

BUG= 684667 

Change-Id: Iaaffd76c55377a91f5da876f2fbd031e3e8625e7
Reviewed-on: https://chromium-review.googlesource.com/435848
Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org>
Reviewed-by: Dave Sansome <dsansome@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>

[modify] https://crrev.com/d9eec3590341f6cd4834ac49f1c662c0a0062d53/packages/infra_libs/setup.py

Project Member

Comment 15 by bugdroid1@chromium.org, Feb 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/a8d18a31ba11b7a62018326b3002e8278746ccda

commit a8d18a31ba11b7a62018326b3002e8278746ccda
Author: David Sansome <dsansome@chromium.org>
Date: Thu Feb 02 08:16:22 2017

Roll infra_libs in infra and fix outer_loop's usage of DistributionMetric

BUG= 687521 , 684667 
TBR=sergeyberezin

Change-Id: I1b57d624ed54cfc42204ef1242ba9ff8627a0583
Reviewed-on: https://chromium-review.googlesource.com/435121
Reviewed-by: Dave Sansome <dsansome@chromium.org>
Commit-Queue: Dave Sansome <dsansome@chromium.org>

[modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/appengine_module/gae_ts_mon/__init__.py
[modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/bootstrap/deps.pyl
[modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/infra/libs/service_utils/outer_loop.py

Project Member

Comment 16 by bugdroid1@chromium.org, Feb 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4

commit e0238c25bc0905ede29bcb697a2bb3ca7a8964a4
Author: bpastene <bpastene@chromium.org>
Date: Fri Feb 03 17:57:59 2017

luci-py: Roll infra_libs.

To d9eec3590341f6cd4834ac49f1c662c0a0062d53

BUG= 684667 

Review-Url: https://codereview.chromium.org/2673613003

[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_event_mon/README.swarming
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_event_mon/__init__.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_ts_mon/README.swarming
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_ts_mon/__init__.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/README.swarming
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/event_mon/__init__.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/event_mon/checkouts.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/common/interface.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/common/monitors.py
[modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/config.py

Status: Fixed (was: Assigned)
Looks like it's working again: http://shortn/_tyJOZNhlgl

Logs should be free of the errors now.

Sign in to add a comment