ts_mon is flushing incorrectly in bot_main process |
|||||
Issue description
Bots have their swarming_bot.log (that it, bot_main process) filled with entries like this (independent of the OS):
302 2017-01-24 18:48:25.378 E: Automatic monitoring flush failed.
Traceback (most recent call last):
File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/interface.py", line 248, in _flush_and_log_exceptions
flush()
File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/interface.py", line 111, in flush
state.global_monitor.send(proto)
File "/b/s/swarming_bot.2.zip/infra_libs/ts_mon/common/monitors.py", line 116, in send
body=body)
File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 572, in new_request
self._refresh(request_orig)
File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 842, in _refresh
self._do_refresh_request(http_request)
File "/b/s/swarming_bot.2.zip/third_party/oauth2client/client.py", line 874, in _do_refresh_request
self.token_uri, method='POST', body=body, headers=headers)
File "/b/s/swarming_bot.2.zip/infra_libs/httplib2_utils.py", line 188, in request
**kwargs)
File "/b/s/swarming_bot.2.zip/infra_libs/httplib2_utils.py", line 252, in request
uri, method, body, *args, **kwargs)
File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1552, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1294, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1214, in _conn_request
conn.connect()
File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 1037, in connect
self.disable_ssl_certificate_validation, self.ca_certs)
File "/b/s/swarming_bot.2.zip/third_party/httplib2/__init__.py", line 80, in _ssl_wrap_socket
cert_reqs=cert_reqs, ca_certs=ca_certs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 911, in wrap_socket
ciphers=ciphers)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 520, in __init__
self._context.load_verify_locations(ca_certs)
IOError: [Errno 20] Not a directory
It is harmless but it is very noisy and makes issue diagnosis like issue 684648 unnecessarily harder.
A thread interface._FlushThread is created via process_argparse_options() in ts_mon/config.py when --ts-mon-flush == 'auto' (the default) and doesn't synchronize the cacert.pem file utilization with utils/tools.py get_cacerts_bundle().
,
Jan 27 2017
It's harmless in a sense that it doesn't break the bot process. But I think we don't actually send any metrics. (Not sure what metrics we are supposed to send, but tsmon tries to flush, so there is something to send...)
,
Jan 31 2017
+bpastene@ - is this related to your work on ts_mon for swarming bots?
,
Jan 31 2017
Yeah, it has to do with my ts_mon stuff in swarming. It's explicitly using bot's bundle here: https://chrome-internal.googlesource.com/infra/infra_internal/+/master/services/swarming/bot_config_public.py#1257 What bots are you seeing this on maruel@? I'm currently debugging something to do with that bundle not being correct on chromium-swarm-dev bots, but fine on chromium-swarm bots.
,
Jan 31 2017
I ssh'ed into the first bot listed at https://chromium-swarm.appspot.com/botlist which is build1-b4 and saw certification failure in /b/swarm_slave/logs/swarming_bot.log
,
Jan 31 2017
:O Ok, looking.
,
Jan 31 2017
Looks like this started happening recently: http://shortn/_f40rTT2oxm I wonder what that drop in data correlates with. A new swarming version maybe?
,
Jan 31 2017
Lines up pretty well with https://chrome-internal.googlesource.com/infra/puppet/+/ed9a8ab4ef9fff52136acb5d3887151c01e9772d + Dave in case he has any ideas.
,
Jan 31 2017
Note that the ts_mon lib in use by the swarming bot lacks the following changes: https://chromium.googlesource.com/infra/infra/+log/066f135e603f90945be764884a3d2863f72d21c1..master/packages/infra_libs/infra_libs/ts_mon Maybe rolling forward here would help?
,
Jan 31 2017
This seems to correlate with https://chrome-internal-review.googlesource.com/c/317177/ ts_mon now by default talks to ProdX Mon endpoint directly, bypassing PubSub/proxy. It's a Platform One endpoint, and it may require some additional stuff to work correctly. That would be my guess.
,
Jan 31 2017
Re #10: That's the same CL I mentioned in #8 :) I manually ran the bot's ts_mon uploading code from a bot (with some junk values) and it worked: http://shortn/_AAQjqh3w49 Not sure why that managed to go through. Nothing really changed. I wonder if there's some weird race condition in the bot.
,
Feb 1 2017
The PubSubMonitor (old one) passes the ca_certs through to the httplib2.Http constructor, but the HttpsMonitor (new one) does not!
HttpsMonitor just uses httplib2.Http's default. Which is:
try:
# Users can optionally provide a module that tells us where the CA_CERTS
# are located.
import ca_certs_locater
CA_CERTS = ca_certs_locater.get()
except ImportError:
# Default CA certificates file bundled with httplib2.
CA_CERTS = os.path.join(
os.path.dirname(os.path.abspath(__file__ )), "cacerts.txt")
So why does this fail here but not on other machines? My guess is that self._context.load_verify_locations is just a call to the C OpenSSL library's load_verify_locations funcion, and of course that doesn't support loading files from zip archives.
,
Feb 2 2017
Dave got to the fix in https://chromium.googlesource.com/infra/infra/+/bd74bd626329195ccfc6db751152607401c9973b I'll roll it into luci-py and redeploy, then these errors should disappear.
,
Feb 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/d9eec3590341f6cd4834ac49f1c662c0a0062d53 commit d9eec3590341f6cd4834ac49f1c662c0a0062d53 Author: Benjamin Pastene <bpastene@google.com> Date: Thu Feb 02 02:12:01 2017 Bump the version of infra_libs. BUG= 684667 Change-Id: Iaaffd76c55377a91f5da876f2fbd031e3e8625e7 Reviewed-on: https://chromium-review.googlesource.com/435848 Reviewed-by: Sergey Berezin <sergeyberezin@chromium.org> Reviewed-by: Dave Sansome <dsansome@chromium.org> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> [modify] https://crrev.com/d9eec3590341f6cd4834ac49f1c662c0a0062d53/packages/infra_libs/setup.py
,
Feb 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/a8d18a31ba11b7a62018326b3002e8278746ccda commit a8d18a31ba11b7a62018326b3002e8278746ccda Author: David Sansome <dsansome@chromium.org> Date: Thu Feb 02 08:16:22 2017 Roll infra_libs in infra and fix outer_loop's usage of DistributionMetric BUG= 687521 , 684667 TBR=sergeyberezin Change-Id: I1b57d624ed54cfc42204ef1242ba9ff8627a0583 Reviewed-on: https://chromium-review.googlesource.com/435121 Reviewed-by: Dave Sansome <dsansome@chromium.org> Commit-Queue: Dave Sansome <dsansome@chromium.org> [modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/appengine_module/gae_ts_mon/__init__.py [modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/bootstrap/deps.pyl [modify] https://crrev.com/a8d18a31ba11b7a62018326b3002e8278746ccda/infra/libs/service_utils/outer_loop.py
,
Feb 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4 commit e0238c25bc0905ede29bcb697a2bb3ca7a8964a4 Author: bpastene <bpastene@chromium.org> Date: Fri Feb 03 17:57:59 2017 luci-py: Roll infra_libs. To d9eec3590341f6cd4834ac49f1c662c0a0062d53 BUG= 684667 Review-Url: https://codereview.chromium.org/2673613003 [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_event_mon/README.swarming [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_event_mon/__init__.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_ts_mon/README.swarming [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/appengine/third_party/gae_ts_mon/__init__.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/README.swarming [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/event_mon/__init__.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/event_mon/checkouts.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/common/interface.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/common/monitors.py [modify] https://crrev.com/e0238c25bc0905ede29bcb697a2bb3ca7a8964a4/client/third_party/infra_libs/ts_mon/config.py
,
Feb 3 2017
Looks like it's working again: http://shortn/_tyJOZNhlgl Logs should be free of the errors now. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by estaab@chromium.org
, Jan 27 2017