Swarming: bot should *only* talk to Swarming, nothing else: drop ts_mon bot side |
|||
Issue descriptionWhen ts_mon fails, it hoses the bot. This is also an ACL complexity since we don't control where each bot is located and it's identity, and there's multiple identity mechanism. Since the Swarming server already implements auth_service access control, has ts_mon monitoring capability, and can pipe the monitoring data from the bot to the monitoring sink, the bot should stop talking to the ts_mon server and send the relevant packets to the Swarming server instead. Outages like issue 835268 would have been prevented. AIs: - Review http://go/swarming-monitoring-v2 and create Swarming endpoints for the things happening on the bot we want monitored. - Convert the bot to use this new API, and stop using ts_mon directly. - Remove ts_mon from the bot.
,
Apr 20 2018
,
Apr 23 2018
TBH, I'm not convinced this is a good idea, and we already discussed this over the years. The whole point of timeseries monitoring is to be as close to the source of data as possible, so even if a bot gets hosed and can't talk to the server, it may still report this fact to ts_mon, helping with alerts and diagnostics. I'd consider monitoring to be a level below swarming (as protocol), hence not a subject to its API restrictions. Just like we already send ts_mon data from the machine proper through sysmon - it's not part of the "bot", it's a foundation layer under it. The fact that ts_mon hoses a bot upon failure is a bug that needs to be fixed on the bot (or in ts_mon). Multiple identities should properly register with prodx-mon-chrome-infra Cloud project (see go/inframon-doc > ts_mon), the same way as they already do for other services and ACLs. Perhaps, we should add it as a necessary step in the onboarding doc (if there is one). |
|||
►
Sign in to add a comment |
|||
Comment 1 by mar...@chromium.org
, Apr 20 2018