New issue
Advanced search Search tips

Issue 835274 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 408424



Sign in to add a comment

Swarming: bot should *only* talk to Swarming, nothing else: drop ts_mon bot side

Project Member Reported by mar...@chromium.org, Apr 20 2018

Issue description

When ts_mon fails, it hoses the bot. This is also an ACL complexity since we don't control where each bot is located and it's identity, and there's multiple identity mechanism.

Since the Swarming server already implements auth_service access control, has ts_mon monitoring capability, and can pipe the monitoring data from the bot to the monitoring sink, the bot should stop talking to the ts_mon server and send the relevant packets to the Swarming server instead.

Outages like issue 835268 would have been prevented.

AIs:
- Review http://go/swarming-monitoring-v2 and create Swarming endpoints for the things happening on the bot we want monitored.
- Convert the bot to use this new API, and stop using ts_mon directly.
- Remove ts_mon from the bot.
 

Comment 1 by mar...@chromium.org, Apr 20 2018

Issue 610006 has been merged into this issue.

Comment 2 by mar...@chromium.org, Apr 20 2018

Blocking: 408424
Cc: sergeybe...@chromium.org
TBH, I'm not convinced this is a good idea, and we already discussed this over the years. The whole point of timeseries monitoring is to be as close to the source of data as possible, so even if a bot gets hosed and can't talk to the server, it may still report this fact to ts_mon, helping with alerts and diagnostics. 

I'd consider monitoring to be a level below swarming (as protocol), hence not a subject to its API restrictions. Just like we already send ts_mon data from the machine proper through sysmon - it's not part of the "bot", it's a foundation layer under it.

The fact that ts_mon hoses a bot upon failure is a bug that needs to be fixed on the bot (or in ts_mon).

Multiple identities should properly register with prodx-mon-chrome-infra Cloud project (see go/inframon-doc > ts_mon), the same way as they already do for other services and ACLs. Perhaps, we should add it as a necessary step in the onboarding doc (if there is one).

Sign in to add a comment