New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 590811 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Mar 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Legion lab swarming slaves are coming offline

Project Member Reported by sh...@chromium.org, Feb 29 2016

Issue description

Every so often the legion lab swarming slaves (https://omnibot-legion-swarming-server.appspot.com/) become unavailable due to different reasons. 

This is noted on KitchenSync runs on Linux, Mac and Windows:
https://uberchromegw.corp.google.com/i/internal.client.kitchensync/waterfall

mmeade@ has been the sheriff of these swarming lab. This bug is just mean to track the work done and future work to fix swarming slaves stability issues.
 

Comment 1 by mmeade@chromium.org, Feb 29 2016

Labels: -Pri-2 Pri-1
Status: Assigned (was: Available)
I've reset all but 2 VMs (one example of each failure case). vm10-d and vm17-d for investigation. 

vm17-d went into quarantine with the error > 1024 files in /tmp (vm17-d is a mac, so its the equivalent directory there, but I am also getting this on linux bots). However when I ssh into the vm and do an ls I only get a few files (vmware, com.apple.*). There are never more than 5 files or so. A simple reboot without deleting the files takes the bot out of quarantine. All of my linux and mac bots are hitting this about once a week.

vm10-d is a windows machine and it went into quarantine because there wasn't enough free space on C. All windows machines are showing the same behavior. When I take a look C:\Users\chrome-bot\cache is getting full (> 32GB) with files who's filename looks like a hash. If I manually delete these files and reboot the bots come back online.

I submitted a cl last week that added cleanup tmp logic after a test completes. It looks like this is working as tmp is clean on all OSes.

Comment 2 by mmeade@chromium.org, Feb 29 2016

Cc: vadimsh@chromium.org
+vadimsh

He may also have insight into this.
There's an issue with cache sizing for smaller disk.

Mike, why isn't the bot running on E:\? There's a large disk there just for that. All our bots run from e:\b\swarm_slave\ explicitly for this. Here's the stats for vm10-d:

disks: 
  c:\: 
    free_mb: 3222.7
    size_mb: 81818.0
  e:\: 
    free_mb: 253905.6
    size_mb: 255998.0
I'll go ahead and switch that over today and see if it repro's. 
Status: Verified (was: Assigned)
Fixed

Comment 6 by aga...@chromium.org, Apr 26 2016

Components: Infra>Platform>Swarming
Labels: -Infra-Swarming

Sign in to add a comment