New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 799182 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 798273



Sign in to add a comment

abort outstanding bad jobs in the lab

Project Member Reported by pprabhu@chromium.org, Jan 4 2018

Issue description

abort existing jobs matching

arc-gts
arc-gts-qual
arc-cts
arc-cts-qual

ihf@ fixed a bug yesterday so that we don't care about anything scheduled before today. We can abort all ~40,000 outstanding job.

These old jobs are known to leave bad state on the server intermittently, including hung jobs.
So it's better to just kill these before they do much damage.

An example job is 167508339
Richard, can you help figure out the magic atest incantation to blow these away?
 
If we do nothing, current jobs will take ~7 days to work though the system, and may cause capacity issues in this time. (with a lowish probability)
Blocking: 798273
I don't see that many jobs, if I didn't flub the query

MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%');
+----------+
| count(*) |
+----------+
|      939 |
+----------+
1 row in set (0.96 sec)

MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%');
+----------+
| count(*) |
+----------+
|      331 |
+----------+
1 row in set, 1 warning (0.49 sec)

Comment 4 by ihf@chromium.org, Jan 4 2018

Try %arc-cts% instead of %arc-cts-%.
Using this query to find jobs:

select jobs.id, name, created_on, complete, active from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%');
#4 Ah, magic

MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%');
+----------+
| count(*) |
+----------+
|      347 |
+----------+
1 row in set, 1 warning (0.44 sec)

MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%');
+----------+
| count(*) |
+----------+
|    35911 |
+----------+
1 row in set, 1 warning (1.08 sec)

Using this query to find jobs:

select jobs.id, name, created_on, complete, active from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%');

Comment 8 by ihf@chromium.org, Jan 4 2018

Looks good, just abort them all if possible.
Running:

 <~/tmp/cts cut -f1 | sed 1d | parallel -X -j1 echo atest job abort

Removing echo for real run
↪ 0 2018-01-04 13:34:10 ayatane@sharanohiar:~/src/chromiumos/chromeos-admin/lab-tools $ <~/tmp/cts cut -f1 | sed 1d | parallel -X -j1 atest job abort
Operation abort_host_queue_entries failed:
    AssertionError: 

Lovely
Did I tell you I tried aborting them :)
chromeos-test@chromeos-server151:~$ <~/cts cut -f1 | sed 1d | xargs /usr/local/autotest/cli/atest job abort
Aborting jobs: 167449533, 167449535, 167496800, 167361053, 167498234, 167413535, 167413536, 167210595, 167210596, 167372162, 167372161, 167275711, 167275710, 167371583, 167498991, 167498999, 167371208, 167499340, 167151012, 167151011, 167371191, 167371197, 167421669, 167114126, 167371803, 167419240, 167416959, 167417375, 167417474, 167429023, 167503790, 167503794, 167499335, 167409566, 167374587, 167276698, 167350149, 167277306, 167277305, 167114274, 167362476, 167498232, 167358289, 167373244, 167373245, 167498161, 167498160, 167358492, 167368986, 167358497, 167113471, 167113470, 167113476, 167214768, 167370949, 167115192, 167416980, 167418358, 167347982, 167373443, 167276710, 167373447, 167373445, 167418400, 167373332, 167423876, 167356089, 167371567, 167371562, 167154170, 167416910, 167498196, 167498197, 167498190, 167114186, 167149387, 167373542, 167276706, 167114188, 167416106, 167349100, 167349106, 167416962, 167378983, 167415453, 167279913, 167425153, 167503106, 167148954, 167379021, 167280076, 167280075, 167370718, 167210365, 167210364, 167413636, 167413634, 167383911, 167210577, 167423831, 167275672, 167275673, 167419968, 167497540, 167497541, 167113343, 167429019, 167113348, 167357459, 167211539, 167277676, 167277675, 167114140, 167500103, 167502968, 167502966, 167151166, 167151165, 167207824, 167207825, 167424421, 167424420, 167113510, 167371231, 167114125, 167208910, 167416884, 167211542, 167409570, 167148963, 167503103, 167146587, 167146586, 167410736, 167410734, 167410735, 167410732, 167419990, 167504491, 167351402, 167351401, 167351405, 167415469, 167276652, 167351409, 167362466, 167353352, 167357410, 167278835, 167278837, 167370978, 167115734, 167425737, 167406904, 167406903, 167406901, 167373474, 167146805, 167146806, 167373470, 167371499, 167497801, 167497804, 167497807, 167497806, 167421649, 167371571, 167278307, 167278305, 167208345, 167208346, 167501488, 167413418, 167425546, 167501487, 167410746, 167151131, 167502200, 167502201, 167500099, 167151034, 167396404, 167213004, 167151118, 167416979, 167502949, 167502948, 167373502, 167503135, 167276694, 167497097, 167341905, 167276042, 167210974, 167210971, 167498982, 167361959, 167370703, 167354646, 167371725, 167371724, 167360861, 167278244, 167413212, 167278243, 167383904, 167150011, 167278909, 167278908, 167497535, 167497534, 167371210, 167419393, 167418370, 167372147, 167373503, 167360854, 167372143, 167154338, 167373317, 167503036, 167114175, 167114174, 167499002, 167415957, 167351412, 167208908, 167146789, 167146788, 167154289, 167425040, 167210833, 167207969, 167503777, 167212982, 167503779, 167413888, 167417143, 167371915, 167425228, 167504489, 167503457, 167500476, 167502578, 167502574, 167361988, 167429698, 167281666, 167499628, 167413220, 167373541, 167413654, 167413655, 167210995, 167210996, 167421676, 167421674, 167498202, 167415508, 167415509, 167278311, 167278312, 167209397, 167209391, 167210193, 167500558, 167358255, 167358253, 167278758, 167278759, 167210600, 167210601, 167420315, 167113108, 167371211, 167113106, 167420319, 167114136, 167153613, 167114139, 167114138, 167503475, 167115247, 167351372, 167503123, 167503124, 167115241, 167341932, 167115485, 167421518, 167410750, 167209823, 167209821, 167374591, 167374590, 167501116, 167501115, 167373855, 167373854, 167149068, 167114262, 167276678, 167371609, 167371607, 167371604, 167371605, 167419989, 167276705, 167413916, 167369349, 167371374, 167502165, 167417796, 167355863, 167149927, 167502169, 167503134, 167419503, 167371855, 167379263, 167358184, 167358185, 167498189, 167425035, 167417388, 167358263, 167500533, 167418672, 167497674, 167497675, 167207978, 167396421, 167153729, 167361621, 167360892, 167503987, 167503988, 167508338, 167508339, 167501366, 167429701, 167350061, 167501418, 167418220, 167418221, 167146865, 167210223

Status: Fixed (was: Untriaged)
It is completely obvious that this means permission error

Operation abort_host_queue_entries failed:
    AssertionError: 

Running atest as chromeos-test works

Comment 14 by ihf@chromium.org, Jan 4 2018

cautotest/afe/ still shows 40k queued jobs. Does it take a while to propagate the new status?
Hm, in theory the scheduler should be propagating the suite job abort to the child jobs.
There's 28761 left, so some of them are getting aborted.  I will just abort all of them manually.

Comment 17 by ihf@chromium.org, Jan 5 2018

I see only 11.6k now, so I think we are good (soon). Thank you!

Sign in to add a comment