abort outstanding bad jobs in the lab |
|||
Issue descriptionabort existing jobs matching arc-gts arc-gts-qual arc-cts arc-cts-qual ihf@ fixed a bug yesterday so that we don't care about anything scheduled before today. We can abort all ~40,000 outstanding job. These old jobs are known to leave bad state on the server intermittently, including hung jobs. So it's better to just kill these before they do much damage. An example job is 167508339 Richard, can you help figure out the magic atest incantation to blow these away?
,
Jan 4 2018
,
Jan 4 2018
I don't see that many jobs, if I didn't flub the query MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%'); +----------+ | count(*) | +----------+ | 939 | +----------+ 1 row in set (0.96 sec) MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%'); +----------+ | count(*) | +----------+ | 331 | +----------+ 1 row in set, 1 warning (0.49 sec)
,
Jan 4 2018
Try %arc-cts% instead of %arc-cts-%.
,
Jan 4 2018
Using this query to find jobs: select jobs.id, name, created_on, complete, active from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts-%' or name like '%arc-gts-%');
,
Jan 4 2018
#4 Ah, magic MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%'); +----------+ | count(*) | +----------+ | 347 | +----------+ 1 row in set, 1 warning (0.44 sec) MySQL [chromeos_autotest_db]> select count(*) from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and complete = 'no' and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%'); +----------+ | count(*) | +----------+ | 35911 | +----------+ 1 row in set, 1 warning (1.08 sec)
,
Jan 4 2018
Using this query to find jobs: select jobs.id, name, created_on, complete, active from afe_jobs as jobs left join afe_host_queue_entries as hqes on jobs.id = hqes.job_id where owner = 'chromeos-test' and parent_job_id is NULL and complete = 0 and created_on < '2018-01-04 08:00:00' and (name like '%arc-cts%' or name like '%arc-gts%');
,
Jan 4 2018
Looks good, just abort them all if possible.
,
Jan 4 2018
Running: <~/tmp/cts cut -f1 | sed 1d | parallel -X -j1 echo atest job abort Removing echo for real run
,
Jan 4 2018
↪ 0 2018-01-04 13:34:10 ayatane@sharanohiar:~/src/chromiumos/chromeos-admin/lab-tools $ <~/tmp/cts cut -f1 | sed 1d | parallel -X -j1 atest job abort
Operation abort_host_queue_entries failed:
AssertionError:
Lovely
,
Jan 4 2018
Did I tell you I tried aborting them :)
,
Jan 4 2018
chromeos-test@chromeos-server151:~$ <~/cts cut -f1 | sed 1d | xargs /usr/local/autotest/cli/atest job abort Aborting jobs: 167449533, 167449535, 167496800, 167361053, 167498234, 167413535, 167413536, 167210595, 167210596, 167372162, 167372161, 167275711, 167275710, 167371583, 167498991, 167498999, 167371208, 167499340, 167151012, 167151011, 167371191, 167371197, 167421669, 167114126, 167371803, 167419240, 167416959, 167417375, 167417474, 167429023, 167503790, 167503794, 167499335, 167409566, 167374587, 167276698, 167350149, 167277306, 167277305, 167114274, 167362476, 167498232, 167358289, 167373244, 167373245, 167498161, 167498160, 167358492, 167368986, 167358497, 167113471, 167113470, 167113476, 167214768, 167370949, 167115192, 167416980, 167418358, 167347982, 167373443, 167276710, 167373447, 167373445, 167418400, 167373332, 167423876, 167356089, 167371567, 167371562, 167154170, 167416910, 167498196, 167498197, 167498190, 167114186, 167149387, 167373542, 167276706, 167114188, 167416106, 167349100, 167349106, 167416962, 167378983, 167415453, 167279913, 167425153, 167503106, 167148954, 167379021, 167280076, 167280075, 167370718, 167210365, 167210364, 167413636, 167413634, 167383911, 167210577, 167423831, 167275672, 167275673, 167419968, 167497540, 167497541, 167113343, 167429019, 167113348, 167357459, 167211539, 167277676, 167277675, 167114140, 167500103, 167502968, 167502966, 167151166, 167151165, 167207824, 167207825, 167424421, 167424420, 167113510, 167371231, 167114125, 167208910, 167416884, 167211542, 167409570, 167148963, 167503103, 167146587, 167146586, 167410736, 167410734, 167410735, 167410732, 167419990, 167504491, 167351402, 167351401, 167351405, 167415469, 167276652, 167351409, 167362466, 167353352, 167357410, 167278835, 167278837, 167370978, 167115734, 167425737, 167406904, 167406903, 167406901, 167373474, 167146805, 167146806, 167373470, 167371499, 167497801, 167497804, 167497807, 167497806, 167421649, 167371571, 167278307, 167278305, 167208345, 167208346, 167501488, 167413418, 167425546, 167501487, 167410746, 167151131, 167502200, 167502201, 167500099, 167151034, 167396404, 167213004, 167151118, 167416979, 167502949, 167502948, 167373502, 167503135, 167276694, 167497097, 167341905, 167276042, 167210974, 167210971, 167498982, 167361959, 167370703, 167354646, 167371725, 167371724, 167360861, 167278244, 167413212, 167278243, 167383904, 167150011, 167278909, 167278908, 167497535, 167497534, 167371210, 167419393, 167418370, 167372147, 167373503, 167360854, 167372143, 167154338, 167373317, 167503036, 167114175, 167114174, 167499002, 167415957, 167351412, 167208908, 167146789, 167146788, 167154289, 167425040, 167210833, 167207969, 167503777, 167212982, 167503779, 167413888, 167417143, 167371915, 167425228, 167504489, 167503457, 167500476, 167502578, 167502574, 167361988, 167429698, 167281666, 167499628, 167413220, 167373541, 167413654, 167413655, 167210995, 167210996, 167421676, 167421674, 167498202, 167415508, 167415509, 167278311, 167278312, 167209397, 167209391, 167210193, 167500558, 167358255, 167358253, 167278758, 167278759, 167210600, 167210601, 167420315, 167113108, 167371211, 167113106, 167420319, 167114136, 167153613, 167114139, 167114138, 167503475, 167115247, 167351372, 167503123, 167503124, 167115241, 167341932, 167115485, 167421518, 167410750, 167209823, 167209821, 167374591, 167374590, 167501116, 167501115, 167373855, 167373854, 167149068, 167114262, 167276678, 167371609, 167371607, 167371604, 167371605, 167419989, 167276705, 167413916, 167369349, 167371374, 167502165, 167417796, 167355863, 167149927, 167502169, 167503134, 167419503, 167371855, 167379263, 167358184, 167358185, 167498189, 167425035, 167417388, 167358263, 167500533, 167418672, 167497674, 167497675, 167207978, 167396421, 167153729, 167361621, 167360892, 167503987, 167503988, 167508338, 167508339, 167501366, 167429701, 167350061, 167501418, 167418220, 167418221, 167146865, 167210223
,
Jan 4 2018
It is completely obvious that this means permission error
Operation abort_host_queue_entries failed:
AssertionError:
Running atest as chromeos-test works
,
Jan 4 2018
cautotest/afe/ still shows 40k queued jobs. Does it take a while to propagate the new status?
,
Jan 5 2018
Hm, in theory the scheduler should be propagating the suite job abort to the child jobs.
,
Jan 5 2018
There's 28761 left, so some of them are getting aborted. I will just abort all of them manually.
,
Jan 5 2018
I see only 11.6k now, so I think we are good (soon). Thank you! |
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Jan 4 2018