coral-paladin: suite and test job alive after timeout abort |
|||||||
Issue descriptionA recent coral-paladin run failed during bvt-inline HWTest stage. http://shortn/_sytuLI3GcE Provisioning succeeded, however there was a test that was aborted. According to jrbarnette@, the aborted test was never scheduled. Assigning to current infra deputy to investigate.
,
Jun 13 2018
One key early finding: The suite job never completes. At the time of this writing, the process is still live on the drone: chromeos-test@cros-full-0027:~$ ps -ef | grep 208133752 | cut -c 1-120 chromeo+ 81196 260142 0 12:15 pts/17 00:00:00 grep --color 208133752 chromeo+ 87208 1 0 03:29 ? 00:00:00 /usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a chromeo+ 87384 87208 0 03:29 ? 00:00:00 /opt/infra-tools/usr/bin/lucifer_run_job -autotestdir /usr/local/autot chromeo+ 87492 87384 0 03:29 ? 00:00:05 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752 chromeo+ 87649 87492 0 03:29 ? 00:00:00 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752 chromeo+ 87651 87492 0 03:29 ? 00:00:00 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752 The job started at 3:22AM, with a 90 minute timeout. Apparently, the process was never sent SIGTERM at around 4:52AM when it should have been. Possibly, the job was never actually aborted.
,
Jun 13 2018
Looks like job_aborter jammed
job_aborter: 2018-06-13 08:11:07,187:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 08:11:07,213:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MDYyNjcsImlhdCI6MTUyODkwMjY2N30', 'nI
YvoOYf_Q2Dtj-M7UFys9vqRkqK2FUjnk6PJlSpb7pZr2MVRs0ZUqGegnTh4MyEORbHfUQck5DGTo9_4YD0ZVxwgS2Y5foe0punE5nX1VKqVmW04nkQWC3vlmWSasR3-yNLkwiaoeRAS6eluSM3yo340zSS5jbT
d8oeJYSK--QoRJYXkQwJxlZEeDBAJqaZxRpHOq1HWeITECExpB4ZUBleOG26lKf5DppIqmsFGP-2Utgb7hJxwX9h0mvy_x2pGHqzkdeRBATv1r3A_ddmq85b8I16dgkt1rJ2KAsa3Tnx_y1DM7wSxZH5_pdBT4
TQHiIpCJFgnjH1f13YVEcEGg']
job_aborter: 2018-06-13 08:11:07,214:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 09:11:07,205:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 09:11:07,221:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MDk4NjcsImlhdCI6MTUyODkwNjI2N30', 'eI
j7MTYBTLwEedgHmpGS6cCnwl88CAi8pwvsCjhPlo40XU0hvkfL36H4TZjQmJ1jQ3_Q3SYW91I0kOTjJl8EtDEqJ6X_tnt6nSE3mdZK6ucVxqeSuMKYw6-bXdb8RQ9o85iMYEytMm-MjZ5rmidhTsW5Pe2Ds2Ej
zX34BRa01yaJ5_uma654ezqxM05aKzBSvoBrPASSp9nY2k8E-v_pxlRaJZKsb3rl3GAFdVIWHf0fw_XuB7v_DH00sHwQ7bCGaahFlSwhcR-vm8NLftubVVLFfRiSwShozMmcK6NehahnygDhmZZpLqawS7EMxy
KKpSbTqfoD_iMFrfROKXxp_Q']
job_aborter: 2018-06-13 09:11:07,221:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 10:11:07,219:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 10:11:07,243:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MTM0NjcsImlhdCI6MTUyODkwOTg2N30', 'G8
gsMkk75jmc_J7NQRiru8p1r4sOuyepxo4-VIKdW71vBPEmgE5TNrAT-eyWwItNJGh8vdoEQR5ZXJZQdYQsBMX6EsVMrjNJa99GfYczwE-sqm7C5oGmLq7KhNuswLBQt7cDBiFM7lXb8Dsl82y4oOy0cEVCqz55
079CcQPh3iW7-Geqfnmfzo3csA2s9CNOCgqAPjCczmcEVg8p2ZQkeCJx2xKuGrgl0Y1ztWdMmTG7crQrDsXjryicrEF00W1LXL__Lgtb093ceEZEtyF-aY7_BHudg1pvbdNE5_tCfWo8hB34xp3WsypaOWPdTa
Dga62iXuSlycU2I4AzT9K_HA']
job_aborter: 2018-06-13 10:11:07,243:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 11:11:07,243:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 11:11:07,268:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MTcwNjcsImlhdCI6MTUyODkxMzQ2N30', 'TM
A41IYLh27exQTqiHRIKJJ3UeJtKywtUV7vblzRlDU-8dnr53AqngtYm8m-Np18s7WXJ8fCF2O7T_wDMz6r8BGwT5yl6ER-0YC0E5s4Jl-7IAr-jOop0eJednwRo2ysWCyi-u-LLunOKiQ6lsZZ35LWzM6i-3NJ
9_hPS0nPh6XXM2CL4WpD3IxOFJ1Wjt4wsYa7YU_kF_SMX-UwvOUmtByxRymmRf2OUwCWUMtkDG__Sq3y8U1DVIQNviccdEaSgGY5a3HkDbdoXb9SfQ7y3inpCZdVbb-ZmKfyKqelAErzDIz6RbyZmgrKNNljay
ianVt1TK9lR2G7ueSWMp0Tog']
job_aborter: 2018-06-13 11:11:07,268:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 12:11:07,263:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 12:11:07,290:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MjA2NjcsImlhdCI6MTUyODkxNzA2N30', 'oL
0qxu6y7WUWCysmhd1jMwrKUVV2PrTAaiTs5S3pDwsbHVVw5pq9AYwopAQNDrP8KcoQhjhGB7oOlwEsVAHxUGD_g2du2XyakgAXbWHdgUObk52iWdGD5s9k3v_iYKs5hrDugD6QWGR_gcfj2cw5dZcdfrShQLlv
dT6vJAK-jD19OIEfBmL0yBNiUjwGsT_WSxxHooj80cj8eBbBC7YzvVthbqO6NpeZE4C5MC1AlpmnJx7OjBngDXAIPZlBAFt4LYN4R9624ffq1BmKYc8B9Pd8rSAXUfojFJCNYAQHgeKoFjs05C1AChItJwHcor
KNopBYckHaC7DEZqcBxkWaqQ']
job_aborter: 2018-06-13 12:11:07,291:INFO:client:_do_refresh_request:795:Refreshing access_token
ESC[31m[E2018-06-13T12:18:44.655010-07:00 256008 0 http.go:89]ESC[0m Monitoring push failed.
Response body: {
"error": {
"code": 503,
"message": "The service is currently unavailable.",
"status": "UNAVAILABLE",
"details": [
{
"@type": "type.googleapis.com/google.rpc.DebugInfo",
"detail": "[ORIGINAL ERROR] RPC::DEADLINE_EXCEEDED: "
}
]
}
}
,
Jun 13 2018
The problematic login_LoginSuccess job ran and passed on the shard.
Logs are available:
http://chromeos-skunk-4.mtv.corp.google.com/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/
,
Jun 13 2018
Last tick was at 3 job_aborter: 2018-06-13 03:08:20,221:DEBUG:job_aborter:_main_loop:66:Tick
,
Jun 13 2018
Logs from the shard relating to the problem test job: chromeos-test@chromeos-skunk-4:/usr/local/autotest/logs$ grep 208134213 $(ls scheduler.log* | tail -2) | grep -v 'Aborting.*job timeout' scheduler.log.2018-06-12-17.00.35:06/13 03:58:52.834 INFO | luciferlib:0385| Spawning '/usr/bin/env', ['/usr/bin/env', 'GOOGLE_APPLICATION_CREDENTIALS=/creds/service_accounts/lucifer-drone.json', '/usr/local/autotest/bin/job_reporter', '--jobdir', '/usr/local/autotest/leases', '--run-job-path', '/opt/infra-tools/usr/bin/lucifer_run_job', '--lucifer-level', 'STARTING', '--job-id', '208134213', '--results-dir', u'/usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2', '--execution-tag', u'208134213-chromeos-test/chromeos2-row4-rack4-host2'], u'/usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/lucifer/job_reporter_output.log' scheduler.log.2018-06-13-04.00.46:06/13 04:00:56.926 INFO | drone_manager:0813| monitoring pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute scheduler.log.2018-06-13-04.00.46:06/13 04:00:57.564 INFO | agent_task:0503| Recovering process localhost/153398 for QueueTask at 208134213-chromeos-test/chromeos2-row4-rack4-host2 scheduler.log.2018-06-13-04.00.46:06/13 04:31:39.371 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos2-row4-rack4-host2/208134213 (208708210) due to max runtime scheduler.log.2018-06-13-04.00.46:06/13 05:37:16.913 WARNI| drone_manager:0369| dropping leaked pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute scheduler.log.2018-06-13-04.00.46:06/13 05:37:16.913 INFO | drone_manager:0825| forgetting pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute chromeos-test@chromeos-skunk-4:/usr/local/autotest/logs$ grep 208134213 $(ls scheduler.log* | tail -2) | grep 'Aborting.*job timeout' | wc -l 86
,
Jun 13 2018
,
Jun 13 2018
,
Jun 13 2018
job_aborter jammed because I forgot to set a timeout on socket connect, the socket for job 208102899 was blocking for some reason.
,
Jun 13 2018
OK. A brief timeline on the shard:
03:59:02 - autoserv for login_LoginSuccess begins
03:59:56 - the autoserv process logs that the test has passed
04:31:39 - scheduler reports aborting the login_LoginSuccess
(this is the timeout for the test, not the suite).
Somehow, after the job terminated, test results didn't get recorded.
,
Jun 13 2018
> job_aborter jammed because I forgot to set a timeout on socket connect, the socket for job 208102899 was blocking for some reason. It feels like maybe we have at least two bugs here: 1) job_aborter jammed, preventing expected cleanup 2) The test on the shard never reported its results in the first place
,
Jun 13 2018
Re #11 job was still running on shard, so not getting synced to master is WAI Another issue: 208102899 is jammed on tko/parse. Parsing does not respect abort (WAI as of ~2 years ago). So the job stays running.
,
Jun 13 2018
Filed Issue 852527 for tko/parse
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4ebf04c6e41d70e220f12d276707d3bf9c788b2 commit d4ebf04c6e41d70e220f12d276707d3bf9c788b2 Author: Allen Li <ayatane@chromium.org> Date: Wed Jun 13 23:41:24 2018 autotest: Don't block on abort socket Rarely, the socket can get jammed. Unix sockets block until both sides connect to the socket. The Lucifer side can get disconnected for various reasons, so it's dangerous to block on this (see bug). BUG= chromium:852476 TEST=monkeypatch on drone Change-Id: I2091cbca27e6fbb0104e78db422c02b569972999 Reviewed-on: https://chromium-review.googlesource.com/1099723 Commit-Ready: Craig Bergstrom <craigb@chromium.org> Tested-by: Craig Bergstrom <craigb@chromium.org> Reviewed-by: Craig Bergstrom <craigb@chromium.org> [modify] https://crrev.com/d4ebf04c6e41d70e220f12d276707d3bf9c788b2/venv/lucifer/leasing.py
,
Jun 18 2018
Fixed pending prod push.
,
Jun 18 2018
,
Jun 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ca32525c79bc9c45e7c6b97fa82985eab50a474f commit ca32525c79bc9c45e7c6b97fa82985eab50a474f Author: Allen Li <ayatane@chromium.org> Date: Tue Jun 26 23:02:32 2018 autotest: Add test for block on abort socket BUG= chromium:852476 TEST=Run test with fix commented out Change-Id: I45609fd7ca61a7ec100ee4936f08334c649ae84e Reviewed-on: https://chromium-review.googlesource.com/1100169 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/ca32525c79bc9c45e7c6b97fa82985eab50a474f/venv/lucifer/leasing_unittest.py [add] https://crrev.com/ca32525c79bc9c45e7c6b97fa82985eab50a474f/venv/lucifer/cmd/test/abort_socket_norecv.py
,
Jul 2
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by aaboagye@chromium.org
, Jun 13 2018