New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 852476 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

coral-paladin: suite and test job alive after timeout abort

Project Member Reported by aaboagye@chromium.org, Jun 13 2018

Issue description

A recent coral-paladin run failed during bvt-inline HWTest stage. 

http://shortn/_sytuLI3GcE

Provisioning succeeded, however there was a test that was aborted. According to jrbarnette@, the aborted test was never scheduled.

Assigning to current infra deputy to investigate.
 
One key early finding:  The suite job never completes.  At the time
of this writing, the process is still live on the drone:
chromeos-test@cros-full-0027:~$ ps -ef | grep 208133752 | cut -c 1-120
chromeo+  81196 260142  0 12:15 pts/17   00:00:00 grep --color 208133752
chromeo+  87208      1  0 03:29 ?        00:00:00 /usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a
chromeo+  87384  87208  0 03:29 ?        00:00:00 /opt/infra-tools/usr/bin/lucifer_run_job -autotestdir /usr/local/autot
chromeo+  87492  87384  0 03:29 ?        00:00:05 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752
chromeo+  87649  87492  0 03:29 ?        00:00:00 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752
chromeo+  87651  87492  0 03:29 ?        00:00:00 /usr/bin/python -u /usr/local/autotest/server/autoserv -s -P 208133752

The job started at 3:22AM, with a 90 minute timeout.  Apparently, the
process was never sent SIGTERM at around 4:52AM when it should have
been.  Possibly, the job was never actually aborted.

Looks like job_aborter jammed

job_aborter: 2018-06-13 08:11:07,187:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 08:11:07,213:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MDYyNjcsImlhdCI6MTUyODkwMjY2N30', 'nI
YvoOYf_Q2Dtj-M7UFys9vqRkqK2FUjnk6PJlSpb7pZr2MVRs0ZUqGegnTh4MyEORbHfUQck5DGTo9_4YD0ZVxwgS2Y5foe0punE5nX1VKqVmW04nkQWC3vlmWSasR3-yNLkwiaoeRAS6eluSM3yo340zSS5jbT
d8oeJYSK--QoRJYXkQwJxlZEeDBAJqaZxRpHOq1HWeITECExpB4ZUBleOG26lKf5DppIqmsFGP-2Utgb7hJxwX9h0mvy_x2pGHqzkdeRBATv1r3A_ddmq85b8I16dgkt1rJ2KAsa3Tnx_y1DM7wSxZH5_pdBT4
TQHiIpCJFgnjH1f13YVEcEGg']
job_aborter: 2018-06-13 08:11:07,214:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 09:11:07,205:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 09:11:07,221:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MDk4NjcsImlhdCI6MTUyODkwNjI2N30', 'eI
j7MTYBTLwEedgHmpGS6cCnwl88CAi8pwvsCjhPlo40XU0hvkfL36H4TZjQmJ1jQ3_Q3SYW91I0kOTjJl8EtDEqJ6X_tnt6nSE3mdZK6ucVxqeSuMKYw6-bXdb8RQ9o85iMYEytMm-MjZ5rmidhTsW5Pe2Ds2Ej
zX34BRa01yaJ5_uma654ezqxM05aKzBSvoBrPASSp9nY2k8E-v_pxlRaJZKsb3rl3GAFdVIWHf0fw_XuB7v_DH00sHwQ7bCGaahFlSwhcR-vm8NLftubVVLFfRiSwShozMmcK6NehahnygDhmZZpLqawS7EMxy
KKpSbTqfoD_iMFrfROKXxp_Q']
job_aborter: 2018-06-13 09:11:07,221:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 10:11:07,219:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 10:11:07,243:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MTM0NjcsImlhdCI6MTUyODkwOTg2N30', 'G8
gsMkk75jmc_J7NQRiru8p1r4sOuyepxo4-VIKdW71vBPEmgE5TNrAT-eyWwItNJGh8vdoEQR5ZXJZQdYQsBMX6EsVMrjNJa99GfYczwE-sqm7C5oGmLq7KhNuswLBQt7cDBiFM7lXb8Dsl82y4oOy0cEVCqz55
079CcQPh3iW7-Geqfnmfzo3csA2s9CNOCgqAPjCczmcEVg8p2ZQkeCJx2xKuGrgl0Y1ztWdMmTG7crQrDsXjryicrEF00W1LXL__Lgtb093ceEZEtyF-aY7_BHudg1pvbdNE5_tCfWo8hB34xp3WsypaOWPdTa
Dga62iXuSlycU2I4AzT9K_HA']
job_aborter: 2018-06-13 10:11:07,243:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 11:11:07,243:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 11:11:07,268:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MTcwNjcsImlhdCI6MTUyODkxMzQ2N30', 'TM
A41IYLh27exQTqiHRIKJJ3UeJtKywtUV7vblzRlDU-8dnr53AqngtYm8m-Np18s7WXJ8fCF2O7T_wDMz6r8BGwT5yl6ER-0YC0E5s4Jl-7IAr-jOop0eJednwRo2ysWCyi-u-LLunOKiQ6lsZZ35LWzM6i-3NJ
9_hPS0nPh6XXM2CL4WpD3IxOFJ1Wjt4wsYa7YU_kF_SMX-UwvOUmtByxRymmRf2OUwCWUMtkDG__Sq3y8U1DVIQNviccdEaSgGY5a3HkDbdoXb9SfQ7y3inpCZdVbb-ZmKfyKqelAErzDIz6RbyZmgrKNNljay
ianVt1TK9lR2G7ueSWMp0Tog']
job_aborter: 2018-06-13 11:11:07,268:INFO:client:_do_refresh_request:795:Refreshing access_token
job_aborter: 2018-06-13 12:11:07,263:INFO:transport:new_request:179:Refreshing due to a 401 (attempt 1/2)
job_aborter: 2018-06-13 12:11:07,290:DEBUG:crypt:make_signed_jwt:100:['eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjA1ZGI0ZGU2ZDM5MjBmYzkxNjIwNDA1MGQ4YzRkNDc
3MWI0NTQ3ZDQifQ', 'eyJpc3MiOiJjaHJvbWVvcy1pbmZyYUBwcm9keC1tb24tY2hyb21lLWluZnJhLmdvb2dsZS5jb20uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmd
vb2dsZWFwaXMuY29tL2F1dGgvcHJvZHhtb24iLCJhdWQiOiJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCJleHAiOjE1Mjg5MjA2NjcsImlhdCI6MTUyODkxNzA2N30', 'oL
0qxu6y7WUWCysmhd1jMwrKUVV2PrTAaiTs5S3pDwsbHVVw5pq9AYwopAQNDrP8KcoQhjhGB7oOlwEsVAHxUGD_g2du2XyakgAXbWHdgUObk52iWdGD5s9k3v_iYKs5hrDugD6QWGR_gcfj2cw5dZcdfrShQLlv
dT6vJAK-jD19OIEfBmL0yBNiUjwGsT_WSxxHooj80cj8eBbBC7YzvVthbqO6NpeZE4C5MC1AlpmnJx7OjBngDXAIPZlBAFt4LYN4R9624ffq1BmKYc8B9Pd8rSAXUfojFJCNYAQHgeKoFjs05C1AChItJwHcor
KNopBYckHaC7DEZqcBxkWaqQ']
job_aborter: 2018-06-13 12:11:07,291:INFO:client:_do_refresh_request:795:Refreshing access_token
ESC[31m[E2018-06-13T12:18:44.655010-07:00 256008 0 http.go:89]ESC[0m Monitoring push failed.
Response body: {
  "error": {
    "code": 503,
    "message": "The service is currently unavailable.",
    "status": "UNAVAILABLE",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] RPC::DEADLINE_EXCEEDED: "
      }
    ]
  }
}

The problematic login_LoginSuccess job ran and passed on the shard.
Logs are available:
    http://chromeos-skunk-4.mtv.corp.google.com/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/
Last tick was at 3

job_aborter: 2018-06-13 03:08:20,221:DEBUG:job_aborter:_main_loop:66:Tick

Logs from the shard relating to the problem test job:
chromeos-test@chromeos-skunk-4:/usr/local/autotest/logs$ grep 208134213 $(ls scheduler.log* | tail -2) | grep -v 'Aborting.*job timeout'
scheduler.log.2018-06-12-17.00.35:06/13 03:58:52.834 INFO |        luciferlib:0385| Spawning '/usr/bin/env', ['/usr/bin/env', 'GOOGLE_APPLICATION_CREDENTIALS=/creds/service_accounts/lucifer-drone.json', '/usr/local/autotest/bin/job_reporter', '--jobdir', '/usr/local/autotest/leases', '--run-job-path', '/opt/infra-tools/usr/bin/lucifer_run_job', '--lucifer-level', 'STARTING', '--job-id', '208134213', '--results-dir', u'/usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2', '--execution-tag', u'208134213-chromeos-test/chromeos2-row4-rack4-host2'], u'/usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/lucifer/job_reporter_output.log'
scheduler.log.2018-06-13-04.00.46:06/13 04:00:56.926 INFO |     drone_manager:0813| monitoring pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute
scheduler.log.2018-06-13-04.00.46:06/13 04:00:57.564 INFO |        agent_task:0503| Recovering process localhost/153398 for QueueTask at 208134213-chromeos-test/chromeos2-row4-rack4-host2
scheduler.log.2018-06-13-04.00.46:06/13 04:31:39.371 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos2-row4-rack4-host2/208134213 (208708210) due to max runtime
scheduler.log.2018-06-13-04.00.46:06/13 05:37:16.913 WARNI|     drone_manager:0369| dropping leaked pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute
scheduler.log.2018-06-13-04.00.46:06/13 05:37:16.913 INFO |     drone_manager:0825| forgetting pidfile /usr/local/autotest/results/208134213-chromeos-test/chromeos2-row4-rack4-host2/.autoserv_execute
chromeos-test@chromeos-skunk-4:/usr/local/autotest/logs$ grep 208134213 $(ls scheduler.log* | tail -2) | grep 'Aborting.*job timeout' | wc -l
86

Comment 7 by cra...@chromium.org, Jun 13 2018

Labels: Chase

Comment 8 by cra...@chromium.org, Jun 13 2018

Labels: -Chase Chase-Pending
job_aborter jammed because I forgot to set a timeout on socket connect, the socket for job 208102899 was blocking for some reason.
OK.  A brief timeline on the shard:

03:59:02 - autoserv for login_LoginSuccess begins
03:59:56 - the autoserv process logs that the test has passed
04:31:39 - scheduler reports aborting the login_LoginSuccess
           (this is the timeout for the test, not the suite).

Somehow, after the job terminated, test results didn't get recorded.

> job_aborter jammed because I forgot to set a timeout on socket connect, the socket for job 208102899 was blocking for some reason.

It feels like maybe we have at least two bugs here:
 1) job_aborter jammed, preventing expected cleanup
 2) The test on the shard never reported its results in the first place

Re #11 job was still running on shard, so not getting synced to master is WAI

Another issue: 208102899 is jammed on tko/parse.  Parsing does not respect abort (WAI as of ~2 years ago).  So the job stays running.
Cc: cra...@chromium.org
Owner: ayatane@chromium.org
Status: Started (was: Unconfirmed)
Summary: coral-paladin: suite and test job alive after timeout abort (was: coral-paladin: an aborted test was never scheduled?)
Filed  Issue 852527  for tko/parse
Project Member

Comment 14 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4ebf04c6e41d70e220f12d276707d3bf9c788b2

commit d4ebf04c6e41d70e220f12d276707d3bf9c788b2
Author: Allen Li <ayatane@chromium.org>
Date: Wed Jun 13 23:41:24 2018

autotest: Don't block on abort socket

Rarely, the socket can get jammed.  Unix sockets block until both
sides connect to the socket.  The Lucifer side can get disconnected
for various reasons, so it's dangerous to block on this (see bug).

BUG= chromium:852476 
TEST=monkeypatch on drone

Change-Id: I2091cbca27e6fbb0104e78db422c02b569972999
Reviewed-on: https://chromium-review.googlesource.com/1099723
Commit-Ready: Craig Bergstrom <craigb@chromium.org>
Tested-by: Craig Bergstrom <craigb@chromium.org>
Reviewed-by: Craig Bergstrom <craigb@chromium.org>

[modify] https://crrev.com/d4ebf04c6e41d70e220f12d276707d3bf9c788b2/venv/lucifer/leasing.py

Labels: -Pri-3 Pri-1
Fixed pending prod push.
Labels: -Chase-Pending Chase
Project Member

Comment 17 by bugdroid1@chromium.org, Jun 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ca32525c79bc9c45e7c6b97fa82985eab50a474f

commit ca32525c79bc9c45e7c6b97fa82985eab50a474f
Author: Allen Li <ayatane@chromium.org>
Date: Tue Jun 26 23:02:32 2018

autotest: Add test for block on abort socket

BUG= chromium:852476 
TEST=Run test with fix commented out

Change-Id: I45609fd7ca61a7ec100ee4936f08334c649ae84e
Reviewed-on: https://chromium-review.googlesource.com/1100169
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/ca32525c79bc9c45e7c6b97fa82985eab50a474f/venv/lucifer/leasing_unittest.py
[add] https://crrev.com/ca32525c79bc9c45e7c6b97fa82985eab50a474f/venv/lucifer/cmd/test/abort_socket_norecv.py

Status: Fixed (was: Started)

Sign in to add a comment