New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 659720 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

scheduling job from shard can result in a race condition causing an exception: list index out of range

Project Member Reported by kevcheng@chromium.org, Oct 26 2016

Issue description

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/82555879-kevcheng/chromeos6-row1-rack2-host5/debug

10/25 10:35:42.042 ERROR|            repair:0313| Failed: servo host software is up-to-date
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 310, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 24, in verify
    host.update_image(wait_for_update=False)
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 531, in update_image
    status, current_build_number = self._check_for_reboot(updater)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 446, in _check_for_reboot
    self.schedule_synchronized_reboot(dut_list, afe)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 415, in schedule_synchronized_reboot
    control_type=control_type, hosts=[dut])
  File "/usr/local/autotest/server/frontend.py", line 637, in create_job
    return self.get_jobs(id=id)[0]
IndexError: list index out of range

Will investigate why afe.create_job is failing when getting the job after it creates it.
 
Cc: dshi@chromium.org kevcheng@chromium.org
Owner: shuqianz@chromium.org
Summary: scheduling job from shard can result in a race condition causing an exception (was: when scheduling servo host reboot job, afe.create_job will fail)
Re-summarizing and assigning to Charlene, Dan figured it out and explained this to me.

We have a job running on a shard (server46) and it calls afe.create_job() which does the following:
1. call create_job rpc
2. call get_jobs rpc

The problem is that afe.create_job is running on a shard which mean that both rpcs hits the AFE on server46 first which then gets forwarded to cautotest.  The race is that the job can exist on cautotest but not exist on server46 and when get_jobs is called (which hits server46 first) in that timeframe, we'll get an 'index out of range' exception.  

It seems like the fix is to have get_jobs always get forwarded to the master but it's probably more complicated than that.

Comment 2 by dshi@chromium.org, Oct 26 2016

Cc: xixuan@chromium.org
Labels: -Pri-3 Pri-2
+xixuan who did some shard RPC cleanup.

Xixuan, maybe you are the better owner of this bug?
Cc: -xixuan@chromium.org shuqianz@chromium.org
Owner: xixuan@chromium.org
Re-assign to xixian@, who is working on the RPC forward project
Project Member

Comment 4 by bugdroid1@chromium.org, Oct 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/79589980308c023b6dbfcece379d72efd505fbfc

commit 79589980308c023b6dbfcece379d72efd505fbfc
Author: Kevin Cheng <kevcheng@chromium.org>
Date: Tue Oct 25 20:26:04 2016

[autotest] Update servo host reboot to talk to cautotest directly.

I've seen the create_job afe call fail which causes the test job to fail.
We don't want that to happen so let's catch that exception and log it
and just have another dut schedule the reboot for us.

The afe create_job call fails because it looks like the 'create_job' rpc
returns with an ID that is not yet available when get_jobs is called and
so it returns an empty list and we try to index that and raise an
IndexError.  This looks to be caused by calling the shard instead of
cautotest directly so also change the afe to call cautotest as well.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/82555879-kevcheng/chromeos6-row1-rack2-host5/debug

BUG= chromium:599533 
BUG= chromium:659720 
TEST=None.

Change-Id: I77c7b2b96ffa3b7a9ea2ee7a1c0960c4f9065ba4
Reviewed-on: https://chromium-review.googlesource.com/403290
Commit-Ready: Kevin Cheng <kevcheng@chromium.org>
Tested-by: Kevin Cheng <kevcheng@chromium.org>
Reviewed-by: Kevin Cheng <kevcheng@chromium.org>

[modify] https://crrev.com/79589980308c023b6dbfcece379d72efd505fbfc/server/hosts/servo_host.py

Cc: xixuan@chromium.org
Owner: ----
Is this issue fixed by kevin's CL? 

Looks this CL changes the afe to call cautotest for 'create_job' rpc.
my cl just ignores this issue, this is still a bug in the sense that calling create_job from shard could fail in the way described in #4's commit message.
Summary: scheduling job from shard can result in a race condition causing an exception: list index out of range (was: scheduling job from shard can result in a race condition causing an exception)
understand. Let's wait for more examples before fixing it :)

Comment 8 by autumn@chromium.org, Dec 13 2016

Status: Unconfirmed (was: Untriaged)

Comment 9 by ajha@chromium.org, Jan 19 2017

Labels: OS-Chrome
Project Member

Comment 10 by sheriffbot@chromium.org, Feb 13 2018

Status: Archived (was: Unconfirmed)
Issue has not been modified or commented on in the last 365 days, please re-open or file a new bug if this is still an issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment