New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 605150 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

DUTs in Running state and stuck at old test jobs

Project Member Reported by ka...@chromium.org, Apr 20 2016

Issue description

Comment 1 by ka...@chromium.org, Apr 20 2016

Cc: jrbarnette@chromium.org
PING!

Comment 2 by pho...@chromium.org, Apr 20 2016

Status: Started (was: Untriaged)

Comment 3 by ka...@chromium.org, Apr 20 2016

veyron_jerry at  seems to finished and is in Ready state.

Still we have DUTs stuck at
chromeos1-row5-rack1-host1 - job 59568461 since April 10th
chromeos1-row5-rack1-host2 - job 60332675 since April 17th

Comment 4 by pho...@chromium.org, Apr 20 2016

Status: Fixed (was: Started)
Should be fixed now.

Comment 5 by ka...@chromium.org, Apr 21 2016

Looks like a cleanup state is maintained for 50 minutes so far. Is it normal?

Comment 6 by ka...@chromium.org, Apr 21 2016

PING!
DUTs are still stuck with Cleaning state -https://screenshot.googleplex.com/i5qvwCaLNwi

Comment 7 by dshi@chromium.org, Apr 21 2016

I've aborted these two cleanup jobs and filed a bug for chameleon: 
https://bugs.chromium.org/p/chromium/issues/detail?id=605611

Comment 8 by ka...@chromium.org, Apr 21 2016

Now the boards are stuck at repair state since an hour or so

- DUTs are ping-able and I can ssh to
- chameleons are ping-able and I can ssh to
- one servo is pingable and I can ssh to(chromeos1-row5-rack1-host2-servo)

- one servo is down - chromeos1-row5-rack1-host1-servo

I locked the boards 

Comment 9 by ka...@chromium.org, Apr 21 2016

Status: Untriaged (was: Fixed)
I aborted the repair on chromeos1-row5-rack1-host1 and started Veify job, but it looks stuck.

and I am unable to abort repair job on chromeos1-row5-rack1-host1 - no shard(<null>)

Comment 10 by ka...@chromium.org, Apr 21 2016

Correction:
...and I am unable to abort repair job on chromeos1-row5-rack1-host2 - no shard(<null>)

It seems this DUT repair aborted(Repair failed), and I started Verify on it too.

Comment 11 by ka...@chromium.org, Apr 21 2016

Verify jobs are stuck too

Comment 12 by ka...@chromium.org, Apr 21 2016

re-booted the disconnected servo and it is OK now.

I aborted both Verify jobs being stuck, and now Repair jobs are started for both boards
chromeos1-row5-rack1-host1
chromeos1-row5-rack1-host2

Comment 13 by ka...@chromium.org, Apr 21 2016

Both boards special jobs are failing like

04/21 11:13:33.145 DEBUG|        base_utils:0178| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0Lipwcssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpl6oU5K -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1'
04/21 11:13:34.073 DEBUG|      abstract_ssh:0696| Nuking master_ssh_job.
04/21 11:13:35.079 DEBUG|      abstract_ssh:0702| Cleaning master_ssh_tempdir.
04/21 11:13:35.336 INFO |        servo_host:0734| Pinging servo at chromeos1-row5-rack1-host1-servo
04/21 11:13:35.337 DEBUG|        base_utils:0178| Running 'ping -c 3 chromeos1-row5-rack1-host1-servo'
04/21 11:13:47.500 DEBUG|      abstract_ssh:0835| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 41061:localhost:9992 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1-chameleon
04/21 11:13:47.545 DEBUG|      abstract_ssh:0843| Started ssh tunnel, local = 41061 remote = 9992, pid = 8084
04/21 11:13:47.546 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:47.647 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:47.749 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:47.850 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:47.951 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:48.052 DEBUG|    chameleon_host:0118| Connection is not ready yet ...
04/21 11:13:48.153 DEBUG|    chameleon_host:0118| Connection is not ready yet ...


On my side I am able to ssh to the chameleon hsosts:

kalin@kalin:~$ ssh root@chromeos1-row5-rack1-host1-chameleon
root@socfpga:~# ls
disable_wp      edid            enable_audio    enable_h2f      io              memdump2file    plug            reset_receiver  test_server.py  unplug
root@socfpga:~# exit
logout
Connection to chromeos1-row5-rack1-host1-chameleon closed.
kalin@kalin:~$ ssh root@chromeos1-row5-rack1-host2-chameleon
root@socfpga:~# ls
disable_wp      edid            enable_audio    enable_h2f      io              memdump2file    plug            rec.raw         reset_receiver  test_server.py  unplug
root@socfpga:~# exit
logout
Connection to chromeos1-row5-rack1-host2-chameleon closed.

Comment 14 by dshi@chromium.org, Apr 21 2016

Cc: xixuan@chromium.org
Could it be related to the ssh tunnel issue?

/usr/bin/ssh -a -x -n -N -q -L 41061:localhost:9992 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1-chameleon

+xixuan
Did we do something special about ssh tunnel to chameleon host?
nothing special, basically the ssh tunnel logic is the same as servo's. They just go through different code paths. 

For chameleon host, it already has 'ssh tunnel' stuff before I add any code to it, and get invoked when it's not 'is_in_lab', so I just add a tag, and keep the tunnel code unchanged.

if self._is_in_lab and not ENABLE_SSH_TUNNEL_FOR_CHAMELEON:
    self._chameleon_connection = chameleon.ChameleonConnection(
        self.hostname, chameleon_port)
else:
    self._create_connection_through_tunnel()

Comment 16 by ka...@chromium.org, Apr 21 2016

Now DUTs are in good state and running state.

I had to reboot chameleon and DUT. Then verification passed. 

Comment 17 by ka...@chromium.org, Apr 21 2016

Cc: rohi...@chromium.org
It is very odd observation I made when booting down chameleon:


1) Seeing the DUT with black screen, 

2) I touch the touchpad, and login screen appears with the test profile on it as expected. 

3) I shut down chameleon - at this moment DUT present a screen from the last test it ran before it started being stuck - video_GlitchDetection_chameleon_vp8_720p. This is the job that was Aborted yesterday.

This happened for both boards - seeing image from previously ran test(one 10 days ago, the other 3 days ago, when turn chameleon off)

Rebooting the DUT brough back the login screen and all proceeded as expected from this moment on.


I asked Rohit to not run this test (video_GlitchDetection) for these two boards and he removed the label from the DUT hosts.

Feel free to close if there is nothing more to find here.

Comment 18 by ka...@chromium.org, Apr 22 2016

Status: Verified (was: Untriaged)
If this happened for the first time, I would like to see if this issue is reproducible and if it can be investigated? 

Kalin, do you think we can turn on the video test on any one audiobox device?
Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS

Comment 21 by ka...@chromium.org, May 11 2016

Cc: waihong@chromium.org
Owner: ----
Status: Untriaged (was: Verified)
I am reopening this issue, b/c I have few more instances this has happened - on lumpy(DUT removed nd replaced with ) and peach_pi(label pool:chameleon_video_capture_stable removed). Like issue 610379


The label 'pool:chameleon_video_capture_stable' will stay on some of the DUTs in cassandra not in audio box, and rohit can monitor his video glich detection test on them.

I'll be monitoring these boards tooo, and if new instances of boards stuck to proceed, will update thsi bug.
Owner: ka...@chromium.org
If this happens again, please link to the specific examples so that we can troubleshoot! reassigning to kalin@chromium.org to find additional examples

Comment 23 by ka...@chromium.org, Oct 24 2016

Status: WontFix (was: Untriaged)

Sign in to add a comment