DUTs in Running state and stuck at old test jobs |
|||||||||||
Issue descriptionNian_Big at http://cautotest/afe/#tab_id=view_host&object_id=3559 is stuck at this job - http://chromeos-server26.mtv.corp.google.com/afe/#tab_id=view_job&object_id=59568461 Squawks at http://cautotest/afe/#tab_id=view_host&object_id=3362 is stuck at this job - http://cautotest/afe/#tab_id=view_job&object_id=60332675 Veyron_Jerry at http://cautotest/afe/#tab_id=view_host&object_id=3808 is stuck at this job - http://chromeos-server41.cbf.corp.google.com/afe/#tab_id=view_job&object_id=60617009
,
Apr 20 2016
,
Apr 20 2016
veyron_jerry at seems to finished and is in Ready state. Still we have DUTs stuck at chromeos1-row5-rack1-host1 - job 59568461 since April 10th chromeos1-row5-rack1-host2 - job 60332675 since April 17th
,
Apr 20 2016
Should be fixed now.
,
Apr 21 2016
Looks like a cleanup state is maintained for 50 minutes so far. Is it normal?
,
Apr 21 2016
PING! DUTs are still stuck with Cleaning state -https://screenshot.googleplex.com/i5qvwCaLNwi
,
Apr 21 2016
I've aborted these two cleanup jobs and filed a bug for chameleon: https://bugs.chromium.org/p/chromium/issues/detail?id=605611
,
Apr 21 2016
Now the boards are stuck at repair state since an hour or so - DUTs are ping-able and I can ssh to - chameleons are ping-able and I can ssh to - one servo is pingable and I can ssh to(chromeos1-row5-rack1-host2-servo) - one servo is down - chromeos1-row5-rack1-host1-servo I locked the boards
,
Apr 21 2016
I aborted the repair on chromeos1-row5-rack1-host1 and started Veify job, but it looks stuck. and I am unable to abort repair job on chromeos1-row5-rack1-host1 - no shard(<null>)
,
Apr 21 2016
Correction: ...and I am unable to abort repair job on chromeos1-row5-rack1-host2 - no shard(<null>) It seems this DUT repair aborted(Repair failed), and I started Verify on it too.
,
Apr 21 2016
Verify jobs are stuck too
,
Apr 21 2016
re-booted the disconnected servo and it is OK now. I aborted both Verify jobs being stuck, and now Repair jobs are started for both boards chromeos1-row5-rack1-host1 chromeos1-row5-rack1-host2
,
Apr 21 2016
Both boards special jobs are failing like 04/21 11:13:33.145 DEBUG| base_utils:0178| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0Lipwcssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpl6oU5K -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1' 04/21 11:13:34.073 DEBUG| abstract_ssh:0696| Nuking master_ssh_job. 04/21 11:13:35.079 DEBUG| abstract_ssh:0702| Cleaning master_ssh_tempdir. 04/21 11:13:35.336 INFO | servo_host:0734| Pinging servo at chromeos1-row5-rack1-host1-servo 04/21 11:13:35.337 DEBUG| base_utils:0178| Running 'ping -c 3 chromeos1-row5-rack1-host1-servo' 04/21 11:13:47.500 DEBUG| abstract_ssh:0835| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 41061:localhost:9992 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1-chameleon 04/21 11:13:47.545 DEBUG| abstract_ssh:0843| Started ssh tunnel, local = 41061 remote = 9992, pid = 8084 04/21 11:13:47.546 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:47.647 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:47.749 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:47.850 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:47.951 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:48.052 DEBUG| chameleon_host:0118| Connection is not ready yet ... 04/21 11:13:48.153 DEBUG| chameleon_host:0118| Connection is not ready yet ... On my side I am able to ssh to the chameleon hsosts: kalin@kalin:~$ ssh root@chromeos1-row5-rack1-host1-chameleon root@socfpga:~# ls disable_wp edid enable_audio enable_h2f io memdump2file plug reset_receiver test_server.py unplug root@socfpga:~# exit logout Connection to chromeos1-row5-rack1-host1-chameleon closed. kalin@kalin:~$ ssh root@chromeos1-row5-rack1-host2-chameleon root@socfpga:~# ls disable_wp edid enable_audio enable_h2f io memdump2file plug rec.raw reset_receiver test_server.py unplug root@socfpga:~# exit logout Connection to chromeos1-row5-rack1-host2-chameleon closed.
,
Apr 21 2016
Could it be related to the ssh tunnel issue? /usr/bin/ssh -a -x -n -N -q -L 41061:localhost:9992 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -l root -p 22 chromeos1-row5-rack1-host1-chameleon +xixuan Did we do something special about ssh tunnel to chameleon host?
,
Apr 21 2016
nothing special, basically the ssh tunnel logic is the same as servo's. They just go through different code paths.
For chameleon host, it already has 'ssh tunnel' stuff before I add any code to it, and get invoked when it's not 'is_in_lab', so I just add a tag, and keep the tunnel code unchanged.
if self._is_in_lab and not ENABLE_SSH_TUNNEL_FOR_CHAMELEON:
self._chameleon_connection = chameleon.ChameleonConnection(
self.hostname, chameleon_port)
else:
self._create_connection_through_tunnel()
,
Apr 21 2016
Now DUTs are in good state and running state. I had to reboot chameleon and DUT. Then verification passed.
,
Apr 21 2016
It is very odd observation I made when booting down chameleon: 1) Seeing the DUT with black screen, 2) I touch the touchpad, and login screen appears with the test profile on it as expected. 3) I shut down chameleon - at this moment DUT present a screen from the last test it ran before it started being stuck - video_GlitchDetection_chameleon_vp8_720p. This is the job that was Aborted yesterday. This happened for both boards - seeing image from previously ran test(one 10 days ago, the other 3 days ago, when turn chameleon off) Rebooting the DUT brough back the login screen and all proceeded as expected from this moment on. I asked Rohit to not run this test (video_GlitchDetection) for these two boards and he removed the label from the DUT hosts. Feel free to close if there is nothing more to find here.
,
Apr 22 2016
,
Apr 24 2016
If this happened for the first time, I would like to see if this issue is reproducible and if it can be investigated? Kalin, do you think we can turn on the video test on any one audiobox device?
,
Apr 27 2016
,
May 11 2016
I am reopening this issue, b/c I have few more instances this has happened - on lumpy(DUT removed nd replaced with ) and peach_pi(label pool:chameleon_video_capture_stable removed). Like issue 610379 The label 'pool:chameleon_video_capture_stable' will stay on some of the DUTs in cassandra not in audio box, and rohit can monitor his video glich detection test on them. I'll be monitoring these boards tooo, and if new instances of boards stuck to proceed, will update thsi bug.
,
May 16 2016
If this happens again, please link to the specific examples so that we can troubleshoot! reassigning to kalin@chromium.org to find additional examples
,
Oct 24 2016
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by ka...@chromium.org
, Apr 20 2016