New issue
Advanced search Search tips

Issue 632895 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 638784



Sign in to add a comment

Get a watchdog daemon onto infra devices

Project Member Reported by bpastene@chromium.org, Jul 29 2016

Issue description

This daemon will serve to restart and/or heal the device if it gets into a bad state.
 
Project Member

Comment 6 by bugdroid1@chromium.org, Aug 17 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/b69085fbe07ce6911897423190636d33b26bd139

commit b69085fbe07ce6911897423190636d33b26bd139
Author: Benjamin Pastene <bpastene@google.com>
Date: Wed Aug 17 23:03:06 2016

Project Member

Comment 7 by bugdroid1@chromium.org, Aug 17 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/106172640aba87f983b07ea9e346a07f6860a614

commit 106172640aba87f983b07ea9e346a07f6860a614
Author: Benjamin Pastene <bpastene@google.com>
Date: Wed Aug 17 23:14:24 2016

Blockedon: 638784
Project Member

Comment 9 by bugdroid1@chromium.org, Aug 25 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/8bc9b6e0aad15b725689b5168f3652bb2f8526ec

commit 8bc9b6e0aad15b725689b5168f3652bb2f8526ec
Author: Benjamin Pastene <bpastene@google.com>
Date: Thu Aug 25 22:13:48 2016

Project Member

Comment 10 by bugdroid1@chromium.org, Aug 29 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/3dccac1684da104d90ddf7254af6cf92a59a7682

commit 3dccac1684da104d90ddf7254af6cf92a59a7682
Author: bpastene <bpastene@google.com>
Date: Mon Aug 29 20:55:54 2016

Project Member

Comment 11 by bugdroid1@chromium.org, Aug 29 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/035e5e7b4a67209239f96446cea0a4d2a22e220b

commit 035e5e7b4a67209239f96446cea0a4d2a22e220b
Author: bpastene <bpastene@google.com>
Date: Mon Aug 29 21:11:13 2016

Update: watchdog deployed everywhere. It's been correctly rebooting phones when needed, but some phones, especially those that really need it, seem to slip through the cracks:
http://shortn/_AVVVJrv3ty

Specifically 00e7a97549912611 should be getting rebooted but it's not. From the device:

root@bullhead:/ # ps | grep watchdog                                           
root      6216  1     827180 3076           0 00aadb6794 R /data/local/tmp/cit_watchdog
root@bullhead:/ # 
root@bullhead:/ # 
root@bullhead:/ # cat /proc/6216/stack                                         
[<0000000000000000>] __switch_to+0x7c/0x88
[<0000000000000000>] cpu_worker_pools+0x77c/0x780
[<0000000000000000>] 0xffffffffffffffff
root@bullhead:/ # 
root@bullhead:/ # cat /proc/6216/status                                        
Name:   cit_watchdog
State:  R (running)
Tgid:   6216
Pid:    6216
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups: 1004 1007 1011 1015 1028 3001 3002 3003 3006 
VmPeak:   827180 kB
VmSize:   827180 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      3076 kB
VmRSS:      3076 kB
VmData:   794156 kB
VmStk:       136 kB
VmExe:      1168 kB
VmLib:     29232 kB
VmPTE:        80 kB
VmSwap:        0 kB
Threads:        1
SigQ:   6/5842
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp:        0
Cpus_allowed:   3f
Cpus_allowed_list:      0-5
Mems_allowed:   1
Mems_allowed_list:      0
voluntary_ctxt_switches:        0
nonvoluntary_ctxt_switches:     458720
root@bullhead:/ # cat /proc/6216/schedstat                                     
22744458145465 12210252825 475018

Not sure what the fields in schedstat correspond to, but I'm sure the values listed on there can't be good...

For reference, this is what's listed for the watchdog on a more healthier device:
root@bullhead:/ # cat /proc/6979/schedstat                                     
4469792 0 1

With stip@'s (and strace's) help, I managed to track down why the process was hanging. Turns out it has trouble reading from /proc/uptime at times. Will need to add some timeouts to the file I/O. It'd also be a good idea to trigger a reboot if it fails too many times in a row.
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 9 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/d026d656c86ae12a6e180630fd3f6185c788ca28

commit d026d656c86ae12a6e180630fd3f6185c788ca28
Author: bpastene <bpastene@chromium.org>
Date: Fri Sep 09 22:12:47 2016

Change daemonize logic in watchdog and add timeout to file system read.

Daemonize via fork made goroutines behave strangely, so this uses
exec instead.

Reading from /proc/uptime can hang indefinitely on some phones. This
adds a timeout.

BUG= 632895 

Review-Url: https://codereview.chromium.org/2302193002

[modify] https://crrev.com/d026d656c86ae12a6e180630fd3f6185c788ca28/go/deps.lock
[modify] https://crrev.com/d026d656c86ae12a6e180630fd3f6185c788ca28/go/deps.yaml
[modify] https://crrev.com/d026d656c86ae12a6e180630fd3f6185c788ca28/go/src/infra/tools/device_watchdog/main.go

Project Member

Comment 16 by bugdroid1@chromium.org, Sep 12 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/8f6ed4d91384c236328acc6a3f0c4ea8b73edc19

commit 8f6ed4d91384c236328acc6a3f0c4ea8b73edc19
Author: Benjamin Pastene <bpastene@google.com>
Date: Mon Sep 12 18:40:05 2016

Status: Fixed (was: Assigned)

Sign in to add a comment