New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 841426 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

guado_moblab-paladin HWTest failures: lxc-attach fails mysteriously

Project Member Reported by pprabhu@chromium.org, May 9 2018

Issue description

I've seen a few instances of this since yesterday.
Latest: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/9379

Failure:

	FAIL	moblab_RunSuite	moblab_RunSuite	timestamp=1525824676	localtime=May 08 17:11:16	Unhandled AutoservRunError: command execution error
  * Command: 
      /usr/bin/ssh -a -x   -o Protocol=2 -o StrictHostKeyChecking=no -o
      UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
      ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
      -l root -p 22 chromeos2-row1-rack8-host1 "export LIBC_FATAL_STDERR_=1; if
      type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::run_once|run_as_moblab|run] -> ssh_run(su - moblab -c
      '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
      --build=cyan-release/R66-10452.74.0 --suite_name=dummy_server --retry=True
      --max_retries=1')\";fi; su - moblab -c
      '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
      --build=cyan-release/R66-10452.74.0 --suite_name=dummy_server --retry=True
      --max_retries=1'"
  Exit status: 1
  Duration: 730.166491032
  
  stdout:
  [?25h[?0c
  stderr:
  Autotest instance created: localhost
  05-08-2018 [16:58:23] Submitted create_suite_job rpc
  05-08-2018 [16:58:28] Created suite job: http://localhost/afe/#tab_id=view_job&object_id=1
  @@@STEP_LINK@Link to suite@http://localhost/afe/#tab_id=view_job&object_id=1@@@
  05-08-2018 [17:10:32] Suite job is finished.
  05-08-2018 [17:10:32] Start collecting test results and dump them to json.
  Suite job                         [ PASSED ]
  dummy_PassServer.ssp_SERVER_JOB   [ FAILED ]
  dummy_PassServer.ssp_SERVER_JOB     FAIL: 
  dummy_PassServer                  [ PASSED ]
  dummy_PassServer_SERVER_JOB       [ FAILED ]
  dummy_PassServer_SERVER_JOB         FAIL: 
 
Basically, for the SSP test, SSP container is setup successfully, then lxc-attach dies when trying to run 'autoserv' inside the container.

Digging around in /var/log/messages on moblab:

2018-05-09T00:08:55.168280+00:00 INFO kernel: [  984.739670] autoserv[30983]: segfault at 73697879d59e ip 0000736973f6449f sp 00007ffef477d820 error 4 in ld-2.19.so[736973f58000+23000]


So, autoserv python process is actually segfaulting in the container!
haddowk@ is in the middle of a container upgrade for moblab which may help here, but we don't know either way.

Comment 3 by nxia@chromium.org, May 10 2018

Cc: adurbin@chromium.org philipchen@chromium.org la...@chromium.org
guado_moblab-paladin has been flaky in the CQ.
Go ahead and mark as experimental for now.
How do I mark a paladin builder as experimental?
It doesn't look something I can do via GE UI.

Comment 6 by la...@chromium.org, May 10 2018

It is - oddly enough - a magic string in the tree status. I have already done it.
I can repo on a local moblab autoserv is segfaulting  best trace so far is 

 --- modulename: scanner, funcname: _import_c_make_scanner
scanner.py(5):     try:
scanner.py(6):         from simplejson._speedups import make_scanner
Segmentation fault (core dumped)

Might have to get out gdb
Does not help me much - anyone else with any other ideas ?  Perhaps might have to ask the toolchain guys

Starting program: /usr/bin/python /usr/local/autotest/server/autoserv -s -P 2-moblab/192.168.231.100 -m 192.168.231.100 -l nami-release/R68-10658.0.0/gts/cheets_GTS.GtsTvBugReportTestCases -u moblab --lab True -n --parent_job_id=1 -r /usr/local/autotest/results/2-moblab --verify_job_repo_url -p /usr/local/autotest/drone_tmp/control_attach --use-existing-results --pidfile-label container_autoserv
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
BFD: /usr/local/lib/python2.7/dist-packages/simplejson/_speedups.so: don't know how to handle section `.relr.dyn' [0x      13]
warning: `/usr/local/lib/python2.7/dist-packages/simplejson/_speedups.so': Shared library architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
elf_dynamic_do_Rela (skip_ifunc=0, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0xe9a5f0) at do-rel.h:136
136	do-rel.h: No such file or directory.
(gdb) backtrace
#0  elf_dynamic_do_Rela (skip_ifunc=0, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0xe9a5f0) at do-rel.h:136
#1  _dl_relocate_object (scope=<optimized out>, reloc_mode=reloc_mode@entry=0, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:264
#2  0x00007ffff7deed71 in dl_open_worker (a=a@entry=0x7fffffff8958) at dl-open.c:427
#3  0x00007ffff7dea094 in _dl_catch_error (objname=objname@entry=0x7fffffff8948, errstring=errstring@entry=0x7fffffff8950, mallocedp=mallocedp@entry=0x7fffffff8940, 
    operate=operate@entry=0x7ffff7deea30 <dl_open_worker>, args=args@entry=0x7fffffff8958) at dl-error.c:187
#4  0x00007ffff7dee44b in _dl_open (file=0xe82af0 "/usr/local/lib/python2.7/dist-packages/simplejson/_speedups.so", mode=-2147483646, caller_dlopen=<optimized out>, nsid=-2, argc=23, 
    argv=0x7fffffffe488, env=0x7fffffffe548) at dl-open.c:661
#5  0x00007ffff75f002b in dlopen_doit (a=a@entry=0x7fffffff8b70) at dlopen.c:66
#6  0x00007ffff7dea094 in _dl_catch_error (objname=0x99e2b0, errstring=0x99e2b8, mallocedp=0x99e2a8, operate=0x7ffff75effd0 <dlopen_doit>, args=0x7fffffff8b70) at dl-error.c:187
#7  0x00007ffff75f062d in _dlerror_run (operate=operate@entry=0x7ffff75effd0 <dlopen_doit>, args=args@entry=0x7fffffff8b70) at dlerror.c:163
#8  0x00007ffff75f00c1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#9  0x000000000059dfc3 in _PyImport_GetDynLoadFunc ()
#10 0x000000000042f9a6 in _PyImport_LoadDynamicModule ()
#11 0x000000000053fe2c in ?? ()
#12 0x00000000005406c5 in PyImport_ImportModuleLevel ()
#13 0x0000000000546e37 in ?? ()
#14 0x00000000004d40fb in PyEval_CallObjectWithKeywords ()
#15 0x00000000004ca061 in PyEval_EvalFrameEx ()
#16 0x00000000004c8762 in PyEval_EvalFrameEx ()
#17 0x00000000004cfedc in PyEval_EvalCodeEx ()
#18 0x0000000000596e82 in PyEval_EvalCode ()
#19 0x0000000000596f9a in PyImport_ExecCodeModuleEx ()
#20 0x00000000005b200f in ?? ()
#21 0x000000000053fe2c in ?? ()
#22 0x000000000054056b in PyImport_ImportModuleLevel ()
#23 0x0000000000546e37 in ?? ()
#24 0x00000000004d40fb in PyEval_CallObjectWithKeywords ()
#25 0x00000000004ca061 in PyEval_EvalFrameEx ()
#26 0x00000000004cfedc in PyEval_EvalCodeEx ()
#27 0x0000000000596e82 in PyEval_EvalCode ()
#28 0x0000000000596f9a in PyImport_ExecCodeModuleEx ()
#29 0x00000000005b200f in ?? ()
#30 0x000000000042abf0 in ?? ()
#31 0x0000000000581e65 in ?? ()
#32 0x000000000053fd2f in ?? ()
#33 0x0000000000540342 in PyImport_ImportModuleLevel ()
#34 0x0000000000546e37 in ?? ()
#35 0x00000000004d40fb in PyEval_CallObjectWithKeywords ()
#36 0x00000000004ca061 in PyEval_EvalFrameEx ()
#37 0x00000000004cfedc in PyEval_EvalCodeEx ()
#38 0x0000000000596e82 in PyEval_EvalCode ()
#39 0x0000000000596f9a in PyImport_ExecCodeModuleEx ()
#40 0x00000000005b200f in ?? ()
#41 0x000000000053fe2c in ?? ()
#42 0x000000000054056b in PyImport_ImportModuleLevel ()
#43 0x0000000000546e37 in ?? ()
#44 0x00000000004d40fb in PyEval_CallObjectWithKeywords ()
#45 0x00000000004ca061 in PyEval_EvalFrameEx ()
#46 0x00000000004cfedc in PyEval_EvalCodeEx ()
#47 0x0000000000596e82 in PyEval_EvalCode ()
#48 0x0000000000596f9a in PyImport_ExecCodeModuleEx ()
#49 0x00000000005b200f in ?? ()
#50 0x0000000000581e65 in ?? ()
#51 0x000000000048c5b4 in ?? ()
#52 0x00000000005403fe in PyImport_ImportModuleLevel ()
#53 0x0000000000546e37 in ?? ()

The new lxc container is in and we got one green run of the guado moblab, I will monitor over the weekend and hopefully can put guado moblab back in the CQ.
Status: Fixed (was: Assigned)
After the update to lxc 2.1.1 there has been 5 CQ green runs in a row, marking this as fixed, will remove the experimental flag for the build at EOD assuming no failed builds.

Sign in to add a comment