ChromeOS issue: The digital signage web page is stuck on final image of a video file. |
|||||||||||||
Issue descriptionChromeOS version: 59.0.3071.91 (9460.60.0 (Official Build) stable-channel ninja) ChromeOS device model: AOpen Chromebox Commercial. Ninja.5216.383.7 Case#: 13431459 Description:The digital signage web page is stuck on final image of a video file. Steps to reproduce: 1. Enroll the device. 2. Configure KIOSK and load video file. (Please refer the details of vide file, link below in a share file) https://docs.google.com/a/google.com/document/d/11eHBiYe4aUmZmlgbaJPBzcHs5U8ss-DDD6u-z2tO5HQ/edit?usp=sharing 3. The digital signage web page is stuck on final image of a video file Current Behavior / Reproduction: The digital signage web page is stuck on final image of a video file. Expected Behavior: The digital signage web page should not stuck/freeze. Drive link to logs: debug-logs_20170818-142123: https://drive.google.com/drive/folders/0B01YYztUbOCudE1HSzFsYlVTTU0?usp=sharing Suspecting time of the device stuck. 20:46 2017-08-18. Extra info: - Sometime the devices show as online in CDM and can be rebooted remotely, after which they play fine. But, This morning was showing as offline in CDM and manually rebooted was required. - The logs are top to bottom full of error lines, but after rebooting and checking back again they’re showing the same thing still (albeit a different Chrome error code, but same format and consistency etc) and the screens are cycling through content. -There was not a valid crash ID in the provided Logfiles . Found similar log messages in below crbug cases. https://bugs.chromium.org/p/chromium/issues/detail?id=756191 ERR imageloader[456]: Failed to read latest-version file. ERR imageloader[456]: Failed to verify and mount component. https://bugs.chromium.org/p/chromium/issues/detail?id=719040 Attribute compr_data_size (and others) will be removed. See zram documentation.
,
Aug 23 2017
Adding more context from the customer. Logs are here: https://drive.google.com/drive/folders/0Bx12NVyIouI0QWdkbi0yWDgyOVk?usp=sharing =========================================== The architecture of the application used is that it runs a chrome app in kiosk mode containing a webview. We hit a backend service which controls what urls are displayed in the webview. The app calls "chrome.power.requestKeepAwake('display')" to stop the device going to sleep. Symptoms: these are the scenarios discovered so far: 1. Chrome stumbles with a “trap invalid opcode” error which crashes the browser and the media player eventually puts itself to sleep (flashing power light on media player) - feedback is this indicates that the Chrome process has tried to run an assembler level command that doesn’t exist. This can happen when there is an error in the process being run (Chrome in this case) where it starts trying to run parts of the code that aren’t actually the program, e.g. it’s trying to execute bits of data. 2. The digital signage web page shows a white screen (the chrome kiosk app appears to be ‘live’) - Chrome logs show consistency in error codes pointing to “cras_client: socketpair: Too many open files” 3. The digital signage web page is stuck on final image of a video file “Keno Let’s Play” (the chrome kiosk app appears to be ‘live’) - The devices show as online in CDM and can be rebooted remotely, after which they play fine - The logs are top to bottom full of error lines, but after rebooting and checking back again they’re showing the same thing still (albeit a different Chrome error code, but same format and consistency etc) and the screens are cycling through content 4. Sudden unexplained disconnection from Control Centre and Chrome Admin (I believe the media player turns off completely and the screen goes black) - There are devices where the Chrome logs appear to be running completely normally at the point the device has disconnected from Control Centre. These eventually suddenly shutdown and nothing is captured in the logs about any shutdown processes. I need these devices to be rebooted to pull down the updated log post-crash to see if there’s any further information in them Logs are attached for a sample from this week, and a spreadsheet highlighting where I see the issues occurring. Below is a summary of findings so far: - Zagame’s Boronia M2 (10/08) – AOPEN device gone to sleep. Log shows “trap invalid opcode” - Zagame’s Boronia Blade – AOPEN device gone to sleep. Log shows “trap invalid opcode” - Capalaba Bistro Blade – AOPEN device gone to sleep. Log shows “trap invalid opcode” - Zagame’s Boronia M2 (15/08) – AOPEN device appears online, but TV is showing a white screen (CDM screen grab) – log shows “too many open files”? - Cabramatta Inn M2 – Went offline in Control Centre 8:15am 16/08, shows as online in CDM. Screen grabs (attached) show screen frozen on “Keno Let’s Play” image. Rebooted via CDM and all fine. Can’t make out anything from log - Caboolture M2 – offline in Control Centre 10:39am 16/08, showing Offline/unavailable in CDM so can’t take screen grab, but rebooted remotely and all ok. Can’t make out anything from log Chrome OS version (found by navigating to "chrome://version"): OS is 59.0.3071.91 across the board. The issues have been evident for a couple months now (but weren’t being reported properly until more recently when we realised it was a much bigger issue), so whilst we were on OSv57 and now on OSv59.
,
Aug 23 2017
,
Aug 23 2017
[RBS for visibility. Feel free to remove after this is triaged]
,
Aug 23 2017
This looks like Issue 696734 , where playing high-def video was causing Ninja and Sumo devices to freeze. A fix was made and verified around July 7, 2017. The latest published stable image (build 9460.73.0, chrome 59.0.3071.134) should include the fix.
,
Aug 23 2017
Another example: replicated the customer config with their web content running in the browser. Log analyzer reports multiple errors: out of memory, crash, multiple instances of NOTREACHED() File available from https://drive.google.com/a/google.com/file/d/0B_5yutJvxlGoaHFVNXlMWDNxY0k/view?usp=sharing
,
Aug 23 2017
Regarding OOM: Can you get us more info on how much RAM this device has ? Crash id: a6b47792d499eec6 looks like a kernel warning which I've been told we can ignore. NOTREACHED: may not be relavent here. I think the most important finding in this log is the OOM. That can happen for two reasons (i) Because we have a memory leak in Chrome (ii) Because the app itself has a memory leak. To rule out (ii) usually people try to run the app in a regular chrome browser and try to profile it. Is this something you can do for the customer ? I haven't heard any other customer reporting a memory leak recently so I doubt its (i), but there can always be edge cases we don't know about.
,
Aug 23 2017
4GB memory in the test device - Ninja. Running same content in Chrome Browser (Windows, build 60.0.3112.101 official release) to check.
,
Aug 23 2017
mlight@, are you able to repro the issue by running the URL (http://d3huy4irkvet32.cloudfront.net/live/screen/SIGNAGE-L5WBG/index.html) in a chrome browser on a Ninja device using a build with the fix for Issue 696734 ?
,
Aug 24 2017
Update on #8 - Have been profiling content in Windows Chrome browser for 20 hours now, no memory leak. Updated same Ninja device to 60.0.3112.101 - content has also been running 20 hours, no errors seen so far.
,
Aug 24 2017
[Raj: Please help us find an explicit owner for this and assign it.]
,
Aug 24 2017
,
Aug 24 2017
Krishna, please try to repro this issue on a Ninja and add the relevant logs. Thank you.
,
Aug 24 2017
First I tried to reproduce the problem using the customer's setup (M-59 build 9460.60.0, chrome 59.0.3071.91). I created a Chrome Sign Builder schedule for the URL http://d3huy4irkvet32.cloudfront.net/live/screen/SIGNAGE-L5WBG/index.html so it could be run in kiosk mode. CSB began running at approximately 4:40 pm on Wed. Aug. 23, 2017. It ran overnight fine, and most of this morning, until exactly 10:57 am Pacific time when this was reported in /var/log/powerd.20170823-163809 (summarizing): "Power button was pressed when no display was attached. Shutting down due to user request." (Full logs attached). Couple of things: I didn't touch the power button this morning; and a display was attached, albeit turned off. Full system logs attached (/media/mlight/OEM/debug-logs_20170824-155111.tgz)
,
Aug 24 2017
About 7 minutes before this event, someone attached a keyboard to it. Would you know why the keyboard was touched, and if there is anything one can do from the keyboard which may look like a shutdown command ? 2017-08-24T10:47:58.779891-07:00 ERR chrome[2549]: [2549:2549:0824/104758.779692:INFO:status_uploader.cc(226)] Status upload successful 2017-08-24T10:50:45.272630-07:00 INFO kernel: [65663.452337] usb 1-3: new full-speed USB device number 6 using xhci_hcd 2017-08-24T10:50:45.439021-07:00 INFO kernel: [65663.619370] usb 1-3: New USB device found, idVendor=046d, idProduct=c52b 2017-08-24T10:50:45.439091-07:00 INFO kernel: [65663.619407] usb 1-3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 2017-08-24T10:50:45.439110-07:00 INFO kernel: [65663.619430] usb 1-3: Product: USB Receiver 2017-08-24T10:50:45.439120-07:00 INFO kernel: [65663.619449] usb 1-3: Manufacturer: Logitech 2017-08-24T10:50:45.451596-07:00 INFO kernel: [65663.631862] logitech-djreceiver 0003:046D:C52B.0007: hiddev0,hidraw0: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:00:14.0-3/input2 2017-08-24T10:50:45.562270-07:00 INFO kernel: [65663.742261] input: Logitech TK820 as /devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3:1.2/0003:046D:C52B.0007/0003:046D:4102.0008/input/input8 2017-08-24T10:50:45.563601-07:00 INFO kernel: [65663.744221] logitech-hidpp-device 0003:046D:4102.0008: input,hidraw1: USB HID v1.11 Keyboard [Logitech TK820] on usb-0000:00:14.0-3:1 2017-08-24T10:50:45.613030-07:00 NOTICE root[540]: /opt/google/input/device_added 2017-08-24T10:50:45.656238-07:00 NOTICE root[548]: /opt/google/input/device_added /dev/input/event7 2017-08-24T10:58:06.635570-07:00 WARNING kernel: [66105.522747] init: debugd main process (2570) killed by TERM signal 2017-08-24T10:58:06.648123-07:00 ERR shill[882]: [ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.dhcpcd.Release: object_path= /org/chromium/dhcpcd: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.13 was not provided by any .service files 2017-08-24T10:58:06.648157-07:00 ERR shill[882]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code
,
Aug 24 2017
For a quick experiment I started CSB again, disconnected the monitor and pressed the power button. Sure enough, the ninja did shut down, completely.
However, for last night's run the device's power button was flashing blue ("sleep mode"). A brief press of the power button should have woken it up, I believe.
,
Aug 24 2017
I plugged in a usb dongle for a keyboard/mouse combo to do a "vmstat" command.
,
Aug 24 2017
I'll do another run overnight, and leave the monitor on the whole time.
,
Aug 24 2017
For comparison, the Ninja I have running 60.0.3112.101 is still running without error.
,
Aug 29 2017
The customer provided other log file and I save it to the share drive Link below. File Name:0003573sat26AUG51451debug-logs_20170824-224333 Link:https://drive.google.com/open?id=0B01YYztUbOCudE1HSzFsYlVTTU0 From comment 2, > symptoms 1. “1. Chrome stumbles with a “trap invalid opcode” error which crashes the browser and the media player eventually puts itself to sleep (flashing power light on media player)” I noticed that “trap invalid opcode” error message in the provided message file. 2017-08-16T20:58:17.378712+10:00 INFO periodic_scheduler[27272]: cros-machine-id-regen: job completed 2017-08-16T21:14:43.493827+10:00 INFO kernel: [998701.876421] traps: Chrome_ProcessL[30198] trap invalid opcode ip:5d4b8cbbcf2b sp:7d6d0c9f0ca0 error:0 in chrome[5d4b8b153000+796c000] 2017-08-16T21:14:43.518477+10:00 WARNING crash_reporter[29562]: [user] Received crash notification for chrome[30016] sig 4, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly) 2017-08-16T21:14:43.625166+10:00 INFO session_manager[29987]: [INFO:child_exit_handler.cc(77)] Handling 30016 exit. 2017-08-16T21:14:43.625412+10:00 ERR session_manager[29987]: [ERROR:child_exit_handler.cc(85)] Exited with signal 4
,
Aug 31 2017
krishnargv@ who should be the owner for this stable blocker?
,
Aug 31 2017
New log files shared by customer - available here: https://drive.google.com/open?id=0B_5yutJvxlGoYlBuMUI2aThzd00 https://drive.google.com/open?id=0B_5yutJvxlGoMnI5a3d1TVRGX00
,
Aug 31 2017
@xiyuan, Would you be able to assist us with this.
,
Aug 31 2017
The “trap invalid opcode” (aka SIGILL) in #20 happens in ontent::ZygoteCommunication::ForkRequest based on ip:5d4b8cbbcf2b and chrome[5d4b8b153000+796c000] in the log line. Around that timestamp in ui log: [30016:20586:0816/211427.227184:ERROR:cpu_data_collector.cc(280)] CPU freq stats not available in sysfs. ExceptionHandler::GenerateDump sys_pipe failed:Too many open files ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptor ExceptionHandler::WaitForContinueSignal sys_read failed:Bad file descriptor Failed to generate minidump.[0816/211443.827476:WARNING:diagnostics_writer.cc(209)] [FAIL] 016 JSONLocalState (File too big) My guess is the problem looks related to "Too many open files". Chrome is running out of fd somehow.
,
Aug 31 2017
Just noticed "[FAIL] 016 JSONLocalState (File too big)" in the log I pasted. This is usually an indication of FS corruption. It shows up when chrome fails to load "Local State" file.
,
Sep 1 2017
Further logs supplied by the customer today: https://drive.google.com/open?id=0B_5yutJvxlGoT1k2UHplSndvczQ https://drive.google.com/open?id=0B_5yutJvxlGoTFIxZ0tyWl80MVU https://drive.google.com/open?id=0B_5yutJvxlGoa1BCeUs3N2JsdlU
,
Sep 1 2017
A number of Ninja hangs when the monitor is turned off for a lengthy period is also being tracked in Issue 719040. I did an overnight run last night with the casino ads video loop, which lasted for 11 hours with the monitor off, and then hung badly. Even "alt + VolUp + x x x" wouldn't wake up the kernel.
,
Sep 3 2017
This issue was reported on 9460.60.0. mlight this also looks like your comment #73 on https://bugs.chromium.org/p/chromium/issues/detail?id=734708, which is 9592.82.0 and https://bugs.chromium.org/p/chromium/issues/detail?id=719040 is testing on the same version.
,
Sep 4 2017
Further logs from the customer: https://drive.google.com/open?id=0B_5yutJvxlGoTDZMZ1gzRXZ5bWM https://drive.google.com/open?id=0B_5yutJvxlGoZV81Q1NLZ09vUms https://drive.google.com/open?id=0B_5yutJvxlGoamJoMFZPM2Rqc3c https://drive.google.com/open?id=0B_5yutJvxlGoMnI5a3d1TVRGX00 https://drive.google.com/open?id=0B_5yutJvxlGoSlFNWVh2OUxXX3M https://drive.google.com/open?id=0B_5yutJvxlGodkRBUWlwWUU4cEU https://drive.google.com/open?id=0B_5yutJvxlGoa1FsYlVtdjdGd1U https://drive.google.com/open?id=0B_5yutJvxlGoM2hBTjJLX2FqSVk
,
Sep 5 2017
Final round of logs from the customer. They are spending significant $$$ outlay on field techs to collect the logs and do not wish to have to collect more. https://drive.google.com/open?id=0B_5yutJvxlGoRTJ5TDRNaUZpQkk https://drive.google.com/open?id=0B_5yutJvxlGoOVFXX2c5U0FlZVE https://drive.google.com/open?id=0B_5yutJvxlGoYnVSOU53NkNUUG8 https://drive.google.com/open?id=0B_5yutJvxlGoNGlRYU1vdVZnM1U https://drive.google.com/open?id=0B_5yutJvxlGoRUdlTjN1YU80Sm8 https://drive.google.com/open?id=0B_5yutJvxlGobTBfcGlBVFg2bEk
,
Sep 5 2017
The “trap invalid opcode” caused by running out of fd problem is issue 733718 and we raised soft fd limit to 8k for chromeos (https://codereview.chromium.org/2969453002). The fix is merged to M59 (59.0.3071.134) and M60 (60.0.3112.52). However, the freeze problem is most likely issue 719040. It is a combination of hardware and kernel/graphics. For this issue, it is tracking problem happening on AOpen ninja with version 59.0.3071.91 (9460.60.0). marcheu@ can you help to triage?
,
Sep 5 2017
This is a result of the switch to kernel 4.4, so assigning to kernel folks
,
Sep 5 2017
,
Sep 5 2017
rajatja@ This is currently a stable blocker. Can you please take a look and share your input here?
,
Sep 11 2017
The Ninja hang issue while playing videos (where the system hangs and does not wake up even using Alt + VolUp + x + x + x) is being tracked in crbug.com/719040. It seems that the "trap invalid opcode" / out of fd(s) issue has been triaged and fixed (per #31).
,
Sep 11 2017
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by roy...@google.com
, Aug 22 2017