factory: First time boot errors & boot time optimization |
||||||||||
Issue description
What steps will reproduce the problem?
(1) Get a fresh new Chromebook.
(2) Install factory toolkit.
(3) Reboot
(4) Check /var/log/factory.log
What is the expected result?
Messages without special error / warning
What happens instead?
Lots of warnings we should not see.
These are captured on a MP Chell with HW/SW WP enabled, with test image 9197, locally built factory toolkit without BOARD.
1. [INFO] goofy goofy:317 2017-01-20 00:46:17.881 Starting factory event server
tail: cannot open '/var/factory/log/console.log' for reading: No such file or directory
Seems like the tail should be created after console.log was generated.
2. status: Unknown job: thermal
cpu_manager.py tries 'thermal' and 'dptf'. We should do this in a better way, for example getting all services and decide which to use, instead of try-catch.
3. EC result 4 (ACCESS_DENIED)
Is AC connected?
[ERROR] goofy charge_manager.py:95 2017-01-20 00:46:18.442 Unable to set charge state: Unable to set charge state: Command '['ectool', 'chargecontrol', 'discharge']' returned non-zero exit status 1
Not sure why calling chargecontrol would fail.
4. [INFO] goofy file_utils.py:220 2017-01-20 00:46:18.463 Writing 'interactive' to /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
[ERROR] goofy cpufreq_manager.py:89 2017-01-20 00:46:18.469 Unable to set CPU scaling parameters
Traceback (most recent call last):
File "/usr/local/factory/py_pkg/cros/factory/test/utils/cpufreq_manager.py", line 83, in SetEnabled
governor, log=True)
File "/usr/local/factory/py_pkg/cros/factory/utils/file_utils.py", line 222, in WriteFile
f.write(data)
IOError: [Errno 22] Invalid argument
'interactive' is an option that needs kernel CONFIG_CPU_FREQ_GOV_INTERACTIVE, and this is only set on config/arm64/common.config.
We probably should not use this anymore.
5. network manager is up/down several times.
[INFO] goofy connection_manager.py:126 2017-01-20 00:46:18.007 Created connection manager: wlans=[crosfactory21, crosfactory20]
[INFO] goofy connection_manager.py:192 2017-01-20 00:46:18.012 Enabling networking
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:18.014 ifconfig br0 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:18.022 ifconfig eth1 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:18.028 ifconfig eth0 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:18.034 ifconfig wlan0 up
...
[INFO] goofy time_sanitizer.py:39 2017-01-20 00:46:19.004 Current hwclock time: Fri Jan 20 00:46:18 2017 .513663 seconds
[INFO] goofy connection_manager.py:243 2017-01-20 00:46:19.360 Disabling networking
[INFO] goofy connection_manager.py:192 2017-01-20 00:46:19.608 Enabling networking
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:19.757 ifconfig br0 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:19.775 ifconfig eth1 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:19.807 ifconfig eth0 up
[INFO] goofy connection_manager.py:196 2017-01-20 00:46:19.820 ifconfig wlan0 up
Which the Disabling looks not needed.
6. [INFO] goofy web_socket_manager.py:115 2017-01-20 00:46:23.120 Chrome UI has come up
[ERROR] goofy invocation.py:237 2017-01-20 00:46:23.151 Unable to create symlink /var/factory/tests/
Traceback (most recent call last):
File "/usr/local/factory/py_pkg/cros/factory/goofy/invocation.py", line 235, in __init__
os.symlink(os.path.basename(self.output_dir), latest_symlink)
OSError: [Errno 17] File exists
1st time run should not see this... Did we do something wrong?
7. Testlog & invocation error
[INFO] goofy invocation.py:680 2017-01-20 00:46:23.165 Running test
[ERROR] goofy invocation.py:768 2017-01-20 00:46:23.187 Unable to log start_test event
Traceback (most recent call last):
File "/usr/local/factory/py_pkg/cros/factory/goofy/invocation.py", line 757, in _run
log_args, TestState.ACTIVE),
File "/usr/local/factory/py_pkg/cros/factory/goofy/invocation.py", line 647, in _convert_log_args
testlog_event.Populate(kwargs)
File "/usr/local/factory/py_pkg/cros/factory/test/testlog.py", line 692, in Populate
self[key] = data[key]
File "/usr/local/factory/py_pkg/cros/factory/test/testlog.py", line 646, in __setitem__
cls.FIELDS[key][1](self, key, value)
File "/usr/local/factory/py_pkg/cros/factory/test/testlog_validator.py", line 48, in String
'Please convert before assign' % (key, type(value)))
ValueError: key[testType] accepts type of basestring. Not <type 'NoneType'> Please convert before assign
[ERROR] goofy invocation.py:829 2017-01-20 00:46:23.206 Unable to log end_test event
Traceback (most recent call last):
File "/usr/local/factory/py_pkg/cros/factory/goofy/invocation.py", line 824, in _run
log_args, status))
File "/usr/local/factory/py_pkg/cros/factory/goofy/invocation.py", line 647, in _convert_log_args
testlog_event.Populate(kwargs)
File "/usr/local/factory/py_pkg/cros/factory/test/testlog.py", line 692, in Populate
self[key] = data[key]
File "/usr/local/factory/py_pkg/cros/factory/test/testlog.py", line 646, in __setitem__
cls.FIELDS[key][1](self, key, value)
File "/usr/local/factory/py_pkg/cros/factory/test/testlog_validator.py", line 48, in String
'Please convert before assign' % (key, type(value)))
ValueError: key[testType] accepts type of basestring. Not <type 'NoneType'> Please convert before assign
[INFO] goofy invocation.py:834 2017-01-20 00:46:23.210 Test FAILED: No autotest_name, pytest_name, or invocation_target
[ERROR] goofy invocation.py:844 2017-01-20 00:46:23.213 Test FAILED: No autotest_name, pytest_name, or invocation_target
Note this is executed without working test list.
,
Jan 23 2017
Re#1 executed without working test list.
,
Jan 23 2017
Also, it takes >15s now to get UI when system boots. Need to optimize what's taking so long. Even when Chrome starts getting something, it takes a really long delay before the main UI is ready.
,
Jan 25 2017
Temporarily assign to Nooger. yllin, you can try to see if there's something can be improved here.
,
May 18 2017
Reassign to petershih.
,
Jul 17 2017
,
Jul 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/factory/+/9d165750654a72bdd1ce205cab4f850d17594ee6 commit 9d165750654a72bdd1ce205cab4f850d17594ee6 Author: Shen-En Shih <petershih@chromium.org> Date: Wed Jul 19 08:23:37 2017 goofy: Fix error 'thermal service is not found' Two services are tried to gain more control on cpu frequency. One of them is 'thermal' service, the other one is 'dptf'. It cannot be easily determined which one should be used, so in current design, we tried to control both of them, and typically one of them will failed, generating an error message like 'thermal service is not found' in the factory.log. At least three methods can be used to check if a given service name exists or not: 1. Check files under /etc/init folder. 2. Check return code of 'status' command It returns 0 if the service exists; otherwise returns 1. 3. Create a unix socket to /com/ubuntu/upstart, and doing some communication with it. In this commit, the second method is chosen. A new method is added to the service utils, which checks if a given service name exists or not. BUG= chromium:683060 TEST=run factory toolkit, there should have no error message like 'thermal service is not found' in factory.log Change-Id: Ic295ac275b617fe3e03090b65c9afcccc231be44 Reviewed-on: https://chromium-review.googlesource.com/575117 Commit-Ready: Shen-En Shih <petershih@chromium.org> Tested-by: Shen-En Shih <petershih@chromium.org> Reviewed-by: Hung-Te Lin <hungte@chromium.org> [modify] https://crrev.com/9d165750654a72bdd1ce205cab4f850d17594ee6/py/utils/service_utils.py [modify] https://crrev.com/9d165750654a72bdd1ce205cab4f850d17594ee6/py/test/utils/cpufreq_manager.py
,
Jul 25 2017
I found another log saying: tpm_nvread is not found This is due to the CL: https://chromium-review.googlesource.com/c/336610/ I'm working on to suppress this error message when 'tpm_nvread' is not found. But, I'm not sure is this expected? The binary 'tpm_nvread' might not be found under certain circumstances?
,
Jul 25 2017
We use tpm_nvread to get virtual developer switch state, but unfortunately tpm_nvread, as part of TPM trousers package, is gone for systems running Cr50. We should create an issue for it.
,
Jul 25 2017
,
Jul 25 2017
Also, another log saying 'Command not supported on this platform', is generated by command 'mosys pd info -s fw_version'. This can be traced back to the function 'GetPDVersion' (line 73) in this CL: https://chromium-review.googlesource.com/#/c/322080/9/py/test/dut/usb_c.py Same question here, is this an expected behavior? If yes, I can try to suppress this error message though...
,
Jul 25 2017
Who calls GetPDVersion? Was it by Goofy getting system info? I think we want to keep device APIs reporting error, but it should be suppressed if it's for system.info. Not sure if it's easy to do that...
,
Jul 25 2017
Yes, it's from SystemInfo under py/device/info.py The caller is goofy's update_system_info() under py/goofy/goofy.py I can try to find a way to suppress errors for SystemInfo
,
Jul 25 2017
To the performance checking, I tested it on a recent chromebook. It takes 11.5 seconds after I pressed Ctrl+D, and until it takes me to the goofy UI.
Details:
1. Press Ctrl+D
<After 4.5 seconds>
2. goofy-control invokes goofy
<After 3.8 seconds>
3. Get board info ('mosys platform version')
<After 0.8 seconds>
4. Get ec info ('mosys ec info -s fw_version')
<After 1.7 seconds>
5. Rest of goofy execution
<After 0.7 seconds>
6. goofy done (no more factory.log), assumed UI ready at this point
In conclusion, goofy takes 7 seconds to boot. Among that, 2.5 seconds (35.7%) was spent on two 'mosys' commands.
Those two 'mosys' commands are invoked when collecting system information.
One naive idea is to collect this information at another thread. However, this introduces extra complexity about data dependency (e.g., if a command depends on the system information, it needs to wait for that thread). Do we want this extra code complexity for a 3-second boost?
Another idea is to 'delay' the system information collection until it's used. However, this is not a good idea in my opinion. The system information is typically updated in our event loop, and it takes around 3-4 seconds to collect these data at first time. If we move this to the event loop, it may make our goofy not-so-responsive at sometime.
,
Jul 25 2017
Just to make sure, did you build toolkit with BOARD set? Because some board hooks may include more commands that take extra time, while the test list in public repo may not have them.
,
Jul 25 2017
No. I will re-run the test again, and updated it here. Thanks for the hint!
,
Jul 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/factory/+/2f624767ba97e94011bb9be8417f67fd890aff4d commit 2f624767ba97e94011bb9be8417f67fd890aff4d Author: Shen-En Shih <petershih@chromium.org> Date: Tue Jul 25 14:00:25 2017 goofy: Fix "console.log does not exist" error. In the current implementation, the /var/factory/log/console.log is delay-created until the first log message is generated. On the other side, a daemon thread is spawned in goofy initialization phase, which 'tail -F' the console.log, and send it out to our UI interface. Combine these two facts, the tail process generates an error message in the first run, saying console.log does not exist. The reason why the file 'console.log' is delay-created, is because the file handler is created at module-loading time, and it is expected that the console log file is not created at that time. In this commit, the console.log is created (if not created yet) before tailing it. BUG= chromium:683060 TEST=rm /var/factory/log/console.log, restart goofy, check /var/log/factory.log should have no tail error, and check a text "Opened console" should be printed on the goofy web interface. Change-Id: Iad55f3914f3ee1ef11f1557993137eabca653a0e Reviewed-on: https://chromium-review.googlesource.com/573886 Commit-Ready: Shen-En Shih <petershih@chromium.org> Tested-by: Shen-En Shih <petershih@chromium.org> Reviewed-by: Hung-Te Lin <hungte@chromium.org> [modify] https://crrev.com/2f624767ba97e94011bb9be8417f67fd890aff4d/py/test/factory.py [modify] https://crrev.com/2f624767ba97e94011bb9be8417f67fd890aff4d/py/goofy/web_socket_manager.py
,
Jul 26 2017
,
Jul 28 2017
Re #15 Update timings with 'board' set correctly. 1. Press Ctrl+D <After 5 seconds> 2. goofy-control invokes goofy <After 1.3 seconds> 3. Load test lists <Takes 1.4 seconds> 4. read from vpd.rw (under board hook) <Takes 5~6 seconds> 5. Update system info <Takes 0.2 seconds> 6. Rest of goofy execution <After 0.7 seconds> 7. goofy done (no more factory.log), assumed UI ready at this point Note: the 'mosys' commands are no longer the bottleneck, so they are merged in the 'Update system info' step. In summary, 1. It takes 15 seconds to take from Ctrl+D to goofy UI 2. It takes 10 seconds in goofy.py 3. read vpd.rw (under board hook) takes 5~6 seconds (~50% among goofy.py, ~33% among whole booting time) If we really need to read from vpd.rw under board hook, maybe we can do that at another thread? Maybe async/await interface to board hook interfaces.
,
Jul 28 2017
For modern systems, you can use the script "vpd_get_value" that will use krnel vpd driver, which reads cache prepared by firmware, to read VPD in less than 1s. The cache won't be changed until next boot, so for most cases factory should not use it. But for board hook which runs at boot time, I think it's good enough. Maybe change vpd.rw to "vpd_get_value".
,
Jul 28 2017
p.s: please try board=soraka or poppy so it'll reflect how modern projects work.
,
Jul 28 2017
#21 Tried on soraka, the board hook there still use vpd.rw.GetAll(), taking 2 seconds to accomplish on my trials. Maybe we should file another issue to try the 'vpd_get_value' method?
,
Jul 28 2017
Yes I think you can make a CL/subissue to change board hooks replacing vpd.rw.GetAll by vpd_get_value. (And you can check if that really speeds things up first, of course)
,
Aug 1 2017
Made some (experimental) changes in CL: https://chromium-review.googlesource.com/c/593712/1 1. Read firmware-cached VPD from /sys/firmware/vpd in board hooks 2. Goofy update system info asynchronously 3. Modify status monitor plugin behavior (since system info is now updated asynchronously)
,
Aug 1 2017
,
Aug 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/factory/+/d7c798eb6f057b0acc7909f9ba3c6bbe97e74932 commit d7c798eb6f057b0acc7909f9ba3c6bbe97e74932 Author: Shen-En Shih <petershih@chromium.org> Date: Mon Aug 07 07:19:20 2017 device: Support reading vpd from sysfs driver For modern Chromebooks, the firmware will prepare a cache of VPD data and exported it by kernel driver via sysfs. The purpose of this CL, is to use the information under /sys/firmware/vpd to speed up reading VPD values if the caller does not need "live" values. For example, for goofy's startup hooks. However, it must be noted that, the data there might not be updated when new values are written to the VPD. This renders most of the tests in factory software inapplicable. In this sense, the inherit graph of VPD components are re-designed, and a new 'boot' property is added to vpd. From this 'boot' property, one can get the VPD values cached at boot time very quickly. An example usage is for a board hook. It can use this to read VPD data in the init hook, reducing goofy's startup time. BUG= chromium:683060 TEST=make test TEST=manually test Change-Id: I87c27e915d787914d2caaea0bb1d5bfdb6d2c565 Reviewed-on: https://chromium-review.googlesource.com/593711 Commit-Ready: Shen-En Shih <petershih@chromium.org> Tested-by: Shen-En Shih <petershih@chromium.org> Reviewed-by: Hung-Te Lin <hungte@chromium.org> [modify] https://crrev.com/d7c798eb6f057b0acc7909f9ba3c6bbe97e74932/py/device/vpd_unittest.py [modify] https://crrev.com/d7c798eb6f057b0acc7909f9ba3c6bbe97e74932/py/device/vpd.py [modify] https://crrev.com/d7c798eb6f057b0acc7909f9ba3c6bbe97e74932/py/device/boards/android.py
,
Aug 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/factory/+/f4ad32ff0927e3e12ee00fabb68163a11d26d36d commit f4ad32ff0927e3e12ee00fabb68163a11d26d36d Author: Shen-En Shih <petershih@chromium.org> Date: Mon Aug 07 07:19:20 2017 goofy: Update system info in an async way during init The system info is now initialized asynchronously by posting an UPDATE_SYSTEM_INFO event to the event server. This speeds up the start up time greatly, but now the 'system_info' in the shared data might be empty for a short time window. All callers are checked to ensure the correctness. The only one needs to be modified is the status monitor plugin, and it's now also update from 'system_info' periodically in a previous commit. BUG= chromium:683060 TEST=make test TEST=restart goofy, check status monitor can show correct system info on UI after a few seconds. Change-Id: I2b0fd9d81e8595571367e472bbd0aefe3daedbd5 Reviewed-on: https://chromium-review.googlesource.com/593712 Commit-Ready: Shen-En Shih <petershih@chromium.org> Tested-by: Shen-En Shih <petershih@chromium.org> Reviewed-by: Hung-Te Lin <hungte@chromium.org> [modify] https://crrev.com/f4ad32ff0927e3e12ee00fabb68163a11d26d36d/py/goofy/goofy.py
,
Aug 7 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/overlay-soraka-private/+/8c59fb40770faeee2ca25dcc435be136b9d5b68f commit 8c59fb40770faeee2ca25dcc435be136b9d5b68f Author: Shen-En Shih <petershih@google.com> Date: Mon Aug 07 18:45:04 2017
,
Aug 8 2017
Remove the block on issue 748435 , since device info is now fetched in an async way. After these improves, goofy now only takes about 2.2 seconds to startup. It even waits about 2 seconds until Chrome UI get ready. All error logs in the /var/log/factory.log are also resolved.
,
Aug 8 2017
,
Aug 8 2017
Cool. Thanks so much for the improvements!
,
Aug 8 2017
,
Jan 22 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by kitching@chromium.org
, Jan 23 2017As for the testlog error, kwargs[testType] is being passed in as None. Tracing this backwards, testType is pulled from log_args[pytest_name], which is set from self.test.pytest_name: if self.test.pytest_name: log_args['pytest_name'] = self.test.pytest_name The other invocation.py error at the bottom (Test FAILED: No autotest_name, pytest_name, or invocation_target) is from this area: if self.test.autotest_name: status, error_msg = self._invoke_autotest(resolved_dargs) elif self.test.pytest_name: status, error_msg = self._invoke_pytest(resolved_dargs) elif self.test.invocation_target: status, error_msg = self._invoke_target() else: status = TestState.FAILED error_msg = ( 'No autotest_name, pytest_name, or invocation_target') So the test item is lacking the autotest_name or pytest_name argument?