Telemetry should not fail because of a ill-behaved module. |
|||||||
Issue descriptionSometimes Telemetry fails to run because DiscoverClasses raises an uncaught ImportError when it scans a ill-behaved module. Reasons of an ill-behaved a module could be either syntax error, importing modules not in the platform environment, etc. It is troublesome that a Telemetry task (e.g. benchmark_runner) is affected by unrelated module(s) (say another benchmark totally irrelevant to the target benchmark). See https://crbug.com/756053 for example. I propose that DiscoverClasses should just log ERROR for ill-behaved modules unless caller wants to fail early for ill-behaved modules.
,
Oct 24 2017
I disagree with the proposal here. The history context is I was the one who modified DiscoverClasses to fail hard upon ill-behaved module. We used to lost benchmarks for a few months without anyone notice because of some ill-behaved module, so I changed Telemetry to fail hard instead. Most people won't look at the log until something really bad happens, so I would rather make these type of errors fatal.
,
Oct 24 2017
*nits: please use my @google acount instead of @chromium account on crbug :P
,
Oct 25 2017
From: https://bugs.chromium.org/p/chromium/issues/detail?id=756053 Is amd64-generic-telemetry Telemetry's bot or PFQ bot for Chrome on ChromeOS? Since we have spent so much time fixing regressions specific to ChromeOS, for the bugs that can be caught by VM environment, should we add amd64-generic-telemetry as Telemetry's trybots as well?
,
Oct 25 2017
We have a bug open for adding chromeos trybots to catapult. There are 3 main issues. 1. We need kvm support - we have experimental GCE instances with this. 2. Someone needs to set up the bots, and chrome infra (esp Dirk) are backlogged. 3. The cycle time is an hour, mostly because login takes a couple of seconds, so we'd have to figure out a way of reducing this. One way is to log in once and try to clean up without a browser restart between runs. Alternatively we could run a small subset of critical tests on the bot.
,
Oct 25 2017
I disagree with fail hard approach. Here are my reasons: 1. (Untreated) benchmarks should not interfere each other. Like Chrome with extensions. If an extension is ill-behaved, the right thing to do is disable it, not to crash Chrome. No mentioned that it is not invoked yet. 2. I agree we should avoid missing benchmarks because it cannot be imported for various reasons. But it can be done in presubmit check by running a unittest which performs DiscoverClasses to all benchmark repositories we care about with fail-hard feature enabled. 2.1 For those missing benchmarks, it should be caught by dashboard monitoring system and alert to benchmark owners. If no one cares about its health status, it could be an indicator of moving the benchmark to archiver. 3. Once DiscoverClasses raises an exception, all Telemetry benchmarks will fail to run on certain (even all) platforms. It is a bad idea losing a bunch of benchmark data not because themself having trouble but a unrelated ill-behaved benchmark. Talk to this case ( crbug.com/756053 ), I don't know why it passed presubmit check. The module (pylib.utils) it wanted to use should be part of Android platform, right? Were Windows and Mac platforms unaffected for this?
,
Oct 25 2017
> 3. Once DiscoverClasses raises an exception, all Telemetry benchmarks will fail to run on certain (even all) platforms. It is a bad idea losing a bunch of benchmark data not because themself having trouble but a unrelated ill-behaved benchmark. How does this code even checked in? This has never been a problem on our infrastructure. Though I think this is a legit issue if your infra is not setup to deal with this. Here are what should fail & what should not fail: The command to list all Telemetry benchmarks should list all the found benchmarks but its return code is 1 in case there is import error. The command to run a particular benchmark should pass if that benchmark is discoverable & runnable. You can modify DiscoverClasses to have s.t like "tolerate_import_error" param. Upon this being True, the method returns the importable classes & list of import errors. I prefer consistent "types" for methods, so you also need to modify DiscoverClasses to return (<list of found classes>, []) & update the callsites everywhere.
,
Nov 1 2017
I already had two commits, the first one is for Catapult: https://chromium-review.googlesource.com/c/catapult/+/734983 the second one is for Chromium/tools/perf: https://chromium-review.googlesource.com/c/chromium/src/+/734984 I agree to use "tolerate_import_error" instead of "ignore_import_error" optional parameters in DiscoverClasses method. However, for return type change, can we make the two commits in different repo atomic? If not, I would suggest two step modification: 1. For Catapult's discover.py: Add DiscoverModulesEx() and DiscoverClassesEx() functions with "tolerate_import_error" flag. DiscoverModulesEx() returns (discovered_modules, failed_to_import_modules) tuple and DiscoverClassesEx() returns ({discovered_class_name, discovered_class_object}, failed_to_import_modules) tuples. DiscoverModules() and DiscoverClasses() remain unchanged. 2. After the Catapult change is landed, use DiscoverClassesEx() with "tolerate_import_error" set to True. Then if we want to just maintain DiscoverClasses() without Ex version, we can first change all Chromium caller, then renamed Catapult lib. But I think it is a nice to have steps. By the way, I observed that the callers of DiscoverClasses() are mostly just use its return value's values() part (i.e. list of classes), shall we just make DiscoverClassesEx jsut returns list of discovered class rather than {discovered_class_name, discovered_class_object}?
,
Nov 1 2017
We should keep the naming DiscoverClasses & DiscoverModules at the end state. The easiest way I can think of is to audit all the call sites to do s.t like: results = DiscoverClasses() if len(results) == ...: # do things as before else: # do things as DiscoverClasses return value now contain the errors list. Then update DiscoverClasses signatures & remove the guarding code in the call sites. This way, you only needs 3 patches.
,
May 3 2018
,
Jan 16
,
Jan 16
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by deanliao@chromium.org
, Oct 24 2017