platform_Powerwash canary failures |
||||
Issue descriptionThere are a few failures during the AUTest step for the canaries. platform_Powerwash: FAIL: Powerwash count didn't increase after powerwash cycle. https://uberchromegw.corp.google.com/i/chromeos/builders/beltino-a-release-group/builds/2041 https://uberchromegw.corp.google.com/i/chromeos/builders/jecht-release-group/builds/1358 ninja had a different AUTest failure: platform_Powerwash: ABORT: Host did not return from reboot. https://uberchromegw.corp.google.com/i/chromeos/builders/rambi-d-release-group/builds/1679 I would include logs, but due to the cautotest/ redirector being down, I can't. Assigned to deymo@, but feel free to assign it to someone else who might know about powerwash.
,
Apr 18 2016
I checked two of the test logs. In one case, the powerwash count when from 1 to 4; in the other case it went from 2 to File not found. The 1 to 4 is very interesting. This could be a problem with the device rebooting while doing the powerwash or something more weird in the test infrastructure. The sheriff should dig more into the logs and see what's going on, if it is tied to certain boards... no special knowledge needed, but I'm not working on the cros platform bugs at the moment.
,
Apr 18 2016
Note that this is not related to AU. "platform_Powerwash" flags a powerwash and reboots the device.
,
Apr 18 2016
I'm not so confident about "no special knowledge needed". I'm the sheriff and as someone who doesn't work on this part of ChromeOS, this is the first time I've ever heard of a powerwash count. Can you link me to some documentation explaining how powerwash works or what this count is? Do you know someone who is working on these bugs at the moment? I know you're not working on it right now, but without a pointer of some kind, I've got no idea where to look really
,
Apr 18 2016
There isn't really any special knowledge needed here, you can learn what you don't know by reading the test and the code under test, which doesn't have dependencies on huge or weird stuff. The test code, as usual with any autotest test is in the autotest tree, in this case, here: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/platform_Powerwash/platform_Powerwash.py (codesearch for the name of the test is the fastest way to find it). The test is very short. It has the path to the powerwash marker file and the powerwash count file at the top. It basically does: 1. It reads the count before the test, 2. it then writes the powerwash marker file with 'safe fast keepimg', 3. reboots the device, and 4. reads the counter after reboot, expecting the counter to be incremented by 1 due to the powerwash. Powerwash is the sort of "factory reset" of a chromebook, except that some stuff are preserved (mostly the stuff in /mnt/stateful_partition/unencrypted/preserve/ ). If you don't know where the code for powerwash is, then again codesearch for some of those file paths and you will find it. The reference you want here is in src/platform2/init/chromeos_startup , which has comments explaining what it does with the factory_install_reset file, and then calls clobber-state ( src/platform2/init/clobber-state ) passing the contents of this file as arguments. clobber-state is the bash script implementing the actual wiping of the stateful partition and increment of the powerwash count, if you really want to take a look that logic (which I think is not relevant for the bug). Now, in this case, the symptoms are "interesting" because it looks like the device rebooted three times in one case to get the count from 1 to 4. So... if I were to debug this issue, given that it is kind of spread among several boards, I would try to reproduce the powerwash and see if the device reboots once or more than once. The annoying part is that the logs of a previous powerwash will be removed if you re-powerwash the device (look for the CLOBBER_STATE_LOG redirection trick to persist the powerwash logs). My guess here would be that something else is making the device reboot during powerwash (either another startup script or a kernel panic, hard to tell). So the "preserve_files" function does the increment and then reboots (results in a higher powerwash count); or the whole powerwash works (wiping the marker file as well) and then the device reboots before the "preserved files" are restored (results in a file not found). Alternatively, if the device is reboot *while* you are formatting stateful, the next attempt will restore it (format it). Therefore, there could be some race condition here :-( (the test passes if you reboot after the powerwash completes successfully). First, as a sheriff you are expected to dig into test failures and reach code you don't know about. I'm not the owner of the powerwash flow or the init scripts, I just wrote the test for it well after it was implemented because it didn't have any integration test. Second, as a sheriff you will have better knowledge of what CLs are landing and causing problems. If you see a widespread "device reboots during boot" or "device reboots during heavy I/O" problem, you are in a better position to diagnose this race-condition issue rather than someone looking at the narrow "powerwash_count not being what it should be" issue.
,
Apr 19 2016
First, I'm sorry to have asked you for a pointer/help about fixing an error detected by your test, but I think you'll agree that since I've never even written an autotest I didn't have much of an other option. Second, I'm well aware of how sherrifing is supposed to work -- I just asked for a pointer to someone or the code involved since you seemed to know more than me, not a description of how I should be doing my job. I don't think it's unreasonable to ask the person who wrote a test for a bit of information when it detects a problem. In fact, your explanation is very helpful indeed because I wasn't aware of several of those things you mentioned, however I will not bother you again |
||||
►
Sign in to add a comment |
||||
Comment 1 by charliemooney@chromium.org
, Apr 18 2016