moblab: Storage fills up with chrome core dumps |
|||||||
Issue descriptionMy current platform (oak) is quite unstable, and moblab storage fills up with Chrome core dumps, e.g. mnt/stateful_partition/dev_image/moblab-storage/results/38839-moblab/192.168.231.116/sysinfo/var/spool/crash/chrome.20160227.182205.14494.core. I believe those should be taken care of by gs offloading, but they don't seem to ever get deleted on local storage.
,
Mar 1 2016
Simran, is it something easy to fix? or something that Michael can help with?
,
Mar 1 2016
Hmm gs_offloader seems to be having some trouble with these larger files. I think someone will have to deep dive. The script is a self-contained utility in /usr/local/autotest/site_utils/gs_offloader.py If Michael can take a look that would be great.
,
Mar 3 2016
I looked at the gs_offloader code and monitored the log on the moblab device several times today. I saw the the gs_offloader timeout after 3600 secs several time on the job "35161". Before timeout, the offloader seems to already have copied 9940 files. I wonder if the job produced too many output files and gsutil could not finish within one hour. Since all the test results are currently deleted on the device, could you monitor the gs_offloader log and see if this happens again? If this happen again, you would probably see the gs_offloader stucks with offloading a specific job result every hour.
,
Mar 3 2016
My moblab rebooted 3 hours ago, and /mnt/moblab did not get mounted as the partition was quite full (what I describe in #1)... Then I imitated what moblab-external-storage-init.conf would do (the results are not deleted, they are just not linked properly anymore): mount -o bind /usr/local/moblab-storage/ /mnt/moblab/ rm -rf /usr/local/autotest/results/ ln -s /mnt/moblab/results /usr/local/autotest/results du -sm /mnt/moblab/results/ 75403 /mnt/moblab/results/ du -sm /mnt/moblab/results/hosts/ 1263 /mnt/moblab/results/hosts/ 75G is a lot, so the 1 hour timeout idea in #4 makes sense... And trying to run the offloader manually: time sudo -u moblab /usr/local/autotest/site_utils/gs_offloader.py --days_old=1 --log_size=200 | tee /var/log/gs_offload.log DEBUG:root:Set process to nice value: 10 DEBUG:root:Offloading Autotest results in /usr/local/autotest/results DEBUG:root:Running 'ip addr show eth0 | grep link/ether' DEBUG:root:Offloading to: gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/ Copying file:///tmp/tmpv0wpVy [Content-Type=application/octet-stream]... Removing gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/tmpv0wpVy... DEBUG:root:Start of offload cycle - found 2973 new jobs Will let it run and let you know...
,
Mar 3 2016
Command above failed with this:
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
Traceback (most recent call last):
File "/usr/local/autotest/site_utils/gs_offloader.py", line 652, in <module>
main()
File "/usr/local/autotest/site_utils/gs_offloader.py", line 647, in main
offloader.offload_once()
File "/usr/local/autotest/site_utils/gs_offloader.py", line 557, in offload_once
job.enqueue_offload(queue, self._age_limit)
File "/usr/local/autotest/site_utils/job_directories.py", line 153, in enqueue_offload
timestamp = self.get_timestamp_if_finished()
File "/usr/local/autotest/site_utils/job_directories.py", line 223, in get_timestamp_if_finished
entry = _AFE.get_jobs(id=self._id, finished=True)
File "/usr/local/autotest/server/frontend.py", line 293, in get_jobs
jobs_data = self.run('get_jobs', **dargs)
File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 83, in run
self, call, **dargs)
File "/usr/lib64/python2.7/site-packages/chromite/lib/retry_util.py", line 86, in GenericRetry
time.sleep(sleep_time)
File "/usr/lib64/python2.7/site-packages/chromite/lib/timeout_util.py", line 53, in kill_us
raise TimeoutError(error_message % {'time': max_run_time})
chromite.lib.timeout_util.TimeoutError: Timeout occurred- waited 1800 seconds.
real 30m5.745s
user 0m3.546s
sys 0m0.839s
I guess I need autotest to be running for gs offloader to work...
,
Mar 3 2016
Indeed, gs_offloader is timing out: gs_offloader_jobs_log_.txt 2016-03-03 15:24:50,281 - DEBUG - Set process to nice value: 10 2016-03-03 15:24:50,282 - DEBUG - Offloading Autotest results in /usr/local/autotest/results 2016-03-03 15:24:50,282 - DEBUG - Running 'ip addr show eth0 | grep link/ether' 2016-03-03 15:24:50,287 - DEBUG - Offloading to: gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/ 2016-03-03 15:24:54,280 - DEBUG - Start of offload cycle - found 2973 new jobs 2016-03-03 16:47:41,810 - ERROR - Offloading 38162-moblab timed out after waiting 3600 seconds. 2016-03-03 16:47:41,818 - ERROR - Error occurred when offloading 38162-moblab: But then it restarts after that... gs_offloader -s also has issues: gs_offloader_hosts_log_.txt 2016-03-03 16:27:09,967 - ERROR - Error occurred when offloading hosts/192.168.231.100/15089-repair: 2016-03-03 16:34:19,073 - DEBUG - End of offload cycle - cleared 80 new jobs, carrying 479 open jobs 2016-03-03 16:34:24,192 - DEBUG - Start of offload cycle - found 32 new jobs 2016-03-03 17:37:46,530 - ERROR - Offloading hosts/192.168.231.100/15089-repair timed out after waiting 3600 seconds. 2016-03-03 17:37:46,534 - ERROR - Error occurred when offloading hosts/192.168.231.100/15089-repair: 2016-03-03 18:14:19,459 - DEBUG - End of offload cycle - cleared 98 new jobs, carrying 413 open jobs 2016-03-03 18:14:19,461 - DEBUG - Currently there are 1 jobs with offload failures 2016-03-03 18:14:19,461 - DEBUG - Reporting failures by e-mail 2016-03-03 18:14:19,462 - ERROR - Failed to send email to chromeos-test-cron+cautotest@google.com: Credential file does notexist: /usr/local/autotest/USE SHADOW GMAIL_API_CREDENTIALS. If this is a prod server, puppet shouldinstall it. If you need to be able to send email, find the credential file from chromeos-admin repo and copy it to /usr/local/autotest/USE SHADOW GMAIL_API_CREDENTIALS 2016-03-03 18:14:24,493 - DEBUG - Start of offload cycle - found 24 new jobs But at least it seems to be slowly uploading results, nevertheless. Actually, what I wonder is: when does gs_offloader start deleting old results? When it manages to upload all jobs? Should it be uploading a few, deleting those, and repeat? I'm going to let it run overnight, see if anything gets deleted (I did a manual cleanup between #6 and #7, which is why it's smaller now): # du -sm /mnt/moblab/results/; du -sm /mnt/moblab/results/hosts/ 51138 /mnt/moblab/results/ 543 /mnt/moblab/results/hosts/
,
Mar 3 2016
The gs_offloader code shows, it tried to offload a directory at a time. If the offloading operation succeeds, it will delete the directory. My speculation is the directory contains too many files, and the offloading operation could not finish within 1 hour. So the operation fails and gs_offloader will retry but never be able to move forwards. There is a parameter called "gs_offloader_limit_file_count" in the moblab setup. Could you try to set it to True. In this way, gs_offloader try to compress the number of files under a directory to <500 (by tar). This may speed up the offloading.
,
Mar 4 2016
Huh, I just tried to add "-p 4 -m" to the command line: sudo -u moblab /usr/local/autotest/site_utils/gs_offloader.py -p 4 -m --days_old=1 --log_size=200 And it seems to be working much better: results are being uploaded, and space is being freed up. I guess that: "-p 4" makes it more likely that at least some uploading jobs go through, and "-m" makes it less likely to timeout?
,
Mar 4 2016
That would definitely help. -p allows you to start multiple processes to load multiple directories each in parallel. -m allow gsutil to use multiple threads to upload files under each directory in parallel. The only concern is, the gs_offloader will take more resource from the moblab device. We could also increase the timeout longer than 1 hour. But it is a constant and you need to change the code. Let's see how well things go and make adjust in the default configuration in the future.
,
Mar 4 2016
Indeed, `-p 4 -m` uses 100% cpu ,-) "nice" CPU though, so it should not have too much performance impact. `-p 2 -m` seems to be using 50%+ cpu, so that might be a good starting point. Increasing the timeout also makes sense (especially if we do not want `-m`), and gs_offloader_limit_file_count should also help (although I think it's usually more convenient if test results are not archived before upload...)
,
Mar 24 2016
Was this fixed?
,
Mar 25 2016
Michael, Is there a fix we can put in for offloading to be more reliable? bump up the priority in case it is not a hard fix. will revisit if it is complicated.
,
Mar 25 2016
Nicolas, is your case a common one or a rare one? I would hesitate in changing the default settings if the current default settings work most of the time. In the long term, we can add some logic to use more process/threads and extend the one-hour timeout during retry or offloads lag behind too much.
,
Mar 25 2016
Well, if anyone uses a moblab and encounters frequent Chrome crashes, I'm almost sure they'll hit this problem.
,
Mar 25 2016
I thought the problem was caused by the number of files inside a single results directory. Would frequent Chrome crashes create a large number of files, or just some big core dumps? Do we still have the result for job (id=35161) on GCS?
,
Mar 26 2016
The results are here: https://pantheon.corp.google.com/storage/browser/chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/35161-moblab/192.168.231.100/ drinkcat@drinkcat:~/tmp/192.168.231.100$ du -sm . 199 . drinkcat@drinkcat:~/tmp/192.168.231.100$ find | wc -l 8848 You are right, the results contain many files, most of those being chrome and UI logs (in **/var/log/[chrome|ui]. But I'm not sure what causes them to accumulate (had a quick look at them and did not see anything abnormal).
,
Mar 28 2016
I checked the directory. It contains 8721 files. To upload it using "gsutil" at cmdline, it took about 40 minutes from my desktop (with corporate network bandwidth). So very possible gs_offloader would timeout (1hour). When that happens, gs_offloader stuck with that directory and could not offload other directories. That is why moblab takes more and more disk space until crashes. I will add a new moblab configuration parameter to turn on the "-m" for gsutil. Users could turn it on if their tests produce result directories with many files.
,
Mar 29 2016
Sounds reasonable, thanks.
,
Apr 26 2016
,
May 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06 commit 0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06 Author: Michael Tang <ntang@google.com> Date: Sat May 14 02:06:54 2016 Add multiprocessing setting for gs_offloader. If the command line flag "multiprocessing" is set, it turns on the -m option for gsutil when gs_offloarder runs. If the flag is not set, the global config setting "gs_offloader_multiprocessing" under the CROS section is used, which is default to False. BUG= chromium:590929 TEST=unit test and test on device. Change-Id: I6580866b7691e1be29977f271dfae8d67e2e06cc Reviewed-on: https://chromium-review.googlesource.com/344832 Commit-Ready: Michael Tang <ntang@chromium.org> Tested-by: Michael Tang <ntang@chromium.org> Reviewed-by: Michael Tang <ntang@chromium.org> Reviewed-by: Nicolas Boichat <drinkcat@chromium.org> [modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/site_utils/gs_offloader_unittest.py [modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/site_utils/gs_offloader.py [modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/global_config.ini
,
May 25 2016
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by drinkcat@chromium.org
, Mar 1 2016