New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 590929 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

moblab: Storage fills up with chrome core dumps

Project Member Reported by drinkcat@chromium.org, Mar 1 2016

Issue description

My current platform (oak) is quite unstable, and moblab storage fills up with Chrome core dumps, e.g. mnt/stateful_partition/dev_image/moblab-storage/results/38839-moblab/192.168.231.116/sysinfo/var/spool/crash/chrome.20160227.182205.14494.core.

I believe those should be taken care of by gs offloading, but they don't seem to ever get deleted on local storage.
 
As a consequence of this, when the moblab reboots every 1-2 days with an OOM oops (probably  Issue 549811 ), it does not resume autotest operation due to the test in /etc/init/moblab-external-storage-init.conf :

    if [ ${diskspace} -lt ${MIN_STATEFUL_DISKSPACE} ]; then
      logger -t "${UPSTART_JOB}" "Moblab will not launch. Please insert a"
      logger -t "${UPSTART_JOB}" "properly configured drive and reboot."
      exit 1
    fi

(MIN_STATEFUL_DISKSPACE=32000000)

(Manual) workaround is:
# cd /mnt/stateful_partition/dev_image/moblab-storage/results
# rm `find -size +100M`
# reboot

(Googlers: feel free to have a look at 100.102.7.217)

Comment 2 by jean@chromium.org, Mar 1 2016

Cc: ntang@chromium.org stephenlin@chromium.org
Owner: sbasi@chromium.org
Simran, is it something easy to fix? or something that Michael can help with?

Comment 3 by sbasi@chromium.org, Mar 1 2016

Hmm gs_offloader seems to be having some trouble with these larger files.

I think someone will have to deep dive.

The script is a self-contained utility in /usr/local/autotest/site_utils/gs_offloader.py

If Michael can take a look that would be great.

Comment 4 by ntang@google.com, Mar 3 2016

I looked at the gs_offloader code and monitored the log on the moblab device several times today. I saw the the gs_offloader timeout after 3600 secs several time on the job "35161". Before timeout, the offloader seems to already have copied 9940 files. I wonder if the job produced too many output files and gsutil could not finish within one hour.

Since all the test results are currently deleted on the device, could you monitor the gs_offloader log and see if this happens again? If this happen again, you would probably see the gs_offloader stucks with offloading a specific job result every hour.
My moblab rebooted 3 hours ago, and /mnt/moblab did not get mounted as the partition was quite full (what I describe in #1)...

Then I imitated what moblab-external-storage-init.conf would do (the results are not deleted, they are just not linked properly anymore):
mount -o bind /usr/local/moblab-storage/ /mnt/moblab/
rm -rf /usr/local/autotest/results/
ln -s /mnt/moblab/results /usr/local/autotest/results

du -sm /mnt/moblab/results/
75403	/mnt/moblab/results/

du -sm /mnt/moblab/results/hosts/
1263	/mnt/moblab/results/hosts/

75G is a lot, so the 1 hour timeout idea in #4 makes sense...

And trying to run the offloader manually:

time sudo -u moblab /usr/local/autotest/site_utils/gs_offloader.py --days_old=1 --log_size=200 | tee /var/log/gs_offload.log
DEBUG:root:Set process to nice value: 10
DEBUG:root:Offloading Autotest results in /usr/local/autotest/results
DEBUG:root:Running 'ip addr show eth0 | grep link/ether'
DEBUG:root:Offloading to: gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/
Copying file:///tmp/tmpv0wpVy [Content-Type=application/octet-stream]...
Removing gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/tmpv0wpVy...
DEBUG:root:Start of offload cycle - found 2973 new jobs

Will let it run and let you know...
Command above failed with this:

FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
FAILED RPC CALL: get_jobs {'finished': True, 'id': 41785}
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/gs_offloader.py", line 652, in <module>
    main()
  File "/usr/local/autotest/site_utils/gs_offloader.py", line 647, in main
    offloader.offload_once()
  File "/usr/local/autotest/site_utils/gs_offloader.py", line 557, in offload_once
    job.enqueue_offload(queue, self._age_limit)
  File "/usr/local/autotest/site_utils/job_directories.py", line 153, in enqueue_offload
    timestamp = self.get_timestamp_if_finished()
  File "/usr/local/autotest/site_utils/job_directories.py", line 223, in get_timestamp_if_finished
    entry = _AFE.get_jobs(id=self._id, finished=True)
  File "/usr/local/autotest/server/frontend.py", line 293, in get_jobs
    jobs_data = self.run('get_jobs', **dargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 83, in run
    self, call, **dargs)
  File "/usr/lib64/python2.7/site-packages/chromite/lib/retry_util.py", line 86, in GenericRetry
    time.sleep(sleep_time)
  File "/usr/lib64/python2.7/site-packages/chromite/lib/timeout_util.py", line 53, in kill_us
    raise TimeoutError(error_message % {'time': max_run_time})
chromite.lib.timeout_util.TimeoutError: Timeout occurred- waited 1800 seconds.

real	30m5.745s
user	0m3.546s
sys	0m0.839s

I guess I need autotest to be running for gs offloader to work...
Indeed, gs_offloader is timing out: gs_offloader_jobs_log_.txt

2016-03-03 15:24:50,281 - DEBUG - Set process to nice value: 10
2016-03-03 15:24:50,282 - DEBUG - Offloading Autotest results in /usr/local/autotest/results
2016-03-03 15:24:50,282 - DEBUG - Running 'ip addr show eth0 | grep link/ether'
2016-03-03 15:24:50,287 - DEBUG - Offloading to: gs://chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/
2016-03-03 15:24:54,280 - DEBUG - Start of offload cycle - found 2973 new jobs
2016-03-03 16:47:41,810 - ERROR - Offloading 38162-moblab timed out after waiting 3600 seconds.
2016-03-03 16:47:41,818 - ERROR - Error occurred when offloading 38162-moblab:

But then it restarts after that...

gs_offloader -s also has issues: gs_offloader_hosts_log_.txt
2016-03-03 16:27:09,967 - ERROR - Error occurred when offloading hosts/192.168.231.100/15089-repair:
2016-03-03 16:34:19,073 - DEBUG - End of offload cycle - cleared 80 new jobs, carrying 479 open jobs
2016-03-03 16:34:24,192 - DEBUG - Start of offload cycle - found 32 new jobs
2016-03-03 17:37:46,530 - ERROR - Offloading hosts/192.168.231.100/15089-repair timed out after waiting 3600 seconds.
2016-03-03 17:37:46,534 - ERROR - Error occurred when offloading hosts/192.168.231.100/15089-repair:
2016-03-03 18:14:19,459 - DEBUG - End of offload cycle - cleared 98 new jobs, carrying 413 open jobs
2016-03-03 18:14:19,461 - DEBUG - Currently there are 1 jobs with offload failures
2016-03-03 18:14:19,461 - DEBUG - Reporting failures by e-mail
2016-03-03 18:14:19,462 - ERROR - Failed to send email to chromeos-test-cron+cautotest@google.com: Credential file does notexist: /usr/local/autotest/USE SHADOW GMAIL_API_CREDENTIALS. If this is a prod server, puppet shouldinstall it. If you need to be able to send email, find the credential file from chromeos-admin repo and copy it to /usr/local/autotest/USE SHADOW GMAIL_API_CREDENTIALS
2016-03-03 18:14:24,493 - DEBUG - Start of offload cycle - found 24 new jobs

But at least it seems to be slowly uploading results, nevertheless.

Actually, what I wonder is: when does gs_offloader start deleting old results? When it manages to upload all jobs? Should it be uploading a few, deleting those, and repeat?

I'm going to let it run overnight, see if anything gets deleted (I did a manual cleanup between #6 and #7, which is why it's smaller now):
# du -sm /mnt/moblab/results/; du -sm /mnt/moblab/results/hosts/
51138	/mnt/moblab/results/
543	/mnt/moblab/results/hosts/

Comment 8 by ntang@google.com, Mar 3 2016

The gs_offloader code shows, it tried to offload a directory at a time. If the offloading operation succeeds, it will delete the directory.

My speculation is the directory contains too many files, and the offloading operation could not finish within 1 hour. So the operation fails and gs_offloader will retry but never be able to move forwards.

There is a parameter called "gs_offloader_limit_file_count" in the moblab setup. Could you try to set it to True. In this way, gs_offloader try to compress the number of files under a directory to <500 (by tar). This may speed up the offloading.
Huh, I just tried to add "-p 4 -m" to the command line:
sudo -u moblab /usr/local/autotest/site_utils/gs_offloader.py -p 4 -m --days_old=1 --log_size=200

And it seems to be working much better: results are being uploaded, and space is being freed up.

I guess that: "-p 4" makes it more likely that at least some uploading jobs go through, and "-m" makes it less likely to timeout?

Comment 10 by ntang@google.com, Mar 4 2016

That would definitely help. -p allows you to start multiple processes to load multiple directories each in parallel. -m allow gsutil to use multiple threads to upload files under each directory in parallel. The only concern is, the gs_offloader will take more resource from the moblab device.

We could also increase the timeout longer than 1 hour. But it is a constant and you need to change the code.

Let's see how well things go and make adjust in the default configuration in the future. 
Indeed, `-p 4 -m` uses 100% cpu ,-) "nice" CPU though, so it should not have too much performance impact.

`-p 2 -m` seems to be using 50%+ cpu, so that might be a good starting point.

Increasing the timeout also makes sense (especially if we do not want `-m`), and gs_offloader_limit_file_count should also help (although I think it's usually more convenient if test results are not archived before upload...)
Was this fixed? 

Comment 13 by jean@chromium.org, Mar 25 2016

Labels: -Pri-2 Pri-1
Owner: ntang@chromium.org
Michael,
Is there a fix we can put in for offloading to be more reliable?
bump up the priority in case it is not a hard fix. will revisit if it is complicated.

Comment 14 by ntang@google.com, Mar 25 2016

Nicolas, is your case a common one or a rare one? I would hesitate in changing the default settings if the current default settings work most of the time. In the long term, we can add some logic to use more process/threads and extend the one-hour timeout during retry or offloads lag behind too much.
Well, if anyone uses a moblab and encounters frequent Chrome crashes, I'm almost sure they'll hit this problem.

Comment 16 by ntang@google.com, Mar 25 2016

I thought the problem was caused by the number of files inside a single results directory. Would frequent Chrome crashes create a large number of files, or just some big core dumps? Do we still have the result for job (id=35161) on GCS? 
The results are here: https://pantheon.corp.google.com/storage/browser/chromeos-moblab-goog/results/2c:60:0c:7c:60:d1/d4a35e56-8c32-11e5-8f66-2c600c7c60d1/35161-moblab/192.168.231.100/

drinkcat@drinkcat:~/tmp/192.168.231.100$ du -sm .
199     .
drinkcat@drinkcat:~/tmp/192.168.231.100$ find | wc -l
8848

You are right, the results contain many files, most of those being chrome and UI logs (in **/var/log/[chrome|ui]. But I'm not sure what causes them to accumulate (had a quick look at them and did not see anything abnormal).

Comment 18 by ntang@google.com, Mar 28 2016

I checked the directory. It contains 8721 files. To upload it using "gsutil" at cmdline, it took about 40 minutes from my desktop (with corporate network bandwidth). So very possible gs_offloader would timeout (1hour). When that happens, gs_offloader stuck with that directory and could not offload other directories. That is why moblab takes more and more disk space until crashes.

I will add a new moblab configuration parameter to turn on the "-m" for gsutil. Users could turn it on if their tests produce result directories with many files.
Status: Assigned (was: Available)
Sounds reasonable, thanks.
Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Project Member

Comment 21 by bugdroid1@chromium.org, May 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06

commit 0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06
Author: Michael Tang <ntang@google.com>
Date: Sat May 14 02:06:54 2016

Add multiprocessing setting for gs_offloader.

If the command line flag "multiprocessing" is set,  it turns on the -m option
for gsutil when gs_offloarder runs. If the flag is not set, the global config
setting "gs_offloader_multiprocessing" under the CROS section is used, which
is default to False.

BUG= chromium:590929 
TEST=unit test and test on device.

Change-Id: I6580866b7691e1be29977f271dfae8d67e2e06cc
Reviewed-on: https://chromium-review.googlesource.com/344832
Commit-Ready: Michael Tang <ntang@chromium.org>
Tested-by: Michael Tang <ntang@chromium.org>
Reviewed-by: Michael Tang <ntang@chromium.org>
Reviewed-by: Nicolas Boichat <drinkcat@chromium.org>

[modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/site_utils/gs_offloader_unittest.py
[modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/site_utils/gs_offloader.py
[modify] https://crrev.com/0df2eb4e1e8638a1a165e8f2dc3b9230f85acf06/global_config.ini

Comment 22 by ntang@google.com, May 25 2016

Status: Fixed (was: Assigned)
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)

Sign in to add a comment