Investigate sql performance in the scheduler |
|||||||
Issue descriptionRight now a lot of overhead of the scheduler is in the database requests: - One part is issue 713841 where the sql database has to hit the disk for almost every request because the sql cache is too small. This issue is understood and we have a solution. - Another part is the fact that django is making lots of very small sql requests. Each of these requests have pretty significant overhead, and in the end this costs us a lot of time on the scheduler side. This is what this issue is about. I suspect we can do something about it, for example by optimizing the django requests, or by adding prefetching to django. Also if we upgrade to a newer version, django has support for persistent connections.
,
Apr 22 2017
strace from chromeos-server100. It is a large file.
,
Apr 22 2017
The log from #2 has 1.9 million lines, and covers a time span of 16 min. Over these 16 minutes, we are doing 300000 database requests. The vast majority of these look like this: 18:29:11 write(3, "*\0\0\0\3SELECT * FROM afe_jobs WHERE ID=113765244", 46) = 46 <0.000146> 18:29:11 read(3, "\1\0\0\1\30>\0\0\2\3def\24chromeos_autotest_db\10afe_jobs\10afe_jobs\2id\2id\f?\0\v\0\0\0\3\3B\0\0\0D\0\0\3\3def\24chromeos_autotest_db\10afe_jobs\10afe_jobs\5owner\5owner\f!\0\375\2\0\0\375\10@\0\0\0B\0\0\4\3def\24chrome These requests seem to originate from autotest/files/scheduler/scheduler_models.py:228 _fetch_row_from_db() So the bottleneck has nothing to do with django, and I am updating the title to reflect this. I'll keep investigating.
,
Apr 22 2017
to find out how much time is spend in the afe_jobs sql requests, I ran:
cat strace.txt |grep -A 4 "SELECT \* FROM afe_jobs WHERE ID="| cut -d "<" -f 2 | cut -f 1 -d ">"|grep ^0\. |awk '{s+=$1} END {print s}'
(each request is followed by 4 more syscalls for reply/commit, so we have to count those too).
The sum of these requests over 16 minutes accounts for 190.937 seconds.
,
Apr 22 2017
Btw since database only accounts for ~20% of the time, there are probably other optimizations in the code that we should look at after this...
,
Jun 12 2017
I think pwang@ is working on this, e.g. https://chromium-review.googlesource.com/#/c/521687/
,
Jun 19 2017
I'm curious about the timestamp in the strace file.
I ran:
cat strace.txt | cut -d "<" -f 2 | cut -f 1 -d ">"|grep ^0\. |awk '{s+=$1} END {print s}'
And it seems only 243.271 seconds has passed in total instead of 960 seconds(16 mins)?
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 24 2017
ChromeOS Infra P1 Bugscrub. Issue untouched in a week after previous message. Downgrading to P2.
,
May 15 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by marc...@chromium.org
, Apr 22 2017