send_to_lucifer took ~1 hour in a single master tick |
|||
Issue descriptionObserved overnight on master scheduler. A single tick took ~ 1 hour to complete to complete. 06/05 01:12:20.906 DEBUG| monitor_db:1246| New tick 06/05 01:14:30.016 DEBUG| monitor_db:1246| New tick 06/05 01:15:30.243 DEBUG| monitor_db:1246| New tick 06/05 01:16:30.003 DEBUG| monitor_db:1246| New tick 06/05 01:39:59.627 DEBUG| monitor_db:1246| New tick 06/05 02:09:46.715 DEBUG| monitor_db:1246| New tick 06/05 03:06:26.974 DEBUG| monitor_db:1246| New tick 06/05 03:39:29.604 DEBUG| monitor_db:1246| New tick 06/05 03:43:09.680 DEBUG| monitor_db:1246| New tick 06/05 03:55:47.459 DEBUG| monitor_db:1246| New tick 06/05 03:57:05.465 DEBUG| monitor_db:1246| New tick 06/05 03:58:45.211 DEBUG| monitor_db:1246| New tick 06/05 03:59:15.114 DEBUG| monitor_db:1246| New tick 06/05 03:59:59.000 DEBUG| monitor_db:1246| New tick 06/05 04:02:17.467 DEBUG| monitor_db:1246| New tick 06/05 04:02:59.217 DEBUG| monitor_db:1246| New tick From a snippet of one of the ticks, send_to_lucidfer took most of the time 06/05 02:09:47.235 DEBUG| monitor_db:1246| Starting _garbage_collection 06/05 02:09:47.235 DEBUG| monitor_db:1246| Starting _drone_manager.trigger_refresh 06/05 02:09:47.285 DEBUG| monitor_db:1246| Starting _schedule_running_host_queue_entries 06/05 02:10:11.328 DEBUG| monitor_db:1246| Starting _schedule_special_tasks 06/05 02:10:13.398 DEBUG| monitor_db:1246| Starting _schedule_new_jobs 06/05 02:10:54.229 DEBUG| monitor_db:1246| Starting _drone_manager.sync_refresh 06/05 02:10:54.280 DEBUG| monitor_db:1246| Starting _send_to_lucifer 06/05 03:04:23.357 DEBUG| monitor_db:1246| Starting _run_cleanup 06/05 03:06:21.461 DEBUG| monitor_db:1246| Starting _find_aborting 06/05 03:06:24.966 DEBUG| monitor_db:1246| Starting _find_aborted_special_tasks 06/05 03:06:25.088 DEBUG| monitor_db:1246| Starting _handle_agents 06/05 03:06:25.479 DEBUG| monitor_db:1246| Starting _host_scheduler.tick 06/05 03:06:25.479 DEBUG| monitor_db:1246| Starting _drone_manager.execute_actions 06/05 03:06:26.964 DEBUG| monitor_db:1246| Starting email_manager.manager.send_queued_emails 06/05 03:06:26.964 DEBUG| monitor_db:1246| Starting django.db.reset_queries
,
Jun 5 2018
Within send_to_lucifer, a lot of the time is sent _between_ processing the various drones. e.g.: 06/05 02:50:24.668 DEBUG| ssh_host:0301| Running (ssh) '/usr/bin/env /usr/local/autotest/bin/job_reporter --jobdir /usr/local/autotest/leases --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --lucifer-level STARTING --job-id 205676515 --results-dir /usr/local/autotest/results/205676515-chromeos-test/hostless --execution-tag 205676515-chromeos-test/hostless </dev/null >>/usr/local/autotest/results/205676515-chromeos-test/hostless/lucifer/job_reporter_output.log 2>&1 &' from '_send_to_lucifer|_send_starting_to_lucifer|spawn_starting_job_handler|spawn|run|run_very_slowly' 06/05 03:02:38.949 INFO | drone_manager:0688| Checking drone cros-full-0026.mtv.corp.google.com ... 06/05 02:15:54.753 DEBUG| ssh_host:0301| Running (ssh) '/usr/bin/env /usr/local/autotest/bin/job_reporter --jobdir /usr/local/autotest/leases --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --lucifer-level STARTING --job-id 205673639 --results-dir /usr/local/autotest/results/205673639-chromeos-test/hostless --execution-tag 205673639-chromeos-test/hostless </dev/null >>/usr/local/autotest/results/205673639-chromeos-test/hostless/lucifer/job_reporter_output.log 2>&1 &' from '_send_to_lucifer|_send_starting_to_lucifer|spawn_starting_job_handler|spawn|run|run_very_slowly' 06/05 02:20:30.440 INFO | drone_manager:0688| Checking drone cros-full-0026.mtv.corp.google.com
,
Jun 5 2018
The bump in time spent in send_to_lucifer is clearly borne out in the tick drill doen graph: https://viceroy.corp.google.com/chromeos/tick_breakdown#_VG_l57H3D2E
,
Jun 5 2018
This is very likely another instance of issue 848337 Unlike that issue, this was caused by hostless jobs, so the suggested mitigation of moving boards to shards won't work here. Note that this was a loopback SSH drone_manager call onto the master server itself. I only see whacky network IO in this timeframe. No other indications of server overloading: http://shortn/_L5eCnjhWgk |
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Jun 5 2018