New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 849715 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner: ----
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

send_to_lucifer took ~1 hour in a single master tick

Project Member Reported by pprabhu@chromium.org, Jun 5 2018

Issue description

Observed overnight on master scheduler.

A single tick took ~ 1 hour to complete to complete.

06/05 01:12:20.906 DEBUG|        monitor_db:1246| New tick
06/05 01:14:30.016 DEBUG|        monitor_db:1246| New tick
06/05 01:15:30.243 DEBUG|        monitor_db:1246| New tick
06/05 01:16:30.003 DEBUG|        monitor_db:1246| New tick
06/05 01:39:59.627 DEBUG|        monitor_db:1246| New tick
06/05 02:09:46.715 DEBUG|        monitor_db:1246| New tick
06/05 03:06:26.974 DEBUG|        monitor_db:1246| New tick
06/05 03:39:29.604 DEBUG|        monitor_db:1246| New tick
06/05 03:43:09.680 DEBUG|        monitor_db:1246| New tick
06/05 03:55:47.459 DEBUG|        monitor_db:1246| New tick
06/05 03:57:05.465 DEBUG|        monitor_db:1246| New tick
06/05 03:58:45.211 DEBUG|        monitor_db:1246| New tick
06/05 03:59:15.114 DEBUG|        monitor_db:1246| New tick
06/05 03:59:59.000 DEBUG|        monitor_db:1246| New tick
06/05 04:02:17.467 DEBUG|        monitor_db:1246| New tick
06/05 04:02:59.217 DEBUG|        monitor_db:1246| New tick


From a snippet of one of the ticks, send_to_lucidfer took most of the time

06/05 02:09:47.235 DEBUG|        monitor_db:1246| Starting _garbage_collection
06/05 02:09:47.235 DEBUG|        monitor_db:1246| Starting _drone_manager.trigger_refresh
06/05 02:09:47.285 DEBUG|        monitor_db:1246| Starting _schedule_running_host_queue_entries
06/05 02:10:11.328 DEBUG|        monitor_db:1246| Starting _schedule_special_tasks
06/05 02:10:13.398 DEBUG|        monitor_db:1246| Starting _schedule_new_jobs
06/05 02:10:54.229 DEBUG|        monitor_db:1246| Starting _drone_manager.sync_refresh
06/05 02:10:54.280 DEBUG|        monitor_db:1246| Starting _send_to_lucifer
06/05 03:04:23.357 DEBUG|        monitor_db:1246| Starting _run_cleanup
06/05 03:06:21.461 DEBUG|        monitor_db:1246| Starting _find_aborting
06/05 03:06:24.966 DEBUG|        monitor_db:1246| Starting _find_aborted_special_tasks
06/05 03:06:25.088 DEBUG|        monitor_db:1246| Starting _handle_agents
06/05 03:06:25.479 DEBUG|        monitor_db:1246| Starting _host_scheduler.tick
06/05 03:06:25.479 DEBUG|        monitor_db:1246| Starting _drone_manager.execute_actions
06/05 03:06:26.964 DEBUG|        monitor_db:1246| Starting email_manager.manager.send_queued_emails
06/05 03:06:26.964 DEBUG|        monitor_db:1246| Starting django.db.reset_queries

 
Labels: -Pri-3 Pri-2
Master scheduler recovered on its own so this is not currently P0/P1.
Within send_to_lucifer, a lot of the time is sent _between_ processing the various drones.
e.g.:

06/05 02:50:24.668 DEBUG|          ssh_host:0301| Running (ssh) '/usr/bin/env /usr/local/autotest/bin/job_reporter --jobdir /usr/local/autotest/leases --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --lucifer-level STARTING --job-id 205676515 --results-dir /usr/local/autotest/results/205676515-chromeos-test/hostless --execution-tag 205676515-chromeos-test/hostless </dev/null >>/usr/local/autotest/results/205676515-chromeos-test/hostless/lucifer/job_reporter_output.log 2>&1 &' from '_send_to_lucifer|_send_starting_to_lucifer|spawn_starting_job_handler|spawn|run|run_very_slowly'
06/05 03:02:38.949 INFO |     drone_manager:0688| Checking drone cros-full-0026.mtv.corp.google.com

...
06/05 02:15:54.753 DEBUG|          ssh_host:0301| Running (ssh) '/usr/bin/env /usr/local/autotest/bin/job_reporter --jobdir /usr/local/autotest/leases --run-job-path /opt/infra-tools/usr/bin/lucifer_run_job --lucifer-level STARTING --job-id 205673639 --results-dir /usr/local/autotest/results/205673639-chromeos-test/hostless --execution-tag 205673639-chromeos-test/hostless </dev/null >>/usr/local/autotest/results/205673639-chromeos-test/hostless/lucifer/job_reporter_output.log 2>&1 &' from '_send_to_lucifer|_send_starting_to_lucifer|spawn_starting_job_handler|spawn|run|run_very_slowly'
06/05 02:20:30.440 INFO |     drone_manager:0688| Checking drone cros-full-0026.mtv.corp.google.com

Cc: cra...@chromium.org
The bump in time spent in send_to_lucifer is clearly borne out in the tick drill doen graph: https://viceroy.corp.google.com/chromeos/tick_breakdown#_VG_l57H3D2E
Mergedinto: 848337
Status: Duplicate (was: Untriaged)
This is very likely another instance of issue 848337

Unlike that issue, this was caused by hostless jobs, so the suggested mitigation of moving boards to shards won't work here.

Note that this was a loopback SSH drone_manager call onto the master server itself. I only see whacky network IO in this timeframe. No other indications of server overloading: http://shortn/_L5eCnjhWgk

Sign in to add a comment