New issue
Advanced search Search tips

Issue 920850 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Feature



Sign in to add a comment

Add UMA metrics to measure request processing delays in hwsec daemons

Project Member Reported by louiscollard@chromium.org, Jan 11

Issue description

As part of Chongzu we will re-work the hwsec daemons to be able to process multiple requests in parallel, doc here:

https://docs.google.com/document/d/1p1ALEWQ33z2Ou27Pw15CauDczZ3SKTG3Rih5qtSpxbU/edit

There are several ways this can be done, and it would be helpful to have more information about the current performance bottlenecks in real-world usage, so that we can ensure these are resolved by our final design.

We should add some UMA metrics to measure this, for each incoming dbus request, we should record something like:

- depth of queue at time request is received
- time spent in the queue
- time spent being processed

(we need to record the metrics such that they can be broken down by eg method name)

Andrey / Menghuan - thoughts on this list of metrics?
 
Cc: louiscollard@chromium.org
Labels: OS-Chrome
Probably, add
- how long for waiting outgoing dbus call. (Still unclear how to count it on our infra.)

Some questions,

1. If we want to break down by method name, we need to create a subgroup (suffix) for each dbus method. I am not sure it is good or not to create so many subgroup. 

2. Do we need to measure all dbus requests in all daemons? Do we only care some of them?
Metrics make sense. But let's better define what queue and what measurements we are talking about. My understanding:

- depth of queue at time request is received

How many dbus requests have been received but not processed yet at this moment. That mostly means tasks sitting in MountThread queue (dbus queue is typically cleared fast for async calls), but if we start breaking the tasks there, we want the metric to stay the same, so counting dbus requests regardless on which queue they are sitting now makes most sense.

- time spent in the queue
- time spent being processed

Again, time spend from getting dbus request to sending reply. Note that by the time Service gets the request, it may already sit in the queue behind others depending on how dbus library serializes incoming requests, so we don't get the full stats. If incoming requests are serialized at dbus lib library, a request can wait behind a synchronous handler, but that time won't be counted by us.
This becomes 'Time being processed + time spent in queue' (to the best of our detection abilities).

To separate 'Time being processed' we can just time every MountThread task and add this time to the appropriate 'time_processed' field of the associated dbus request, and print the resulting sum when sending the response (since we can have multiple tasks dealing with a dbus request, esp in the future; and iirc the same task may originate from different requests). Requires that we start associating dbus request with tasks.

Comment 3 by menghuan@chromium.org, Jan 17 (5 days ago)

Some update,

1. I checked and confirm dbus information is not easily to get in cryptohomed.
I looked through the dbus implementation in cryptohome. The dbus handle daemon of cryptohome deeply utilize the glib message loop, so we may not get the queue status easily. Not sure whether we can get more information if we do some hacking to play with glib message loop.

2. Already have proof of work shows we can get the timing/counting information in MountThread queue.
I added some hook in mount thread's work queue, so we can get the time that the task is sent to the queue (T1), the time that the task is pop from the queue to execute (T2), the time that finished the task (T3). I just finished some proof of works. I have not finish to trace multiple stage task (the task will append the next stage to the queue) yet since dbus information is not available when our handler is called, we may need to use something like async_id to somehow track each dbus call.

3. By evaluating on some early data, I found out the metric on mount_thread is not enough. We need metric in the other daemons queue.

Some task will be block because of they sent some sync call to other daemon, ex. attestation. In high level view, we only see that task runs super long, can't get information from this level metrics since the bottleneck is at the lower level queue.

One of example addressed in design doc can't be figure out if we only have metrics in cryptohomed. Take ownership action is an async request and cryptohomed just forward the request to and return immediately. It doesn't touch the mount_thread and it is also a very short dbus job in cryptohomed.
When the take ownership action is finished, it will trigger PrepareForEnrollment in attestationdWhen it block MountEx action in attestationd, we can only see MountEx ran unusual long at cryptohomed side, but don't know the reason. (The number of parallel task is 0 for this case, those dbus call are requested under the table, not in the sight of cryptohomed)



The next step I will start to look at how to monitor in attestation since it use different design playing with dbus, I may need a different way to hook them.

Comment 4 by menghuan@chromium.org, Jan 19 (3 days ago)

Draft CL for metrics in cryptohome,

It is a little bit hacking to add the hook in cryptohomed.

https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1423978
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1423979

Comment 5 by menghuan@chromium.org, Jan 19 (3 days ago)

Status: Assigned (was: Untriaged)

Sign in to add a comment