New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 680207 link

Starred by 0 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature

Blocked on:
issue 674262

Blocking:
issue 669639



Sign in to add a comment

[Predator] make the dict of feature values lazy

Project Member Reported by wrengr@chromium.org, Jan 11 2017

Issue description

In the UnnormalizedLogLinearModel class there's the question about how best to store the FeatureValue vector and the weight covector. In the future we will get more and more features, so performance will become a concern.

The standard thing here is to ensure that the weights are *sparse* (i.e., mostly zeros), which is ensured by the training of TrainableLogLinearModel. However right now (i.e., githash ec44c2881) we store the weights as a list (or np.ndarray) which is *dense*; thus we're forced to iterate over all the entries, rather than being able to automatically skip over the zero entries. So step one is to replace this with some sparse representation; e.g., a dict mapping the feature names to their weights, where we (a) ensure we don't store any zero weights, and (b) assume missing keys means those features are mapped to zero weights. This part will probably be handled by the solution to  https://crbug.com/674262 

However, even if we have a sparse representation for the weight covector, there's still a performance issue. Because the FV vector will be also stored as a dict (so we can zip the dicts along their keys),  that means we'll first compute *all* the FVs (to store them in the dict) and then throw most of them away since they end up having weight zero. It'd be better to use a *lazy* dict, so that we only compute the FV object once we know it'll be needed. (This is an example of why we want to distinguish "vectors" from "covectors"; because we want to represent them differently.) Of course, a "lazy dict" is the exact same thing as a MemoizedFunction! Even better, since we know we will only access each value once, we don't even need to memoize: we can use a plain function.

So ultimately we'll want to compute the scores as in the following pseudocode:

    def score(ws, f, x, ys):
      """
      ws : dict(FeatureName, float)
      f : X -> Y -> FeatureName -> FeatureValue
      x : X
      ys : iterable(Y)
      """
      # let the feature function do preprocessing work given x.
      fx = f(x)
      scores = {}
      for y in ys:
        # let the feature function do more preprocessing given y; if desired.
        fxy = fx(y)
        # This fsum(fvi * wi) is the dot product.
        scores[y] = math.fsum(
          # when we apply fxy to i, *that's* when we compute the FV!
          fxy(i) * wi
          # Because ws is sparse, we don't even look at zero weights!
          for i, wi in ws.iteritems())
      return scores

This code avoids all unnecessary work in the dot product, and almost all unnecessary work in computing the FV vector. The one place unnecessary work could slip in is if ``fx(y)`` does a bunch of work; because there's a chance that it does work towards computing FVs which we'll never ask for. We should either (a) accept that as an unavoidable performance loss, or (b) have ``fx(y)`` do no interesting work (i.e., immediately return the fxy function) and instead only do the work once ``fxy(i)` is computed.
 

Comment 1 by wrengr@chromium.org, Jan 11 2017

Description: Show this description

Comment 2 by wrengr@chromium.org, Jan 11 2017

Blocking: 669639
Status: Archived (was: Assigned)

Sign in to add a comment