New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 695619 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 669732



Sign in to add a comment

[Predator] Use linear score instead of log linear score.

Project Member Reported by kateso...@chromium.org, Feb 23 2017

Issue description

[Predator] Use linear score instead of log linear score.

Previous, a feature value is in (log(0), log(1)], for example, feature like ``MinDistanceFeature``, if distance between changed lines and crashed lines are more than 50, we score it log(0), in this case, no matter what value it gets for other features, we won't consider this suspect as culprit. 

However, this way of evaluating features will make the log linear model hard to scale. For example, even thought a suspect didn't touch crashed files in a stacktrace, we still want to checkout whether it touched files under the same directory of crashed files. So we need to switch the value range from (log(0), log(1)] to [0, 1].
 

Comment 1 by wrengr@chromium.org, Feb 28 2017

I'm not sure why you think using the linear-domain score will help...

A log-domain score of log(0) means "this suspect is absolutely impossible"; returning log(0) should typically be avoided, because it's an extremely hard constraint. However, any log-domain score greater than log(0) results in a softer constraint, and so it can be overcome by a bunch of other features all saying that something should be blamed. Thus: just have the MinDistanceFeature return something other than log(0). It doesn't have to be [0,1], just has to be greater than log(0) is all. Supposing you pick some finite p and return the score log(p), it doesn't really matter what p is since the weight of the feature will rescale it. (The combination of p and the feature's weight, together, say how much a MinDistance >= 50 should make us think the current suspect is to blame or not.)

Remember, we're adding log-domain values, so if some feature gives a value of log(0) aka -inf, then the overall sum will always be log(0) no matter what else we add to it. This is exactly like doing things in the normal domain with multiplication: once you multiply by 0, the overall product will always be 0. So, just as we want to avoid multiplying by zero unless we really mean to forbid something, the same is true for adding log(0). Conversely, a log-domain score of log(1) effectively means "this feature doesn't matter". If we add log(1) aka 0 to some overall sum, it doesn't change things one way or the other. This is exactly like multiplying by 1 in the normal-domain.

Consequently, you can think of log-domain scores in the following way:

[log(0)] == absolutely don't blame this suspect, no matter what other features say.
(log(0),log(1)) == this feature says to not blame the suspect, to some degree.
[log(1)] == this feature says nothing one way or the other about this suspect.
(log(1), log(inf)) == this feature says to blame the suspect, to some degree.
[log(inf)] == absolutely blame this suspect, no matter what other features say.

You really want to avoid log(inf) for the same reasons you want to avoid log(0); it's too hard of a constraint. But anything in (log(0),log(inf)) is fair game.

N.B., the final score over all the features will be normalized to be in the range [log(0),log(1)], but you shouldn't confuse that normalized score with the score each individual feature returns. Individual features are allowed to return values greater than log(1); that's why we normalize things.
Using [0, 1] is for convenience. I think the feature value can just be how much percentage this crash has of such a feature, and we let the weight to determine how important this feature is. 

For example,  
Weight: -inf/log(0): if have this feature, absolutely don't blame this suspect 
Weight: 0: this feature doesn't matter at all.
Weight: inf: if have this feature, absolutely blame this suspect

In this case, the feature just a percentage, and we can adjust the weight as we want, and do not need to change the feature code.
The weights do indeed say how much we care about a given feature. But each feature also needs to say how strongly it feels, and *in which direction*. Restricting feature values to [0,1] aka [log(1),log(e)] means features can only ever vote *for* blaming a given suspect: because (log(1),log(e)] is a subset of (log(1),log(inf)). But we also want to allow features to vote *against* blaming a given suspect: i.e., to return values in (log(0),log(1)).

Feel free to ignore the fact that logs are involved at all. When features return negative values, that means they're voting against blaming the suspect; when they return positive values, that means they're voting for blaming the suspect. Positive weights mean to trust what the feature says; negative weights mean to trust that the feature is always wrong: i.e., flip the direction of what the feature says re voting for/against. And in all cases the absolute value of the feature value, weight, or score means how strongly to feel that way.

Part of the whole point of loglinear models is that we don't need to worry about computing "percentages". We can just have the feature return whatever value it likes. If we want to returns values in [-5,5] we can. If we want values in [0,1] we can. If we want to return values in [-42,0] we can. Each feature can choose its own range of values; and the weights will rescale things so that all the various features contribute an appropriate amount to the overall decision about whether to blame the suspect or not.
Also, re weights. Weight 0 does indeed mean to ignore the feature entirely. However, infinite weights are a bit funny. Whether it means to absolutely blame or absolutely not blame, depends on the sign of the feature value. Infinity means to absolutely do whatever the feature says; negative infinity means to absolutely do the opposite of whatever the feature says.
This change is just to change the current negative feature value from (log(0), log(1)] to [0, 1]. So we can get rid of -inf/log(0).

It's possible some feature may have range hard to be normalized to [0, 1] or have value on both direction. In this case, I am ok with expending the range from [0, 1] to (-inf, inf).


Project Member

Comment 6 by bugdroid1@chromium.org, Mar 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d

commit 0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d
Author: Sharu Jiang <katesonia@google.com>
Date: Fri Mar 10 01:34:59 2017

[Predator] Use linear feature value instead of log linear value.

Previous, a feature value is in (log(0), log(1)], for example, feature like ``MinDistanceFeature``, if distance between changed lines and crashed lines are more than 50, we score it log(0), in this case, no matter what value it gets for other features, we won't consider this suspect as culprit. 

However, this way of evaluating features will make the log linear model hard to scale. For example, even thought a suspect didn't touch crashed files in a stacktrace, we still want to checkout whether it touched files under the same directory of crashed files. So we need to switch the value range from (log(0), log(1)] to [0, 1].
 

BUG= 695619 

Change-Id: I3c4f0400b08867540d3d96098cd458435aa2df45
Reviewed-on: https://chromium-review.googlesource.com/446528
Reviewed-by: Chan Li <chanli@chromium.org>
Commit-Queue: Sharu Jiang <katesonia@google.com>

[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/test/touch_crashed_file_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/top_frame_index.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/test/touch_crashed_directory_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/test/changelist_classifier_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/test/touch_crashed_file_meta_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/test/min_distance_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/touch_crashed_file.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/test/top_frame_index_test.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/touch_crashed_directory.py
[modify] https://crrev.com/0ed0c5d627e1d1a69564ad36c6a5cabef96f1d2d/appengine/findit/crash/loglinear/changelist_features/min_distance.py

Status: Fixed (was: Assigned)

Sign in to add a comment