tracing: idea to constexpr-eliminate GetCategoryGroupEnabled calls for popular categories |
||||||||||
Issue descriptionForking this off Issue 702718 (tracing macros cost ~1 MB of binary on android). This bug is just about code around populating the static variable of macros. I will file a separate bug with ideas for reducing cost of the rest of the TRACE_EVENT macros. TL;DR of this bug: ------ I think that with some constexpr cleverness, it should be possible to avoid the cost (both binary and runtime first-time a macro is hit) of calling GetCategoryGroupEnabled() in TRACE_EVENT macros for popular categories, in a way that doesn't require more macros, nor changing the call sites. I managed to get a draft of something that works great on clang. Not enough to fool gcc though. Also, here I am playing being very clever with the compiler. All the times I did it in the past I got bitten. Hence feedback on the overall approach is more than welcome. ------ Consider this simplified (yet realistic) model of tracing: ----- // In real life this is //base/trace_event/category_registry.h struct Category { const char *name; uint8_t is_enabled; }; constexpr uint32_t N = 100; Category g_categories[N] = {}; // Find the pointer to the |is_enabled| word for the given category. If not present, dynamically add a new category // In real life this is base::trace_event::TraceLog::GetCategoryGroupEnabled() uint8_t* __attribute__ ((noinline)) GetEnabledPtrAtRunTime(const char *cat) { // Assume this has the proper locks for thread-safety. for (uint32_t i = 0; i < N; i++) { // Reached empty slot, just append the new cat. if (g_categories[i].name == nullptr) g_categories[i].name = cat; if (strcmp(g_categories[i].name, cat) == 0) return &g_categories[i].is_enabled; } // more than 100 categories. the real implementation has a fallback for this. return nullptr; } // In real life this is the result of expanding lot of layers of macros. // See https://goo.gl/EBOMv3 for a full drive-through. #define TRACE_EVENT0(cat, event_name) \ static uint8_t *UNIQUE_VAR = nullptr; \ if (UNIQUE_VAR == nullptr) { \ UNIQUE_VAR = GetEnabledPtrAtRunTime(cat); \ } \ do { \ if (*((volatile uint8_t *)UNIQUE_VAR)) \ AddEvent(cat, event_name); \ } while (0) #define TOKENPASTE(x, y) x##y #define TOKENPASTE2(x, y) TOKENPASTE(x, y) #define UNIQUE_VAR TOKENPASTE2(var, __LINE__) ----- This allows to dynamically make up categories, by just saying TRACE_EVENT0("whatever"), which is very low friction. This, however, comes with a cost. Clearly we don't want to lookup the category in the registry *all the times* just to check if a given category is enabled. Doing that will kill performances of production code even when tracing is disabled. For each macro introduces a cached static variable. The first time we hit the macro we cache the value of the pointer (g_categories is append only -> those pointers will be valid forever). So, after the first invocation, checking whether a category is enabled as the cost of just an unbarriered "if" on the cached pointer. (Note: we could bikeshed a lot here on threadsafety, volatile vs std::atomic. This is just a simplified model. Threading stuff is out of scope here). This approach still has two drawbacks: - The first time we hit he macro we actually have to get the poitner (Regardless of the fact that tracing is enabled), which takes locks. - There is some code associated with doing the GetEnabledPtrAtRunTime(). A possible solution (ruled out) ------------------------------- We could definitely pre-populate the category at compile (i.e. loading) time, like this: Category g_categories[N] = { {"cc", false}, {"toplevel", false}, // The rest is zero-initialized and will be used for dynamic categories. }; And then we could have a variant of the macro which takes directly the pointer to the category, so say something like: static const uint8_t* kTracing_CC_Category = &g_categories[0]; static const uint8_t* kTracing_TopLevel_Category = &g_categories[1]; TRACE_EVENT_WITH_STATIC_CATEGORY0(kTracing_CC_Category, "whatever"). instead of TRACE_EVENT0("cc", "whaever") This would solve both the aforementioned problems, but has problems: - Adds another variant of the trace event macros, which requires to rewrite and maintain all the new combinatorial explosion. - It introduces dev friction, making tracing macros harder to understand - It requires to rewrite callsites in ~3000 files in the codebase. A variant of this solution (being proposed) ------------------------------------------- What if we could achieve the above in a more transparent way, which doesn't require new macros? I gave a shoot to this, and I think we can do something clever with constexpr. The TL;DR is: the compiler can have enough information to tell whether we require the runtime fetching of the pointer or not. I gave a shoot to this: ---- constexpr uint32_t N = 100; constexpr Category kKnownCategories[N] = { {"foo", false}, {"bar", false} }; Category g_categories[N] = { // Having to do these assignments is really awkward. But I don't know // enough constexpr magic to be able to just assign the array to the // constexpr one in one go. // Ideally I want to say just g_categories = kKnownCategories, in a way // that guarantees that g_categories is still populated at loader time // (i.e. in the .rwdata section) kKnownCategories[0], kKnownCategories[1], // The rest is zero-initialized }; constexpr uint8_t *GetEnabledPtrAtCompileTime(const char *cat, uint32_t i = 0) { // Apologies for the nested ternary operators, but any other more readable // option (if-statements, variables) seems to be a C++14 extension. return (kKnownCategories[i].name == nullptr) ? nullptr // this category is unknown at compile time : ((cat == kKnownCategories[i].name) ? &g_categories[i].is_enabled : (GetEnabledPtrAtCompileTime(cat, i + 1))); } // Major doubt on that "if (cat == kKnownCategories[i].name)". // What is the semantic of string literal comparison constexpr? // Maybe we need a sort of constexpr_strcmp instead? Compiler internals, who knows. uint8_t* __attribute__ ((noinline)) GetEnabledPtrAtRunTime(const char *cat) { // In real world, this has the right locks to deal with thread safety. for (uint32_t i = 0; i < N; i++) { // Reached empty slot, just append the new cat. if (g_categories[i].name == nullptr) g_categories[i].name = cat; if (strcmp(g_categories[i].name, cat) == 0) return &g_categories[i].is_enabled; } // more than 100 categories. the real implementation has a fallback for this. return nullptr; } void __attribute__ ((noinline)) AddEvent(const char *cat, const char *evt_name) { // LOL, assume this does TraceLog::GetInstance()->AddTraceEvent(...). printf("EVT: %s %s (ptr: %p)\n", cat, evt_name, GetEnabledPtrAtRunTime(cat)); } #define TRACE_EVENT0(cat, event_name) \ static uint8_t *UNIQUE_VAR = nullptr; \ if (GetEnabledPtrAtCompileTime(cat)) { \ UNIQUE_VAR = GetEnabledPtrAtCompileTime(cat); \ } else if (UNIQUE_VAR == nullptr) { \ UNIQUE_VAR = GetEnabledPtrAtRunTime(cat); \ printf("Dynamically getting pointer '%s' : %p\n", cat, UNIQUE_VAR); \ } \ do { \ if (*((volatile uint8_t *)UNIQUE_VAR)) \ AddEvent(cat, event_name); \ } while (0) // foo is a category known at compile time. void foo() { TRACE_EVENT0("foo", "foo::something"); } // baz is a dynamic category. void baz() { TRACE_EVENT0("baz", "baz::something"); } ----- The core of the trick, is that I rely on the compiler to be smart and realize that: - the result of if GetEnabledPtrAtCompileTime() is known at compile time - there is no point emitting code for the "else" part of the branch, if we know at compile time that it is taken. Let's look at the assembly now: $ clang++ -o /tmp/cat constexpr_cat.cc --std=c++11 -Wall -Wextra -g2 -Os Dump of assembler code for function foo(): <+0>: push %rbp <+1>: mov %rsp,%rbp <+4>: lea 0x2ff(%rip),%rax # 0x100001028 <g_categories+8> <+11>: mov %rax,0x930(%rip) # 0x100001660 <foo()::var65> <+18>: cmpb $0x0,0x2f1(%rip) # 0x100001028 <g_categories+8> <+25>: je 0x100000d4d <foo()+47> <+27>: lea 0x1e0(%rip),%rdi # 0x100000f20 <+34>: lea 0x21f(%rip),%rsi # 0x100000f66 <+41>: pop %rbp <+42>: jmpq 0x100000cf1 <AddEvent(char const*, char const*)> <+47>: pop %rbp <+48>: retq End of assembler dump. Dump of assembler code for function baz(): <+0>: push %rbp <+1>: mov %rsp,%rbp <+4>: push %rbx <+5>: push %rax <+6>: lea 0x219(%rip),%rbx # 0x100000f75 <+13>: mov %rbx,%rdi <+16>: callq 0x100000c9e <GetEnabledPtrAtRunTime(char const*)> <+21>: mov %rax,%rcx <+24>: mov %rcx,0x8fa(%rip) # 0x100001668 <baz()::var67> <+31>: lea 0x1ca(%rip),%rdi # 0x100000f3f <+38>: xor %eax,%eax <+40>: mov %rbx,%rsi <+43>: mov %rcx,%rdx <+46>: callq 0x100000ef0 <+51>: mov 0x8df(%rip),%rax # 0x100001668 <baz()::var67> <+58>: cmpb $0x0,(%rax) <+61>: je 0x100000da7 <baz()+88> <+63>: lea 0x1e0(%rip),%rdi # 0x100000f75 <+70>: lea 0x1dd(%rip),%rsi # 0x100000f79 <+77>: add $0x8,%rsp <+81>: pop %rbx <+82>: pop %rbp <+83>: jmpq 0x100000cf1 <AddEvent(char const*, char const*)> <+88>: add $0x8,%rsp <+92>: pop %rbx <+93>: pop %rbp <+94>: retq End of assembler dump. \o/ \o/ \o/ \o/ \o/ \o/ The saving is is ~50 bytes per macro on x86_64 Unfortunately, this trick fools LLVM, but doesn't seem to be enough to fool GCC for Android (whatever we use in our tree). Here I got Dump of assembler code for function foo: 0x00003908 <+0>: movs r2, #0 0x0000390a <+2>: push {r3, r4, r5, lr} 0x0000390c <+4>: ldr r0, [pc, #96] ; (0x3970 <foo+104>) 0x0000390e <+6>: ldr r4, [pc, #100] ; (0x3974 <foo+108>) 0x00003910 <+8>: add r0, pc 0x00003912 <+10>: add r4, pc 0x00003914 <+12>: ldr.w r1, [r0, r2, lsl #3] 0x00003918 <+16>: lsls r3, r2, #3 0x0000391a <+18>: cbz r1, 0x3934 <foo+44> 0x0000391c <+20>: cmp r1, r4 0x0000391e <+22>: bne.n 0x3930 <foo+40> 0x00003920 <+24>: ldr r2, [pc, #84] ; (0x3978 <foo+112>) 0x00003922 <+26>: add r2, pc 0x00003924 <+28>: add r3, r2 0x00003926 <+30>: ldr r2, [pc, #84] ; (0x397c <foo+116>) 0x00003928 <+32>: adds r3, #4 0x0000392a <+34>: add r2, pc 0x0000392c <+36>: str r3, [r2, #0] 0x0000392e <+38>: b.n 0x3954 <foo+76> 0x00003930 <+40>: adds r2, #1 0x00003932 <+42>: b.n 0x3914 <foo+12> 0x00003934 <+44>: ldr r4, [pc, #72] ; (0x3980 <foo+120>) 0x00003936 <+46>: add r4, pc 0x00003938 <+48>: ldr r3, [r4, #0] 0x0000393a <+50>: cbnz r3, 0x3954 <foo+76> 0x0000393c <+52>: ldr r5, [pc, #68] ; (0x3984 <foo+124>) 0x0000393e <+54>: add r5, pc 0x00003940 <+56>: mov r0, r5 0x00003942 <+58>: bl 0x38a8 <GetEnabledPtrAtRunTime(char const*)> 0x00003946 <+62>: mov r2, r0 0x00003948 <+64>: str r0, [r4, #0] 0x0000394a <+66>: ldr r0, [pc, #60] ; (0x3988 <foo+128>) 0x0000394c <+68>: mov r1, r5 0x0000394e <+70>: add r0, pc 0x00003950 <+72>: bl 0xa3bc <printf> 0x00003954 <+76>: ldr r3, [pc, #52] ; (0x398c <foo+132>) 0x00003956 <+78>: add r3, pc 0x00003958 <+80>: ldr r3, [r3, #0] 0x0000395a <+82>: ldrb r3, [r3, #0] 0x0000395c <+84>: cbz r3, 0x396e <foo+102> 0x0000395e <+86>: ldmia.w sp!, {r3, r4, r5, lr} 0x00003962 <+90>: ldr r0, [pc, #44] ; (0x3990 <foo+136>) 0x00003964 <+92>: ldr r1, [pc, #44] ; (0x3994 <foo+140>) 0x00003966 <+94>: add r0, pc 0x00003968 <+96>: add r1, pc 0x0000396a <+98>: b.w 0x38e8 <AddEvent(char const*, char const*)> 0x0000396e <+102>: pop {r3, r4, r5, pc} 0x00003970 <+104>: andeq pc, r0, r8, ror #21 0x00003974 <+108>: andeq r11, r0, r1, lsr r1 0x00003978 <+112>: ldrdeq r0, [r1], -r10 0x0000397c <+116>: andeq r0, r1, lr, lsl #20 0x00003980 <+120>: andeq r0, r1, r2, lsl #20 0x00003984 <+124>: andeq r11, r0, r5, lsl #2 0x00003988 <+128>: strdeq r11, [r0], -r9 0x0000398c <+132>: andeq r0, r1, r2, ror #19 0x00003990 <+136>: ldrdeq r11, [r0], -sp 0x00003994 <+140>: andeq r11, r0, r6, lsl #2 End of assembler dump. Which is clearly not what I was hoping. see that call to GetEnabledPtrAtRunTime() :-( I am still confident there might be another way around this... but how? + some people who know compilers for clues and general feedback (read: even if this works, how horrifying this would look to you?)
,
Apr 6 2017
Primianio, I think you can do what you want with C++11 constexpr and some macro tricks. Here's an example:
static constexpr bool str_eq(const char* a, const char* b) {
return (*a == *b) && (!*a || str_eq(a + 1, b + 1));
}
static_assert(str_eq("foo", "foo"), "strings should be equal");
static_assert(!str_eq("foo", "Foo"), "strings should not be equal");
static_assert(!str_eq("foo", "foo1"), "strings should not be equal");
static_assert(!str_eq("foo2", "foo"), "strings should not be equal");
#define LIST_KNOWN_CATEGORIES(X) \
X(foo) \
X(bar) \
X(zoo) \
// Declare the list of known categories as an enum class.
// Note that Category::Unknown is -1 on purpose.
enum class Category {
Unknown = -1,
#define DECLARE_CATEGORY(name) name,
LIST_KNOWN_CATEGORIES(DECLARE_CATEGORY)
#undef DECLARE_CATEGORY
TotalCountMustBeLast // Don't remove.
};
static constexpr int kCategoryCount =
static_cast<int>(Category::TotalCountMustBeLast);
// An array of category names.
static constexpr const char* const kCategoryNames[kCategoryCount] = {
#define STRINGIFY_(x) #x
#define STRINGIFY(x) STRINGIFY_(x),
LIST_KNOWN_CATEGORIES(STRINGIFY)
#undef STRINGIFY_
#undef STRINGIFY
};
static constexpr Category FindCategoryByIndex(const char* name, int index) {
return (index < kCategoryCount)
? (str_eq(kCategoryNames[index], name)
? static_cast<Category>(index)
: FindCategoryByIndex(name, index + 1))
: Category::Unknown;
}
static constexpr Category FindCategoryByName(const char* name) {
return FindCategoryByIndex(name, 0);
}
static_assert(FindCategoryByName("foo") == Category::foo, "foo not found");
static_assert(FindCategoryByName("bar") == Category::bar, "bar not found");
static_assert(FindCategoryByName("zoo") == Category::zoo, "zoo not found");
static_assert(FindCategoryByName("ah ah") == Category::Unknown, "ah ah found");
Hope this helps :)
,
Apr 6 2017
primiano@ and digit@ are awesome! If I understand correctly, it is getting the pointer to the category at compile time. If making a typo in a category explodes compilation, that'd be extra nice, right? Do we need the "dynamic" categories at all? What is the projected saving if we do this? (sorry, but the "proof" is obviously x86-64 code)
,
Apr 6 2017
> If I understand correctly, it is getting the pointer to the category at compile time.
Correct
> If making a typo in a category explodes compilation, that'd be extra nice, right?
No, the plan is that making a typo in a category falls back to the current behavior.
We still want (I think) people to be able to dynamically create categories, as that reduces friction. However we want the most popular categories to be zero cost.
To be honest, I am not too strong here. We could enforce that the dynamic categories should start with "experimental-". However doing that might be painful right now, because tracing is in base and is used for sure by other projects in our thousands hidden bots. So if I do that, the CL would be reverted like 15 times. (We should still make a plan though)
> What is the projected saving if we do this? (sorry, but the "proof" is obviously x86-64 code)
Aaa I did I copy/paste mistake. I wrote "Now also gcc saves 40 bytes on ARM (-Os, is_official_build=true), proof:"
but then I copied the wrong stuff. Fooling pasko@ is harder than what I thought :P
Lemme try again:
Dump of assembler code for function foo:
0x00003740 <+0>: ldr r3, [pc, #28] ; (0x3760 <foo+32>)
0x00003742 <+2>: ldr r2, [pc, #32] ; (0x3764 <foo+36>)
0x00003744 <+4>: add r3, pc
0x00003746 <+6>: adds r1, r3, #4
0x00003748 <+8>: ldrb r3, [r3, #4]
0x0000374a <+10>: add r2, pc
0x0000374c <+12>: str r1, [r2, #0]
0x0000374e <+14>: cbz r3, 0x375c <foo+28>
0x00003750 <+16>: ldr r0, [pc, #20] ; (0x3768 <foo+40>)
0x00003752 <+18>: ldr r1, [pc, #24] ; (0x376c <foo+44>)
0x00003754 <+20>: add r0, pc
0x00003756 <+22>: add r1, pc
0x00003758 <+24>: b.w 0x3720 <AddEvent(char const*, char const*)>
0x0000375c <+28>: bx lr
0x0000375e <+30>: nop
0x00003760 <+32>: ; <UNDEFINED> instruction: 0x000108b8
0x00003764 <+36>: strdeq r0, [r1], -r2
0x00003768 <+40>: andeq r11, r0, pc, ror #9
0x0000376c <+44>: strdeq r11, [r0], -r1
End of assembler dump.
Dump of assembler code for function baz:
0x00003770 <+0>: push {r3, r4, r5, lr}
0x00003772 <+2>: ldr r5, [pc, #60] ; (0x37b0 <baz+64>)
0x00003774 <+4>: add r5, pc
0x00003776 <+6>: ldr r3, [r5, #0]
0x00003778 <+8>: cbnz r3, 0x3792 <baz+34>
0x0000377a <+10>: ldr r4, [pc, #56] ; (0x37b4 <baz+68>)
0x0000377c <+12>: add r4, pc
0x0000377e <+14>: mov r0, r4
0x00003780 <+16>: bl 0x36e0 <GetEnabledPtrAtRunTime(char const*)>
0x00003784 <+20>: mov r2, r0
0x00003786 <+22>: str r0, [r5, #0]
0x00003788 <+24>: ldr r0, [pc, #44] ; (0x37b8 <baz+72>)
0x0000378a <+26>: mov r1, r4
0x0000378c <+28>: add r0, pc
0x0000378e <+30>: bl 0xa15c <printf>
0x00003792 <+34>: ldr r3, [pc, #40] ; (0x37bc <baz+76>)
0x00003794 <+36>: add r3, pc
0x00003796 <+38>: ldr r3, [r3, #0]
0x00003798 <+40>: ldrb r3, [r3, #0]
0x0000379a <+42>: cbz r3, 0x37ac <baz+60>
0x0000379c <+44>: ldmia.w sp!, {r3, r4, r5, lr}
0x000037a0 <+48>: ldr r0, [pc, #28] ; (0x37c0 <baz+80>)
0x000037a2 <+50>: ldr r1, [pc, #32] ; (0x37c4 <baz+84>)
0x000037a4 <+52>: add r0, pc
0x000037a6 <+54>: add r1, pc
0x000037a8 <+56>: b.w 0x3720 <AddEvent(char const*, char const*)>
0x000037ac <+60>: pop {r3, r4, r5, pc}
0x000037ae <+62>: nop
0x000037b0 <+64>: andeq r0, r1, r4, asr #23
0x000037b4 <+68>: ldrdeq r11, [r0], -r10
0x000037b8 <+72>: andeq r11, r0, lr, asr #9
0x000037bc <+76>: andeq r0, r1, r4, lsr #23
0x000037c0 <+80>: ; <UNDEFINED> instruction: 0x0000b4b2
0x000037c4 <+84>: ldrdeq r11, [r0], -r11 ; <UNPREDICTABLE>
End of assembler dump.
~ 50 bytes per macro
,
Apr 6 2017
> ~ 50 bytes per macro cool! Understood the desire to have the dynamism now. I don't know how much friction it may cause. going 1 step further, can we also un-inline the "if (*UNIQUE_VAR)" check for categories that are not performance-sensitive? save 5-10 bytes more on these maybe?
,
Apr 6 2017
In the latest ARM code for foo, I'm not sure what the point of the store at +12 is? It seems to be storing the address of the boolean flag being checked into a non-local variable?
Can you try initializing the static variable directly with the constexpr value, as in:
constexpr uint8_t* UNIQUE##_const_flag_ptr =
GetEnabledPtrAtCompileTime(category);
static uint8_t* UNIQUE##_flag_ptr = UNIQUE##_const_flag_ptr;
// In theory, the following should be completely removed if
// _const_flag_ptr is not nullptr. If it is, this should be
// reduced to a simple !UNIQUE##_flag_ptr check.
if (!UNIQUE##_const_flag_ptr && !UNIQUE##_flag_ptr) {
UNIQUE##_flag_ptr = GetEnabledPtrAtRuntime(category);
printf(...);
}
if (*UNIQUE##_flag_ptr) {
AddTraceEvent(...);
}
Also, you probably want to remove the volatile keyword, I don't think it does what you think it does :)
Regarding #5, how to you determine performance-sensitive cases? Any category can be used anywhere, and doing so would actually slow down regular non-traced execution.
,
Apr 6 2017
Also keep in mind that the category name can contain a list of categories separated by commas, so we will always need the dynamic version (though I'm pretty sure this is rare).
,
Apr 6 2017
> static uint8_t* UNIQUE##_flag_ptr = UNIQUE##_const_flag_ptr; The problem of this is that, in c++11, statics have thread-safe. Doing that causes all sort of extra sync code to be generated. > Also, you probably want to remove the volatile keyword, I don't think it does what you think it does :) Yeah that was a quick thing for the demo, that is a std::atomic(relaxed) or something similar. > Regarding #5, how to you determine performance-sensitive cases? Any category can be used anywhere, and doing so would actually slow down regular non-traced execution. Yeah I agree. This would be quite risky because would introduce a full call+return all the times.
,
Apr 6 2017
Discussing with pasko, I'm pretty sure the store at +12 comes from the assignment to UNIQUE in your original code. In other words, I'm pretty sure that my suggestion should fix it and remove 3 instructions (including one load and one store) to the generated sequence for foo.
Hope you can verify that quickly.
As a reminder, initializing a non-POD static variable in C++ generates a ton of code that isn't seen here, e.g.:
const Value& getMagicalConstant() {
static const Value kMagicalConstant(complexComputation());
return kMagicalConstant;
}
is translated into something like:
const Value& getMagicalConstant__naive_debug() {
static Value local_cxx_var; // mutable var
static uint64_t local_cxx_guard; // init flag (64-bits even on x86!)
// check initialization flag
if (__atomic_load_acquire(&local_cxx_guard) == 0) {
if (__cxa_guard_acquire(&local_cxx_guard)) {
try { // value initialization
Value temp;
temp.Value::Value(complexComputation());
local_cxx_var.Value::Value(temp);
} catch (...) {
__cxa_guard_abort(&local_cxx_guard);
throw();
}
__cxa_guard_release(&local_cxx_guard);
}
}
return static_cast<const Value&>(local_cxx_var);
}
which is quite a bit different, and incurs an load-acquire on every function call (which means a DMB instruction on ARM, or about 50 ns).
,
Apr 6 2017
Note: the drawback of the constexpr usage is that it is likely to make the compilation of each and every TRACE_EVENTxxx macro much slower. I wonder how this will impact full builds. Maybe we'll have to only enable this for official ones?
,
Apr 6 2017
,
Apr 6 2017
,
Apr 6 2017
The following seems to compile correctly on ARM for me:
static int increment = 10;
static constexpr int* GetCompileTimeIncrementPtr() {
return &increment;
}
int foo(int x) {
constexpr int* const_var_ptr = GetCompileTimeIncrementPtr();
static int* var_ptr = const_var_ptr;
return x + *var_ptr;
}
Which gives:
00000000 <_Z3fooi>:
0: 4b02 ldr r3, [pc, #8] ; (c <_Z3fooi+0xc>)
2: 447b add r3, pc
4: 681b ldr r3, [r3, #0]
6: 4418 add r0, r3
8: 4770 bx lr
a: bf00 nop
c: 00000006 .word 0x00000006
c: R_ARM_REL32 .data
which seems correct.
,
Apr 6 2017
,
Apr 6 2017
> As a reminder, initializing a non-POD static variable in C++ We are talking about POD here (specifically uint8_t*). > Discussing with pasko, I'm pretty sure the store at +12 comes from the assignment to UNIQUE in your original code. Yes I was thinking the same while discussing in a meeting. Once we have a constexpr we can directly dereference that, there is no need to use a static variable at all. I think I need to make this prototype more realistic and just make it work. > Note: the drawback of the constexpr usage is that it is likely to make the compilation of each and every TRACE_EVENTxxx macro much slower. Yeah I though to that, I am concerned as well. I need a full prototype to see how much this will impact build. If the impact is not too much, I'd prefer this to be uniform on non-official vs official, as that increases coverage and reduces surprises. Will keep you updated with what I find. >The following seems to compile correctly on ARM for me: Yes but I think you are helping the compiler do be smarter and avoding the store you saw in +12 because this is clearly const propagatable.
,
Apr 6 2017
> > Regarding #5, how to you determine performance-sensitive cases? Any category > > can be used anywhere, and doing so would actually slow down regular non-traced > > execution. > > Yeah I agree. This would be quite risky because would introduce a full > call+return all the times. Being cautious is understandable, I am cautious too. My reasoning for trying it out is: 1. miscategorizing a few of these is not going to make a huge difference because the memory ref + branch has costs in the same order of magnitude as a call (have not benchmarked on ARM, please correct me if I am wrong), some of these are slowing down critical codepaths, and we seem to be OK with this amount of slowness. Making these 2x slower in a few cases are probably not noticeable in real-world scenarios. 2. by doing this we can discover which trace events are too much on the critical path and decide remove them entirely or replace with cheaper ones 3. I am optimistic that trace events in categories like "loader", "content", "navigation", "startup" (and a bunch of others) will not be sensitive to the price of a call. Cons: 1. more work :)
,
Apr 7 2017
Dump of assembler code for function foo: 0x00003740 <+0>: ldr r3, [pc, #28] ; (0x3760 <foo+32>) 0x00003742 <+2>: ldr r2, [pc, #32] ; (0x3764 <foo+36>) 0x00003744 <+4>: add r3, pc 0x00003746 <+6>: adds r1, r3, #4 0x00003748 <+8>: ldrb r3, [r3, #4] 0x0000374a <+10>: add r2, pc 0x0000374c <+12>: str r1, [r2, #0] 0x0000374e <+14>: cbz r3, 0x375c <foo+28> 0x00003750 <+16>: ldr r0, [pc, #20] ; (0x3768 <foo+40>) 0x00003752 <+18>: ldr r1, [pc, #24] ; (0x376c <foo+44>) 0x00003754 <+20>: add r0, pc 0x00003756 <+22>: add r1, pc 0x00003758 <+24>: b.w 0x3720 <AddEvent(char const*, char const*)> 0x0000375c <+28>: bx lr 0x0000375e <+30>: nop 0x00003760 <+32>: ; <UNDEFINED> instruction: 0x000108b8 0x00003764 <+36>: strdeq r0, [r1], -r2 0x00003768 <+40>: andeq r11, r0, pc, ror #9 0x0000376c <+44>: strdeq r11, [r0], -r1 End of assembler dump. Does anyone know why GDB disassembled more than needed here? I.e. the function stops at +28, "bx lr" is the last thing that matters. I can understand the following nop, but what about instructions +32 - +44? Is it GDB being silly, or we indeed have garbage between functions?
,
Apr 7 2017
Those entries at end store the offsets of the relocations. Gdb cannot distinguish them from a actual code and shows random instructions. If you look at the assembly on the top it loads +32,adds the pc, and dereferences it. Essentially the bytes at +32 are the offset for position-independent addressing of the global (static) variable.
,
Apr 7 2017
Nit to Comment 18: Those after +28 are _constants_ (called 'immediates') that did not fit into the instructions themselves (because the instruction size is limited to 2 bytes with this instruction set). AFAIR the max size of the immediate on Thumb2 is 12 bits. Instructions can shoot near distances from PC (code pointer), and other registers. So, for example, to be able to use a 4 byte constant, the instruction needs to load it from memory onto a register. The compiler puts these constants at the end of the function itself to make sure the offsets are small. These constants are _not_ relocations for PIC, just offsets to be able to reach the relocation in a writable segment. Code segment is kept read-only for a few reasons: 1. security 2. sharing code in different processes where it is mapped to a different address (not applicable for Chrome - AFAIR we map the code to the same address in all processes)
,
Apr 7 2017
Thanks guys, didn't know about that. So in theory Clang / linker could try moving code around so that several (small) functions reuse same pool of immediate constants? Or does that happen already?
,
Apr 7 2017
Re #20: unfortunately those constants are distinct, because there is one static var per each TRACE_EVENT macro.
So, imagine this:
foo() {
TRACE_EVENT0(...)
...
bar() {TRACE_EVENT0(...)
}
at the end this expands to
foo() {
static const uint8_t* cached_pointer_for_first_event;
if (cached_pointer_for_first_event == nullptr)
... rest of TRACE_EVENT code ...
static const uint8_t* cached_pointer_for_second_event;
... rest of TRACE_EVENT code ...
}
Now, since those cached_pointer_* are static, they really behave like variables in the anonymous namespace. As such, they will be put in the .rwdata section.
Now, IIRC the distance between .text and .rwdata is fixed and part of the elf layout, so the code that access them has to use PC-relative addressing to read/write that.
From a high level viewpoint, the pseudocode for
(if cached_pointer_for_first_event == nullptr) is:
1. Load PC-relative offset (this datum is known at compile time) of address of |cached_pointer_for_first_event| in the .rwdata section
2. Add PC to the register
3. dereference the register
4. Branch if zero/nonzero
The problem here is 1. The offset between the current instruction and the rwdata section is a bigg-ish number (some MB) which doesn't fit in the immediate of the "ldr" instruction.
So this will be split into:
- The compiler writes the offset in the epilogue of the function, which is reacheable via the immediate
- That takes an extra layer to get to the actual pointer.
,
Apr 7 2017
Yeah, but I was speaking more generally. I.e. there must be functions in Chrome that are using similar constants - why not put them together to reuse those constants? And with some extra smartness, the linker could even transform two distinct "ldr [C1]", "ldr [C2]" into "ldr [C1]", "ldr [C1], add D" where D is [C2] - [C1]. This would save 2 bytes on thumb.
,
Apr 7 2017
re #22: what you suggest is a good idea, but due to decades of legacy in computing ... compilers don't know what's outside files other than the one-very-file being compiled right now, the only way the communicate to the linker is by writing the 'relocations' which basically represent a function with starting address as input. The linker would put the code in the order it encounters the object, applies this function (one of dozens predefined in the ABI standard) and writes a resulting immediate into the code at an offset that the compiler also provided for this relocation. There are gross hacks on top of that where linkers recognize some binary patterns of short instruction sequences and rewrite those into other sequences (happens when the linker is informed with ThreadLocalStorage 'model', for example). Linkers also know some instructions by heart and are able to change them, including their size, which is another gross hack that leads to a waterfall of other resizes for binaries like libchrome.so. What you suggest is possible within the framework of "link time optimizations (LTO)", which we tried for GCC 4.9/ARM and it saved a couple of megs for libchrome.so, but linking takes forever (not parallelize-able), and is effectively non-shippable. Now in GCC 5.0+ the LTO times and resource consumption is vastly improved, but we are not even trying it because the Android NDK folks decided to abandon supporting GCC (for good reasons). Glang generates larger (checked a year ago), LTO is not ready yet. Some day, some day :)
,
Aug 29 2017
,
Nov 6 2017
,
Aug 1
,
Sep 24
Related to #23 above, we now have enabled ThinLTO when building Chrome for Android.
,
Oct 1
I gave this idea a try and wrote a Chromium CL [1] to see how much of the binary size would be saved. TL;DR This trick allows to save ~16 bytes per tracing macro invocation which gives us -52 KiB for the monochrome binary size in total [2]. More savings are possible if we convert the macro in third-party projects (V8, skia, webrtc, angle). Unfortunately, the compiler saves only ~16 bytes per macro instead of ~50 bytes mentioned in c#4. As we can see from the generated assembler [3], the compiler managed indeed to resolve an address of the categories table entry at the compile time (annotated yellow), so probably this is the maximum that this optimization can offer. Even without the patch the category resolution takes only ~30 bytes out of 100 for the TRACE_EVENT0 macro (annotated red in [3]). Looking at the assembly code, I see that many bytes of code are spent on making the trace_event_internal::AddTraceEvent() call (annotated green, ~26 bytes). It's even worse if we look at TRACE_EVENT2 macro which has more parameters or/and if the AddTraceEvent() call got inlined - 104 bytes! I think this also can be optimized, as was mentioned in [4]. One problem I have with [1] is that now all category names passed to tracing macros have to be constexpr, because they are later passed to a constexpr function. I had to modify two groups of callers to make the patch work: 1) Code using constants holding a category name. Adding constexpr to the declaration fixes the issue. 2) Utility code that stores a category name as a class member. I fixed that by creating a const char* template parameter for a class, which is probably not the best solution. Is there a way to allow non-constexpr values be accepted to the macro without having an extra version of a macro like TRACE_EVENT_NON_CONSTEXPR()? Would it be possible to link external embedders as v8 to the categories table at compile-time? At the first glance it seems like v8 resolves the dependency to the base/trace_event only at run-time [5]. Are ~50KiB binary size savings worth this extra complexity? The CL [1] is still prototype and I'd like to know if it's worth to spend more time to land it. [1] CL "Constexpr-eliminate GetCategoryGroupEnabled calls for popular categories": https://crrev.com/c/1251361 [2] Supersize HTML report: https://storage.googleapis.com/chrome-supersize/viewer.html?load_url=https%3A%2F%2Fstorage.googleapis.com%2Fchromium-binary-size-trybot-results%2Fandroid-binary-size%2F2018%2F09%2F28%2F64926.ndjson&diff_mode=on [3] Generated assembly with annotations: https://docs.google.com/document/d/1piIq_yi2fsqOEKObx0tFHeunMyX7hp2ZdAGKK6t0k80/edit?usp=sharing [4] Idea to reduce the number of parameters in AddTraceEvent(): https://crbug.com/702718#c13 [5] v8 trace_event: https://cs.chromium.org/chromium/src/v8/src/tracing/trace-event.cc?rcl=d46dfc574b2090b0ffa7cd0d3cea901deddb3bd7&l=19
,
Oct 1
re #28: > Are ~50KiB binary size savings worth this extra complexity? IMO while it was non-trivial to do and investigate, the end state complexity is not noticeably increased. There is an array of popular categories, a couple of small classes moved to .h and that's it! Does it make changes to tracing macros more difficult? Seems no. On the bonus side: it saves binary size for every new trace event used on Android, so the savings will grow with time :)
,
Oct 1
Woot, thanks a lot for the annotated assembly [3], awesome work, love it. Just to clarify, can you confirm how do I read the left vs right side? Is left=today, right=proposal4? > Is there a way to allow non-constexpr values be accepted to the macro without having an extra version of a macro like TRACE_EVENT_NON_CONSTEXPR()? IIRC: For category: today we do tolerate non-constexpr strings but the caller has to guarantee that they are indefinitely lived (strup()'d or in some singleton map). Not sure anybody is using this (hopefully not) and if they don't I am happy if we take that option back and force the category to be constexpr only. For the no for category (event name, args names): yes we even allow them to be short-lived, via TRACE_STR_COPY, and I think this is going to be harder to change. > Are ~50KiB binary size savings worth this extra complexity? Yeah if we are talking about 50 KiB not sure this is worth a huge refactoring or increasing complexity of tracing macros. > Would it be possible to link external embedders as v8 to the categories table at compile-time? Hmm this might be tricky, but possible given that they share the same trace_event/common.h header. Haven't thought to this too much though. Q: If we forced categories to be always constexpr, removing the fallback, what would be the estimated saving?
,
Oct 1
> Just to clarify, can you confirm how do I read the left vs right side? Is left=today, right=proposal4? Yes, that's right. left=today, right=after applying CL [1]. > Q: If we forced categories to be always constexpr, removing the fallback, what would be the estimated saving? I just did this and got -125KiB of the binary size for the monochrome apk (comparing to without patch). Without the fallback, the compiler can completely remove the code generated by the tracing macro. So even if we'll be able to resolve all categories at compile time we won't get all 125KiB. Currently we don't resolve a category name in compile time if it lists several categories separated by commas. I also noticed that I was missing some important categories in the static list. Adding them allowed to increase savings from patch [1] to 60 KiB (instead of 52KiB).
,
Oct 1
> So even if we'll be able to resolve all categories at compile time we won't get all 125KiB. I failed to parse this statement (very likely just my brain fried at 8PM). You first said that you saved 125 KiB, but then say that we won't able to get all 125 KiB? Can you expand (again, this is 99% me having 2 neurons left at this time of the day) > Currently we don't resolve a category name in compile time if it lists several categories separated by commas. AAAAAAA I forgot about that horrible hack. That makes everything more complicated :( Is this the reason why we cannot have a static list of categories? Re: #29 if you can figure out something that works and that doesn't increase the complexity sounds great even for 60K, agreed.
,
Oct 1
> I failed to parse this statement (very likely just my brain fried at 8PM). You first said that you saved 125 KiB, but then say that we won't able to get all 125 KiB? Can you expand (again, this is 99% me having 2 neurons left at this time of the day)
To get these 125 KiB, I modified the CL [1] to always return a nullptr if the category is unknown at compile time*. This allowed the compiler to completely remove the code inside the "if (*category_group_enabled && ENABLED_FLAGS)" statement, which greatly reduces the binary size. We cannot ship this version, obviously.
> AAAAAAA I forgot about that horrible hack. That makes everything more complicated :(
Is this the reason why we cannot have a static list of categories?
This is the reason cannot have a *complete* list of categories. There are too many combinations of different categories to put them in the static list.
*So an expanded macro looks like this:
constexpr const unsigned char* k_category_group_enabled =
base::trace_event::GetCategoryGroupEnabledPtrConstexpr("foo");
const unsigned char* category_group_enabled;
if (k_category_group_enabled) {
category_group_enabled = k_category_group_enabled;
} else {
category_group_enabled = nullptr; // Here's the change.
}
trace_event_internal::ScopedTracer tracer;
if (__builtin_expect(
!!(*category_group_enabled &
(base::trace_event::TraceCategory::ENABLED_FOR_RECORDING |
base::trace_event::TraceCategory::ENABLED_FOR_ETW_EXPORT |
base::trace_event::TraceCategory::ENABLED_FOR_FILTERING)),
0)) {
base::trace_event::TraceEventHandle h = trace_event_internal::AddTraceEvent(
('X'), category_group_enabled, "my_event1",
trace_event_internal::kGlobalScope, trace_event_internal::kNoId,
(static_cast<unsigned int>(0)), trace_event_internal::kNoId);
tracer.Initialize(category_group_enabled, "my_event1", h);
}
,
Oct 1
Can we detect the comma at constexpr time and route to a "fat" implementation? Events with commas are way less frequent, would be ideal to pay the binary cost of the fallback only when we need it.
Alternatively could we have a dedicated explicit macro for these cases. Better if a composable macro argument (eg TRACE_EVENT2(TRACE_MANY("Foo,bar"), "event" ) rather than a whole new set of macro variants, because that will bloom another combinatorial explosion of macros
,
Oct 1
> Can we detect the comma at constexpr time and route to a "fat" implementation? Events with commas are way less frequent, would be ideal to pay the binary cost of the fallback only when we need it. That's what currently happens. We don't add categories with commas to the static list, so they go through the "fat" implementation. Another problem is that some categories without any commas may be missing in the static list. They are also routed to the slow-path. Detecting and special-casing commas makes sense if we would have a static assert for categories without comma, advising to add the category to the static list.
,
Oct 1
> We don't add categories with commas to the static list, so they go through the "fat" implementation. No what I mean is *emitting* the code for the fat implementation only when we see a comma at compile time. That should get us very close to those 125 KiB right? > Detecting and special-casing commas makes sense if we would have a static assert for categories without comma, advising to add the category to the static list. What would be a great end state.
,
Oct 2
> No what I mean is *emitting* the code for the fat implementation only when we see a comma at compile time. That should get us very close to those 125 KiB right?
The compiler already doesn't generate any code to get a category at run-time if the category was found at compile-time.
Let's look through the TRACE_EVENT0 code:
constexpr const unsigned char* k_category_group_enabled =
base::trace_event::GetCategoryGroupEnabledPtrConstexpr("foo");
const unsigned char* category_group_enabled = nullptr;
if (k_category_group_enabled) {
constexpr_category:
category_group_enabled = k_category_group_enabled;
} else {
dynamic_category: // "fat" implementation
static base::subtle::AtomicWord atomic = 0;
category_group_enabled = reinterpret_cast<const unsigned char*>(
base::subtle::NoBarrier_Load(&(atomic)));
if (!category_group_enabled) {
category_group_enabled =
base::trace_event::TraceLog::GetCategoryGroupEnabled("foo");
base::subtle::NoBarrier_Store(
&(atomic),
(reinterpret_cast<base::subtle::AtomicWord>(category_group_enabled)));
}
}
trace_event_internal::ScopedTracer tracer;
record_trace: // useful stuff that TRACE_EVENT0 does
if (*category_group_enabled &
(base::trace_event::TraceCategory::ENABLED_FOR_RECORDING |
base::trace_event::TraceCategory::ENABLED_FOR_ETW_EXPORT |
base::trace_event::TraceCategory::ENABLED_FOR_FILTERING)) {
base::trace_event::TraceEventHandle h =
trace_event_internal::AddTraceEvent(...);
tracer.Initialize(category_group_enabled, "my_event1", h);
}
The compiler emits the code either for |constexpr_category| or for |dynamic_category| because it never needs both.
I got 125 KiB number by removing |dynamic_category| part completely. I this case the compiler just ignores everything after |record_trace| because the if statement below is always false. This saves XX KiB of code but TRACE_EVENT0 becomes essentially a no-op if a category isn't resolved at compile-time.
,
Oct 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c5fb30be3e04239045ebe2c64dd80aa6cf046a4e commit c5fb30be3e04239045ebe2c64dd80aa6cf046a4e Author: David 'Digit' Turner <digit@google.com> Date: Fri Oct 05 14:42:49 2018 base: make ScopedTracer smaller. Use C++11 default initializers to make ScopedTracer smaller and simpler. Surprisingly, this saves 12288 bytes in MonochromePublic.apk for the 32-bit ARM build! BUG=702718,708990 R=agrieve@chromium.org,lizeb@chromium.org,pasko@chromium.org,alexilin@chromium.org,primiano@chromium.org,picksi@chromium.org Change-Id: Ia1d3e936ec8b9cfe3c52b9cfc0b97a2a38ac7043 Reviewed-on: https://chromium-review.googlesource.com/c/1254723 Reviewed-by: Benoit L <lizeb@chromium.org> Reviewed-by: Primiano Tucci <primiano@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Reviewed-by: agrieve <agrieve@chromium.org> Reviewed-by: Alexandr Ilin <alexilin@chromium.org> Commit-Queue: David Turner <digit@chromium.org> Cr-Commit-Position: refs/heads/master@{#597111} [modify] https://crrev.com/c5fb30be3e04239045ebe2c64dd80aa6cf046a4e/base/trace_event/trace_event.h
,
Nov 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/331266377d03b51dc3e2cf2c26b896e21f8f0391 commit 331266377d03b51dc3e2cf2c26b896e21f8f0391 Author: Alexandr Ilin <alexilin@chromium.org> Date: Wed Nov 14 14:48:17 2018 Constexpr-eliminate GetCategoryGroupEnabled calls for popular categories This CL allows to avoid the cost of calling GetCategoryGroupEnabled() in TRACE_EVENT macros for popular categories, in a way that doesn't require more macros, nor changing all the call sites. The idea is to put all popular categories into a static global array so an address of a category state may be determined at compile-time through a call to a constexpr function. This CL makes it impossible to use non constant expressions as a category name in the tracing macros. All the call sites that don't satisfy this requirement have to be converted. Hopefully, there aren't a lot of such callers. This change reduces the code size generated from each trace macro by ~16 bytes, check the full report here: https://docs.google.com/document/d/1piIq_yi2fsqOEKObx0tFHeunMyX7hp2ZdAGKK6t0k80/edit?usp=sharing. The total Android apk size savings are ~60 KiB. Bug: 702718, 708990 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel Change-Id: I9f7b48fd4ee614996feaabda3e5495fd9b786c3a Reviewed-on: https://chromium-review.googlesource.com/c/1251361 Commit-Queue: Alexandr Ilin <alexilin@chromium.org> Reviewed-by: Scott Violet <sky@chromium.org> Reviewed-by: Primiano Tucci <primiano@chromium.org> Reviewed-by: Tibor Goldschwendt <tiborg@chromium.org> Reviewed-by: Frank Liberato <liberato@chromium.org> Reviewed-by: vmpstr <vmpstr@chromium.org> Reviewed-by: Sami Kyöstilä <skyostil@chromium.org> Reviewed-by: David Turner <digit@chromium.org> Cr-Commit-Position: refs/heads/master@{#607975} [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/android_webview/common/devtools_instrumentation.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/BUILD.gn [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/android/early_trace_event_binding.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/android/trace_event_binding.cc [delete] https://crrev.com/65644ef9859380e742f3a5d91c627e3b3e942183/base/trace_event/auto_open_close_event.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/auto_open_close_event.h [add] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/builtin_categories.cc [add] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/builtin_categories.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/category_registry.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/category_registry.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/cpufreq_monitor_android.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/memory_dump_manager.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/memory_dump_manager.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_category.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_category_unittest.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_event.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_event_unittest.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_log.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/base/trace_event/trace_log.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/base/devtools_instrumentation.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/base/devtools_instrumentation.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/benchmarks/benchmark_instrumentation.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/tiles/frame_viewer_instrumentation.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/tiles/frame_viewer_instrumentation.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/cc/trees/layer_tree_host_impl.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/chrome/browser/android/vr/gvr_scheduler_delegate.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/chrome/browser/android/vr/scoped_gpu_trace.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/chrome/browser/android/vr/scoped_gpu_trace.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/components/heap_profiling/test_driver.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/content/browser/renderer_host/render_widget_targeter.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/content/browser/tracing/file_tracing_provider_impl.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/media/base/scoped_async_trace.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/media/base/scoped_async_trace.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/media/blink/video_frame_compositor.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/media/blink/video_frame_compositor.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/media/gpu/android/media_codec_video_decoder.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/BUILD.gn [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/base/network_delegate.cc [delete] https://crrev.com/65644ef9859380e742f3a5d91c627e3b3e942183/net/base/trace_constants.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/base/trace_constants.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/cert/crl_set.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/cert/multi_threaded_cert_verifier.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/dns/host_cache.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/dns/host_resolver_impl.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/http/http_cache_transaction.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/http/http_stream_factory_job.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/log/trace_net_log_observer.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/nqe/network_quality_estimator.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/proxy_resolution/proxy_resolver_v8_tracing.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/quic/quic_stream_factory.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/client_socket_handle.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/client_socket_pool_base.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/socket_posix.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/ssl_client_socket_impl.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/ssl_client_socket_pool.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/transport_client_socket_pool.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/udp_socket_posix.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/socket/websocket_transport_client_socket_pool.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/spdy/spdy_session_pool.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/net/url_request/url_request_http_job.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/services/tracing/public/cpp/perfetto/trace_event_data_source_unittest.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/services/tracing/public/cpp/trace_event_agent_unittest.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/bindings/runtime_call_stats.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/network/network_instrumentation.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/common/throttling/cpu_time_budget_pool.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/common/tracing_helper.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/common/tracing_helper.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/common/tracing_helper_unittest.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/main_thread/frame_scheduler_impl.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/third_party/blink/renderer/platform/scheduler/main_thread/main_thread_scheduler_impl.h [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/ui/gl/gl_surface_egl.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/ui/latency/frame_metrics.cc [modify] https://crrev.com/331266377d03b51dc3e2cf2c26b896e21f8f0391/ui/latency/latency_info.cc
,
Nov 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cb31836299a37ba2b8775a191ca74211189f2687 commit cb31836299a37ba2b8775a191ca74211189f2687 Author: Alexandr Ilin <alexilin@chromium.org> Date: Mon Nov 19 17:29:37 2018 tracing: Display all categories in about:tracing UI. about:tracing UI calls TraceLog::GetKnownCategoryGroups() to get categories to display in "Record" dialog. This function assumes that built-in categories shouldn't be displayed. However, the CL https://crrev.com/c/1251361 made almost all categories built-in, thus a majority of categories isn't displayed in the UI. This CL fixes this by narrowing the list of non-displayed categories to "meta" categories. Bug: 708990 Change-Id: Ic8ccfa396e890983afda96a20ebfe00557f319a9 Reviewed-on: https://chromium-review.googlesource.com/c/1341909 Reviewed-by: Sami Kyöstilä <skyostil@chromium.org> Reviewed-by: Primiano Tucci <primiano@chromium.org> Commit-Queue: Alexandr Ilin <alexilin@chromium.org> Cr-Commit-Position: refs/heads/master@{#609344} [modify] https://crrev.com/cb31836299a37ba2b8775a191ca74211189f2687/base/trace_event/category_registry.cc [modify] https://crrev.com/cb31836299a37ba2b8775a191ca74211189f2687/base/trace_event/category_registry.h [modify] https://crrev.com/cb31836299a37ba2b8775a191ca74211189f2687/base/trace_event/trace_category_unittest.cc [modify] https://crrev.com/cb31836299a37ba2b8775a191ca74211189f2687/base/trace_event/trace_log.cc |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by primiano@chromium.org
, Apr 6 2017Update: with some more though I manage to convince also GCC to do what I want. The problem was the following in the TRACE_EVENT0 macro: --- if (GetEnabledPtrAtCompileTime(cat)) { --- Just because GetEnabledPtrAtCompileTime is constexpr, doesn't mean that the compiler is not allowed to actually instantiate code for that. However, managed to fix that by forcing it into a constexpr variable itself: --- #define TRACE_EVENT0(cat, event_name) \ static uint8_t *UNIQUE_VAR = nullptr; \ constexpr uint8_t *k_##UNIQUE_VAR = GetEnabledPtrAtCompileTime(cat); \ if (k_##UNIQUE_VAR) { \ UNIQUE_VAR = k_##UNIQUE_VAR; \ } else if (UNIQUE_VAR == nullptr) { \ UNIQUE_VAR = GetEnabledPtrAtRunTime(cat); \ printf("Dynamically getting pointer '%s' : %p\n", cat, UNIQUE_VAR); \ } \ do { \ if (*((volatile uint8_t *)UNIQUE_VAR)) \ AddEvent(cat, event_name); \ } while (0) --- Funny thing, at this point clang started complaining about: constexpr variable 'k_UNIQUE_VAR' must be initialized by a constant expression as it didn't like the char* pointer comparison. But by introducing an actual conexexpr strcmp: ----- constexpr bool ConstexprStrcmp(const char* a, const char* b, int n = 0) { return (!a[n] && !b[n]) ? true : ((a[n] == b[n]) ? ConstexprStrcmp(a, b, n + 1) : false); } ----- Both compiler are happy now. Full version of the prototype here: ----- https://gist.github.com/primiano/d5c2f13dc89050c8fbfb7c70dab92288 ----- Now also gcc saves 40 bytes on ARM (-Os, is_official_build=true), proof: (gdb) disass foo Dump of assembler code for function foo(): 0x0000000100000e42 <+0>: push %rbp 0x0000000100000e43 <+1>: mov %rsp,%rbp 0x0000000100000e46 <+4>: cmpb $0x0,0x1db(%rip) # 0x100001028 <g_categories+8> 0x0000000100000e4d <+11>: je 0x100000e63 <foo()+33> 0x0000000100000e4f <+13>: lea 0xde(%rip),%rdi # 0x100000f34 0x0000000100000e56 <+20>: lea 0xf6(%rip),%rsi # 0x100000f53 0x0000000100000e5d <+27>: pop %rbp 0x0000000100000e5e <+28>: jmpq 0x100000e15 <AddEvent(char const*, char const*)> 0x0000000100000e63 <+33>: pop %rbp 0x0000000100000e64 <+34>: retq End of assembler dump. (gdb) disass baz Dump of assembler code for function baz(): 0x0000000100000e65 <+0>: mov 0x7f4(%rip),%rax # 0x100001660 <_ZZ3bazE5var77> 0x0000000100000e6c <+7>: test %rax,%rax 0x0000000100000e6f <+10>: jne 0x100000eb1 <baz()+76> 0x0000000100000e71 <+12>: push %rbp 0x0000000100000e72 <+13>: mov %rsp,%rbp 0x0000000100000e75 <+16>: push %rbx 0x0000000100000e76 <+17>: push %rax 0x0000000100000e77 <+18>: lea 0xe4(%rip),%rbx # 0x100000f62 0x0000000100000e7e <+25>: mov %rbx,%rdi 0x0000000100000e81 <+28>: callq 0x100000dc2 <GetEnabledPtrAtRunTime(char const*)> 0x0000000100000e86 <+33>: mov %rax,%rcx 0x0000000100000e89 <+36>: mov %rcx,0x7d0(%rip) # 0x100001660 <_ZZ3bazE5var77> 0x0000000100000e90 <+43>: lea 0xcf(%rip),%rdi # 0x100000f66 0x0000000100000e97 <+50>: xor %eax,%eax 0x0000000100000e99 <+52>: mov %rbx,%rsi 0x0000000100000e9c <+55>: mov %rcx,%rdx 0x0000000100000e9f <+58>: callq 0x100000f04 0x0000000100000ea4 <+63>: mov 0x7b5(%rip),%rax # 0x100001660 <_ZZ3bazE5var77> 0x0000000100000eab <+70>: add $0x8,%rsp 0x0000000100000eaf <+74>: pop %rbx 0x0000000100000eb0 <+75>: pop %rbp 0x0000000100000eb1 <+76>: cmpb $0x0,(%rax) 0x0000000100000eb4 <+79>: je 0x100000ec9 <baz()+100> 0x0000000100000eb6 <+81>: lea 0xa5(%rip),%rdi # 0x100000f62 0x0000000100000ebd <+88>: lea 0xc9(%rip),%rsi # 0x100000f8d 0x0000000100000ec4 <+95>: jmpq 0x100000e15 <AddEvent(char const*, char const*)> 0x0000000100000ec9 <+100>: retq End of assembler dump. (gdb) quit Now the main concern is: how horrifying is this? I realize that I am playing being clever with the compiler, but from a realistic viewpoint this seems to save ~50 bytes per macro. thakis@/dcheng@ opinions here? Maybe there is a saner way to do all this instead of reinventing all my constexpr crazyness?