* [PATCH V2 1/2] module: Add comments describing how the "strmap" logic works @ 2011-11-13 3:08 Kevin Cernekee 2011-11-13 3:08 ` [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables Kevin Cernekee 0 siblings, 1 reply; 10+ messages in thread From: Kevin Cernekee @ 2011-11-13 3:08 UTC (permalink / raw) To: Rusty Russell; +Cc: linux-kernel, linux-kbuild, Jan Beulich Signed-off-by: Kevin Cernekee <cernekee@gmail.com> --- kernel/module.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 178333c..cf9f1b6 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2193,6 +2193,13 @@ static void layout_symtab(struct module *mod, struct load_info *info) src = (void *)info->hdr + symsect->sh_offset; nsrc = symsect->sh_size / sizeof(*src); + + /* + * info->strmap has a '1' bit for each byte of .strtab we want to + * keep resident in mod->core_strtab. Everything else in .strtab + * is unreferenced by the symbols in mod->core_symtab, and will be + * discarded when add_kallsyms() compacts the string table. + */ for (ndst = i = 1; i < nsrc; ++i, ++src) if (is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) { unsigned int j = src->st_name; @@ -2215,6 +2222,8 @@ static void layout_symtab(struct module *mod, struct load_info *info) /* Append room for core symbols' strings at end of core part. */ info->stroffs = mod->core_size; + + /* First strtab byte (and first symtab entry) are zeroes. */ __set_bit(0, info->strmap); mod->core_size += bitmap_weight(info->strmap, strsect->sh_size); } -- 1.7.6.3 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-13 3:08 [PATCH V2 1/2] module: Add comments describing how the "strmap" logic works Kevin Cernekee @ 2011-11-13 3:08 ` Kevin Cernekee 2011-11-14 0:35 ` Rusty Russell 2011-11-14 8:48 ` Jan Beulich 0 siblings, 2 replies; 10+ messages in thread From: Kevin Cernekee @ 2011-11-13 3:08 UTC (permalink / raw) To: Rusty Russell; +Cc: linux-kernel, linux-kbuild, Jan Beulich Commit 554bdfe5acf3715e87c8d5e25a4f9a896ac9f014 (module: reduce string table for loaded modules) introduced an optimization to shrink the size of the resident string table. Part of this involves calling bitmap_weight() on the strmap bitmap once for each core symbol. strmap contains one bit for each byte of the module's strtab. For kernel modules with a large number of symbols, the addition of the bitmap_weight() operation to each iteration of the add_kallsyms() loop resulted in a significant "insmod" performance regression from 2.6.31 to 2.6.32. bitmap_weight() is expensive when the bitmap is large. The proposed alternative optimizes the common case in this loop (is_core_symbol() == true, and the symbol name is not a duplicate), while penalizing the exceptional case of a duplicate symbol. My test was run on a 600 MHz MIPS processor, using a kernel module with 15,000 "core" symbols and 10,000 symbols in .init.text. .strtab takes up 250,227 bytes. Original code: insmod takes 3.39 seconds Patched code: insmod takes 0.07 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Kevin Cernekee <cernekee@gmail.com> --- V2: Properly handle cases where different symtab entries point to strtab substrings (suffix merging). Mostly Rusty's work. Also, add comments. My test case for suffix merging was: $ nm --no-sort alphabet.ko 00000000 t hello 00000014 t init 00000000 d retcode 00000000 r __mod_retcodetype8 00000000 r __param_retcode 00000000 r __param_str_retcode 00000000 r $LC0 00000018 r __module_depends 00000000 r ____versions 00000021 r __mod_vermagic5 00000050 T uvwxyz 00000000 D __this_module 00000048 T yz 00000000 T qrstuvwxyz 00000060 T rstuvwxyz 00000014 T init_module 00000040 T wxyz 00000068 T tuvwxyz U printk 00000058 T vwxyz U param_ops_int "qrstuvwxyz" is in .init.text (non-core symbol); all others were core symbols. The symtab entries for all substrings were hexedited to point to the respective portion of the "qrstuvwxyz" strtab entry. I verified that "rstuvwxyz" (without the 'q') was copied to core_strtab when the loop hit "uvwxyz", and that the substrings all used that entry. kernel/module.c | 40 +++++++++++++++++++++++++++++++++------- 1 files changed, 33 insertions(+), 7 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index cf9f1b6..6c87305 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2246,22 +2246,48 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) mod->symtab[i].st_info = elf_type(&mod->symtab[i], info); mod->core_symtab = dst = mod->module_core + info->symoffs; + mod->core_strtab = s = mod->module_core + info->stroffs; src = mod->symtab; *dst = *src; + *s++ = 0; for (ndst = i = 1; i < mod->num_symtab; ++i, ++src) { + const char *name; if (!is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) continue; dst[ndst] = *src; - dst[ndst].st_name = bitmap_weight(info->strmap, - dst[ndst].st_name); + + name = &mod->strtab[src->st_name]; + if (unlikely(!test_bit(src->st_name, info->strmap))) { + /* Symbol name has already been copied; find it. */ + char *dup; + + for (dup = mod->core_strtab; strcmp(dup, name); dup++) + BUG_ON(dup > s); + + dst[ndst].st_name = dup - mod->core_strtab; + } else { + /* + * Copy the symbol name to mod->core_strtab. + * "name" might point to the middle of a longer strtab + * entry, so backtrack to the first "required" byte + * of the string. + */ + unsigned start = src->st_name, len = 0; + + for (; test_bit(start - 1, info->strmap) && + mod->strtab[start - 1]; start--) + len++; + + dst[ndst].st_name = len + s - mod->core_strtab; + len += strlen(name) + 1; + + memcpy(s, &mod->strtab[start], len); + s += len; + bitmap_clear(info->strmap, start, len); + } ++ndst; } mod->core_num_syms = ndst; - - mod->core_strtab = s = mod->module_core + info->stroffs; - for (*s = 0, i = 1; i < info->sechdrs[info->index.str].sh_size; ++i) - if (test_bit(i, info->strmap)) - *++s = mod->strtab[i]; } #else static inline void layout_symtab(struct module *mod, struct load_info *info) -- 1.7.6.3 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-13 3:08 ` [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables Kevin Cernekee @ 2011-11-14 0:35 ` Rusty Russell 2011-11-14 8:48 ` Jan Beulich 1 sibling, 0 replies; 10+ messages in thread From: Rusty Russell @ 2011-11-14 0:35 UTC (permalink / raw) To: Kevin Cernekee; +Cc: linux-kernel, linux-kbuild, Jan Beulich On Sat, 12 Nov 2011 19:08:56 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: > + /* > + * Copy the symbol name to mod->core_strtab. > + * "name" might point to the middle of a longer strtab > + * entry, so backtrack to the first "required" byte > + * of the string. > + */ Icky, but it works. Both applied, thanks! Rusty. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-13 3:08 ` [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables Kevin Cernekee 2011-11-14 0:35 ` Rusty Russell @ 2011-11-14 8:48 ` Jan Beulich 2011-11-16 0:24 ` Rusty Russell 2011-11-18 3:15 ` Kevin Cernekee 1 sibling, 2 replies; 10+ messages in thread From: Jan Beulich @ 2011-11-14 8:48 UTC (permalink / raw) To: Kevin Cernekee, Rusty Russell; +Cc: linux-kbuild, linux-kernel >>> On 13.11.11 at 04:08, Kevin Cernekee <cernekee@gmail.com> wrote: > Commit 554bdfe5acf3715e87c8d5e25a4f9a896ac9f014 (module: reduce string > table for loaded modules) introduced an optimization to shrink the size of > the resident string table. Part of this involves calling bitmap_weight() > on the strmap bitmap once for each core symbol. strmap contains one bit > for each byte of the module's strtab. > > For kernel modules with a large number of symbols, the addition of the > bitmap_weight() operation to each iteration of the add_kallsyms() loop > resulted in a significant "insmod" performance regression from 2.6.31 > to 2.6.32. bitmap_weight() is expensive when the bitmap is large. > > The proposed alternative optimizes the common case in this loop > (is_core_symbol() == true, and the symbol name is not a duplicate), while > penalizing the exceptional case of a duplicate symbol. > > My test was run on a 600 MHz MIPS processor, using a kernel module with > 15,000 "core" symbols and 10,000 symbols in .init.text. .strtab takes up > 250,227 bytes. > > Original code: insmod takes 3.39 seconds > Patched code: insmod takes 0.07 seconds > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> > Signed-off-by: Kevin Cernekee <cernekee@gmail.com> > --- > > V2: > > Properly handle cases where different symtab entries point to strtab > substrings (suffix merging). Mostly Rusty's work. Also, add comments. > > My test case for suffix merging was: > > $ nm --no-sort alphabet.ko > 00000000 t hello > 00000014 t init > 00000000 d retcode > 00000000 r __mod_retcodetype8 > 00000000 r __param_retcode > 00000000 r __param_str_retcode > 00000000 r $LC0 > 00000018 r __module_depends > 00000000 r ____versions > 00000021 r __mod_vermagic5 > 00000050 T uvwxyz > 00000000 D __this_module > 00000048 T yz > 00000000 T qrstuvwxyz > 00000060 T rstuvwxyz > 00000014 T init_module > 00000040 T wxyz > 00000068 T tuvwxyz > U printk > 00000058 T vwxyz > U param_ops_int > > "qrstuvwxyz" is in .init.text (non-core symbol); all others were core > symbols. The symtab entries for all substrings were hexedited to > point to the respective portion of the "qrstuvwxyz" strtab entry. > > I verified that "rstuvwxyz" (without the 'q') was copied to core_strtab > when the loop hit "uvwxyz", and that the substrings all used that entry. > > > kernel/module.c | 40 +++++++++++++++++++++++++++++++++------- > 1 files changed, 33 insertions(+), 7 deletions(-) > > diff --git a/kernel/module.c b/kernel/module.c > index cf9f1b6..6c87305 100644 > --- a/kernel/module.c > +++ b/kernel/module.c > @@ -2246,22 +2246,48 @@ static void add_kallsyms(struct module *mod, const > struct load_info *info) > mod->symtab[i].st_info = elf_type(&mod->symtab[i], info); > > mod->core_symtab = dst = mod->module_core + info->symoffs; > + mod->core_strtab = s = mod->module_core + info->stroffs; > src = mod->symtab; > *dst = *src; > + *s++ = 0; > for (ndst = i = 1; i < mod->num_symtab; ++i, ++src) { > + const char *name; > if (!is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) > continue; > dst[ndst] = *src; > - dst[ndst].st_name = bitmap_weight(info->strmap, > - dst[ndst].st_name); > + > + name = &mod->strtab[src->st_name]; > + if (unlikely(!test_bit(src->st_name, info->strmap))) { > + /* Symbol name has already been copied; find it. */ > + char *dup; > + > + for (dup = mod->core_strtab; strcmp(dup, name); dup++) > + BUG_ON(dup > s); Aren't you concerned that this again will be rather slow? It would be pretty easy to accelerate by comparing only with the tail of each string (as nothing else can possibly match), moving from string to string instead of from character to character. Jan > + > + dst[ndst].st_name = dup - mod->core_strtab; > + } else { > + /* > + * Copy the symbol name to mod->core_strtab. > + * "name" might point to the middle of a longer strtab > + * entry, so backtrack to the first "required" byte > + * of the string. > + */ > + unsigned start = src->st_name, len = 0; > + > + for (; test_bit(start - 1, info->strmap) && > + mod->strtab[start - 1]; start--) > + len++; > + > + dst[ndst].st_name = len + s - mod->core_strtab; > + len += strlen(name) + 1; > + > + memcpy(s, &mod->strtab[start], len); > + s += len; > + bitmap_clear(info->strmap, start, len); > + } > ++ndst; > } > mod->core_num_syms = ndst; > - > - mod->core_strtab = s = mod->module_core + info->stroffs; > - for (*s = 0, i = 1; i < info->sechdrs[info->index.str].sh_size; ++i) > - if (test_bit(i, info->strmap)) > - *++s = mod->strtab[i]; > } > #else > static inline void layout_symtab(struct module *mod, struct load_info > *info) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-14 8:48 ` Jan Beulich @ 2011-11-16 0:24 ` Rusty Russell 2011-11-18 3:15 ` Kevin Cernekee 1 sibling, 0 replies; 10+ messages in thread From: Rusty Russell @ 2011-11-16 0:24 UTC (permalink / raw) To: Jan Beulich, Kevin Cernekee; +Cc: linux-kbuild, linux-kernel On Mon, 14 Nov 2011 08:48:52 +0000, "Jan Beulich" <JBeulich@suse.com> wrote: > > + name = &mod->strtab[src->st_name]; > > + if (unlikely(!test_bit(src->st_name, info->strmap))) { > > + /* Symbol name has already been copied; find it. */ > > + char *dup; > > + > > + for (dup = mod->core_strtab; strcmp(dup, name); dup++) > > + BUG_ON(dup > s); > > Aren't you concerned that this again will be rather slow? It would be > pretty easy to accelerate by comparing only with the tail of each string > (as nothing else can possibly match), moving from string to string instead > of from character to character. Kevin's central thesis is that this is actually really unusual. I'm not sure how much faster a tail search would be in practice. We're still scanning the string. Perhaps a "dup[0] == name[0] &&" would optimize it almost as well, if gcc doesn't already? Kevin, you're probably in an optimal position to get numbers on this? Thanks, Rusty. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-14 8:48 ` Jan Beulich 2011-11-16 0:24 ` Rusty Russell @ 2011-11-18 3:15 ` Kevin Cernekee 2011-11-21 5:06 ` Rusty Russell 1 sibling, 1 reply; 10+ messages in thread From: Kevin Cernekee @ 2011-11-18 3:15 UTC (permalink / raw) To: 'Jan Beulich'; +Cc: 'Rusty Russell', linux-kbuild, linux-kernel On Mon, 14 Nov 2011 08:48:52 +0000, "Jan Beulich" <JBeulich@suse.com> wrote: > > + for (dup = mod->core_strtab; strcmp(dup, name); dup++) > > + BUG_ON(dup > s); > > Aren't you concerned that this again will be rather slow? It would be > pretty easy to accelerate by comparing only with the tail of each string > (as nothing else can possibly match), moving from string to string instead > of from character to character. I reinstalled an old laptop today and then ran a few numbers for fun. System: Linux 3.0.4, Ubuntu 11.10, x86 32-bit, Intel Core 2 P8700, nvidia 280.13 driver Looking at /proc/kallsyms, one starts to ponder whether all of the extra strtab-related complexity in module.c is worth the memory savings: nvidia.ko: 33,785 entries (3,037 are duplicates) Other *.ko files: 5,972 entries (3 are duplicates) Kernel image: 62,737 entries Sum of lengths of all *.ko core symbol names + terminating bytes: 542,824 bytes; 408,585 of that is from nvidia.ko Potential memory savings from suffix merging (shorter strings refer to a longer strtab entry, exact matches excluded): 13,156 bytes total; 487 from nvidia.ko. These mostly came from entries starting with __kcrctab* or __param* . Potential memory savings from reusing EXACT strtab matches: 36,475 bytes total; 35,432 from nvidia.ko. For comparison, discarding non-core strtab entries saved about 91KB on my system. Instead of making the add_kallsyms() loop even more complex, I tried the other route of deleting the strmap logic and naively copying each string into core_strtab with no consideration for consolidating duplicates. According to the above figures, this would unnecessarily cost me about 49KB (10%) if binutils did suffix merging. Performance on an "already exists" insmod of nvidia.ko (runs add_kallsyms() but does not actually initialize the module): Original scheme: 1.230s With patch V2: 0.280s With naive copying: 0.058s I am posting the "naive copying" patch in case anybody else wants to follow up on this. I am satisfied enough with the way the "V2" code works and do not know if any additional tweaking is warranted. But if you can make good use of my patch, feel free: Signed-off-by: Kevin Cernekee <cernekee@gmail.com> --- diff --git a/kernel/module.c b/kernel/module.c index 6c87305..b07f71c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -138,7 +138,6 @@ struct load_info { unsigned long len; Elf_Shdr *sechdrs; char *secstrings, *strtab; - unsigned long *strmap; unsigned long symoffs, stroffs; struct _ddebug *debug; unsigned int num_debug; @@ -2183,7 +2182,7 @@ static void layout_symtab(struct module *mod, struct load_info *info) Elf_Shdr *symsect = info->sechdrs + info->index.sym; Elf_Shdr *strsect = info->sechdrs + info->index.str; const Elf_Sym *src; - unsigned int i, nsrc, ndst; + unsigned int i, nsrc, ndst, strtab_size; /* Put symbol section at end of init part of module. */ symsect->sh_flags |= SHF_ALLOC; @@ -2194,38 +2193,23 @@ static void layout_symtab(struct module *mod, struct load_info *info) src = (void *)info->hdr + symsect->sh_offset; nsrc = symsect->sh_size / sizeof(*src); - /* - * info->strmap has a '1' bit for each byte of .strtab we want to - * keep resident in mod->core_strtab. Everything else in .strtab - * is unreferenced by the symbols in mod->core_symtab, and will be - * discarded when add_kallsyms() compacts the string table. - */ - for (ndst = i = 1; i < nsrc; ++i, ++src) + /* Compute total space required for the core symbols' strtab. */ + for (ndst = i = strtab_size = 1; i < nsrc; ++i, ++src) if (is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) { - unsigned int j = src->st_name; - - while (!__test_and_set_bit(j, info->strmap) - && info->strtab[j]) - ++j; - ++ndst; + strtab_size += strlen(&info->strtab[src->st_name]) + 1; + ndst++; } /* Append room for core symbols at end of core part. */ info->symoffs = ALIGN(mod->core_size, symsect->sh_addralign ?: 1); - mod->core_size = info->symoffs + ndst * sizeof(Elf_Sym); + info->stroffs = mod->core_size = info->symoffs + ndst * sizeof(Elf_Sym); + mod->core_size += strtab_size; /* Put string table section at end of init part of module. */ strsect->sh_flags |= SHF_ALLOC; strsect->sh_entsize = get_offset(mod, &mod->init_size, strsect, info->index.str) | INIT_OFFSET_MASK; DEBUGP("\t%s\n", info->secstrings + strsect->sh_name); - - /* Append room for core symbols' strings at end of core part. */ - info->stroffs = mod->core_size; - - /* First strtab byte (and first symtab entry) are zeroes. */ - __set_bit(0, info->strmap); - mod->core_size += bitmap_weight(info->strmap, strsect->sh_size); } static void add_kallsyms(struct module *mod, const struct load_info *info) @@ -2251,41 +2235,12 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) *dst = *src; *s++ = 0; for (ndst = i = 1; i < mod->num_symtab; ++i, ++src) { - const char *name; if (!is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) continue; - dst[ndst] = *src; - name = &mod->strtab[src->st_name]; - if (unlikely(!test_bit(src->st_name, info->strmap))) { - /* Symbol name has already been copied; find it. */ - char *dup; - - for (dup = mod->core_strtab; strcmp(dup, name); dup++) - BUG_ON(dup > s); - - dst[ndst].st_name = dup - mod->core_strtab; - } else { - /* - * Copy the symbol name to mod->core_strtab. - * "name" might point to the middle of a longer strtab - * entry, so backtrack to the first "required" byte - * of the string. - */ - unsigned start = src->st_name, len = 0; - - for (; test_bit(start - 1, info->strmap) && - mod->strtab[start - 1]; start--) - len++; - - dst[ndst].st_name = len + s - mod->core_strtab; - len += strlen(name) + 1; - - memcpy(s, &mod->strtab[start], len); - s += len; - bitmap_clear(info->strmap, start, len); - } - ++ndst; + dst[ndst] = *src; + dst[ndst++].st_name = s - mod->core_strtab; + s += strlcpy(s, &mod->strtab[src->st_name], KSYM_NAME_LEN) + 1; } mod->core_num_syms = ndst; } @@ -2777,27 +2732,18 @@ static struct module *layout_and_allocate(struct load_info *info) this is done generically; there doesn't appear to be any special cases for the architectures. */ layout_sections(mod, info); - - info->strmap = kzalloc(BITS_TO_LONGS(info->sechdrs[info->index.str].sh_size) - * sizeof(long), GFP_KERNEL); - if (!info->strmap) { - err = -ENOMEM; - goto free_percpu; - } layout_symtab(mod, info); /* Allocate and move to the final place */ err = move_module(mod, info); if (err) - goto free_strmap; + goto free_percpu; /* Module has been copied to its final place now: return it. */ mod = (void *)info->sechdrs[info->index.mod].sh_addr; kmemleak_load_module(mod, info); return mod; -free_strmap: - kfree(info->strmap); free_percpu: percpu_modfree(mod); out: @@ -2807,7 +2753,6 @@ out: /* mod is no longer valid after this! */ static void module_deallocate(struct module *mod, struct load_info *info) { - kfree(info->strmap); percpu_modfree(mod); module_free(mod, mod->module_init); module_free(mod, mod->module_core); @@ -2937,8 +2882,7 @@ static struct module *load_module(void __user *umod, if (err < 0) goto unlink; - /* Get rid of temporary copy and strmap. */ - kfree(info.strmap); + /* Get rid of temporary copy. */ free_copy(&info); /* Done! */ -- ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-18 3:15 ` Kevin Cernekee @ 2011-11-21 5:06 ` Rusty Russell 2011-11-21 6:29 ` Kevin Cernekee 2011-11-21 8:08 ` Jan Beulich 0 siblings, 2 replies; 10+ messages in thread From: Rusty Russell @ 2011-11-21 5:06 UTC (permalink / raw) To: Kevin Cernekee, 'Jan Beulich'; +Cc: linux-kbuild, linux-kernel On Thu, 17 Nov 2011 19:15:02 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: > Potential memory savings from reusing EXACT strtab matches: 36,475 bytes > total; 35,432 from nvidia.ko. For comparison, discarding non-core > strtab entries saved about 91KB on my system. ... > Original scheme: 1.230s > With patch V2: 0.280s > With naive copying: 0.058s I'm deeply tempted. It's very simple, 46 lines shorter, preserves exact matches, and doesn't have any strange slowdowns on corner cases. But like Kevin, I could be convinced either way. Jan? Thanks, Rusty. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-21 5:06 ` Rusty Russell @ 2011-11-21 6:29 ` Kevin Cernekee 2011-11-21 23:57 ` Rusty Russell 2011-11-21 8:08 ` Jan Beulich 1 sibling, 1 reply; 10+ messages in thread From: Kevin Cernekee @ 2011-11-21 6:29 UTC (permalink / raw) To: Rusty Russell; +Cc: Jan Beulich, linux-kbuild, linux-kernel On Sun, Nov 20, 2011 at 9:06 PM, Rusty Russell <rusty@rustcorp.com.au> wrote: > On Thu, 17 Nov 2011 19:15:02 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: > I'm deeply tempted. It's very simple, 46 lines shorter, preserves exact > matches, and doesn't have any strange slowdowns on corner cases. Unfortunately, the last patch I posted still makes duplicate copies of the exact matches. And the copy of nvidia.ko I'm looking at right now does reuse strtab entries for the duplicated symbols. FWIW, preserving exact matches would save about 35KB on a kernel module whose .text + .data + .rodata sections add up to ~9.5MB. I'm sure there are more clever ways to do it, but a fast brute force approach could involve storing the core_strtab offsets in a temporary array indexed by the original strtab st_name values. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-21 6:29 ` Kevin Cernekee @ 2011-11-21 23:57 ` Rusty Russell 0 siblings, 0 replies; 10+ messages in thread From: Rusty Russell @ 2011-11-21 23:57 UTC (permalink / raw) To: Kevin Cernekee; +Cc: Jan Beulich, linux-kbuild, linux-kernel On Sun, 20 Nov 2011 22:29:55 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: > On Sun, Nov 20, 2011 at 9:06 PM, Rusty Russell <rusty@rustcorp.com.au> wrote: > > On Thu, 17 Nov 2011 19:15:02 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: > > I'm deeply tempted. It's very simple, 46 lines shorter, preserves exact > > matches, and doesn't have any strange slowdowns on corner cases. > > Unfortunately, the last patch I posted still makes duplicate copies of > the exact matches. And the copy of nvidia.ko I'm looking at right now > does reuse strtab entries for the duplicated symbols. Ah, I read it too fast. Still, simplicity wins. I like Jan's point, I'll put in a comment and reference to this thread for future adventurers. Result below. Thanks! Rusty. From: Kevin Cernekee <cernekee@gmail.com> Subject: module: Fix performance regression on modules with large symbol tables Looking at /proc/kallsyms, one starts to ponder whether all of the extra strtab-related complexity in module.c is worth the memory savings. Instead of making the add_kallsyms() loop even more complex, I tried the other route of deleting the strmap logic and naively copying each string into core_strtab with no consideration for consolidating duplicates. Performance on an "already exists" insmod of nvidia.ko (runs add_kallsyms() but does not actually initialize the module): Original scheme: 1.230s With naive copying: 0.058s Extra space used: 35k (of a 408k module). Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <73defb5e4bca04a6431392cc341112b1@localhost> --- kernel/module.c | 65 ++++++++++++++++++-------------------------------------- 1 file changed, 21 insertions(+), 44 deletions(-) diff --git a/kernel/module.c b/kernel/module.c --- a/kernel/module.c +++ b/kernel/module.c @@ -138,7 +138,6 @@ struct load_info { unsigned long len; Elf_Shdr *sechdrs; char *secstrings, *strtab; - unsigned long *strmap; unsigned long symoffs, stroffs; struct _ddebug *debug; unsigned int num_debug; @@ -2178,12 +2177,19 @@ static bool is_core_symbol(const Elf_Sym return true; } +/* + * We only allocate and copy the strings needed by the parts of symtab + * we keep. This is simple, but has the effect of making multiple + * copies of duplicates. We could be more sophisticated, see + * linux-kernel thread starting with + * <73defb5e4bca04a6431392cc341112b1@localhost>. + */ static void layout_symtab(struct module *mod, struct load_info *info) { Elf_Shdr *symsect = info->sechdrs + info->index.sym; Elf_Shdr *strsect = info->sechdrs + info->index.str; const Elf_Sym *src; - unsigned int i, nsrc, ndst; + unsigned int i, nsrc, ndst, strtab_size; /* Put symbol section at end of init part of module. */ symsect->sh_flags |= SHF_ALLOC; @@ -2194,38 +2200,23 @@ static void layout_symtab(struct module src = (void *)info->hdr + symsect->sh_offset; nsrc = symsect->sh_size / sizeof(*src); - /* - * info->strmap has a '1' bit for each byte of .strtab we want to - * keep resident in mod->core_strtab. Everything else in .strtab - * is unreferenced by the symbols in mod->core_symtab, and will be - * discarded when add_kallsyms() compacts the string table. - */ - for (ndst = i = 1; i < nsrc; ++i, ++src) + /* Compute total space required for the core symbols' strtab. */ + for (ndst = i = strtab_size = 1; i < nsrc; ++i, ++src) if (is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) { - unsigned int j = src->st_name; - - while (!__test_and_set_bit(j, info->strmap) - && info->strtab[j]) - ++j; - ++ndst; + strtab_size += strlen(&info->strtab[src->st_name]) + 1; + ndst++; } /* Append room for core symbols at end of core part. */ info->symoffs = ALIGN(mod->core_size, symsect->sh_addralign ?: 1); - mod->core_size = info->symoffs + ndst * sizeof(Elf_Sym); + info->stroffs = mod->core_size = info->symoffs + ndst * sizeof(Elf_Sym); + mod->core_size += strtab_size; /* Put string table section at end of init part of module. */ strsect->sh_flags |= SHF_ALLOC; strsect->sh_entsize = get_offset(mod, &mod->init_size, strsect, info->index.str) | INIT_OFFSET_MASK; DEBUGP("\t%s\n", info->secstrings + strsect->sh_name); - - /* Append room for core symbols' strings at end of core part. */ - info->stroffs = mod->core_size; - - /* First strtab byte (and first symtab entry) are zeroes. */ - __set_bit(0, info->strmap); - mod->core_size += bitmap_weight(info->strmap, strsect->sh_size); } static void add_kallsyms(struct module *mod, const struct load_info *info) @@ -2246,22 +2237,19 @@ static void add_kallsyms(struct module * mod->symtab[i].st_info = elf_type(&mod->symtab[i], info); mod->core_symtab = dst = mod->module_core + info->symoffs; + mod->core_strtab = s = mod->module_core + info->stroffs; src = mod->symtab; *dst = *src; + *s++ = 0; for (ndst = i = 1; i < mod->num_symtab; ++i, ++src) { if (!is_core_symbol(src, info->sechdrs, info->hdr->e_shnum)) continue; + dst[ndst] = *src; - dst[ndst].st_name = bitmap_weight(info->strmap, - dst[ndst].st_name); - ++ndst; + dst[ndst++].st_name = s - mod->core_strtab; + s += strlcpy(s, &mod->strtab[src->st_name], KSYM_NAME_LEN) + 1; } mod->core_num_syms = ndst; - - mod->core_strtab = s = mod->module_core + info->stroffs; - for (*s = 0, i = 1; i < info->sechdrs[info->index.str].sh_size; ++i) - if (test_bit(i, info->strmap)) - *++s = mod->strtab[i]; } #else static inline void layout_symtab(struct module *mod, struct load_info *info) @@ -2751,27 +2739,18 @@ static struct module *layout_and_allocat this is done generically; there doesn't appear to be any special cases for the architectures. */ layout_sections(mod, info); - - info->strmap = kzalloc(BITS_TO_LONGS(info->sechdrs[info->index.str].sh_size) - * sizeof(long), GFP_KERNEL); - if (!info->strmap) { - err = -ENOMEM; - goto free_percpu; - } layout_symtab(mod, info); /* Allocate and move to the final place */ err = move_module(mod, info); if (err) - goto free_strmap; + goto free_percpu; /* Module has been copied to its final place now: return it. */ mod = (void *)info->sechdrs[info->index.mod].sh_addr; kmemleak_load_module(mod, info); return mod; -free_strmap: - kfree(info->strmap); free_percpu: percpu_modfree(mod); out: @@ -2781,7 +2760,6 @@ out: /* mod is no longer valid after this! */ static void module_deallocate(struct module *mod, struct load_info *info) { - kfree(info->strmap); percpu_modfree(mod); module_free(mod, mod->module_init); module_free(mod, mod->module_core); @@ -2911,8 +2889,7 @@ static struct module *load_module(void _ if (err < 0) goto unlink; - /* Get rid of temporary copy and strmap. */ - kfree(info.strmap); + /* Get rid of temporary copy. */ free_copy(&info); /* Done! */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables 2011-11-21 5:06 ` Rusty Russell 2011-11-21 6:29 ` Kevin Cernekee @ 2011-11-21 8:08 ` Jan Beulich 1 sibling, 0 replies; 10+ messages in thread From: Jan Beulich @ 2011-11-21 8:08 UTC (permalink / raw) To: Kevin Cernekee, Rusty Russell; +Cc: linux-kbuild, linux-kernel >>> On 21.11.11 at 06:06, Rusty Russell <rusty@rustcorp.com.au> wrote: > On Thu, 17 Nov 2011 19:15:02 -0800, Kevin Cernekee <cernekee@gmail.com> wrote: >> Potential memory savings from reusing EXACT strtab matches: 36,475 bytes >> total; 35,432 from nvidia.ko. For comparison, discarding non-core >> strtab entries saved about 91KB on my system. > ... >> Original scheme: 1.230s >> With patch V2: 0.280s >> With naive copying: 0.058s > > I'm deeply tempted. It's very simple, 46 lines shorter, preserves exact > matches, and doesn't have any strange slowdowns on corner cases. > > But like Kevin, I could be convinced either way. Jan? I'd be fine with the simpler solution, as long it is clearly understood (and marked as intended to be that was for the time being) that it has room for improvement, should it ever turn out to yield more significant savings in other contexts. Jan ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-11-22 0:59 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-13 3:08 [PATCH V2 1/2] module: Add comments describing how the "strmap" logic works Kevin Cernekee 2011-11-13 3:08 ` [PATCH V2 2/2] module: Fix performance regression on modules with large symbol tables Kevin Cernekee 2011-11-14 0:35 ` Rusty Russell 2011-11-14 8:48 ` Jan Beulich 2011-11-16 0:24 ` Rusty Russell 2011-11-18 3:15 ` Kevin Cernekee 2011-11-21 5:06 ` Rusty Russell 2011-11-21 6:29 ` Kevin Cernekee 2011-11-21 23:57 ` Rusty Russell 2011-11-21 8:08 ` Jan Beulich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox