[PATCH] module, fix percpu reserved memory exhaustion

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] module, fix percpu reserved memory exhaustion
@ 2013-01-10  2:41 Prarit Bhargava
  2013-01-11  3:48 ` Rusty Russell
  0 siblings, 1 reply; 6+ messages in thread
From: Prarit Bhargava @ 2013-01-10  2:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Prarit Bhargava, Rusty Russell, Mike Galbraith

Rusty,

There is likely some subtlety of moving the module mutex that I'm unaware of.
What I can say is that this patch seems to resolve the problem for me, or at
least through 100+ reboots I have not seen the problem (I'm still testing
as I write this).

I'm more than willing to hear an alternative approach, or test an alternative
patch.

Thanks,

P.

----8<----

In recent Fedora releases (F17 & F18) some users have reported seeing
messages similar to

[   15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1
[   15.478121] Call Trace:
[   15.478131]  [<ffffffff81153001>] pcpu_alloc+0xa01/0xa60
[   15.478137]  [<ffffffff8163d8b0>] ? printk+0x61/0x63
[   15.478140]  [<ffffffff811532f3>] __alloc_reserved_percpu+0x13/0x20
[   15.478145]  [<ffffffff810c42e2>] load_module+0x1dc2/0x20b0
[   15.478150]  [<ffffffff8164b21e>] ? do_page_fault+0xe/0x10
[   15.478152]  [<ffffffff81647858>] ? page_fault+0x28/0x30
[   15.478155]  [<ffffffff810c46a7>] sys_init_module+0xd7/0x120
[   15.478159]  [<ffffffff8164f859>] system_call_fastpath+0x16/0x1b
[   15.478160] kvm: Could not allocate 304 bytes percpu data
[   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed

during system boot.  In some cases, users have also reported seeing this
message along with a failed load of other modules.

As the message indicates, the reserved chunk of percpu memory (where
modules allocate their memory) is exhausted.  A debug printk inserted in
the code shows

[   15.478533] PRARIT size = 304 > chunk->contig_hint = 208

ie) the reserved chunk of percpu has only 208 bytes of available space.

What is happening is systemd is loading an instance of the kvm module for
each cpu found (see commit e9bda3b).  When the module load occurs the kernel
currently allocates the modules percpu data area prior to checking to see
if the module is already loaded or is in the process of being loaded.  If
the module is already loaded, or finishes load, the module loading code
releases the current instance's module's percpu data.

The problem is that these module loads race and it is possible that all of
the percpu reserved area is consumed by repeated loads of the same module
which results in the failure of other drivers to load.

This patch moves the module percpu allocation after the check for an
existing instance of the module.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/module.c |  124 ++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 250092c..e7e9b57 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod)
 	return 0;
 }
 
+static void simplify_percpu_symbols(struct module *mod,
+				    const struct load_info *info)
+{
+	Elf_Shdr *symsec = &info->sechdrs[info->index.sym];
+	Elf_Sym *sym = (void *)symsec->sh_addr;
+	unsigned long secbase;
+	unsigned int i;
+
+	/*
+	 * No need for error checking in this function because
+	 * simplify_symbols has already been called.
+	 */
+	for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
+		/* Divert to percpu allocation if a percpu var. */
+		if (sym[i].st_shndx == info->index.pcpu) {
+			secbase = (unsigned long)mod_percpu(mod);
+			sym[i].st_value += secbase;
+		}
+	}
+}
+
 /* Change all symbols so that st_value encodes the pointer directly. */
 static int simplify_symbols(struct module *mod, const struct load_info *info)
 {
@@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
 			break;
 
 		default:
-			/* Divert to percpu allocation if a percpu var. */
-			if (sym[i].st_shndx == info->index.pcpu)
-				secbase = (unsigned long)mod_percpu(mod);
-			else
+			/* percpu diverts handled in simplify_percpu_symbols */
+			if (sym[i].st_shndx != info->index.pcpu) {
 				secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
-			sym[i].st_value += secbase;
+				sym[i].st_value += secbase;
+			}
 			break;
 		}
 	}
@@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr,
 	return 0;
 }
 
+static int allocate_percpu(struct module *mod, struct load_info *info)
+{
+	Elf_Shdr *pcpusec;
+	int err;
+
+	pcpusec = &info->sechdrs[info->index.pcpu];
+	if (pcpusec->sh_size) {
+		/* We have a special allocation for this section. */
+		pr_debug("module %s attempting to percpu with size %d\n",
+			 mod->name, pcpusec->sh_size);
+		err = percpu_modalloc(mod,
+				      pcpusec->sh_size, pcpusec->sh_addralign);
+		if (err)
+			return -ENOMEM;
+		pcpusec->sh_flags &= ~(unsigned long)SHF_ALLOC;
+	}
+	return 0;
+}
+
 static struct module *layout_and_allocate(struct load_info *info, int flags)
 {
 	/* Module within temporary copy. */
 	struct module *mod;
-	Elf_Shdr *pcpusec;
 	int err;
 
 	mod = setup_load_info(info, flags);
@@ -2920,16 +2958,6 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
 	if (err < 0)
 		goto out;
 
-	pcpusec = &info->sechdrs[info->index.pcpu];
-	if (pcpusec->sh_size) {
-		/* We have a special allocation for this section. */
-		err = percpu_modalloc(mod,
-				      pcpusec->sh_size, pcpusec->sh_addralign);
-		if (err)
-			goto out;
-		pcpusec->sh_flags &= ~(unsigned long)SHF_ALLOC;
-	}
-
 	/* Determine total sizes, and put offsets in sh_entsize.  For now
 	   this is done generically; there doesn't appear to be any
 	   special cases for the architectures. */
@@ -2955,7 +2983,6 @@ out:
 /* mod is no longer valid after this! */
 static void module_deallocate(struct module *mod, struct load_info *info)
 {
-	percpu_modfree(mod);
 	module_free(mod, mod->module_init);
 	module_free(mod, mod->module_core);
 }
@@ -3135,18 +3162,54 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	/* Set up MODINFO_ATTR fields */
 	setup_modinfo(mod, info);
 
-	/* Fix up syms, so that st_value is a pointer to location. */
+	/* Fix up non-percpu syms, so that st_value is a pointer to location. */
 	err = simplify_symbols(mod, info);
 	if (err < 0)
 		goto free_modinfo;
 
+	/*
+	 * This mutex protects against concurrent writers of the module
+	 * RCU list.  The code takes it pretty early because there are
+	 * several things that must be completed after checking for an
+	 * existing module, and before adding it to the RCU list.
+	 */
+	mutex_lock(&module_mutex);
+
+	/* Mark state as coming so strong_try_module_get() ignores us. */
+	mod->state = MODULE_STATE_COMING;
+
+	/* is another instance of this module already loading or loaded? */
+again:
+	if ((old = find_module(mod->name)) != NULL) {
+		if (old->state == MODULE_STATE_COMING) {
+			/* Wait in case it fails to load. */
+			mutex_unlock(&module_mutex);
+			err = wait_event_interruptible(module_wq,
+					       finished_loading(mod->name));
+			if (err)
+				goto free_modinfo;
+			mutex_lock(&module_mutex);
+			goto again;
+		}
+		err = -EEXIST;
+		goto unlock;
+	}
+
+	/* allocate this modules percpu reserved area */
+	err = allocate_percpu(mod, info);
+	if (err)
+		goto free_modinfo;
+
+	/* Fix up percpu syms, so that st_value is a pointer to location. */
+	simplify_percpu_symbols(mod, info);
+
 	err = apply_relocations(mod, info);
 	if (err < 0)
-		goto free_modinfo;
+		goto free_percpu;
 
 	err = post_relocation(mod, info);
 	if (err < 0)
-		goto free_modinfo;
+		goto free_percpu;
 
 	flush_module_icache(mod);
 
@@ -3157,31 +3220,12 @@ static int load_module(struct load_info *info, const char __user *uargs,
 		goto free_arch_cleanup;
 	}
 
-	/* Mark state as coming so strong_try_module_get() ignores us. */
-	mod->state = MODULE_STATE_COMING;
-
 	/* Now sew it into the lists so we can get lockdep and oops
 	 * info during argument parsing.  No one should access us, since
 	 * strong_try_module_get() will fail.
 	 * lockdep/oops can run asynchronous, so use the RCU list insertion
 	 * function to insert in a way safe to concurrent readers.
-	 * The mutex protects against concurrent writers.
 	 */
-again:
-	mutex_lock(&module_mutex);
-	if ((old = find_module(mod->name)) != NULL) {
-		if (old->state == MODULE_STATE_COMING) {
-			/* Wait in case it fails to load. */
-			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
-			if (err)
-				goto free_arch_cleanup;
-			goto again;
-		}
-		err = -EEXIST;
-		goto unlock;
-	}
 
 	/* This has to be done once we're sure module name is unique. */
 	dynamic_debug_setup(info->debug, info->num_debug);
@@ -3228,6 +3272,8 @@ again:
 	kfree(mod->args);
  free_arch_cleanup:
 	module_arch_cleanup(mod);
+ free_percpu:
+	percpu_modfree(mod);
  free_modinfo:
 	free_modinfo(mod);
  free_unload:
-- 
1.7.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] module, fix percpu reserved memory exhaustion
  2013-01-10  2:41 [PATCH] module, fix percpu reserved memory exhaustion Prarit Bhargava
@ 2013-01-11  3:48 ` Rusty Russell
  2013-01-11 14:16   ` Prarit Bhargava
  0 siblings, 1 reply; 6+ messages in thread
From: Rusty Russell @ 2013-01-11  3:48 UTC (permalink / raw)
  To: Prarit Bhargava, linux-kernel
  Cc: Prarit Bhargava, Mike Galbraith, Josh Triplett, Tim Abbott

Prarit Bhargava <prarit@redhat.com> writes:
> [   15.478160] kvm: Could not allocate 304 bytes percpu data
> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
> from reserved chunk failed
...
> What is happening is systemd is loading an instance of the kvm module for
> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
> currently allocates the modules percpu data area prior to checking to see
> if the module is already loaded or is in the process of being loaded.  If
> the module is already loaded, or finishes load, the module loading code
> releases the current instance's module's percpu data.

Wow, what a cool bug!  Classic unforseen side-effect.

I'd prefer not to do relocations with the module_lock held: it can be
relatively slow.  Yet we can't do relocations before the per-cpu
allocation, obviously.  Did you do boot timings before and after?

An alternative would be to put the module into the list even earlier
(say, just after layout_and_allocate) so we could block on concurrent
loads at that point.  But then we have to make sure noone looks in the
module too early before it's completely set up, and that's complicated
and error-prone too.  A separate list is kind of icky.

We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
allmodconfig build, there are only three modules with per-cpu data,
totalling 328 bytes.  So it's not reasonable to increase that number to
paper over this.

This is what a new boot state looks like (pains not to break ksplice).
It's two patches, but I'll just post them back to back:

module: add new state MODULE_STATE_UNFORMED.

You should never look at such a module, so it's excised from all paths
which traverse the modules list.

We add the state at the end, to avoid gratuitous ABI break (ksplice).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/include/linux/module.h b/include/linux/module.h
index 7760c6d..4432373 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -199,11 +199,11 @@ struct module_use {
 	struct module *source, *target;
 };
 
-enum module_state
-{
-	MODULE_STATE_LIVE,
-	MODULE_STATE_COMING,
-	MODULE_STATE_GOING,
+enum module_state {
+	MODULE_STATE_LIVE,	/* Normal state. */
+	MODULE_STATE_COMING,	/* Full formed, running module_init. */
+	MODULE_STATE_GOING,	/* Going away. */
+	MODULE_STATE_UNFORMED,	/* Still setting it up. */
 };
 
 /**
diff --git a/kernel/module.c b/kernel/module.c
index 41bc118..c3a2ee8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -188,6 +188,7 @@ struct load_info {
    ongoing or failed initialization etc. */
 static inline int strong_try_module_get(struct module *mod)
 {
+	BUG_ON(mod && mod->state == MODULE_STATE_UNFORMED);
 	if (mod && mod->state == MODULE_STATE_COMING)
 		return -EBUSY;
 	if (try_module_get(mod))
@@ -343,6 +344,9 @@ bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
 #endif
 		};
 
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
+
 		if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data))
 			return true;
 	}
@@ -450,16 +454,24 @@ const struct kernel_symbol *find_symbol(const char *name,
 EXPORT_SYMBOL_GPL(find_symbol);
 
 /* Search for module by name: must hold module_mutex. */
-struct module *find_module(const char *name)
+static struct module *find_module_all(const char *name,
+				      bool even_unformed)
 {
 	struct module *mod;
 
 	list_for_each_entry(mod, &modules, list) {
+		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (strcmp(mod->name, name) == 0)
 			return mod;
 	}
 	return NULL;
 }
+
+struct module *find_module(const char *name)
+{
+	return find_module_all(name, false);
+}
 EXPORT_SYMBOL_GPL(find_module);
 
 #ifdef CONFIG_SMP
@@ -525,6 +537,8 @@ bool is_module_percpu_address(unsigned long addr)
 	preempt_disable();
 
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (!mod->percpu_size)
 			continue;
 		for_each_possible_cpu(cpu) {
@@ -1048,6 +1062,8 @@ static ssize_t show_initstate(struct module_attribute *mattr,
 	case MODULE_STATE_GOING:
 		state = "going";
 		break;
+	default:
+		BUG();
 	}
 	return sprintf(buffer, "%s\n", state);
 }
@@ -1786,6 +1802,8 @@ void set_all_modules_text_rw(void)
 
 	mutex_lock(&module_mutex);
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if ((mod->module_core) && (mod->core_text_size)) {
 			set_page_attributes(mod->module_core,
 						mod->module_core + mod->core_text_size,
@@ -1807,6 +1825,8 @@ void set_all_modules_text_ro(void)
 
 	mutex_lock(&module_mutex);
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if ((mod->module_core) && (mod->core_text_size)) {
 			set_page_attributes(mod->module_core,
 						mod->module_core + mod->core_text_size,
@@ -2998,7 +3018,8 @@ static bool finished_loading(const char *name)
 
 	mutex_lock(&module_mutex);
 	mod = find_module(name);
-	ret = !mod || mod->state != MODULE_STATE_COMING;
+	ret = !mod || mod->state == MODULE_STATE_LIVE
+		|| mod->state == MODULE_STATE_GOING;
 	mutex_unlock(&module_mutex);
 
 	return ret;
@@ -3361,6 +3382,8 @@ const char *module_address_lookup(unsigned long addr,
 
 	preempt_disable();
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (within_module_init(addr, mod) ||
 		    within_module_core(addr, mod)) {
 			if (modname)
@@ -3384,6 +3407,8 @@ int lookup_module_symbol_name(unsigned long addr, char *symname)
 
 	preempt_disable();
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (within_module_init(addr, mod) ||
 		    within_module_core(addr, mod)) {
 			const char *sym;
@@ -3408,6 +3433,8 @@ int lookup_module_symbol_attrs(unsigned long addr, unsigned long *size,
 
 	preempt_disable();
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (within_module_init(addr, mod) ||
 		    within_module_core(addr, mod)) {
 			const char *sym;
@@ -3435,6 +3462,8 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 
 	preempt_disable();
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (symnum < mod->num_symtab) {
 			*value = mod->symtab[symnum].st_value;
 			*type = mod->symtab[symnum].st_info;
@@ -3477,9 +3506,12 @@ unsigned long module_kallsyms_lookup_name(const char *name)
 			ret = mod_find_symname(mod, colon+1);
 		*colon = ':';
 	} else {
-		list_for_each_entry_rcu(mod, &modules, list)
+		list_for_each_entry_rcu(mod, &modules, list) {
+			if (mod->state == MODULE_STATE_UNFORMED)
+				continue;
 			if ((ret = mod_find_symname(mod, name)) != 0)
 				break;
+		}
 	}
 	preempt_enable();
 	return ret;
@@ -3494,6 +3526,8 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, const char *,
 	int ret;
 
 	list_for_each_entry(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		for (i = 0; i < mod->num_symtab; i++) {
 			ret = fn(data, mod->strtab + mod->symtab[i].st_name,
 				 mod, mod->symtab[i].st_value);
@@ -3509,6 +3543,7 @@ static char *module_flags(struct module *mod, char *buf)
 {
 	int bx = 0;
 
+	BUG_ON(mod->state == MODULE_STATE_UNFORMED);
 	if (mod->taints ||
 	    mod->state == MODULE_STATE_GOING ||
 	    mod->state == MODULE_STATE_COMING) {
@@ -3550,6 +3585,10 @@ static int m_show(struct seq_file *m, void *p)
 	struct module *mod = list_entry(p, struct module, list);
 	char buf[8];
 
+	/* We always ignore unformed modules. */
+	if (mod->state == MODULE_STATE_UNFORMED)
+		return 0;
+
 	seq_printf(m, "%s %u",
 		   mod->name, mod->init_size + mod->core_size);
 	print_unload_info(m, mod);
@@ -3610,6 +3649,8 @@ const struct exception_table_entry *search_module_extables(unsigned long addr)
 
 	preempt_disable();
 	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (mod->num_exentries == 0)
 			continue;
 
@@ -3658,10 +3699,13 @@ struct module *__module_address(unsigned long addr)
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
-	list_for_each_entry_rcu(mod, &modules, list)
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		if (within_module_core(addr, mod)
 		    || within_module_init(addr, mod))
 			return mod;
+	}
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(__module_address);
@@ -3714,8 +3758,11 @@ void print_modules(void)
 	printk(KERN_DEFAULT "Modules linked in:");
 	/* Most callers should already have preempt disabled, but make sure */
 	preempt_disable();
-	list_for_each_entry_rcu(mod, &modules, list)
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (mod->state == MODULE_STATE_UNFORMED)
+			continue;
 		printk(" %s%s", mod->name, module_flags(mod, buf));
+	}
 	preempt_enable();
 	if (last_unloaded_module[0])
 		printk(" [last unloaded: %s]", last_unloaded_module);

module: put modules in list much earlier.

Prarit's excellent bug report:
> In recent Fedora releases (F17 & F18) some users have reported seeing
> messages similar to
> 
> [   15.478160] kvm: Could not allocate 304 bytes percpu data
> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
> reserved chunk failed
> 
> during system boot.  In some cases, users have also reported seeing this
> message along with a failed load of other modules.
> 
> What is happening is systemd is loading an instance of the kvm module for
> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
> currently allocates the modules percpu data area prior to checking to see
> if the module is already loaded or is in the process of being loaded.  If
> the module is already loaded, or finishes load, the module loading code
> releases the current instance's module's percpu data.

Now we have a new state MODULE_STATE_UNFORMED, we can insert the
module into the list (and thus guarantee its uniqueness) before we
allocate the per-cpu region.

Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/kernel/module.c b/kernel/module.c
index c3a2ee8..ec535aa 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3017,7 +3017,7 @@ static bool finished_loading(const char *name)
 	bool ret;
 
 	mutex_lock(&module_mutex);
-	mod = find_module(name);
+	mod = find_module_all(name, true);
 	ret = !mod || mod->state == MODULE_STATE_LIVE
 		|| mod->state == MODULE_STATE_GOING;
 	mutex_unlock(&module_mutex);
@@ -3141,6 +3141,32 @@ static int load_module(struct load_info *info, const char __user *uargs,
 		goto free_copy;
 	}
 
+	/*
+	 * We try to place it in the list now to make sure it's unique
+	 * before we dedicate too many resources.  In particular,
+	 * temporary percpu memory exhaustion.
+	 */
+	mod->state = MODULE_STATE_UNFORMED;
+again:
+	mutex_lock(&module_mutex);
+	if ((old = find_module_all(mod->name, true)) != NULL) {
+		if (old->state == MODULE_STATE_COMING
+		    || old->state == MODULE_STATE_UNFORMED) {
+			/* Wait in case it fails to load. */
+			mutex_unlock(&module_mutex);
+			err = wait_event_interruptible(module_wq,
+					       finished_loading(mod->name));
+			if (err)
+				goto free_module;
+			goto again;
+		}
+		err = -EEXIST;
+		mutex_unlock(&module_mutex);
+		goto free_module;
+	}
+	list_add_rcu(&mod->list, &modules);
+	mutex_unlock(&module_mutex);
+
 #ifdef CONFIG_MODULE_SIG
 	mod->sig_ok = info->sig_ok;
 	if (!mod->sig_ok)
@@ -3150,7 +3176,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	/* Now module is in final location, initialize linked lists, etc. */
 	err = module_unload_init(mod);
 	if (err)
-		goto free_module;
+		goto unlink_mod;
 
 	/* Now we've got everything in the final locations, we can
 	 * find optional sections. */
@@ -3185,54 +3211,33 @@ static int load_module(struct load_info *info, const char __user *uargs,
 		goto free_arch_cleanup;
 	}
 
-	/* Mark state as coming so strong_try_module_get() ignores us. */
-	mod->state = MODULE_STATE_COMING;
-
-	/* Now sew it into the lists so we can get lockdep and oops
-	 * info during argument parsing.  No one should access us, since
-	 * strong_try_module_get() will fail.
-	 * lockdep/oops can run asynchronous, so use the RCU list insertion
-	 * function to insert in a way safe to concurrent readers.
-	 * The mutex protects against concurrent writers.
-	 */
-again:
-	mutex_lock(&module_mutex);
-	if ((old = find_module(mod->name)) != NULL) {
-		if (old->state == MODULE_STATE_COMING) {
-			/* Wait in case it fails to load. */
-			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
-			if (err)
-				goto free_arch_cleanup;
-			goto again;
-		}
-		err = -EEXIST;
-		goto unlock;
-	}
-
-	/* This has to be done once we're sure module name is unique. */
 	dynamic_debug_setup(info->debug, info->num_debug);
 
-	/* Find duplicate symbols */
+	mutex_lock(&module_mutex);
+	/* Find duplicate symbols (must be called under lock). */
 	err = verify_export_symbols(mod);
 	if (err < 0)
-		goto ddebug;
+		goto ddebug_cleanup;
 
+	/* This relies on module_mutex for list integrity. */
 	module_bug_finalize(info->hdr, info->sechdrs, mod);
-	list_add_rcu(&mod->list, &modules);
+
+	/* Mark state as coming so strong_try_module_get() ignores us,
+	 * but kallsyms etc. can see us. */
+	mod->state = MODULE_STATE_COMING;
+
 	mutex_unlock(&module_mutex);
 
 	/* Module is ready to execute: parsing args may do that. */
 	err = parse_args(mod->name, mod->args, mod->kp, mod->num_kp,
 			 -32768, 32767, &ddebug_dyndbg_module_param_cb);
 	if (err < 0)
-		goto unlink;
+		goto bug_cleanup;
 
 	/* Link in to syfs. */
 	err = mod_sysfs_setup(mod, info, mod->kp, mod->num_kp);
 	if (err < 0)
-		goto unlink;
+		goto bug_cleanup;
 
 	/* Get rid of temporary copy. */
 	free_copy(info);
@@ -3242,16 +3247,13 @@ again:
 
 	return do_init_module(mod);
 
- unlink:
+ bug_cleanup:
+	/* module_bug_cleanup needs module_mutex protection */
 	mutex_lock(&module_mutex);
-	/* Unlink carefully: kallsyms could be walking list. */
-	list_del_rcu(&mod->list);
 	module_bug_cleanup(mod);
-	wake_up_all(&module_wq);
- ddebug:
-	dynamic_debug_remove(info->debug);
- unlock:
 	mutex_unlock(&module_mutex);
+ ddebug_cleanup:
+	dynamic_debug_remove(info->debug);
 	synchronize_sched();
 	kfree(mod->args);
  free_arch_cleanup:
@@ -3260,6 +3262,12 @@ again:
 	free_modinfo(mod);
  free_unload:
 	module_unload_free(mod);
+ unlink_mod:
+	mutex_lock(&module_mutex);
+	/* Unlink carefully: kallsyms could be walking list. */
+	list_del_rcu(&mod->list);
+	wake_up_all(&module_wq);
+	mutex_unlock(&module_mutex);
  free_module:
 	module_deallocate(mod, info);
  free_copy:
diff --git a/lib/bug.c b/lib/bug.c
index a28c141..d0cdf14 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -55,6 +55,7 @@ static inline unsigned long bug_addr(const struct bug_entry *bug)
 }
 
 #ifdef CONFIG_MODULES
+/* Updates are protected by module mutex */
 static LIST_HEAD(module_bug_list);
 
 static const struct bug_entry *module_find_bug(unsigned long bugaddr)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] module, fix percpu reserved memory exhaustion
  2013-01-11  3:48 ` Rusty Russell
@ 2013-01-11 14:16   ` Prarit Bhargava
  2013-01-12  1:06     ` Rusty Russell
  0 siblings, 1 reply; 6+ messages in thread
From: Prarit Bhargava @ 2013-01-11 14:16 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, Mike Galbraith, Josh Triplett, Tim Abbott

On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit@redhat.com> writes:
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Wow, what a cool bug!  Classic unforseen side-effect.
> 
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow.  Yet we can't do relocations before the per-cpu
> allocation, obviously.  Did you do boot timings before and after?

Heh ... I did! :)  I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

>From stock kernel:

[  22.893015] PRARIT: FINAL BOOT MESSAGE

>From stock kernel + my patch:

[  22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading.  A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321.  Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar.  I didn't see any huge delays, etc.  Can the
relocations really cause a long delay?  I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]

> 
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point.  But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too.  A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly.  I abandoned that approach in favor of moving the percpu allocations
under the lock.  I thought that was likely the easiest approach.

> 
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes.  So it's not reasonable to increase that number to
> paper over this.

I've been thinking about that.  The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

> 
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
> 
> module: add new state MODULE_STATE_UNFORMED
> 
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
> 
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 
<snip patch>

Sure, but I'm always nervous about expanding any state machine ;).  That's just
me though :).

> 
> module: put modules in list much earlier.
> 
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot.  In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Now we have a new state MODULE_STATE_UNFORMED, we can insert the
> module into the list (and thus guarantee its uniqueness) before we
> allocate the per-cpu region.
> 
> Reported-by: Prarit Bhargava <prarit@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 

<snip patch>

Tested-by: Prarit Bhargava <prarit@redhat.com>

Rusty, you can change that to an Acked-by if you prefer that.  I know some
engineers prefer one over the other.  I'll also continue doing some reboot
testing and will email back in a few days to let you know what the timing looks
like.

Thanks!,

P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] module, fix percpu reserved memory exhaustion
  2013-01-11 14:16   ` Prarit Bhargava
@ 2013-01-12  1:06     ` Rusty Russell
  2013-01-14 14:29       ` Prarit Bhargava
  2013-01-14 17:18       ` Tejun Heo
  0 siblings, 2 replies; 6+ messages in thread
From: Rusty Russell @ 2013-01-12  1:06 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel, Mike Galbraith, Josh Triplett, Tim Abbott,
	Tejun Heo

Prarit Bhargava <prarit@redhat.com> writes:
> On 01/10/2013 10:48 PM, Rusty Russell wrote:
> The timing were similar.  I didn't see any huge delays, etc.  Can the
> relocations really cause a long delay?  I thought we were pretty much writing
> values to memory...

For x86 that's true, but look at what ppc64 has to do for example.  I'm
guessing you don't have a giant Nvidia proprietary driver module
loading, either.

It just makes me nervous; this kind of boot slowdown typically won't get
diagnosed for several releases, if ever.  Now I've done the work, I'm
going to apply my patch (with an additional fix: I forgot to change
kgdb, which traverses the module list too).

> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]

I figured it had to be something big ;)

>> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
>> allmodconfig build, there are only three modules with per-cpu data,
>> totalling 328 bytes.  So it's not reasonable to increase that number to
>> paper over this.
>
> I've been thinking about that.  The problem is that at the same time the kvm
> problem occurs I'm attempting to load a debug module that I've written to debug
> some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
> While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
> tomorrow if I have the need to increase the size of my log buffer.

Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
could easily be halved.  I guess this is because dynamic per-cpu data is
now such a nice alternative (thanks to Tejun).

> <snip patch>
>
> Tested-by: Prarit Bhargava <prarit@redhat.com>
>
> Rusty, you can change that to an Acked-by if you prefer that.  I know some
> engineers prefer one over the other.  I'll also continue doing some reboot
> testing and will email back in a few days to let you know what the timing looks
> like.

There seem to be two kinds of Acked-by:

1) Acked-by: <maintainer>.  ie. "this should go through my tree, but
   it's going via someone else".  I like this: shows the normal
   maintainer is aware of the change.

2) Acked-by: <random>. ie. "I like the concept of the patch though I
   haven't actually read it or tested it".  Completely useless.

OTOH, Tested-by: means it actually fixed someone's problem.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] module, fix percpu reserved memory exhaustion
  2013-01-12  1:06     ` Rusty Russell
@ 2013-01-14 14:29       ` Prarit Bhargava
  2013-01-14 17:18       ` Tejun Heo
  1 sibling, 0 replies; 6+ messages in thread
From: Prarit Bhargava @ 2013-01-14 14:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Mike Galbraith, Josh Triplett, Tim Abbott,
	Tejun Heo



On 01/11/2013 08:06 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit@redhat.com> writes:
>> On 01/10/2013 10:48 PM, Rusty Russell wrote:
>> The timing were similar.  I didn't see any huge delays, etc.  Can the
>> relocations really cause a long delay?  I thought we were pretty much writing
>> values to memory...
> 
> For x86 that's true, but look at what ppc64 has to do for example.  I'm
> guessing you don't have a giant Nvidia proprietary driver module
> loading, either.

Ah -- I see.  I hadn't thought much about the other arches and I see what ppc64
does ...

> 
>> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]
> 
> I figured it had to be something big ;)

:)  Imagine what happens at 4096 cpus (SGI territory).  I'm wondering about that
kvm commit.  Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm
loading flood' during boot ... I'll talk with Kay Sievers about it to see if
there's a way around that.

> 
> OTOH, Tested-by: means it actually fixed someone's problem.

Got it.  For the record over-the-weekend testing didn't show any bizarre
results.  The boot times were all around 20-23 seconds.

Tested-by: Prarit Bhargava <prarit@redhat.com>

P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] module, fix percpu reserved memory exhaustion
  2013-01-12  1:06     ` Rusty Russell
  2013-01-14 14:29       ` Prarit Bhargava
@ 2013-01-14 17:18       ` Tejun Heo
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2013-01-14 17:18 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Prarit Bhargava, linux-kernel, Mike Galbraith, Josh Triplett,
	Tim Abbott

Hello, Rusty.

On Sat, Jan 12, 2013 at 11:36:09AM +1030, Rusty Russell wrote:
> > I've been thinking about that.  The problem is that at the same time the kvm
> > problem occurs I'm attempting to load a debug module that I've written to debug
> > some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
> > While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
> > tomorrow if I have the need to increase the size of my log buffer.
> 
> Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
> could easily be halved.  I guess this is because dynamic per-cpu data is
> now such a nice alternative (thanks to Tejun).

Heh, thanks. :)

There are two percpu constants that I basically pulled out of my ass.

* PERCPU_MODULE_RESERVE: It's used on architectures with data model
  which restricts the distance between code and data.  e.g. x86_64
  genereates code which can't address full 64bit for static data and
  as percpu addressing depends on the regular address generation to
  derive percpu addresses, the percpu offsets need to be restricted
  somehow.  Percpu allocator currently achieves this by reserving
  certain amount in the first percpu chunk which is guaranteed to have
  very low (starting from 0) percpu address space offsets.

  On those archs, this becomes the hard limit for all module static
  percpu areas.  Maybe 8k is way too high and we can go with 4k.
  Worth a try I guess.

* PERCPU_DYNAMIC_RESERVE: The amount of space at the end of the first
  chunk used for dynamic allocation.  This isn't as critical as the
  above and the only reason it exists is because the first chunk is
  often embedded in the kernel linear address space instead of vmalloc
  area and the former is often mapped with larger page table size than
  the latter.  The goal of PERCPU_DYNAMIC_RESERVE is to have just
  enough space to cover the usual percpu allocations without wasting
  too much space to minimize impact on TLB pressure.

  The current number isn't entirely made up.  I did some testing with
  some random config.  Don't know how it holds up now.  It probably
  can be increased given that we developed quite a few percpu dynamic
  allocations since then.

  It would be awesome to have a way to somehow dynamically adjust this
  one; unfortunately, that would require persistent data across
  boots. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-01-14 17:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10  2:41 [PATCH] module, fix percpu reserved memory exhaustion Prarit Bhargava
2013-01-11  3:48 ` Rusty Russell
2013-01-11 14:16   ` Prarit Bhargava
2013-01-12  1:06     ` Rusty Russell
2013-01-14 14:29       ` Prarit Bhargava
2013-01-14 17:18       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).