Re: [PATCH] module, fix percpu reserved memory exhaustion

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Prarit Bhargava <prarit@redhat.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-kernel@vger.kernel.org, Mike Galbraith <efault@gmx.de>,
	Josh Triplett <josh@joshtriplett.org>,
	Tim Abbott <tabbott@ksplice.com>
Subject: Re: [PATCH] module, fix percpu reserved memory exhaustion
Date: Fri, 11 Jan 2013 09:16:34 -0500	[thread overview]
Message-ID: <50F01EC2.1000908@redhat.com> (raw)
In-Reply-To: <87d2xc1itd.fsf@rustcorp.com.au>

On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit@redhat.com> writes:
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Wow, what a cool bug!  Classic unforseen side-effect.
> 
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow.  Yet we can't do relocations before the per-cpu
> allocation, obviously.  Did you do boot timings before and after?

Heh ... I did! :)  I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

>From stock kernel:

[  22.893015] PRARIT: FINAL BOOT MESSAGE

>From stock kernel + my patch:

[  22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading.  A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321.  Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar.  I didn't see any huge delays, etc.  Can the
relocations really cause a long delay?  I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]

> 
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point.  But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too.  A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly.  I abandoned that approach in favor of moving the percpu allocations
under the lock.  I thought that was likely the easiest approach.

> 
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes.  So it's not reasonable to increase that number to
> paper over this.

I've been thinking about that.  The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

> 
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
> 
> module: add new state MODULE_STATE_UNFORMED
> 
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
> 
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 
<snip patch>

Sure, but I'm always nervous about expanding any state machine ;).  That's just
me though :).

> 
> module: put modules in list much earlier.
> 
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot.  In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Now we have a new state MODULE_STATE_UNFORMED, we can insert the
> module into the list (and thus guarantee its uniqueness) before we
> allocate the per-cpu region.
> 
> Reported-by: Prarit Bhargava <prarit@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 

<snip patch>

Tested-by: Prarit Bhargava <prarit@redhat.com>

Rusty, you can change that to an Acked-by if you prefer that.  I know some
engineers prefer one over the other.  I'll also continue doing some reboot
testing and will email back in a few days to let you know what the timing looks
like.

Thanks!,

P.

next prev parent reply	other threads:[~2013-01-11 14:16 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-10  2:41 [PATCH] module, fix percpu reserved memory exhaustion Prarit Bhargava
2013-01-11  3:48 ` Rusty Russell
2013-01-11 14:16   ` Prarit Bhargava [this message]
2013-01-12  1:06     ` Rusty Russell
2013-01-14 14:29       ` Prarit Bhargava
2013-01-14 17:18       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50F01EC2.1000908@redhat.com \
    --to=prarit@redhat.com \
    --cc=efault@gmx.de \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=tabbott@ksplice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.