[kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
@ 2008-10-05  0:44 Arjan van de Ven
  2008-10-05  4:52 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Arjan van de Ven @ 2008-10-05  0:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds, Nick Piggin


Details: http://www.kerneloops.org/searchweek.php?search=lock_page

There's quite a few of this BUG, which seems to be an interaction between the "hwclock" program and something in 2.6.27.
It's new in .27 and is currently the 8th ranked issue.....

BUG: sleeping function called from invalid context at
include/linux/pagemap.h:294 in_atomic():0, irqs_disabled():1
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<00000000>] 0x0
hardirqs last disabled at (0): [<c042c3a4>] copy_process+0x2e7/0x115e
softirqs last  enabled at (0): [<c042c3a4>] copy_process+0x2e7/0x115e
softirqs last disabled at (0): [<00000000>] 0x0
Pid: 9591, comm: hwclock Tainted: G        W 2.6.27-0.372.rc8.fc10.i686 #1 
 [<c0427a53>] __might_sleep+0xd1/0xd6 
 [<c0479a8b>] lock_page+0x1a/0x34
 [<c0479cfa>] find_lock_page+0x23/0x48
 [<c047a215>] filemap_fault+0x9b/0x330
 [<c0486493>] __do_fault+0x40/0x2e6
 [<c0487d63>] handle_mm_fault+0x2ec/0x6d2
 [<c06e8260>] do_page_fault+0x2e5/0x693


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05  0:44 [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program Arjan van de Ven
@ 2008-10-05  4:52 ` Andrew Morton
  2008-10-05 15:11   ` Arjan van de Ven
  2008-10-05 18:18   ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2008-10-05  4:52 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, torvalds, Nick Piggin

On Sat, 4 Oct 2008 17:44:33 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> 
> Details: http://www.kerneloops.org/searchweek.php?search=lock_page
> 
> There's quite a few of this BUG, which seems to be an interaction between the "hwclock" program and something in 2.6.27.
> It's new in .27 and is currently the 8th ranked issue.....
> 
> BUG: sleeping function called from invalid context at
> include/linux/pagemap.h:294 in_atomic():0, irqs_disabled():1
> INFO: lockdep is turned off.
> irq event stamp: 0
> hardirqs last  enabled at (0): [<00000000>] 0x0
> hardirqs last disabled at (0): [<c042c3a4>] copy_process+0x2e7/0x115e
> softirqs last  enabled at (0): [<c042c3a4>] copy_process+0x2e7/0x115e
> softirqs last disabled at (0): [<00000000>] 0x0
> Pid: 9591, comm: hwclock Tainted: G        W 2.6.27-0.372.rc8.fc10.i686 #1 
>  [<c0427a53>] __might_sleep+0xd1/0xd6 
>  [<c0479a8b>] lock_page+0x1a/0x34
>  [<c0479cfa>] find_lock_page+0x23/0x48
>  [<c047a215>] filemap_fault+0x9b/0x330
>  [<c0486493>] __do_fault+0x40/0x2e6
>  [<c0487d63>] handle_mm_fault+0x2ec/0x6d2
>  [<c06e8260>] do_page_fault+0x2e5/0x693
> 

Looks like `hwclock' disabled interrupts in userspace with sys_iopl()?

And then it took a pagefault, which is presumably a bug in hwclock.

That's all a bit antisocial of it.  I guess a suitable quickfix is to
remove the might_sleep() from lock_page() (which would be a good thing
from a text size POV anyway).

But there will of course be other sites which do possibly-sleeping
operations on the pagefault path.

Really, it's a bit stupid doing _any_ system calls (and a pagefault is
a syscall in disguise) with interrupts disabled.  The kernel makes no
guarantees that we'll honour it.  We could just enable interrupts on
pagefault entry - that'll teach 'em.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05  4:52 ` Andrew Morton
@ 2008-10-05 15:11   ` Arjan van de Ven
  2008-10-05 17:27     ` Andrew Morton
  2008-10-05 18:18   ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Arjan van de Ven @ 2008-10-05 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, Nick Piggin

On Sat, 4 Oct 2008 21:52:25 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sat, 4 Oct 2008 17:44:33 -0700 Arjan van de Ven
> <arjan@infradead.org> wrote:
> 
> > 
> > Details: http://www.kerneloops.org/searchweek.php?search=lock_page
> > 
> > There's quite a few of this BUG, which seems to be an interaction
> > between the "hwclock" program and something in 2.6.27. It's new
> > in .27 and is currently the 8th ranked issue.....
> > 
> > BUG: sleeping function called from invalid context at
> > include/linux/pagemap.h:294 in_atomic():0, irqs_disabled():1
> > INFO: lockdep is turned off.
> > irq event stamp: 0
> > hardirqs last  enabled at (0): [<00000000>] 0x0
> > hardirqs last disabled at (0): [<c042c3a4>]
> > copy_process+0x2e7/0x115e softirqs last  enabled at (0):
> > [<c042c3a4>] copy_process+0x2e7/0x115e softirqs last disabled at
> > (0): [<00000000>] 0x0 Pid: 9591, comm: hwclock Tainted: G        W
> > 2.6.27-0.372.rc8.fc10.i686 #1 [<c0427a53>] __might_sleep+0xd1/0xd6 
> >  [<c0479a8b>] lock_page+0x1a/0x34
> >  [<c0479cfa>] find_lock_page+0x23/0x48
> >  [<c047a215>] filemap_fault+0x9b/0x330
> >  [<c0486493>] __do_fault+0x40/0x2e6
> >  [<c0487d63>] handle_mm_fault+0x2ec/0x6d2
> >  [<c06e8260>] do_page_fault+0x2e5/0x693
> > 
> 
> Looks like `hwclock' disabled interrupts in userspace with sys_iopl()?

static unsigned long   
atomic(const char *name, unsigned long (*op)(unsigned long),
       unsigned long arg)
{
  unsigned long v;
  __asm__ volatile ("cli");
  v = (*op)(arg);
  __asm__ volatile ("sti");
  return v;
} 

looks like it (but only on 32 bit x86, not on 64 bit x86)

> 
> And then it took a pagefault, which is presumably a bug in hwclock.
> 
> That's all a bit antisocial of it.  I guess a suitable quickfix is to
> remove the might_sleep() from lock_page() (which would be a good thing
> from a text size POV anyway).
> 
> But there will of course be other sites which do possibly-sleeping
> operations on the pagefault path.
> 
> Really, it's a bit stupid doing _any_ system calls (and a pagefault is
> a syscall in disguise) with interrupts disabled.  The kernel makes no
> guarantees that we'll honour it.  We could just enable interrupts on
> pagefault entry - that'll teach 'em.

or save - enable - <run handlers> - restore sequence

it's horrible that we allowed this before, and the semantics are very
fuzzy at best, but to go WARN_ON() for it might be a bit too much.

(and yes someone really ought to fix hwclock; it's rather broken)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05 15:11   ` Arjan van de Ven
@ 2008-10-05 17:27     ` Andrew Morton
  2008-10-05 17:38       ` Arjan van de Ven
  2008-10-12 19:46       ` Karel Zak
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2008-10-05 17:27 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, torvalds, Nick Piggin

On Sun, 5 Oct 2008 08:11:45 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> On Sat, 4 Oct 2008 21:52:25 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Sat, 4 Oct 2008 17:44:33 -0700 Arjan van de Ven
> > <arjan@infradead.org> wrote:
> > 
> > > 
> > > Details: http://www.kerneloops.org/searchweek.php?search=lock_page
> > > 
> > > There's quite a few of this BUG, which seems to be an interaction
> > > between the "hwclock" program and something in 2.6.27. It's new
> > > in .27 and is currently the 8th ranked issue.....
> > > 
> > > BUG: sleeping function called from invalid context at
> > > include/linux/pagemap.h:294 in_atomic():0, irqs_disabled():1
> > > INFO: lockdep is turned off.
> > > irq event stamp: 0
> > > hardirqs last  enabled at (0): [<00000000>] 0x0
> > > hardirqs last disabled at (0): [<c042c3a4>]
> > > copy_process+0x2e7/0x115e softirqs last  enabled at (0):
> > > [<c042c3a4>] copy_process+0x2e7/0x115e softirqs last disabled at
> > > (0): [<00000000>] 0x0 Pid: 9591, comm: hwclock Tainted: G        W
> > > 2.6.27-0.372.rc8.fc10.i686 #1 [<c0427a53>] __might_sleep+0xd1/0xd6 
> > >  [<c0479a8b>] lock_page+0x1a/0x34
> > >  [<c0479cfa>] find_lock_page+0x23/0x48
> > >  [<c047a215>] filemap_fault+0x9b/0x330
> > >  [<c0486493>] __do_fault+0x40/0x2e6
> > >  [<c0487d63>] handle_mm_fault+0x2ec/0x6d2
> > >  [<c06e8260>] do_page_fault+0x2e5/0x693
> > > 
> > 
> > Looks like `hwclock' disabled interrupts in userspace with sys_iopl()?
> 
> static unsigned long   
> atomic(const char *name, unsigned long (*op)(unsigned long),
>        unsigned long arg)
> {
>   unsigned long v;
>   __asm__ volatile ("cli");
>   v = (*op)(arg);
>   __asm__ volatile ("sti");
>   return v;
> } 
> 
> looks like it (but only on 32 bit x86, not on 64 bit x86)

I suspect this is new in hwclock?  We do a might_sleep() in lock_page()
in 2.6.25 and in 2.6.26.

In which case there isn't a lot of point in changing 2.6.27.

> > 
> > And then it took a pagefault, which is presumably a bug in hwclock.
> > 
> > That's all a bit antisocial of it.  I guess a suitable quickfix is to
> > remove the might_sleep() from lock_page() (which would be a good thing
> > from a text size POV anyway).
> > 
> > But there will of course be other sites which do possibly-sleeping
> > operations on the pagefault path.
> > 
> > Really, it's a bit stupid doing _any_ system calls (and a pagefault is
> > a syscall in disguise) with interrupts disabled.  The kernel makes no
> > guarantees that we'll honour it.  We could just enable interrupts on
> > pagefault entry - that'll teach 'em.
> 
> or save - enable - <run handlers> - restore sequence

hwclock is buggy either way - it's trying to disable interrupts but
it's calling into the kernel, which will reenable interrupts, thus
losing any protection which hwclock was trying to attain.

Plus there's this little thing called "smp".  I bet it doesn't disable
interrupts on all CPUs.

> it's horrible that we allowed this before, and the semantics are very
> fuzzy at best, but to go WARN_ON() for it might be a bit too much.
> 
> (and yes someone really ought to fix hwclock; it's rather broken)

well yeah. Recently broken?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05 17:27     ` Andrew Morton
@ 2008-10-05 17:38       ` Arjan van de Ven
  2008-10-12 20:00         ` Karel Zak
  2008-10-12 19:46       ` Karel Zak
  1 sibling, 1 reply; 17+ messages in thread
From: Arjan van de Ven @ 2008-10-05 17:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, Nick Piggin

On Sun, 5 Oct 2008 10:27:42 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > static unsigned long   
> > atomic(const char *name, unsigned long (*op)(unsigned long),
> >        unsigned long arg)
> > {
> >   unsigned long v;
> >   __asm__ volatile ("cli");
> >   v = (*op)(arg);
> >   __asm__ volatile ("sti");
> >   return v;
> > } 
> > 
> > looks like it (but only on 32 bit x86, not on 64 bit x86)
> 
> I suspect this is new in hwclock?  We do a might_sleep() in
> lock_page() in 2.6.25 and in 2.6.26.

this quote was from the F9 hwclock.. which shipped with 2.6.25.
Hum.

> > > Really, it's a bit stupid doing _any_ system calls (and a
> > > pagefault is a syscall in disguise) with interrupts disabled.
> > > The kernel makes no guarantees that we'll honour it.  We could
> > > just enable interrupts on pagefault entry - that'll teach 'em.
> > 
> > or save - enable - <run handlers> - restore sequence
> 
> hwclock is buggy either way - 

not arguing with that ;-)
All code doing cli/sti in userland is buggy period. No excuses possible.

> it's trying to disable interrupts but
> it's calling into the kernel, which will reenable interrupts, thus
> losing any protection which hwclock was trying to attain.
> 
> Plus there's this little thing called "smp".  I bet it doesn't disable
> interrupts on all CPUs.

I get the impression from the code that it really wants a "don't
schedule me out" rather than "this is a lock".
it can do better.
On Alpha it implements a seq-lock kind of thing instead.
(and on x86-64 .. it implements NOTHING)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05  4:52 ` Andrew Morton
  2008-10-05 15:11   ` Arjan van de Ven
@ 2008-10-05 18:18   ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2008-10-05 18:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, Nick Piggin



On Sat, 4 Oct 2008, Andrew Morton wrote:
> > Pid: 9591, comm: hwclock Tainted: G        W 2.6.27-0.372.rc8.fc10.i686 #1 
> >  [<c0427a53>] __might_sleep+0xd1/0xd6 
> >  [<c0479a8b>] lock_page+0x1a/0x34
> >  [<c0479cfa>] find_lock_page+0x23/0x48
> >  [<c047a215>] filemap_fault+0x9b/0x330
> >  [<c0486493>] __do_fault+0x40/0x2e6
> >  [<c0487d63>] handle_mm_fault+0x2ec/0x6d2
> >  [<c06e8260>] do_page_fault+0x2e5/0x693
> 
> Looks like `hwclock' disabled interrupts in userspace with sys_iopl()?

We probably should enable interrupts in the page fault code. We already do

	#ifdef CONFIG_X86_32
	        /* It's safe to allow irq's after cr2 has been saved and the vmalloc
	           fault has been handled. */
	        if (regs->flags & (X86_EFLAGS_IF | X86_VM_MASK))
	                local_irq_enable();

	        /*
	         * If we're in an interrupt, have no user context or are running in an
	         * atomic region then we must not take the fault.
	         */
	        if (in_atomic() || !mm)
	                goto bad_area_nosemaphore;
	#endif
	...

so we have code to do so, it's just that we don't do it if the page fault 
happened in an interrupt. But that's for the _kernel_ having interrupts 
disabled and us needing to fix up the vmalloc area lazily (do we ever even 
do that any more.. I dunno).

So we could easily add a check for 'user_space_vm(regs)' instead of 
checking the VM_MASK, and fix it that way. Hmm?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05 17:27     ` Andrew Morton
  2008-10-05 17:38       ` Arjan van de Ven
@ 2008-10-12 19:46       ` Karel Zak
  1 sibling, 0 replies; 17+ messages in thread
From: Karel Zak @ 2008-10-12 19:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, torvalds, Nick Piggin

On Sun, Oct 05, 2008 at 10:27:42AM -0700, Andrew Morton wrote:
> On Sun, 5 Oct 2008 08:11:45 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > static unsigned long   
> > atomic(const char *name, unsigned long (*op)(unsigned long),
> >        unsigned long arg)
> > {
> >   unsigned long v;
> >   __asm__ volatile ("cli");
> >   v = (*op)(arg);
> >   __asm__ volatile ("sti");
> >   return v;
> > } 
> > 
> > looks like it (but only on 32 bit x86, not on 64 bit x86)
> 
> I suspect this is new in hwclock?

 since util-linux-2.9v, year 1999

> > (and yes someone really ought to fix hwclock; it's rather broken)
> 
> well yeah. Recently broken?

 9 years ago

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-05 17:38       ` Arjan van de Ven
@ 2008-10-12 20:00         ` Karel Zak
  2008-10-12 20:16           ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Karel Zak @ 2008-10-12 20:00 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, torvalds, Nick Piggin

On Sun, Oct 05, 2008 at 10:38:26AM -0700, Arjan van de Ven wrote:
> I get the impression from the code that it really wants a "don't
> schedule me out" rather than "this is a lock".

 Yes, I think so.

> it can do better.

 Any suggestion how to nicely implement "don't schedule me out"?

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-12 20:00         ` Karel Zak
@ 2008-10-12 20:16           ` Linus Torvalds
  2008-10-13 14:55             ` Arjan van de Ven
  2008-10-13 15:26             ` Ingo Molnar
  0 siblings, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2008-10-12 20:16 UTC (permalink / raw)
  To: Karel Zak
  Cc: Arjan van de Ven, Andrew Morton, Linux Kernel Mailing List,
	Nick Piggin, Ingo Molnar



On Sun, 12 Oct 2008, Karel Zak wrote:
>
>  Any suggestion how to nicely implement "don't schedule me out"?

There's nothing you can do. If you take a page fault, you're done. Forget 
about any "can't schedule" or "don't enable interrupts". The kernel _has_ 
to handle the page fault, and that may involve IO and thus random pauses. 
No ifs, buts or maybe's about it.

This patch may or may not get rid of the warning, at least. It won't fix 
hwclock, but that's apparently unfixable from the kernel - the thing is 
just plain buggy.

[ Ingo added to Cc just because this is obviously a x86 tree thing, and 
  tries to unify some trivial parts of the VM paths at the same time. ]

For hwclock, you may try to:

 - do

	mlockall(MCL_CURRENT)

   before you do the critical region

 - set yourself to some realtime scheduling thing

	struct sched_param param = {
		.sched_priority = 50,
	};

	sched_setscheduler(0, SCHED_FIFO, &param);

   or similar.

and that should mean that you stay on your CPU (by virtue of not being 
scheduled away because you're more important than others) and don't take 
page faults.

But making yourself real-time also means that any bugs can essentially 
kill the system (endless loop).

		Linus

---
 arch/x86/mm/fault.c |   30 +++++++++++-------------------
 1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a742d75..ac2ad78 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -645,24 +645,23 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	}
 
 
-#ifdef CONFIG_X86_32
-	/* It's safe to allow irq's after cr2 has been saved and the vmalloc
-	   fault has been handled. */
-	if (regs->flags & (X86_EFLAGS_IF | X86_VM_MASK))
-		local_irq_enable();
-
 	/*
-	 * If we're in an interrupt, have no user context or are running in an
-	 * atomic region then we must not take the fault.
+	 * It's safe to allow irq's after cr2 has been saved and the
+	 * vmalloc fault has been handled.
+	 *
+	 * User-mode registers count as a user access even for any
+	 * potential system fault or CPU buglet.
 	 */
-	if (in_atomic() || !mm)
-		goto bad_area_nosemaphore;
-#else /* CONFIG_X86_64 */
-	if (likely(regs->flags & X86_EFLAGS_IF))
+	if (user_mode_vm(regs)) {
+		local_irq_enable();
+		error_code |= PF_USER;
+	} else if (regs->flags & X86_EFLAGS_IF)
 		local_irq_enable();
 
+#ifdef CONFIG_X86_64
 	if (unlikely(error_code & PF_RSVD))
 		pgtable_bad(address, regs, error_code);
+#endif
 
 	/*
 	 * If we're in an interrupt, have no user context or are running in an
@@ -671,14 +670,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	if (unlikely(in_atomic() || !mm))
 		goto bad_area_nosemaphore;
 
-	/*
-	 * User-mode registers count as a user access even for any
-	 * potential system fault or CPU buglet.
-	 */
-	if (user_mode_vm(regs))
-		error_code |= PF_USER;
 again:
-#endif
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-12 20:16           ` Linus Torvalds
@ 2008-10-13 14:55             ` Arjan van de Ven
  2008-10-14 21:04               ` Karel Zak
  2008-10-13 15:26             ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Arjan van de Ven @ 2008-10-13 14:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, Andrew Morton, Linux Kernel Mailing List, Nick Piggin,
	Ingo Molnar

On Sun, 12 Oct 2008 13:16:12 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 12 Oct 2008, Karel Zak wrote:
> >
> >  Any suggestion how to nicely implement "don't schedule me out"?
> 
> There's nothing you can do. If you take a page fault, you're done.
> Forget about any "can't schedule" or "don't enable interrupts". The
> kernel _has_ to handle the page fault, and that may involve IO and
> thus random pauses. No ifs, buts or maybe's about it.
> 
> This patch may or may not get rid of the warning, at least. It won't
> fix hwclock, but that's apparently unfixable from the kernel - the
> thing is just plain buggy.


it almost sounds like it's trying to do something that ... really the
kernel should be doing.

Karel: Can you describe what it WANTS to do so we can see if we can
just extend the linux kernel to do that the right way?

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-12 20:16           ` Linus Torvalds
  2008-10-13 14:55             ` Arjan van de Ven
@ 2008-10-13 15:26             ` Ingo Molnar
  2008-10-13 15:40               ` Alan Cox
  2008-10-13 15:44               ` Linus Torvalds
  1 sibling, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-10-13 15:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ Ingo added to Cc just because this is obviously a x86 tree thing, and 
>   tries to unify some trivial parts of the VM paths at the same time. ]

applied to tip/x86/urgent, thanks Linus!

do you agree with the changelog and can i add your Signed-off-by ?

	Ingo

--------------------->
>From bdbe15671f9b3ad1264ed174f62563774f0abef9 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 12 Oct 2008 13:16:12 -0700
Subject: [PATCH] x86/mm: do not trigger a kernel warning if user-space disables interrupts and generates a page fault

Arjan reported a spike in the following bug pattern in v2.6.27:

   http://www.kerneloops.org/searchweek.php?search=lock_page

which happens because hwclock started triggering warnings due to
a (correct) might_sleep() check in the MM code.

The warning occurs because hwclock uses this dubious sequence of
code to run "atomic" code:

  static unsigned long
  atomic(const char *name, unsigned long (*op)(unsigned long),
         unsigned long arg)
  {
    unsigned long v;
    __asm__ volatile ("cli");
    v = (*op)(arg);
    __asm__ volatile ("sti");
    return v;
  }

Then it pagefaults in that "atomic" section, triggering the warning.

There is no way the kernel could provide "atomicity" in this path,
a page fault is a cannot-continue machine event so the kernel has to
wait for the page to be filled in.

Even if it was just a minor fault we'd have to take locks and might have
to spend quite a bit of time with interrupts disabled - not nice to irq
latencies in general.

So instead just enable interrupts in the pagefault path unconditionally
if we come from user-space, and handle the fault.

Also, while touching this code, unify some trivial parts of the x86
VM paths at the same time.

Reported-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/fault.c |   30 +++++++++++-------------------
 1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a742d75..ac2ad78 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -645,24 +645,23 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	}
 
 
-#ifdef CONFIG_X86_32
-	/* It's safe to allow irq's after cr2 has been saved and the vmalloc
-	   fault has been handled. */
-	if (regs->flags & (X86_EFLAGS_IF | X86_VM_MASK))
-		local_irq_enable();
-
 	/*
-	 * If we're in an interrupt, have no user context or are running in an
-	 * atomic region then we must not take the fault.
+	 * It's safe to allow irq's after cr2 has been saved and the
+	 * vmalloc fault has been handled.
+	 *
+	 * User-mode registers count as a user access even for any
+	 * potential system fault or CPU buglet.
 	 */
-	if (in_atomic() || !mm)
-		goto bad_area_nosemaphore;
-#else /* CONFIG_X86_64 */
-	if (likely(regs->flags & X86_EFLAGS_IF))
+	if (user_mode_vm(regs)) {
+		local_irq_enable();
+		error_code |= PF_USER;
+	} else if (regs->flags & X86_EFLAGS_IF)
 		local_irq_enable();
 
+#ifdef CONFIG_X86_64
 	if (unlikely(error_code & PF_RSVD))
 		pgtable_bad(address, regs, error_code);
+#endif
 
 	/*
 	 * If we're in an interrupt, have no user context or are running in an
@@ -671,14 +670,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	if (unlikely(in_atomic() || !mm))
 		goto bad_area_nosemaphore;
 
-	/*
-	 * User-mode registers count as a user access even for any
-	 * potential system fault or CPU buglet.
-	 */
-	if (user_mode_vm(regs))
-		error_code |= PF_USER;
 again:
-#endif
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 15:26             ` Ingo Molnar
@ 2008-10-13 15:40               ` Alan Cox
  2008-10-13 15:44               ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Alan Cox @ 2008-10-13 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin

> The warning occurs because hwclock uses this dubious sequence of
> code to run "atomic" code:

Nobody should be using old versions of hwclock that do this as they are
not SMP safe.

> So instead just enable interrupts in the pagefault path unconditionally
> if we come from user-space, and handle the fault.

This should definitely also log an error to get users to remove that
rubbish from their PC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 15:26             ` Ingo Molnar
  2008-10-13 15:40               ` Alan Cox
@ 2008-10-13 15:44               ` Linus Torvalds
  2008-10-13 16:02                 ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2008-10-13 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin

On Mon, 13 Oct 2008, Ingo Molnar wrote:
> 
> do you agree with the changelog and can i add your Signed-off-by ?

Sure. One thing I'd still like to see is that crazy "again" vs "survive" 
mess for x86-64 vs x86-32. I think the patch as posted will cause a new 
warning on x86-32 due to "unused label 'again'" or similar.

It's totally insane that we have two different versions of the oom 
handling for x86. I don't know why we do that, it's probably historical, 
and I _suspect_ that the 32-bit one has gotten a lot more testing.

And not just because there have been more of the 32-bit kernels around, 
but also because low-memory situations are probably more common on 32-bit 
setups. But I dunno. 

So I would suggest you just pick the x86-32 version of that oom handling 
thing too. Unless you know some deep reason why the 64-bit one would be 
superior.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 15:44               ` Linus Torvalds
@ 2008-10-13 16:02                 ` Ingo Molnar
  2008-10-13 16:08                   ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-13 16:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 13 Oct 2008, Ingo Molnar wrote:
> > 
> > do you agree with the changelog and can i add your Signed-off-by ?
> 
> Sure. One thing I'd still like to see is that crazy "again" vs 
> "survive" mess for x86-64 vs x86-32. I think the patch as posted will 
> cause a new warning on x86-32 due to "unused label 'again'" or 
> similar.
> 
> It's totally insane that we have two different versions of the oom 
> handling for x86. I don't know why we do that, it's probably 
> historical, and I _suspect_ that the 32-bit one has gotten a lot more 
> testing.
> 
> And not just because there have been more of the 32-bit kernels 
> around, but also because low-memory situations are probably more 
> common on 32-bit setups. But I dunno.
> 
> So I would suggest you just pick the x86-32 version of that oom 
> handling thing too. Unless you know some deep reason why the 64-bit 
> one would be superior.

hm, i think the 64-bit case is the correct code, because in this 'init 
task OOMs' case we do:

out_of_memory:
        up_read(&mm->mmap_sem);
        if (is_global_init(tsk)) {
                yield();
                down_read(&mm->mmap_sem);

note that we drop the mmap_sem, so in theory another thread of this same 
MM could change the vma tree, and our 'vma' might not be valid anymore.

It's probably not a real issue in practice because this is about PID 1, 
so i doubt it really matters, but still.

So how about the patch below?

	Ingo

---------------->
>From 7b87da331b6ada44ccd5ffeedba76880c825d4fc Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 13 Oct 2008 17:49:02 +0200
Subject: [PATCH] x86/mm: unify init task OOM handling

Linus noticed that the "again:" versus "survive:" OOM logic for
the init task was arbitrarily different.

The 64-bit codepath is the better one, because it correctly re-lookups
the vma after having dropped the ->mmap_sem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/fault.c |   15 ++++++---------
 1 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ac2ad78..8bc5956 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -671,7 +671,8 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 		goto bad_area_nosemaphore;
 
 again:
-	/* When running in the kernel we expect faults to occur only to
+	/*
+	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
 	 * erroneous fault occurring in a code path which already holds mmap_sem
@@ -734,9 +735,6 @@ good_area:
 			goto bad_area;
 	}
 
-#ifdef CONFIG_X86_32
-survive:
-#endif
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -871,12 +869,11 @@ out_of_memory:
 	up_read(&mm->mmap_sem);
 	if (is_global_init(tsk)) {
 		yield();
-#ifdef CONFIG_X86_32
-		down_read(&mm->mmap_sem);
-		goto survive;
-#else
+		/*
+		 * Re-lookup the vma - in theory the vma tree might
+		 * have changed:
+		 */
 		goto again;
-#endif
 	}
 
 	printk("VM: killing process %s\n", tsk->comm);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 16:02                 ` Ingo Molnar
@ 2008-10-13 16:08                   ` Linus Torvalds
  2008-10-13 16:21                     ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2008-10-13 16:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra



On Mon, 13 Oct 2008, Ingo Molnar wrote:
> 
> hm, i think the 64-bit case is the correct code, because in this 'init 
> task OOMs' case we do:
> 
> out_of_memory:
>         up_read(&mm->mmap_sem);
>         if (is_global_init(tsk)) {
>                 yield();
>                 down_read(&mm->mmap_sem);
> 
> note that we drop the mmap_sem, so in theory another thread of this same 
> MM could change the vma tree, and our 'vma' might not be valid anymore.

Hmm. Looks about right.

> It's probably not a real issue in practice because this is about PID 1, 
> so i doubt it really matters, but still.
> 
> So how about the patch below?

Ack. As long as we don't have two versions and the code is impossible to 
look at.

			Linus

> 
> 	Ingo
> 
> ---------------->
> >From 7b87da331b6ada44ccd5ffeedba76880c825d4fc Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 13 Oct 2008 17:49:02 +0200
> Subject: [PATCH] x86/mm: unify init task OOM handling
> 
> Linus noticed that the "again:" versus "survive:" OOM logic for
> the init task was arbitrarily different.
> 
> The 64-bit codepath is the better one, because it correctly re-lookups
> the vma after having dropped the ->mmap_sem.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/mm/fault.c |   15 ++++++---------
>  1 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index ac2ad78..8bc5956 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -671,7 +671,8 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  		goto bad_area_nosemaphore;
>  
>  again:
> -	/* When running in the kernel we expect faults to occur only to
> +	/*
> +	 * When running in the kernel we expect faults to occur only to
>  	 * addresses in user space.  All other faults represent errors in the
>  	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
>  	 * erroneous fault occurring in a code path which already holds mmap_sem
> @@ -734,9 +735,6 @@ good_area:
>  			goto bad_area;
>  	}
>  
> -#ifdef CONFIG_X86_32
> -survive:
> -#endif
>  	/*
>  	 * If for any reason at all we couldn't handle the fault,
>  	 * make sure we exit gracefully rather than endlessly redo
> @@ -871,12 +869,11 @@ out_of_memory:
>  	up_read(&mm->mmap_sem);
>  	if (is_global_init(tsk)) {
>  		yield();
> -#ifdef CONFIG_X86_32
> -		down_read(&mm->mmap_sem);
> -		goto survive;
> -#else
> +		/*
> +		 * Re-lookup the vma - in theory the vma tree might
> +		 * have changed:
> +		 */
>  		goto again;
> -#endif
>  	}
>  
>  	printk("VM: killing process %s\n", tsk->comm);
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 16:08                   ` Linus Torvalds
@ 2008-10-13 16:21                     ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-10-13 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, Arjan van de Ven, Andrew Morton,
	Linux Kernel Mailing List, Nick Piggin, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > It's probably not a real issue in practice because this is about PID 
> > 1, so i doubt it really matters, but still.
> > 
> > So how about the patch below?
> 
> Ack. As long as we don't have two versions and the code is impossible 
> to look at.

thx, added your Acked-by and queued it up in x86/urgent.

The many dumb #ifdefs were the result of the mechanic unification of 
fault.c - we waited for the bugs to get shaken out - it went pretty 
well, there was only one in the end IIRC.

Now we can unify it semantically as well and create sane, maintainable 
code with gradual patches. We used to have over 50 #ifdefs in fault.c 
iirc, that's now down to 24. (still high but shrinking)

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program
  2008-10-13 14:55             ` Arjan van de Ven
@ 2008-10-14 21:04               ` Karel Zak
  0 siblings, 0 replies; 17+ messages in thread
From: Karel Zak @ 2008-10-14 21:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Nick Piggin, Ingo Molnar

On Mon, Oct 13, 2008 at 10:55:59AM -0400, Arjan van de Ven wrote:
> Karel: Can you describe what it WANTS to do so we can see if we can

 It wants to "atomically" (without context switch) read/write time
 from CMOS. This is unrealistic of course.

> just extend the linux kernel to do that the right way?

 The kernel is already extended and hwclock uses the extension :-)
 The solution is /dev/rtc.

 The code that directly works with CMOS is fallback solution for
 people who don't want or can not use the standard RTC device. I guess
 people use this functionally for experiments only. Today the RTC
 framework should work everywhere.

 I'll simply remove the "cli" / "sti" code -- eventually I can try
 to optimize it by mlockall() and SCHED_FIFO (as suggested by Linus).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-10-14 21:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-05  0:44 [kerneloops] regression in 2.6.27 wrt "lock_page" and the "hwclock" program Arjan van de Ven
2008-10-05  4:52 ` Andrew Morton
2008-10-05 15:11   ` Arjan van de Ven
2008-10-05 17:27     ` Andrew Morton
2008-10-05 17:38       ` Arjan van de Ven
2008-10-12 20:00         ` Karel Zak
2008-10-12 20:16           ` Linus Torvalds
2008-10-13 14:55             ` Arjan van de Ven
2008-10-14 21:04               ` Karel Zak
2008-10-13 15:26             ` Ingo Molnar
2008-10-13 15:40               ` Alan Cox
2008-10-13 15:44               ` Linus Torvalds
2008-10-13 16:02                 ` Ingo Molnar
2008-10-13 16:08                   ` Linus Torvalds
2008-10-13 16:21                     ` Ingo Molnar
2008-10-12 19:46       ` Karel Zak
2008-10-05 18:18   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox