struct page field arrangement

All of lore.kernel.org
 help / color / mirror / Atom feed

* struct page field arrangement
@ 2007-02-28 14:31 Jan Beulich
  2007-02-28 21:08 ` Hugh Dickins
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-02-28 14:31 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins; +Cc: linux-kernel

A change early last year reordered struct page so that ptl overlaps not only
private, but also mapping. Since spinlock_t can be much larger, I'm wondering
whether there's a reason to not also overlay the space index and lru take -
are these used for anything on page table pages?

Thanks, Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: struct page field arrangement
  2007-02-28 14:31 struct page field arrangement Jan Beulich
@ 2007-02-28 21:08 ` Hugh Dickins
  2007-03-01  8:42   ` Fwd: " Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2007-02-28 21:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Morton, linux-kernel

On Wed, 28 Feb 2007, Jan Beulich wrote:

> A change early last year reordered struct page so that ptl overlaps not only
> private, but also mapping. Since spinlock_t can be much larger, I'm wondering
> whether there's a reason to not also overlay the space index and lru take -
> are these used for anything on page table pages?

Overlaying lru is a problem for for those architectures which use
kmem_cache_alloc for their pagetables: arm26, powerpc, sparc64 and
perhaps others (I just grepped quickly through include/asm*, didn't
follow up those who have extern functions): since slab reuses the
lru fields for its own purposes.  Could perhaps be stacked somehow.

Overlaying index is fairly straightforward: the index field is fair
game.  In my original patches I did overlay index, but Andrew was
strongly averse to the way I was doing it, and scaled things back,
to private alone if I remember rightly, then relaxed a little to
include mapping too.  Way back then I made up a patch to overlay
index too (when I saw Fedora going out with CONFIG_DEBUG_SPINLOCK),
but I could never get it into a form where I felt it would satisfy
Andrew; and grew increasingly dissatisfied with that approach myself.

I don't think further overlaying is the right answer really.
But I do think it's a scandal that the size of struct page (in a
DEBUG_SPINLOCK system) is governed by such a minority use of the
struct page.  Lacking a satisfying answer, I've just let it drift
on until someone notices and complains.

kmalloc a separate spinlock structure when it's too big to fit in?
Not such a good idea, since then there will tend to be false sharing
of cachelines between them: simpler just to disable SPLIT_PTLOCK in
that case.

I'm not happy with the status quo, but I don't know the right answer:
perhaps allow pagetable pages to use an undebugged spinlock variant?

Hugh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Fwd: Re: struct page field arrangement
  2007-02-28 21:08 ` Hugh Dickins
@ 2007-03-01  8:42   ` Jan Beulich
  2007-03-01 10:22     ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-03-01  8:42 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2649 bytes --]

Having looked a little into the disabled SPLIT_PT_LOCK issue on Xen, I
realized that is shouldn't be too difficult to re-enable it (at least in some
cases).

One option would be to not use the mapping field of struct page for
PageForeign but rather the index one (and to that effect was a mail I
sent to Andrew Morton and Hugh Dickins that I got below answer on,
basically indicating that it should be okay to use index here). This would
allow removing the Xen conditional on SPLIT_PTLOCK_CPUS altogether.

The other option would be along the lines of attached patch (not
checked whether it would apply as-is to -unstable, but it appears to
work okay), allowing the splitting at least in non-debug configs.

Opinions?

Thanks, Jan

>>> Hugh Dickins <hugh@veritas.com> 28.02.07 22:08 >>>
On Wed, 28 Feb 2007, Jan Beulich wrote:

> A change early last year reordered struct page so that ptl overlaps not only
> private, but also mapping. Since spinlock_t can be much larger, I'm wondering
> whether there's a reason to not also overlay the space index and lru take -
> are these used for anything on page table pages?

Overlaying lru is a problem for for those architectures which use
kmem_cache_alloc for their pagetables: arm26, powerpc, sparc64 and
perhaps others (I just grepped quickly through include/asm*, didn't
follow up those who have extern functions): since slab reuses the
lru fields for its own purposes.  Could perhaps be stacked somehow.

Overlaying index is fairly straightforward: the index field is fair
game.  In my original patches I did overlay index, but Andrew was
strongly averse to the way I was doing it, and scaled things back,
to private alone if I remember rightly, then relaxed a little to
include mapping too.  Way back then I made up a patch to overlay
index too (when I saw Fedora going out with CONFIG_DEBUG_SPINLOCK),
but I could never get it into a form where I felt it would satisfy
Andrew; and grew increasingly dissatisfied with that approach myself.

I don't think further overlaying is the right answer really.
But I do think it's a scandal that the size of struct page (in a
DEBUG_SPINLOCK system) is governed by such a minority use of the
struct page.  Lacking a satisfying answer, I've just let it drift
on until someone notices and complains.

kmalloc a separate spinlock structure when it's too big to fit in?
Not such a good idea, since then there will tend to be false sharing
of cachelines between them: simpler just to disable SPLIT_PTLOCK in
that case.

I'm not happy with the status quo, but I don't know the right answer:
perhaps allow pagetable pages to use an undebugged spinlock variant?

Hugh

[-- Attachment #2: xen-split-pt-lock.patch --]
[-- Type: text/plain, Size: 2456 bytes --]

From: jbeulich@novell.com
Subject: allow use of split page table lock in non-debug case
Patch-mainline: obsolete

Index: head-2007-02-27/include/linux/mm.h
===================================================================
--- head-2007-02-27.orig/include/linux/mm.h	2007-02-27 16:25:00.000000000 +0100
+++ head-2007-02-27/include/linux/mm.h	2007-02-28 17:11:56.000000000 +0100
@@ -891,7 +891,11 @@ static inline pmd_t *pmd_alloc(struct mm
 #define pte_lock_init(_page)	do {					\
 	spin_lock_init(__pte_lockptr(_page));				\
 } while (0)
-#define pte_lock_deinit(page)	((page)->mapping = NULL)
+#define pte_lock_deinit(_page)	do {					\
+	if ((void *)&((struct page *)0)->mapping			\
+	    < (void *)(__pte_lockptr((struct page *)0) + 1))		\
+		(_page)->mapping = NULL;				\
+} while (0)
 #define pte_lockptr(mm, pmd)	({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
 #else
 /*
Index: head-2007-02-27/include/linux/page-flags.h
===================================================================
--- head-2007-02-27.orig/include/linux/page-flags.h	2007-02-27 16:26:22.000000000 +0100
+++ head-2007-02-27/include/linux/page-flags.h	2007-02-28 16:38:41.000000000 +0100
@@ -254,8 +254,16 @@ static inline void SetPageUptodate(struc
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)

 #define PageForeign(page)	test_bit(PG_foreign, &(page)->flags)
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+#define PageForeignCheck()						\
+	BUILD_BUG_ON((void *)&((struct page *)0)->mapping		\
+	             < (void *)(__pte_lockptr((struct page *)0) + 1))
+#else
+#define PageForeignCheck() ((void)0)
+#endif
 #define SetPageForeign(page, dtor) do {		\
 	set_bit(PG_foreign, &(page)->flags);	\
+	PageForeignCheck();			\
 	(page)->mapping = (void *)dtor;		\
 } while (0)
 #define ClearPageForeign(page) do {		\
Index: head-2007-02-27/mm/Kconfig
===================================================================
--- head-2007-02-27.orig/mm/Kconfig	2007-02-27 16:23:14.000000000 +0100
+++ head-2007-02-27/mm/Kconfig	2007-02-28 16:26:48.000000000 +0100
@@ -139,7 +139,8 @@ config SPLIT_PTLOCK_CPUS
 	int
 	default "4096" if ARM && !CPU_CACHE_VIPT
 	default "4096" if PARISC && !PA20
-	default "4096" if X86_XEN || X86_64_XEN
+	default "4096" if X86_XEN && (PREEMPT || DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC)
+	default "4096" if X86_64_XEN && (DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC)
 	default "4"

 #

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-01  8:42   ` Fwd: " Jan Beulich
@ 2007-03-01 10:22     ` Keir Fraser
  2007-03-01 11:45       ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-01 10:22 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 1/3/07 08:42, "Jan Beulich" <jbeulich@novell.com> wrote:

> Having looked a little into the disabled SPLIT_PT_LOCK issue on Xen, I
> realized that is shouldn't be too difficult to re-enable it (at least in some
> cases).

How does pagetable locking work in modern Linux kernels? It seems that
updates of ptes are protected by a per-page lock or the mm lock, and
population of page directory entries is protected by the mm lock, but that
there is no synchronisation with read-only pagetable walks. Does this mean
that sections of pagetable hierarchy are never reaped from a process until
it dies?

Can we confident that the mm_pin/mm_unpin code (which walks pagetables and
has to find every page to make every one read-only or writable) is safe?
Presumably for this to be true we need to be sure that noone can meanwhile
concurrently be populating the pagetable we are walking with extra
pgds/puds/pmds/ptes...

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-01 10:22     ` Keir Fraser
@ 2007-03-01 11:45       ` Jan Beulich
  2007-03-01 12:12         ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-03-01 11:45 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 01.03.07 11:22 >>>
>On 1/3/07 08:42, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> Having looked a little into the disabled SPLIT_PT_LOCK issue on Xen, I
>> realized that is shouldn't be too difficult to re-enable it (at least in some
>> cases).
>
>How does pagetable locking work in modern Linux kernels? It seems that
>updates of ptes are protected by a per-page lock or the mm lock, and
>population of page directory entries is protected by the mm lock, but that
>there is no synchronisation with read-only pagetable walks. Does this mean
>that sections of pagetable hierarchy are never reaped from a process until
>it dies?

Yes, that's my understanding.

>Can we confident that the mm_pin/mm_unpin code (which walks pagetables and
>has to find every page to make every one read-only or writable) is safe?
>Presumably for this to be true we need to be sure that noone can meanwhile
>concurrently be populating the pagetable we are walking with extra
>pgds/puds/pmds/ptes...

Since the pin/unpin walking only cares about pgd/pud/pmd entries, synchronization
is guaranteed through mm->page_table_lock. The pte lock is used only for leaf
entries, which are of no concern to (un)pinning.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-01 11:45       ` Jan Beulich
@ 2007-03-01 12:12         ` Keir Fraser
  2007-03-01 12:50           ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-01 12:12 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 1/3/07 11:45, "Jan Beulich" <jbeulich@novell.com> wrote:

>> Can we confident that the mm_pin/mm_unpin code (which walks pagetables and
>> has to find every page to make every one read-only or writable) is safe?
>> Presumably for this to be true we need to be sure that noone can meanwhile
>> concurrently be populating the pagetable we are walking with extra
>> pgds/puds/pmds/ptes...
> 
> Since the pin/unpin walking only cares about pgd/pud/pmd entries,
> synchronization
> is guaranteed through mm->page_table_lock. The pte lock is used only for leaf
> entries, which are of no concern to (un)pinning.

Oh yes, of course.

By the way, I think your proposed patch looks okay. However we already use
page->index for foreign pages, in netback. Perhaps page->lru is also fair
game? :-)

 K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-01 12:12         ` Keir Fraser
@ 2007-03-01 12:50           ` Jan Beulich
  2007-03-01 13:42             ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-03-01 12:50 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>By the way, I think your proposed patch looks okay. However we already use
>page->index for foreign pages, in netback. Perhaps page->lru is also fair
>game? :-)

Not generally, there are arch-es that use it for their own purposes (look
back at the message from Hugh I originally forwarded). So even if none of
Xen's currently supported would use it (I didn't check), I wouldn't want to
introduce a lurking issue if anyone comes up with a new port.

I didn't check what private is being used for - while it can't be used for
PageForeign's destructor pointer, perhaps netback could use that one
rather than index? Or, even more likely, netback could use mapping if
PageForeign was switched over to index (while mapping overlays ptl, that
is benign to netback as it's not dealing with pte pages). But of course,
this requires thorough auditing of other uses of index.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-01 12:50           ` Jan Beulich
@ 2007-03-01 13:42             ` Keir Fraser
  0 siblings, 0 replies; 19+ messages in thread
From: Keir Fraser @ 2007-03-01 13:42 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel




On 1/3/07 12:50, "Jan Beulich" <jbeulich@novell.com> wrote:

> I didn't check what private is being used for - while it can't be used for
> PageForeign's destructor pointer, perhaps netback could use that one
> rather than index? Or, even more likely, netback could use mapping if
> PageForeign was switched over to index (while mapping overlays ptl, that
> is benign to netback as it's not dealing with pte pages). But of course,
> this requires thorough auditing of other uses of index.

Since PageForeign is only used for pagetables (for which you checked ->index
should be okay) and for skbuff data (where we already use ->index in
netback) then changing PageForeign to put the callback in ->index should be
fine.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
@ 2007-03-16 11:58 Jan Beulich
  2007-03-16 12:11 ` Keir Fraser
  2007-03-16 12:37 ` Keir Fraser
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Beulich @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>>> Keir Fraser <keir@xensource.com> 01.03.07 11:22 >>>
>>Can we confident that the mm_pin/mm_unpin code (which walks pagetables and
>>has to find every page to make every one read-only or writable) is safe?
>>Presumably for this to be true we need to be sure that noone can meanwhile
>>concurrently be populating the pagetable we are walking with extra
>>pgds/puds/pmds/ptes...
>
>Since the pin/unpin walking only cares about pgd/pud/pmd entries, synchronization
>is guaranteed through mm->page_table_lock. The pte lock is used only for leaf
>entries, which are of no concern to (un)pinning.

I'm afraid I have to correct myself. Stress testing has shown severe
problems, and after a few hours of staring at this I'm almost certain there
is a race condition here: While no new pte-s can ever appear, the logic in
mm/vmscan.c may try to modify pte-s in an mm currently being unpinned
(at least through ptep_clear_flush_young() called from
page_referenced_one() in mm/rmap.c). If this happens when
xen_pgd_unpin() has already passed the respective pte page, but
mm_walk() hasn't reached the page, yet, the update will fail (if done
directly, ptwr will no pick this up, and if done through a hypercall, the
call would fail, likely producing a BUG()).

Of course we could back out that changeset, but one of the reasons for
submitting it was to get closer to native. Therefore I'm considering
alternatives:
- lock all pte pages right after taking the page table lock in the pin/unpin
  functions, and drop them right before dropping the page table lock
  (this nesting should be no problem, as none of them can ever nest
  elsewhere, since otherwise the non-split-pt-lock case would not work)
- find a way to fix up the possibly resulting page fault (e.g. allow the
  ptwr code to update the page if it is PGT_l1_page_table but has a
  zero type reference count; the PGT_writable case would be more
  difficult and would probably need to be caught in the guest by checking
  the pte and finding it to be writable); the hypercalls don't seem to be
  affected (do_mmu_update seems okay as it doesn't look at the type
  reference count, and do_update_va_mapping can be called only on
  the currently active address space, which cannot be the one being in
  transition)
Dealing with an mm being pinned seems more difficult, as its L1 page
table pages will not be in PGT_l1_page_table state yet. Thus another
alternative could be to make page_check_address() in the kernel
detect the being-(un)pinned status, or even adjust pte_lockptr() to
return the page table lock for mm-s being (un)pinned (this would
perhaps be the cheapest fix).

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 11:58 Jan Beulich
@ 2007-03-16 12:11 ` Keir Fraser
  2007-03-16 12:25   ` Keir Fraser
  2007-03-16 12:37 ` Keir Fraser
  1 sibling, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 12:11 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:

>> Since the pin/unpin walking only cares about pgd/pud/pmd entries,
>> synchronization
>> is guaranteed through mm->page_table_lock. The pte lock is used only for leaf
>> entries, which are of no concern to (un)pinning.
> 
> I'm afraid I have to correct myself. Stress testing has shown severe
> problems, and after a few hours of staring at this I'm almost certain there
> is a race condition here: While no new pte-s can ever appear, the logic in
> mm/vmscan.c may try to modify pte-s in an mm currently being unpinned
> (at least through ptep_clear_flush_young() called from
> page_referenced_one() in mm/rmap.c). If this happens when
> xen_pgd_unpin() has already passed the respective pte page, but
> mm_walk() hasn't reached the page, yet, the update will fail (if done
> directly, ptwr will no pick this up, and if done through a hypercall, the
> call would fail, likely producing a BUG()).

What kind of stress test did you run? I was expecting that unpin would be
okay because we only call mm_unpin() from _arch_exit_mmap() if the mm_count
is 1 (which I believe means the mm is not active in any task).

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 12:11 ` Keir Fraser
@ 2007-03-16 12:25   ` Keir Fraser
  2007-03-16 12:54     ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 12:25 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: xen-devel

On 16/3/07 12:11, "Keir Fraser" <keir@xensource.com> wrote:

>> page_referenced_one() in mm/rmap.c). If this happens when
>> xen_pgd_unpin() has already passed the respective pte page, but
>> mm_walk() hasn't reached the page, yet, the update will fail (if done
>> directly, ptwr will no pick this up, and if done through a hypercall, the
>> call would fail, likely producing a BUG()).
> 
> What kind of stress test did you run? I was expecting that unpin would be
> okay because we only call mm_unpin() from _arch_exit_mmap() if the mm_count
> is 1 (which I believe means the mm is not active in any task).

And actually the pinning happens on activate_mm() in most cases, which I
would expect to be 'early enough' since noone can run on the mm before that?

If you've managed to provoke bugs then that's very interesting (and scary)!

I suppose if I understand the rmap case correctly, we're still susceptible
to the paging kernel thread trying to page things out at any time? Is that
what you think you've been seeing go wrong?

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 11:58 Jan Beulich
  2007-03-16 12:11 ` Keir Fraser
@ 2007-03-16 12:37 ` Keir Fraser
  2007-03-16 12:46   ` Jan Beulich
  1 sibling, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 12:37 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:

> Of course we could back out that changeset, but one of the reasons for
> submitting it was to get closer to native.

None of the fixes sound straightforward. :-)

I'd suggest we back out the mm/Kconfig change for 3.0.5 (since that is
close, and I'm not super confident that we'd get any of your proposed fixes
correct first time) and then have a go at supporting split-ptlock in 3.0.6
timeframe.

The situation after all is that our scalability beyond 4 VCPUs is currently
almost certainly bottlenecked by conservative locking in Xen rather than by
per-mm locking in Linux. So we're looking at complicating the Linux code
close to a Xen release for no actual user benefit.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 12:37 ` Keir Fraser
@ 2007-03-16 12:46   ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2007-03-16 12:46 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

I agree.

>>> Keir Fraser <keir@xensource.com> 16.03.07 13:37 >>>
On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:

> Of course we could back out that changeset, but one of the reasons for
> submitting it was to get closer to native.

None of the fixes sound straightforward. :-)

I'd suggest we back out the mm/Kconfig change for 3.0.5 (since that is
close, and I'm not super confident that we'd get any of your proposed fixes
correct first time) and then have a go at supporting split-ptlock in 3.0.6
timeframe.

The situation after all is that our scalability beyond 4 VCPUs is currently
almost certainly bottlenecked by conservative locking in Xen rather than by
per-mm locking in Linux. So we're looking at complicating the Linux code
close to a Xen release for no actual user benefit.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 12:25   ` Keir Fraser
@ 2007-03-16 12:54     ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2007-03-16 12:54 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 16.03.07 13:25 >>>
>On 16/3/07 12:11, "Keir Fraser" <keir@xensource.com> wrote:
>
>>> page_referenced_one() in mm/rmap.c). If this happens when
>>> xen_pgd_unpin() has already passed the respective pte page, but
>>> mm_walk() hasn't reached the page, yet, the update will fail (if done
>>> directly, ptwr will no pick this up, and if done through a hypercall, the
>>> call would fail, likely producing a BUG()).
>> 
>> What kind of stress test did you run? I was expecting that unpin would be
>> okay because we only call mm_unpin() from _arch_exit_mmap() if the mm_count
>> is 1 (which I believe means the mm is not active in any task).

newburn on machines with not too much (<= 2G) memory.

>And actually the pinning happens on activate_mm() in most cases, which I
>would expect to be 'early enough' since noone can run on the mm before that?
>
>If you've managed to provoke bugs then that's very interesting (and scary)!
>
>I suppose if I understand the rmap case correctly, we're still susceptible
>to the paging kernel thread trying to page things out at any time? Is that
>what you think you've been seeing go wrong?

Yes, somewhere in that area. From the data I have (page fault on the
page table write in ptep_clear_flush_young(), with the page table dump
showing the page to be writeable and present) I can only conclude that
the race is with the unpin path (otherwise I should see the page being
write protected), while the vm scan tries to recover memory at the same
time, and since this scan is scanning zones, not mm-s, the references to
the mm-s are being obtained from struct page -> vma -> mm (i.e. the
mm-s' use counts don't matter here).

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
@ 2007-03-16 13:47 Jan Beulich
  2007-03-16 14:13 ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-03-16 13:47 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Btw., another question that made me wonder already when doing the
original patch: why is it that x86-64 properly uses locking for mm_pin_all(),
yet i386 doesn't need to?

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 13:47 Jan Beulich
@ 2007-03-16 14:13 ` Keir Fraser
  2007-03-16 14:30   ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 14:13 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 16/3/07 13:47, "Jan Beulich" <jbeulich@novell.com> wrote:

> Btw., another question that made me wonder already when doing the
> original patch: why is it that x86-64 properly uses locking for mm_pin_all(),
> yet i386 doesn't need to?

Neither needs to. Well, assuming you're not using CONFIG_PREEMPT. :-)

Either we are in stop_machine context, or we have offlined all other CPUs
via cpu hotplug. In the absence of involuntary preemption it's therefore
safe to proceed without locking. But probably inadvisable (we'd like to
support full CONFIG_PREEMPT sometime in the future)... I think the 386 code
should be changed to match x86/64.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 14:13 ` Keir Fraser
@ 2007-03-16 14:30   ` Keir Fraser
  2007-03-16 15:15     ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 14:30 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: xen-devel

On 16/3/07 14:13, "Keir Fraser" <keir@xensource.com> wrote:

> Either we are in stop_machine context, or we have offlined all other CPUs
> via cpu hotplug. In the absence of involuntary preemption it's therefore
> safe to proceed without locking. But probably inadvisable (we'd like to
> support full CONFIG_PREEMPT sometime in the future)... I think the 386 code
> should be changed to match x86/64.

Actually it's not so easy. We walk the pgd_list and so do not have the mm
associated with the pgd. And in some cases there may not be an mm, if we
walk the list after a pgd is allocated but before it is attached to an mm.

So I think we should simply disable preemption in the i386 case. Which is
done easily by simply taking the pgd_lock (which it would be polite to do
anyway, since we're walking the pgd_list). With preemption disabled we're
definitely safe because pinning is non-blocking and all other CPUs are
safely spinning in stop_machine or have been offlined.

Actually preemption is already disabled by the caller (for a different
reason) but it's not nice to rely on that.

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 14:30   ` Keir Fraser
@ 2007-03-16 15:15     ` Jan Beulich
  2007-03-16 15:30       ` Keir Fraser
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2007-03-16 15:15 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <keir@xensource.com> 16.03.07 15:30 >>>
>On 16/3/07 14:13, "Keir Fraser" <keir@xensource.com> wrote:
>
>> Either we are in stop_machine context, or we have offlined all other CPUs
>> via cpu hotplug. In the absence of involuntary preemption it's therefore
>> safe to proceed without locking. But probably inadvisable (we'd like to
>> support full CONFIG_PREEMPT sometime in the future)... I think the 386 code
>> should be changed to match x86/64.
>
>Actually it's not so easy. We walk the pgd_list and so do not have the mm
>associated with the pgd. And in some cases there may not be an mm, if we
>walk the list after a pgd is allocated but before it is attached to an mm.

But without being attached to an mm, the user portion of the pgd should
be blank? I really can't follow why i386 requires so much more special handling
here than x86-64.

>So I think we should simply disable preemption in the i386 case. Which is
>done easily by simply taking the pgd_lock (which it would be polite to do
>anyway, since we're walking the pgd_list). With preemption disabled we're
>definitely safe because pinning is non-blocking and all other CPUs are
>safely spinning in stop_machine or have been offlined.

And what about the pgd_test_and_unpin() case?

>Actually preemption is already disabled by the caller (for a different
>reason) but it's not nice to rely on that.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: Re: struct page field arrangement
  2007-03-16 15:15     ` Jan Beulich
@ 2007-03-16 15:30       ` Keir Fraser
  0 siblings, 0 replies; 19+ messages in thread
From: Keir Fraser @ 2007-03-16 15:30 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 16/3/07 15:15, "Jan Beulich" <jbeulich@novell.com> wrote:

> But without being attached to an mm, the user portion of the pgd should
> be blank? I really can't follow why i386 requires so much more special
> handling here than x86-64.

PAE allocates the pmd's early (in pgd_alloc() in fact). So any pgd_alloc()ed
pgd must be captured by mm_pin_all().

> And what about the pgd_test_and_unpin() case?

It's only called from pgd_alloc, pgd_free, pgd_dtor. I'm not sure it's
needed in all those places, but it's needed in some of them to deal with the
case that _arch_exit_mmap() didn't unpin the pgd, because a PAE pgd does not
lose its pmd's until pgd_free(). In none of those cases is there a
concurrency problem: there's no mm associated with the pgd and hence no
concurrency issue.

I just checked in a patch to clarify some of this, particularly around
mm_pin_all, as c/s 14408. Feel free to submit more clarification patches --
I suspect a few more comments around these functions would help avoid
confusion!

 -- Keir

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-03-16 15:30 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-28 14:31 struct page field arrangement Jan Beulich
2007-02-28 21:08 ` Hugh Dickins
2007-03-01  8:42   ` Fwd: " Jan Beulich
2007-03-01 10:22     ` Keir Fraser
2007-03-01 11:45       ` Jan Beulich
2007-03-01 12:12         ` Keir Fraser
2007-03-01 12:50           ` Jan Beulich
2007-03-01 13:42             ` Keir Fraser
  -- strict thread matches above, loose matches on Subject: below --
2007-03-16 11:58 Jan Beulich
2007-03-16 12:11 ` Keir Fraser
2007-03-16 12:25   ` Keir Fraser
2007-03-16 12:54     ` Jan Beulich
2007-03-16 12:37 ` Keir Fraser
2007-03-16 12:46   ` Jan Beulich
2007-03-16 13:47 Jan Beulich
2007-03-16 14:13 ` Keir Fraser
2007-03-16 14:30   ` Keir Fraser
2007-03-16 15:15     ` Jan Beulich
2007-03-16 15:30       ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.