* operation ordering during pgd_alloc/pgd_free
@ 2008-06-05 8:51 Jan Beulich
2008-06-05 9:27 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 3+ messages in thread
From: Jan Beulich @ 2008-06-05 8:51 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: Ingo Molnar, linux-kernel
At present, pgd_ctor() adds a new pgd to pgd_list solely based on
!SHARED_KERNEL_PMD. For PAE && !SHARED_KERNEL_PMD (i.e. Xen)
this doesn't seem correct, as the pgd is still empty, which will confuse
vmalloc_sync_all(). So in this case, list insertion should only happen at
the end of pgd_prepopulate_pmd().
Likewise, pgd_free() calls pgd_mop_up_pmds() *before* pgd_dtor(),
with the former zeroing pgd entries as it goes and only the latter
removing the pgd from the list. Just as above this can confuse
vmalloc_sync_all(), so here I would think that the two calls should just
be swapped. However, if they get swapped, careful inspection of the
interaction with save/restore will be needed - XenSource's Linux tree
has a comment specifically to that effect:
/*
* After this the pgd should not be pinned for the duration of this
* function's execution. We should never sleep and thus never race:
* 1. User pmds will not become write-protected under our feet due
* to a concurrent mm_pin_all().
* 2. The machine addresses in PGD entries will not become invalid
* due to a concurrent save/restore.
*/
Since that tree doesn't support preemption, this is perhaps fine, but
likely going to cause problems in the (preemptable) pv-ops code.
The issue with vmalloc_sync_all() would even go unnoticed, since the
patch to unify the pgd_list mechanism with x86-64 removed the
BUG_ON() that was meant to trigger on issues like this.
But maybe I'm missing something?
Jan
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: operation ordering during pgd_alloc/pgd_free
2008-06-05 8:51 operation ordering during pgd_alloc/pgd_free Jan Beulich
@ 2008-06-05 9:27 ` Jeremy Fitzhardinge
2008-06-05 9:58 ` Jan Beulich
0 siblings, 1 reply; 3+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-05 9:27 UTC (permalink / raw)
To: Jan Beulich; +Cc: Ingo Molnar, linux-kernel
Jan Beulich wrote:
> At present, pgd_ctor() adds a new pgd to pgd_list solely based on
> !SHARED_KERNEL_PMD. For PAE && !SHARED_KERNEL_PMD (i.e. Xen)
> this doesn't seem correct, as the pgd is still empty, which will confuse
> vmalloc_sync_all(). So in this case, list insertion should only happen at
> the end of pgd_prepopulate_pmd().
>
How does vmalloc_sync_all() get confused?
> Likewise, pgd_free() calls pgd_mop_up_pmds() *before* pgd_dtor(),
> with the former zeroing pgd entries as it goes and only the latter
> removing the pgd from the list. Just as above this can confuse
> vmalloc_sync_all(), so here I would think that the two calls should just
> be swapped. However, if they get swapped, careful inspection of the
> interaction with save/restore will be needed -
Yes, I specifically wanted to make sure that the pgd was on the list
from before it had any entries until after it has any, to make sure that
no pmds escape visibility from xen_mm_pin_all(). (Note to self: put a
memory barrier to make sure the list update is complete before/after
inserting/removing any pmd entries.)
> XenSource's Linux tree
> has a comment specifically to that effect:
>
> /*
> * After this the pgd should not be pinned for the duration of this
> * function's execution. We should never sleep and thus never race:
> * 1. User pmds will not become write-protected under our feet due
> * to a concurrent mm_pin_all().
> * 2. The machine addresses in PGD entries will not become invalid
> * due to a concurrent save/restore.
> */
>
> Since that tree doesn't support preemption, this is perhaps fine, but
> likely going to cause problems in the (preemptable) pv-ops code.
>
I don't think so. When saving with preemption enabled, it first puts
all processes in the freezer before entering stop_machine_run(); a
process constructing a pagetable should be finished by the time it can
be frozen.
But I think there's a problem *without* preemption.
pmd_prepopulate_pgd() allocates new pmds with GFP_KERNEL, and so it can
block, which undermines the precondition of the comment you quote. This
allows an unlisted and unpinned pgd to be missed at save time. I could
just use the freezer unconditionally, but there was some concern about
how much time it would take on a busy system.
Alternatively, a different ordering would fix it:
1. preallocate - but don't install - the pmds
2. take pgd_lock
3. install pmds into pgd
4. insert pgd onto list
5. release pgd_lock
Holding pgd_lock will prevent both vmalloc_sync_all() and
xen_mm_pin_all() from being able to visit the pgd while it is in its
transitional state.
> The issue with vmalloc_sync_all() would even go unnoticed, since the
> patch to unify the pgd_list mechanism with x86-64 removed the
> BUG_ON() that was meant to trigger on issues like this.
>
Is there an inherent reason vmalloc_sync_all can't deal with a partially
constructed pgd? Couldn't it just skip them, as if it wasn't (or
rather, not yet) on the list? In fact, that looks like what it does now.
Thanks for looking at this.
J
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: operation ordering during pgd_alloc/pgd_free
2008-06-05 9:27 ` Jeremy Fitzhardinge
@ 2008-06-05 9:58 ` Jan Beulich
0 siblings, 0 replies; 3+ messages in thread
From: Jan Beulich @ 2008-06-05 9:58 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: Ingo Molnar, linux-kernel
>But I think there's a problem *without* preemption.
>pmd_prepopulate_pgd() allocates new pmds with GFP_KERNEL, and so it can
>block, which undermines the precondition of the comment you quote. This
>allows an unlisted and unpinned pgd to be missed at save time. I could
>just use the freezer unconditionally, but there was some concern about
>how much time it would take on a busy system.
>
>Alternatively, a different ordering would fix it:
>
> 1. preallocate - but don't install - the pmds
> 2. take pgd_lock
> 3. install pmds into pgd
> 4. insert pgd onto list
> 5. release pgd_lock
... which is precisely what the XenSource tree does.
>> The issue with vmalloc_sync_all() would even go unnoticed, since the
>> patch to unify the pgd_list mechanism with x86-64 removed the
>> BUG_ON() that was meant to trigger on issues like this.
>>
>
>Is there an inherent reason vmalloc_sync_all can't deal with a partially
>constructed pgd? Couldn't it just skip them, as if it wasn't (or
>rather, not yet) on the list? In fact, that looks like what it does now.
It could be made so, but it doesn't at present - it exits the inner loop
when vmalloc_sync_one() returns NULL. The intention here is that if
the reference page table has a clear entry at some level, then there
shouldn't be any attempt to look at other page tables' entries mapping
the same va (and because it would be an error if page table other than
the reference one had a clear entry where the reference one didn't,
there was that BUG_ON() that your patch removed). Since the purpose
of vmalloc_sync_all() is to catch partially populated page tables, allowing
it to skip such that are under construction would need a way to
distinguish them from such being fully constructed but out of sync.
Jan
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-06-05 10:02 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-05 8:51 operation ordering during pgd_alloc/pgd_free Jan Beulich
2008-06-05 9:27 ` Jeremy Fitzhardinge
2008-06-05 9:58 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.