* [PATCH 0/6] freepgt: free_pgtables shakeup
@ 2005-03-23 17:10 Hugh Dickins
2005-03-23 19:57 ` David S. Miller
2005-03-25 21:22 ` Russell King
0 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2005-03-23 17:10 UTC (permalink / raw)
To: Nick Piggin; +Cc: akpm, davem, tony.luck, benh, ak, linux-kernel
Here's the recut of those patches, including David Miller's vital fixes.
I'm addressing these to Nick rather than Andrew, because they're perhaps
not fit for -mm until more testing done and the x86_64 32-bit vdso issue
handled. I'm unlikely to be responsive until next week, sorry: over to
you, Nick - thanks.
Hugh
arch/i386/mm/hugetlbpage.c | 11 --
arch/i386/mm/pgtable.c | 2
arch/ia64/mm/hugetlbpage.c | 60 ++++---------
arch/ppc64/mm/hugetlbpage.c | 47 ----------
include/asm-generic/pgtable.h | 8 -
include/asm-ia64/page.h | 2
include/asm-ia64/pgtable.h | 30 ------
include/asm-ia64/processor.h | 8 -
include/asm-ppc64/pgtable.h | 12 +-
include/asm-ppc64/processor.h | 4
include/asm-s390/processor.h | 2
include/asm-sparc64/pgtable.h | 15 ---
include/linux/hugetlb.h | 6 -
include/linux/mm.h | 14 +--
mm/memory.c | 190 ++++++++++++++++++++++++++++++------------
mm/mmap.c | 139 ++++++++----------------------
16 files changed, 226 insertions(+), 324 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-23 17:10 Hugh Dickins
@ 2005-03-23 19:57 ` David S. Miller
2005-03-24 0:26 ` Nick Piggin
2005-03-25 21:22 ` Russell King
1 sibling, 1 reply; 24+ messages in thread
From: David S. Miller @ 2005-03-23 19:57 UTC (permalink / raw)
To: Hugh Dickins; +Cc: nickpiggin, akpm, tony.luck, benh, ak, linux-kernel
On Wed, 23 Mar 2005 17:10:15 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:
> Here's the recut of those patches, including David Miller's vital fixes.
> I'm addressing these to Nick rather than Andrew, because they're perhaps
> not fit for -mm until more testing done and the x86_64 32-bit vdso issue
> handled. I'm unlikely to be responsive until next week, sorry: over to
> you, Nick - thanks.
Works perfectly fine on sparc64.
BTW, I note that we may still want something like that page table
bitmask stuff I worked on some time ago. Ie. for things like
what lat_mmap does in lmbench, I think that situation is more
realistic than people might thing.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-23 19:57 ` David S. Miller
@ 2005-03-24 0:26 ` Nick Piggin
2005-03-24 5:44 ` David S. Miller
2005-03-30 19:30 ` Hugh Dickins
0 siblings, 2 replies; 24+ messages in thread
From: Nick Piggin @ 2005-03-24 0:26 UTC (permalink / raw)
To: David S. Miller; +Cc: Hugh Dickins, akpm, tony.luck, benh, ak, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 945 bytes --]
David S. Miller wrote:
> On Wed, 23 Mar 2005 17:10:15 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
>
>
>>Here's the recut of those patches, including David Miller's vital fixes.
>>I'm addressing these to Nick rather than Andrew, because they're perhaps
>>not fit for -mm until more testing done and the x86_64 32-bit vdso issue
>>handled. I'm unlikely to be responsive until next week, sorry: over to
>>you, Nick - thanks.
>
>
> Works perfectly fine on sparc64.
>
OK, attached is my first cut at slimming down the boundary tests.
I have only had a chance to try it on i386, so I hate to drop it
on you like this - but I *have* put a bit of thought into it....
Treat it as an RFC, and I'll try to test it on a wider range of
things in the next couple of days.
Not that there is anything really nasty with your system David,
so I don't think it will be a big disaster if I can't get this to
work.
Goes on top of Hugh's 6 patches.
[-- Attachment #2: fix-ptclear --]
[-- Type: text/plain, Size: 3732 bytes --]
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2005-03-24 10:43:31.000000000 +1100
+++ linux-2.6/mm/memory.c 2005-03-24 11:22:21.000000000 +1100
@@ -139,14 +139,9 @@ static inline void free_pmd_range(struct
start &= PUD_MASK;
if (start < floor)
return;
- if (ceiling) {
- ceiling &= PUD_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PUD_SIZE - 1) & PUD_MASK;
+ if (end - ceiling - 1 < PUD_SIZE - 1)
return;
-
pmd = pmd_offset(pud, start);
pud_clear(pud);
pmd_free_tlb(tlb, pmd);
@@ -172,14 +167,9 @@ static inline void free_pud_range(struct
start &= PGDIR_MASK;
if (start < floor)
return;
- if (ceiling) {
- ceiling &= PGDIR_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PGDIR_SIZE - 1) & PGDIR_MASK;
+ if (end - ceiling - 1 < PGDIR_SIZE - 1)
return;
-
pud = pud_offset(pgd, start);
pgd_clear(pgd);
pud_free_tlb(tlb, pud);
@@ -198,6 +188,10 @@ void free_pgd_range(struct mmu_gather **
unsigned long next;
unsigned long start;
+ BUG_ON(addr >= end);
+ /* Don't want end to be 0 and ceiling to be greater than 0-PGDIR_SIZE */
+ BUG_ON(end - 1 > ceiling - 1);
+
/*
* The next few lines have given us lots of grief...
*
@@ -205,23 +199,22 @@ void free_pgd_range(struct mmu_gather **
* there will be no work to do at all, and we'd prefer not to
* go all the way down to the bottom just to discover that.
*
- * Why all these "- 1"s? Because 0 represents both the bottom
- * of the address space and the top of it (using -1 for the
- * top wouldn't help much: the masks would do the wrong thing).
- * The rule is that addr 0 and floor 0 refer to the bottom of
- * the address space, but end 0 and ceiling 0 refer to the top
- * Comparisons need to use "end - 1" and "ceiling - 1" (though
- * that end 0 case should be mythical).
- *
- * Wherever addr is brought up or ceiling brought down, we must
- * be careful to reject "the opposite 0" before it confuses the
- * subsequent tests. But what about where end is brought down
- * by PMD_SIZE below? no, end can't go down to 0 there.
+ * The tricky part of this logic (and similar in free_p?d_range above)
+ * is the 'end' handling. end and ceiling are *exclusive* boundaries,
+ * so their maximum is 0. This suggests the use of two's complement
+ * difference when comparing them, so the wrapping is handled for us.
*
- * Whereas we round start (addr) and ceiling down, by different
- * masks at different levels, in order to test whether a table
- * now has no other vmas using it, so can be freed, we don't
- * bother to round floor or end up - the tests don't need that.
+ * The method is:
+ * - Round end up to the nearest PMD aligned boundary.
+ * - If end has exceeded ceiling, then end - ceiling will be less than
+ * PMD_SIZE.
+ * end can't have approached ceiling from above if ceiling is 0,
+ * because it is rounded up to the next PMD aligned boundary, so
+ * either it will be 0, or 0+PMD_SIZE.
+ * - In the above case that end is 0, or any other time end might be
+ * equal to ceiling, end - ceiling = 0 < PMD_SIZE. So the actual test
+ * we use is (unsigned) end - ceiling - 1 < PMD_SIZE - 1,
+ * to catch this case.
*/
addr &= PMD_MASK;
@@ -230,14 +223,10 @@ void free_pgd_range(struct mmu_gather **
if (!addr)
return;
}
- if (ceiling) {
- ceiling &= PMD_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PMD_SIZE - 1) & PMD_MASK;
+ if (end - ceiling - 1 < PMD_SIZE - 1)
end -= PMD_SIZE;
- if (addr > end - 1)
+ if (addr >= end)
return;
start = addr;
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [PATCH 0/6] freepgt: free_pgtables shakeup
@ 2005-03-24 1:00 Luck, Tony
2005-03-24 1:07 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Luck, Tony @ 2005-03-24 1:00 UTC (permalink / raw)
To: Nick Piggin, David S. Miller; +Cc: Hugh Dickins, akpm, benh, ak, linux-kernel
>OK, attached is my first cut at slimming down the boundary tests.
>I have only had a chance to try it on i386, so I hate to drop it
>on you like this - but I *have* put a bit of thought into it....
>Treat it as an RFC, and I'll try to test it on a wider range of
>things in the next couple of days.
>
>Not that there is anything really nasty with your system David,
>so I don't think it will be a big disaster if I can't get this to
>work.
>
>Goes on top of Hugh's 6 patches.
Runs on ia64. Looks much cleaner too.
-Tony
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-24 1:00 [PATCH 0/6] freepgt: free_pgtables shakeup Luck, Tony
@ 2005-03-24 1:07 ` Nick Piggin
0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2005-03-24 1:07 UTC (permalink / raw)
To: Luck, Tony; +Cc: David S. Miller, Hugh Dickins, akpm, benh, ak, linux-kernel
Luck, Tony wrote:
>>OK, attached is my first cut at slimming down the boundary tests.
>>I have only had a chance to try it on i386, so I hate to drop it
>>on you like this - but I *have* put a bit of thought into it....
>>Treat it as an RFC, and I'll try to test it on a wider range of
>>things in the next couple of days.
>>
>>Not that there is anything really nasty with your system David,
>>so I don't think it will be a big disaster if I can't get this to
>>work.
>>
>>Goes on top of Hugh's 6 patches.
>
>
> Runs on ia64. Looks much cleaner too.
>
Oh good. Thanks for testing, Tony.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-24 0:26 ` Nick Piggin
@ 2005-03-24 5:44 ` David S. Miller
2005-03-30 19:30 ` Hugh Dickins
1 sibling, 0 replies; 24+ messages in thread
From: David S. Miller @ 2005-03-24 5:44 UTC (permalink / raw)
To: Nick Piggin; +Cc: hugh, akpm, tony.luck, benh, ak, linux-kernel
On Thu, 24 Mar 2005 11:26:16 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> OK, attached is my first cut at slimming down the boundary tests.
Works perfectly fine on sparc64.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-23 17:10 Hugh Dickins
2005-03-23 19:57 ` David S. Miller
@ 2005-03-25 21:22 ` Russell King
2005-03-26 2:06 ` Nick Piggin
1 sibling, 1 reply; 24+ messages in thread
From: Russell King @ 2005-03-25 21:22 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Nick Piggin, akpm, davem, tony.luck, benh, ak, linux-kernel
On Wed, Mar 23, 2005 at 05:10:15PM +0000, Hugh Dickins wrote:
> Here's the recut of those patches, including David Miller's vital fixes.
> I'm addressing these to Nick rather than Andrew, because they're perhaps
> not fit for -mm until more testing done and the x86_64 32-bit vdso issue
> handled. I'm unlikely to be responsive until next week, sorry: over to
> you, Nick - thanks.
I thought I'd try these out on ARM, but alas they don't apply to
2.6.12-rc1. ;( This means I won't be testing them, sorry.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-25 21:22 ` Russell King
@ 2005-03-26 2:06 ` Nick Piggin
2005-03-26 11:35 ` Russell King
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2005-03-26 2:06 UTC (permalink / raw)
To: Russell King; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
Russell King wrote:
> On Wed, Mar 23, 2005 at 05:10:15PM +0000, Hugh Dickins wrote:
>
>>Here's the recut of those patches, including David Miller's vital fixes.
>>I'm addressing these to Nick rather than Andrew, because they're perhaps
>>not fit for -mm until more testing done and the x86_64 32-bit vdso issue
>>handled. I'm unlikely to be responsive until next week, sorry: over to
>>you, Nick - thanks.
>
>
> I thought I'd try these out on ARM, but alas they don't apply to
> 2.6.12-rc1. ;( This means I won't be testing them, sorry.
>
The reject should be confined to include/asm-ia64, so it will still
work for you.
But I've put a clean rollup of all Hugh's patches here in case you'd
like to try it.
http://www.kerneltrap.org/~npiggin/freepgt-2.6.12-rc1.patch
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 2:06 ` Nick Piggin
@ 2005-03-26 11:35 ` Russell King
2005-03-26 13:37 ` Russell King
2005-03-26 13:42 ` Nick Piggin
0 siblings, 2 replies; 24+ messages in thread
From: Russell King @ 2005-03-26 11:35 UTC (permalink / raw)
To: Nick Piggin; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sat, Mar 26, 2005 at 01:06:47PM +1100, Nick Piggin wrote:
> The reject should be confined to include/asm-ia64, so it will still
> work for you.
I guess I should've tried a little harder last night then. Sorry.
> But I've put a clean rollup of all Hugh's patches here in case you'd
> like to try it.
>
> http://www.kerneltrap.org/~npiggin/freepgt-2.6.12-rc1.patch
This works fine on ARM with high vectors. With low vectors (located in
the 1st page of virtual memory space) I get:
kernel BUG at mm/mmap.c:1934!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c1e88000
[00000000] *pgd=c1e86031, *pte=c04440cf, *ppte=c044400e
Internal error: Oops: c1e8981f [#1]
Modules linked in:
CPU: 0
PC is at __bug+0x40/0x54
LR is at 0x1
pc : [<c0223870>] lr : [<00000001>] Not tainted
sp : c1e7bd18 ip : 60000093 fp : c1e7bd28
r10: c1f4b040 r9 : 00000006 r8 : c1f02ca0
r7 : 00000000 r6 : 00000000 r5 : c015b8c0 r4 : 00000000
r3 : 00000000 r2 : 00000000 r1 : 00000d4e r0 : 00000001
Flags: nZCv IRQs on FIQs on Mode SVC_32 Segment user
Control: C1E8917F Table: C1E8917F DAC: 00000015
Process init (pid: 235, stack limit = 0xc1e7a194)
Stack: (0xc1e7bd18 to 0xc1e7c000)
...
Backtrace:
[<c0223830>] (__bug+0x0/0x54) from [<c02691d8>] (exit_mmap+0x154/0x168)
r4 = C1E7BD3C
[<c0269084>] (exit_mmap+0x0/0x168) from [<c02358c8>] (mmput+0x40/0xc0)
r7 = C015B8C0 r6 = C015B8C0 r5 = 00000000 r4 = C015B8C0
[<c0235888>] (mmput+0x0/0xc0) from [<c027ecec>] (exec_mmap+0xec/0x134)
r4 = C015B6A0
[<c027ec00>] (exec_mmap+0x0/0x134) from [<c027f234>] (flush_old_exec+0x4c8/0x6e4)
r7 = C012B940 r6 = C1E7A000 r5 = C0498A00 r4 = 00000000
[<c027ed6c>] (flush_old_exec+0x0/0x6e4) from [<c029d53c>] (load_elf_binary+0x5c0/0xdc0)
[<c029cf7c>] (load_elf_binary+0x0/0xdc0) from [<c027f6e0>] (search_binary_handler+0xa0/0x244)
[<c027f640>] (search_binary_handler+0x0/0x244) from [<c029c4e8>] (load_script+0x224/0x22c)
[<c029c2c4>] (load_script+0x0/0x22c) from [<c027f6e0>] (search_binary_handler+0xa0/0x244)
r6 = C1E7A000 r5 = C015E400 r4 = C03EC2B4
[<c027f640>] (search_binary_handler+0x0/0x244) from [<c027f9b8>] (do_execve+0x134/0x1f8)
[<c027f884>] (do_execve+0x0/0x1f8) from [<c02223f8>] (sys_execve+0x3c/0x5c)
[<c02223bc>] (sys_execve+0x0/0x5c) from [<c021dca0>] (ret_fast_syscall+0x0/0x2c)
r7 = 0000000B r6 = BED6AA74 r5 = BED6AB00 r4 = 0200F008
Code: 1b0051b8 e59f0014 eb0051b6 e3a03000 (e5833000)
In this case, we have a page which must be kept mapped at virtual address
0, which means the first entry in the L1 page table must always exist.
However, user threads start from 0x8000, which is also mapped via the
first entry in the L1 page table.
At a guess, I'd imagine that we're freeing the first L1 page table entry,
thereby causing mm->nr_ptes to become -1.
I'll do some debugging and try to work out if that (or exactly what's)
going on.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 11:35 ` Russell King
@ 2005-03-26 13:37 ` Russell King
2005-03-26 13:51 ` Nick Piggin
2005-03-26 13:42 ` Nick Piggin
1 sibling, 1 reply; 24+ messages in thread
From: Russell King @ 2005-03-26 13:37 UTC (permalink / raw)
To: Nick Piggin, Hugh Dickins, akpm, davem, tony.luck, benh, ak,
linux-kernel
On Sat, Mar 26, 2005 at 11:35:30AM +0000, Russell King wrote:
> In this case, we have a page which must be kept mapped at virtual address
> 0, which means the first entry in the L1 page table must always exist.
> However, user threads start from 0x8000, which is also mapped via the
> first entry in the L1 page table.
>
> At a guess, I'd imagine that we're freeing the first L1 page table entry,
> thereby causing mm->nr_ptes to become -1.
>
> I'll do some debugging and try to work out if that (or exactly what's)
> going on.
Ok. What's happening is that the ARM pgd_alloc uses pte_alloc_map() to
allocate the first L1 page table. This sets mm->nr_ptes to be one.
The ARM free_pgd knows about this, and will free this PTE at the
appropriate time. However, exit_mmap() doesn't know about this itself,
so in the ARM case, it should BUG_ON(mm->nr_ptes != 1) if we're using
low vectors.
I guess we could hack it such that the ARM pgd_alloc decrements mm->nr_ptes
itself to keep things balanced, since free_pte_range() won't be called
for this pte. However, I don't like that because its likely to break at
some point in the future.
Any ideas how to cleanly support this with the new infrastructure?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 11:35 ` Russell King
2005-03-26 13:37 ` Russell King
@ 2005-03-26 13:42 ` Nick Piggin
2005-03-26 15:52 ` Russell King
1 sibling, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2005-03-26 13:42 UTC (permalink / raw)
To: Russell King; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3350 bytes --]
Russell King wrote:
> On Sat, Mar 26, 2005 at 01:06:47PM +1100, Nick Piggin wrote:
>
>>The reject should be confined to include/asm-ia64, so it will still
>>work for you.
>
>
> I guess I should've tried a little harder last night then. Sorry.
>
Well I only said that just in case you had time to try testing
again. I wouldn't expect you to go hunting through rejects and
various kernel trees to try to get a working patch ;)
>
>>But I've put a clean rollup of all Hugh's patches here in case you'd
>>like to try it.
>>
>>http://www.kerneltrap.org/~npiggin/freepgt-2.6.12-rc1.patch
>
>
> This works fine on ARM with high vectors. With low vectors (located in
> the 1st page of virtual memory space) I get:
>
> kernel BUG at mm/mmap.c:1934!
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = c1e88000
> [00000000] *pgd=c1e86031, *pte=c04440cf, *ppte=c044400e
> Internal error: Oops: c1e8981f [#1]
> Modules linked in:
> CPU: 0
> PC is at __bug+0x40/0x54
> LR is at 0x1
> pc : [<c0223870>] lr : [<00000001>] Not tainted
> sp : c1e7bd18 ip : 60000093 fp : c1e7bd28
> r10: c1f4b040 r9 : 00000006 r8 : c1f02ca0
> r7 : 00000000 r6 : 00000000 r5 : c015b8c0 r4 : 00000000
> r3 : 00000000 r2 : 00000000 r1 : 00000d4e r0 : 00000001
> Flags: nZCv IRQs on FIQs on Mode SVC_32 Segment user
> Control: C1E8917F Table: C1E8917F DAC: 00000015
> Process init (pid: 235, stack limit = 0xc1e7a194)
> Stack: (0xc1e7bd18 to 0xc1e7c000)
> ...
> Backtrace:
> [<c0223830>] (__bug+0x0/0x54) from [<c02691d8>] (exit_mmap+0x154/0x168)
> r4 = C1E7BD3C
> [<c0269084>] (exit_mmap+0x0/0x168) from [<c02358c8>] (mmput+0x40/0xc0)
> r7 = C015B8C0 r6 = C015B8C0 r5 = 00000000 r4 = C015B8C0
> [<c0235888>] (mmput+0x0/0xc0) from [<c027ecec>] (exec_mmap+0xec/0x134)
> r4 = C015B6A0
> [<c027ec00>] (exec_mmap+0x0/0x134) from [<c027f234>] (flush_old_exec+0x4c8/0x6e4)
> r7 = C012B940 r6 = C1E7A000 r5 = C0498A00 r4 = 00000000
> [<c027ed6c>] (flush_old_exec+0x0/0x6e4) from [<c029d53c>] (load_elf_binary+0x5c0/0xdc0)
> [<c029cf7c>] (load_elf_binary+0x0/0xdc0) from [<c027f6e0>] (search_binary_handler+0xa0/0x244)
> [<c027f640>] (search_binary_handler+0x0/0x244) from [<c029c4e8>] (load_script+0x224/0x22c)
> [<c029c2c4>] (load_script+0x0/0x22c) from [<c027f6e0>] (search_binary_handler+0xa0/0x244)
> r6 = C1E7A000 r5 = C015E400 r4 = C03EC2B4
> [<c027f640>] (search_binary_handler+0x0/0x244) from [<c027f9b8>] (do_execve+0x134/0x1f8)
> [<c027f884>] (do_execve+0x0/0x1f8) from [<c02223f8>] (sys_execve+0x3c/0x5c)
> [<c02223bc>] (sys_execve+0x0/0x5c) from [<c021dca0>] (ret_fast_syscall+0x0/0x2c)
> r7 = 0000000B r6 = BED6AA74 r5 = BED6AB00 r4 = 0200F008
> Code: 1b0051b8 e59f0014 eb0051b6 e3a03000 (e5833000)
>
> In this case, we have a page which must be kept mapped at virtual address
> 0, which means the first entry in the L1 page table must always exist.
> However, user threads start from 0x8000, which is also mapped via the
> first entry in the L1 page table.
>
> At a guess, I'd imagine that we're freeing the first L1 page table entry,
> thereby causing mm->nr_ptes to become -1.
>
> I'll do some debugging and try to work out if that (or exactly what's)
> going on.
>
OK, thanks that would be good. You could well be right in your analysis.
May I suggest a possible avenue of investigation:
[-- Attachment #2: freepgt-ARM-fix.patch --]
[-- Type: text/plain, Size: 1224 bytes --]
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2005-03-26 23:47:51.000000000 +1100
+++ linux-2.6/mm/mmap.c 2005-03-27 00:41:00.000000000 +1100
@@ -1612,6 +1612,8 @@ static void unmap_vma_list(struct mm_str
validate_mm(mm);
}
+#define FIRST_USER_ADDRESS (FIRST_USER_PGD_NR * PGDIR_SIZE)
+
/*
* Get rid of page table information in the indicated region.
*
@@ -1630,7 +1632,7 @@ static void unmap_region(struct mm_struc
tlb = tlb_gather_mmu(mm, 0);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- free_pgtables(&tlb, vma, prev? prev->vm_end: 0,
+ free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
spin_unlock(&mm->page_table_lock);
@@ -1910,7 +1912,7 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, mm, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
- free_pgtables(&tlb, vma, 0, 0);
+ free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
mm->mmap = mm->mmap_cache = NULL;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 13:37 ` Russell King
@ 2005-03-26 13:51 ` Nick Piggin
2005-03-26 15:03 ` Russell King
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2005-03-26 13:51 UTC (permalink / raw)
To: Russell King; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
Russell King wrote:
> Ok. What's happening is that the ARM pgd_alloc uses pte_alloc_map() to
> allocate the first L1 page table. This sets mm->nr_ptes to be one.
>
> The ARM free_pgd knows about this, and will free this PTE at the
> appropriate time. However, exit_mmap() doesn't know about this itself,
> so in the ARM case, it should BUG_ON(mm->nr_ptes != 1) if we're using
> low vectors.
>
> I guess we could hack it such that the ARM pgd_alloc decrements mm->nr_ptes
> itself to keep things balanced, since free_pte_range() won't be called
> for this pte. However, I don't like that because its likely to break at
> some point in the future.
>
> Any ideas how to cleanly support this with the new infrastructure?
>
Hmm, in that case it could be just a problem with that BUG_ON() -
it wasn't there before... but it seems like a very useful test,
especially with all this new work going on, so it would be a shame
not to run it in releases.
But I don't quite understand (should really look at the code more),
how come you aren't leaking memory? Do _all_ processes share this
same first L1 page table? If so, can it be allocated with a more
specialised function? If not, can nr_ptes be decremented in the
place where it is freed?
/me goes to bed now - I'll have a bit of a look tomorrow.
Nick
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 13:51 ` Nick Piggin
@ 2005-03-26 15:03 ` Russell King
2005-03-30 17:00 ` Hugh Dickins
0 siblings, 1 reply; 24+ messages in thread
From: Russell King @ 2005-03-26 15:03 UTC (permalink / raw)
To: Nick Piggin; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sun, Mar 27, 2005 at 12:51:14AM +1100, Nick Piggin wrote:
> Hmm, in that case it could be just a problem with that BUG_ON() -
> it wasn't there before... but it seems like a very useful test,
> especially with all this new work going on, so it would be a shame
> not to run it in releases.
Indeed.
> But I don't quite understand (should really look at the code more),
> how come you aren't leaking memory?
The ARM free_pgd_slow() knows about this special first L1 page table, and
knows to free it if it exists when its called.
> Do _all_ processes share this same first L1 page table?
No. It has to be specific to each process. Each L1 page table entry
covers 2MB, but executables start at virtual 0x8000.
I guess we could open-code pte_alloc_map() in get_pgd_slow() which
could solve this problem by omitting the mm->nr_ptes accounting; it
may also reduce the complexity as well.
I'm just slightly concerned that this may be rather fragile in terms
of future development.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 13:42 ` Nick Piggin
@ 2005-03-26 15:52 ` Russell King
2005-03-27 3:41 ` Nick Piggin
2005-03-30 16:30 ` Hugh Dickins
0 siblings, 2 replies; 24+ messages in thread
From: Russell King @ 2005-03-26 15:52 UTC (permalink / raw)
To: Nick Piggin; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sun, Mar 27, 2005 at 12:42:56AM +1100, Nick Piggin wrote:
> OK, thanks that would be good. You could well be right in your analysis.
> May I suggest a possible avenue of investigation:
Yes, this patch seems to also be required, otherwise I see:
free_pgtables : floor 40015000 ceiling 4001b000
free_pgd_range: floor 40015000 ceiling 4001b000 addr 40015000 end 40016000
free_pgtables : floor 40015000 ceiling 4001b000
free_pgd_range: floor 40015000 ceiling 4001b000 addr 40015000 end 40016000
free_pgtables : floor 40015000 ceiling 4001b000
free_pgd_range: floor 40015000 ceiling 4001b000 addr 40015000 end 40016000
free_pgtables : floor 00000000 ceiling 00000000
free_pgd_range: floor 00000000 ceiling 40000000 addr 00008000 end 0001d000
free_pud_range: floor 00000000 ceiling 40000000 addr 00000000 end 00200000
free_pmd_range: floor 00000000 ceiling 40000000 addr 00000000 end 00200000
free_pgd_range: floor 00000000 ceiling bef4e000 addr 40000000 end 4012d000
free_pud_range: floor 00000000 ceiling bef4e000 addr 40000000 end 40200000
free_pmd_range: floor 00000000 ceiling bef4e000 addr 40000000 end 40200000
free_pgd_range: floor 00000000 ceiling 00000000 addr bef4e000 end bef63000
free_pud_range: floor 00000000 ceiling 00000000 addr bee00000 end bf000000
free_pmd_range: floor 00000000 ceiling 00000000 addr bee00000 end bf000000
exit_mmap: nr_ptes -1
The above is with my fix to ARMs get_pgd_slow, which shows that we
accidentally freed the first entry in the L1 page table. With my
fix and your patch, low-vectored ARMs work again.
I don't think it'll be invasive to push my get_pgd_slow() fix before
these freepgt patches appear. For the record, this is the patch I'm
using at present. With a bit more effort, I could probably eliminate
pmd_alloc (and therefore the unnecessary spinlocking) here.
diff -up -x BitKeeper -x ChangeSet -x SCCS -x _xlk -x *.orig -x *.rej orig/arch/arm/mm/mm-armv.c linux/arch/arm/mm/mm-armv.c
--- orig/arch/arm/mm/mm-armv.c Sat Mar 19 11:20:01 2005
+++ linux/arch/arm/mm/mm-armv.c Sat Mar 26 15:51:57 2005
@@ -160,6 +160,8 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
init_pgd = pgd_offset_k(0);
if (!vectors_high()) {
+ struct page *new;
+
/*
* This lock is here just to satisfy pmd_alloc and pte_lock
*/
@@ -170,12 +172,16 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
* contains the machine vectors.
*/
new_pmd = pmd_alloc(mm, new_pgd, 0);
+ spin_unlock(&mm->page_table_lock);
if (!new_pmd)
goto no_pmd;
- new_pte = pte_alloc_map(mm, new_pmd, 0);
- if (!new_pte)
+ new = pte_alloc_one(mm, 0);
+ if (!new)
goto no_pte;
+ inc_page_state(nr_page_table_pages);
+ pmd_populate(mm, new_pmd, new);
+ new_pte = pte_offset_map(new_pmd, 0);
init_pmd = pmd_offset(init_pgd, 0);
init_pte = pte_offset_map_nested(init_pmd, 0);
@@ -197,16 +203,9 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
return new_pgd;
no_pte:
- spin_unlock(&mm->page_table_lock);
pmd_free(new_pmd);
- free_pages((unsigned long)new_pgd, 2);
- return NULL;
-
no_pmd:
- spin_unlock(&mm->page_table_lock);
free_pages((unsigned long)new_pgd, 2);
- return NULL;
-
no_pgd:
return NULL;
}
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 15:52 ` Russell King
@ 2005-03-27 3:41 ` Nick Piggin
2005-03-27 7:57 ` Russell King
2005-03-30 16:30 ` Hugh Dickins
1 sibling, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2005-03-27 3:41 UTC (permalink / raw)
To: Russell King; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
Russell King wrote:
> On Sun, Mar 27, 2005 at 12:42:56AM +1100, Nick Piggin wrote:
>
>>OK, thanks that would be good. You could well be right in your analysis.
>>May I suggest a possible avenue of investigation:
>
>
> Yes, this patch seems to also be required, otherwise I see:
>
[...]
OK.
>
> The above is with my fix to ARMs get_pgd_slow, which shows that we
> accidentally freed the first entry in the L1 page table. With my
> fix and your patch, low-vectored ARMs work again.
>
> I don't think it'll be invasive to push my get_pgd_slow() fix before
> these freepgt patches appear. For the record, this is the patch I'm
> using at present. With a bit more effort, I could probably eliminate
> pmd_alloc (and therefore the unnecessary spinlocking) here.
>
Seems OK if you're happy with it. Is this going to leak
"nr_page_table_pages" too, though?
Hmm... no, because free_pgd_slow decrements it? In that case, can
you have free_pgd_slow also decrement nr_ptes, instead of your
below patch?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-27 3:41 ` Nick Piggin
@ 2005-03-27 7:57 ` Russell King
2005-03-27 18:17 ` David S. Miller
0 siblings, 1 reply; 24+ messages in thread
From: Russell King @ 2005-03-27 7:57 UTC (permalink / raw)
To: Nick Piggin; +Cc: Hugh Dickins, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sun, Mar 27, 2005 at 01:41:46PM +1000, Nick Piggin wrote:
> Hmm... no, because free_pgd_slow decrements it? In that case, can
> you have free_pgd_slow also decrement nr_ptes, instead of your
> below patch?
Unfortunately not - free_pgd_slow doesn't have any knowledge about the
mm_struct that the pgd was associated with.
Also remember that nr_ptes wasn't incremented in get_pgd_slow() for this
page table either, and this is the "bug" fix.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-27 7:57 ` Russell King
@ 2005-03-27 18:17 ` David S. Miller
2005-03-28 7:51 ` Russell King
0 siblings, 1 reply; 24+ messages in thread
From: David S. Miller @ 2005-03-27 18:17 UTC (permalink / raw)
To: Russell King; +Cc: nickpiggin, hugh, akpm, tony.luck, benh, ak, linux-kernel
On Sun, 27 Mar 2005 08:57:25 +0100
Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> Unfortunately not - free_pgd_slow doesn't have any knowledge about the
> mm_struct that the pgd was associated with.
You could store the mm pointer in the page struct of the
pgd, we used to that before set_pte_at() existed on
sparc64 and ppc64 for pte level tables.
page->mapping and page->index are basically free game for
tracking information assosciated with page table chunks.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-27 18:17 ` David S. Miller
@ 2005-03-28 7:51 ` Russell King
2005-03-28 18:47 ` David S. Miller
0 siblings, 1 reply; 24+ messages in thread
From: Russell King @ 2005-03-28 7:51 UTC (permalink / raw)
To: David S. Miller; +Cc: nickpiggin, hugh, akpm, tony.luck, benh, ak, linux-kernel
On Sun, Mar 27, 2005 at 10:17:39AM -0800, David S. Miller wrote:
> On Sun, 27 Mar 2005 08:57:25 +0100
> Russell King <rmk+lkml@arm.linux.org.uk> wrote:
>
> > Unfortunately not - free_pgd_slow doesn't have any knowledge about the
> > mm_struct that the pgd was associated with.
>
> You could store the mm pointer in the page struct of the
> pgd, we used to that before set_pte_at() existed on
> sparc64 and ppc64 for pte level tables.
>
> page->mapping and page->index are basically free game for
> tracking information assosciated with page table chunks.
Why would I want to do this, given that decrementing mm->nr_ptes in
free_pgd_slow() would make it negative ? Am I missing something
obvious?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-28 7:51 ` Russell King
@ 2005-03-28 18:47 ` David S. Miller
0 siblings, 0 replies; 24+ messages in thread
From: David S. Miller @ 2005-03-28 18:47 UTC (permalink / raw)
To: Russell King; +Cc: nickpiggin, hugh, akpm, tony.luck, benh, ak, linux-kernel
On Mon, 28 Mar 2005 08:51:36 +0100
Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> Why would I want to do this, given that decrementing mm->nr_ptes in
> free_pgd_slow() would make it negative ? Am I missing something
> obvious?
You were saying there was no way to figure out which mm is
assosociate a particular pgd, and I'm merely showing you
how you can in fact do it. :-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 15:52 ` Russell King
2005-03-27 3:41 ` Nick Piggin
@ 2005-03-30 16:30 ` Hugh Dickins
1 sibling, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2005-03-30 16:30 UTC (permalink / raw)
To: Russell King; +Cc: Nick Piggin, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sat, 26 Mar 2005, Russell King wrote:
>
> I don't think it'll be invasive to push my get_pgd_slow() fix before
> these freepgt patches appear. For the record, this is the patch I'm
> using at present. With a bit more effort, I could probably eliminate
> pmd_alloc (and therefore the unnecessary spinlocking) here.
Your get_pgd_slow patch looks good to me. Yes, it slightly increases
the assumptions here about what is done in common, to the extent of a
pmd_populate, but even the nr_page_table_pages adjustment just nicely
balances what you were already having to do in free_pgd_slow.
Sorry for dumping you suddenly into this with my BUG_ON(mm->nr_ptes),
but I think we all agree it's a worthwhile check now. And sorry for
being so slow to respond, but I needed to think through what's right
for your case.
I'll write separately about Nick's FIRST_USER_ADDRESS patch,
I'm still puzzled, and not quite happy with that one.
Hugh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-26 15:03 ` Russell King
@ 2005-03-30 17:00 ` Hugh Dickins
2005-03-30 17:39 ` Russell King
0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-03-30 17:00 UTC (permalink / raw)
To: Russell King; +Cc: Nick Piggin, akpm, davem, tony.luck, benh, ak, linux-kernel
On Sat, 26 Mar 2005, Russell King wrote:
> On Sun, Mar 27, 2005 at 12:51:14AM +1100, Nick Piggin wrote:
>
> > But I don't quite understand (should really look at the code more),
> > how come you aren't leaking memory?
>
> The ARM free_pgd_slow() knows about this special first L1 page table, and
> knows to free it if it exists when its called.
>
> > Do _all_ processes share this same first L1 page table?
>
> No. It has to be specific to each process. Each L1 page table entry
> covers 2MB, but executables start at virtual 0x8000.
I'm not satisfied with Nick's patch because it defines FIRST_USER_ADDRESS
as FIRST_USER_PGD_NR * PGDIR_SIZE i.e. 2MB. And if that is the first
user address, then there is no need for his patch, because the new
free_pgtables will never touch that pgd because there's no vma in it.
That's why I thought arm needed no special code for this
(overlooking the nr_ptes bug issue, sorry).
But above you say FIRST_USER_ADDRESS should actually be 0x8000?
In that case, I think we should define FIRST_USER_ADDRESS to 0,
either in every asm-arch or in asm-generic, with arm overriding
it to 0x8000 - or better, to whatever #define you already have
for that 0x8000, but I didn't find it.
If vmas can occur in between 0x8000 and 2MB, then with Nick's patch
we'd be liable to pass free_pgtables a vma with vm_start lower than
floor, which is not a case I've thought through, nor wish to.
And, whether FIRST_USER_ADDRESS is 0x8000 or 2MB,
shouldn't your arch_get_unmapped_area be enforcing it?
Hugh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-30 17:00 ` Hugh Dickins
@ 2005-03-30 17:39 ` Russell King
0 siblings, 0 replies; 24+ messages in thread
From: Russell King @ 2005-03-30 17:39 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Nick Piggin, akpm, davem, tony.luck, benh, ak, linux-kernel
On Wed, Mar 30, 2005 at 06:00:54PM +0100, Hugh Dickins wrote:
> And, whether FIRST_USER_ADDRESS is 0x8000 or 2MB,
> shouldn't your arch_get_unmapped_area be enforcing it?
Why should it? arch_get_unmapped_area allocates address space dynamically
when NULL is passed, and always starts from TASK_UNMAPPED_BASE. This
will be greater than 32K.
The protection against mapping things below 32K comes from the syscall
wrappers on mmap() and mremap().
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-24 0:26 ` Nick Piggin
2005-03-24 5:44 ` David S. Miller
@ 2005-03-30 19:30 ` Hugh Dickins
2005-03-30 23:40 ` Nick Piggin
1 sibling, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-03-30 19:30 UTC (permalink / raw)
To: Nick Piggin; +Cc: David S. Miller, akpm, tony.luck, benh, ak, linux-kernel
On Thu, 24 Mar 2005, Nick Piggin wrote:
>
> OK, attached is my first cut at slimming down the boundary tests.
> I have only had a chance to try it on i386, so I hate to drop it
> on you like this - but I *have* put a bit of thought into it....
> Treat it as an RFC, and I'll try to test it on a wider range of
> things in the next couple of days.
I've stared and stared at it. I think I mostly like it.
It's nicer to be rounding end up than ceiling down.
It's clearly superior to what David and I had, in branching
less (other than in your BUG_ONs), and I do believe your
"if (end - ceiling - 1 < P*_SIZE - 1)" is correct and efficient.
But I still find it harder to understand than ours; and don't
understand at all your comment "end can't have approached ceiling
from above...." - but I think you're bravely trying to explain the case
I sidestepped with a lordly unexplained "end can't go down to 0 there".
Let others decide.
One thing I believe is outright wrong, at least with out-of-tree
patches: your change from "if (addr > end - 1)" to "if (addr >= end)",
after you've just rounded up end (perhaps to 0).
(And let me astonish you by asking for the blank lines back before
pmd_offset and pud_offset!)
Hugh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/6] freepgt: free_pgtables shakeup
2005-03-30 19:30 ` Hugh Dickins
@ 2005-03-30 23:40 ` Nick Piggin
0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2005-03-30 23:40 UTC (permalink / raw)
To: Hugh Dickins; +Cc: David S. Miller, akpm, tony.luck, benh, ak, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1729 bytes --]
Hugh Dickins wrote:
> It's clearly superior to what David and I had, in branching
> less (other than in your BUG_ONs), and I do believe your
> "if (end - ceiling - 1 < P*_SIZE - 1)" is correct and efficient.
>
Well the BUG_ONs were more to just satisfy me that my assumptions
were correct. Not to mention only contained in the top level, so
they shouldn't hurt performance. But they could go.
> But I still find it harder to understand than ours; and don't
> understand at all your comment "end can't have approached ceiling
> from above...." - but I think you're bravely trying to explain the case
> I sidestepped with a lordly unexplained "end can't go down to 0 there".
>
Yes, say ceiling is 0 - something less than P*_SIZE, you might
get the feeling that end may be able to come within our limit
of it if it were a very small number.
This can't happen because 0 is actually the top of address space,
and end can't be *greater* than ceiling before any rounding. If it
is not 0, then it must be at least 1, in which case it will always
be rounded up to the next P*_SIZE boundary. So no problem.
This may have been obvious to you from the start, in which case my
extra rambling may have confused you... actually on re-reading it,
it would have confused you no matter what. See if the next version
is better.
> Let others decide.
>
> One thing I believe is outright wrong, at least with out-of-tree
> patches: your change from "if (addr > end - 1)" to "if (addr >= end)",
> after you've just rounded up end (perhaps to 0).
>
Oh yes, good catch. I don't know why I did that :(
> (And let me astonish you by asking for the blank lines back before
> pmd_offset and pud_offset!)
>
Hugh? What have you done with Hugh?
[-- Attachment #2: freepgt-boundary-tests.patch --]
[-- Type: text/plain, Size: 4337 bytes --]
Simplify (from the machine's point of view) the infamous boundary tests.
The method, and an outline of the proof (which I haven't actually done)
is recorded in the comments.
It is not conceptually much more difficult than the current method when
it is understood, although it doesn't present the corner cases so explicitly
in code (hence the need for comments).
Eliminates 2 branches per freeable page table level.
Tested and works on i386, ia64, sparc64.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2005-03-29 17:09:16.000000000 +1000
+++ linux-2.6/mm/memory.c 2005-03-31 09:39:31.000000000 +1000
@@ -139,12 +139,8 @@ static inline void free_pmd_range(struct
start &= PUD_MASK;
if (start < floor)
return;
- if (ceiling) {
- ceiling &= PUD_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PUD_SIZE - 1) & PUD_MASK;
+ if (end - ceiling - 1 < PUD_SIZE - 1)
return;
pmd = pmd_offset(pud, start);
@@ -172,12 +168,8 @@ static inline void free_pud_range(struct
start &= PGDIR_MASK;
if (start < floor)
return;
- if (ceiling) {
- ceiling &= PGDIR_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PGDIR_SIZE - 1) & PGDIR_MASK;
+ if (end - ceiling - 1 < PGDIR_SIZE - 1)
return;
pud = pud_offset(pgd, start);
@@ -198,6 +190,10 @@ void free_pgd_range(struct mmu_gather **
unsigned long next;
unsigned long start;
+ BUG_ON(addr >= end);
+ /* Don't want end to be 0 and ceiling to be greater than 0-PGDIR_SIZE */
+ BUG_ON(end - 1 > ceiling - 1);
+
/*
* The next few lines have given us lots of grief...
*
@@ -205,23 +201,25 @@ void free_pgd_range(struct mmu_gather **
* there will be no work to do at all, and we'd prefer not to
* go all the way down to the bottom just to discover that.
*
- * Why all these "- 1"s? Because 0 represents both the bottom
- * of the address space and the top of it (using -1 for the
- * top wouldn't help much: the masks would do the wrong thing).
- * The rule is that addr 0 and floor 0 refer to the bottom of
- * the address space, but end 0 and ceiling 0 refer to the top
- * Comparisons need to use "end - 1" and "ceiling - 1" (though
- * that end 0 case should be mythical).
- *
- * Wherever addr is brought up or ceiling brought down, we must
- * be careful to reject "the opposite 0" before it confuses the
- * subsequent tests. But what about where end is brought down
- * by PMD_SIZE below? no, end can't go down to 0 there.
+ * The tricky part of this logic (and similar in free_p?d_range above)
+ * is the 'end' handling. end and ceiling are *exclusive* boundaries,
+ * so their maximum is 0. This suggests the use of two's complement
+ * difference when comparing them, so the wrapping is handled for us.
*
- * Whereas we round start (addr) and ceiling down, by different
- * masks at different levels, in order to test whether a table
- * now has no other vmas using it, so can be freed, we don't
- * bother to round floor or end up - the tests don't need that.
+ * The method is:
+ * - Round end up to the nearest PMD aligned boundary.
+ * - If end has exceeded ceiling, then end - ceiling will be less than
+ * PMD_SIZE.
+ * - If end is very small (close to 0) and ceiling is very large
+ * (close to wrapping to 0, or 0), then the end - ceiling condition
+ * needs to be false. This holds because end must be at least 1, and
+ * so rounding it up will always take it to the first PMD boundary,
+ * and hence out of reach of ceiling.
+ * - If end is 0 (top of address space), then ceiling must also be 0.
+ * - In the above case that end is 0, or any other time end might be
+ * equal to ceiling, end - ceiling = 0 < PMD_SIZE. So the actual test
+ * we use is (unsigned) end - ceiling - 1 < PMD_SIZE - 1,
+ * to catch this case.
*/
addr &= PMD_MASK;
@@ -230,12 +228,8 @@ void free_pgd_range(struct mmu_gather **
if (!addr)
return;
}
- if (ceiling) {
- ceiling &= PMD_MASK;
- if (!ceiling)
- return;
- }
- if (end - 1 > ceiling - 1)
+ end = (end + PMD_SIZE - 1) & PMD_MASK;
+ if (end - ceiling - 1 < PMD_SIZE - 1)
end -= PMD_SIZE;
if (addr > end - 1)
return;
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2005-03-30 23:43 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-24 1:00 [PATCH 0/6] freepgt: free_pgtables shakeup Luck, Tony
2005-03-24 1:07 ` Nick Piggin
-- strict thread matches above, loose matches on Subject: below --
2005-03-23 17:10 Hugh Dickins
2005-03-23 19:57 ` David S. Miller
2005-03-24 0:26 ` Nick Piggin
2005-03-24 5:44 ` David S. Miller
2005-03-30 19:30 ` Hugh Dickins
2005-03-30 23:40 ` Nick Piggin
2005-03-25 21:22 ` Russell King
2005-03-26 2:06 ` Nick Piggin
2005-03-26 11:35 ` Russell King
2005-03-26 13:37 ` Russell King
2005-03-26 13:51 ` Nick Piggin
2005-03-26 15:03 ` Russell King
2005-03-30 17:00 ` Hugh Dickins
2005-03-30 17:39 ` Russell King
2005-03-26 13:42 ` Nick Piggin
2005-03-26 15:52 ` Russell King
2005-03-27 3:41 ` Nick Piggin
2005-03-27 7:57 ` Russell King
2005-03-27 18:17 ` David S. Miller
2005-03-28 7:51 ` Russell King
2005-03-28 18:47 ` David S. Miller
2005-03-30 16:30 ` Hugh Dickins
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox