Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] Hugetlb demotion for x86
From: Dave Hansen @ 2006-05-15 14:20 UTC (permalink / raw)
  To: Adam Litke; +Cc: Christoph Lameter, linux-mm, linux-kernel
In-Reply-To: <1147363859.24029.134.camel@localhost.localdomain>

On Thu, 2006-05-11 at 11:10 -0500, Adam Litke wrote:
> Yes, the SIGBUS issues are "fixed".  Now the application is killed
> directly via VM_FAULT_OOM so it is not possible to handle the fault from
> userspace.  For my libhugetlbfs-based fallback approach, I needed to
> patch the kernel so that SIGBUS was delivered to the process like in the
> days of old.

Maybe this could be off-by-default behavior that can be enabled with a
special mmap flag or madvise, or something similar.  It seems that apps
don't want to get SIGBUS for low memory.  But, if they have _asked_ for
it, perhaps they'd be a bit more willing.

(BTW, I fixed the bogus linux-mm cc, finally ;)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Mel Gorman @ 2006-05-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev
In-Reply-To: <20060514203158.216a966e.akpm@osdl.org>

On (14/05/06 20:31), Andrew Morton didst pronounce:
> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > Size zones and holes in an architecture independent manner for ia64.
> > 
> 
> This one makes my ia64 die very early in boot.   The trace is pretty useless.
> 
> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
> 
> <log snipped>

Curses. When I tried to reproduce this, the machine booted with my default
config but died before initialising the console with your config. The machine
is far away so I can't see the screen or restart the machine remotely so
I can only assume it is dying for the same reasons yours did.

> Note the misaligned pfns.
> 
> Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
> include an update to any Kconfig files.  But hacking that in by hand didn't
> help.

It would not have helped in this case because the zone boundaries would still
be in the wrong place for ia64. Below is a patch that aligns the zones on
all architectures that use CONFIG_ARCH_POPULATES_NODE_MAP . That is currently
i386, x86_64, powerpc, ppc and ia64. It does *not* align pgdat->node_start_pfn
but I don't believe that it is necessary.

I can't test it on ia64 until I get someone to restart the machine. The patch
compiles and is currently boot-testing on a range of other machines. I hope
to know within 5-6 hours if everything is ok.

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c
--- linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c	2006-05-15 10:37:55.000000000 +0100
+++ linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c	2006-05-15 13:10:42.000000000 +0100
@@ -2640,14 +2640,20 @@ void __init free_area_init_nodes(unsigne
 {
 	unsigned long nid;
 	int zone_index;
+	unsigned long lowest_pfn = find_min_pfn_with_active_regions();
+
+	lowest_pfn = zone_boundary_align_pfn(lowest_pfn);
+	arch_max_dma_pfn = zone_boundary_align_pfn(arch_max_dma_pfn);
+	arch_max_dma32_pfn = zone_boundary_align_pfn(arch_max_dma32_pfn);
+	arch_max_low_pfn = zone_boundary_align_pfn(arch_max_low_pfn);
+	arch_max_high_pfn = zone_boundary_align_pfn(arch_max_high_pfn);
 
 	/* Record where the zone boundaries are */
 	memset(arch_zone_lowest_possible_pfn, 0,
 				sizeof(arch_zone_lowest_possible_pfn));
 	memset(arch_zone_highest_possible_pfn, 0,
 				sizeof(arch_zone_highest_possible_pfn));
-	arch_zone_lowest_possible_pfn[ZONE_DMA] =
-					find_min_pfn_with_active_regions();
+	arch_zone_lowest_possible_pfn[ZONE_DMA] = lowest_pfn;
 	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
 	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
 	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Andy Whitcroft @ 2006-05-15 11:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nickpiggin, akpm, mel, davej, tony.luck, linux-kernel, bob.picco,
	ak, linux-mm, linuxppc-dev
In-Reply-To: <20060515192918.c3e2e895.kamezawa.hiroyu@jp.fujitsu.com>

KAMEZAWA Hiroyuki wrote:
> On Mon, 15 May 2006 11:19:27 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> 
>>Nick Piggin wrote:
>>
>>>Andy Whitcroft wrote:
>>>
>>>
>>>>Interesting.  You are correct there was no config component, at the time
>>>>I didn't have direct evidence that any architecture needed it, only that
>>>>we had an unchecked requirement on zones, a requirement that had only
>>>>recently arrived with the changes to free buddy detection.  I note that
>>>
>>>
>>>Recently arrived? Over a year ago with the no-buddy-bitmap patches,
>>>right? Just checking because I that's what I'm assuming broke it...
>>
>>Yep, sorry I forget I was out of the game for 6 months!  And yes that
>>was when the requirements were altered.
>>
> 
> When no-bitmap-buddy patches was included,
> 
> 1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
> ==
> static int bad_range(struct zone *zone, struct page *page)
> {
>         if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
>                 return 1;
>         if (page_to_pfn(page) < zone->zone_start_pfn)
>                 return 1;
> ==
> And , this code
> ==
>                 buddy = __page_find_buddy(page, page_idx, order);
> 
>                 if (bad_range(zone, buddy))
>                         break;
> ==
> 
> checked whether buddy is in zone and guarantees it to have page struct.
> 
> 
> But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
> Sorry for misses these things.

Heh, sorry to make it sound like it was you who was responsible.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: KAMEZAWA Hiroyuki @ 2006-05-15 10:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: apw, nickpiggin, akpm, mel, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev
In-Reply-To: <20060515192918.c3e2e895.kamezawa.hiroyu@jp.fujitsu.com>

On Mon, 15 May 2006 19:29:18 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 15 May 2006 11:19:27 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > Nick Piggin wrote:
> > > Andy Whitcroft wrote:
> > > 
> > >> Interesting.  You are correct there was no config component, at the time
> > >> I didn't have direct evidence that any architecture needed it, only that
> > >> we had an unchecked requirement on zones, a requirement that had only
> > >> recently arrived with the changes to free buddy detection.  I note that
> > > 
> > > 
> > > Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> > > right? Just checking because I that's what I'm assuming broke it...
> > 
> > Yep, sorry I forget I was out of the game for 6 months!  And yes that
> > was when the requirements were altered.
> > 
> When no-bitmap-buddy patches was included,
> 
> 1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
> ==
> static int bad_range(struct zone *zone, struct page *page)
> {
>         if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
>                 return 1;
>         if (page_to_pfn(page) < zone->zone_start_pfn)
>                 return 1;
> ==
> And , this code
> ==
>                 buddy = __page_find_buddy(page, page_idx, order);
> 
>                 if (bad_range(zone, buddy))
>                         break;
> ==
> 
> checked whether buddy is in zone and guarantees it to have page struct.
> 
> 
> But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
> Sorry for misses these things.
> 

One more point
When above no-bitmap patches was included, the user of not-aligned zones
are only ia64, I think. Because ia64 used virtual mem_map, page_to_pfn(page)
on  CONFIG_DISCONTIG_MEM doesn't access page struct itself. 

#define page_to_pfn(page)	(page - vmemmap)

So, it didn't  panic. ia64/vmemmap was safe.

If other archs used not-aligned zone + CONFIG_DISCONTIGMEM,
not-aligned-zones problem would come out earlier.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: KAMEZAWA Hiroyuki @ 2006-05-15 10:29 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: nickpiggin, akpm, mel, davej, tony.luck, linux-kernel, bob.picco,
	ak, linux-mm, linuxppc-dev
In-Reply-To: <446855AF.1090100@shadowen.org>

On Mon, 15 May 2006 11:19:27 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> Nick Piggin wrote:
> > Andy Whitcroft wrote:
> > 
> >> Interesting.  You are correct there was no config component, at the time
> >> I didn't have direct evidence that any architecture needed it, only that
> >> we had an unchecked requirement on zones, a requirement that had only
> >> recently arrived with the changes to free buddy detection.  I note that
> > 
> > 
> > Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> > right? Just checking because I that's what I'm assuming broke it...
> 
> Yep, sorry I forget I was out of the game for 6 months!  And yes that
> was when the requirements were altered.
> 
When no-bitmap-buddy patches was included,

1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
==
static int bad_range(struct zone *zone, struct page *page)
{
        if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
                return 1;
        if (page_to_pfn(page) < zone->zone_start_pfn)
                return 1;
==
And , this code
==
                buddy = __page_find_buddy(page, page_idx, order);

                if (bad_range(zone, buddy))
                        break;
==

checked whether buddy is in zone and guarantees it to have page struct.


But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
Sorry for misses these things.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Andy Whitcroft @ 2006-05-15 10:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev
In-Reply-To: <44685123.7040501@yahoo.com.au>

Nick Piggin wrote:
> Andy Whitcroft wrote:
> 
>> Interesting.  You are correct there was no config component, at the time
>> I didn't have direct evidence that any architecture needed it, only that
>> we had an unchecked requirement on zones, a requirement that had only
>> recently arrived with the changes to free buddy detection.  I note that
> 
> 
> Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> right? Just checking because I that's what I'm assuming broke it...

Yep, sorry I forget I was out of the game for 6 months!  And yes that
was when the requirements were altered.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Nick Piggin @ 2006-05-15 10:00 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, Mel Gorman, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev
In-Reply-To: <44683A09.2060404@shadowen.org>

Andy Whitcroft wrote:

> Interesting.  You are correct there was no config component, at the time
> I didn't have direct evidence that any architecture needed it, only that
> we had an unchecked requirement on zones, a requirement that had only
> recently arrived with the changes to free buddy detection.  I note that

Recently arrived? Over a year ago with the no-buddy-bitmap patches,
right? Just checking because I that's what I'm assuming broke it...

> MAX_ORDER is 17 for ia64 so that probabally accounts for the
> missalignment.  It is clear that the reporting is slightly over-zelous
> as I am reporting zero-sized zones.  I'll get that fixed and patch to
> you.  I'll also have a look at the patch as added to -mm and try and get
> the rest of the spelling sorted :-/.
> 
> I'll go see if we currently have a machine to test this config on.


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Andy Whitcroft @ 2006-05-15  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev
In-Reply-To: <20060514203158.216a966e.akpm@osdl.org>

Andrew Morton wrote:
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
>>Size zones and holes in an architecture independent manner for ia64.
>>
> 
> 
> This one makes my ia64 die very early in boot.   The trace is pretty useless.
> 
> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
> 
> EFI v1.10 by INTEL: SALsystab=0x3fe4c8c0 ACPI=0x3ff84000 ACPI 2.0=0x3ff83000 MP0
> Early serial console at I/O port 0x2f8 (options '9600n8')
> SAL 3.1: Intel Corp                       SR870BN4                         vers0
> SAL Platform features: BusLock IRQ_Redirection
> SAL: AP wakeup using external interrupt vector 0xf0
> No logical to physical processor mapping available
> iosapic_system_init: Disabling PC-AT compatible 8259 interrupts
> ACPI: Local APIC address c0000000fee00000
> PLATFORM int CPEI (0x3): GSI 22 (level, low) -> CPU 0 (0xc618) vector 30
> register_intr: changing vector 39 from IO-SAPIC-edge to IO-SAPIC-level
> 4 CPUs available, 4 CPUs total
> MCA related initialization done
> node 0 zone DMA missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone DMA32 missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> SMP: Allowing 4 CPUs, 0 hotplug CPUs
> Built 1 zonelists
> Kernel command line: BOOT_IMAGE=scsi0:\EFI\redhat\vmlinuz-2.6.17-rc4-mm1 root=/o
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Inode-cache hash table entries: 65536 (order: 5, 524288 bytes)
> Placing software IO TLB between 0x4a30000 - 0x8a30000
> Unable to handle kernel NULL pointer dereference (address 0000000000000008)
> swapper[0]: Oops 8813272891392 [1]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 00001010084a6010 ifs : 800000000000060f ip  : [<a0000001000e6750>]    Notd
> ip is at __free_pages_ok+0x190/0x3c0
> unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
> rnat: 0000000000ffffff bsps: 00000000000002f9 pr  : 80000000afb5956b
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
> csd : 0930ffff00090000 ssd : 0930ffff00090000
> b0  : a0000001000e6660 b6  : e00000003fe52940 b7  : a000000100790120
> f6  : 1003e6db6db6db6db6db7 f7  : 1003e000000000006dec0
> f8  : 1003e000000000000fb80 f9  : 1003e000000000006e080
> f10 : 1003e000000000000fb40 f11 : 1003e000000000006dec0
> r1  : a000000100af2db0 r2  : 0000000000000001 r3  : 0000000000000000
> r8  : a0000001008f3d38 r9  : 0000000000004000 r10 : 0000000000370400
> r11 : 0000000000004000 r12 : a0000001007b7e10 r13 : a0000001007b0000
> r14 : 0000000000000001 r15 : 0000000100000001 r16 : 0000000100000001
> r17 : 0000000100000001 r18 : 0000000000001041 r19 : 0000000000000000
> r20 : e00000000149df00 r21 : 0000000100000000 r22 : 0000000055555155
> r23 : 00000000ffffffff r24 : e00000000149df08 r25 : 1555555555555155
> r26 : 0000000000000032 r27 : 0000000000000000 r28 : 0000000000000008
> r29 : 0000000000001041 r30 : 0000000000001041 r31 : 0000000000000001
> Unable to handle kernel NULL pointer dereference (address 0000000000000000)
> swapper[0]: Oops 8813272891392 [2]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 0000101008022018 ifs : 8000000000000287 ip  : [<a0000001001236c0>]    Notd
> ip is at kmem_cache_alloc+0x40/0x100
> unat: 0000000000000000 pfs : 0000000000000712 rsc : 0000000000000003
> rnat: 0000000000000000 bsps: 0000000000000000 pr  : 80000000afb59967
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> csd : 0930ffff00090000 ssd : 0930ffff00090000
> b0  : a00000010003e450 b6  : a000000100001b50 b7  : a00000010003f320
> f6  : 1003e9e3779b97f4a7c16 f7  : 0ffdb8000000000000000
> f8  : 1003e000000000000007f f9  : 1003e0000000000000379
> f10 : 1003e6db6db6db6db6db7 f11 : 1003e000000000000007f
> r1  : a000000100af2db0 r2  : 0000000000000000 r3  : 0000000000000000
> r8  : 0000000000000000 r9  : 0000000000000000 r10 : a0000001007b0f24
> r11 : 0000000000000000 r12 : a0000001007b7280 r13 : a0000001007b0000
> r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001007b7310
> r17 : 0000000000000000 r18 : a0000001007b7478 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : 0000000000000018 r22 : 0000000000000000
> r23 : 0000000000000000 r24 : 0000000000000000 r25 : a0000001007b7308
> r26 : 000000007fffffff r27 : a000000100825520 r28 : a0000001008f3c40
> r29 : a000000100816ca8 r30 : 0000000000000018 r31 : 0000000000000018
> 
> 
> 
> (gdb) l *0xa0000001000e6750
> 0xa0000001000e6750 is in __free_pages_ok (mm.h:324).
> 319     extern void FASTCALL(__page_cache_release(struct page *));
> 320     
> 321     static inline int page_count(struct page *page)
> 322     {
> 323             if (unlikely(PageCompound(page)))
> 324                     page = (struct page *)page_private(page);
> 325             return atomic_read(&page->_count);
> 326     }
> 327     
> 328     static inline void get_page(struct page *page)
> 
> 
> Note the misaligned pfns.
> 
> Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
> include an update to any Kconfig files.  But hacking that in by hand didn't
> help.

Interesting.  You are correct there was no config component, at the time
I didn't have direct evidence that any architecture needed it, only that
we had an unchecked requirement on zones, a requirement that had only
recently arrived with the changes to free buddy detection.  I note that
MAX_ORDER is 17 for ia64 so that probabally accounts for the
missalignment.  It is clear that the reporting is slightly over-zelous
as I am reporting zero-sized zones.  I'll get that fixed and patch to
you.  I'll also have a look at the patch as added to -mm and try and get
the rest of the spelling sorted :-/.

I'll go see if we currently have a machine to test this config on.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
From: Andrew Morton @ 2006-05-15  3:31 UTC (permalink / raw)
  To: Mel Gorman, Andy Whitcroft
  Cc: davej, tony.luck, linux-kernel, bob.picco, ak, linux-mm,
	linuxppc-dev
In-Reply-To: <20060508141211.26912.48278.sendpatchset@skynet>

Mel Gorman <mel@csn.ul.ie> wrote:
>
> Size zones and holes in an architecture independent manner for ia64.
> 

This one makes my ia64 die very early in boot.   The trace is pretty useless.

config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64

EFI v1.10 by INTEL: SALsystab=0x3fe4c8c0 ACPI=0x3ff84000 ACPI 2.0=0x3ff83000 MP0
Early serial console at I/O port 0x2f8 (options '9600n8')
SAL 3.1: Intel Corp                       SR870BN4                         vers0
SAL Platform features: BusLock IRQ_Redirection
SAL: AP wakeup using external interrupt vector 0xf0
No logical to physical processor mapping available
iosapic_system_init: Disabling PC-AT compatible 8259 interrupts
ACPI: Local APIC address c0000000fee00000
PLATFORM int CPEI (0x3): GSI 22 (level, low) -> CPU 0 (0xc618) vector 30
register_intr: changing vector 39 from IO-SAPIC-edge to IO-SAPIC-level
4 CPUs available, 4 CPUs total
MCA related initialization done
node 0 zone DMA missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone DMA32 missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
SMP: Allowing 4 CPUs, 0 hotplug CPUs
Built 1 zonelists
Kernel command line: BOOT_IMAGE=scsi0:\EFI\redhat\vmlinuz-2.6.17-rc4-mm1 root=/o
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 6, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 5, 524288 bytes)
Placing software IO TLB between 0x4a30000 - 0x8a30000
Unable to handle kernel NULL pointer dereference (address 0000000000000008)
swapper[0]: Oops 8813272891392 [1]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 00001010084a6010 ifs : 800000000000060f ip  : [<a0000001000e6750>]    Notd
ip is at __free_pages_ok+0x190/0x3c0
unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
rnat: 0000000000ffffff bsps: 00000000000002f9 pr  : 80000000afb5956b
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0930ffff00090000 ssd : 0930ffff00090000
b0  : a0000001000e6660 b6  : e00000003fe52940 b7  : a000000100790120
f6  : 1003e6db6db6db6db6db7 f7  : 1003e000000000006dec0
f8  : 1003e000000000000fb80 f9  : 1003e000000000006e080
f10 : 1003e000000000000fb40 f11 : 1003e000000000006dec0
r1  : a000000100af2db0 r2  : 0000000000000001 r3  : 0000000000000000
r8  : a0000001008f3d38 r9  : 0000000000004000 r10 : 0000000000370400
r11 : 0000000000004000 r12 : a0000001007b7e10 r13 : a0000001007b0000
r14 : 0000000000000001 r15 : 0000000100000001 r16 : 0000000100000001
r17 : 0000000100000001 r18 : 0000000000001041 r19 : 0000000000000000
r20 : e00000000149df00 r21 : 0000000100000000 r22 : 0000000055555155
r23 : 00000000ffffffff r24 : e00000000149df08 r25 : 1555555555555155
r26 : 0000000000000032 r27 : 0000000000000000 r28 : 0000000000000008
r29 : 0000000000001041 r30 : 0000000000001041 r31 : 0000000000000001
Unable to handle kernel NULL pointer dereference (address 0000000000000000)
swapper[0]: Oops 8813272891392 [2]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 0000101008022018 ifs : 8000000000000287 ip  : [<a0000001001236c0>]    Notd
ip is at kmem_cache_alloc+0x40/0x100
unat: 0000000000000000 pfs : 0000000000000712 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 80000000afb59967
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0930ffff00090000 ssd : 0930ffff00090000
b0  : a00000010003e450 b6  : a000000100001b50 b7  : a00000010003f320
f6  : 1003e9e3779b97f4a7c16 f7  : 0ffdb8000000000000000
f8  : 1003e000000000000007f f9  : 1003e0000000000000379
f10 : 1003e6db6db6db6db6db7 f11 : 1003e000000000000007f
r1  : a000000100af2db0 r2  : 0000000000000000 r3  : 0000000000000000
r8  : 0000000000000000 r9  : 0000000000000000 r10 : a0000001007b0f24
r11 : 0000000000000000 r12 : a0000001007b7280 r13 : a0000001007b0000
r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001007b7310
r17 : 0000000000000000 r18 : a0000001007b7478 r19 : 0000000000000000
r20 : 0000000000000000 r21 : 0000000000000018 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000000000 r25 : a0000001007b7308
r26 : 000000007fffffff r27 : a000000100825520 r28 : a0000001008f3c40
r29 : a000000100816ca8 r30 : 0000000000000018 r31 : 0000000000000018



(gdb) l *0xa0000001000e6750
0xa0000001000e6750 is in __free_pages_ok (mm.h:324).
319     extern void FASTCALL(__page_cache_release(struct page *));
320     
321     static inline int page_count(struct page *page)
322     {
323             if (unlikely(PageCompound(page)))
324                     page = (struct page *)page_private(page);
325             return atomic_read(&page->_count);
326     }
327     
328     static inline void get_page(struct page *page)


Note the misaligned pfns.

Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
include an update to any Kconfig files.  But hacking that in by hand didn't
help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Andy Whitcroft @ 2006-05-14 15:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, clameter, a.p.zijlstra, piggin, ak, rohitseth,
	mbligh, hugh, riel, andrea, arjan, mel, marcelo, anton, paulmck,
	linux-mm
In-Reply-To: <20060511164448.4686a2bd.akpm@osdl.org>

Andrew Morton wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>What happened to the VM stress-test programs that we used to test the 
>>page-out with? I forget who kept a collection of them around, but they did 
>>things like trying to cause MM problems on purpose.
> 
> 
> I think that was me, back in my programming days.
> 
> 
>>And I'm pretty sure 
>>some of the nastiest ones used shared mappings, exactly because we've had 
>>problems with the virtual scanning.
> 
> 
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
> 
> run-bash-shared-mapping.sh is a good stress-tester and deadlock-finder.

Well my amd64 box ran run-bash-shared-mapping.sh for 6 hours without a
peep or oops.  Machine seemed ok at the end.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 00/14] remap_file_pages protection support
From: Valerie Henson @ 2006-05-13 22:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Arjan van de Ven
In-Reply-To: <20060513181945.GC9612@goober>

On Sat, May 13, 2006 at 11:19:46AM -0700, Valerie Henson wrote:
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:

... Which is of course exactly the case that Blaisorblade's patches
should coalesce into one vma.  So I wrote another option which uses
holes instead - takes more memory initially, unfortunately.  Grab it
from:

http://www.nmt.edu/~val/patches/ebizzy.tar.gz

-p for preventing coaelescing via protections, -P for preventing via
holes.

-VAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 00/14] remap_file_pages protection support
From: Valerie Henson @ 2006-05-13 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson
In-Reply-To: <4465E981.60302@yahoo.com.au>

On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
> 
> OK, I got interested again, but can't get Val's ebizzy to give me
> a find_vma constrained workload yet (though the numbers back up
> my assertion that the vma cache is crap for threaded apps).

Hey Nick,

Glad to see you're using it!  There are (at least) two ways to do what
you want:

1. Increase the number of threads - this gives you two vma's per
   thread, one for stack, one for guard page:

   $ ./ebizzy -t 100

2. Apply the patch at the end of this email and use -p "prevent
   coalescing", -m "always mmap" and appropriate number of chunks,
   size, and records to search - this works for me:

   $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

The original program mmapped everything with the same permissions and
no alignment restrictions, so all the mmaps were coalesced into one.
This version alternates PROT_WRITE permissions on the mmap'd areas
after they are written, so you get lots of vma's:

val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

[2]+  Stopped                 ./ebizzy -p -m -n 10000 -s 4096 -r 100000
val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
10019 /proc/10917/maps

I haven't profiled to see if this brings find_vma to the top, though.

(The patch also moves around some other stuff so that options are in
alphabetical order; apparently I thought 's' came after 'r' and before
'R'...)

-VAL

--- ebizzy.c.old	2006-05-13 10:18:58.000000000 -0700
+++ ebizzy.c	2006-05-13 11:01:42.000000000 -0700
@@ -52,9 +52,10 @@
 static unsigned int always_mmap;
 static unsigned int never_mmap;
 static unsigned int chunks;
+static unsigned int prevent_coalescing;
 static unsigned int records;
-static unsigned int chunk_size;
 static unsigned int random_size;
+static unsigned int chunk_size;
 static unsigned int threads;
 static unsigned int verbose;
 static unsigned int linear;
@@ -76,9 +77,10 @@
 		"-m\t\t Always use mmap instead of malloc\n"
 		"-M\t\t Never use mmap\n"
 		"-n <num>\t Number of memory chunks to allocate\n"
+		"-p \t\t Prevent mmap coalescing\n"
 		"-r <num>\t Total number of records to search for\n"
-		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-R\t\t Randomize size of memory to copy and search\n"
+		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-t <num>\t Number of threads\n"
 		"-v[v[v]]\t Be verbose (more v's for more verbose)\n"
 		"-z\t\t Linear search instead of binary search\n",
@@ -98,7 +100,7 @@
 	cmd = argv[0];
 	opterr = 1;
 
-	while ((c = getopt(argc, argv, "mMn:r:s:Rt:vz")) != -1) {
+	while ((c = getopt(argc, argv, "mMn:pr:Rs:t:vz")) != -1) {
 		switch (c) {
 		case 'm':
 			always_mmap = 1;
@@ -111,19 +113,22 @@
 			if (chunks == 0)
 				usage();
 			break;
+		case 'p':
+			prevent_coalescing = 1;
+			break;
 		case 'r':
 			records = atoi(optarg);
 			if (records == 0)
 				usage();
 			break;
+		case 'R':
+			random_size = 1;
+			break;
 		case 's':
 			chunk_size = atoi(optarg);
 			if (chunk_size == 0)
 				usage();
 			break;
-		case 'R':
-			random_size = 1;
-			break;
 		case 't':
 			threads = atoi(optarg);
 			if (threads == 0)
@@ -141,7 +146,7 @@
 	}
 
 	if (verbose)
-		printf("ebizzy 0.1, Copyright 2006 Intel Corporation\n"
+		printf("ebizzy 0.2, Copyright 2006 Intel Corporation\n"
 		       "Written by Val Henson <val_henson@linux.intel.com\n");
 
 	/*
@@ -173,10 +178,11 @@
 		printf("always_mmap %u\n", always_mmap);
 		printf("never_mmap %u\n", never_mmap);
 		printf("chunks %u\n", chunks);
+		printf("prevent coalescing %u\n", prevent_coalescing);
 		printf("records %u\n", records);
 		printf("records per thread %u\n", records_per_thread);
-		printf("chunk_size %u\n", chunk_size);
 		printf("random_size %u\n", random_size);
+		printf("chunk_size %u\n", chunk_size);
 		printf("threads %u\n", threads);
 		printf("verbose %u\n", verbose);
 		printf("linear %u\n", linear);
@@ -251,9 +257,13 @@
 {
 	int i, j;
 
-	for (i = 0; i < chunks; i++)
+	for (i = 0; i < chunks; i++) {
 		for(j = 0; j < chunk_size / record_size; j++)
 			mem[i][j] = (record_t) j;
+		/* Prevent coalescing by alternating permissions */
+		if (prevent_coalescing && (i % 2) == 0)
+			mprotect(mem[i], chunk_size, PROT_READ);
+	}
 	if (verbose)
 		printf("Wrote memory\n");
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 00/14] remap_file_pages protection support
From: Nick Piggin @ 2006-05-13 14:13 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson
In-Reply-To: <445D75EB.5030909@yahoo.com.au>

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

Nick Piggin wrote:

> I think possibly each thread should have a private vma cache, with
> room for at least its stack vma(s), (and several others, eg. code,
> data). Perhaps the per-mm cache could be dispensed with completely,
> although it might be useful eg. for the heap. And it might be helped
> with increased entries as well.
> 
> I've got patches lying around to implement this stuff -- I'd be
> interested to have more detail about this problem, or distilled test
> cases.

OK, I got interested again, but can't get Val's ebizzy to give me
a find_vma constrained workload yet (though the numbers back up
my assertion that the vma cache is crap for threaded apps).

Without the patch, after bootup, the vma cache gets 208 364 hits out
of 438 535 lookups (47.5%)

./ebizzy -t16: 384.29user 754.61system 5:31.87elapsed 343%CPU

And ebizzy gets 7 373 078 hits out of 82 255 863 lookups (8.9%)


With mm + 4 slot LRU per-thread cache (this patch):
After boot, 303 767 / 439 918 = 69.0%

./ebizzy -t16: 388.73user 750.29system 5:30.24elapsed 344%CPU

ebizzy hits: 53 024 083 / 82 055 195 = 64.6%


So on a non-threaded workload, hit rate is increased by about 50%;
on a threaded workload it is increased by over 700%. In rbtree-walk
-constrained workloads, the total find_vma speedup should be linear
to the hit ratio improvement.

I don't think my ebizzy numbers can justify the patch though...

Nick

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: vma.patch --]
[-- Type: text/plain, Size: 6766 bytes --]

Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-05-13 23:31:13.000000000 +1000
+++ linux-2.6/mm/mmap.c	2006-05-13 23:48:53.000000000 +1000
@@ -30,6 +30,99 @@
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
 
+static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	struct task_struct *curr = current;
+	if (mm == curr->mm) {
+		int i;
+		if (curr->vma_cache_sequence != mm->vma_sequence) {
+			curr->vma_cache_sequence = mm->vma_sequence;
+			curr->vma_cache[0] = vma;
+			for (i = 1; i < 4; i++)
+				curr->vma_cache[i] = NULL;
+		} else {
+			int update_mm;
+
+			if (curr->vma_cache[0] == vma)
+				return;
+
+			for (i = 1; i < 4; i++) {
+				if (curr->vma_cache[i] == vma)
+					break;
+			}
+			update_mm = 0;
+			if (i == 4) {
+				update_mm = 1;
+				i = 3;
+			}
+			while (i != 0) {
+				curr->vma_cache[i] = curr->vma_cache[i-1];
+				i--;
+			}
+			curr->vma_cache[0] = vma;
+
+			if (!update_mm)
+				return;
+		}
+	}
+
+	if (mm->vma_cache != vma) /* prevent cacheline bouncing */
+		mm->vma_cache = vma;
+}
+
+static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
+						struct vm_area_struct *repl)
+{
+	mm->vma_sequence++;
+	if (unlikely(mm->vma_sequence == 0)) {
+		struct task_struct *curr = current, *t;
+		t = curr;
+		rcu_read_lock();
+		do {
+			t->vma_cache_sequence = -1;
+			t = next_thread(t);
+		} while (t != curr);
+		rcu_read_unlock();
+	}
+
+	if (mm->vma_cache == vma)
+		mm->vma_cache = repl;
+}
+
+static inline void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	vma_cache_replace(mm, vma, NULL);
+}
+
+static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
+						unsigned long addr)
+{
+	struct task_struct *curr;
+	struct vm_area_struct *vma;
+
+	inc_page_state(vma_cache_query);
+
+	curr = current;
+	if (mm == curr->mm && mm->vma_sequence == curr->vma_cache_sequence) {
+		int i;
+		for (i = 0; i < 4; i++) {
+			vma = curr->vma_cache[i];
+			if (vma && vma->vm_end > addr && vma->vm_start <= addr){
+				inc_page_state(vma_cache_hit);
+				return vma;
+			}
+		}
+	}
+
+	vma = mm->vma_cache;
+	if (vma && vma->vm_end > addr && vma->vm_start <= addr) {
+		inc_page_state(vma_cache_hit);
+		return vma;
+	}
+
+	return NULL;
+}
+
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
@@ -460,8 +553,6 @@
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
-	if (mm->mmap_cache == vma)
-		mm->mmap_cache = prev;
 }
 
 /*
@@ -586,6 +677,7 @@
 		 * us to remove next before dropping the locks.
 		 */
 		__vma_unlink(mm, next, vma);
+		vma_cache_replace(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 		if (next->anon_vma)
@@ -1384,8 +1476,8 @@
 	if (mm) {
 		/* Check the cache first. */
 		/* (Cache hit rate is typically around 35%.) */
-		vma = mm->mmap_cache;
-		if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
+		vma = vma_cache_find(mm, addr);
+		if (!vma) {
 			struct rb_node * rb_node;
 
 			rb_node = mm->mm_rb.rb_node;
@@ -1405,9 +1497,9 @@
 				} else
 					rb_node = rb_node->rb_right;
 			}
-			if (vma)
-				mm->mmap_cache = vma;
 		}
+		if (vma)
+			vma_cache_touch(mm, vma);
 	}
 	return vma;
 }
@@ -1424,6 +1516,14 @@
 	if (!mm)
 		goto out;
 
+	vma = vma_cache_find(mm, addr);
+	if (vma) {
+		rb_node = rb_prev(&vma->vm_rb);
+		if (rb_node)
+			prev = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+		goto out;
+	}
+
 	/* Guard against addr being lower than the first VMA */
 	vma = mm->mmap;
 
@@ -1445,6 +1545,9 @@
 	}
 
 out:
+	if (vma)
+		vma_cache_touch(mm, vma);
+
 	*pprev = prev;
 	return prev ? prev->vm_next : vma;
 }
@@ -1686,6 +1789,7 @@
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
+		vma_cache_invalidate(mm, vma);
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
 		mm->map_count--;
 		tail_vma = vma;
@@ -1698,7 +1802,6 @@
 	else
 		addr = vma ?  vma->vm_start : mm->mmap_base;
 	mm->unmap_area(mm, addr);
-	mm->mmap_cache = NULL;		/* Kill the cache. */
 }
 
 /*
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h	2006-05-13 23:31:44.000000000 +1000
@@ -164,6 +164,9 @@
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
 	unsigned long nr_bounce;	/* pages for bounce buffers */
+
+	unsigned long vma_cache_hit;
+	unsigned long vma_cache_query;
 };
 
 extern void get_page_state(struct page_state *ret);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/mm/page_alloc.c	2006-05-13 23:31:44.000000000 +1000
@@ -2389,6 +2389,9 @@
 
 	"pgrotated",
 	"nr_bounce",
+
+	"vma_cache_hit",
+	"vma_cache_query",
 };
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/include/linux/sched.h	2006-05-13 23:33:01.000000000 +1000
@@ -293,9 +293,11 @@
 } while (0)
 
 struct mm_struct {
-	struct vm_area_struct * mmap;		/* list of VMAs */
+	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
-	struct vm_area_struct * mmap_cache;	/* last find_vma result */
+	struct vm_area_struct *vma_cache;
+	unsigned long vma_sequence;
+
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -734,6 +736,8 @@
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+	struct vm_area_struct *vma_cache[4];
+	unsigned long vma_cache_sequence;
 
 /* task state */
 	struct linux_binfmt *binfmt;
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/kernel/fork.c	2006-05-13 23:32:59.000000000 +1000
@@ -197,7 +197,7 @@
 
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
-	mm->mmap_cache = NULL;
+	mm->vma_sequence = oldmm->vma_sequence+1;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
 	mm->map_count = 0;
@@ -238,6 +238,10 @@
 		tmp->vm_next = NULL;
 		anon_vma_link(tmp);
 		file = tmp->vm_file;
+
+		if (oldmm->vma_cache == mpnt)
+			mm->vma_cache = tmp;
+
 		if (file) {
 			struct inode *inode = file->f_dentry->d_inode;
 			get_file(file);

^ permalink raw reply

* Re: [PATCH 0/3] Zone boundry alignment fixes
From: Nick Piggin @ 2006-05-13  1:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andy Whitcroft, haveblue, bob.picco, mbligh, ak,
	linux-kernel, linux-mm
In-Reply-To: <20060512141921.GA564@elte.hu>

Ingo Molnar wrote:
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> 
>>There's some possibility here of interaction with Mel's "patchset to 
>>size zones and memory holes in an architecture-independent manner." I 
>>jammed them together - let's see how it goes.
> 
> 
> update: Andy's 3 patches, applied to 2.6.17-rc3-mm1, fixed all the 
> crashes and asserts i saw. NUMA-on-x86 is now rock-solid on my testbox. 
> Great work Andy!

Excellent. I think these should get into 2.6.17, and possibly even the
-stable series.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Martin J. Bligh @ 2006-05-12 14:25 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, Linus Torvalds, clameter, a.p.zijlstra, piggin, ak,
	rohitseth, hugh, riel, andrea, arjan, mel, marcelo, anton,
	paulmck, linux-mm
In-Reply-To: <4464423D.50803@shadowen.org>

> Well for what its worth (and from this thread it may not be that much)
> the testing I did over night shows green across all the test boxes I
> have.  The tests do include fsx-linux across a limited range of filesystems.

There's no perf regressions anywhere in there either (across dbench, 
reaim, kernbench, tbench, at least) on a multitude of machines ...

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/3] Zone boundry alignment fixes
From: Ingo Molnar @ 2006-05-12 14:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, nickpiggin, haveblue, bob.picco, mbligh, ak,
	linux-kernel, linux-mm
In-Reply-To: <20060511005952.3d23897c.akpm@osdl.org>

* Andrew Morton <akpm@osdl.org> wrote:

> There's some possibility here of interaction with Mel's "patchset to 
> size zones and memory holes in an architecture-independent manner." I 
> jammed them together - let's see how it goes.

update: Andy's 3 patches, applied to 2.6.17-rc3-mm1, fixed all the 
crashes and asserts i saw. NUMA-on-x86 is now rock-solid on my testbox. 
Great work Andy!

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Peter Zijlstra @ 2006-05-12  8:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <44644196.9070402@cyberone.com.au>

On Fri, 2006-05-12 at 18:04 +1000, Nick Piggin wrote:
> Peter Zijlstra wrote:
> 
> >On Thu, 2006-05-11 at 21:30 -0700, Andrew Morton wrote:
> >
> >>Nick Piggin <piggin@cyberone.com.au> wrote:
> >>
> >>> >So let's see.  We take a write fault, we mark the page dirty then we return
> >>> >to userspace which will proceed with the write and will mark the pte dirty.
> >>> >
> >>> >Later, the VM will write the page out.
> >>> >
> >>> >Later still, the pte will get cleaned by reclaim or by munmap or whatever
> >>> >and the page will be marked dirty and the page will again be written out. 
> >>> >Potentially needlessly.
> >>> >
> >>>
> >>> page_wrprotect also marks the page clean,
> >>>
> >>Oh.  I missed that when reading the comment which describes
> >>page_wrprotect() (I do go on).
> >>
> >
> >Yes, this name is not the best of names :-(
> >
> >I was aware of this, but since in my mind the counting through
> >protection 
> >faults was the prime idea, I stuck to page_wrprotect().
> >
> >But I'm hard pressed to come up with a better one. Nick proposes:
> > page_mkclean()
> >But that also doesn't cover the whole of it from my perspective.
> >
> 
> What's your perspective?
> 
> With mmap shared accounting, the _whole VM's_ perspective is that clean
> MAP_SHARED ptes are marked readonly.
> 
> The logical operation is marking the page's ptes clean. The VM mechanism
> also marks the ptes readonly as a side effect of that. Think about it:
> writeback does not want to make the page write protected, it wants to make
> it clean.

As said, I was looking at the added functionality on its own; that is, 
counting dirty pages by trapping write faults.

However your view; the big picture; does make more sense. I shall
rename.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Nick Piggin @ 2006-05-12  8:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <1147417561.8951.17.camel@twins>


Peter Zijlstra wrote:

>On Thu, 2006-05-11 at 21:30 -0700, Andrew Morton wrote:
>
>>
>>We just lost that pte dirty bit, and hence the user's data.
>>
>
>I thought that at the time we clean PAGECACHE_TAG_DIRTY the page is in
>flight to disk.
>

No.

>Now that I look at it again, perhaps the page_wrprotect() call in
>clear_page_dirty_for_io()
>should be in test_set_page_writeback().
>

No. The logical operation is clearing the dirty bits from the ptes. Such
an operation would be valid even if we didn't set the ptes readonly.

And clearing dirty belongs in clear_page_dirty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Andy Whitcroft @ 2006-05-12  8:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, clameter, a.p.zijlstra, piggin, ak, rohitseth,
	mbligh, hugh, riel, andrea, arjan, mel, marcelo, anton, paulmck,
	linux-mm
In-Reply-To: <20060511164448.4686a2bd.akpm@osdl.org>

Andrew Morton wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>What happened to the VM stress-test programs that we used to test the 
>>page-out with? I forget who kept a collection of them around, but they did 
>>things like trying to cause MM problems on purpose.
> 
> 
> I think that was me, back in my programming days.
> 
> 
>>And I'm pretty sure 
>>some of the nastiest ones used shared mappings, exactly because we've had 
>>problems with the virtual scanning.
> 
> 
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
> 
> run-bash-shared-mapping.sh is a good stress-tester and deadlock-finder.
> 
> Running fsx-linux (in mmap-read and mmap-write and read and write mode) in
> combination with memory pressure is a good correctness-tester.  Needs to be
> run on various filesystems too.

Well for what its worth (and from this thread it may not be that much)
the testing I did over night shows green across all the test boxes I
have.  The tests do include fsx-linux across a limited range of filesystems.

I'll see if I can get the other one done.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Nick Piggin @ 2006-05-12  8:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <1147417561.8951.17.camel@twins>

Peter Zijlstra wrote:

>On Thu, 2006-05-11 at 21:30 -0700, Andrew Morton wrote:
>
>>Nick Piggin <piggin@cyberone.com.au> wrote:
>>
>>> >So let's see.  We take a write fault, we mark the page dirty then we return
>>> >to userspace which will proceed with the write and will mark the pte dirty.
>>> >
>>> >Later, the VM will write the page out.
>>> >
>>> >Later still, the pte will get cleaned by reclaim or by munmap or whatever
>>> >and the page will be marked dirty and the page will again be written out. 
>>> >Potentially needlessly.
>>> >
>>>
>>> page_wrprotect also marks the page clean,
>>>
>>Oh.  I missed that when reading the comment which describes
>>page_wrprotect() (I do go on).
>>
>
>Yes, this name is not the best of names :-(
>
>I was aware of this, but since in my mind the counting through
>protection 
>faults was the prime idea, I stuck to page_wrprotect().
>
>But I'm hard pressed to come up with a better one. Nick proposes:
> page_mkclean()
>But that also doesn't cover the whole of it from my perspective.
>

What's your perspective?

With mmap shared accounting, the _whole VM's_ perspective is that clean
MAP_SHARED ptes are marked readonly.

The logical operation is marking the page's ptes clean. The VM mechanism
also marks the ptes readonly as a side effect of that. Think about it:
writeback does not want to make the page write protected, it wants to make
it clean.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Peter Zijlstra @ 2006-05-12  7:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <20060511213045.32b41aa6.akpm@osdl.org>

On Thu, 2006-05-11 at 21:30 -0700, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> >  >So let's see.  We take a write fault, we mark the page dirty then we return
> >  >to userspace which will proceed with the write and will mark the pte dirty.
> >  >
> >  >Later, the VM will write the page out.
> >  >
> >  >Later still, the pte will get cleaned by reclaim or by munmap or whatever
> >  >and the page will be marked dirty and the page will again be written out. 
> >  >Potentially needlessly.
> >  >
> > 
> >  page_wrprotect also marks the page clean,
> 
> Oh.  I missed that when reading the comment which describes
> page_wrprotect() (I do go on).

Yes, this name is not the best of names :-(

I was aware of this, but since in my mind the counting through
protection 
faults was the prime idea, I stuck to page_wrprotect().

But I'm hard pressed to come up with a better one. Nick proposes:
 page_mkclean()
But that also doesn't cover the whole of it from my perspective.

> > so this window is very small.
> >  The window is that the fault path might set_page_dirty, then throttle
> >  on writeout, and the page gets written out before it really gets dirtied
> >  by the application (which then has to fault again).
> 
> : int test_clear_page_dirty(struct page *page)
> : {
> : 	struct address_space *mapping = page_mapping(page);
> : 	unsigned long flags;
> : 
> : 	if (mapping) {
> : 		write_lock_irqsave(&mapping->tree_lock, flags);
> : 		if (TestClearPageDirty(page)) {
> : 			radix_tree_tag_clear(&mapping->page_tree,
> : 						page_index(page),
> : 						PAGECACHE_TAG_DIRTY);
> : 			write_unlock_irqrestore(&mapping->tree_lock, flags);
> : 			/*
> : 			 * We can continue to use `mapping' here because the
> : 			 * page is locked, which pins the address_space
> : 			 */
> 
> So if userspace modifies the page right here, and marks the pte dirty.
> 
> : 			if (mapping_cap_account_dirty(mapping)) {
> : 				page_wrprotect(page);
> 
> We just lost that pte dirty bit, and hence the user's data.

I thought that at the time we clean PAGECACHE_TAG_DIRTY the page is in
flight to disk.
Now that I look at it again, perhaps the page_wrprotect() call in
clear_page_dirty_for_io()
should be in test_set_page_writeback().

> : 				dec_page_state(nr_dirty);
> : 			}
> : 			return 1;
> : 		}
> : 		write_unlock_irqrestore(&mapping->tree_lock, flags);
> : 		return 0;
> : 	}
> : 	return TestClearPageDirty(page);
> : }
> : 
> 
> Which is just the sort of subtle and nasty problem I was referring to...
> 
> If that's correct then I guess we need the
> 
>                 if (ptep_clear_flush_dirty(vma, addr, pte) ||
>                                 page_test_and_clear_dirty(page))
>                         ret += set_page_dirty(page);
> 
> treatment in page_wrprotect().

I would make that a BUG_ON(); the only way for a pte of a shared mapping
to become 
dirty is through the fault handler, and that should already call
set_page_dirty() on it.

> Now I suppose it's not really a dataloss race, because in practice the
> kernel is about to write this page to backing store anwyay.  I guess.  I
> cannot immediately think of any clear_page_dirty() callers for whom that
> won't be true.
> 
> Someone please convince me that this has all been thought about and is solid
> as a rock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/2][RFC] New version of shared page tables
From: Nick Piggin @ 2006-05-12  5:17 UTC (permalink / raw)
  To: Brian Twichell
  Cc: Hugh Dickins, Dave McCracken, Linux Memory Management,
	Linux Kernel
In-Reply-To: <446242CB.4090106@us.ibm.com>

Brian Twichell wrote:
> Nick Piggin wrote:

>> Of course if it was free performance then we'd want it. The downsides 
>> are that it
>> is a significant complexity for a pretty small (3%) performance gain 
>> for your apparent
>> target workload, which is pretty uncommon among all Linux users.
> 
> 
> Our performance data demonstrated that the potential gain for the 
> non-hugepage case is much higher than 3%.

The point is, there are hugepages. They were a significant additional
complexity but the concession was made because they did provide a
large speedup for databases.

> 
>>
>> Ignoring the complexity, it is still not free. Sharing data across 
>> processes adds to
>> synchronisation overhead and hurts scalability. Some of these page 
>> fault scalability
>> scenarios have shown to be important enough that we have introduced 
>> complexity _there_.
> 
> 
> True, but this needs to be balanced against the fact that pagetable 
> sharing will reduce the number of page faults when it is achieved.  
> Let's say you have N processes which touch all the pages in an M page 
> shared memory region.  Without shared pagetables this requires N*M page 
> faults; if pagetable sharing is achieved, only M pagefaults are required.
> 
>>
>> And it seems customers running "out-of-the-box" settings really want 
>> to start using
>> hugepages if they're interested in getting the most performance 
>> possible, no?
> 
> 
> My perspective is that, once the customer is required to invoke "echo 
> XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box" 
> domain, and entered the domain of hoping that the number of hugepages is 
> sufficient, because if it's not, they'll probably need to reboot, which 
> can be pretty inconvenient for a production transaction-processing 
> application.

I think it is pretty easy to reserve hugepages at bootup. This is what
a production transaction processing system will be doing, won't it?
Especially if they're performance constrained and hugepages gives them
a 30% performance boost.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Nick Piggin @ 2006-05-12  5:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: a.p.zijlstra, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <20060511213045.32b41aa6.akpm@osdl.org>

Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>> >So let's see.  We take a write fault, we mark the page dirty then we return
>> >to userspace which will proceed with the write and will mark the pte dirty.
>> >
>> >Later, the VM will write the page out.
>> >
>> >Later still, the pte will get cleaned by reclaim or by munmap or whatever
>> >and the page will be marked dirty and the page will again be written out. 
>> >Potentially needlessly.
>> >
>>
>> page_wrprotect also marks the page clean,
>>
>
>Oh.  I missed that when reading the comment which describes
>page_wrprotect() (I do go on).
>

I guess page_wrprotect isn't a good name, because it would suggest it
can be used in situations where it would cause data loss.

page_mappings_mkclean or page_mkclean might be better (the wrprotect
is just a side effect of the fact that clean,shared mappings are
readonly).

>
>>so this window is very small.
>> The window is that the fault path might set_page_dirty, then throttle
>> on writeout, and the page gets written out before it really gets dirtied
>> by the application (which then has to fault again).
>>
>
>: int test_clear_page_dirty(struct page *page)
>: {
>: 	struct address_space *mapping = page_mapping(page);
>: 	unsigned long flags;
>: 
>: 	if (mapping) {
>: 		write_lock_irqsave(&mapping->tree_lock, flags);
>: 		if (TestClearPageDirty(page)) {
>: 			radix_tree_tag_clear(&mapping->page_tree,
>: 						page_index(page),
>: 						PAGECACHE_TAG_DIRTY);
>: 			write_unlock_irqrestore(&mapping->tree_lock, flags);
>: 			/*
>: 			 * We can continue to use `mapping' here because the
>: 			 * page is locked, which pins the address_space
>: 			 */
>
>So if userspace modifies the page right here, and marks the pte dirty.
>
>: 			if (mapping_cap_account_dirty(mapping)) {
>: 				page_wrprotect(page);
>
>We just lost that pte dirty bit, and hence the user's data.
>
>: 				dec_page_state(nr_dirty);
>: 			}
>: 			return 1;
>: 		}
>: 		write_unlock_irqrestore(&mapping->tree_lock, flags);
>: 		return 0;
>: 	}
>: 	return TestClearPageDirty(page);
>: }
>: 
>
>Which is just the sort of subtle and nasty problem I was referring to...
>
>If that's correct then I guess we need the
>
>                if (ptep_clear_flush_dirty(vma, addr, pte) ||
>                                page_test_and_clear_dirty(page))
>                        ret += set_page_dirty(page);
>
>treatment in page_wrprotect().
>
>Now I suppose it's not really a dataloss race, because in practice the
>kernel is about to write this page to backing store anwyay.  I guess.  I
>cannot immediately think of any clear_page_dirty() callers for whom that
>won't be true.
>

If they do a clear_page_dirty, then fail to clean the page, then fail to
subsequently set_page_dirty again, it is a data-loss situation anyway.

If they do set_page_dirty (which they have to, for correctness), then the
page has PG_dirty set again; true dirty bits are moved out of the ptes,
but that's no problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Christoph Lameter @ 2006-05-12  4:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, piggin, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <20060511080220.48688b40.akpm@osdl.org>

On Thu, 11 May 2006, Andrew Morton wrote:

> So let's see.  We take a write fault, we mark the page dirty then we return
> to userspace which will proceed with the write and will mark the pte dirty.

The pte is marked dirty when the page is marked dirty.

> Later still, the pte will get cleaned by reclaim or by munmap or whatever
> and the page will be marked dirty and the page will again be written out. 
> Potentially needlessly.

But consistent with the way write() works on page buffers. It is rather 
surprising that one can dirty lots of mmapped pages and they are only 
written out when the process terminates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4
From: Andrew Morton @ 2006-05-12  4:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: a.p.zijlstra, clameter, torvalds, ak, rohitseth, mbligh, hugh,
	riel, andrea, arjan, apw, mel, marcelo, anton, paulmck, linux-mm
In-Reply-To: <4463EA16.5090208@cyberone.com.au>

Nick Piggin <piggin@cyberone.com.au> wrote:
>
>  >So let's see.  We take a write fault, we mark the page dirty then we return
>  >to userspace which will proceed with the write and will mark the pte dirty.
>  >
>  >Later, the VM will write the page out.
>  >
>  >Later still, the pte will get cleaned by reclaim or by munmap or whatever
>  >and the page will be marked dirty and the page will again be written out. 
>  >Potentially needlessly.
>  >
> 
>  page_wrprotect also marks the page clean,

Oh.  I missed that when reading the comment which describes
page_wrprotect() (I do go on).

> so this window is very small.
>  The window is that the fault path might set_page_dirty, then throttle
>  on writeout, and the page gets written out before it really gets dirtied
>  by the application (which then has to fault again).

: int test_clear_page_dirty(struct page *page)
: {
: 	struct address_space *mapping = page_mapping(page);
: 	unsigned long flags;
: 
: 	if (mapping) {
: 		write_lock_irqsave(&mapping->tree_lock, flags);
: 		if (TestClearPageDirty(page)) {
: 			radix_tree_tag_clear(&mapping->page_tree,
: 						page_index(page),
: 						PAGECACHE_TAG_DIRTY);
: 			write_unlock_irqrestore(&mapping->tree_lock, flags);
: 			/*
: 			 * We can continue to use `mapping' here because the
: 			 * page is locked, which pins the address_space
: 			 */

So if userspace modifies the page right here, and marks the pte dirty.

: 			if (mapping_cap_account_dirty(mapping)) {
: 				page_wrprotect(page);

We just lost that pte dirty bit, and hence the user's data.

: 				dec_page_state(nr_dirty);
: 			}
: 			return 1;
: 		}
: 		write_unlock_irqrestore(&mapping->tree_lock, flags);
: 		return 0;
: 	}
: 	return TestClearPageDirty(page);
: }
: 

Which is just the sort of subtle and nasty problem I was referring to...

If that's correct then I guess we need the

                if (ptep_clear_flush_dirty(vma, addr, pte) ||
                                page_test_and_clear_dirty(page))
                        ret += set_page_dirty(page);

treatment in page_wrprotect().

Now I suppose it's not really a dataloss race, because in practice the
kernel is about to write this page to backing store anwyay.  I guess.  I
cannot immediately think of any clear_page_dirty() callers for whom that
won't be true.

Someone please convince me that this has all been thought about and is solid
as a rock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox