RE: la la la la ... swappiness

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: la la la la ... swappiness
@ 2006-12-03  6:18 Aucoin
  0 siblings, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-03  6:18 UTC (permalink / raw)
  To: akpm, torvalds, linux-kernel, clameter

Reformatted as plain text.

________________________________________
From: Aucoin [mailto:Aucoin@Houston.RR.com] 
Sent: Sunday, December 03, 2006 12:17 AM
To: 'akpm@osdl.org'; 'torvalds@osdl.org'; 'linux-kernel@vger.kernel.org';
'clameter@sgi.com'
Subject: la la la la ... swappiness

I set swappiness to zero and it doesn't do what I want!

I have a system that runs as a Linux based data server 24x7 and occasionally
I need to apply an update or patch. It's a BIIIG patch to the tune of
several hundred megabytes, let's say 600MB for a good round number. The
server software itself runs on very tight memory boundaries, I've
preallocated a large chunk of memory that is shared amongst several
processes as a form of application cache, there is barely 15% spare memory
floating around.

The update is delivered to the server as a tar file. In order to minimize
down time I untar this update and verify the contents landed correctly
before switching over to the updated software.

The problem is when I attempt to untar the payload disk I/O starts caching,
the inactive page count reels wildly out of control, the system starts
swapping, OOM fires and there goes my 4 9's uptime. My system just suffered
a catastrophic failure because I can't control pagecache due to disk I/O.
I need a pagecache throttle, what do you suggest?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
       [not found] <200612030616.kB36GYBs019873@ms-smtp-03.texas.rr.com>
@ 2006-12-03  8:08 ` Andrew Morton
  2006-12-03 15:40   ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-03  8:08 UTC (permalink / raw)
  To: Aucoin; +Cc: torvalds, linux-kernel, clameter

> On Sun, 3 Dec 2006 00:16:38 -0600 "Aucoin" <Aucoin@Houston.RR.com> wrote:
> I set swappiness to zero and it doesn't do what I want!
> 
> I have a system that runs as a Linux based data server 24x7 and occasionally
> I need to apply an update or patch. It's a BIIIG patch to the tune of
> several hundred megabytes, let's say 600MB for a good round number. The
> server software itself runs on very tight memory boundaries, I've
> preallocated a large chunk of memory that is shared amongst several
> processes as a form of application cache, there is barely 15% spare memory
> floating around.
> 
> The update is delivered to the server as a tar file. In order to minimize
> down time I untar this update and verify the contents landed correctly
> before switching over to the updated software.
> 
> The problem is when I attempt to untar the payload disk I/O starts caching,
> the inactive page count reels wildly out of control, the system starts
> swapping, OOM fires and there goes my 4 9's uptime. My system just suffered
> a catastrophic failure because I can't control pagecache due to disk I/O.

kernel version?

> I need a pagecache throttle, what do you suggest?

Don't set swappiness to zero...   Leaving it at the default should avoid
the oom-killer.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-03  8:08 ` Andrew Morton
@ 2006-12-03 15:40   ` Aucoin
  2006-12-03 20:46     ` Tim Schmielau
  0 siblings, 1 reply; 54+ messages in thread
From: Aucoin @ 2006-12-03 15:40 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: torvalds, linux-kernel, clameter

Thanks for the reply! I'll buy one of your books!

2.6.16.28 SMP

The application is an "embedded", headless system and we've pretty much laid
memory out the way we want it, the only rogue player is the tar update
process. A little bit of swapping is fine but enough swapping to irritate
OOM is not desireable. Yes, the swap is only 500MB but this is a purpose
built system, there are no random user apps started and stopped so
absolutely nothing swaps until the update process runs.

Here's meminfo from an idle system, on a loaded system the machine locks up
now since I've disabled OOM trying to prevent the imminent crash. I got
desperate and not only set swappiness to zero but I've also tried setting
the dirty ratios down as low as 1, the centisecs as low as 1 and cache
pressure as high as 9999. I'm thrashing and running out of dials to turn.

With the ridiculous settings above dirty pages porpoise between 0-20K, with
more reasonable settings they porpoise between 10-40K but it seems to be the
inactive page count that is killing me.

before tar extraction
MemTotal:      2075152 kB
MemFree:        502916 kB
Buffers:          2272 kB
Cached:           7180 kB
SwapCached:          0 kB
Active:         118792 kB
Inactive:         1648 kB
HighTotal:     1179392 kB
HighFree:         3040 kB
LowTotal:       895760 kB
LowFree:        499876 kB
SwapTotal:      524276 kB
SwapFree:       524276 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:         116720 kB
Slab:            27956 kB
CommitLimit:    557376 kB
Committed_AS:   903912 kB
PageTables:       1340 kB
VmallocTotal:   114680 kB
VmallocUsed:      1000 kB
VmallocChunk:   113584 kB
HugePages_Total:   345
HugePages_Free:      0
Hugepagesize:     4096 kB

during tar extraction ... inactive pages reaches levels as high as ~375000
MemTotal:      2075152 kB
MemFree:        256316 kB
Buffers:          2944 kB
Cached:         247228 kB
SwapCached:          0 kB
Active:         159652 kB
Inactive:       201608 kB
HighTotal:     1179392 kB
HighFree:         1652 kB
LowTotal:       895760 kB
LowFree:        254664 kB
SwapTotal:      524276 kB
SwapFree:       523932 kB
Dirty:           16068 kB
Writeback:           0 kB
Mapped:         116952 kB
Slab:            34864 kB
CommitLimit:    557376 kB
Committed_AS:   904196 kB
PageTables:       1352 kB
VmallocTotal:   114680 kB
VmallocUsed:      1000 kB
VmallocChunk:   113584 kB
HugePages_Total:   345
HugePages_Free:      0
Hugepagesize:     4096 kB

even after the tar has been complete for a couple minutes.
MemTotal:      2075152 kB
MemFree:        169848 kB
Buffers:          4360 kB
Cached:         334824 kB
SwapCached:          0 kB
Active:         178692 kB
Inactive:       271452 kB
HighTotal:     1179392 kB
HighFree:         1652 kB
LowTotal:       895760 kB
LowFree:        168196 kB
SwapTotal:      524276 kB
SwapFree:       523932 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:         116716 kB
Slab:            31868 kB
CommitLimit:    557376 kB
Committed_AS:   903908 kB
PageTables:       1340 kB
VmallocTotal:   114680 kB
VmallocUsed:      1000 kB
VmallocChunk:   113584 kB
HugePages_Total:   345
HugePages_Free:      0
Hugepagesize:     4096 kB

-----Original Message-----
From: Andrew Morton [mailto:akpm@osdl.org] 
Sent: Sunday, December 03, 2006 2:09 AM
To: Aucoin@Houston.RR.com
Cc: torvalds@osdl.org; linux-kernel@vger.kernel.org; clameter@sgi.com
Subject: Re: la la la la ... swappiness

> On Sun, 3 Dec 2006 00:16:38 -0600 "Aucoin" <Aucoin@Houston.RR.com> wrote:
> I set swappiness to zero and it doesn't do what I want!
> 
> I have a system that runs as a Linux based data server 24x7 and
occasionally
> I need to apply an update or patch. It's a BIIIG patch to the tune of
> several hundred megabytes, let's say 600MB for a good round number. The
> server software itself runs on very tight memory boundaries, I've
> preallocated a large chunk of memory that is shared amongst several
> processes as a form of application cache, there is barely 15% spare memory
> floating around.
> 
> The update is delivered to the server as a tar file. In order to minimize
> down time I untar this update and verify the contents landed correctly
> before switching over to the updated software.
> 
> The problem is when I attempt to untar the payload disk I/O starts
caching,
> the inactive page count reels wildly out of control, the system starts
> swapping, OOM fires and there goes my 4 9's uptime. My system just
suffered
> a catastrophic failure because I can't control pagecache due to disk I/O.

kernel version?

> I need a pagecache throttle, what do you suggest?

Don't set swappiness to zero...   Leaving it at the default should avoid
the oom-killer.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-03 15:40   ` Aucoin
@ 2006-12-03 20:46     ` Tim Schmielau
  2006-12-03 23:56       ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Tim Schmielau @ 2006-12-03 20:46 UTC (permalink / raw)
  To: Aucoin; +Cc: 'Andrew Morton', torvalds, linux-kernel, clameter

On Sun, 3 Dec 2006, Aucoin wrote:

> during tar extraction ... inactive pages reaches levels as high as ~375000

So why do you want the system to swap _less_? You need to find some free 
memory for the additional processes to run in, and you have lots of 
inactive pages, so I think you want to swap out _more_ pages.

I'd suggest to temporarily add a swapfile before you update your system. 
This can even help in bringing your memory use to the state before if you 
do it like this
  - swapon additional swapfile
  - update your database software
  - swapoff swap partition
  - swapon swap partition
  - swapoff additional swapfile

Tim

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-03 20:46     ` Tim Schmielau
@ 2006-12-03 23:56       ` Aucoin
  2006-12-04  0:57         ` Horst H. von Brand
                           ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Aucoin @ 2006-12-03 23:56 UTC (permalink / raw)
  To: 'Tim Schmielau'
  Cc: 'Andrew Morton', torvalds, linux-kernel, clameter

We want it to swap less for this particular operation because it is low
priority compared to the rest of what's going on inside the box.

We've considered both artificially manipulating swap on the fly similar to
your suggestion as well a parallel thread that pumps a 3 into drop_caches
every few seconds while the update is running, but these seem too much like
hacks for our liking. Mind you, if we don't have a choice we'll do what we
need to get the job done but there's a nagging voice in our conscience that
says keep looking for a more elegant solution and work *with* the kernel
rather than working against it or trying to trick it into doing what we
want. 

We've already disabled OOM so we can at least keep our testing alive while
searching for a more elegant solution. Although we want to avoid swap in
this particular instance for this particular reason, in our hearts we agree
with Andrew that swap can be your friend and get you out of a jam once in a
while. Even more, we'd like to leave OOM active if we can because we want to
be told when somebody's not being a good memory citizen.

Some background, what we've done is carve up a huge chunk of memory that is
shared between three resident processes as write cache for a proprietary
block system layout that is part of a scalable storage architecture
currently capable of RAID 0, 1, 5 (soon 6) virtualized across multiple
chassis's, essentially treating each machine as a "disk" and providing
multipath I/O to multiple iSCSI targets as part of a grid/array storage
solution. Whew! We also have a version that leverages a battery backed write
cache for higher performance at an additional cost. This software is
installable on any commodity platform with 4-N disks supported by Linux,
I've even put it on an Optiplex with 4 simulated disks. Yawn ... yet another
iSCSI storage solution, but this one scales linearly in capacity as well as
performance. As such, we have no user level apps on the boxes and precious
little disk to spare for additional swap so our version of the swap
manipulation solution is to turn swap completely off for the duration of the
update.

I hope I haven't muddied things up even more but basically what we want to
do is find a way to limit the number of cached pages for disk I/O on the OS
filesystem, even if it drastically slows down the untar and verify process
because the disk I/O we really care about is not on any of the OS
partitions.

Louis Aucoin

-----Original Message-----
From: Tim Schmielau [mailto:tim@physik3.uni-rostock.de] 
Sent: Sunday, December 03, 2006 2:47 PM
To: Aucoin
Cc: 'Andrew Morton'; torvalds@osdl.org; linux-kernel@vger.kernel.org;
clameter@sgi.com
Subject: RE: la la la la ... swappiness

On Sun, 3 Dec 2006, Aucoin wrote:

> during tar extraction ... inactive pages reaches levels as high as ~375000

So why do you want the system to swap _less_? You need to find some free 
memory for the additional processes to run in, and you have lots of 
inactive pages, so I think you want to swap out _more_ pages.

I'd suggest to temporarily add a swapfile before you update your system. 
This can even help in bringing your memory use to the state before if you 
do it like this
  - swapon additional swapfile
  - update your database software
  - swapoff swap partition
  - swapon swap partition
  - swapoff additional swapfile

Tim

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-03 23:56       ` Aucoin
@ 2006-12-04  0:57         ` Horst H. von Brand
  2006-12-04  4:56         ` Andrew Morton
  2006-12-04 10:43         ` Nick Piggin
  2 siblings, 0 replies; 54+ messages in thread
From: Horst H. von Brand @ 2006-12-04  0:57 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

Aucoin <Aucoin@Houston.RR.com> wrote:
> We want it to swap less for this particular operation because it is low
> priority compared to the rest of what's going on inside the box.

The swapping is not a "operation" thing, it is global for /all/ what is
going on in the box. And having it swap less means assigning it more RAM,
i.e., giving it higher (not lower) priority than other stuff happening at
the same time.

I guess I don't understand what your needs are (not what you want to do to
get there).
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
@ 2006-12-04  1:54 Aucoin
  2006-12-04  4:59 ` Andrew Morton
  2006-12-04  7:22 ` Kyle Moffett
  0 siblings, 2 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04  1:54 UTC (permalink / raw)
  To: Aucoin, 'Tim Schmielau'
  Cc: 'Andrew Morton', torvalds, linux-kernel, clameter

I should also have made it clear that under a full load OOM kills critical
data moving processes because of (what appears to be) the out of control
memory consumption by disk I/O cache related to the tar.

As a side note, even now, *hours* after the tar has completed and even
though I have swappiness set to 0, cache pressure set to 9999, all dirty
timeouts set to 1 and all dirty ratios set to 1, I still have a 360+K
inactive page count and my "free" memory is less than 10% of normal. I'm not
pretending to understand what's happening here but shouldn't some kind of
expiration have kicked in by now and freed up all those inactive pages? The
*instant* I manually push a "3" into drop_caches I have 100% of my normal
free memory and the inactive page count drops below 2K. Maybe I completely
misunderstood the purpose of all those dials but I really did get the
feeling that twisting them all tight would make the housekeeping algorithms
more aggressive.

What, if anything, besides manually echoing a "3" to drop_caches will cause
all those inactive pages to be put back on the free list ?

-----Original Message-----
From: Aucoin [mailto:Aucoin@Houston.RR.com] 
Sent: Sunday, December 03, 2006 5:57 PM
To: 'Tim Schmielau'
Cc: 'Andrew Morton'; 'torvalds@osdl.org'; 'linux-kernel@vger.kernel.org';
'clameter@sgi.com'
Subject: RE: la la la la ... swappiness

We want it to swap less for this particular operation because it is low
priority compared to the rest of what's going on inside the box.

We've considered both artificially manipulating swap on the fly similar to
your suggestion as well a parallel thread that pumps a 3 into drop_caches
every few seconds while the update is running, but these seem too much like
hacks for our liking. Mind you, if we don't have a choice we'll do what we
need to get the job done but there's a nagging voice in our conscience that
says keep looking for a more elegant solution and work *with* the kernel
rather than working against it or trying to trick it into doing what we
want. 

We've already disabled OOM so we can at least keep our testing alive while
searching for a more elegant solution. Although we want to avoid swap in
this particular instance for this particular reason, in our hearts we agree
with Andrew that swap can be your friend and get you out of a jam once in a
while. Even more, we'd like to leave OOM active if we can because we want to
be told when somebody's not being a good memory citizen.

Some background, what we've done is carve up a huge chunk of memory that is
shared between three resident processes as write cache for a proprietary
block system layout that is part of a scalable storage architecture
currently capable of RAID 0, 1, 5 (soon 6) virtualized across multiple
chassis's, essentially treating each machine as a "disk" and providing
multipath I/O to multiple iSCSI targets as part of a grid/array storage
solution. Whew! We also have a version that leverages a battery backed write
cache for higher performance at an additional cost. This software is
installable on any commodity platform with 4-N disks supported by Linux,
I've even put it on an Optiplex with 4 simulated disks. Yawn ... yet another
iSCSI storage solution, but this one scales linearly in capacity as well as
performance. As such, we have no user level apps on the boxes and precious
little disk to spare for additional swap so our version of the swap
manipulation solution is to turn swap completely off for the duration of the
update.

I hope I haven't muddied things up even more but basically what we want to
do is find a way to limit the number of cached pages for disk I/O on the OS
filesystem, even if it drastically slows down the untar and verify process
because the disk I/O we really care about is not on any of the OS
partitions.

Louis Aucoin

-----Original Message-----
From: Tim Schmielau [mailto:tim@physik3.uni-rostock.de] 
Sent: Sunday, December 03, 2006 2:47 PM
To: Aucoin
Cc: 'Andrew Morton'; torvalds@osdl.org; linux-kernel@vger.kernel.org;
clameter@sgi.com
Subject: RE: la la la la ... swappiness

On Sun, 3 Dec 2006, Aucoin wrote:

> during tar extraction ... inactive pages reaches levels as high as ~375000

So why do you want the system to swap _less_? You need to find some free 
memory for the additional processes to run in, and you have lots of 
inactive pages, so I think you want to swap out _more_ pages.

I'd suggest to temporarily add a swapfile before you update your system. 
This can even help in bringing your memory use to the state before if you 
do it like this
  - swapon additional swapfile
  - update your database software
  - swapoff swap partition
  - swapon swap partition
  - swapoff additional swapfile

Tim

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-03 23:56       ` Aucoin
  2006-12-04  0:57         ` Horst H. von Brand
@ 2006-12-04  4:56         ` Andrew Morton
  2006-12-04  5:13           ` Linus Torvalds
  2006-12-04 10:43         ` Nick Piggin
  2 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-04  4:56 UTC (permalink / raw)
  To: Aucoin; +Cc: 'Tim Schmielau', torvalds, linux-kernel, clameter

On Sun, 3 Dec 2006 17:56:30 -0600
"Aucoin" <Aucoin@Houston.RR.com> wrote:

> I hope I haven't muddied things up even more but basically what we want to
> do is find a way to limit the number of cached pages for disk I/O on the OS
> filesystem, even if it drastically slows down the untar and verify process
> because the disk I/O we really care about is not on any of the OS
> partitions.

Try mounting that fs with `-o sync'.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04  1:54 la la la la ... swappiness Aucoin
@ 2006-12-04  4:59 ` Andrew Morton
  2006-12-04  7:22 ` Kyle Moffett
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-04  4:59 UTC (permalink / raw)
  To: Aucoin; +Cc: 'Tim Schmielau', torvalds, linux-kernel, clameter

On Sun, 3 Dec 2006 19:54:41 -0600
"Aucoin" <Aucoin@Houston.RR.com> wrote:

> What, if anything, besides manually echoing a "3" to drop_caches will cause
> all those inactive pages to be put back on the free list ?

There is no reason for the kernel to do that - a clean, inactive page is
immediately reclaimable on demand.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04  4:56         ` Andrew Morton
@ 2006-12-04  5:13           ` Linus Torvalds
  2006-12-04 17:03             ` Christoph Lameter
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-04  5:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aucoin, 'Tim Schmielau', linux-kernel, clameter



On Sun, 3 Dec 2006, Andrew Morton wrote:

> On Sun, 3 Dec 2006 17:56:30 -0600
> "Aucoin" <Aucoin@Houston.RR.com> wrote:
> 
> > I hope I haven't muddied things up even more but basically what we want to
> > do is find a way to limit the number of cached pages for disk I/O on the OS
> > filesystem, even if it drastically slows down the untar and verify process
> > because the disk I/O we really care about is not on any of the OS
> > partitions.
> 
> Try mounting that fs with `-o sync'.

Wouldn't it be much nicer to just lower the dirty-page limit?

	echo 1 > /proc/sys/vm/dirty_background_ratio
	echo 2 > /proc/sys/vm/dirty_ratio

or something. Which we already discussed in another thread and almost 
already decided we should lower the values for big-mem machines..

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04  1:54 la la la la ... swappiness Aucoin
  2006-12-04  4:59 ` Andrew Morton
@ 2006-12-04  7:22 ` Kyle Moffett
  2006-12-04 14:39   ` Aucoin
  2006-12-04 15:55   ` David Lang
  1 sibling, 2 replies; 54+ messages in thread
From: Kyle Moffett @ 2006-12-04  7:22 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

On Dec 03, 2006, at 20:54:41, Aucoin wrote:
> As a side note, even now, *hours* after the tar has completed and  
> even though I have swappiness set to 0, cache pressure set to 9999,  
> all dirty timeouts set to 1 and all dirty ratios set to 1, I still  
> have a 360+K inactive page count and my "free" memory is less than  
> 10% of normal.

The point you're missing is that an "inactive" page is a free page  
that happens to have known clean data on it corresponding to  
something on disk.  If you need to use the inactive page for  
something all you have to do is either zero it or fill it with data  
from elsewhere.  There is _no_ practical reason for the kernel to  
turn an "inactive" page into a "free" page.  On my Linux systems  
after heavy local-disk and network intensive read-only load I have no  
more than 2% "free" memory, most of the rest is "inactive" (in one  
case some 2GB of it).  There's nothing _wrong_ with that much  
"inactive" memory, it just means that you were using it for data at  
one point, then didn't need it anymore and haven't reused it since.

> I'm not pretending to understand what's happening here but  
> shouldn't some kind of expiration have kicked in by now and freed  
> up all those inactive pages?

Nope; the pages will continue to contain valid data until you  
overwrite them with new data somehow.  Now, if they were "dirty"  
pages, containing unwritten data, then you would be correct.

> The *instant* I manually push a "3" into drop_caches I have 100% of  
> my normal free memory and the inactive page count drops below 2K.  
> Maybe I completely misunderstood the purpose of all those dials but  
> I really did get the feeling that twisting them all tight would  
> make the housekeeping algorithms more aggressive.

In this case you're telling the kernel to go beyond its normal  
housekeeping and delete perfectly good data from memory.  The only  
reason to do that is usually to make benchmarks mildly more  
repeatable and doing it on a regular basis tends to kill performance.

Cheers,
Kyle Moffett

> [copy of long previous email snipped]

PS: No need to put a copy of the entire message you are replying to  
at the end of your post, it just chews up space.  If anything please  
quote inline immediately before the appropriate portion of your reply  
so we can get the gist, much as I have done above.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-03 23:56       ` Aucoin
  2006-12-04  0:57         ` Horst H. von Brand
  2006-12-04  4:56         ` Andrew Morton
@ 2006-12-04 10:43         ` Nick Piggin
  2006-12-04 14:45           ` Aucoin
  2 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-12-04 10:43 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

Aucoin wrote:
> We want it to swap less for this particular operation because it is low
> priority compared to the rest of what's going on inside the box.
> 
> We've considered both artificially manipulating swap on the fly similar to
> your suggestion as well a parallel thread that pumps a 3 into drop_caches
> every few seconds while the update is running, but these seem too much like
> hacks for our liking. Mind you, if we don't have a choice we'll do what we
> need to get the job done but there's a nagging voice in our conscience that
> says keep looking for a more elegant solution and work *with* the kernel
> rather than working against it or trying to trick it into doing what we
> want. 
> 
> We've already disabled OOM so we can at least keep our testing alive while
> searching for a more elegant solution. Although we want to avoid swap in
> this particular instance for this particular reason, in our hearts we agree
> with Andrew that swap can be your friend and get you out of a jam once in a
> while. Even more, we'd like to leave OOM active if we can because we want to
> be told when somebody's not being a good memory citizen.
> 
> Some background, what we've done is carve up a huge chunk of memory that is
> shared between three resident processes as write cache for a proprietary
> block system layout that is part of a scalable storage architecture
> currently capable of RAID 0, 1, 5 (soon 6) virtualized across multiple
> chassis's, essentially treating each machine as a "disk" and providing
> multipath I/O to multiple iSCSI targets as part of a grid/array storage
> solution. Whew! We also have a version that leverages a battery backed write
> cache for higher performance at an additional cost. This software is
> installable on any commodity platform with 4-N disks supported by Linux,
> I've even put it on an Optiplex with 4 simulated disks. Yawn ... yet another
> iSCSI storage solution, but this one scales linearly in capacity as well as
> performance. As such, we have no user level apps on the boxes and precious
> little disk to spare for additional swap so our version of the swap
> manipulation solution is to turn swap completely off for the duration of the
> update.
> 
> I hope I haven't muddied things up even more but basically what we want to
> do is find a way to limit the number of cached pages for disk I/O on the OS
> filesystem, even if it drastically slows down the untar and verify process
> because the disk I/O we really care about is not on any of the OS
> partitions.

Hi Louis,

We had customers see similar incorrect OOM problems, so I sent in some
patches merged after 2.6.16. Can you upgrade to latest kernel? (otherwise
I guess backporting could be an option for you).

Basically the fixes are more conservative about going OOM if the kernel
thinks it can still reclaim some pages, and also allow the kernel to swap
as a last resort, even if swappiness is set to 0.

Once your OOM problems are solved, I think that page reclaim should do a
reasonable job at evicting the right pages with your simple untar
workload.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04  7:22 ` Kyle Moffett
@ 2006-12-04 14:39   ` Aucoin
  2006-12-04 16:10     ` Chris Friesen
  2006-12-04 17:07     ` Horst H. von Brand
  2006-12-04 15:55   ` David Lang
  1 sibling, 2 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04 14:39 UTC (permalink / raw)
  To: 'Kyle Moffett'
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

 > PS: No need to put a copy of the entire message

Apologies for the lapse in protocol.

> The point you're missing is that an "inactive" page is a free 
> page that happens to have known clean data on it 

I understand now where the inactive page count is coming from.
I don't understand why there is no way for me to make the kernel
prefer to reclaim inactive pages before choosing swap.

> In this case you're telling the kernel to go beyond its 
> normal housekeeping and delete perfectly good data from 
> memory.  The only reason to do that is usually to make 

The definition of perfectly good here may be up for debate or
someone can explain it to me. This perfectly good data was
cached under the tar yet hours after the tar has completed the
pages are still cached.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 10:43         ` Nick Piggin
@ 2006-12-04 14:45           ` Aucoin
  2006-12-04 15:04             ` Nick Piggin
  0 siblings, 1 reply; 54+ messages in thread
From: Aucoin @ 2006-12-04 14:45 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

> From: Nick Piggin [mailto:nickpiggin@yahoo.com.au]
> We had customers see similar incorrect OOM problems, so I sent in some
> patches merged after 2.6.16. Can you upgrade to latest kernel? (otherwise
> I guess backporting could be an option for you).

I will raise the question of moving the kernel forward one more time before
release. Can you point me to the patches you mentioned?



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 14:45           ` Aucoin
@ 2006-12-04 15:04             ` Nick Piggin
  2006-12-05  4:02               ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2006-12-04 15:04 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

Aucoin wrote:
>>From: Nick Piggin [mailto:nickpiggin@yahoo.com.au]
>>We had customers see similar incorrect OOM problems, so I sent in some
>>patches merged after 2.6.16. Can you upgrade to latest kernel? (otherwise
>>I guess backporting could be an option for you).
> 
> 
> I will raise the question of moving the kernel forward one more time before
> release. Can you point me to the patches you mentioned?

These two are the main ones:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=408d85441cd5a9bd6bc851d677a10c605ed8db5f
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4ff1ffb4870b007b86f21e5f27eeb11498c4c077

They shouldn't be too hard to backport.

I'd be interested to know how OOM and page reclaim behaves after these patches
(or with a newer kernel).

Thanks,
Nick

-- 
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04  7:22 ` Kyle Moffett
  2006-12-04 14:39   ` Aucoin
@ 2006-12-04 15:55   ` David Lang
  2006-12-04 17:42     ` Aucoin
  1 sibling, 1 reply; 54+ messages in thread
From: David Lang @ 2006-12-04 15:55 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Aucoin, 'Tim Schmielau', 'Andrew Morton',
	torvalds, linux-kernel, clameter

I think that I am seeing two seperate issues here that are getting mixed up.

1. while doing the tar + patch the system is choosing to use memory for 
caching the tar (pushing program data out to cache).

2. after the tar has completed the data remins in the cache.

the answer for #2 is the one that is being stated in the response below, namely 
that this shouldn't matter, the memory used for the inactive cache is just as 
good as free memory (i.e. it can be used immediatly for other purposes with no 
swap needed), so the fact that it's inactive instead of free doesn't matter.

however the real problem that Aucoin is running into is #1, namely that when the 
patching process (tar, etc) kicks off the system is choosing to use it's ram as 
a cache instead of leaving it in use for the processes that are running. if he 
manually forces the system to drop it's cache (echoing 3 into drop_caches 
repeatedly during the run of the patch process) he is able to keep this under 
control.

from the documentation on swappiness it seems like setting it to 0 would do what 
he wants (tell the system not to swap out process memory to make room for more 
cache), but he's reporting that this is not working as expected.

this is the same type of problem that people run into with the nightly updatedb 
run pushing inactive programs out of ram makeing the system sluggish the next 
morning.

IIRC there is a flag that can be passed to the open that tells the system that 
the data is 'use once' and not to cache it, is it possible to do ld_preload 
tricks to force this parameter for all the programs that his patch script is 
useing?

David Lang

On Mon, 4 Dec 2006, Kyle Moffett wrote:

> On Dec 03, 2006, at 20:54:41, Aucoin wrote:
>> As a side note, even now, *hours* after the tar has completed and even 
>> though I have swappiness set to 0, cache pressure set to 9999, all dirty 
>> timeouts set to 1 and all dirty ratios set to 1, I still have a 360+K 
>> inactive page count and my "free" memory is less than 10% of normal.
>
> The point you're missing is that an "inactive" page is a free page that 
> happens to have known clean data on it corresponding to something on disk. 
> If you need to use the inactive page for something all you have to do is 
> either zero it or fill it with data from elsewhere.  There is _no_ practical 
> reason for the kernel to turn an "inactive" page into a "free" page.  On my 
> Linux systems after heavy local-disk and network intensive read-only load I 
> have no more than 2% "free" memory, most of the rest is "inactive" (in one 
> case some 2GB of it).  There's nothing _wrong_ with that much "inactive" 
> memory, it just means that you were using it for data at one point, then 
> didn't need it anymore and haven't reused it since.
>
>> I'm not pretending to understand what's happening here but shouldn't some 
>> kind of expiration have kicked in by now and freed up all those inactive 
>> pages?
>
> Nope; the pages will continue to contain valid data until you overwrite them 
> with new data somehow.  Now, if they were "dirty" pages, containing unwritten 
> data, then you would be correct.
>
>> The *instant* I manually push a "3" into drop_caches I have 100% of my 
>> normal free memory and the inactive page count drops below 2K. Maybe I 
>> completely misunderstood the purpose of all those dials but I really did 
>> get the feeling that twisting them all tight would make the housekeeping 
>> algorithms more aggressive.
>
> In this case you're telling the kernel to go beyond its normal housekeeping 
> and delete perfectly good data from memory.  The only reason to do that is 
> usually to make benchmarks mildly more repeatable and doing it on a regular 
> basis tends to kill performance.
>
> Cheers,
> Kyle Moffett
>
>> [copy of long previous email snipped]
>
> PS: No need to put a copy of the entire message you are replying to at the 
> end of your post, it just chews up space.  If anything please quote inline 
> immediately before the appropriate portion of your reply so we can get the 
> gist, much as I have done above.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 14:39   ` Aucoin
@ 2006-12-04 16:10     ` Chris Friesen
  2006-12-04 17:07     ` Horst H. von Brand
  1 sibling, 0 replies; 54+ messages in thread
From: Chris Friesen @ 2006-12-04 16:10 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Kyle Moffett', 'Tim Schmielau',
	'Andrew Morton', torvalds, linux-kernel, clameter

Aucoin wrote:

> The definition of perfectly good here may be up for debate or
> someone can explain it to me. This perfectly good data was
> cached under the tar yet hours after the tar has completed the
> pages are still cached.

If nothing else has asked for that memory since the tar, there is no 
reason to evict the pages from the cache.  The inactive memory is 
basically "free, but still contains the previous data".

If anything asks for memory, those pages will be filled with zeros or 
the new information.  In the meantime, the kernel keeps them in the 
cache in case anyone wants the old information.

It doesn't hurt anything to keep the pages around with the old data in 
them--and it might help.

Chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04  5:13           ` Linus Torvalds
@ 2006-12-04 17:03             ` Christoph Lameter
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Lameter @ 2006-12-04 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Aucoin, 'Tim Schmielau', linux-kernel, pj

On Sun, 3 Dec 2006, Linus Torvalds wrote:

> Wouldn't it be much nicer to just lower the dirty-page limit?
> 
> 	echo 1 > /proc/sys/vm/dirty_background_ratio
> 	echo 2 > /proc/sys/vm/dirty_ratio

Dirty ratio cannot be set to less than 5%. See 
mm/page-writeback.c:get_dirty_limits().

> or something. Which we already discussed in another thread and almost 
> already decided we should lower the values for big-mem machines..

We also have an issue with cpusets. Dirty page throttling does not work in 
a cpuset if it is relatively small to the total memory on the system since 
we calculate the percentage of the total memory and not a percentage of 
the memory the process is allowed to use.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 14:39   ` Aucoin
  2006-12-04 16:10     ` Chris Friesen
@ 2006-12-04 17:07     ` Horst H. von Brand
  2006-12-04 17:49       ` Aucoin
  2006-12-04 18:06       ` Andrew Morton
  1 sibling, 2 replies; 54+ messages in thread
From: Horst H. von Brand @ 2006-12-04 17:07 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Kyle Moffett', 'Tim Schmielau',
	'Andrew Morton', torvalds, linux-kernel, clameter

Aucoin <Aucoin@Houston.RR.com> wrote:

[...]

> The definition of perfectly good here may be up for debate or
> someone can explain it to me. This perfectly good data was
> cached under the tar yet hours after the tar has completed the
> pages are still cached.

That means that there isn't a need for that memory at all (and so they stay
around; why actively delete data (using up resources!) needlessly when it
would be a win to have them around in the (admittedly remote) case they'll
be needed again?), or the whole memory handling in Linux is very broken.
I'd vote for the former, i.e., your problems have nothing to do with memory
pressure and swapping. That would explain why your maneuvres didn't make a
difference...

In any case, how do you know it is the tar data that stays around, and not
just that the number of pages "in use" stays roughly constant?

Please explain again:

- What you are doing, step by step
- What are your exact requirements
- In what exact way is it missbehaving. Please tell /in detail/ how you
  determine the real behaviour, not your deductions.

[Yes, I'm in my "dense" day today.]
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 15:55   ` David Lang
@ 2006-12-04 17:42     ` Aucoin
  0 siblings, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04 17:42 UTC (permalink / raw)
  To: 'David Lang', 'Kyle Moffett'
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

> From: David Lang [mailto:dlang@digitalinsight.com]
> I think that I am seeing two seperate issues here that are getting mixed
> up.

Fair enough.

> however the real problem that Aucoin is running into is patching process
> (tar, etc) kicks off the system is choosing to use it's

First name Louis, yes but we haven't resorted to echoing 3 in a loop at
drop_caches yet.

> from the documentation on swappiness it seems like setting it to 0 would
> do what he wants

That's what I thought, but some responses would seem to indicate that these
two "types" of memory are completely independent of each other and
swappiness has no impact on the type that is currently annoying me. It just
doesn't seem like a fair way to run a kernel when you have a dial dial to
control swappiness but then there's this rogue memory consumption that lives
outside the control of the swappiness dial and you end up swapping anyway.

> this is the same type of problem that people run into with the nightly
> updatedb

I would imagine so, yes. But take that example and instead of programs going
in active over night substitute programs that go inactive for only a few
seconds ... swap thrash, oom-killer, game over.

> IIRC there is a flag that can be passed to the open that tells the system
> that

I'll check into it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 17:07     ` Horst H. von Brand
@ 2006-12-04 17:49       ` Aucoin
  2006-12-04 18:44         ` Tim Schmielau
  2006-12-04 18:46         ` Horst H. von Brand
  2006-12-04 18:06       ` Andrew Morton
  1 sibling, 2 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04 17:49 UTC (permalink / raw)
  To: 'Horst H. von Brand'
  Cc: 'Kyle Moffett', 'Tim Schmielau',
	'Andrew Morton', torvalds, linux-kernel, clameter

> From: Horst H. von Brand [mailto:vonbrand@inf.utfsm.cl]
> That means that there isn't a need for that memory at all (and so they

In the current isolated non-production, not actually bearing a load test
case yes. But if I can't get it to not swap on an idle system I have no hope
of avoiding OOM on a loaded system.

> In any case, how do you know it is the tar data that stays around, and not
> just that the number of pages "in use" stays roughly constant?

I'm not dumping the contents of memory so I don't.

> - What you are doing, step by step

Trying to deliver a high availability, linearly scalable, clustered iSCSI
storage solution that can be upgraded with minimum downtime.

> - What are your exact requirements

OOM not to kill anything.

> - In what exact way is it missbehaving. Please tell /in detail/ how you

OOM kills important stuff.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 17:07     ` Horst H. von Brand
  2006-12-04 17:49       ` Aucoin
@ 2006-12-04 18:06       ` Andrew Morton
  2006-12-04 18:15         ` Christoph Lameter
  1 sibling, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-04 18:06 UTC (permalink / raw)
  To: Horst H. von Brand
  Cc: Aucoin, 'Kyle Moffett', 'Tim Schmielau', torvalds,
	linux-kernel, clameter

On Mon, 04 Dec 2006 14:07:22 -0300
"Horst H. von Brand" <vonbrand@inf.utfsm.cl> wrote:

> Please explain again:
> 
> - What you are doing, step by step

That 2GB machine apparently has a 1.6GB shm segment which is mlocked.  That will
cause the VM to do one heck of a lot of pointless scanning and could, I guess,
cause false oom decisions.  It's also an ia32 highmem machine, which adds to the
fun.

We could scan more:

--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -918,6 +918,7 @@ static unsigned long shrink_zone(int pri
 	 * slowly sift through the active list.
 	 */
 	zone->nr_scan_active += (zone->nr_active >> priority) + 1;
+	zone->nr_scan_active *= 2;
 	nr_active = zone->nr_scan_active;
 	if (nr_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;
@@ -925,6 +926,7 @@ static unsigned long shrink_zone(int pri
 		nr_active = 0;
 
 	zone->nr_scan_inactive += (zone->nr_inactive >> priority) + 1;
+	zone->nr_scan_inactive *= 2;
 	nr_inactive = zone->nr_scan_inactive;
 	if (nr_inactive >= sc->swap_cluster_max)
 		zone->nr_scan_inactive = 0;
_

but that's rather dumb.  Better would be to remove mlocked pages from the
LRU.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 18:06       ` Andrew Morton
@ 2006-12-04 18:15         ` Christoph Lameter
  2006-12-04 18:38           ` Jeffrey Hundstad
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-04 18:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Horst H. von Brand, Aucoin, 'Kyle Moffett',
	'Tim Schmielau', torvalds, linux-kernel, dcn

On Mon, 4 Dec 2006, Andrew Morton wrote:

> but that's rather dumb.  Better would be to remove mlocked pages from the
> LRU.

Could we generalize the removal of sections of a zone from the LRU? I 
believe this would help various buffer allocation schemes. We have some 
issues with heavy LRU scans if large buffers are allocated on some 
nodes.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 18:15         ` Christoph Lameter
@ 2006-12-04 18:38           ` Jeffrey Hundstad
  2006-12-04 21:25             ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Jeffrey Hundstad @ 2006-12-04 18:38 UTC (permalink / raw)
  To: Aucoin
  Cc: Christoph Lameter, Andrew Morton, Horst H. von Brand,
	'Kyle Moffett', 'Tim Schmielau', torvalds,
	linux-kernel, dcn

Hello,

Please forgive me if this is naive.  It seems that you could recompile 
your tar and patch commands to use the POSIX_FADVISE(2) feature with the 
POSIX_FADV_NOREUSE flags.  It seems these would cause the tar and patch 
commands to not clutter the page cache at all.

It'd be nice to be able to make a wrapper out of this kind of like the 
fakeroot(1) command like such as:

nocachesuck tar xvfz kernel.tar.gz

ya know what I mean?

-- 
Jeffrey Hundstad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 17:49       ` Aucoin
@ 2006-12-04 18:44         ` Tim Schmielau
  2006-12-04 21:28           ` Aucoin
  2006-12-04 18:46         ` Horst H. von Brand
  1 sibling, 1 reply; 54+ messages in thread
From: Tim Schmielau @ 2006-12-04 18:44 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Horst H. von Brand', 'Kyle Moffett',
	'Andrew Morton', torvalds, linux-kernel, clameter

On Mon, 4 Dec 2006, Aucoin wrote:

> > From: Horst H. von Brand [mailto:vonbrand@inf.utfsm.cl]
> > That means that there isn't a need for that memory at all (and so they
> 
> In the current isolated non-production, not actually bearing a load test
> case yes. But if I can't get it to not swap on an idle system I have no hope
> of avoiding OOM on a loaded system.

I don't think that assumption is correct. If you have no load on your 
system and the pages in the shared application cache are not actually 
touched, it is perfectly reasonable for the kernel to push out these 
unused pages to swap space to have even more RAM available (e.g. for 
caching the pages more recently accessed by the tar and patch commands). 

I believe your OOM problem is not connected to these observations. There 
might be a problem in the handling of OOM situations in Linux. But before 
coming to that conclusion, I would suggest trying your simulated software 
upgrade scenario with plenty of swap space available and without playing
any tricks with MM settings.

Tim

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 17:49       ` Aucoin
  2006-12-04 18:44         ` Tim Schmielau
@ 2006-12-04 18:46         ` Horst H. von Brand
  2006-12-04 21:43           ` Aucoin
  1 sibling, 1 reply; 54+ messages in thread
From: Horst H. von Brand @ 2006-12-04 18:46 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Horst H. von Brand', 'Kyle Moffett',
	'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

Aucoin <Aucoin@Houston.RR.com> wrote:
> From: Horst H. von Brand [mailto:vonbrand@inf.utfsm.cl]
> > That means that there isn't a need for that memory at all (and so they

> In the current isolated non-production, not actually bearing a load test
> case yes. But if I can't get it to not swap on an idle system I have no hope
> of avoiding OOM on a loaded system.

How do you /know/ it won't just be recycled in the production case?

> > In any case, how do you know it is the tar data that stays around, and not
> > just that the number of pages "in use" stays roughly constant?
> 
> I'm not dumping the contents of memory so I don't.

OK.

> > - What you are doing, step by step
> 
> Trying to deliver a high availability, linearly scalable, clustered iSCSI
> storage solution that can be upgraded with minimum downtime.

That is your ultimate goal, not what you are doing, step by step.

> > - What are your exact requirements

> OOM not to kill anything.

Can't ever guarantee that (unless you have the exact memory requirements
beforehand, and enough RAM for the worst case).

> > - In what exact way is it missbehaving. Please tell /in detail/ how you

> OOM kills important stuff.

What "important stuff"? How come OOM kills it, when there is plenty of
free(able) memory around? Is this in the production setting, or are you
just afraid it could happen by what you see in the "current isolated
non-production, not actually bearing a load test" case?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
@ 2006-12-04 19:02 Al Boldi
  0 siblings, 0 replies; 54+ messages in thread
From: Al Boldi @ 2006-12-04 19:02 UTC (permalink / raw)
  To: linux-kernel

As a workaround try this:

echo 2 > /proc/sys/vm/overcommit_memory
echo 0 > /proc/sys/vm/overcommit_ratio

Hopefully someone can fix this intrinsic swap before drop behaviour.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 18:38           ` Jeffrey Hundstad
@ 2006-12-04 21:25             ` Aucoin
  2006-12-04 21:43               ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Aucoin @ 2006-12-04 21:25 UTC (permalink / raw)
  To: 'Jeffrey Hundstad'
  Cc: 'Christoph Lameter', 'Andrew Morton',
	'Horst H. von Brand', 'Kyle Moffett',
	'Tim Schmielau', torvalds, linux-kernel, dcn

> From: Jeffrey Hundstad [mailto:jeffrey.hundstad@mnsu.edu]
> POSIX_FADV_NOREUSE flags.  It seems these would cause the tar and patch

WI may be naïve as well, but that sounds interesting. Unless someone knows
of an obvious reason this won't work we can make a one-off tar command and
give it a whirl.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 18:44         ` Tim Schmielau
@ 2006-12-04 21:28           ` Aucoin
  0 siblings, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04 21:28 UTC (permalink / raw)
  To: 'Tim Schmielau'
  Cc: 'Horst H. von Brand', 'Kyle Moffett',
	'Andrew Morton', torvalds, linux-kernel, clameter

> From: Tim Schmielau [mailto:tim@physik3.uni-rostock.de]
> I believe your OOM problem is not connected to these observations. There

I don't know what to tell you except oom fires only when the update runs. I
know it's a pitiful datapoint so I'll work on getting more data.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-04 21:25             ` Aucoin
@ 2006-12-04 21:43               ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-04 21:43 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Jeffrey Hundstad', 'Christoph Lameter',
	'Horst H. von Brand', 'Kyle Moffett',
	'Tim Schmielau', torvalds, linux-kernel, dcn

On Mon, 4 Dec 2006 15:25:47 -0600
"Aucoin" <Aucoin@Houston.RR.com> wrote:

> > From: Jeffrey Hundstad [mailto:jeffrey.hundstad@mnsu.edu]
> > POSIX_FADV_NOREUSE flags.  It seems these would cause the tar and patch
> 
> WI may be na__ve as well, but that sounds interesting. Unless someone knows
> of an obvious reason this won't work we can make a one-off tar command and
> give it a whirl.
> 

Well if altering tar is an option then sure, a
sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) followed
by fadvise(POSIX_FADV_DONTNEED) will free the memory up again.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 18:46         ` Horst H. von Brand
@ 2006-12-04 21:43           ` Aucoin
  0 siblings, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-04 21:43 UTC (permalink / raw)
  To: 'Horst H. von Brand'
  Cc: 'Kyle Moffett', 'Tim Schmielau',
	'Andrew Morton', torvalds, linux-kernel, clameter

> From: Horst H. von Brand [mailto:vonbrand@inf.utfsm.cl]
> How do you /know/ it won't just be recycled in the production case?

In the production case is when oom fires and kills things. I can only assume
memory is not being freed fast enough otherwise oom wouldn't get so upset.

> That is your ultimate goal, not what you are doing, step by step.

It's 1/2+ million lines of code, there are a lot of steps. Other than saying
we create a 1.6GB shared memory segment up front, then load the high
availability iSCSI application, start I/O with some number of clients and
then launch an update. I'm not sure what detail you're looking for. Linus
seems to have the best summary of the problem so far saying that we have a
2GB system with 1.6GB dedicated and we want the OS to pretend there's only
400MB of memory.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-04 15:04             ` Nick Piggin
@ 2006-12-05  4:02               ` Aucoin
  2006-12-05  4:46                 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Aucoin @ 2006-12-05  4:02 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: 'Tim Schmielau', 'Andrew Morton', torvalds,
	linux-kernel, clameter

> From: Nick Piggin [mailto:nickpiggin@yahoo.com.au]
> I'd be interested to know how OOM and page reclaim behaves after these
> patches
> (or with a newer kernel).

We didn't get far today. The various suggestions everyone has for solving
this problem spurred several new discussions inside the office and raised
more questions. At the heart of the problem Andrew is right, heavy handed
tactics to force limits on page cache don't solve anything and may just
squeeze the problem to new areas. Modifying tar is really a band aid and not
a solution, there is still a fundamental problem with memory management in
this setup.

Nick suggested the possibility of a patching the kernel or upgrading to a
new kernel. Linus made the suggestion of dialing the value of
min_free_kbytes down to match something more in line with what might be
expected in a system with 400MB memory as a way to possibly make VM or at
least a portion of VM simulate a restricted amount of memory. And, I have
seen a couple suggestions about creating a new proc vm file to do things
like tweak max_buffer_heads dynamically.

So here's a silly (crazy) question (or two).

If I'm going to go through all the trouble to change the kernel and maybe
create a new proc file how much code would I have to touch to create a proc
file to set something like, let's say, effective memory and have all the vm
calculations use effective memory as the basis for swap and cache
calculations? And can I stop at the vm engine or does the sprawl farther
out? To the untrained mind it seems like this might be the best of both
worlds. It sounds like it would allow an embedded system like ours to set
aside a chunk of ram for a special purpose and designate a sandbox for the
OS. I am, of course, making the *bold* assumption here that a majority of
the vm algorithms are based off something remotely similar to a value which
represents physical memory.

Thoughts? Stones?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05  4:02               ` Aucoin
@ 2006-12-05  4:46                 ` Linus Torvalds
  2006-12-05  6:41                   ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-05  4:46 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Nick Piggin', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

On Mon, 4 Dec 2006, Aucoin wrote:
>
> If I'm going to go through all the trouble to change the kernel and maybe
> create a new proc file how much code would I have to touch to create a proc
> file to set something like, let's say, effective memory and have all the vm
> calculations use effective memory as the basis for swap and cache
> calculations?

Considering your /proc/meminfo under load:

	MemTotal:      2075152 kB
	MemFree:        169848 kB
	Buffers:          4360 kB
	Cached:         334824 kB
	SwapCached:          0 kB
	Active:         178692 kB
	Inactive:       271452 kB
	HighTotal:     1179392 kB
	HighFree:         3040 kB
	LowTotal:       895760 kB
	LowFree:        499876 kB
	SwapTotal:      524276 kB
	SwapFree:       524276 kB
	Dirty:               0 kB
	Writeback:           0 kB
	Mapped:         116720 kB
	Slab:            27956 kB
	..

I actually suspect you should be _fairly_ close to such a situation 
already. In particular, the Active and Inactive lists really are fairly 
small, and don't contain the big SHM area, they seem to be just the cache 
and some (a fairly small amount of) anonymous pages.

The above actually confuses me mightily. I _really_ expected the SHM pages 
to show up on the active/inactive lists if it was actually SHM, and they 
don't seem to. What am I missing?

Louis, exactly how do you allocate that big 1.6GB shared area? 

			Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05  4:46                 ` Linus Torvalds
@ 2006-12-05  6:41                   ` Aucoin
  2006-12-05  7:01                     ` Nick Piggin
  2006-12-05 16:17                     ` Linus Torvalds
  0 siblings, 2 replies; 54+ messages in thread
From: Aucoin @ 2006-12-05  6:41 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: 'Nick Piggin', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

> From: Linus Torvalds [mailto:torvalds@osdl.org]
> I actually suspect you should be _fairly_ close to such a situation

We run with min_free_kbytes set around 4k to answer your earlier question.

> Louis, exactly how do you allocate that big 1.6GB shared area?

Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
responsible for initially setting up the shared area doesn't stay resident.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05  6:41                   ` Aucoin
@ 2006-12-05  7:01                     ` Nick Piggin
  2006-12-05  7:26                       ` Rene Herman
  2006-12-05 13:25                       ` Aucoin
  2006-12-05 16:17                     ` Linus Torvalds
  1 sibling, 2 replies; 54+ messages in thread
From: Nick Piggin @ 2006-12-05  7:01 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Linus Torvalds', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

Aucoin wrote:
>>From: Linus Torvalds [mailto:torvalds@osdl.org]
>>I actually suspect you should be _fairly_ close to such a situation
> 
> 
> We run with min_free_kbytes set around 4k to answer your earlier question.
> 
> 
>>Louis, exactly how do you allocate that big 1.6GB shared area?
> 
> 
> Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
> responsible for initially setting up the shared area doesn't stay resident.

The issue is that the shm pages should show up in the active and
inactive lists. But they aren't, and you seem to have about 1542524K
unacconted for. Weird.

Can you try getting the output of /proc/vmstat as well?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05  7:01                     ` Nick Piggin
@ 2006-12-05  7:26                       ` Rene Herman
  2006-12-05 13:27                         ` Aucoin
  2006-12-05 13:25                       ` Aucoin
  1 sibling, 1 reply; 54+ messages in thread
From: Rene Herman @ 2006-12-05  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Aucoin, 'Linus Torvalds', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

Nick Piggin wrote:

> Aucoin wrote:

>> Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
>> responsible for initially setting up the shared area doesn't stay 
>> resident.
> 
> The issue is that the shm pages should show up in the active and
> inactive lists. But they aren't, and you seem to have about 1542524K
> unacconted for. Weird.
> 
> Can you try getting the output of /proc/vmstat as well?

Haven't followed along on this thread, but couldn't help notice the 
ftruncate there and some similarity to a problem I once experienced 
myself. Is ext3 involved? If so, maybe:

http://mail.nl.linux.org/linux-mm/2002-11/msg00110.html

is still or again being annoying?

Rene.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05  7:01                     ` Nick Piggin
  2006-12-05  7:26                       ` Rene Herman
@ 2006-12-05 13:25                       ` Aucoin
  1 sibling, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-05 13:25 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: 'Linus Torvalds', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

> From: Nick Piggin [mailto:nickpiggin@yahoo.com.au]
> Can you try getting the output of /proc/vmstat as well?

Ouput from vmstat, meminfo and bloatmon below.

vmstat
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_page_table_pages 361
nr_mapped 33077
nr_slab 8107
pgpgin 1433195947
pgpgout 148795046
pswpin 0
pswpout 1
pgalloc_high 19333815
pgalloc_normal 38376025
pgalloc_dma32 0
pgalloc_dma 1043219
pgfree 58768398
pgactivate 99313
pgdeactivate 61910
pgfault 248450153
pgmajfault 1009
pgrefill_high 18587
pgrefill_normal 129658
pgrefill_dma32 0
pgrefill_dma 6299
pgsteal_high 11954
pgsteal_normal 197484
pgsteal_dma32 0
pgsteal_dma 6176
pgscan_kswapd_high 13035
pgscan_kswapd_normal 205326
pgscan_kswapd_dma32 0
pgscan_kswapd_dma 6369
pgscan_direct_high 0
pgscan_direct_normal 0
pgscan_direct_dma32 0
pgscan_direct_dma 0
pginodesteal 0
slabs_scanned 24576
kswapd_steal 215614
kswapd_inodesteal 0
pageoutrun 3315
allocstall 0
pgrotated 1
nr_bounce 0

meminfo 
MemTotal:      2075152 kB
MemFree:         59052 kB
Buffers:         45088 kB
Cached:         401128 kB
SwapCached:          0 kB
Active:         246424 kB
Inactive:       313332 kB
HighTotal:     1179392 kB
HighFree:         1696 kB
LowTotal:       895760 kB
LowFree:         57356 kB
SwapTotal:      524276 kB
SwapFree:       524272 kB
Dirty:               4 kB
Writeback:           0 kB
Mapped:         132252 kB
Slab:            32432 kB
CommitLimit:    855292 kB
Committed_AS:   980948 kB
PageTables:       1432 kB
VmallocTotal:   114680 kB
VmallocUsed:      1000 kB
VmallocChunk:   113584 kB
HugePages_Total:   345
HugePages_Free:      0
Hugepagesize:     4096 kB

bloatmon
skbuff_fclone_cache:       22KB       22KB  100.0
reiser_inode_cache:        0KB        0KB  100.0
posix_timers_cache:        0KB        0KB  100.0
mqueue_inode_cache:       60KB       63KB   95.9
inotify_watch_cache:        0KB        3KB   14.85
inotify_event_cache:        0KB        0KB  100.0
hugetlbfs_inode_cache:        1KB        3KB   27.27
 skbuff_head_cache:     2082KB     2100KB   99.14
 shmem_inode_cache:        5KB       11KB   48.14
 isofs_inode_cache:        0KB        0KB  100.0
  sock_inode_cache:       21KB       26KB   82.85
  size-131072(DMA):        0KB        0KB  100.0
  request_sock_TCP:        0KB        0KB  100.0
  proc_inode_cache:       18KB       38KB   48.18
  ext3_inode_cache:      314KB      375KB   83.85
  ext2_inode_cache:       11KB       30KB   37.50
   tcp_bind_bucket:        0KB        3KB    3.94
   sysfs_dir_cache:       85KB       86KB  100.0
   size-65536(DMA):        0KB        0KB  100.0
   size-32768(DMA):        0KB        0KB  100.0
   size-16384(DMA):        0KB        0KB  100.0
   scsi_io_context:        0KB        0KB  100.0



^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05  7:26                       ` Rene Herman
@ 2006-12-05 13:27                         ` Aucoin
  2006-12-05 13:49                           ` Rene Herman
  0 siblings, 1 reply; 54+ messages in thread
From: Aucoin @ 2006-12-05 13:27 UTC (permalink / raw)
  To: 'Rene Herman', 'Nick Piggin'
  Cc: 'Linus Torvalds', 'Tim Schmielau',
	'Andrew Morton', linux-kernel, clameter

> From: Rene Herman [mailto:rene.herman@gmail.com]
> ftruncate there and some similarity to a problem I once experienced

I can't honestly say I completely grasp the fundamentals of the issue you
experienced but we are using ext3 with data=journal



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 13:27                         ` Aucoin
@ 2006-12-05 13:49                           ` Rene Herman
  0 siblings, 0 replies; 54+ messages in thread
From: Rene Herman @ 2006-12-05 13:49 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Nick Piggin', 'Linus Torvalds',
	'Tim Schmielau', 'Andrew Morton', linux-kernel,
	clameter

Aucoin wrote:

>> From: Rene Herman [mailto:rene.herman@gmail.com] ftruncate there
>> and some similarity to a problem I once experienced
> 
> I can't honestly say I completely grasp the fundamentals of the issue
> you experienced but we are using ext3 with data=journal

Rereading I see ext3 isn't involved at all but perhaps the ftruncate 
does something similar here as it did on ext3? Andrew? It's probably 
best to igniore me, I also never quite understood what the problem on 
ext3 was. Just thought I'd share the hunch anyway...

Rene.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05  6:41                   ` Aucoin
  2006-12-05  7:01                     ` Nick Piggin
@ 2006-12-05 16:17                     ` Linus Torvalds
  2006-12-05 16:59                       ` Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-05 16:17 UTC (permalink / raw)
  To: Aucoin
  Cc: 'Nick Piggin', 'Tim Schmielau',
	'Andrew Morton', clameter, Linux Memory Management List

On Tue, 5 Dec 2006, Aucoin wrote:
>
> > Louis, exactly how do you allocate that big 1.6GB shared area?
> 
> Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
> responsible for initially setting up the shared area doesn't stay resident.

Not a trick question, I just suddenly realized that I really should have 
expected the SHM pages to show up in the LRU lists (either inactive or 
active) and shown up as "cached" pages too. Afaik, the SHM routines all 
end up using the page cache and the LRU for the backing store.

But your 1.6GB thing doesn't show up anywhere.

(I'm sure it's intentional, and I've just forgotten some detail. We 
probably remove pages from the LRU lists when they are locked. Anyway, my 
original point was that since the pages _aren't_ on the LRU lists, the VM 
really should basically act as if they didn't exist at all, but there are 
probably things that still base their decisions on the _total_ amount of 
memory)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 16:17                     ` Linus Torvalds
@ 2006-12-05 16:59                       ` Andrew Morton
  2006-12-05 17:41                         ` aucoin, Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aucoin, 'Nick Piggin', 'Tim Schmielau', clameter,
	Linux Memory Management List

On Tue, 5 Dec 2006 08:17:51 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Tue, 5 Dec 2006, Aucoin wrote:
> >
> > > Louis, exactly how do you allocate that big 1.6GB shared area?
> > 
> > Ummm, shm_open, ftruncate, mmap ? Is it a trick question ? The process
> > responsible for initially setting up the shared area doesn't stay resident.
> 
> Not a trick question, I just suddenly realized that I really should have 
> expected the SHM pages to show up in the LRU lists (either inactive or 
> active) and shown up as "cached" pages too. Afaik, the SHM routines all 
> end up using the page cache and the LRU for the backing store.
> 
> But your 1.6GB thing doesn't show up anywhere.
> 
> (I'm sure it's intentional, and I've just forgotten some detail. We 
> probably remove pages from the LRU lists when they are locked. Anyway, my 
> original point was that since the pages _aren't_ on the LRU lists, the VM 
> really should basically act as if they didn't exist at all, but there are 
> probably things that still base their decisions on the _total_ amount of 
> memory)
> 

Yes, those pages should be on the LRU.  I suspect they never got paged in
or something.  But that would mean they weren't mlocked.  Is a mystery.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 16:59                       ` Andrew Morton
@ 2006-12-05 17:41                         ` aucoin, Andrew Morton
  2006-12-05 18:31                           ` Christoph Lameter
  0 siblings, 1 reply; 54+ messages in thread
From: aucoin, Andrew Morton @ 2006-12-05 17:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', clameter, Linux Memory Management List

> Yes, those pages should be on the LRU.  I suspect they never got 

Oops, details, details.

These are huge pages .... apologies for leaving that out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 17:41                         ` aucoin, Andrew Morton
@ 2006-12-05 18:31                           ` Christoph Lameter
  2006-12-05 18:44                             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-05 18:31 UTC (permalink / raw)
  To: Aucoin
  Cc: Andrew Morton, Linus Torvalds, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, aucoin@houston.rr.com wrote:

> From: Andrew Morton <akpm@osdl.org>
> > Yes, those pages should be on the LRU.  I suspect they never got 
> Oops, details, details.
> These are huge pages .... apologies for leaving that out.

We do not support swapping / reclaim for huge pages.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:31                           ` Christoph Lameter
@ 2006-12-05 18:44                             ` Linus Torvalds
  2006-12-05 19:32                               ` Christoph Lameter
  2006-12-05 20:39                               ` aucoin, Linus Torvalds
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-05 18:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aucoin, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, Christoph Lameter wrote:
> 
> We do not support swapping / reclaim for huge pages.

Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
wants the system to run well with the remaining 400MB of memory in his 
machine.

Which it doesn't. It just OOM's for some reason.

We still haven't seen the oom debug output though, I think. It should talk 
about some of the state (it calls "show_mem()", which should call 
"show_free_areas()", which should tell a lot about why the heck it 
thought it was out of memory.

But maybe Louis posted it and I just missed it.

Anyway, if it's hugepages, then I don't see why Louis even _wants_ to turn 
down swappiness. The hugepages won't be swapped out regardless.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:44                             ` Linus Torvalds
@ 2006-12-05 19:32                               ` Christoph Lameter
  2006-12-05 20:02                                 ` Andrew Morton
  2006-12-05 20:39                               ` aucoin, Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-05 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aucoin, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, Linus Torvalds wrote:
> On Tue, 5 Dec 2006, Christoph Lameter wrote:
> > We do not support swapping / reclaim for huge pages.
> 
> Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
> wants the system to run well with the remaining 400MB of memory in his 
> machine.
> 
> Which it doesn't. It just OOM's for some reason.

If you take huge chunks of memory out of a zone then the dirty limits as 
well as the min free kbytes etc are all off. As a result the VM may 
behave strangely.  F.e. too many dirty pages may cause an OOM since we do 
not enter synchrononous writeout during reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 19:32                               ` Christoph Lameter
@ 2006-12-05 20:02                                 ` Andrew Morton
  2006-12-05 20:15                                   ` Christoph Lameter
  2006-12-05 20:52                                   ` Andrew Morton
  0 siblings, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 20:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006 11:32:21 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Linus Torvalds wrote:
> > On Tue, 5 Dec 2006, Christoph Lameter wrote:
> > > We do not support swapping / reclaim for huge pages.
> > 
> > Well, Louis doesn't actually _want_ swapping or reclaim on them. He just 
> > wants the system to run well with the remaining 400MB of memory in his 
> > machine.
> > 
> > Which it doesn't. It just OOM's for some reason.
> 
> If you take huge chunks of memory out of a zone then the dirty limits as 
> well as the min free kbytes etc are all off. As a result the VM may 
> behave strangely.  F.e. too many dirty pages may cause an OOM since we do 
> not enter synchrononous writeout during reclaim.

yes, it's quite possible that this setup would cause the page reclaim
arithmetic to go wrong.

But otoh, it's a very common scenario, and nobody has observed it before. 
For example:

akpm2:/home/akpm# echo 4000 > /proc/sys/vm/nr_hugepages 

Free memory on this box instantly fell from 7G down to ~250MB.  It's now
happily chuggling its way through a `dbench 512' run.

But this is a 64-bit machine.  Could be that there are problems on 32-bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:02                                 ` Andrew Morton
@ 2006-12-05 20:15                                   ` Christoph Lameter
  2006-12-05 20:48                                     ` Andrew Morton
  2006-12-05 20:52                                   ` Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-05 20:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> But otoh, it's a very common scenario, and nobody has observed it before. 

This is the same scenario as mlocked memory. Kame-san has recently posted 
an occurence in ZONE_DMA. I have 3 customers where I have seen similar VM 
behavior with a special shared memory thingy locking down lots of 
memory.

In fact in the NUMA case with cpusets the limits being off is a very 
common problem. F.e. the dirty balancing logic does not take into account 
that the application can just run on a subset of the machine. So if a 
cpuset is just 1/10th of the whole machine then we will never be able to 
reach the dirty limits, all the nodes of a cpuset may be filled up with 
dirty pages. A simple cp of a large file will bring the machine into a 
continual reclaim on all nodes.

I am working on a solution for the dirty throttling but we have similar 
issues for the other limits. I wonder if we should not account for 
unreclaimable memory per zone and recalculate the limits if they change 
significantly. A series of huge page allocations would then retune the 
limits.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 18:44                             ` Linus Torvalds
  2006-12-05 19:32                               ` Christoph Lameter
@ 2006-12-05 20:39                               ` aucoin, Linus Torvalds
  1 sibling, 0 replies; 54+ messages in thread
From: aucoin, Linus Torvalds @ 2006-12-05 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Andrew Morton, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

I didn't post it yet, I don't have a recent build with oom enabled at
the moment so I was digging through old bugzillas to see what I could
find. Here are some pieces from one oom firing, they're from old runs
and based on the bugzilla context I can't swear it's exactly the same
problem, I'm looking for more. The "ae" process that's being kill is one
of the three processes attached to the 1.6G shm.

Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c01049a4>] show_trace+0xd/0xf
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0104a43>] dump_stack+0x17/0x19
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0138f44>] out_of_memory+0x27/0x12f
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c013a617>] __alloc_pages+0x1e1/0x261
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c013a6bf>] __get_free_pages+0x28/0x37
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015f066>] __pollwait+0x33/0x9e
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c01eb25c>] mqueue_poll_file+0x27/0x57
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015fb9b>] do_sys_poll+0x165/0x2da
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c015ff24>] sys_poll+0x43/0x47
Oct 11 19:06:38 QAR2MOVDB2 kernel:  [<c0103513>] sysenter_past_esp+0x54/0x75

Oct 11 19:08:19 QAR2MOVDB2 kernel: Free swap  = 421008kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: Total swap = 524276kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: Free swap:       421008kB
Oct 11 19:08:19 QAR2MOVDB2 kernel: 524224 pages of RAM
Oct 11 19:08:19 QAR2MOVDB2 kernel: 294848 pages of HIGHMEM
Oct 11 19:08:19 QAR2MOVDB2 kernel: 5437 reserved pages
Oct 11 19:08:19 QAR2MOVDB2 kernel: 1340645 pages shared
Oct 11 19:08:19 QAR2MOVDB2 kernel: 25817 pages swap cached
Oct 11 19:08:19 QAR2MOVDB2 kernel: 107 pages dirty
Oct 11 19:08:19 QAR2MOVDB2 kernel: 45405 pages writeback
Oct 11 19:08:19 QAR2MOVDB2 kernel: 2638 pages mapped
Oct 11 19:08:19 QAR2MOVDB2 kernel: 29632 pages slab
Oct 11 19:08:19 QAR2MOVDB2 kernel: 385 pages pagetables
Oct 11 19:08:19 QAR2MOVDB2 kernel: Out of Memory: Kill process 1636 (ae)
score
556471 and children.
Oct 11 19:08:19 QAR2MOVDB2 kernel: Out of memory: Killed process 1636 (ae).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:15                                   ` Christoph Lameter
@ 2006-12-05 20:48                                     ` Andrew Morton
  2006-12-05 20:59                                       ` Christoph Lameter
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 20:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006 12:15:46 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > But otoh, it's a very common scenario, and nobody has observed it before. 
> 
> This is the same scenario as mlocked memory.

Not quite - mlocked pages are on the page LRU and hence contribute to the
arithmetic in there.   The hugetlb pages are simply gone.

> Kame-san has recently posted 
> an occurence in ZONE_DMA. I have 3 customers where I have seen similar VM 
> behavior with a special shared memory thingy locking down lots of 
> memory.

I expect the mechanisms are different.  The mlocked shared-memory segment
will fill the LRU with unreclaimable pages and the machine will do lots of
scanning.  That's inefficient, but it is unexpected that this will lead to
fals declaration of OOM.

> In fact in the NUMA case with cpusets the limits being off is a very 
> common problem. F.e. the dirty balancing logic does not take into account 
> that the application can just run on a subset of the machine.

Yup.

> So if a 
> cpuset is just 1/10th of the whole machine then we will never be able to 
> reach the dirty limits, all the nodes of a cpuset may be filled up with 
> dirty pages. A simple cp of a large file will bring the machine into a 
> continual reclaim on all nodes.

It shouldn't be continual and it shouldn't be on all nodes.  What _should_
happen in this situation is that the dirty pages in those zones are written
back off the LRU by the vm scanner.

That's less efficient from an IO scheduling POV than writing them back via
the inodes, but it should work OK and it shouldn't affect other zones.

If the activity is really "continual" and "on all nodes" then we have some
bugs to fix.

> I am working on a solution for the dirty throttling but we have similar 
> issues for the other limits. I wonder if we should not account for 
> unreclaimable memory per zone and recalculate the limits if they change 
> significantly. A series of huge page allocations would then retune the 
> limits.

We should fix the existing code before even thinking about this sort of
thing.  Or at least, gain a full understanding of why it is failing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:02                                 ` Andrew Morton
  2006-12-05 20:15                                   ` Christoph Lameter
@ 2006-12-05 20:52                                   ` Andrew Morton
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 20:52 UTC (permalink / raw)
  To: Christoph Lameter, Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006 12:02:56 -0800
Andrew Morton <akpm@osdl.org> wrote:

> But otoh, it's a very common scenario, and nobody has observed it before. 
> For example:
> 
> akpm2:/home/akpm# echo 4000 > /proc/sys/vm/nr_hugepages 
> 
> Free memory on this box instantly fell from 7G down to ~250MB.  It's now
> happily chuggling its way through a `dbench 512' run.

FS(small)VO "happily".  It's running like a complete dog (but I guess
dbench 512 in 256M is a bit mean).  But it's still running!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:48                                     ` Andrew Morton
@ 2006-12-05 20:59                                       ` Christoph Lameter
  2006-12-05 21:39                                         ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-05 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> > This is the same scenario as mlocked memory.
> 
> Not quite - mlocked pages are on the page LRU and hence contribute to the
> arithmetic in there.   The hugetlb pages are simply gone.

They cannot be swapped out and AFAICT the ratio calculations are assuming 
that pages can be evicted.

> > So if a 
> > cpuset is just 1/10th of the whole machine then we will never be able to 
> > reach the dirty limits, all the nodes of a cpuset may be filled up with 
> > dirty pages. A simple cp of a large file will bring the machine into a 
> > continual reclaim on all nodes.
> 
> It shouldn't be continual and it shouldn't be on all nodes.  What _should_

I meant all nodes of the cpuset.

> happen in this situation is that the dirty pages in those zones are written
> back off the LRU by the vm scanner.

Right in the best case that occurs. However, since we do not recognize 
that we are in a dirty overload situation we may not do synchrononous 
writes but return without having reclaimed any memory (a particular 
problem exists here in connections with NFS well known memory 
problems). If memory gets completely clogged then we OOM.

> That's less efficient from an IO scheduling POV than writing them back via
> the inodes, but it should work OK and it shouldn't affect other zones.

Could we get to the inode from the reclaim path and just start writing out 
all dirty pages of the indoe?

> If the activity is really "continual" and "on all nodes" then we have some
> bugs to fix.

Its continual on the nodes of the cpuset. Reclaim is constantly running 
and becomes very inefficient.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 20:59                                       ` Christoph Lameter
@ 2006-12-05 21:39                                         ` Andrew Morton
  2006-12-05 23:20                                           ` Christoph Lameter
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 21:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006 12:59:14 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > > This is the same scenario as mlocked memory.
> > 
> > Not quite - mlocked pages are on the page LRU and hence contribute to the
> > arithmetic in there.   The hugetlb pages are simply gone.
> 
> They cannot be swapped out and AFAICT the ratio calculations are assuming 
> that pages can be evicted.

Some calculations assume that.  But a lot (most) of the reclaim code is
paced by number-of-pages-scanned.  mlocked pages on the LRU will be noted
by the scanner and will cause priority elevation, throttling, etc.  Pages
which have been gobbled by hugetlb will not.

> > > So if a 
> > > cpuset is just 1/10th of the whole machine then we will never be able to 
> > > reach the dirty limits, all the nodes of a cpuset may be filled up with 
> > > dirty pages. A simple cp of a large file will bring the machine into a 
> > > continual reclaim on all nodes.
> > 
> > It shouldn't be continual and it shouldn't be on all nodes.  What _should_
> 
> I meant all nodes of the cpuset.
> 
> > happen in this situation is that the dirty pages in those zones are written
> > back off the LRU by the vm scanner.
> 
> Right in the best case that occurs.

We want it to work in all cases.

> However, since we do not recognize 
> that we are in a dirty overload situation we may not do synchrononous 
> writes but return without having reclaimed any memory

Return from what?  try_to_free_pages() or balance_dirty_pages()?

The behaviour of page reclaim is independent of the level of dirty memory
and of the dirty-memory thresholds, as far as I recall...

> (a particular 
> problem exists here in connections with NFS well known memory 
> problems). If memory gets completely clogged then we OOM.

NFS causes problems because it needs to allocate memory (skbs) to be able
to write back dirty memory.  There have been fixes and things have
improved, but I wouldn't be surprised if there are still problems.

> > That's less efficient from an IO scheduling POV than writing them back via
> > the inodes, but it should work OK and it shouldn't affect other zones.
> 
> Could we get to the inode from the reclaim path and just start writing out 
> all dirty pages of the indoe?

Yeah, maybe.  But of course the pages on the inode can be from any zone at
all so the problem is that in some scenarios, we could write out tremendous
numbers of pages from zones which don't need that writeout.

> > If the activity is really "continual" and "on all nodes" then we have some
> > bugs to fix.
> 
> Its continual on the nodes of the cpuset. Reclaim is constantly running 
> and becomes very inefficient.

I think what you're saying is that we're not throttling in
balance_dirty_pages().  So a large write() which is performed by a process
inside your one-tenth-of-memory cpuset will just go and dirty all of the
pages in that cpuset's nodes and things get all gummed up.

That can certainly happen, and I suppose we can make changes to
balance_dirty_pages() to fix it (although it will have the
we-wrote-lots-of-pages-we-didnt-need-to failure mode).

But right now in 2.6.19 the machine should _not_ declare oom in this
situation.  If it does, then we should fix that.  If it's only happening
with NFS then yeah, OK, mumble, NFS still needs work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: la la la la ... swappiness
  2006-12-05 21:39                                         ` Andrew Morton
@ 2006-12-05 23:20                                           ` Christoph Lameter
  2006-12-12 15:12                                             ` Aucoin
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Lameter @ 2006-12-05 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Aucoin, 'Nick Piggin',
	'Tim Schmielau', Linux Memory Management List

On Tue, 5 Dec 2006, Andrew Morton wrote:

> > However, since we do not recognize 
> > that we are in a dirty overload situation we may not do synchrononous 
> > writes but return without having reclaimed any memory
> 
> Return from what?  try_to_free_pages() or balance_dirty_pages()?

If we do not reach the dirty_ratio then we will not block but simply 
trigger writeouts.

try_to_free_pages() will trigger pdflush and we may wait 1/10th of a 
second in congestaion_wait() and in throttle_vm_writeout() (well not 
really since we check global limits) but we will not block. I think what 
happens is that try_to_free_pages() (given sufficient slowless of the 
writeout) at some point will start to return 0 and thus 
we OOM.

> The behaviour of page reclaim is independent of the level of dirty memory
> and of the dirty-memory thresholds, as far as I recall...

You cannot easily free a dirty page. We can only trigger writeout.

> > Could we get to the inode from the reclaim path and just start writing out 
> > all dirty pages of the indoe?
> 
> Yeah, maybe.  But of course the pages on the inode can be from any zone at
> all so the problem is that in some scenarios, we could write out tremendous
> numbers of pages from zones which don't need that writeout.

But we know that at least one page was in the correct zone. Writeout will 
be much faster if we can write a seris of block in sequence via the inode.

> > Its continual on the nodes of the cpuset. Reclaim is constantly running 
> > and becomes very inefficient.
> 
> I think what you're saying is that we're not throttling in
> balance_dirty_pages().  So a large write() which is performed by a process
> inside your one-tenth-of-memory cpuset will just go and dirty all of the
> pages in that cpuset's nodes and things get all gummed up.

Correct.

> That can certainly happen, and I suppose we can make changes to
> balance_dirty_pages() to fix it (although it will have the
> we-wrote-lots-of-pages-we-didnt-need-to failure mode).

Right. In addition to checking the limits of the nodes in the current 
cpuset (requires looping over all nodes and adding up the counters we 
need) I made some modification to pass a set of nodes in the 
writeback_control structure. We can then check if there are sufficient 
pages of the inode within the nodes of the cpuset. But I am a bit 
concerned about performance.

> But right now in 2.6.19 the machine should _not_ declare oom in this
> situation.  If it does, then we should fix that.  If it's only happening
> with NFS then yeah, OK, mumble, NFS still needs work.

We OOM only in some rare cases. Mostly it seems that the
machines just becomes extremely slow and the LRU locks become hot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: la la la la ... swappiness
  2006-12-05 23:20                                           ` Christoph Lameter
@ 2006-12-12 15:12                                             ` Aucoin
  0 siblings, 0 replies; 54+ messages in thread
From: Aucoin @ 2006-12-12 15:12 UTC (permalink / raw)
  To: 'Christoph Lameter', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Nick Piggin',
	'Tim Schmielau', 'Linux Memory Management List'

For what it's worth we tried a version of tar recompiled with calls to
posix_fadvise and the no reuse flag but it had no effect on the issue.
Inactive pages still accumulated to the point of invoking swap instead of
reclaiming inactive pages.

> -----Original Message-----
> From: Christoph Lameter [mailto:christoph@schroedinger.engr.sgi.com] On
> Behalf Of Christoph Lameter
> Sent: Tuesday, December 05, 2006 5:21 PM
> To: Andrew Morton
> Cc: Linus Torvalds; Aucoin; 'Nick Piggin'; 'Tim Schmielau'; Linux Memory
> Management List
> Subject: Re: la la la la ... swappiness
> 
> On Tue, 5 Dec 2006, Andrew Morton wrote:
> 
> > > However, since we do not recognize
> > > that we are in a dirty overload situation we may not do synchrononous
> > > writes but return without having reclaimed any memory
> >
> > Return from what?  try_to_free_pages() or balance_dirty_pages()?
> 
> If we do not reach the dirty_ratio then we will not block but simply
> trigger writeouts.
> 
> try_to_free_pages() will trigger pdflush and we may wait 1/10th of a
> second in congestaion_wait() and in throttle_vm_writeout() (well not
> really since we check global limits) but we will not block. I think what
> happens is that try_to_free_pages() (given sufficient slowless of the
> writeout) at some point will start to return 0 and thus
> we OOM.
> 
> > The behaviour of page reclaim is independent of the level of dirty
> memory
> > and of the dirty-memory thresholds, as far as I recall...
> 
> You cannot easily free a dirty page. We can only trigger writeout.
> 
> > > Could we get to the inode from the reclaim path and just start writing
> out
> > > all dirty pages of the indoe?
> >
> > Yeah, maybe.  But of course the pages on the inode can be from any zone
> at
> > all so the problem is that in some scenarios, we could write out
> tremendous
> > numbers of pages from zones which don't need that writeout.
> 
> But we know that at least one page was in the correct zone. Writeout will
> be much faster if we can write a seris of block in sequence via the inode.
> 
> > > Its continual on the nodes of the cpuset. Reclaim is constantly
> running
> > > and becomes very inefficient.
> >
> > I think what you're saying is that we're not throttling in
> > balance_dirty_pages().  So a large write() which is performed by a
> process
> > inside your one-tenth-of-memory cpuset will just go and dirty all of the
> > pages in that cpuset's nodes and things get all gummed up.
> 
> Correct.
> 
> > That can certainly happen, and I suppose we can make changes to
> > balance_dirty_pages() to fix it (although it will have the
> > we-wrote-lots-of-pages-we-didnt-need-to failure mode).
> 
> Right. In addition to checking the limits of the nodes in the current
> cpuset (requires looping over all nodes and adding up the counters we
> need) I made some modification to pass a set of nodes in the
> writeback_control structure. We can then check if there are sufficient
> pages of the inode within the nodes of the cpuset. But I am a bit
> concerned about performance.
> 
> > But right now in 2.6.19 the machine should _not_ declare oom in this
> > situation.  If it does, then we should fix that.  If it's only happening
> > with NFS then yeah, OK, mumble, NFS still needs work.
> 
> We OOM only in some rare cases. Mostly it seems that the
> machines just becomes extremely slow and the LRU locks become hot.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2006-12-12 15:12 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-04  1:54 la la la la ... swappiness Aucoin
2006-12-04  4:59 ` Andrew Morton
2006-12-04  7:22 ` Kyle Moffett
2006-12-04 14:39   ` Aucoin
2006-12-04 16:10     ` Chris Friesen
2006-12-04 17:07     ` Horst H. von Brand
2006-12-04 17:49       ` Aucoin
2006-12-04 18:44         ` Tim Schmielau
2006-12-04 21:28           ` Aucoin
2006-12-04 18:46         ` Horst H. von Brand
2006-12-04 21:43           ` Aucoin
2006-12-04 18:06       ` Andrew Morton
2006-12-04 18:15         ` Christoph Lameter
2006-12-04 18:38           ` Jeffrey Hundstad
2006-12-04 21:25             ` Aucoin
2006-12-04 21:43               ` Andrew Morton
2006-12-04 15:55   ` David Lang
2006-12-04 17:42     ` Aucoin
  -- strict thread matches above, loose matches on Subject: below --
2006-12-04 19:02 Al Boldi
     [not found] <200612030616.kB36GYBs019873@ms-smtp-03.texas.rr.com>
2006-12-03  8:08 ` Andrew Morton
2006-12-03 15:40   ` Aucoin
2006-12-03 20:46     ` Tim Schmielau
2006-12-03 23:56       ` Aucoin
2006-12-04  0:57         ` Horst H. von Brand
2006-12-04  4:56         ` Andrew Morton
2006-12-04  5:13           ` Linus Torvalds
2006-12-04 17:03             ` Christoph Lameter
2006-12-04 10:43         ` Nick Piggin
2006-12-04 14:45           ` Aucoin
2006-12-04 15:04             ` Nick Piggin
2006-12-05  4:02               ` Aucoin
2006-12-05  4:46                 ` Linus Torvalds
2006-12-05  6:41                   ` Aucoin
2006-12-05  7:01                     ` Nick Piggin
2006-12-05  7:26                       ` Rene Herman
2006-12-05 13:27                         ` Aucoin
2006-12-05 13:49                           ` Rene Herman
2006-12-05 13:25                       ` Aucoin
2006-12-05 16:17                     ` Linus Torvalds
2006-12-05 16:59                       ` Andrew Morton
2006-12-05 17:41                         ` aucoin, Andrew Morton
2006-12-05 18:31                           ` Christoph Lameter
2006-12-05 18:44                             ` Linus Torvalds
2006-12-05 19:32                               ` Christoph Lameter
2006-12-05 20:02                                 ` Andrew Morton
2006-12-05 20:15                                   ` Christoph Lameter
2006-12-05 20:48                                     ` Andrew Morton
2006-12-05 20:59                                       ` Christoph Lameter
2006-12-05 21:39                                         ` Andrew Morton
2006-12-05 23:20                                           ` Christoph Lameter
2006-12-12 15:12                                             ` Aucoin
2006-12-05 20:52                                   ` Andrew Morton
2006-12-05 20:39                               ` aucoin, Linus Torvalds
2006-12-03  6:18 Aucoin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.