From: Bill Davidsen <davidsen@tmr.com>
To: Ray Bryant <raybry@sgi.com>
Cc: Buddy Lumpkin <b.lumpkin@comcast.net>,
"'Con Kolivas'" <kernel@kolivas.org>,
"'FabF'" <fabian.frederick@skynet.be>,
"'Bernd Eckenfels'" <ecki-news2004-05@lina.inka.de>,
linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net,
linux-mm@kvack.org
Subject: Re: why swap at all?
Date: Wed, 09 Jun 2004 15:24:13 -0400 [thread overview]
Message-ID: <40C763DD.7090003@tmr.com> (raw)
In-Reply-To: <40C5D7FB.7020402@sgi.com>
Ray Bryant wrote:
>
> Buddy Lumpkin wrote:
>
>> <snip> One method would be to keep the
>> pagecache on it's own list, and move pages to the head of the list any
>> time
>> they are modified or referenced, and reclaim from the tail.
>> All pages on this list can be considered as "free memory", because any
>> new
>> memory requests would just cause pages to be evicted from the tail of the
>> list.
>>
>
> We have code running on Altix that does exactly this. (Please note,
> however, that this is for our version of Linux 2.4.21 -- Yeah, its
> old, but that is what the product runs at the moment -- we are in
> the process of switching over to Linux 2.6 when all of this will
> have to be re-evaluated.) The changes are in three parts:
>
> (1) We added a new page list, the reclaim list. Pages are put
> onto the reclaim list when they are inserted into the page cache.
> They are removed from the list when they are marked dirty (buffers
> from the page go on to the LRU dirty list) or when the pages are
> mmap'd into an address space, since in either of these situations,
> the pages are not reclaimable. (This list is per node in our
> NUMA system.)
>
> (2) We added code in __alloc_pages() so that if the local node
> allocation is going to fail (remember that Altix is a NUMA machine),
> we call out to a routine to scan the reclaim list on that node and
> to release enough clean buffer cache pages to make the local
> allocation succeed (plus a few pages, for efficiency). If this
> doesn't work, we most likely end up spilling the allocation over
> to another node.
>
> (3) We added code in generic_file_write() to limit the size of
> the page cache on buffered file I/O write operations. If the
> current size of the page cache is larger than the limit, we
> call the same routine as above to release some page cache pages.
> If we can't free enough pages to get below the limit, we throttle
> the write process by delaying it for a bit. This was all to
> avoid the problem of a large buffered file I/O request causing
> the page cache to grow to the point where the system would start
> to swap. (On our large memory systems, dropping into the
> swapping code can cause the system to freeze for 10's of seconds,
> and that is something we would like to avoid).
>
> (We actually don't enforce the page cache limit unless the amount
> of free memory has dropped below a certain threshold. This is to
> keep the page cache from being limited if there is lots of free
> memory -- even though we only limit the page cache on writes,
> it turns out that the kernel is constantly writing to the disk,
> so this also effectively causes the page cache to be limited
> for reads as well.)
>
> This code was also written in response to customer demand. They
> don't like the fact that the buffer cache grows and grows on our
> Altix systems, and they want old buffer cache pages to be cleared
> out when they are no longer needed. Since we almost never suffer
> memory pressure on our systems (and if we do, we are likely in
> trouble), kswapd almost never does this. Buffer cache pages can
> sit around for days with no one removing them. The above was one
> approach to solve that problem.
>
> Pleaes note: YMMV. An Altix is not a desktop system and I make
> no claims that the above approach is appropriate for everyone.
> For us, it turns out to work better to bias storage allocation
> against unbridled growth of the page cache. Indeed, we have
> spent a lot of time trying to solve problems related to page
> cache on Altix systems. Assuming we get our OLS paper done
> in time, you can read more about this in our paper at OLS.
> (If not, we intend to post our experiences paper on the
> oss.sgi.com website.)
>
> Finally, let me reiterate that we are beginning the process of
> evaluating the 2.6 memory manager wrt the same problem as above.
> Before we will propose a change such as above for 2.6, we have
> to convince ourselves that (1) setting vm_swappiness appropriately
> doesn't solve the problem, and (2) that patches such as the ones
> that Nick Piggin has been proposing don't solve the problem
> either, and that (3) there isn't some other mechanism to deal
> with this in 2.6.
I have to admit that the definition of "desktop machine" has changed a
lot in the last few years, in terms of hardware, but I have been running
since 486 days with "what can I build/buy for <$2k which best fits my
overall computing?" With the onset of cheap memory and Opteron, NUMA
will be a factor in the next few years in all probability, and SMP has
been since the dual pentium systems were new.
That said, I think that your work will be useful, even if it is used
piecemeal or as inspiration to Nick, Andrea, and other who have been
working in the area. I find Nick's work as of 2.6.7-rc1-mm1 so good I
haven't moved any of my desktop machines beyond it, but it sounds as if
your work addresses the issue I mentioned about limiting buffer usage,
and Rik's comment that the code lacks check and balances. You seem to
have a balance, I'd love to see it.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
WARNING: multiple messages have this Message-ID (diff)
From: Bill Davidsen <davidsen@tmr.com>
To: Ray Bryant <raybry@sgi.com>
Cc: Buddy Lumpkin <b.lumpkin@comcast.net>,
'Con Kolivas' <kernel@kolivas.org>,
'FabF' <fabian.frederick@skynet.be>,
'Bernd Eckenfels' <ecki-news2004-05@lina.inka.de>,
linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net,
linux-mm@kvack.org
Subject: Re: why swap at all?
Date: Wed, 09 Jun 2004 15:24:13 -0400 [thread overview]
Message-ID: <40C763DD.7090003@tmr.com> (raw)
In-Reply-To: <40C5D7FB.7020402@sgi.com>
Ray Bryant wrote:
>
> Buddy Lumpkin wrote:
>
>> <snip> One method would be to keep the
>> pagecache on it's own list, and move pages to the head of the list any
>> time
>> they are modified or referenced, and reclaim from the tail.
>> All pages on this list can be considered as "free memory", because any
>> new
>> memory requests would just cause pages to be evicted from the tail of the
>> list.
>>
>
> We have code running on Altix that does exactly this. (Please note,
> however, that this is for our version of Linux 2.4.21 -- Yeah, its
> old, but that is what the product runs at the moment -- we are in
> the process of switching over to Linux 2.6 when all of this will
> have to be re-evaluated.) The changes are in three parts:
>
> (1) We added a new page list, the reclaim list. Pages are put
> onto the reclaim list when they are inserted into the page cache.
> They are removed from the list when they are marked dirty (buffers
> from the page go on to the LRU dirty list) or when the pages are
> mmap'd into an address space, since in either of these situations,
> the pages are not reclaimable. (This list is per node in our
> NUMA system.)
>
> (2) We added code in __alloc_pages() so that if the local node
> allocation is going to fail (remember that Altix is a NUMA machine),
> we call out to a routine to scan the reclaim list on that node and
> to release enough clean buffer cache pages to make the local
> allocation succeed (plus a few pages, for efficiency). If this
> doesn't work, we most likely end up spilling the allocation over
> to another node.
>
> (3) We added code in generic_file_write() to limit the size of
> the page cache on buffered file I/O write operations. If the
> current size of the page cache is larger than the limit, we
> call the same routine as above to release some page cache pages.
> If we can't free enough pages to get below the limit, we throttle
> the write process by delaying it for a bit. This was all to
> avoid the problem of a large buffered file I/O request causing
> the page cache to grow to the point where the system would start
> to swap. (On our large memory systems, dropping into the
> swapping code can cause the system to freeze for 10's of seconds,
> and that is something we would like to avoid).
>
> (We actually don't enforce the page cache limit unless the amount
> of free memory has dropped below a certain threshold. This is to
> keep the page cache from being limited if there is lots of free
> memory -- even though we only limit the page cache on writes,
> it turns out that the kernel is constantly writing to the disk,
> so this also effectively causes the page cache to be limited
> for reads as well.)
>
> This code was also written in response to customer demand. They
> don't like the fact that the buffer cache grows and grows on our
> Altix systems, and they want old buffer cache pages to be cleared
> out when they are no longer needed. Since we almost never suffer
> memory pressure on our systems (and if we do, we are likely in
> trouble), kswapd almost never does this. Buffer cache pages can
> sit around for days with no one removing them. The above was one
> approach to solve that problem.
>
> Pleaes note: YMMV. An Altix is not a desktop system and I make
> no claims that the above approach is appropriate for everyone.
> For us, it turns out to work better to bias storage allocation
> against unbridled growth of the page cache. Indeed, we have
> spent a lot of time trying to solve problems related to page
> cache on Altix systems. Assuming we get our OLS paper done
> in time, you can read more about this in our paper at OLS.
> (If not, we intend to post our experiences paper on the
> oss.sgi.com website.)
>
> Finally, let me reiterate that we are beginning the process of
> evaluating the 2.6 memory manager wrt the same problem as above.
> Before we will propose a change such as above for 2.6, we have
> to convince ourselves that (1) setting vm_swappiness appropriately
> doesn't solve the problem, and (2) that patches such as the ones
> that Nick Piggin has been proposing don't solve the problem
> either, and that (3) there isn't some other mechanism to deal
> with this in 2.6.
I have to admit that the definition of "desktop machine" has changed a
lot in the last few years, in terms of hardware, but I have been running
since 486 days with "what can I build/buy for <$2k which best fits my
overall computing?" With the onset of cheap memory and Opteron, NUMA
will be a factor in the next few years in all probability, and SMP has
been since the dual pentium systems were new.
That said, I think that your work will be useful, even if it is used
piecemeal or as inspiration to Nick, Andrea, and other who have been
working in the area. I find Nick's work as of 2.6.7-rc1-mm1 so good I
haven't moved any of my desktop machines beyond it, but it sounds as if
your work addresses the issue I mentioned about limiting buffer usage,
and Rik's comment that the code lacks check and balances. You seem to
have a balance, I'd love to see it.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2004-06-09 19:24 UTC|newest]
Thread overview: 149+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <fa.amhil9e.o5kt1u@ifi.uio.no>
[not found] ` <fa.kfm8lru.1l2mdp4@ifi.uio.no>
2004-06-08 15:12 ` why swap at all? Ray Bryant
2004-06-08 15:12 ` Ray Bryant
2004-06-08 15:15 ` Ray Bryant
2004-06-08 15:15 ` Ray Bryant
2004-06-09 19:24 ` Bill Davidsen [this message]
2004-06-09 19:24 ` Bill Davidsen
2004-05-31 19:34 Michael Brennan
2004-05-31 20:29 ` John Bradford
2004-05-31 22:47 ` Nick Piggin
2004-05-31 23:30 ` Bernd Eckenfels
2004-06-01 18:36 ` FabF
2004-06-01 19:02 ` Valdis.Kletnieks
2004-06-01 19:53 ` FabF
2004-06-01 20:00 ` Valdis.Kletnieks
2004-06-01 20:14 ` FabF
2004-06-01 20:22 ` Valdis.Kletnieks
2004-06-01 21:15 ` FabF
2004-06-01 21:40 ` Valdis.Kletnieks
2004-06-03 13:54 ` Bill Davidsen
2004-06-04 0:01 ` Nick Piggin
2004-06-01 23:17 ` Bernd Eckenfels
2004-06-02 5:38 ` FabF
2004-06-02 11:42 ` Con Kolivas
2004-06-02 12:22 ` John Bradford
2004-06-02 12:22 ` Con Kolivas
2004-06-02 17:06 ` FabF
2004-06-03 14:14 ` Bill Davidsen
2004-06-04 7:23 ` Buddy Lumpkin
2004-06-04 17:08 ` Bill Davidsen
2004-06-15 14:55 ` Charles Shannon Hendrix
2004-06-04 9:11 ` Catalin BOIE
2004-06-04 17:24 ` Bill Davidsen
2004-06-06 14:39 ` Rik van Riel
2004-06-02 17:59 ` Valdis.Kletnieks
2004-06-02 18:30 ` FabF
2004-06-02 23:54 ` Con Kolivas
2004-06-03 16:16 ` FabF
2004-06-03 23:56 ` Con Kolivas
2004-06-04 0:16 ` Con Kolivas
2004-06-03 14:18 ` Bill Davidsen
2004-06-03 14:27 ` Con Kolivas
2004-06-02 17:52 ` Valdis.Kletnieks
2004-06-02 3:50 ` Tim Connors
2004-06-02 17:45 ` Valdis.Kletnieks
2004-06-01 8:34 ` John Bradford
2004-06-01 8:32 ` William Lee Irwin III
2004-06-01 8:50 ` John Bradford
2004-06-01 8:54 ` William Lee Irwin III
2004-06-01 9:10 ` John Bradford
2004-06-08 1:18 ` Tim Connors
2004-06-08 5:29 ` Denis Vlasenko
2004-06-01 9:38 ` Buddy Lumpkin
2004-06-01 10:13 ` Tim Connors
2004-06-01 10:24 ` William Lee Irwin III
2004-06-01 11:19 ` Tim Connors
-- strict thread matches above, loose matches on Subject: below --
2004-05-27 12:31 Piszcz, Justin Michael
2004-05-27 12:41 ` William Lee Irwin III
2004-05-27 15:59 ` John Bradford
2004-05-27 16:16 ` William Lee Irwin III
2004-06-03 13:38 ` Bill Davidsen
[not found] <fa.fegqf9v.kmidof@ifi.uio.no>
[not found] ` <fa.bqpvcrs.u648jq@ifi.uio.no>
2004-05-27 11:39 ` Andy Lutomirski
2004-05-28 21:37 ` Denis Vlasenko
2004-05-28 22:28 ` Bernd Eckenfels
2004-05-29 7:31 ` Denis Vlasenko
2004-05-31 10:49 ` jlnance
2004-06-01 11:57 ` Lenar Lõhmus
2004-06-01 12:27 ` Robin Rosenberg
2004-06-01 16:49 ` jlnance
2004-06-02 18:38 ` John Hendrikx
2004-06-01 12:21 ` David B. Stevens
2004-05-27 5:37 Nick Piggin
2004-05-27 17:27 ` Buddy Lumpkin
2004-05-26 12:34 Piszcz, Justin Michael
2004-05-26 12:24 Nick Piggin
2004-05-26 13:03 ` Buddy Lumpkin
2004-05-26 13:27 ` Helge Hafting
2004-05-26 11:57 Nick Piggin
2004-05-26 12:19 ` Buddy Lumpkin
2004-05-26 11:04 Nick Piggin
2004-05-26 6:38 Anthony DiSante
2004-05-26 7:31 ` Buddy Lumpkin
2004-05-26 7:55 ` William Lee Irwin III
2004-05-26 8:30 ` Buddy Lumpkin
2004-05-26 8:44 ` Nick Piggin
2004-05-26 9:34 ` John Bradford
2004-05-26 9:48 ` Nick Piggin
2004-05-26 10:10 ` Matthias Schniedermeyer
2004-05-26 10:33 ` Nick Piggin
2004-05-26 10:58 ` Matthias Schniedermeyer
2004-05-26 11:19 ` Nick Piggin
2004-05-26 12:27 ` Matthias Schniedermeyer
2004-05-27 5:38 ` Nick Piggin
2004-05-26 12:37 ` Matthias Schniedermeyer
2004-05-26 13:06 ` Gianni Tedesco
2004-05-26 13:41 ` Matt H.
2004-05-26 13:55 ` Buddy Lumpkin
2004-05-27 5:14 ` Tom Felker
2004-05-27 6:02 ` Nick Piggin
2004-05-27 7:04 ` Bernd Eckenfels
2004-05-27 7:16 ` Oliver Neukum
2004-05-26 10:45 ` Martin Olsson
2004-05-26 11:25 ` Nick Piggin
2004-05-26 16:33 ` David Schwartz
2004-05-26 16:58 ` John Bradford
2004-05-26 23:32 ` Kyle Moffett
2004-05-27 8:05 ` John Bradford
2004-05-26 10:46 ` John Bradford
2004-05-26 11:46 ` Buddy Lumpkin
2004-05-26 11:39 ` Buddy Lumpkin
2004-05-26 9:42 ` Anthony DiSante
2004-05-26 9:58 ` Nick Piggin
2004-05-26 20:11 ` Wakko Warner
2004-05-27 5:59 ` Nick Piggin
2004-05-27 14:34 ` Wakko Warner
2004-05-26 10:40 ` Buddy Lumpkin
2004-05-26 13:15 ` Helge Hafting
2004-05-26 9:09 ` William Lee Irwin III
2004-05-26 11:38 ` Buddy Lumpkin
2004-05-26 12:12 ` Paulo Marques
2004-05-26 12:14 ` Nick Piggin
2004-05-26 12:40 ` Denis Vlasenko
2004-05-26 10:41 ` Denis Vlasenko
2004-05-26 12:07 ` Buddy Lumpkin
2004-05-26 12:06 ` Marc-Christian Petersen
2004-05-26 12:19 ` Denis Vlasenko
2004-05-26 13:48 ` Buddy Lumpkin
2004-05-26 12:33 ` Richard B. Johnson
2004-05-26 13:25 ` Buddy Lumpkin
2004-05-26 12:30 ` Rik van Riel
2004-05-26 10:44 ` Denis Vlasenko
2004-05-26 11:49 ` Buddy Lumpkin
2004-05-26 12:19 ` Rik van Riel
2004-05-26 12:55 ` Buddy Lumpkin
2004-05-26 8:27 ` Roger Luethi
2004-05-26 9:23 ` John Bradford
2004-05-26 9:30 ` Roger Luethi
2004-05-26 10:35 ` John Bradford
2004-05-26 10:37 ` Nick Piggin
2004-05-26 10:48 ` John Bradford
2004-05-26 13:01 ` Helge Hafting
2004-05-26 8:32 ` Denis Vlasenko
2004-05-26 9:00 ` Helge Hafting
2004-05-26 9:40 ` John Bradford
2004-05-26 13:06 ` Helge Hafting
2004-05-26 9:06 ` John Bradford
2004-05-26 12:31 ` Buddy Lumpkin
2004-05-26 10:02 ` Raphael Jacquot
2004-05-26 13:00 ` Satoshi Oshima
2004-05-26 13:38 ` William Lee Irwin III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=40C763DD.7090003@tmr.com \
--to=davidsen@tmr.com \
--cc=b.lumpkin@comcast.net \
--cc=ecki-news2004-05@lina.inka.de \
--cc=fabian.frederick@skynet.be \
--cc=kernel@kolivas.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lse-tech@lists.sourceforge.net \
--cc=raybry@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.