All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Neil Brown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, trond.myklebust@fys.uio.no,
	Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [PATCH 00/28] Swap over NFS -v16
Date: Mon, 10 Mar 2008 10:17:54 +0100	[thread overview]
Message-ID: <1205140674.8514.152.camel@twins> (raw)
In-Reply-To: <18388.50188.552322.780524@notabene.brown>

On Mon, 2008-03-10 at 16:15 +1100, Neil Brown wrote:

> > On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > > 
> > > [I don't find the above wholly satisfying.  There seems to be too much
> > >  hand-waving.  If someone can provide better text explaining why
> > >  swapout is a special case, that would be great.]
> > 
> > Anonymous pages are dirty by definition (except the zero page, but I
> > think we recently ditched it). So shrinking of the anonymous pool will
> > require swapping.
> 
> Well, there is the swap cache.  That's probably what I was thinking of
> when I said "clean anonymous pages".  I suspect they are the first to
> go!

Ah, right, we could consider those clean anonymous. Alas, they are just
part of the aging lists and do not get special priority.

> > It is indeed the last refuge for those with GFP_NOFS. Allong with the
> > strict limit on the amount of dirty file pages it also ensures writing
> > those out will never deadlock the machine as there are always clean file
> > pages and or anonymous pages to launder.
> 
> The difficulty I have is justifying exactly why page-cache writeout
> will not deadlock.  What if all the memory that is not dirty-pagecache
> is anonymous, and if swap isn't enabled?

Ah, I never considered the !SWAP case.

> Maybe the number returned by "determine_dirtyable_memory" in
> page-writeback.c excludes anonymous pages?  I wonder if the meaning of
> NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

I don't think they are, but it should be obvious once you know the VM,
har har har :-)

NR_FREE_PAGES are the pages in the page allocators free lists.
NR_INACTIVE are the pages on the inactive list
NR_ACTIVE are the pageso on the active list

NR_INACTIVE+NR_ACTIVE are the number of pages on the page reclaim lists.

So, if you consider !SWAP, we could get in a deadlock when all of memory
is anonymous except for a few (<=dirty limit) dirty file pages.

But I guess the !SWAP people know what they're doing, large anon usage
without swap is asking for trouble.
 
> > Right. I've had a long conversation on PG_emergency with Pekka. And I
> > think the conclusion was that PG_emergency will create more head-aches
> > than it solves. I probably have the conversation in my IRC logs and
> > could email it if you're interested (and Pekka doesn't object).
> 
> Maybe that depends on the exact semantic of PG_emergency ??
> I remember you being concerned that PG_emergency never changes between
> allocation and freeing, and that wouldn't work well with slub.
> My envisioned semantic has it possibly changing quite often.
> What it means is:
>    The last allocation done from this page was in a low-memory
>    condition.

Yes, that works, except that we'd need to iterate all pages and clear
PG_emergency - which would imply tracking all these pages etc..

Hence it would be better not to keep persistent state and do as we do
now; use some non-persistent state on allocation.

> You really need some way to tell if the result of kmalloc/kmemalloc
> should be treated as reserved.
> I think you had code which first tried the allocation without
> GFP_MEMALLOC and then if that failed, tried again *with*
> GFP_MEMALLOC.  If that then succeeded, it is assumed to be an
> allocation from reserves.  That seemed rather ugly, though I guess you
> could wrap it in a function to hide the ugliness:
> 
> void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
> {
> 	void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 0;
> 		return result;
> 	}
> 	result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 1;
> 		return result;
> 	}
> 	return NULL;
> }
> ???

Yeah, I this this is the best we can do, just split this part out into
helper functions. I've been thinking of doing this - just haven't gotten
around to implementing it. Hope to do so this week and send out a new
series.

> > I've already heard interest from other people to use these hooks to
> > provide swap on other non-block filesystems such as jffs2, logfs and the
> > like.
> 
> I'm interested in the swap_in/swap_out interface for external
> write-intent bitmaps for md/raid arrays.
> You can have a write-intent bitmap which records which blocks might be
> dirty if the host crashes, so that resync is much faster.
> It can be stored in a file in a separate filesystem, but that is
> currently implemented by using bmap to enumerate the blocks and then
> reading/writing directly to the device (like swap).  Your interface
> would be much nicer for that (not that I think having a
> write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

Hmm, right. But for that purpose the names swap_* are a tad misleading.
I remember hch mentioning this at some point. What would be a more
suitable naming scheme so we can both use it?

> I'll look forward to your next patch set....
> 
> One thing I had thought odd while reading the patches, but haven't
> found an opportunity to mention before, is the "IS_SWAPFILE" test in
> nfs-swapper.patch.
> This seems like a layering violation.  It would be better if the test
> was based on whether  ->swapfile had been called on the file.  That way
> my write-intent-bitmaps would get the same benefit.

I'll look into this, I didn't thing using a inode test inside a
filesystem implementation was too weird..


WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Neil Brown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, trond.myklebust@fys.uio.no,
	Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [PATCH 00/28] Swap over NFS -v16
Date: Mon, 10 Mar 2008 10:17:54 +0100	[thread overview]
Message-ID: <1205140674.8514.152.camel@twins> (raw)
In-Reply-To: <18388.50188.552322.780524@notabene.brown>

On Mon, 2008-03-10 at 16:15 +1100, Neil Brown wrote:

> > On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > > 
> > > [I don't find the above wholly satisfying.  There seems to be too much
> > >  hand-waving.  If someone can provide better text explaining why
> > >  swapout is a special case, that would be great.]
> > 
> > Anonymous pages are dirty by definition (except the zero page, but I
> > think we recently ditched it). So shrinking of the anonymous pool will
> > require swapping.
> 
> Well, there is the swap cache.  That's probably what I was thinking of
> when I said "clean anonymous pages".  I suspect they are the first to
> go!

Ah, right, we could consider those clean anonymous. Alas, they are just
part of the aging lists and do not get special priority.

> > It is indeed the last refuge for those with GFP_NOFS. Allong with the
> > strict limit on the amount of dirty file pages it also ensures writing
> > those out will never deadlock the machine as there are always clean file
> > pages and or anonymous pages to launder.
> 
> The difficulty I have is justifying exactly why page-cache writeout
> will not deadlock.  What if all the memory that is not dirty-pagecache
> is anonymous, and if swap isn't enabled?

Ah, I never considered the !SWAP case.

> Maybe the number returned by "determine_dirtyable_memory" in
> page-writeback.c excludes anonymous pages?  I wonder if the meaning of
> NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

I don't think they are, but it should be obvious once you know the VM,
har har har :-)

NR_FREE_PAGES are the pages in the page allocators free lists.
NR_INACTIVE are the pages on the inactive list
NR_ACTIVE are the pageso on the active list

NR_INACTIVE+NR_ACTIVE are the number of pages on the page reclaim lists.

So, if you consider !SWAP, we could get in a deadlock when all of memory
is anonymous except for a few (<=dirty limit) dirty file pages.

But I guess the !SWAP people know what they're doing, large anon usage
without swap is asking for trouble.
 
> > Right. I've had a long conversation on PG_emergency with Pekka. And I
> > think the conclusion was that PG_emergency will create more head-aches
> > than it solves. I probably have the conversation in my IRC logs and
> > could email it if you're interested (and Pekka doesn't object).
> 
> Maybe that depends on the exact semantic of PG_emergency ??
> I remember you being concerned that PG_emergency never changes between
> allocation and freeing, and that wouldn't work well with slub.
> My envisioned semantic has it possibly changing quite often.
> What it means is:
>    The last allocation done from this page was in a low-memory
>    condition.

Yes, that works, except that we'd need to iterate all pages and clear
PG_emergency - which would imply tracking all these pages etc..

Hence it would be better not to keep persistent state and do as we do
now; use some non-persistent state on allocation.

> You really need some way to tell if the result of kmalloc/kmemalloc
> should be treated as reserved.
> I think you had code which first tried the allocation without
> GFP_MEMALLOC and then if that failed, tried again *with*
> GFP_MEMALLOC.  If that then succeeded, it is assumed to be an
> allocation from reserves.  That seemed rather ugly, though I guess you
> could wrap it in a function to hide the ugliness:
> 
> void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
> {
> 	void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 0;
> 		return result;
> 	}
> 	result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 1;
> 		return result;
> 	}
> 	return NULL;
> }
> ???

Yeah, I this this is the best we can do, just split this part out into
helper functions. I've been thinking of doing this - just haven't gotten
around to implementing it. Hope to do so this week and send out a new
series.

> > I've already heard interest from other people to use these hooks to
> > provide swap on other non-block filesystems such as jffs2, logfs and the
> > like.
> 
> I'm interested in the swap_in/swap_out interface for external
> write-intent bitmaps for md/raid arrays.
> You can have a write-intent bitmap which records which blocks might be
> dirty if the host crashes, so that resync is much faster.
> It can be stored in a file in a separate filesystem, but that is
> currently implemented by using bmap to enumerate the blocks and then
> reading/writing directly to the device (like swap).  Your interface
> would be much nicer for that (not that I think having a
> write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

Hmm, right. But for that purpose the names swap_* are a tad misleading.
I remember hch mentioning this at some point. What would be a more
suitable naming scheme so we can both use it?

> I'll look forward to your next patch set....
> 
> One thing I had thought odd while reading the patches, but haven't
> found an opportunity to mention before, is the "IS_SWAPFILE" test in
> nfs-swapper.patch.
> This seems like a layering violation.  It would be better if the test
> was based on whether  ->swapfile had been called on the file.  That way
> my write-intent-bitmaps would get the same benefit.

I'll look into this, I didn't thing using a inode test inside a
filesystem implementation was too weird..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2008-03-10  9:18 UTC|newest]

Thread overview: 146+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
2008-02-20 14:46 ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-23  8:05     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-23  8:05     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-23  8:05     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-23  8:05     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-23  8:06     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-23  8:06     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-23  8:06     ` Andrew Morton
2008-02-24  6:52   ` Mike Snitzer
2008-02-24  6:52     ` Mike Snitzer
2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-23  8:06     ` Andrew Morton
2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 16:30   ` Randy Dunlap
2008-02-20 16:30     ` Randy Dunlap
2008-02-20 16:46     ` Peter Zijlstra
2008-02-20 16:46       ` Peter Zijlstra
2008-02-26 12:45   ` Miklos Szeredi
2008-02-26 12:45     ` Miklos Szeredi
2008-02-26 12:58     ` Peter Zijlstra
2008-02-26 12:58       ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2008-02-20 14:46   ` Peter Zijlstra
2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
2008-02-23  8:06   ` Andrew Morton
2008-02-26  6:03   ` Neil Brown
2008-02-26  6:03     ` Neil Brown
2008-02-26 10:50     ` Peter Zijlstra
2008-02-26 10:50       ` Peter Zijlstra
2008-02-26 12:00       ` Peter Zijlstra
2008-02-26 12:00         ` Peter Zijlstra
2008-02-26 15:29       ` Miklos Szeredi
2008-02-26 15:29         ` Miklos Szeredi
2008-02-26 15:41         ` Peter Zijlstra
2008-02-26 15:41           ` Peter Zijlstra
2008-02-26 15:43         ` Peter Zijlstra
2008-02-26 15:43           ` Peter Zijlstra
2008-02-26 15:47           ` Miklos Szeredi
2008-02-26 15:47             ` Miklos Szeredi
2008-02-26 17:56       ` Andrew Morton
2008-02-26 17:56         ` Andrew Morton
2008-02-27  5:51       ` Neil Brown
2008-02-27  5:51         ` Neil Brown
2008-02-27  7:58         ` Peter Zijlstra
2008-02-27  7:58           ` Peter Zijlstra
2008-02-27  8:05           ` Pekka Enberg
2008-02-27  8:05             ` Pekka Enberg
2008-02-27  8:14             ` Peter Zijlstra
2008-02-27  8:14               ` Peter Zijlstra
2008-02-27  8:33               ` Peter Zijlstra
2008-02-27  8:33                 ` Peter Zijlstra
2008-02-27  8:43                 ` Pekka J Enberg
2008-02-27  8:43                   ` Pekka J Enberg
2008-02-29 11:51             ` Peter Zijlstra
2008-02-29 11:51               ` Peter Zijlstra
2008-02-29 11:58               ` Pekka Enberg
2008-02-29 11:58                 ` Pekka Enberg
2008-02-29 12:18                 ` Peter Zijlstra
2008-02-29 12:18                   ` Peter Zijlstra
2008-02-29 12:29                   ` Pekka Enberg
2008-02-29 12:29                     ` Pekka Enberg
2008-02-29  1:29           ` Neil Brown
2008-02-29  1:29             ` Neil Brown
2008-02-29 10:21             ` Peter Zijlstra
2008-02-29 10:21               ` Peter Zijlstra
2008-03-02 22:18               ` Neil Brown
2008-03-02 22:18                 ` Neil Brown
2008-03-02 23:33                 ` Peter Zijlstra
2008-03-02 23:33                   ` Peter Zijlstra
2008-03-03 23:41                   ` Neil Brown
2008-03-03 23:41                     ` Neil Brown
2008-03-04 10:28                     ` Peter Zijlstra
2008-03-04 10:28                       ` Peter Zijlstra
     [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
2008-03-07  3:33             ` Neil Brown
2008-03-07  3:33               ` Neil Brown
2008-03-07 11:17               ` Peter Zijlstra
2008-03-07 11:17                 ` Peter Zijlstra
2008-03-07 11:55                 ` Peter Zijlstra
2008-03-07 11:55                   ` Peter Zijlstra
2008-03-10  5:15                 ` Neil Brown
2008-03-10  5:15                   ` Neil Brown
2008-03-10  9:17                   ` Peter Zijlstra [this message]
2008-03-10  9:17                     ` Peter Zijlstra
2008-03-14  5:22                     ` Neil Brown
2008-03-14  5:22                       ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1205140674.8514.152.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    --cc=penberg@cs.helsinki.fi \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.