How can we share page cache pages for reflinked files?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How can we share page cache pages for reflinked files?
@ 2017-08-10  4:28 Dave Chinner
  2017-08-10  5:57 ` Kirill A. Shutemov
  2017-08-10 16:11 ` Matthew Wilcox
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2017-08-10  4:28 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm

Hi folks,

I've recently been looking into what is involved in sharing page
cache pages for shared extents in a filesystem. That is, create a
file, reflink it so there's two files but only one copy of the data
on disk, then read both files.  Right now, we get two copies of the
data in the page cache - one in each inode mapping tree.

If we scale this up to a container host which is using reflink trees
it's shared root images, there might be hundreds of copies of the
same data held in cache (i.e. one page per container). Given that
the filesystem knows that the underlying data extent is shared when
we go to read it, it's relatively easy to add mechanisms to the
filesystem to return the same page for all attempts to read the
from a shared extent from all inodes that share it.

However, the problem I'm getting stuck on is that the page cache
itself can't handle inserting a single page into multiple page cache
mapping trees. i.e. The page has a single pointer to the mapping
address space, and the mapping has a single pointer back to the
owner inode. As such, a cached page has a 1:1 mapping to it's host
inode and this structure seems to be assumed rather widely through
the code.

The problem is somewhat limited by the fact that only clean,
read-only pages would be shared, and the attempt to write/dirty
a shared page in a mapping would trigger a COW operation in the
filesystem which would invalidate that inode's shared page and
replace it with a new, inode-private page that could be written to.
This still requires us to be able to find the right inode from the
shared page context to run the COW operation. Luckily, the IO path
already has an inode pointer, and the page fault path provides us
with the inode via file_inode(vmf->vma->vm_file) so we don't
actually need page->mapping->host in these paths.

Along these lines I've thought about using a "shared mapping" that
is associated with the filesystem rather than a specific inode (like
a bdev mapping), but that's no good because if page->mapping !=
inode->i_mapping the page is consider to have been invalidated and
should be considered invalid.

Further - a page has a single, fixed index into the mapping tree
(i.e. page->index), so this prevents arbitrary page sharing across
inodes (the "deduplication triggered shared extent" case). And we
can't really get rid of the page index, because that's how the page
finds itself in a mapping tree.

This leads me to think about crazy schemes like allocating a
"referring struct page" that is allocated for every reference to a
shared cache page and chain them all to the real struct page sorta
like we do for compound pages. That would give us a unique struct
page for each mapping tree and solve many of the issues, but I'm not
sure how viable such a concept would be.

I'm sure there's more issues than I've outlined here, but I haven't
gone deeper than this because I've got to solve the one to many
problem first.  I don't know if anyone has looked at this in any
detail, so I don't know what ideas, patches, crazy schemes, etc
might already exist out there. Right now I'm just looking for
information to narrow down what I need to look at - finding what
rabbit holes have already been explored and what dragons are already
known about would help an awful lot right now.

Anyone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10  4:28 How can we share page cache pages for reflinked files? Dave Chinner
@ 2017-08-10  5:57 ` Kirill A. Shutemov
  2017-08-10  9:01   ` Dave Chinner
  2017-08-10 16:11 ` Matthew Wilcox
  1 sibling, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-08-10  5:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> I've recently been looking into what is involved in sharing page
> cache pages for shared extents in a filesystem. That is, create a
> file, reflink it so there's two files but only one copy of the data
> on disk, then read both files.  Right now, we get two copies of the
> data in the page cache - one in each inode mapping tree.
> 
> If we scale this up to a container host which is using reflink trees
> it's shared root images, there might be hundreds of copies of the
> same data held in cache (i.e. one page per container). Given that
> the filesystem knows that the underlying data extent is shared when
> we go to read it, it's relatively easy to add mechanisms to the
> filesystem to return the same page for all attempts to read the
> from a shared extent from all inodes that share it.
> 
> However, the problem I'm getting stuck on is that the page cache
> itself can't handle inserting a single page into multiple page cache
> mapping trees. i.e. The page has a single pointer to the mapping
> address space, and the mapping has a single pointer back to the
> owner inode. As such, a cached page has a 1:1 mapping to it's host
> inode and this structure seems to be assumed rather widely through
> the code.

I think to solve the problem with page->mapping we need something similar
to what we have for anon rmap[1]. In this case we would be able to keep
the same page in page cache for multiple inodes.

The long term benefit for this is that we might be able to unify a lot of
code for anon and file code paths in mm, making anon memory a special case
of file mapping.

The downside is that anon rmap is rather complicated. I have to re-read
the article everytime I deal with anon rmap to remind myself how it works.

[1] https://lwn.net/Articles/383162/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10  5:57 ` Kirill A. Shutemov
@ 2017-08-10  9:01   ` Dave Chinner
  2017-08-10 13:31     ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2017-08-10  9:01 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 08:57:37AM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > I've recently been looking into what is involved in sharing page
> > cache pages for shared extents in a filesystem. That is, create a
> > file, reflink it so there's two files but only one copy of the data
> > on disk, then read both files.  Right now, we get two copies of the
> > data in the page cache - one in each inode mapping tree.
> > 
> > If we scale this up to a container host which is using reflink trees
> > it's shared root images, there might be hundreds of copies of the
> > same data held in cache (i.e. one page per container). Given that
> > the filesystem knows that the underlying data extent is shared when
> > we go to read it, it's relatively easy to add mechanisms to the
> > filesystem to return the same page for all attempts to read the
> > from a shared extent from all inodes that share it.
> > 
> > However, the problem I'm getting stuck on is that the page cache
> > itself can't handle inserting a single page into multiple page cache
> > mapping trees. i.e. The page has a single pointer to the mapping
> > address space, and the mapping has a single pointer back to the
> > owner inode. As such, a cached page has a 1:1 mapping to it's host
> > inode and this structure seems to be assumed rather widely through
> > the code.
> 
> I think to solve the problem with page->mapping we need something similar
> to what we have for anon rmap[1]. In this case we would be able to keep
> the same page in page cache for multiple inodes.

Being unfamiliar with the anon rmap code, I'm struggling to see the
need for that much complexity here. The AVC abstraction solves a
scalability problem that, to me, doesn't exist for tracking multiple
mapping tree pointers for a page. i.e. I don't see where a list
traversal is necessary in the shared page -> mapping tree resolution
for page cache sharing.

I've been thinking of something simpler along the lines of a dynamic
struct page objects w/ special page flags as an object that allows
us to keep different mapping tree entries for the same physical
page. Seems like this would work for read-only sharing, but perhaps
I'm just blind and I'm missing something I shouldn't be?

> The long term benefit for this is that we might be able to unify a lot of
> code for anon and file code paths in mm, making anon memory a special case
> of file mapping.
> 
> The downside is that anon rmap is rather complicated. I have to re-read
> the article everytime I deal with anon rmap to remind myself how it works.

Yeah, that's a problem - if you have trouble with it, I've got no
hope.... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10  9:01   ` Dave Chinner
@ 2017-08-10 13:31     ` Kirill A. Shutemov
  2017-08-11  3:59       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-08-10 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, Rik van Riel

On Thu, Aug 10, 2017 at 07:01:33PM +1000, Dave Chinner wrote:
> On Thu, Aug 10, 2017 at 08:57:37AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > I've recently been looking into what is involved in sharing page
> > > cache pages for shared extents in a filesystem. That is, create a
> > > file, reflink it so there's two files but only one copy of the data
> > > on disk, then read both files.  Right now, we get two copies of the
> > > data in the page cache - one in each inode mapping tree.
> > > 
> > > If we scale this up to a container host which is using reflink trees
> > > it's shared root images, there might be hundreds of copies of the
> > > same data held in cache (i.e. one page per container). Given that
> > > the filesystem knows that the underlying data extent is shared when
> > > we go to read it, it's relatively easy to add mechanisms to the
> > > filesystem to return the same page for all attempts to read the
> > > from a shared extent from all inodes that share it.
> > > 
> > > However, the problem I'm getting stuck on is that the page cache
> > > itself can't handle inserting a single page into multiple page cache
> > > mapping trees. i.e. The page has a single pointer to the mapping
> > > address space, and the mapping has a single pointer back to the
> > > owner inode. As such, a cached page has a 1:1 mapping to it's host
> > > inode and this structure seems to be assumed rather widely through
> > > the code.
> > 
> > I think to solve the problem with page->mapping we need something similar
> > to what we have for anon rmap[1]. In this case we would be able to keep
> > the same page in page cache for multiple inodes.
> 
> Being unfamiliar with the anon rmap code, I'm struggling to see the
> need for that much complexity here. The AVC abstraction solves a
> scalability problem that, to me, doesn't exist for tracking multiple
> mapping tree pointers for a page. i.e. I don't see where a list
> traversal is necessary in the shared page -> mapping tree resolution
> for page cache sharing.

[ Cc: Rik ]

The reflink interface has potential to construct a tree of dependencies
between reflinked files similar in complexity to tree of forks (and CoWed
anon mappings) that lead to current anon rmap design.

But it's harder to get there accidentally. :)

> I've been thinking of something simpler along the lines of a dynamic
> struct page objects w/ special page flags as an object that allows
> us to keep different mapping tree entries for the same physical
> page. Seems like this would work for read-only sharing, but perhaps
> I'm just blind and I'm missing something I shouldn't be?

Naive approach would be just to put all connected through reflink mappings
on the same linked list. page->mapping can point to any of them in this
case. To check that the page actually belong to the mapping we would need
to look into radix-tree.

Something like this we had before current anon rmap design, but with
checking page tables instead of radix-tree as primary reference.

> > The long term benefit for this is that we might be able to unify a lot of
> > code for anon and file code paths in mm, making anon memory a special case
> > of file mapping.
> > 
> > The downside is that anon rmap is rather complicated. I have to re-read
> > the article everytime I deal with anon rmap to remind myself how it works.
> 
> Yeah, that's a problem - if you have trouble with it, I've got no
> hope.... :/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10  4:28 How can we share page cache pages for reflinked files? Dave Chinner
  2017-08-10  5:57 ` Kirill A. Shutemov
@ 2017-08-10 16:11 ` Matthew Wilcox
  2017-08-10 19:17   ` Vivek Goyal
  2017-08-11  4:25   ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2017-08-10 16:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> I've recently been looking into what is involved in sharing page
> cache pages for shared extents in a filesystem. That is, create a
> file, reflink it so there's two files but only one copy of the data
> on disk, then read both files.  Right now, we get two copies of the
> data in the page cache - one in each inode mapping tree.

Yep.  We had a brief discussion of this at LSFMM (as you know, since you
commented on the discussion): https://lwn.net/Articles/717950/

> If we scale this up to a container host which is using reflink trees
> it's shared root images, there might be hundreds of copies of the
> same data held in cache (i.e. one page per container). Given that
> the filesystem knows that the underlying data extent is shared when
> we go to read it, it's relatively easy to add mechanisms to the
> filesystem to return the same page for all attempts to read the
> from a shared extent from all inodes that share it.

I agree the problem exists.  Should we try to fix this problem, or
should we steer people towards solutions which don't have this problem?
The solutions I've been seeing use COW block devices instead of COW
filesystems, and DAX to share the common pages between the host and
each guest.

> This leads me to think about crazy schemes like allocating a
> "referring struct page" that is allocated for every reference to a
> shared cache page and chain them all to the real struct page sorta
> like we do for compound pages. That would give us a unique struct
> page for each mapping tree and solve many of the issues, but I'm not
> sure how viable such a concept would be.

That's the solution I'd recommend looking into deeper.  We've also talked
about creating referring struct pages to support block size > page size.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10 16:11 ` Matthew Wilcox
@ 2017-08-10 19:17   ` Vivek Goyal
  2017-08-10 21:20     ` Matthew Wilcox
  2017-08-11  4:25   ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Vivek Goyal @ 2017-08-10 19:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Dave Chinner, linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 09:11:59AM -0700, Matthew Wilcox wrote:
> On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > I've recently been looking into what is involved in sharing page
> > cache pages for shared extents in a filesystem. That is, create a
> > file, reflink it so there's two files but only one copy of the data
> > on disk, then read both files.  Right now, we get two copies of the
> > data in the page cache - one in each inode mapping tree.
> 
> Yep.  We had a brief discussion of this at LSFMM (as you know, since you
> commented on the discussion): https://lwn.net/Articles/717950/
> 
> > If we scale this up to a container host which is using reflink trees
> > it's shared root images, there might be hundreds of copies of the
> > same data held in cache (i.e. one page per container). Given that
> > the filesystem knows that the underlying data extent is shared when
> > we go to read it, it's relatively easy to add mechanisms to the
> > filesystem to return the same page for all attempts to read the
> > from a shared extent from all inodes that share it.
> 
> I agree the problem exists.  Should we try to fix this problem, or
> should we steer people towards solutions which don't have this problem?
> The solutions I've been seeing use COW block devices instead of COW
> filesystems, and DAX to share the common pages between the host and
> each guest.

Hi Matthew, 

This is in the context of clear containers? It would be good to have
a solution for those who are not launching virt guests.

overlayfs helps mitigate this page cache sharing issue but xfs reflink
and dm thin pool continue to face this issue.

Vivek

> 
> > This leads me to think about crazy schemes like allocating a
> > "referring struct page" that is allocated for every reference to a
> > shared cache page and chain them all to the real struct page sorta
> > like we do for compound pages. That would give us a unique struct
> > page for each mapping tree and solve many of the issues, but I'm not
> > sure how viable such a concept would be.
> 
> That's the solution I'd recommend looking into deeper.  We've also talked
> about creating referring struct pages to support block size > page size.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10 19:17   ` Vivek Goyal
@ 2017-08-10 21:20     ` Matthew Wilcox
  0 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2017-08-10 21:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 03:17:46PM -0400, Vivek Goyal wrote:
> On Thu, Aug 10, 2017 at 09:11:59AM -0700, Matthew Wilcox wrote:
> > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > If we scale this up to a container host which is using reflink trees
> > > it's shared root images, there might be hundreds of copies of the
> > > same data held in cache (i.e. one page per container). Given that
> > > the filesystem knows that the underlying data extent is shared when
> > > we go to read it, it's relatively easy to add mechanisms to the
> > > filesystem to return the same page for all attempts to read the
> > > from a shared extent from all inodes that share it.
> > 
> > I agree the problem exists.  Should we try to fix this problem, or
> > should we steer people towards solutions which don't have this problem?
> > The solutions I've been seeing use COW block devices instead of COW
> > filesystems, and DAX to share the common pages between the host and
> > each guest.
> 
> Hi Matthew, 
> 
> This is in the context of clear containers? It would be good to have
> a solution for those who are not launching virt guests.
> 
> overlayfs helps mitigate this page cache sharing issue but xfs reflink
> and dm thin pool continue to face this issue.

Right, this is with clear containers.  But there's no reason it couldn't
be used with other virtualisation solutions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10 13:31     ` Kirill A. Shutemov
@ 2017-08-11  3:59       ` Dave Chinner
  2017-08-11 12:57         ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2017-08-11  3:59 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, Rik van Riel

On Thu, Aug 10, 2017 at 04:31:18PM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 10, 2017 at 07:01:33PM +1000, Dave Chinner wrote:
> > On Thu, Aug 10, 2017 at 08:57:37AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > I've recently been looking into what is involved in sharing page
> > > > cache pages for shared extents in a filesystem. That is, create a
> > > > file, reflink it so there's two files but only one copy of the data
> > > > on disk, then read both files.  Right now, we get two copies of the
> > > > data in the page cache - one in each inode mapping tree.
> > > > 
> > > > If we scale this up to a container host which is using reflink trees
> > > > it's shared root images, there might be hundreds of copies of the
> > > > same data held in cache (i.e. one page per container). Given that
> > > > the filesystem knows that the underlying data extent is shared when
> > > > we go to read it, it's relatively easy to add mechanisms to the
> > > > filesystem to return the same page for all attempts to read the
> > > > from a shared extent from all inodes that share it.
> > > > 
> > > > However, the problem I'm getting stuck on is that the page cache
> > > > itself can't handle inserting a single page into multiple page cache
> > > > mapping trees. i.e. The page has a single pointer to the mapping
> > > > address space, and the mapping has a single pointer back to the
> > > > owner inode. As such, a cached page has a 1:1 mapping to it's host
> > > > inode and this structure seems to be assumed rather widely through
> > > > the code.
> > > 
> > > I think to solve the problem with page->mapping we need something similar
> > > to what we have for anon rmap[1]. In this case we would be able to keep
> > > the same page in page cache for multiple inodes.
> > 
> > Being unfamiliar with the anon rmap code, I'm struggling to see the
> > need for that much complexity here. The AVC abstraction solves a
> > scalability problem that, to me, doesn't exist for tracking multiple
> > mapping tree pointers for a page. i.e. I don't see where a list
> > traversal is necessary in the shared page -> mapping tree resolution
> > for page cache sharing.
> 
> [ Cc: Rik ]
> 
> The reflink interface has potential to construct a tree of dependencies
> between reflinked files similar in complexity to tree of forks (and CoWed
> anon mappings) that lead to current anon rmap design.

I'm too stupid to see the operation that would create the tree of
dependencies you are talking about. Can you outline how we get to
that situation?

AFAICT, the dependencies just don't exist because the reflink
operations don't duplicate the page cache into the new file. And
when we are doing a cache lookup, we are looking for a page with a matching *block address*,
not a specific mapping or index in the cache. The cached page could
be anywhere in the filesystem, it could even be on a different
block device and filesystem. Nothing we have in the page cache
indexes physical block addresses, so these lookups cannot be done
via the page cache.

Physical block index lookups, of course, is what buffer caches are
for.  So essentially the process of sharing the pages cached on a
shared extent is this:

Cold cache:

	1. lookup extent map
	2. find IOMAP_SHARED is set on the extent
	3. Look up iomap->blkno in buffer cache
		a. Search for cached block
		b. not found, instantiate, attach page to buffer
		c. take ref to page, return struct page
	4. insert struct page into page cache
	5. do read IO into page.

Hot cache: Only step 3 changes:

	3. Look up iomap->blkno in buffer cache
		a. Search for cached block
		b. Found, take ref to page,
		c. return struct page

IOWs, we use a buffer cache to provide an inclusive global L2 cache
for pages cached over shared blocks within a filesystem LBA space.
This means there is no "needle in a haystack" search for matching
shared cached pages, nor is there complex dependency graph between
shared pages and mappings.

The only thing that makes this not work right now is that a
struct page can't be shared across mutliple mappings....

> But it's harder to get there accidentally. :)
> 
> > I've been thinking of something simpler along the lines of a dynamic
> > struct page objects w/ special page flags as an object that allows
> > us to keep different mapping tree entries for the same physical
> > page. Seems like this would work for read-only sharing, but perhaps
> > I'm just blind and I'm missing something I shouldn't be?
> 
> Naive approach would be just to put all connected through reflink mappings
> on the same linked list. page->mapping can point to any of them in this
> case. To check that the page actually belong to the mapping we would need
> to look into radix-tree.

There is so much code that assumes page->mapping points to the
mapping (and hence mapping tree) the page has been inserted into.
Indeed, this is how we check for racing with page
invalidation/reclaim after a lookup. i.e. once we've locked a page,
if page->mapping is different to the current mapping we have, then
the page is considered invalid and we shouldn't touch it. This
mechanism is the only thing that makes truncate work correctly in
many filesystems, so changing this is pretty much a non-starter.

Put simply, I'm trying to find a solution that doesn't start with
"break all the filesystems".... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-10 16:11 ` Matthew Wilcox
  2017-08-10 19:17   ` Vivek Goyal
@ 2017-08-11  4:25   ` Dave Chinner
  2017-08-11 17:08     ` Matthew Wilcox
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2017-08-11  4:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm

On Thu, Aug 10, 2017 at 09:11:59AM -0700, Matthew Wilcox wrote:
> On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > I've recently been looking into what is involved in sharing page
> > cache pages for shared extents in a filesystem. That is, create a
> > file, reflink it so there's two files but only one copy of the data
> > on disk, then read both files.  Right now, we get two copies of the
> > data in the page cache - one in each inode mapping tree.
> 
> Yep.  We had a brief discussion of this at LSFMM (as you know, since you
> commented on the discussion): https://lwn.net/Articles/717950/

*nod*

> > If we scale this up to a container host which is using reflink trees
> > it's shared root images, there might be hundreds of copies of the
> > same data held in cache (i.e. one page per container). Given that
> > the filesystem knows that the underlying data extent is shared when
> > we go to read it, it's relatively easy to add mechanisms to the
> > filesystem to return the same page for all attempts to read the
> > from a shared extent from all inodes that share it.
> 
> I agree the problem exists.  Should we try to fix this problem, or
> should we steer people towards solutions which don't have this problem?
> The solutions I've been seeing use COW block devices instead of COW
> filesystems, and DAX to share the common pages between the host and
> each guest.

That's one possible solution for people using hardware
virutalisation, but not everyone is doing that. It also relies on
block devices, which rules out a whole bunch of interesting stuff we
can do with filesystems...

> > This leads me to think about crazy schemes like allocating a
> > "referring struct page" that is allocated for every reference to a
> > shared cache page and chain them all to the real struct page sorta
> > like we do for compound pages. That would give us a unique struct
> > page for each mapping tree and solve many of the issues, but I'm not
> > sure how viable such a concept would be.
> 
> That's the solution I'd recommend looking into deeper.

OK, I'll dig deeper into that, try and understand what happens if we
put such things into the LRUs so reclaim can act on them.

> We've also talked
> about creating referring struct pages to support block size > page size.

Yup, that's a small extension of the infrastructure we need in XFS
for caching shared blocks. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-11  3:59       ` Dave Chinner
@ 2017-08-11 12:57         ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-08-11 12:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, Rik van Riel

On Fri, Aug 11, 2017 at 01:59:22PM +1000, Dave Chinner wrote:
> On Thu, Aug 10, 2017 at 04:31:18PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 10, 2017 at 07:01:33PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 10, 2017 at 08:57:37AM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > I've recently been looking into what is involved in sharing page
> > > > > cache pages for shared extents in a filesystem. That is, create a
> > > > > file, reflink it so there's two files but only one copy of the data
> > > > > on disk, then read both files.  Right now, we get two copies of the
> > > > > data in the page cache - one in each inode mapping tree.
> > > > > 
> > > > > If we scale this up to a container host which is using reflink trees
> > > > > it's shared root images, there might be hundreds of copies of the
> > > > > same data held in cache (i.e. one page per container). Given that
> > > > > the filesystem knows that the underlying data extent is shared when
> > > > > we go to read it, it's relatively easy to add mechanisms to the
> > > > > filesystem to return the same page for all attempts to read the
> > > > > from a shared extent from all inodes that share it.
> > > > > 
> > > > > However, the problem I'm getting stuck on is that the page cache
> > > > > itself can't handle inserting a single page into multiple page cache
> > > > > mapping trees. i.e. The page has a single pointer to the mapping
> > > > > address space, and the mapping has a single pointer back to the
> > > > > owner inode. As such, a cached page has a 1:1 mapping to it's host
> > > > > inode and this structure seems to be assumed rather widely through
> > > > > the code.
> > > > 
> > > > I think to solve the problem with page->mapping we need something similar
> > > > to what we have for anon rmap[1]. In this case we would be able to keep
> > > > the same page in page cache for multiple inodes.
> > > 
> > > Being unfamiliar with the anon rmap code, I'm struggling to see the
> > > need for that much complexity here. The AVC abstraction solves a
> > > scalability problem that, to me, doesn't exist for tracking multiple
> > > mapping tree pointers for a page. i.e. I don't see where a list
> > > traversal is necessary in the shared page -> mapping tree resolution
> > > for page cache sharing.
> > 
> > [ Cc: Rik ]
> > 
> > The reflink interface has potential to construct a tree of dependencies
> > between reflinked files similar in complexity to tree of forks (and CoWed
> > anon mappings) that lead to current anon rmap design.
> 
> I'm too stupid to see the operation that would create the tree of
> dependencies you are talking about. Can you outline how we get to
> that situation?
> 
> AFAICT, the dependencies just don't exist because the reflink
> operations don't duplicate the page cache into the new file. And
> when we are doing a cache lookup, we are looking for a page with a matching *block address*,
> not a specific mapping or index in the cache. The cached page could
> be anywhere in the filesystem, it could even be on a different
> block device and filesystem. Nothing we have in the page cache
> indexes physical block addresses, so these lookups cannot be done
> via the page cache.
> 
> Physical block index lookups, of course, is what buffer caches are
> for.  So essentially the process of sharing the pages cached on a
> shared extent is this:
> 
> Cold cache:
> 
> 	1. lookup extent map
> 	2. find IOMAP_SHARED is set on the extent
> 	3. Look up iomap->blkno in buffer cache
> 		a. Search for cached block
> 		b. not found, instantiate, attach page to buffer
> 		c. take ref to page, return struct page
> 	4. insert struct page into page cache
> 	5. do read IO into page.
> 
> Hot cache: Only step 3 changes:
> 
> 	3. Look up iomap->blkno in buffer cache
> 		a. Search for cached block
> 		b. Found, take ref to page,
> 		c. return struct page
> 
> IOWs, we use a buffer cache to provide an inclusive global L2 cache
> for pages cached over shared blocks within a filesystem LBA space.
> This means there is no "needle in a haystack" search for matching
> shared cached pages, nor is there complex dependency graph between
> shared pages and mappings.
> 
> The only thing that makes this not work right now is that a
> struct page can't be shared across mutliple mappings....
> 
> > But it's harder to get there accidentally. :)
> > 
> > > I've been thinking of something simpler along the lines of a dynamic
> > > struct page objects w/ special page flags as an object that allows
> > > us to keep different mapping tree entries for the same physical
> > > page. Seems like this would work for read-only sharing, but perhaps
> > > I'm just blind and I'm missing something I shouldn't be?
> > 
> > Naive approach would be just to put all connected through reflink mappings
> > on the same linked list. page->mapping can point to any of them in this
> > case. To check that the page actually belong to the mapping we would need
> > to look into radix-tree.
> 
> There is so much code that assumes page->mapping points to the
> mapping (and hence mapping tree) the page has been inserted into.
> Indeed, this is how we check for racing with page
> invalidation/reclaim after a lookup. i.e. once we've locked a page,
> if page->mapping is different to the current mapping we have, then
> the page is considered invalid and we shouldn't touch it. This
> mechanism is the only thing that makes truncate work correctly in
> many filesystems, so changing this is pretty much a non-starter.
> 
> Put simply, I'm trying to find a solution that doesn't start with
> "break all the filesystems".... :/

I don't think there's any.

Your idea with multiple (dynamic) struct pages per physical page is not
feasible on many levels:

 - We still need a consistent view on page metadata: PG_lock,
   page_count(), page_mapcount(), etc. I don't think it solvable with
   multiple struct pages;

 - We need to be able to find *all* mappings for the physical page:
   consider page reclaim, which required removing the page from all
   radix-trees that has the physical page.

   Basically, all these dynamic struct pages need to be aware about its
   siblings that represents the same physical page.

   We don't have space for it.

 - It doesn't scale well: 1000 reflinked files with 1000 pages each would
   require 1M struct pages.

The only option I see is to redefine meaning of page->mapping. It needs to
point to a data structure that can be used to find all mappings where the
page might be cached.

With the look up process you've described above, I don't longer think
linking all mappings is an option. And a data structure per phys page
looks too wasteful to me.

I've run out of ideas for now. :-/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-11  4:25   ` Dave Chinner
@ 2017-08-11 17:08     ` Matthew Wilcox
  2017-08-11 18:04       ` Christoph Hellwig
  2017-08-14  6:48       ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2017-08-11 17:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm

On Fri, Aug 11, 2017 at 02:25:19PM +1000, Dave Chinner wrote:
> On Thu, Aug 10, 2017 at 09:11:59AM -0700, Matthew Wilcox wrote:
> > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > If we scale this up to a container host which is using reflink trees
> > > it's shared root images, there might be hundreds of copies of the
> > > same data held in cache (i.e. one page per container). Given that
> > > the filesystem knows that the underlying data extent is shared when
> > > we go to read it, it's relatively easy to add mechanisms to the
> > > filesystem to return the same page for all attempts to read the
> > > from a shared extent from all inodes that share it.
> > 
> > I agree the problem exists.  Should we try to fix this problem, or
> > should we steer people towards solutions which don't have this problem?
> > The solutions I've been seeing use COW block devices instead of COW
> > filesystems, and DAX to share the common pages between the host and
> > each guest.
> 
> That's one possible solution for people using hardware
> virutalisation, but not everyone is doing that. It also relies on
> block devices, which rules out a whole bunch of interesting stuff we
> can do with filesystems...

Assuming there's something fun we can do with filesystems that's
interesting to this type of user, what do you think to this:

Create a block device (maybe it's a loop device, maybe it's dm-raid0)
which supports DAX and uses the page cache to cache the physical pages
of the block device it's fronting.

Use XFS+reflink+DAX on top of this loop device.  Now there's only one
copy of each page in RAM.

We'd need to be able to shoot down all mapped pages when evicting pages
from the loop device's page cache, but we have the right data structures
in place for that; we just need to use them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-11 17:08     ` Matthew Wilcox
@ 2017-08-11 18:04       ` Christoph Hellwig
  2017-08-14  6:48       ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2017-08-11 18:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Dave Chinner, linux-fsdevel, linux-mm

On Fri, Aug 11, 2017 at 10:08:47AM -0700, Matthew Wilcox wrote:
> Assuming there's something fun we can do with filesystems that's
> interesting to this type of user, what do you think to this:
> 
> Create a block device (maybe it's a loop device, maybe it's dm-raid0)
> which supports DAX and uses the page cache to cache the physical pages
> of the block device it's fronting.

Why not make every block device just support fake DAX and avoid the
additional layer?

Basically this would be going back to a file cache indexed by
physical blocks from our logically indexed page cache model.  And
for a fs using heavy reflinks that's probably the right model in the
end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-11 17:08     ` Matthew Wilcox
  2017-08-11 18:04       ` Christoph Hellwig
@ 2017-08-14  6:48       ` Dave Chinner
  2017-08-14 18:14         ` Christopher Lameter
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2017-08-14  6:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm

On Fri, Aug 11, 2017 at 10:08:47AM -0700, Matthew Wilcox wrote:
> On Fri, Aug 11, 2017 at 02:25:19PM +1000, Dave Chinner wrote:
> > On Thu, Aug 10, 2017 at 09:11:59AM -0700, Matthew Wilcox wrote:
> > > On Thu, Aug 10, 2017 at 02:28:49PM +1000, Dave Chinner wrote:
> > > > If we scale this up to a container host which is using reflink trees
> > > > it's shared root images, there might be hundreds of copies of the
> > > > same data held in cache (i.e. one page per container). Given that
> > > > the filesystem knows that the underlying data extent is shared when
> > > > we go to read it, it's relatively easy to add mechanisms to the
> > > > filesystem to return the same page for all attempts to read the
> > > > from a shared extent from all inodes that share it.
> > > 
> > > I agree the problem exists.  Should we try to fix this problem, or
> > > should we steer people towards solutions which don't have this problem?
> > > The solutions I've been seeing use COW block devices instead of COW
> > > filesystems, and DAX to share the common pages between the host and
> > > each guest.
> > 
> > That's one possible solution for people using hardware
> > virutalisation, but not everyone is doing that. It also relies on
> > block devices, which rules out a whole bunch of interesting stuff we
> > can do with filesystems...
> 
> Assuming there's something fun we can do with filesystems that's
> interesting to this type of user, what do you think to this:
> 
> Create a block device (maybe it's a loop device, maybe it's dm-raid0)
> which supports DAX and uses the page cache to cache the physical pages
> of the block device it's fronting.

/me shudders and runs away screaming

<puff, puff, gasp>

Ok, I'm far away enough now. :P

> Use XFS+reflink+DAX on top of this loop device.  Now there's only one
> copy of each page in RAM.

Yes, I can see how that could work. Crazy, out of the box, abuses
DAX for non-DAX purposes and uses stuff we haven't enabled yet
because nobody has done the work to validate it. Full points for
creativity! :)

However, I don't think it's a viable solution.

First, now *everything* is cached in a single global mapping tree
and that's going to affect scalability and likely also the working
set tracking in the mapping tree (now global rather than per-file).
That, in turn, will affect reclaim behaviour and patterns. I'll come
back to that.

Second, direct IO is no longer direct - it would now by cached
and concurrency is limited by the block device page cache, not the
capability and queue depth of the underlying device.

Third, I have a concern that while the filesystem might present to
userspace as a DAX filesystem, it does not present userspace with
same semantics as direct access to CPU addressable non-volatile
storage. That seems, to me, like minefield we don't want to step into.

And, finally, i can't see how it would work for sharing between
cloned filesystem images and snapshots.  e.g. you use reflink to
clone the filesystem images exported by loopback devices. Or
dm-thinp to clone devices - there's no way for share page cache
pages for blocks that are shared across different dm-thinp devices
in the same pool. (And no, turtles is not the answer here :)

> We'd need to be able to shoot down all mapped pages when evicting pages
> from the loop device's page cache, but we have the right data structures
> in place for that; we just need to use them.

Sure. My biggest concern is whether reclaim can easily determine the
difference between a heavily shared page and a single use page? We'd
want to make sure we don't do stupid things like reclaim widely
shared pages from libc before we reclaim a page that has be read
only once in one context.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-14  6:48       ` Dave Chinner
@ 2017-08-14 18:14         ` Christopher Lameter
  2017-08-14 21:09           ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Christopher Lameter @ 2017-08-14 18:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm

On Mon, 14 Aug 2017, Dave Chinner wrote:

> > Use XFS+reflink+DAX on top of this loop device.  Now there's only one
> > copy of each page in RAM.
>
> Yes, I can see how that could work. Crazy, out of the box, abuses
> DAX for non-DAX purposes and uses stuff we haven't enabled yet
> because nobody has done the work to validate it. Full points for
> creativity! :)

Another not so crazy solution is to break the 1-1 relation between page
structs and pages. We already have issues with huge pages where one struct
page may represent 2m of memmory using 512 or so page struct.

Therer are also constantly attempts to expand struct page.

So how about an m->n relationship? Any page (may it be 4k, 2m or 1G) has
one page struct for each mapping that it is a member of?

Maybe a the page state could consist of a base struct that describes
the page state and then 1..n  pieces of mapping information? In the future
other state info could be added to the end if we allow dynamic sizing of
page structs.

This would also allow the inevitable creeping page struct bloat to get
completely out of control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-14 18:14         ` Christopher Lameter
@ 2017-08-14 21:09           ` Kirill A. Shutemov
  2017-08-15 15:11             ` Christopher Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-08-14 21:09 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: Dave Chinner, Matthew Wilcox, linux-fsdevel, linux-mm

On Mon, Aug 14, 2017 at 01:14:57PM -0500, Christopher Lameter wrote:
> On Mon, 14 Aug 2017, Dave Chinner wrote:
> 
> > > Use XFS+reflink+DAX on top of this loop device.  Now there's only one
> > > copy of each page in RAM.
> >
> > Yes, I can see how that could work. Crazy, out of the box, abuses
> > DAX for non-DAX purposes and uses stuff we haven't enabled yet
> > because nobody has done the work to validate it. Full points for
> > creativity! :)
> 
> Another not so crazy solution is to break the 1-1 relation between page
> structs and pages. We already have issues with huge pages where one struct
> page may represent 2m of memmory using 512 or so page struct.
> 
> Therer are also constantly attempts to expand struct page.
> 
> So how about an m->n relationship? Any page (may it be 4k, 2m or 1G) has
> one page struct for each mapping that it is a member of?
> 
> Maybe a the page state could consist of a base struct that describes
> the page state and then 1..n  pieces of mapping information? In the future
> other state info could be added to the end if we allow dynamic sizing of
> page structs.
> 
> This would also allow the inevitable creeping page struct bloat to get
> completely out of control.

Nice wish list. Add pony. :)

Any attempt to replace struct page with something more complex will have
severe performance implications. I'll be glad proved otherwise.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How can we share page cache pages for reflinked files?
  2017-08-14 21:09           ` Kirill A. Shutemov
@ 2017-08-15 15:11             ` Christopher Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christopher Lameter @ 2017-08-15 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Dave Chinner, Matthew Wilcox, linux-fsdevel, linux-mm

On Tue, 15 Aug 2017, Kirill A. Shutemov wrote:

> > This would also allow the inevitable creeping page struct bloat to get
> > completely out of control.
>
> Nice wish list. Add pony. :)
>
> Any attempt to replace struct page with something more complex will have
> severe performance implications. I'll be glad proved otherwise.

Do we care that much anymore? I have people inserting all sorts of runtime
checks into hotpaths in the name of security.....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-08-15 15:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-10  4:28 How can we share page cache pages for reflinked files? Dave Chinner
2017-08-10  5:57 ` Kirill A. Shutemov
2017-08-10  9:01   ` Dave Chinner
2017-08-10 13:31     ` Kirill A. Shutemov
2017-08-11  3:59       ` Dave Chinner
2017-08-11 12:57         ` Kirill A. Shutemov
2017-08-10 16:11 ` Matthew Wilcox
2017-08-10 19:17   ` Vivek Goyal
2017-08-10 21:20     ` Matthew Wilcox
2017-08-11  4:25   ` Dave Chinner
2017-08-11 17:08     ` Matthew Wilcox
2017-08-11 18:04       ` Christoph Hellwig
2017-08-14  6:48       ` Dave Chinner
2017-08-14 18:14         ` Christopher Lameter
2017-08-14 21:09           ` Kirill A. Shutemov
2017-08-15 15:11             ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).