Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found] ` <c559c2ce-50dc-d143-5741-fe3d21d0305c@nvidia.com>
@ 2019-06-06 17:11   ` Ira Weiny
  2019-06-06 19:46     ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-06-06 17:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, Jan Kara, Theodore Ts'o, Jeff Layton,
	Dave Chinner, Matthew Wilcox, linux-xfs, Andrew Morton,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote:
> On 6/5/19 6:45 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > 	John Hubbard's put_user_pages() patch series.[1]
> > 	Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >    (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> >    notification to the original thread that another thread has tried to delete
> >    their data.  Furthermore this indicates that if the user needs to GUP the
> >    file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by another
> > operation further GUP operations on the file will fail without re-taking the
> > lease.  This means that if a user would like to register pieces of a file and
> > continue to register other pieces later they would be advised to keep the
> > layout lease, get a SIGIO notification, and retake the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.  Similar
> > to accessing an mmap to this area GUP pins of that memory may fail.
> > 
> 
> Hi Ira,
> 
> Wow, great to see this. This looks like basically the right behavior, IMHO.
> 
> 1. We'll need man page additions, to explain it. In fact, even after a quick first
> pass through, I'm vague on two points:

Of course.  But I was not going to go through and attempt to write man pages
and other docs without some agreement on the final mechanisms.  This works
which was the basic requirement I had to send an RFC.  :-D  But yes man pages
and updates to headers etc all have to be done.

> 
> a) I'm not sure how this actually provides "opt-in to new behavior", because I 
> don't see any CONFIG_* or boot time choices, and it looks like the new behavior 
> just is there. That is, if user space doesn't set F_LAYOUT on a range, 
> GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?)

The opt in is at run time.  Currently GUP FOLL_LONGTERM is _not_ _allowed_ on
the FS DAX pages at all.  So the default behavior is the same, GUP fails.  (Or
specifically ibv_reg_mr() fails.  This fails as before, not change there.

The Opt in is that if a user knows what is involved they can take the lease and
the GUP will not fail.  This comes with the price of knowing that other
processes can't truncate those pages in use.

> 
> b) Truncate and hole punch behavior, with and without user space having a SIGIO
> handler. (I'm sure this is obvious after another look through, but it might go
> nicely in a man page.)

Sorry this was not clear.  There are 2 points for this patch set which requires
the use of catching SIGIO.

1) If an application _actually_ does (somehow, somewhere, in some unforseen use
   case) want to allow a truncate to happen.  They can catch the SIGIO, finish
   their use of the pages, and release them.  As long as they can do this within
   the <sysfs>/lease-time-break time they are ok and the truncate can proceed.

2) This is a bit more subtle and something I almost delayed sending these out
   for.  Currently the implementation of a lease break actually removes the
   lease from the file.  I did not want this to happen and I was thinking of
   delaying this patch set to implement something which keeps the lease around
   but I figured I should get something out for comments.  Jan has proposed
   something along these lines and I agree with him so I'm going to ask you to
   read my response to him about the details.

   Anyway so the key here is that currently an app needs the SIGIO to retake
   the lease if they want to map the file again or in parts based on usage.
   For example, they may only want to map some of the file for when they are
   using it and then map another part later.  Without the SIGIO they would lose
   their lease or would have to just take the lease for each GUP pin (which
   adds overhead).  Like I said I did not like this but I left it to get
   something which works out.

> 
> 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case,
> but for general RDMA on them? Or is there more that must be done?

This is limited to DAX.  All the functionality is limited to *_devmap or "is
DAX" cases.  I'm still thinking that page cache backed files can have a better
solution for the user.

> 
> 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will
> conflict violently, as I'm sure you noticed. :)

Yep...  But I needed to get the conversation started on this idea.

Thanks for the feedback!
Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-06 17:11   ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal Ira Weiny
@ 2019-06-06 19:46     ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 19:46 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, Dan Williams, Jan Kara, Theodore Ts'o,
	Jeff Layton, Dave Chinner, Matthew Wilcox, linux-xfs,
	Andrew Morton, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Thu, Jun 06, 2019 at 10:11:58AM -0700, Ira Weiny wrote:

> 2) This is a bit more subtle and something I almost delayed sending these out
>    for.  Currently the implementation of a lease break actually removes the
>    lease from the file.  I did not want this to happen and I was thinking of
>    delaying this patch set to implement something which keeps the lease around
>    but I figured I should get something out for comments.  Jan has proposed
>    something along these lines and I agree with him so I'm going to ask you to
>    read my response to him about the details.
>
> 
>    Anyway so the key here is that currently an app needs the SIGIO to retake
>    the lease if they want to map the file again or in parts based on usage.
>    For example, they may only want to map some of the file for when they are
>    using it and then map another part later.  Without the SIGIO they would lose
>    their lease or would have to just take the lease for each GUP pin (which
>    adds overhead).  Like I said I did not like this but I left it to get
>    something which works out.

So to be clear.. 

Even though the lease is broken the GUP remains, the pages remain
pined, and truncate/etc continues to fail? 

I like Jan's take on this actually.. see other email.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found] ` <20190606104203.GF7433@quack2.suse.cz>
@ 2019-06-06 22:03   ` Ira Weiny
  2019-06-06 22:26     ` Ira Weiny
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Ira Weiny @ 2019-06-06 22:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Theodore Ts'o, Jeff Layton, Dave Chinner,
	Matthew Wilcox, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > 	John Hubbard's put_user_pages() patch series.[1]
> > 	Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >    (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> >    notification to the original thread that another thread has tried to delete
> >    their data.  Furthermore this indicates that if the user needs to GUP the
> >    file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease.  This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
> 
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated".  The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.

Agreed.  For some "GUP interfaces" one may be able to figure this out but I'm
not familiar with any.  For RDMA there has been some additions for tracking
resources but I don't think any of that is useful here.  Regardless from a FS
POV this is awkward to have to understand all the independent interfaces, so I
agree.

> 
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Fair enough.

I was kind of heading that direction but had not thought this far forward.  I
was exploring how to have a lease remain on the file even after a "lease
break".  But that is incompatible with the current semantics of a "layout"
lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
out to see what people think of this idea so I did not look at keeping the
lease.]

Also hitch is that currently a lease is forcefully broken after
<sysfs>/lease-break-time.  To do what you suggest I think we would need a new
lease type with the semantics you describe.

Previously I had thought this would be a good idea (for other reasons).  But
what does everyone think about using a "longterm lease" similar to [1] which
has the semantics you proppose?  In [1] I was not sure "longterm" was a good
name but with your proposal I think it makes more sense.

> 
> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference

I actually think that might be pretty easy.  I actually added a ref count to
the longterm lease before.[2]  This was done to be able to take the lease
within the GUP code.  We don't need that functionality exactly but that patch
implements some of what you propose.  With a ref count on the lease we can
refuse to release it until all GUP users have released it.

>
> - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking.

I'll take a look at that code.  But that seems like a pretty big change.

>
> Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...

As presented at LSFmm I'm not opposed to killing a process which does not
"follow the rules".  But I'm concerned about how to handle this across a fork.

Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
to a child through the RDMA context.  This was the major issue Jason had with
the SIGBUS proposal.

Always sending a SIGKILL would prevent an RDMA process from doing something
like system("ls") (would kill the child unnecessarily).  Are we ok with that?

> 
> What do people think about this?

But generally I like the idea of the leases being sticky.  Not sure about the
SIGKILL.

Thanks for the review,
Ira

[1] https://patchwork.kernel.org/patch/10921171/
[2] https://patchwork.kernel.org/patch/10921177/

> 
> 								Honza
> > 
> > 
> > A general overview follows for background.
> > 
> > It should be noted that one solution for this problem is to use RDMA's On
> > Demand Paging (ODP).  There are 2 big reasons this may not work.
> > 
> > 	1) The hardware being used for RDMA may not support ODP
> > 	2) ODP may be detrimental to the over all network (cluster or cloud)
> > 	   performance
> > 
> > Therefore, in order to support RDMA to File system pages without On Demand
> > Paging (ODP) a number of things need to be done.
> > 
> > 1) GUP "longterm" users need to inform the other subsystems that they have
> >    taken a pin on a page which may remain pinned for a very "long time".[3]
> > 
> > 2) Any page which is "controlled" by a file system needs to have special
> >    handling.  The details of the handling depends on if the page is page cache
> >    fronted or not.
> > 
> >    2a) A page cache fronted page which has been pinned by GUP long term can use a
> >    bounce buffer to allow the file system to write back snap shots of the page.
> >    This is handled by the FS recognizing the GUP long term pin and making a copy
> >    of the page to be written back.
> > 	NOTE: this patch set does not address this path.
> > 
> >    2b) A FS "controlled" page which is not page cache fronted is either easier
> >    to deal with or harder depending on the operation the filesystem is trying
> >    to do.
> > 
> > 	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
> > 	FS can no longer use the pages in question until the pin has been
> > 	removed.  This patch set presents a solution to this by introducing
> > 	some reasonable restrictions on user space applications.
> > 
> > 	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
> > 	then there is nothing which need be done.  Data is Read or Written
> > 	directly to the page.  This is an easy case which would currently work
> > 	if not for GUP long term pins being disabled.  Therefore this patch set
> > 	need not change access to the file data but does allow for GUP pins
> > 	after 2ba above is dealt with.
> > 
> > 
> > This patch series and presents a solution for problem 2ba)
> > 
> > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> > 
> > [2] ext4/dev branch:
> > 
> > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> > 
> > 	Specific patches:
> > 
> > 	[2a] ext4: wait for outstanding dio during truncate in nojournal mode
> > 
> > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
> > 
> > 	[2b] ext4: do not delete unlinked inode from orphan list on failed truncate
> > 
> > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> > 
> > 	[2c] ext4: gracefully handle ext4_break_layouts() failure during truncate
> > 
> > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> > 
> > [3] The definition of long time is debatable but it has been established
> > that RDMAs use of pages, minutes or hours after the pin is the extreme case
> > which makes this problem most severe.
> > 
> > 
> > Ira Weiny (10):
> >   fs/locks: Add trace_leases_conflict
> >   fs/locks: Export F_LAYOUT lease to user space
> >   mm/gup: Pass flags down to __gup_device_huge* calls
> >   mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
> >   fs/ext4: Teach ext4 to break layout leases
> >   fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
> >   fs/ext4: Fail truncate if pages are GUP pinned
> >   fs/xfs: Teach xfs to use new dax_layout_busy_page()
> >   fs/xfs: Fail truncate if pages are GUP pinned
> >   mm/gup: Remove FOLL_LONGTERM DAX exclusion
> > 
> >  fs/Kconfig                       |   1 +
> >  fs/dax.c                         |  38 ++++++---
> >  fs/ext4/ext4.h                   |   2 +-
> >  fs/ext4/extents.c                |   6 +-
> >  fs/ext4/inode.c                  |  26 +++++--
> >  fs/locks.c                       |  97 ++++++++++++++++++++---
> >  fs/xfs/xfs_file.c                |  24 ++++--
> >  fs/xfs/xfs_inode.h               |   5 +-
> >  fs/xfs/xfs_ioctl.c               |  15 +++-
> >  fs/xfs/xfs_iops.c                |  14 +++-
> >  fs/xfs/xfs_pnfs.c                |  14 ++--
> >  include/linux/dax.h              |   9 ++-
> >  include/linux/fs.h               |   2 +-
> >  include/linux/mm.h               |   2 +
> >  include/trace/events/filelock.h  |  35 +++++++++
> >  include/uapi/asm-generic/fcntl.h |   3 +
> >  mm/gup.c                         | 129 ++++++++++++-------------------
> >  mm/huge_memory.c                 |  12 +++
> >  18 files changed, 299 insertions(+), 135 deletions(-)
> > 
> > -- 
> > 2.20.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-06 22:03   ` Ira Weiny
@ 2019-06-06 22:26     ` Ira Weiny
  2019-06-06 22:28     ` Dave Chinner
  2019-06-07 11:04     ` Jan Kara
  2 siblings, 0 replies; 30+ messages in thread
From: Ira Weiny @ 2019-06-06 22:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Theodore Ts'o, Jeff Layton, Dave Chinner,
	Matthew Wilcox, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 06, 2019 at 03:03:30PM -0700, 'Ira Weiny' wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ... V1,000,000   ;-)
> > > 
> > > Pre-requisites:
> > > 	John Hubbard's put_user_pages() patch series.[1]
> > > 	Jan Kara's ext4_break_layouts() fixes[2]
> > > 
> > > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > > decided to take a slightly different tack on this problem.
> > > 
> > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is gonig to be
> > > specific to DAX (because of the lack of a page cache) we have made the user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> > >    (now made visible).
> > > 2) GUP will fail (EPERM) if a layout lease is not taken
> > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > > 4) The user has the option of holding the layout lease to receive a SIGIO for
> > >    notification to the original thread that another thread has tried to delete
> > >    their data.  Furthermore this indicates that if the user needs to GUP the
> > >    file again they will need to retake the Layout lease before doing so.
> > > 
> > > 
> > > NOTE: If the user releases the layout lease or if it has been broken by
> > > another operation further GUP operations on the file will fail without
> > > re-taking the lease.  This means that if a user would like to register
> > > pieces of a file and continue to register other pieces later they would
> > > be advised to keep the layout lease, get a SIGIO notification, and retake
> > > the lease.
> > > 
> > > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > > Similar to accessing an mmap to this area GUP pins of that memory may
> > > fail.
> > 
> > So after some through I'm willing accept the fact that pinned DAX pages
> > will just make truncate / hole punch fail and shove it into a same bucket
> > of situations like "user can open a file and unlink won't delete it" or
> > "ETXTBUSY when user is executing a file being truncated".  The problem I
> > have with this proposal is a lack of visibility from sysadmin POV. For
> > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> > problematic process and kill it. There's nothing like that with your
> > proposal since currently once you hold page reference, you can unmap the
> > file, drop layout lease, close the file, and there's no trace that you're
> > responsible for the pinned page anymore.
> 
> Agreed.  For some "GUP interfaces" one may be able to figure this out but I'm
> not familiar with any.  For RDMA there has been some additions for tracking
> resources but I don't think any of that is useful here.  Regardless from a FS
> POV this is awkward to have to understand all the independent interfaces, so I
> agree.
> 
> > 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.
> 
> Previously I had thought this would be a good idea (for other reasons).  But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.
> 
> > 
> > The question is on how to exactly enforce that lease is taken until all
> > pages are unpinned. I belive it could be done by tracking number of
> > long-term pinned pages within a lease. Gup_longterm could easily increment
> > the count when verifying the lease exists, gup_longterm users will somehow
> > need to propagate corresponding 'filp' (struct file pointer) to
> > put_user_pages_longterm() callsites so that they can look up appropriate
> > lease to drop reference
> 
> I actually think that might be pretty easy.  I actually added a ref count to
> the longterm lease before.[2]  This was done to be able to take the lease
> within the GUP code.  We don't need that functionality exactly but that patch
> implements some of what you propose.  With a ref count on the lease we can
> refuse to release it until all GUP users have released it.
> 
> >
> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
> 
> I'll take a look at that code.  But that seems like a pretty big change.
> 
> >
> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
> 
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules".  But I'm concerned about how to handle this across a fork.
> 
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context.  This was the major issue Jason had with
> the SIGBUS proposal.
> 
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily).  Are we ok with that?

I might be wrong here.  My memory said it closed all fd's but I'm not finding
any documentation of that.  Perhaps we could say that the child would be
required to keep the fd open as well?

Ira

> 
> > 
> > What do people think about this?
> 
> But generally I like the idea of the leases being sticky.  Not sure about the
> SIGKILL.
> 
> Thanks for the review,
> Ira
> 
> [1] https://patchwork.kernel.org/patch/10921171/
> [2] https://patchwork.kernel.org/patch/10921177/
> 
> > 
> > 								Honza
> > > 
> > > 
> > > A general overview follows for background.
> > > 
> > > It should be noted that one solution for this problem is to use RDMA's On
> > > Demand Paging (ODP).  There are 2 big reasons this may not work.
> > > 
> > > 	1) The hardware being used for RDMA may not support ODP
> > > 	2) ODP may be detrimental to the over all network (cluster or cloud)
> > > 	   performance
> > > 
> > > Therefore, in order to support RDMA to File system pages without On Demand
> > > Paging (ODP) a number of things need to be done.
> > > 
> > > 1) GUP "longterm" users need to inform the other subsystems that they have
> > >    taken a pin on a page which may remain pinned for a very "long time".[3]
> > > 
> > > 2) Any page which is "controlled" by a file system needs to have special
> > >    handling.  The details of the handling depends on if the page is page cache
> > >    fronted or not.
> > > 
> > >    2a) A page cache fronted page which has been pinned by GUP long term can use a
> > >    bounce buffer to allow the file system to write back snap shots of the page.
> > >    This is handled by the FS recognizing the GUP long term pin and making a copy
> > >    of the page to be written back.
> > > 	NOTE: this patch set does not address this path.
> > > 
> > >    2b) A FS "controlled" page which is not page cache fronted is either easier
> > >    to deal with or harder depending on the operation the filesystem is trying
> > >    to do.
> > > 
> > > 	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
> > > 	FS can no longer use the pages in question until the pin has been
> > > 	removed.  This patch set presents a solution to this by introducing
> > > 	some reasonable restrictions on user space applications.
> > > 
> > > 	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
> > > 	then there is nothing which need be done.  Data is Read or Written
> > > 	directly to the page.  This is an easy case which would currently work
> > > 	if not for GUP long term pins being disabled.  Therefore this patch set
> > > 	need not change access to the file data but does allow for GUP pins
> > > 	after 2ba above is dealt with.
> > > 
> > > 
> > > This patch series and presents a solution for problem 2ba)
> > > 
> > > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> > > 
> > > [2] ext4/dev branch:
> > > 
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> > > 
> > > 	Specific patches:
> > > 
> > > 	[2a] ext4: wait for outstanding dio during truncate in nojournal mode
> > > 
> > > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
> > > 
> > > 	[2b] ext4: do not delete unlinked inode from orphan list on failed truncate
> > > 
> > > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> > > 
> > > 	[2c] ext4: gracefully handle ext4_break_layouts() failure during truncate
> > > 
> > > 	- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> > > 
> > > [3] The definition of long time is debatable but it has been established
> > > that RDMAs use of pages, minutes or hours after the pin is the extreme case
> > > which makes this problem most severe.
> > > 
> > > 
> > > Ira Weiny (10):
> > >   fs/locks: Add trace_leases_conflict
> > >   fs/locks: Export F_LAYOUT lease to user space
> > >   mm/gup: Pass flags down to __gup_device_huge* calls
> > >   mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
> > >   fs/ext4: Teach ext4 to break layout leases
> > >   fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
> > >   fs/ext4: Fail truncate if pages are GUP pinned
> > >   fs/xfs: Teach xfs to use new dax_layout_busy_page()
> > >   fs/xfs: Fail truncate if pages are GUP pinned
> > >   mm/gup: Remove FOLL_LONGTERM DAX exclusion
> > > 
> > >  fs/Kconfig                       |   1 +
> > >  fs/dax.c                         |  38 ++++++---
> > >  fs/ext4/ext4.h                   |   2 +-
> > >  fs/ext4/extents.c                |   6 +-
> > >  fs/ext4/inode.c                  |  26 +++++--
> > >  fs/locks.c                       |  97 ++++++++++++++++++++---
> > >  fs/xfs/xfs_file.c                |  24 ++++--
> > >  fs/xfs/xfs_inode.h               |   5 +-
> > >  fs/xfs/xfs_ioctl.c               |  15 +++-
> > >  fs/xfs/xfs_iops.c                |  14 +++-
> > >  fs/xfs/xfs_pnfs.c                |  14 ++--
> > >  include/linux/dax.h              |   9 ++-
> > >  include/linux/fs.h               |   2 +-
> > >  include/linux/mm.h               |   2 +
> > >  include/trace/events/filelock.h  |  35 +++++++++
> > >  include/uapi/asm-generic/fcntl.h |   3 +
> > >  mm/gup.c                         | 129 ++++++++++++-------------------
> > >  mm/huge_memory.c                 |  12 +++
> > >  18 files changed, 299 insertions(+), 135 deletions(-)
> > > 
> > > -- 
> > > 2.20.1
> > > 
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-06 22:03   ` Ira Weiny
  2019-06-06 22:26     ` Ira Weiny
@ 2019-06-06 22:28     ` Dave Chinner
  2019-06-07 11:04     ` Jan Kara
  2 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2019-06-06 22:28 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Dan Williams, Theodore Ts'o, Jeff Layton,
	Matthew Wilcox, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 06, 2019 at 03:03:30PM -0700, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote:
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.

That just requires a flag when gaining the layout lease to say it is
an "unbreakable layout lease". That gives the kernel the information
needed to determine whether it should attempt to break the lease on
truncate or just return ETXTBSY....

i.e. it allows gup-pinning applications that want to behave nicely
with other users to drop their gup pins and release the lease when
something else wants to truncate/hole punch the file rather than
have truncate return an error. e.g. to allow apps to cleanly interop
with other breakable layout leases (e.g. pNFS) on the same
filesystem.

FWIW, I'd also like to see the "truncate fails when unbreakable
layout lease is held" behaviour to be common across all
filesystem/storage types, not be confined to DAX only. i.e. truncate
should return ETXTBSY when an unbreakable layout lease is held
by an application, not just when "DAX+gup-pinned" is triggered....

Whatever we decide, the behaviour of truncate et al needs to be
predictable, consistent and easily discoverable...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-06 22:03   ` Ira Weiny
  2019-06-06 22:26     ` Ira Weiny
  2019-06-06 22:28     ` Dave Chinner
@ 2019-06-07 11:04     ` Jan Kara
       [not found]       ` <20190607110426.GB12765-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  2 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2019-06-07 11:04 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Dan Williams, Theodore Ts'o, Jeff Layton,
	Dave Chinner, Matthew Wilcox, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe,
	linux-rdma

On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.

I'd do what Dave suggested - add flag to mark lease as unbreakable by
truncate and teach file locking core to handle that. There actually is
support for locks that are not broken after given timeout so there
shouldn't be too many changes need.
 
> Previously I had thought this would be a good idea (for other reasons).  But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.

As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
truncate.

> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
> 
> I'll take a look at that code.  But that seems like a pretty big change.

I was looking into that yesterday before proposing this and there aren't
than many gup_longterm() users and most of them anyway just stick pages
array into their tracking structure and then release them once done. So it
shouldn't be that complex to convert to a new convention (and you have to
touch all gup_longterm() users anyway to teach them track leases etc.).

> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
> 
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules".  But I'm concerned about how to handle this across a fork.
> 
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context.  This was the major issue Jason had with
> the SIGBUS proposal.
> 
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily).  Are we ok with that?

I answered this in another email but system("ls") won't kill anybody.
fork(2) just creates new file descriptor for the same file and possibly
then closes it but since there is still another file descriptor for the
same struct file, the "close" code won't trigger.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]       ` <20190607110426.GB12765-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2019-06-07 18:25         ` Ira Weiny
  2019-06-07 18:50           ` Jason Gunthorpe
       [not found]           ` <20190607182534.GC14559-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
  0 siblings, 2 replies; 30+ messages in thread
From: Ira Weiny @ 2019-06-07 18:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, John Hubbard,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > On Wed 05-06-19 18:45:33, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > > > From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > > 
> > > So I'd like to actually mandate that you *must* hold the file lease until
> > > you unpin all pages in the given range (not just that you have an option to
> > > hold a lease). And I believe the kernel should actually enforce this. That
> > > way we maintain a sane state that if someone uses a physical location of
> > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > and kill it if he wishes so.
> > 
> > Fair enough.
> > 
> > I was kind of heading that direction but had not thought this far forward.  I
> > was exploring how to have a lease remain on the file even after a "lease
> > break".  But that is incompatible with the current semantics of a "layout"
> > lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> > out to see what people think of this idea so I did not look at keeping the
> > lease.]
> > 
> > Also hitch is that currently a lease is forcefully broken after
> > <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> > lease type with the semantics you describe.
> 
> I'd do what Dave suggested - add flag to mark lease as unbreakable by
> truncate and teach file locking core to handle that. There actually is
> support for locks that are not broken after given timeout so there
> shouldn't be too many changes need.
>  
> > Previously I had thought this would be a good idea (for other reasons).  But
> > what does everyone think about using a "longterm lease" similar to [1] which
> > has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> > name but with your proposal I think it makes more sense.
> 
> As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> truncate.

Ok I want to make sure I understand what you and Dave are suggesting.

Are you suggesting that we have something like this from user space?

	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);

> 
> > > - probably I'd just transition all gup_longterm()
> > > users to a saner API similar to the one we have in mm/frame_vector.c where
> > > we don't hand out page pointers but an encapsulating structure that does
> > > all the necessary tracking.
> > 
> > I'll take a look at that code.  But that seems like a pretty big change.
> 
> I was looking into that yesterday before proposing this and there aren't
> than many gup_longterm() users and most of them anyway just stick pages
> array into their tracking structure and then release them once done. So it
> shouldn't be that complex to convert to a new convention (and you have to
> touch all gup_longterm() users anyway to teach them track leases etc.).

I think in the direction we are heading this becomes more attractive for sure.
For me though it will take some time.

Should we convert the frame_vector over to this new mechanism?  (Or more
accurately perhaps, add to frame_vector and use it?)  It seems bad to have "yet
another object" returned from the pin pages interface...

And I think this is related to what Christoph Hellwig is doing with bio_vec and
dma.  Really we want drivers out of the page processing business.

So for now I'm going to move forward with the idea of handing "some object" to
the GUP callers and figure out the lsof stuff, and let bigger questions like
this play out a bit more before I try and work with that code.  Fair?

> 
> > > Removing a lease would need to block until all
> > > pins are released - this is probably the most hairy part since we need to
> > > handle a case if application just closes the file descriptor which would
> > > release the lease but OTOH we need to make sure task exit does not deadlock.
> > > Maybe we could block only on explicit lease unlock and just drop the layout
> > > lease on file close and if there are still pinned pages, send SIGKILL to an
> > > application as a reminder it did something stupid...
> > 
> > As presented at LSFmm I'm not opposed to killing a process which does not
> > "follow the rules".  But I'm concerned about how to handle this across a fork.
> > 
> > Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> > to a child through the RDMA context.  This was the major issue Jason had with
> > the SIGBUS proposal.
> > 
> > Always sending a SIGKILL would prevent an RDMA process from doing something
> > like system("ls") (would kill the child unnecessarily).  Are we ok with that?
> 
> I answered this in another email but system("ls") won't kill anybody.
> fork(2) just creates new file descriptor for the same file and possibly
> then closes it but since there is still another file descriptor for the
> same struct file, the "close" code won't trigger.

Agreed.  I was wrong.  Sorry.

But if we can keep track of who has the pins in lsof can we agree no process
needs to be SIGKILL'ed?  Admins can do this on their own "killing" if they
really need to stop the use of these files, right?

Ira

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-07 18:25         ` Ira Weiny
@ 2019-06-07 18:50           ` Jason Gunthorpe
       [not found]           ` <20190607182534.GC14559-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
  1 sibling, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 18:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Dan Williams, Theodore Ts'o, Jeff Layton,
	Dave Chinner, Matthew Wilcox, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:

> And I think this is related to what Christoph Hellwig is doing with bio_vec and
> dma.  Really we want drivers out of the page processing business.

At least for RDMA, and a few other places I've noticed, I'd really
like to get totally out of the handling struct pages game.

We are DMA based and really only want DMA addresses for the target
device. I know other places need CPU pages or more complicated
things.. But I also know there are other drivers like RDMA..

So I think it would be very helpful to have a driver API something
like:

int get_user_mem_for_dma(struct device *dma_device,
                void __user *mem, size_t length,
                struct gup_handle *res,
                struct 'bio dma list' *dma_list,
                const struct dma_params *params);
void put_user_mem_for_dma(struct gup_handle *res, 
                 struct 'bio dma list' *dma_list);

And we could hope to put in there all the specialty logic we want to
have for this flow:
 - The weird HMM stuff in hmm_range_dma_map()
 - Interaction with DAX
 - Interaction with DMA BUF
 - Holding file leases
 - PCI peer 2 peer features
 - Optimizations for huge pages
 - Handling page dirtying from DMA
 - etc

I think Matthew was suggesting something like this at LS/MM, so +1
from here..

When Christoph sends his BIO dma work I was thinking of investigating
this avenue, as we already have something quite similiar in RDMA that
could perhaps be hoisted out for re-use into mm/

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]           ` <20190607182534.GC14559-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
@ 2019-06-08  0:10             ` Dave Chinner
       [not found]               ` <20190608001036.GF14308-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-06-08  0:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> > On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > > On Wed 05-06-19 18:45:33, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > > > > From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > > > 
> > > > So I'd like to actually mandate that you *must* hold the file lease until
> > > > you unpin all pages in the given range (not just that you have an option to
> > > > hold a lease). And I believe the kernel should actually enforce this. That
> > > > way we maintain a sane state that if someone uses a physical location of
> > > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > > and kill it if he wishes so.
> > > 
> > > Fair enough.
> > > 
> > > I was kind of heading that direction but had not thought this far forward.  I
> > > was exploring how to have a lease remain on the file even after a "lease
> > > break".  But that is incompatible with the current semantics of a "layout"
> > > lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> > > out to see what people think of this idea so I did not look at keeping the
> > > lease.]
> > > 
> > > Also hitch is that currently a lease is forcefully broken after
> > > <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> > > lease type with the semantics you describe.
> > 
> > I'd do what Dave suggested - add flag to mark lease as unbreakable by
> > truncate and teach file locking core to handle that. There actually is
> > support for locks that are not broken after given timeout so there
> > shouldn't be too many changes need.
> >  
> > > Previously I had thought this would be a good idea (for other reasons).  But
> > > what does everyone think about using a "longterm lease" similar to [1] which
> > > has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> > > name but with your proposal I think it makes more sense.
> > 
> > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> > truncate.
> 
> Ok I want to make sure I understand what you and Dave are suggesting.
> 
> Are you suggesting that we have something like this from user space?
> 
> 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);

Rather than "unbreakable", perhaps a clearer description of the
policy it entails is "exclusive"?

i.e. what we are talking about here is an exclusive lease that
prevents other processes from changing the layout. i.e. the
mechanism used to guarantee a lease is exclusive is that the layout
becomes "unbreakable" at the filesystem level, but the policy we are
actually presenting to uses is "exclusive access"...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]               ` <20190608001036.GF14308-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
@ 2019-06-09  1:29                 ` Ira Weiny
  2019-06-12 12:37                 ` Matthew Wilcox
  1 sibling, 0 replies; 30+ messages in thread
From: Ira Weiny @ 2019-06-09  1:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> > > On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > > > On Wed 05-06-19 18:45:33, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > > > > > From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > > > > 
> > > > > So I'd like to actually mandate that you *must* hold the file lease until
> > > > > you unpin all pages in the given range (not just that you have an option to
> > > > > hold a lease). And I believe the kernel should actually enforce this. That
> > > > > way we maintain a sane state that if someone uses a physical location of
> > > > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > > > and kill it if he wishes so.
> > > > 
> > > > Fair enough.
> > > > 
> > > > I was kind of heading that direction but had not thought this far forward.  I
> > > > was exploring how to have a lease remain on the file even after a "lease
> > > > break".  But that is incompatible with the current semantics of a "layout"
> > > > lease (as currently defined in the kernel).  [In the end I wanted to get an RFC
> > > > out to see what people think of this idea so I did not look at keeping the
> > > > lease.]
> > > > 
> > > > Also hitch is that currently a lease is forcefully broken after
> > > > <sysfs>/lease-break-time.  To do what you suggest I think we would need a new
> > > > lease type with the semantics you describe.
> > > 
> > > I'd do what Dave suggested - add flag to mark lease as unbreakable by
> > > truncate and teach file locking core to handle that. There actually is
> > > support for locks that are not broken after given timeout so there
> > > shouldn't be too many changes need.
> > >  
> > > > Previously I had thought this would be a good idea (for other reasons).  But
> > > > what does everyone think about using a "longterm lease" similar to [1] which
> > > > has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> > > > name but with your proposal I think it makes more sense.
> > > 
> > > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> > > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> > > truncate.
> > 
> > Ok I want to make sure I understand what you and Dave are suggesting.
> > 
> > Are you suggesting that we have something like this from user space?
> > 
> > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> 
> Rather than "unbreakable", perhaps a clearer description of the
> policy it entails is "exclusive"?
> 
> i.e. what we are talking about here is an exclusive lease that
> prevents other processes from changing the layout. i.e. the
> mechanism used to guarantee a lease is exclusive is that the layout
> becomes "unbreakable" at the filesystem level, but the policy we are
> actually presenting to uses is "exclusive access"...

That sounds good.

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]               ` <20190608001036.GF14308-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  2019-06-09  1:29                 ` Ira Weiny
@ 2019-06-12 12:37                 ` Matthew Wilcox
       [not found]                   ` <20190612123751.GD32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-06-12 12:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > Are you suggesting that we have something like this from user space?
> > 
> > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> 
> Rather than "unbreakable", perhaps a clearer description of the
> policy it entails is "exclusive"?
> 
> i.e. what we are talking about here is an exclusive lease that
> prevents other processes from changing the layout. i.e. the
> mechanism used to guarantee a lease is exclusive is that the layout
> becomes "unbreakable" at the filesystem level, but the policy we are
> actually presenting to uses is "exclusive access"...

That's rather different from the normal meaning of 'exclusive' in the
context of locks, which is "only one user can have access to this at
a time".  As I understand it, this is rather more like a 'shared' or
'read' lock.  The filesystem would be the one which wants an exclusive
lock, so it can modify the mapping of logical to physical blocks.

The complication being that by default the filesystem has an exclusive
lock on the mapping, and what we're trying to add is the ability for
readers to ask the filesystem to give up its exclusive lock.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                   ` <20190612123751.GD32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
@ 2019-06-12 23:30                     ` Ira Weiny
       [not found]                       ` <20190612233024.GD14336-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
  2019-06-13  0:25                     ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-06-12 23:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, John Hubbard,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > > 
> > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > 
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> > 
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
> 
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time".  As I understand it, this is rather more like a 'shared' or
> 'read' lock.  The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.
> 
> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

This is an interesting view...

And after some more thought, exclusive does not seem like a good name for this
because technically F_WRLCK _is_ an exclusive lease...

In addition, the user does not need to take the "exclusive" write lease to be
notified of (broken by) an unexpected truncate.  A "read" lease is broken by
truncate.  (And "write" leases really don't do anything different WRT the
interaction of the FS and the user app.  Write leases control "exclusive"
access between other file descriptors.)

Another thing to consider is that this patch set _allows_ a truncate/hole punch
to proceed _if_ the pages being affected are not actually pinned.  So the
unbreakable/exclusive nature of the lease is not absolute.

Personally I like this functionality.  I'm not quite sure I can make it work
with what Jan is suggesting.  But I like it.

Given the muddied water of "exclusive" and "write" lease I'm now feeling like
Jeff has a point WRT the conflation of F_RDLCK/F_WRLCK/F_UNLCK and this new
functionality.

Should we use his suggested F_SETLAYOUT/F_GETLAYOUT cmd type?[1]

Ira

[1] https://lkml.org/lkml/2019/6/9/117

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                   ` <20190612123751.GD32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  2019-06-12 23:30                     ` Ira Weiny
@ 2019-06-13  0:25                     ` Dave Chinner
       [not found]                       ` <20190613002555.GH14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-06-13  0:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > > 
> > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > 
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> > 
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
> 
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time".

Layout leases are not locks, they are a user access policy object.
It is the process/fd which holds the lease and it's the process/fd
that is granted exclusive access.  This is exactly the same semantic
as O_EXCL provides for granting exclusive access to a block device
via open(), yes?

> As I understand it, this is rather more like a 'shared' or
> 'read' lock.  The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.

ISTM that you're conflating internal filesystem implementation with
application visible semantics. Yes, the filesystem uses internal
locks to serialise the modification of the things the lease manages
access too, but that has nothing to do with the access policy the
lease provides to users.

e.g. Process A has an exclusive layout lease on file F. It does an
IO to file F. The filesystem IO path checks that Process A owns the
lease on the file and so skips straight through layout breaking
because it owns the lease and is allowed to modify the layout. It
then takes the inode metadata locks to allocate new space and write
new data.

Process B now tries to write to file F. The FS checks whether
Process B owns a layout lease on file F. It doesn't, so then it
tries to break the layout lease so the IO can proceed. The layout
breaking code sees that process A has an exclusive layout lease
granted, and so returns -ETXTBSY to process B - it is not allowed to
break the lease and so the IO fails with -ETXTBSY.

i.e. the exclusive layout lease prevents other processes from
performing operations that may need to modify the layout from
performing those operations. It does not "lock" the file/inode in
any way, it just changes how the layout lease breaking behaves.

Further, the "exclusiveness" of a layout lease is completely
irrelevant to the filesystem that is indicating that an operation
that may need to modify the layout is about to be performed. All the
filesystem has to do is handle failures to break the lease
appropriately.  Yes, XFS serialises the layout lease validation
against other IO to the same file via it's IO locks, but that's an
internal data IO coherency requirement, not anything to do with
layout lease management.

Note that I talk about /writes/ here. This is interchangable with
any other operation that may need to modify the extent layout of the
file, be it truncate, fallocate, etc: the attempt to break the
layout lease by a non-owner should fail if the lease is "exclusive"
to the owner.

> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

The filesystem doesn't even lock the "mapping" until after the
layout lease has been validated or broken.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                       ` <20190612233024.GD14336-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
@ 2019-06-13  0:55                         ` Dave Chinner
       [not found]                           ` <20190613005552.GI14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-06-13  0:55 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > Are you suggesting that we have something like this from user space?
> > > > 
> > > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > 
> > > Rather than "unbreakable", perhaps a clearer description of the
> > > policy it entails is "exclusive"?
> > > 
> > > i.e. what we are talking about here is an exclusive lease that
> > > prevents other processes from changing the layout. i.e. the
> > > mechanism used to guarantee a lease is exclusive is that the layout
> > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > actually presenting to uses is "exclusive access"...
> > 
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".  As I understand it, this is rather more like a 'shared' or
> > 'read' lock.  The filesystem would be the one which wants an exclusive
> > lock, so it can modify the mapping of logical to physical blocks.
> > 
> > The complication being that by default the filesystem has an exclusive
> > lock on the mapping, and what we're trying to add is the ability for
> > readers to ask the filesystem to give up its exclusive lock.
> 
> This is an interesting view...
> 
> And after some more thought, exclusive does not seem like a good name for this
> because technically F_WRLCK _is_ an exclusive lease...
> 
> In addition, the user does not need to take the "exclusive" write lease to be
> notified of (broken by) an unexpected truncate.  A "read" lease is broken by
> truncate.  (And "write" leases really don't do anything different WRT the
> interaction of the FS and the user app.  Write leases control "exclusive"
> access between other file descriptors.)

I've been assuming that there is only one type of layout lease -
there is no use case I've heard of for read/write layout leases, and
like you say there is zero difference in behaviour at the filesystem
level - they all have to be broken to allow a non-lease truncate to
proceed.

IMO, taking a "read lease" to be able to modify and write to the
underlying mapping of a file makes absolutely no sense at all.
IOWs, we're talking exaclty about a revokable layout lease vs an
exclusive layout lease here, and so read/write really doesn't match
the policy or semantics we are trying to provide.

> Another thing to consider is that this patch set _allows_ a truncate/hole punch
> to proceed _if_ the pages being affected are not actually pinned.  So the
> unbreakable/exclusive nature of the lease is not absolute.

If you're talking about the process that owns the layout lease
running the truncate, then that is fine.

However, if you are talking about a process that does not own the
layout lease being allowed to truncate a file without first breaking
the layout lease, then that is fundamentally broken.

i.e. If you don't own a layout lease, the layout leases must be
broken before the truncate can proceed. If it's an exclusive lease,
then you cannot break the lease and the truncate *must fail before
it is started*. i.e.  the layout lease state must be correctly
resolved before we start an operation that may modify a file layout.

Determining if we can actually do the truncate based on page state
occurs /after/ the lease says the truncate can proceed....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                       ` <20190613002555.GH14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
@ 2019-06-13  3:23                         ` Matthew Wilcox
       [not found]                           ` <20190613032320.GG32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  2019-06-13 15:29                           ` Jason Gunthorpe
  2019-06-13 15:27                         ` Matthew Wilcox
  2019-06-13 20:34                         ` Ira Weiny
  2 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2019-06-13  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".
> 
> Layout leases are not locks, they are a user access policy object.
> It is the process/fd which holds the lease and it's the process/fd
> that is granted exclusive access.  This is exactly the same semantic
> as O_EXCL provides for granting exclusive access to a block device
> via open(), yes?

This isn't my understanding of how RDMA wants this to work, so we should
probably clear that up before we get too far down deciding what name to
give it.

For the RDMA usage case, it is entirely possible that both process A
and process B which don't know about each other want to perform RDMA to
file F.  So there will be two layout leases active on this file at the
same time.  It's fine for IOs to simultaneously be active to both leases.
But if the filesystem wants to move blocks around, it has to break
both leases.

If Process C tries to do a write to file F without a lease, there's no
problem, unless a side-effect of the write would be to change the block
mapping, in which case either the leases must break first, or the write
must be denied.

Jason, please correct me if I've misunderstood the RDMA needs here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                           ` <20190613032320.GG32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
@ 2019-06-13  4:36                             ` Dave Chinner
       [not found]                               ` <20190613043649.GJ14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-06-13  4:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> > 
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access.  This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
> 
> This isn't my understanding of how RDMA wants this to work, so we should
> probably clear that up before we get too far down deciding what name to
> give it.
> 
> For the RDMA usage case, it is entirely possible that both process A
> and process B which don't know about each other want to perform RDMA to
> file F.  So there will be two layout leases active on this file at the
> same time.  It's fine for IOs to simultaneously be active to both leases.

Yes, it is.

> But if the filesystem wants to move blocks around, it has to break
> both leases.

No, the _lease layer_ needs to break both leases when the filesystem
calls break_layout().

The filesystem is /completely unaware/ of whether a lease is held,
how many leases are held, what is involved in revoking leases or
whether they are exclusive or not. The filesystem only knows that it
is about to perform an operation that may require a layout lease to
be broken, so it's _asking permission_ from the layout lease layer
whether it is OK to go ahead with the operation.

See what I mean about the layout lease being an /access arbitration/
layer? It's the layer that decides whether a modification can be
made or not, not the filesystem. The layout lease layer tells the
filesystem what it should do, the filesystem just has to ensure it
adds layout breaking callouts in places that can block safely and
are serialised to ensure operations from new layouts can't race with
the operation that broke the existing layouts.

> If Process C tries to do a write to file F without a lease, there's no
> problem, unless a side-effect of the write would be to change the block
> mapping,

That's a side effect we cannot predict ahead of time. But it's
also _completely irrelevant_ to the layout lease layer API and
implementation.(*)

> in which case either the leases must break first, or the write
> must be denied.

Which is exactly how I'm saying layout leases already interact with
the filesystem: that if the lease cannot be broken, break_layout()
returns -ETXTBSY to the filesystem, and the filesystem returns that
to the application having made no changes at all. Layout leases are
the policy engine, the filesystem just has to implement the
break_layout() callouts such that layout breaking is consistent,
correct, and robust....

Cheers,

Dave.

(*) In the case of XFS, we don't know if a layout change will be
necessary until we are deep inside the actual IO path and hold inode
metadata locks. We can't block here to break the layout because IO
completion and metadata commits need to occur to allow the
application to release it's lease and IO completion requires that
same inode metadata lock. i.e. if we block once we know a layout
change needs to occur, we will deadlock the filesystem on the inode
metadata lock.

Hence the filesystem implementation dictates when the filesystem
issues layout lease break notifications. However, these filesystem
implementation issues do not dictate how applications interact with
layout leases, how layout leases are managed, whether concurrent
leases are allowed, whether leases can be broken, etc.  That's all
managed by the layout lease layer and that's where the go/no go
decision is made and communicated to the filesystem as the return
value from the break_layout() call.

-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                               ` <20190613043649.GJ14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
@ 2019-06-13 10:47                                 ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2019-06-13 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Thu, Jun 13, 2019 at 02:36:49PM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote:
> > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > > That's rather different from the normal meaning of 'exclusive' in the
> > > > context of locks, which is "only one user can have access to this at
> > > > a time".
> > > 
> > > Layout leases are not locks, they are a user access policy object.
> > > It is the process/fd which holds the lease and it's the process/fd
> > > that is granted exclusive access.  This is exactly the same semantic
> > > as O_EXCL provides for granting exclusive access to a block device
> > > via open(), yes?
> > 
> > This isn't my understanding of how RDMA wants this to work, so we should
> > probably clear that up before we get too far down deciding what name to
> > give it.
> > 
> > For the RDMA usage case, it is entirely possible that both process A
> > and process B which don't know about each other want to perform RDMA to
> > file F.  So there will be two layout leases active on this file at the
> > same time.  It's fine for IOs to simultaneously be active to both leases.
> 
> Yes, it is.
> 
> > But if the filesystem wants to move blocks around, it has to break
> > both leases.
> 
> No, the _lease layer_ needs to break both leases when the filesystem
> calls break_layout().

That's a distinction without a difference as far as userspace is
concerned.  If process A asks for an exclusive lease (and gets it),
then process B asks for an exclusive lease (and gets it), that lease
isn't exclusive!  It's shared.

I think the example you give of O_EXCL is more of a historical accident.
It's a relatively recent Linuxism that O_EXCL on a block device means
"this block device is not part of a filesystem", and I don't think
most userspace programmers are aware of what it means when not paired
with O_CREAT.

> > If Process C tries to do a write to file F without a lease, there's no
> > problem, unless a side-effect of the write would be to change the block
> > mapping,
> 
> That's a side effect we cannot predict ahead of time. But it's
> also _completely irrelevant_ to the layout lease layer API and
> implementation.(*)

It's irrelevant to the naming, but you brought it up as part of the
semantics.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                       ` <20190613002555.GH14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  2019-06-13  3:23                         ` Matthew Wilcox
@ 2019-06-13 15:27                         ` Matthew Wilcox
  2019-06-13 21:13                           ` Ira Weiny
       [not found]                           ` <20190613152755.GI32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  2019-06-13 20:34                         ` Ira Weiny
  2 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2019-06-13 15:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> e.g. Process A has an exclusive layout lease on file F. It does an
> IO to file F. The filesystem IO path checks that Process A owns the
> lease on the file and so skips straight through layout breaking
> because it owns the lease and is allowed to modify the layout. It
> then takes the inode metadata locks to allocate new space and write
> new data.
> 
> Process B now tries to write to file F. The FS checks whether
> Process B owns a layout lease on file F. It doesn't, so then it
> tries to break the layout lease so the IO can proceed. The layout
> breaking code sees that process A has an exclusive layout lease
> granted, and so returns -ETXTBSY to process B - it is not allowed to
> break the lease and so the IO fails with -ETXTBSY.

This description doesn't match the behaviour that RDMA wants either.
Even if Process A has a lease on the file, an IO from Process A which
results in blocks being freed from the file is going to result in the
RDMA device being able to write to blocks which are now freed (and
potentially reallocated to another file).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13  3:23                         ` Matthew Wilcox
       [not found]                           ` <20190613032320.GG32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
@ 2019-06-13 15:29                           ` Jason Gunthorpe
  1 sibling, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-06-13 15:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Ira Weiny, Jan Kara, Dan Williams,
	Theodore Ts'o, Jeff Layton, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> > 
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access.  This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
> 
> This isn't my understanding of how RDMA wants this to work, so we should
> probably clear that up before we get too far down deciding what name to
> give it.
> 
> For the RDMA usage case, it is entirely possible that both process A
> and process B which don't know about each other want to perform RDMA to
> file F.  So there will be two layout leases active on this file at the
> same time.  It's fine for IOs to simultaneously be active to both leases.
> But if the filesystem wants to move blocks around, it has to break
> both leases.
> 
> If Process C tries to do a write to file F without a lease, there's no
> problem, unless a side-effect of the write would be to change the block
> mapping, in which case either the leases must break first, or the write
> must be denied.
> 
> Jason, please correct me if I've misunderstood the RDMA needs here.

Yes, I think you've captured it

Based on Dave's remarks how frequently a filesystem needs to break the
lease will determine if it is usuable to be combined with RDMA or
not...

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                       ` <20190613002555.GH14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
  2019-06-13  3:23                         ` Matthew Wilcox
  2019-06-13 15:27                         ` Matthew Wilcox
@ 2019-06-13 20:34                         ` Ira Weiny
  2019-06-14  2:58                           ` Dave Chinner
  2 siblings, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-06-13 20:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > Are you suggesting that we have something like this from user space?
> > > > 
> > > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > 
> > > Rather than "unbreakable", perhaps a clearer description of the
> > > policy it entails is "exclusive"?
> > > 
> > > i.e. what we are talking about here is an exclusive lease that
> > > prevents other processes from changing the layout. i.e. the
> > > mechanism used to guarantee a lease is exclusive is that the layout
> > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > actually presenting to uses is "exclusive access"...
> > 
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".
> 
> 
> Layout leases are not locks, they are a user access policy object.
> It is the process/fd which holds the lease and it's the process/fd
> that is granted exclusive access.  This is exactly the same semantic
> as O_EXCL provides for granting exclusive access to a block device
> via open(), yes?
> 
> > As I understand it, this is rather more like a 'shared' or
> > 'read' lock.  The filesystem would be the one which wants an exclusive
> > lock, so it can modify the mapping of logical to physical blocks.
> 
> ISTM that you're conflating internal filesystem implementation with
> application visible semantics. Yes, the filesystem uses internal
> locks to serialise the modification of the things the lease manages
> access too, but that has nothing to do with the access policy the
> lease provides to users.
> 
> e.g. Process A has an exclusive layout lease on file F. It does an
> IO to file F. The filesystem IO path checks that Process A owns the
> lease on the file and so skips straight through layout breaking
> because it owns the lease and is allowed to modify the layout. It
> then takes the inode metadata locks to allocate new space and write
> new data.
> 
> Process B now tries to write to file F. The FS checks whether
> Process B owns a layout lease on file F. It doesn't, so then it
> tries to break the layout lease so the IO can proceed. The layout
> breaking code sees that process A has an exclusive layout lease
> granted, and so returns -ETXTBSY to process B - it is not allowed to
> break the lease and so the IO fails with -ETXTBSY.
> 
> i.e. the exclusive layout lease prevents other processes from
> performing operations that may need to modify the layout from
> performing those operations. It does not "lock" the file/inode in
> any way, it just changes how the layout lease breaking behaves.

Question: Do we expect Process A to get notified that Process B was attempting
to change the layout?

This changes the exclusivity semantics.  While Process A has an exclusive lease
it could release it if notified to allow process B temporary exclusivity.

Question 2: Do we expect other process' (say Process C) to also be able to map
and pin the file?  I believe users will need this and for layout purposes it is
ok to do so.  But this means that Process A does not have "exclusive" access to
the lease.

So given Process C has also placed a layout lease on the file.  Indicating
that it does not want the layout to change.  Both A and C need to be "broken"
by Process B to change the layout.  If there is no Process B; A and C can run
just fine with a "locked" layout.

Ira

> 
> Further, the "exclusiveness" of a layout lease is completely
> irrelevant to the filesystem that is indicating that an operation
> that may need to modify the layout is about to be performed. All the
> filesystem has to do is handle failures to break the lease
> appropriately.  Yes, XFS serialises the layout lease validation
> against other IO to the same file via it's IO locks, but that's an
> internal data IO coherency requirement, not anything to do with
> layout lease management.
> 
> Note that I talk about /writes/ here. This is interchangable with
> any other operation that may need to modify the extent layout of the
> file, be it truncate, fallocate, etc: the attempt to break the
> layout lease by a non-owner should fail if the lease is "exclusive"
> to the owner.
> 
> > The complication being that by default the filesystem has an exclusive
> > lock on the mapping, and what we're trying to add is the ability for
> > readers to ask the filesystem to give up its exclusive lock.
> 
> The filesystem doesn't even lock the "mapping" until after the
> layout lease has been validated or broken.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                           ` <20190613005552.GI14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
@ 2019-06-13 20:34                             ` Ira Weiny
  2019-06-14  3:42                               ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-06-13 20:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > > 
> > > > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > 
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > > 
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > > 
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".  As I understand it, this is rather more like a 'shared' or
> > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> > > 
> > > The complication being that by default the filesystem has an exclusive
> > > lock on the mapping, and what we're trying to add is the ability for
> > > readers to ask the filesystem to give up its exclusive lock.
> > 
> > This is an interesting view...
> > 
> > And after some more thought, exclusive does not seem like a good name for this
> > because technically F_WRLCK _is_ an exclusive lease...
> > 
> > In addition, the user does not need to take the "exclusive" write lease to be
> > notified of (broken by) an unexpected truncate.  A "read" lease is broken by
> > truncate.  (And "write" leases really don't do anything different WRT the
> > interaction of the FS and the user app.  Write leases control "exclusive"
> > access between other file descriptors.)
> 
> I've been assuming that there is only one type of layout lease -
> there is no use case I've heard of for read/write layout leases, and
> like you say there is zero difference in behaviour at the filesystem
> level - they all have to be broken to allow a non-lease truncate to
> proceed.
> 
> IMO, taking a "read lease" to be able to modify and write to the
> underlying mapping of a file makes absolutely no sense at all.
> IOWs, we're talking exaclty about a revokable layout lease vs an
> exclusive layout lease here, and so read/write really doesn't match
> the policy or semantics we are trying to provide.

I humbly disagree, at least depending on how you look at it...  :-D

The patches as they stand expect the user to take a "read" layout lease which
indicates they are currently using "reading" the layout as is.  They are not
changing ("writing" to) the layout.  They then pin pages which locks parts of
the layout and therefore they expect no "writers" to change the layout.

The "write" layout lease breaks the "read" layout lease indicating that the
layout is being written to.  Should the layout be pinned in such a way that the
layout can't be changed the "layout writer" (truncate) fails.

In fact, this is what NFS does right now.  The lease it puts on the file is of
"read" type.

nfs4layouts.c:
static int
nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
{
...
        fl->fl_flags = FL_LAYOUT;
        fl->fl_type = F_RDLCK;
...
}

I was not changing that much from the NFS patter which meant the break lease
code worked.

Jans proposal is solid but it means that there is no breaking of the lease.  I
tried to add an "exclusive" flag to the "write" lease but the __break_lease()
code gets weird.  I'm not saying it is not possible.  Just that I have not
seen a good way to do it.

> 
> > Another thing to consider is that this patch set _allows_ a truncate/hole punch
> > to proceed _if_ the pages being affected are not actually pinned.  So the
> > unbreakable/exclusive nature of the lease is not absolute.
> 
> If you're talking about the process that owns the layout lease
> running the truncate, then that is fine.
> 
> However, if you are talking about a process that does not own the
> layout lease being allowed to truncate a file without first breaking
> the layout lease, then that is fundamentally broken.

In both cases (local or remote process) the lease is broken prior to the
attempt to truncate.

> 
> i.e. If you don't own a layout lease, the layout leases must be
> broken before the truncate can proceed.

Agreed.

>
> If it's an exclusive lease,
> then you cannot break the lease and the truncate *must fail before
> it is started*. i.e.  the layout lease state must be correctly
> resolved before we start an operation that may modify a file layout.
> 
> Determining if we can actually do the truncate based on page state
> occurs /after/ the lease says the truncate can proceed....

That makes a lot of sense and that is the way the patch currently works.

I need to think on this some more.  Keeping the lease may not be critical.  As
discussed with Jan; dealing with close() is best dealt with by tracking the
actual pins on the file.  If that works then we could potentially keep the
lease semantics closer to what you and I are talking about here.

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13 15:27                         ` Matthew Wilcox
@ 2019-06-13 21:13                           ` Ira Weiny
  2019-06-13 23:45                             ` Jason Gunthorpe
       [not found]                           ` <20190613152755.GI32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  1 sibling, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-06-13 21:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, Dan Williams, Theodore Ts'o,
	Jeff Layton, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> 
> This description doesn't match the behaviour that RDMA wants either.
> Even if Process A has a lease on the file, an IO from Process A which
> results in blocks being freed from the file is going to result in the
> RDMA device being able to write to blocks which are now freed (and
> potentially reallocated to another file).

I don't understand why this would not work for RDMA?  As long as the layout
does not change the page pins can remain in place.

Ira

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13 21:13                           ` Ira Weiny
@ 2019-06-13 23:45                             ` Jason Gunthorpe
  2019-06-14  0:00                               ` Ira Weiny
       [not found]                               ` <20190613234530.GK22901-uk2M96/98Pc@public.gmane.org>
  0 siblings, 2 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-06-13 23:45 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Matthew Wilcox, Dave Chinner, Jan Kara, Dan Williams,
	Theodore Ts'o, Jeff Layton, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > IO to file F. The filesystem IO path checks that Process A owns the
> > > lease on the file and so skips straight through layout breaking
> > > because it owns the lease and is allowed to modify the layout. It
> > > then takes the inode metadata locks to allocate new space and write
> > > new data.
> > > 
> > > Process B now tries to write to file F. The FS checks whether
> > > Process B owns a layout lease on file F. It doesn't, so then it
> > > tries to break the layout lease so the IO can proceed. The layout
> > > breaking code sees that process A has an exclusive layout lease
> > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > break the lease and so the IO fails with -ETXTBSY.
> > 
> > This description doesn't match the behaviour that RDMA wants either.
> > Even if Process A has a lease on the file, an IO from Process A which
> > results in blocks being freed from the file is going to result in the
> > RDMA device being able to write to blocks which are now freed (and
> > potentially reallocated to another file).
> 
> I don't understand why this would not work for RDMA?  As long as the layout
> does not change the page pins can remain in place.

Because process A had a layout lease (and presumably a MR) and the
layout was still modified in way that invalidates the RDMA MR.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13 23:45                             ` Jason Gunthorpe
@ 2019-06-14  0:00                               ` Ira Weiny
       [not found]                               ` <20190613234530.GK22901-uk2M96/98Pc@public.gmane.org>
  1 sibling, 0 replies; 30+ messages in thread
From: Ira Weiny @ 2019-06-14  0:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Dave Chinner, Jan Kara, Dan Williams,
	Theodore Ts'o, Jeff Layton, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > lease on the file and so skips straight through layout breaking
> > > > because it owns the lease and is allowed to modify the layout. It
> > > > then takes the inode metadata locks to allocate new space and write
> > > > new data.
> > > > 
> > > > Process B now tries to write to file F. The FS checks whether
> > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > tries to break the layout lease so the IO can proceed. The layout
> > > > breaking code sees that process A has an exclusive layout lease
> > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > break the lease and so the IO fails with -ETXTBSY.
> > > 
> > > This description doesn't match the behaviour that RDMA wants either.
> > > Even if Process A has a lease on the file, an IO from Process A which
> > > results in blocks being freed from the file is going to result in the
> > > RDMA device being able to write to blocks which are now freed (and
> > > potentially reallocated to another file).
> > 
> > I don't understand why this would not work for RDMA?  As long as the layout
> > does not change the page pins can remain in place.
> 
> Because process A had a layout lease (and presumably a MR) and the
> layout was still modified in way that invalidates the RDMA MR.

Oh sorry I miss read the above...  (got Process A and  B mixed up...)

Right, but Process A still can't free those blocks because the gup pin exists
on them...  So yea it can't _just_ be a layout lease which controls this on the
"file fd".

Ira

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                               ` <20190613234530.GK22901-uk2M96/98Pc@public.gmane.org>
@ 2019-06-14  2:09                                 ` Dave Chinner
  2019-06-14  2:31                                   ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-06-14  2:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Theodore Ts'o, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > lease on the file and so skips straight through layout breaking
> > > > because it owns the lease and is allowed to modify the layout. It
> > > > then takes the inode metadata locks to allocate new space and write
> > > > new data.
> > > > 
> > > > Process B now tries to write to file F. The FS checks whether
> > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > tries to break the layout lease so the IO can proceed. The layout
> > > > breaking code sees that process A has an exclusive layout lease
> > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > break the lease and so the IO fails with -ETXTBSY.
> > > 
> > > This description doesn't match the behaviour that RDMA wants either.
> > > Even if Process A has a lease on the file, an IO from Process A which
> > > results in blocks being freed from the file is going to result in the
> > > RDMA device being able to write to blocks which are now freed (and
> > > potentially reallocated to another file).
> > 
> > I don't understand why this would not work for RDMA?  As long as the layout
> > does not change the page pins can remain in place.
> 
> Because process A had a layout lease (and presumably a MR) and the
> layout was still modified in way that invalidates the RDMA MR.

The lease holder is allowed to modify the mapping it has a lease
over. That's necessary so lease holders can write data into
unallocated space in the file. The lease is there to prevent third
parties from modifying the layout without the lease holder being
informed and taking appropriate action to allow that 3rd party
modification to occur.

If the lease holder modifies the mapping in a way that causes it's
own internal state to screw up, then that's a bug in the lease
holder application.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-14  2:09                                 ` Dave Chinner
@ 2019-06-14  2:31                                   ` Matthew Wilcox
       [not found]                                     ` <20190614023107.GK32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-06-14  2:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Ira Weiny, Jan Kara, Dan Williams,
	Theodore Ts'o, Jeff Layton, linux-xfs, Andrew Morton,
	John Hubbard, Jérôme Glisse, linux-fsdevel,
	linux-kernel, linux-nvdimm, linux-ext4, linux-mm, linux-rdma

On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > > lease on the file and so skips straight through layout breaking
> > > > > because it owns the lease and is allowed to modify the layout. It
> > > > > then takes the inode metadata locks to allocate new space and write
> > > > > new data.
> > > > > 
> > > > > Process B now tries to write to file F. The FS checks whether
> > > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > > tries to break the layout lease so the IO can proceed. The layout
> > > > > breaking code sees that process A has an exclusive layout lease
> > > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > > break the lease and so the IO fails with -ETXTBSY.
> > > > 
> > > > This description doesn't match the behaviour that RDMA wants either.
> > > > Even if Process A has a lease on the file, an IO from Process A which
> > > > results in blocks being freed from the file is going to result in the
> > > > RDMA device being able to write to blocks which are now freed (and
> > > > potentially reallocated to another file).
> > > 
> > > I don't understand why this would not work for RDMA?  As long as the layout
> > > does not change the page pins can remain in place.
> > 
> > Because process A had a layout lease (and presumably a MR) and the
> > layout was still modified in way that invalidates the RDMA MR.
> 
> The lease holder is allowed to modify the mapping it has a lease
> over. That's necessary so lease holders can write data into
> unallocated space in the file. The lease is there to prevent third
> parties from modifying the layout without the lease holder being
> informed and taking appropriate action to allow that 3rd party
> modification to occur.
> 
> If the lease holder modifies the mapping in a way that causes it's
> own internal state to screw up, then that's a bug in the lease
> holder application.

Sounds like the lease semantics aren't the right ones for the longterm
GUP users then.  The point of the longterm GUP is so the pages can be
written to, and if the filesystem is going to move the pages around when
they're written to, that just won't work.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13 20:34                         ` Ira Weiny
@ 2019-06-14  2:58                           ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2019-06-14  2:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Matthew Wilcox, Jan Kara, Dan Williams, Theodore Ts'o,
	Jeff Layton, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 13, 2019 at 01:34:05PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > > 
> > > > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > 
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > > 
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > > 
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> > 
> > 
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access.  This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
> > 
> > > As I understand it, this is rather more like a 'shared' or
> > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> > 
> > ISTM that you're conflating internal filesystem implementation with
> > application visible semantics. Yes, the filesystem uses internal
> > locks to serialise the modification of the things the lease manages
> > access too, but that has nothing to do with the access policy the
> > lease provides to users.
> > 
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> > 
> > i.e. the exclusive layout lease prevents other processes from
> > performing operations that may need to modify the layout from
> > performing those operations. It does not "lock" the file/inode in
> > any way, it just changes how the layout lease breaking behaves.
> 
> Question: Do we expect Process A to get notified that Process B was attempting
> to change the layout?

In which case?

In the non-exclusive case, yes, the lease gets
recalled and the application needs to play nice and release it's
references and drop the lease.

In the exclusive case, no. The application has said "I don't play
nice with others" and so we basically tell process B to get stuffed
and process A can continue onwards oblivious to the wreckage it
leaves behind....

> This changes the exclusivity semantics.  While Process A has an exclusive lease
> it could release it if notified to allow process B temporary exclusivity.

And then it's not an exclusive lease - it's just a normal layout
lease. Process B -does not need a lease- to write to the file.

All the layout lease does is provide notification to applications
that rely on the layout of the file being under their control that
someone else is about to modify the layout. The lease holder that
"plays nice" then releases the layout and drops it's lease, allowing
process B to begin it's operation.

Process A then immediately takes a new layout lease, and remaps the
file layout via FIEMAP or by creating a new RDMA MR for the mmap
region. THose operations get serialised by the filesystem because
the operation being run by process B is run atomically w.r.t. the
original lease being broken. Hence the new mapping that process A
gets with it's new lease reflects whatever change was made by
process B.

IOWs, the "normal" layout lease recall behaviour provides "temporary
exclusivity" for third parties. If you are able to release leases
temporarily and regain them then there is no need for an exclusive
lease.

> Question 2: Do we expect other process' (say Process C) to also be able to map
> and pin the file?  I believe users will need this and for layout purposes it is
> ok to do so.  But this means that Process A does not have "exclusive" access to
> the lease.

This is an application architecture problem, not a layout lease or
filesystem problem. :)

i.e. if you have a single process controlling all the RDMA mappings,
then you can use exclusive leases. If you have multiple processes
that are uncoordinated and all require layout access to the same
file then you can't use exclusive layout leases in the application.
i.e. your application has to play nice with others.

Indeed, this is more than a application architecture problem - it's
actually a system wide architecture problem.  e.g. the pNFS server
cannot use exclusive layout leases because it has to play nice with
anything else on the local filesystem that might require a layout
lease. An example of this woudl be an app that provides coherent
RDMA access to the same storage that pNFS is sharing (e.g. a
userspace CIFS server).

Hence I see that exclusive layout leases will end up being the
exception rather than the norm, because most applications will need
to play nice with other applications on the system that also
directly access the storage under the filesystem....

> So given Process C has also placed a layout lease on the file.  Indicating
> that it does not want the layout to change.

That is *not what layout leases provide*.

Layout leases grant the owner the ability to map the layout and
directly access the underlying storage and to do it safely because
they will get a notification of 3rd party access that will
invalidate their mapping. Layout leases do not prevent anyone from
_changing_ the layout and, in fact, pNFS _requires_ the lease holder
to be able to modify the layout.

IOWs, the layout lease _as it stands now_ is a notification
mechanism that tells the lease owner when someone else is about to
modify the layout. It does not make the file layout immutable.

The "exclusive" aspect of layout we have been discussing is a
mechanism that prevents 3rd party modification of the layout by
denying the ability to break the layout. This "exclusive" aspect
does not make the layout immutable, either, it just means the
layout is only modifiable by the exclusive lease holder. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                                     ` <20190614023107.GK32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
@ 2019-06-14  3:07                                       ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2019-06-14  3:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Hubbard, Jeff Layton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jan Kara,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, Jun 13, 2019 at 07:31:07PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote:
> > If the lease holder modifies the mapping in a way that causes it's
> > own internal state to screw up, then that's a bug in the lease
> > holder application.
> 
> Sounds like the lease semantics aren't the right ones for the longterm
> GUP users then.  The point of the longterm GUP is so the pages can be
> written to, and if the filesystem is going to move the pages around when
> they're written to, that just won't work.

And now we go full circle back to the constraints we decided on long
ago because we can't rely on demand paging RDMA hardware any time
soon to do everything we need to transparently support long-term GUP
on file-backed mappings. i.e.:

	RDMA to file backed mappings must first preallocate and
	write zeros to the range of the file they are mapping so
	that the filesystem block mapping is complete and static for
	the life of the RDMA mapping that will pin it.

IOWs, the layout lease will tell the RDMA application that the
static setup it has already done  to work correctly with a file
backed mapping may be about to be broken by a third party.....

-Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
  2019-06-13 20:34                             ` Ira Weiny
@ 2019-06-14  3:42                               ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2019-06-14  3:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Matthew Wilcox, Jan Kara, Dan Williams, Theodore Ts'o,
	Jeff Layton, linux-xfs, Andrew Morton, John Hubbard,
	Jérôme Glisse, linux-fsdevel, linux-kernel,
	linux-nvdimm, linux-ext4, linux-mm, Jason Gunthorpe, linux-rdma

On Thu, Jun 13, 2019 at 01:34:06PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > > Are you suggesting that we have something like this from user space?
> > > > > > 
> > > > > > 	fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > > 
> > > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > > policy it entails is "exclusive"?
> > > > > 
> > > > > i.e. what we are talking about here is an exclusive lease that
> > > > > prevents other processes from changing the layout. i.e. the
> > > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > > actually presenting to uses is "exclusive access"...
> > > > 
> > > > That's rather different from the normal meaning of 'exclusive' in the
> > > > context of locks, which is "only one user can have access to this at
> > > > a time".  As I understand it, this is rather more like a 'shared' or
> > > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > > lock, so it can modify the mapping of logical to physical blocks.
> > > > 
> > > > The complication being that by default the filesystem has an exclusive
> > > > lock on the mapping, and what we're trying to add is the ability for
> > > > readers to ask the filesystem to give up its exclusive lock.
> > > 
> > > This is an interesting view...
> > > 
> > > And after some more thought, exclusive does not seem like a good name for this
> > > because technically F_WRLCK _is_ an exclusive lease...
> > > 
> > > In addition, the user does not need to take the "exclusive" write lease to be
> > > notified of (broken by) an unexpected truncate.  A "read" lease is broken by
> > > truncate.  (And "write" leases really don't do anything different WRT the
> > > interaction of the FS and the user app.  Write leases control "exclusive"
> > > access between other file descriptors.)
> > 
> > I've been assuming that there is only one type of layout lease -
> > there is no use case I've heard of for read/write layout leases, and
> > like you say there is zero difference in behaviour at the filesystem
> > level - they all have to be broken to allow a non-lease truncate to
> > proceed.
> > 
> > IMO, taking a "read lease" to be able to modify and write to the
> > underlying mapping of a file makes absolutely no sense at all.
> > IOWs, we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.
> 
> I humbly disagree, at least depending on how you look at it...  :-D
> 
> The patches as they stand expect the user to take a "read" layout lease which
> indicates they are currently using "reading" the layout as is.
> They are not
> changing ("writing" to) the layout.

As I said in a another email in the thread, a layout lease does not
make the layout "read only". It just means the lease owner will be
notified when someone else is about to modify it. The lease owner
can modify the mapping themselves, and they will not get notified
about their own modifications.

> They then pin pages which locks parts of
> the layout and therefore they expect no "writers" to change the layout.

Except they can change the layout themselves. It's perfectly valid
to get a layout lease, write() from offset 0 to EOF and fsync() to
intiialise the file and allocate all the space in the file, then
mmap() it and hand to off to RMDA, all while holding the layout
lease.

> The "write" layout lease breaks the "read" layout lease indicating that the
> layout is being written to.

Layout leases do not work this way.

> In fact, this is what NFS does right now.  The lease it puts on the file is of
> "read" type.
> 
> nfs4layouts.c:
> static int
> nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
> {
> ...
>         fl->fl_flags = FL_LAYOUT;
>         fl->fl_type = F_RDLCK;
> ...
> }

Yes, the existing /implementation/ uses F_RDLCK, but that doesn't
mean the layout is "read only". Look at the pNFS mapping layout code
- the ->map_blocks export operation:

       int (*map_blocks)(struct inode *inode, loff_t offset,
                          u64 len, struct iomap *iomap,
                          bool write, u32 *device_generation);
                          ^^^^^^^^^^

Yup, it has a write variable that, when set, causes the filesystem
to _allocate_ blocks if the range to be written to falls over a hole
in the file.  IOWs, a pNFS layout lease can modify the file layout -
you're conflating use of a "read lock" API to mean that what the
lease _manages_ is "read only". That is not correct.

Layouts are /always writeable/ by the lease owner(s), the question
here is what we do with third parties attempting to modify a layout
covered by an "exclusive" layout lease. Hence, I'll repeat:

> > we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
       [not found]                           ` <20190613152755.GI32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
@ 2019-06-20 14:52                             ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2019-06-20 14:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Theodore Ts'o,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Jérôme Glisse, John Hubbard,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jan Kara,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu 13-06-19 08:27:55, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> 
> This description doesn't match the behaviour that RDMA wants either.
> Even if Process A has a lease on the file, an IO from Process A which
> results in blocks being freed from the file is going to result in the
> RDMA device being able to write to blocks which are now freed (and
> potentially reallocated to another file).

I think you're partially wrong here. You are correct that the lease won't
stop process A from doing truncate on the file. *But* there are still page
pins in existence so truncate will block on waiting for these pins to go
away (after all this is a protection that guards all short-term page pin
users). So there is no problem with blocks being freed under the RDMA app.
Yes, the app will effectively deadlock and sysadmin has to kill it. IMO an
acceptable answer for doing something stupid and unsupportable...

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-06-20 14:52 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190606014544.8339-1-ira.weiny@intel.com>
     [not found] ` <c559c2ce-50dc-d143-5741-fe3d21d0305c@nvidia.com>
2019-06-06 17:11   ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal Ira Weiny
2019-06-06 19:46     ` Jason Gunthorpe
     [not found] ` <20190606104203.GF7433@quack2.suse.cz>
2019-06-06 22:03   ` Ira Weiny
2019-06-06 22:26     ` Ira Weiny
2019-06-06 22:28     ` Dave Chinner
2019-06-07 11:04     ` Jan Kara
     [not found]       ` <20190607110426.GB12765-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:50           ` Jason Gunthorpe
     [not found]           ` <20190607182534.GC14559-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
2019-06-08  0:10             ` Dave Chinner
     [not found]               ` <20190608001036.GF14308-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
2019-06-09  1:29                 ` Ira Weiny
2019-06-12 12:37                 ` Matthew Wilcox
     [not found]                   ` <20190612123751.GD32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2019-06-12 23:30                     ` Ira Weiny
     [not found]                       ` <20190612233024.GD14336-J5EW/p2F9lUlb2qzJQmr9q2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
2019-06-13  0:55                         ` Dave Chinner
     [not found]                           ` <20190613005552.GI14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
2019-06-13 20:34                             ` Ira Weiny
2019-06-14  3:42                               ` Dave Chinner
2019-06-13  0:25                     ` Dave Chinner
     [not found]                       ` <20190613002555.GH14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
2019-06-13  3:23                         ` Matthew Wilcox
     [not found]                           ` <20190613032320.GG32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2019-06-13  4:36                             ` Dave Chinner
     [not found]                               ` <20190613043649.GJ14363-pA1nmv6sEBkOM8BvhN4Z8vybgvtCy99p@public.gmane.org>
2019-06-13 10:47                                 ` Matthew Wilcox
2019-06-13 15:29                           ` Jason Gunthorpe
2019-06-13 15:27                         ` Matthew Wilcox
2019-06-13 21:13                           ` Ira Weiny
2019-06-13 23:45                             ` Jason Gunthorpe
2019-06-14  0:00                               ` Ira Weiny
     [not found]                               ` <20190613234530.GK22901-uk2M96/98Pc@public.gmane.org>
2019-06-14  2:09                                 ` Dave Chinner
2019-06-14  2:31                                   ` Matthew Wilcox
     [not found]                                     ` <20190614023107.GK32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2019-06-14  3:07                                       ` Dave Chinner
     [not found]                           ` <20190613152755.GI32656-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2019-06-20 14:52                             ` Jan Kara
2019-06-13 20:34                         ` Ira Weiny
2019-06-14  2:58                           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).