[RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
@ 2012-05-03 15:34 Niels de Vos
       [not found] ` <4FA2A56B.1030208-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Niels de Vos @ 2012-05-03 15:34 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, David Howells,
	Jeffrey Layton

When an application on an NFS-client (tested with NFSv3) executes the
following steps, data written after the close() is never flushed to the
server:

1. open()
2. mmap()
3. close()
4. <modify data in the mmap'ed area>
5. munmap()

Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
result in the data being sent to the server.

The man-page for mmap (man 2 mmap) does mention that closing the file-
descriptor does not munmap() the area. Using the mmap'ed area after a
close() sound valid to me (even if it may be bad practice).

Investigation and checking showed that the NFS-client does not handle
munmap(), and only flushes on close(). To solve this problem, least two
solutions can be proposed:

a. f_ops->release() is called on munmap() as well as on close(),
    therefore release() can be used to flush data as well.
b. In the 'struct vm_operations_struct' add a .close to the
    'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
    the data in the new close() function.

Solution a. contains currently very few code changes:

--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)

  int nfs_release(struct inode *inode, struct file *filp)
  {
+       if (S_ISREG(inode->i_mode) && inode->i_mapping->nrpages != 0) {
+               nfs_sync_mapping(inode->i_mapping);
         nfs_file_clear_open_context(filp);
         return 0;
  }

The disadvantage is, that nfs_release() is called on close() too. That
means this causes a flushing of dirty pages, and just after that the
nfs_file_clear_open_context() might flush again. The advantage is that
it is possible (though not done at the moment) to return an error in
case flushing failed.

Solution b. does not provide an option to return an error, but does not
get called on each close():

--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -547,9 +547,17 @@ out:
  	return ret;
  }

+static void nfs_vm_close(struct vm_area_struct * vma)
+{
+	struct file *filp = vma->vm_file;
+
+	nfs_file_flush(filp, (fl_owner_t)filp);
+}
+
  static const struct vm_operations_struct nfs_file_vm_ops = {
  	.fault = filemap_fault,
  	.page_mkwrite = nfs_vm_page_mkwrite,
+	.close = nfs_vm_close,
  };

  static int nfs_need_sync_write(struct file *filp, struct inode *inode)

I would like some feedback on what solution is most acceptable, or any
other suggestions.

Many thanks,
Niels
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
       [not found] ` <4FA2A56B.1030208-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-05-03 15:43   ` Myklebust, Trond
  2012-05-03 17:07     ` Niels de Vos
  0 siblings, 1 reply; 7+ messages in thread
From: Myklebust, Trond @ 2012-05-03 15:43 UTC (permalink / raw)
  To: Niels de Vos
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David Howells, Jeffrey Layton

On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
> When an application on an NFS-client (tested with NFSv3) executes the
> following steps, data written after the close() is never flushed to the
> server:
> 
> 1. open()
> 2. mmap()
> 3. close()
> 4. <modify data in the mmap'ed area>
> 5. munmap()
> 
> Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
> result in the data being sent to the server.
> 
> The man-page for mmap (man 2 mmap) does mention that closing the file-
> descriptor does not munmap() the area. Using the mmap'ed area after a
> close() sound valid to me (even if it may be bad practice).
> 
> Investigation and checking showed that the NFS-client does not handle
> munmap(), and only flushes on close(). To solve this problem, least two
> solutions can be proposed:
> 
> a. f_ops->release() is called on munmap() as well as on close(),
>     therefore release() can be used to flush data as well.
> b. In the 'struct vm_operations_struct' add a .close to the
>     'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
>     the data in the new close() function.
> 
> Solution a. contains currently very few code changes:
> 
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
> 
>   int nfs_release(struct inode *inode, struct file *filp)
>   {
> +       if (S_ISREG(inode->i_mode) && inode->i_mapping->nrpages != 0) {
> +               nfs_sync_mapping(inode->i_mapping);
>          nfs_file_clear_open_context(filp);
>          return 0;
>   }
> 
> The disadvantage is, that nfs_release() is called on close() too. That
> means this causes a flushing of dirty pages, and just after that the
> nfs_file_clear_open_context() might flush again. The advantage is that
> it is possible (though not done at the moment) to return an error in
> case flushing failed.
> 
> Solution b. does not provide an option to return an error, but does not
> get called on each close():
> 
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -547,9 +547,17 @@ out:
>   	return ret;
>   }
> 
> +static void nfs_vm_close(struct vm_area_struct * vma)
> +{
> +	struct file *filp = vma->vm_file;
> +
> +	nfs_file_flush(filp, (fl_owner_t)filp);
> +}
> +
>   static const struct vm_operations_struct nfs_file_vm_ops = {
>   	.fault = filemap_fault,
>   	.page_mkwrite = nfs_vm_page_mkwrite,
> +	.close = nfs_vm_close,
>   };
> 
>   static int nfs_need_sync_write(struct file *filp, struct inode *inode)
> 
> I would like some feedback on what solution is most acceptable, or any
> other suggestions.

Neither solution is acceptable. This isn't a close-to-open cache
consistency issue.

The syntax of mmap() for both block and NFS mounts is the same: writes
are not guaranteed to hit the disk until your application explicitly
calls msync().

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
  2012-05-03 15:43   ` Myklebust, Trond
@ 2012-05-03 17:07     ` Niels de Vos
       [not found]       ` <4FA2BB68.7070304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Niels de Vos @ 2012-05-03 17:07 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	David Howells, Jeffrey Layton

On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
 > On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
 >> When an application on an NFS-client (tested with NFSv3) executes the
 >> following steps, data written after the close() is never flushed to the
 >> server:
 >>
 >> 1. open()
 >> 2. mmap()
 >> 3. close()
 >> 4.<modify data in the mmap'ed area>
 >> 5. munmap()
 >>
 >> Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
 >> result in the data being sent to the server.
 >>
 >> The man-page for mmap (man 2 mmap) does mention that closing the file-
 >> descriptor does not munmap() the area. Using the mmap'ed area after a
 >> close() sound valid to me (even if it may be bad practice).
 >>
 >> Investigation and checking showed that the NFS-client does not handle
 >> munmap(), and only flushes on close(). To solve this problem, least two
 >> solutions can be proposed:
 >>
 >> a. f_ops->release() is called on munmap() as well as on close(),
 >>      therefore release() can be used to flush data as well.
 >> b. In the 'struct vm_operations_struct' add a .close to the
 >>      'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
 >>      the data in the new close() function.
 >>
 >> Solution a. contains currently very few code changes:
 >>
 >> --- a/fs/nfs/inode.c
 >> +++ b/fs/nfs/inode.c
 >> @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
 >>
 >>    int nfs_release(struct inode *inode, struct file *filp)
 >>    {
 >> +       if (S_ISREG(inode->i_mode)&&  inode->i_mapping->nrpages != 0) {
 >> +               nfs_sync_mapping(inode->i_mapping);
 >>           nfs_file_clear_open_context(filp);
 >>           return 0;
 >>    }
 >>
 >> The disadvantage is, that nfs_release() is called on close() too. That
 >> means this causes a flushing of dirty pages, and just after that the
 >> nfs_file_clear_open_context() might flush again. The advantage is that
 >> it is possible (though not done at the moment) to return an error in
 >> case flushing failed.
 >>
 >> Solution b. does not provide an option to return an error, but does not
 >> get called on each close():
 >>
 >> --- a/fs/nfs/file.c
 >> +++ b/fs/nfs/file.c
 >> @@ -547,9 +547,17 @@ out:
 >>    	return ret;
 >>    }
 >>
 >> +static void nfs_vm_close(struct vm_area_struct * vma)
 >> +{
 >> +	struct file *filp = vma->vm_file;
 >> +
 >> +	nfs_file_flush(filp, (fl_owner_t)filp);
 >> +}
 >> +
 >>    static const struct vm_operations_struct nfs_file_vm_ops = {
 >>    	.fault = filemap_fault,
 >>    	.page_mkwrite = nfs_vm_page_mkwrite,
 >> +	.close = nfs_vm_close,
 >>    };
 >>
 >>    static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 >>
 >> I would like some feedback on what solution is most acceptable, or any
 >> other suggestions.
 >
 > Neither solution is acceptable. This isn't a close-to-open cache
 > consistency issue.
 >
 > The syntax of mmap() for both block and NFS mounts is the same: writes
 > are not guaranteed to hit the disk until your application explicitly
 > calls msync().
 >

Okay, that makes sense. But if the application never calls msync(), and
just munmap()'s the area, when should the changes be written? I did not
expect that unmounting just disregards the data.

Any suggestions on correcting this (if it should be corrected) are
welcome.

Thanks again,
Niels

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
       [not found]       ` <4FA2BB68.7070304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-05-03 17:26         ` Myklebust, Trond
       [not found]           ` <1336066014.5385.41.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Myklebust, Trond @ 2012-05-03 17:26 UTC (permalink / raw)
  To: Niels de Vos
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David Howells, Jeffrey Layton

On Thu, 2012-05-03 at 19:07 +0200, Niels de Vos wrote:
> On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
>  > On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
>  >> When an application on an NFS-client (tested with NFSv3) executes the
>  >> following steps, data written after the close() is never flushed to the
>  >> server:
>  >>
>  >> 1. open()
>  >> 2. mmap()
>  >> 3. close()
>  >> 4.<modify data in the mmap'ed area>
>  >> 5. munmap()
>  >>
>  >> Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
>  >> result in the data being sent to the server.
>  >>
>  >> The man-page for mmap (man 2 mmap) does mention that closing the file-
>  >> descriptor does not munmap() the area. Using the mmap'ed area after a
>  >> close() sound valid to me (even if it may be bad practice).
>  >>
>  >> Investigation and checking showed that the NFS-client does not handle
>  >> munmap(), and only flushes on close(). To solve this problem, least two
>  >> solutions can be proposed:
>  >>
>  >> a. f_ops->release() is called on munmap() as well as on close(),
>  >>      therefore release() can be used to flush data as well.
>  >> b. In the 'struct vm_operations_struct' add a .close to the
>  >>      'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
>  >>      the data in the new close() function.
>  >>
>  >> Solution a. contains currently very few code changes:
>  >>
>  >> --- a/fs/nfs/inode.c
>  >> +++ b/fs/nfs/inode.c
>  >> @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
>  >>
>  >>    int nfs_release(struct inode *inode, struct file *filp)
>  >>    {
>  >> +       if (S_ISREG(inode->i_mode)&&  inode->i_mapping->nrpages != 0) {
>  >> +               nfs_sync_mapping(inode->i_mapping);
>  >>           nfs_file_clear_open_context(filp);
>  >>           return 0;
>  >>    }
>  >>
>  >> The disadvantage is, that nfs_release() is called on close() too. That
>  >> means this causes a flushing of dirty pages, and just after that the
>  >> nfs_file_clear_open_context() might flush again. The advantage is that
>  >> it is possible (though not done at the moment) to return an error in
>  >> case flushing failed.
>  >>
>  >> Solution b. does not provide an option to return an error, but does not
>  >> get called on each close():
>  >>
>  >> --- a/fs/nfs/file.c
>  >> +++ b/fs/nfs/file.c
>  >> @@ -547,9 +547,17 @@ out:
>  >>    	return ret;
>  >>    }
>  >>
>  >> +static void nfs_vm_close(struct vm_area_struct * vma)
>  >> +{
>  >> +	struct file *filp = vma->vm_file;
>  >> +
>  >> +	nfs_file_flush(filp, (fl_owner_t)filp);
>  >> +}
>  >> +
>  >>    static const struct vm_operations_struct nfs_file_vm_ops = {
>  >>    	.fault = filemap_fault,
>  >>    	.page_mkwrite = nfs_vm_page_mkwrite,
>  >> +	.close = nfs_vm_close,
>  >>    };
>  >>
>  >>    static int nfs_need_sync_write(struct file *filp, struct inode *inode)
>  >>
>  >> I would like some feedback on what solution is most acceptable, or any
>  >> other suggestions.
>  >
>  > Neither solution is acceptable. This isn't a close-to-open cache
>  > consistency issue.
>  >
>  > The syntax of mmap() for both block and NFS mounts is the same: writes
>  > are not guaranteed to hit the disk until your application explicitly
>  > calls msync().
>  >
> 
> Okay, that makes sense. But if the application never calls msync(), and
> just munmap()'s the area, when should the changes be written? I did not
> expect that unmounting just disregards the data.

That suggests that the VM is failing to dirty the pages on munmap()
before releasing the vma->vm_file. If so, then that would be a VM bug...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
       [not found]           ` <1336066014.5385.41.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
@ 2012-05-04 16:03             ` Niels de Vos
  2012-05-04 18:29               ` Myklebust, Trond
  0 siblings, 1 reply; 7+ messages in thread
From: Niels de Vos @ 2012-05-04 16:03 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David Howells, Jeffrey Layton

On 05/03/2012 07:26 PM, Myklebust, Trond wrote:
 > On Thu, 2012-05-03 at 19:07 +0200, Niels de Vos wrote:
 >> On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
 >>   >  On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
 >>   >>  When an application on an NFS-client (tested with NFSv3) executes the
 >>   >>  following steps, data written after the close() is never flushed to the
 >>   >>  server:
 >>   >>
 >>   >>  1. open()
 >>   >>  2. mmap()
 >>   >>  3. close()
 >>   >>  4.<modify data in the mmap'ed area>
 >>   >>  5. munmap()
 >>   >>
 >>   >>  Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
 >>   >>  result in the data being sent to the server.
 >>   >>
 >>   >>  The man-page for mmap (man 2 mmap) does mention that closing the file-
 >>   >>  descriptor does not munmap() the area. Using the mmap'ed area after a
 >>   >>  close() sound valid to me (even if it may be bad practice).
 >>   >>
 >>   >>  Investigation and checking showed that the NFS-client does not handle
 >>   >>  munmap(), and only flushes on close(). To solve this problem, least two
 >>   >>  solutions can be proposed:
 >>   >>
 >>   >>  a. f_ops->release() is called on munmap() as well as on close(),
 >>   >>       therefore release() can be used to flush data as well.
 >>   >>  b. In the 'struct vm_operations_struct' add a .close to the
 >>   >>       'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
 >>   >>       the data in the new close() function.
 >>   >>
 >>   >>  Solution a. contains currently very few code changes:
 >>   >>
 >>   >>  --- a/fs/nfs/inode.c
 >>   >>  +++ b/fs/nfs/inode.c
 >>   >>  @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
 >>   >>
 >>   >>     int nfs_release(struct inode *inode, struct file *filp)
 >>   >>     {
 >>   >>  +       if (S_ISREG(inode->i_mode)&&   inode->i_mapping->nrpages != 0) {
 >>   >>  +               nfs_sync_mapping(inode->i_mapping);
 >>   >>            nfs_file_clear_open_context(filp);
 >>   >>            return 0;
 >>   >>     }
 >>   >>
 >>   >>  The disadvantage is, that nfs_release() is called on close() too. That
 >>   >>  means this causes a flushing of dirty pages, and just after that the
 >>   >>  nfs_file_clear_open_context() might flush again. The advantage is that
 >>   >>  it is possible (though not done at the moment) to return an error in
 >>   >>  case flushing failed.
 >>   >>
 >>   >>  Solution b. does not provide an option to return an error, but does not
 >>   >>  get called on each close():
 >>   >>
 >>   >>  --- a/fs/nfs/file.c
 >>   >>  +++ b/fs/nfs/file.c
 >>   >>  @@ -547,9 +547,17 @@ out:
 >>   >>     	return ret;
 >>   >>     }
 >>   >>
 >>   >>  +static void nfs_vm_close(struct vm_area_struct * vma)
 >>   >>  +{
 >>   >>  +	struct file *filp = vma->vm_file;
 >>   >>  +
 >>   >>  +	nfs_file_flush(filp, (fl_owner_t)filp);
 >>   >>  +}
 >>   >>  +
 >>   >>     static const struct vm_operations_struct nfs_file_vm_ops = {
 >>   >>     	.fault = filemap_fault,
 >>   >>     	.page_mkwrite = nfs_vm_page_mkwrite,
 >>   >>  +	.close = nfs_vm_close,
 >>   >>     };
 >>   >>
 >>   >>     static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 >>   >>
 >>   >>  I would like some feedback on what solution is most acceptable, or any
 >>   >>  other suggestions.
 >>   >
 >>   >  Neither solution is acceptable. This isn't a close-to-open cache
 >>   >  consistency issue.
 >>   >
 >>   >  The syntax of mmap() for both block and NFS mounts is the same: writes
 >>   >  are not guaranteed to hit the disk until your application explicitly
 >>   >  calls msync().
 >>   >
 >>
 >> Okay, that makes sense. But if the application never calls msync(), and
 >> just munmap()'s the area, when should the changes be written? I did not
 >> expect that unmounting just disregards the data.
 >
 > That suggests that the VM is failing to dirty the pages on munmap()
 > before releasing the vma->vm_file. If so, then that would be a VM bug...
 >

I've checked if the VM tags the pages as dirty:
- f_ops->release() is called on munmap(). An added printk there, shows
   that inode->i_state is set to I_DIRTY_PAGE.
- mapping_tagged(filp->f_mapping, PAGECACHE_TAG_DIRTY) also returns true

 From my understanding this is what the VM is expected to do, and the
pages are marked dirty correctly.

However, nfs_inode->ndirty and nfs_inode->ncommit are both 0. It is
unclear to me how the VM is supposed to interact with the nfs_inode.
Some clarification or suggestion what to look into would be much
appreciated.

Cheers,
Niels
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
  2012-05-04 16:03             ` Niels de Vos
@ 2012-05-04 18:29               ` Myklebust, Trond
       [not found]                 ` <1336156178.2582.15.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Myklebust, Trond @ 2012-05-04 18:29 UTC (permalink / raw)
  To: Niels de Vos
  Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	David Howells, Jeffrey Layton

On Fri, 2012-05-04 at 18:03 +0200, Niels de Vos wrote:
> On 05/03/2012 07:26 PM, Myklebust, Trond wrote:
>  > On Thu, 2012-05-03 at 19:07 +0200, Niels de Vos wrote:
>  >> On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
>  >>   >  On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
>  >>   >>  When an application on an NFS-client (tested with NFSv3) executes the
>  >>   >>  following steps, data written after the close() is never flushed to the
>  >>   >>  server:
>  >>   >>
>  >>   >>  1. open()
>  >>   >>  2. mmap()
>  >>   >>  3. close()
>  >>   >>  4.<modify data in the mmap'ed area>
>  >>   >>  5. munmap()
>  >>   >>
>  >>   >>  Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
>  >>   >>  result in the data being sent to the server.
>  >>   >>
>  >>   >>  The man-page for mmap (man 2 mmap) does mention that closing the file-
>  >>   >>  descriptor does not munmap() the area. Using the mmap'ed area after a
>  >>   >>  close() sound valid to me (even if it may be bad practice).
>  >>   >>
>  >>   >>  Investigation and checking showed that the NFS-client does not handle
>  >>   >>  munmap(), and only flushes on close(). To solve this problem, least two
>  >>   >>  solutions can be proposed:
>  >>   >>
>  >>   >>  a. f_ops->release() is called on munmap() as well as on close(),
>  >>   >>       therefore release() can be used to flush data as well.
>  >>   >>  b. In the 'struct vm_operations_struct' add a .close to the
>  >>   >>       'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
>  >>   >>       the data in the new close() function.
>  >>   >>
>  >>   >>  Solution a. contains currently very few code changes:
>  >>   >>
>  >>   >>  --- a/fs/nfs/inode.c
>  >>   >>  +++ b/fs/nfs/inode.c
>  >>   >>  @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
>  >>   >>
>  >>   >>     int nfs_release(struct inode *inode, struct file *filp)
>  >>   >>     {
>  >>   >>  +       if (S_ISREG(inode->i_mode)&&   inode->i_mapping->nrpages != 0) {
>  >>   >>  +               nfs_sync_mapping(inode->i_mapping);
>  >>   >>            nfs_file_clear_open_context(filp);
>  >>   >>            return 0;
>  >>   >>     }
>  >>   >>
>  >>   >>  The disadvantage is, that nfs_release() is called on close() too. That
>  >>   >>  means this causes a flushing of dirty pages, and just after that the
>  >>   >>  nfs_file_clear_open_context() might flush again. The advantage is that
>  >>   >>  it is possible (though not done at the moment) to return an error in
>  >>   >>  case flushing failed.
>  >>   >>
>  >>   >>  Solution b. does not provide an option to return an error, but does not
>  >>   >>  get called on each close():
>  >>   >>
>  >>   >>  --- a/fs/nfs/file.c
>  >>   >>  +++ b/fs/nfs/file.c
>  >>   >>  @@ -547,9 +547,17 @@ out:
>  >>   >>     	return ret;
>  >>   >>     }
>  >>   >>
>  >>   >>  +static void nfs_vm_close(struct vm_area_struct * vma)
>  >>   >>  +{
>  >>   >>  +	struct file *filp = vma->vm_file;
>  >>   >>  +
>  >>   >>  +	nfs_file_flush(filp, (fl_owner_t)filp);
>  >>   >>  +}
>  >>   >>  +
>  >>   >>     static const struct vm_operations_struct nfs_file_vm_ops = {
>  >>   >>     	.fault = filemap_fault,
>  >>   >>     	.page_mkwrite = nfs_vm_page_mkwrite,
>  >>   >>  +	.close = nfs_vm_close,
>  >>   >>     };
>  >>   >>
>  >>   >>     static int nfs_need_sync_write(struct file *filp, struct inode *inode)
>  >>   >>
>  >>   >>  I would like some feedback on what solution is most acceptable, or any
>  >>   >>  other suggestions.
>  >>   >
>  >>   >  Neither solution is acceptable. This isn't a close-to-open cache
>  >>   >  consistency issue.
>  >>   >
>  >>   >  The syntax of mmap() for both block and NFS mounts is the same: writes
>  >>   >  are not guaranteed to hit the disk until your application explicitly
>  >>   >  calls msync().
>  >>   >
>  >>
>  >> Okay, that makes sense. But if the application never calls msync(), and
>  >> just munmap()'s the area, when should the changes be written? I did not
>  >> expect that unmounting just disregards the data.
>  >
>  > That suggests that the VM is failing to dirty the pages on munmap()
>  > before releasing the vma->vm_file. If so, then that would be a VM bug...
>  >
> 
> I've checked if the VM tags the pages as dirty:
> - f_ops->release() is called on munmap(). An added printk there, shows
>    that inode->i_state is set to I_DIRTY_PAGE.
> - mapping_tagged(filp->f_mapping, PAGECACHE_TAG_DIRTY) also returns true
> 
>  From my understanding this is what the VM is expected to do, and the
> pages are marked dirty correctly.
> 
> However, nfs_inode->ndirty and nfs_inode->ncommit are both 0. It is
> unclear to me how the VM is supposed to interact with the nfs_inode.
> Some clarification or suggestion what to look into would be much
> appreciated.

The first time the page is touched, it will to trigger a ->pg_mkwrite(),
which in the case of NFS will set up the necessary tracking structures
to ensure that the page is written out using the correct credentials
etc. In the case of NFSv4, it will also ensure that the file doesn't get
closed on the server until the page is written out to disk.

When the page is cleaned (i.e. something calls clear_page_dirty_for_io()
as part of a write to disk), the call to page_mkclean() is supposed to
re-write-protect the pte, ensuring that any future changes will
re-trigger pg_mkwrite().

You should be able to check if/when nfs_vm_page_mkwrite() is triggered
using 'rpcdebug -m nfs -s pagecache' to turn on the NFS page cache
debugging printks.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor
       [not found]                 ` <1336156178.2582.15.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
@ 2012-05-07  8:49                   ` Niels de Vos
  0 siblings, 0 replies; 7+ messages in thread
From: Niels de Vos @ 2012-05-07  8:49 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David Howells, Jeffrey Layton

On 05/04/2012 08:29 PM, Myklebust, Trond wrote:
 > On Fri, 2012-05-04 at 18:03 +0200, Niels de Vos wrote:
 >> On 05/03/2012 07:26 PM, Myklebust, Trond wrote:
 >>   >  On Thu, 2012-05-03 at 19:07 +0200, Niels de Vos wrote:
 >>   >>  On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
 >>   >>    >   On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
 >>   >>    >>   When an application on an NFS-client (tested with NFSv3) 
executes the
 >>   >>    >>   following steps, data written after the close() is never 
flushed to the
 >>   >>    >>   server:
 >>   >>    >>
 >>   >>    >>   1. open()
 >>   >>    >>   2. mmap()
 >>   >>    >>   3. close()
 >>   >>    >>   4.<modify data in the mmap'ed area>
 >>   >>    >>   5. munmap()
 >>   >>    >>
 >>   >>    >>   Dropping the caches (via /proc/sys/vm/drop_caches) or 
unmounting does not
 >>   >>    >>   result in the data being sent to the server.
 >>   >>    >>
 >>   >>    >>   The man-page for mmap (man 2 mmap) does mention that closing 
the file-
 >>   >>    >>   descriptor does not munmap() the area. Using the mmap'ed area 
after a
 >>   >>    >>   close() sound valid to me (even if it may be bad practice).
 >>   >>    >>
 >>   >>    >>   Investigation and checking showed that the NFS-client does not 
handle
 >>   >>    >>   munmap(), and only flushes on close(). To solve this problem, 
least two
 >>   >>    >>   solutions can be proposed:
 >>   >>    >>
 >>   >>    >>   a. f_ops->release() is called on munmap() as well as on close(),
 >>   >>    >>        therefore release() can be used to flush data as well.
 >>   >>    >>   b. In the 'struct vm_operations_struct' add a .close to the
 >>   >>    >>        'struct vm_area_struct' on calling mmap()/nfs_file_mmap() 
and flush
 >>   >>    >>        the data in the new close() function.
 >>   >>    >>
 >>   >>    >>   Solution a. contains currently very few code changes:
 >>   >>    >>
 >>   >>    >>   --- a/fs/nfs/inode.c
 >>   >>    >>   +++ b/fs/nfs/inode.c
 >>   >>    >>   @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct 
file *filp)
 >>   >>    >>
 >>   >>    >>      int nfs_release(struct inode *inode, struct file *filp)
 >>   >>    >>      {
 >>   >>    >>   +       if (S_ISREG(inode->i_mode)&& 
inode->i_mapping->nrpages != 0) {
 >>   >>    >>   +               nfs_sync_mapping(inode->i_mapping);
 >>   >>    >>             nfs_file_clear_open_context(filp);
 >>   >>    >>             return 0;
 >>   >>    >>      }
 >>   >>    >>
 >>   >>    >>   The disadvantage is, that nfs_release() is called on close() 
too. That
 >>   >>    >>   means this causes a flushing of dirty pages, and just after 
that the
 >>   >>    >>   nfs_file_clear_open_context() might flush again. The advantage 
is that
 >>   >>    >>   it is possible (though not done at the moment) to return an 
error in
 >>   >>    >>   case flushing failed.
 >>   >>    >>
 >>   >>    >>   Solution b. does not provide an option to return an error, but 
does not
 >>   >>    >>   get called on each close():
 >>   >>    >>
 >>   >>    >>   --- a/fs/nfs/file.c
 >>   >>    >>   +++ b/fs/nfs/file.c
 >>   >>    >>   @@ -547,9 +547,17 @@ out:
 >>   >>    >>      	return ret;
 >>   >>    >>      }
 >>   >>    >>
 >>   >>    >>   +static void nfs_vm_close(struct vm_area_struct * vma)
 >>   >>    >>   +{
 >>   >>    >>   +	struct file *filp = vma->vm_file;
 >>   >>    >>   +
 >>   >>    >>   +	nfs_file_flush(filp, (fl_owner_t)filp);
 >>   >>    >>   +}
 >>   >>    >>   +
 >>   >>    >>      static const struct vm_operations_struct nfs_file_vm_ops = {
 >>   >>    >>      	.fault = filemap_fault,
 >>   >>    >>      	.page_mkwrite = nfs_vm_page_mkwrite,
 >>   >>    >>   +	.close = nfs_vm_close,
 >>   >>    >>      };
 >>   >>    >>
 >>   >>    >>      static int nfs_need_sync_write(struct file *filp, struct 
inode *inode)
 >>   >>    >>
 >>   >>    >>   I would like some feedback on what solution is most acceptable, 
or any
 >>   >>    >>   other suggestions.
 >>   >>    >
 >>   >>    >   Neither solution is acceptable. This isn't a close-to-open cache
 >>   >>    >   consistency issue.
 >>   >>    >
 >>   >>    >   The syntax of mmap() for both block and NFS mounts is the same: 
writes
 >>   >>    >   are not guaranteed to hit the disk until your application explicitly
 >>   >>    >   calls msync().
 >>   >>    >
 >>   >>
 >>   >>  Okay, that makes sense. But if the application never calls msync(), and
 >>   >>  just munmap()'s the area, when should the changes be written? I did not
 >>   >>  expect that unmounting just disregards the data.
 >>   >
 >>   >  That suggests that the VM is failing to dirty the pages on munmap()
 >>   >  before releasing the vma->vm_file. If so, then that would be a VM bug...
 >>   >
 >>
 >> I've checked if the VM tags the pages as dirty:
 >> - f_ops->release() is called on munmap(). An added printk there, shows
 >>     that inode->i_state is set to I_DIRTY_PAGE.
 >> - mapping_tagged(filp->f_mapping, PAGECACHE_TAG_DIRTY) also returns true
 >>
 >>   From my understanding this is what the VM is expected to do, and the
 >> pages are marked dirty correctly.
 >>
 >> However, nfs_inode->ndirty and nfs_inode->ncommit are both 0. It is
 >> unclear to me how the VM is supposed to interact with the nfs_inode.
 >> Some clarification or suggestion what to look into would be much
 >> appreciated.
 >
 > The first time the page is touched, it will to trigger a ->pg_mkwrite(),
 > which in the case of NFS will set up the necessary tracking structures
 > to ensure that the page is written out using the correct credentials
 > etc. In the case of NFSv4, it will also ensure that the file doesn't get
 > closed on the server until the page is written out to disk.
 >
 > When the page is cleaned (i.e. something calls clear_page_dirty_for_io()
 > as part of a write to disk), the call to page_mkclean() is supposed to
 > re-write-protect the pte, ensuring that any future changes will
 > re-trigger pg_mkwrite().
 >
 > You should be able to check if/when nfs_vm_page_mkwrite() is triggered
 > using 'rpcdebug -m nfs -s pagecache' to turn on the NFS page cache
 > debugging printks.
 >

Many thanks for the explanation! At the moment I'm a little uncertain
where the problem lays, as the problem does not occur with more recent
kernels anymore. There likely was some invalid testing on my side :-/

I think I have to hunt down what changes were made and how this affects
the writing to mmap()'d files. The explanation you gave helps a lot in
understanding how NFS handles this all.

Thanks again, and sorry for any confusion,
Niels
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-07  8:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-03 15:34 [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor Niels de Vos
     [not found] ` <4FA2A56B.1030208-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-05-03 15:43   ` Myklebust, Trond
2012-05-03 17:07     ` Niels de Vos
     [not found]       ` <4FA2BB68.7070304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-05-03 17:26         ` Myklebust, Trond
     [not found]           ` <1336066014.5385.41.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2012-05-04 16:03             ` Niels de Vos
2012-05-04 18:29               ` Myklebust, Trond
     [not found]                 ` <1336156178.2582.15.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2012-05-07  8:49                   ` Niels de Vos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).