public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* nfs MAP_SHARED corruption fix
@ 2001-05-08 14:00 Andrea Arcangeli
  2001-05-08 15:21 ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-05-08 14:00 UTC (permalink / raw)
  To: linux-kernel

This fixes corruption with MAP_SHARED on top of nfs filesystem in 2.4:

--- 2.4.5pre1aa2/fs/nfs/write.c.~1~	Tue May  1 19:35:29 2001
+++ 2.4.5pre1aa2/fs/nfs/write.c	Tue May  8 02:04:15 2001
@@ -1533,6 +1533,7 @@
 	if (!inode && file)
 		inode = file->f_dentry->d_inode;
 
+	filemap_fdatasync(inode->i_mapping);
 	do {
 		error = 0;
 		if (wait)

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-08 14:00 nfs MAP_SHARED corruption fix Andrea Arcangeli
@ 2001-05-08 15:21 ` Trond Myklebust
  2001-05-08 15:38   ` Kurt Garloff
  2001-05-09  2:48   ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Trond Myklebust @ 2001-05-08 15:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > This fixes corruption with MAP_SHARED on top of nfs filesystem
     > in 2.4:
     > --- 2.4.5pre1aa2/fs/nfs/write.c.~1~ Tue May 1 19:35:29 2001
     > +++ 2.4.5pre1aa2/fs/nfs/write.c Tue May 8 02:04:15 2001
     > @@ -1533,6 +1533,7 @@
     >  	if (!inode && file)
     >  		inode = file->f_dentry->d_inode;
 
     > + filemap_fdatasync(inode->i_mapping);
     >  	do {
     >  		error = 0; if (wait)

Yech! Apart from the fact that this means you do a full fdatasync()
even when you are simply trying to flush out single pages,
nfs_sync_file() gets called all over the place including in areas
where we know we're already holding a page lock.


AFAICs this fix will clearly deadlock...

Could you instead detail exactly which corruption problem you are
trying to fix?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-08 15:21 ` Trond Myklebust
@ 2001-05-08 15:38   ` Kurt Garloff
  2001-05-09  7:30     ` Trond Myklebust
  2001-05-09  2:48   ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Kurt Garloff @ 2001-05-08 15:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrea Arcangeli, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

On Tue, May 08, 2001 at 05:21:02PM +0200, Trond Myklebust wrote:
> Could you instead detail exactly which corruption problem you are
> trying to fix?

int fd = open (name, O_RDWR);
char* adr = (char*) mmap (0, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
/* write to *adr through *(ard+len-1) */
/* Try adding here: msync (adr, len, MS_SYNC); */
munmap (adr, len);
close (fd);

The code works on files on local harddisks and on NFS volumes on a 2.2
kernel, but breaks on NFS drives on a 2.4.4 kernel.
msync() works around the bug.
Andrea's patch did help as well.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, FRG                               SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-08 15:21 ` Trond Myklebust
  2001-05-08 15:38   ` Kurt Garloff
@ 2001-05-09  2:48   ` Andrea Arcangeli
  2001-05-09  7:00     ` Trond Myklebust
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-05-09  2:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On Tue, May 08, 2001 at 05:21:02PM +0200, Trond Myklebust wrote:
> AFAICs this fix will clearly deadlock...

yeah, it didn't triggered because it probably needs to be the same page
writepaged and in the dirty list at the same time. I hooked it very deep
into the writeback logic to keep it generic (it wasn't going to add a
significant overhead) but it didn't need to be _that_ deep.

Even worse I think it was partly wrong because it was only in the
close(2) path but not in the fput path that is the one walked by munmap.

This looks better to me, what do you think?

diff -urN ref/fs/nfs/file.c nfs-corruption/fs/nfs/file.c
--- ref/fs/nfs/file.c	Thu Feb 22 03:45:10 2001
+++ nfs-corruption/fs/nfs/file.c	Tue May  8 19:11:57 2001
@@ -39,6 +39,7 @@
 static ssize_t nfs_file_write(struct file *, const char *, size_t, loff_t *);
 static int  nfs_file_flush(struct file *);
 static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
+static void nfs_file_close_vma(struct vm_area_struct *);
 
 struct file_operations nfs_file_operations = {
 	read:		nfs_file_read,
@@ -57,6 +58,11 @@
 	setattr:	nfs_notify_change,
 };
 
+static struct vm_operations_struct nfs_file_vm_ops = {
+	nopage:		filemap_nopage,
+	close:		nfs_file_close_vma,
+};
+
 /* Hack for future NFS swap support */
 #ifndef IS_SWAPFILE
 # define IS_SWAPFILE(inode)	(0)
@@ -104,6 +110,20 @@
 	return result;
 }
 
+static void nfs_file_close_vma(struct vm_area_struct * vma)
+{
+	struct inode * inode;
+
+	inode = vma->vm_file->f_dentry->d_inode;
+
+	if (inode->i_state & I_DIRTY_PAGES) {
+		filemap_fdatasync(inode->i_mapping);
+		lock_kernel();
+		nfs_wb_file(inode, vma->vm_file);
+		unlock_kernel();
+	}
+}
+
 static int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -115,8 +135,11 @@
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
 	status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
-	if (!status)
+	if (!status) {
 		status = generic_file_mmap(file, vma);
+		if (!status)
+			vma->vm_ops = &nfs_file_vm_ops;
+	}
 	return status;
 }
 

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-09  2:48   ` Andrea Arcangeli
@ 2001-05-09  7:00     ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2001-05-09  7:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > On Tue, May 08, 2001 at 05:21:02PM +0200, Trond Myklebust
     > wrote:
    >> AFAICs this fix will clearly deadlock...

     > yeah, it didn't triggered because it probably needs to be the
     > same page writepaged and in the dirty list at the same time. I
     > hooked it very deep into the writeback logic to keep it generic
     > (it wasn't going to add a significant overhead) but it didn't
     > need to be _that_ deep.

     > Even worse I think it was partly wrong because it was only in
     > the close(2) path but not in the fput path that is the one
     > walked by munmap.

     > This looks better to me, what do you think?

Just 2 comments.

 - You should use nfs_wb_all() here rather than nfs_wb_file() since
writepage() unfortunately can't initialize the NFS writeback structure
with the correct vma->vm_file.

 - Are we allowed to protect the filemap_fdatasync() + nfs_wb_all() by
grabbing the inode->i_sem in nfs_file_close_vma()? If so, this should
be done as well.

 Otherwise, that patch looked fine to me... I'll test it out...

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-08 15:38   ` Kurt Garloff
@ 2001-05-09  7:30     ` Trond Myklebust
  2001-05-09 13:25       ` Andrea Arcangeli
  2001-05-09 22:02       ` Marcelo Tosatti
  0 siblings, 2 replies; 15+ messages in thread
From: Trond Myklebust @ 2001-05-09  7:30 UTC (permalink / raw)
  To: Kurt Garloff, Andrea Arcangeli; +Cc: Linux Kernel


In addition to the two changes I proposed to Andrea's new patch, I
also realized we might want to do a fdatasync() when locking files. If
we don't, then locking won't be atomic on mmap()...

Here therefore is Andrea's patch with the changes I propose. Opinions?

Cheers,
  Trond

diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/file.c linux-2.4.4-mmap/fs/nfs/file.c
--- linux-2.4.4/fs/nfs/file.c	Fri Feb  9 20:29:44 2001
+++ linux-2.4.4-mmap/fs/nfs/file.c	Wed May  9 09:18:45 2001
@@ -39,6 +39,7 @@
 static ssize_t nfs_file_write(struct file *, const char *, size_t, loff_t *);
 static int  nfs_file_flush(struct file *);
 static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
+static void nfs_file_close_vma(struct vm_area_struct *);
 
 struct file_operations nfs_file_operations = {
 	read:		nfs_file_read,
@@ -57,6 +58,11 @@
 	setattr:	nfs_notify_change,
 };
 
+static struct vm_operations_struct nfs_file_vm_ops = {
+	nopage:		filemap_nopage,
+	close:		nfs_file_close_vma,
+};
+
 /* Hack for future NFS swap support */
 #ifndef IS_SWAPFILE
 # define IS_SWAPFILE(inode)	(0)
@@ -104,6 +110,20 @@
 	return result;
 }
 
+static void nfs_file_close_vma(struct vm_area_struct * vma)
+{
+	struct inode * inode;
+
+	inode = vma->vm_file->f_dentry->d_inode;
+
+	if (inode->i_state & I_DIRTY_PAGES) {
+		down(&inode->i_sem);
+		filemap_fdatasync(inode->i_mapping);
+		nfs_wb_all(inode);
+		up(&inode->i_sem);
+	}
+}
+
 static int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -115,8 +135,11 @@
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
 	status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
-	if (!status)
+	if (!status) {
 		status = generic_file_mmap(file, vma);
+		if (!status)
+			vma->vm_ops = &nfs_file_vm_ops;
+	}
 	return status;
 }
 
@@ -283,9 +306,10 @@
 	 * Flush all pending writes before doing anything
 	 * with locks..
 	 */
-	down(&filp->f_dentry->d_inode->i_sem);
+	down(&inode->i_sem);
+	filemap_fdatasync(inode->i_mapping);
 	status = nfs_wb_all(inode);
-	up(&filp->f_dentry->d_inode->i_sem);
+	up(&inode->i_sem);
 	if (status < 0)
 		return status;
 
@@ -300,10 +324,11 @@
 	 */
  out_ok:
 	if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) {
-		down(&filp->f_dentry->d_inode->i_sem);
+		down(&inode->i_sem);
+		filemap_fdatasync(inode->i_mapping);
 		nfs_wb_all(inode);      /* we may have slept */
 		nfs_zap_caches(inode);
-		up(&filp->f_dentry->d_inode->i_sem);
+		up(&inode->i_sem);
 	}
 	return status;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
@ 2001-05-09 10:55 Kurt Garloff
  0 siblings, 0 replies; 15+ messages in thread
From: Kurt Garloff @ 2001-05-09 10:55 UTC (permalink / raw)
  To: Andrea Arcangeli, Trond Myklebust; +Cc: Linux kernel list


[-- Attachment #1.1: Type: text/plain, Size: 544 bytes --]

Hi Andrea, Trond,

the demo for the NFS SHARED_MAP corruption:

garloff@daubechies:~/C $ uname -sr
Linux 2.4.4
garloff@daubechies:~/C $ ./test_nfs_shared_map ; head -1 ./testfile; sleep 10; head -1 ./testfile
Linux NFS rocks.
Linux NFS sucks.

Sources attached. I still have to test your fix, Trond.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, FRG                               SCSI, Security

[-- Attachment #1.2: test_nfs_shared_map.c --]
[-- Type: text/plain, Size: 1197 bytes --]

/** test_nfs_shared_map.c
 *
 * Creates a file, expands it by ftruncate, mmaps it, writes
 * to the mapped memory, unmaps it and closes the file again.
 * 
 * This triggers a bug in 2.4.4 NFS client code: The file won't
 * contain the data written to the mapped memory.
 * 
 * (c) Kurt Garloff <garloff@suse.de>, 2001-05-09, GNU GPL
 */

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <errno.h>


const char * const bad  = "Linux NFS sucks.\n";
const char * const good = "Linux NFS rocks.\n";
const char * const name = "testfile";

#define LEN 4096

int die (const char* const txt)
{
	perror (txt);
	exit (errno);
}

int main ()
{
	char* adr; int err;
	int fd = open (name, O_RDWR | O_CREAT | O_TRUNC, 0644);
	if (fd <= 0) die ("create testfile");
	err = write (fd, bad, strlen (bad));
	close (fd); 
	truncate (name, LEN);
	sync ();
	fd = open (name, O_RDWR);
	if (fd <= 0) die ("open testfile");
	adr = (char*) mmap (0, LEN, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (!adr) die ("mmap failed");
	strcpy (adr, good);
	strcpy (adr+32, good);
#ifdef NEED_MSYNC
	msync (adr, LEN, MS_SYNC);
#endif
	munmap (adr, LEN);
	close (fd);
}

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-09  7:30     ` Trond Myklebust
@ 2001-05-09 13:25       ` Andrea Arcangeli
  2001-05-09 22:02       ` Marcelo Tosatti
  1 sibling, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2001-05-09 13:25 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Kurt Garloff, Linux Kernel

On Wed, May 09, 2001 at 09:30:18AM +0200, Trond Myklebust wrote:
> Here therefore is Andrea's patch with the changes I propose. Opinions?

flushing the dirty pages before locks is probably not required, a dirty
page in the dirty_pages list is no different than a mapped page not in
the dirty_pages list but only with the pte marked dirty, and we cannot flush
the pages with only the pte marked dirty before the locks, but flushing
the dirty_pgaes list won't hurt so overall it looks ok to me.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-09  7:30     ` Trond Myklebust
  2001-05-09 13:25       ` Andrea Arcangeli
@ 2001-05-09 22:02       ` Marcelo Tosatti
  2001-05-10  0:08         ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2001-05-09 22:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Kurt Garloff, Andrea Arcangeli, Linux Kernel


On Wed, 9 May 2001, Trond Myklebust wrote:

> 
> In addition to the two changes I proposed to Andrea's new patch, I
> also realized we might want to do a fdatasync() when locking files. If
> we don't, then locking won't be atomic on mmap()...
> 
> Here therefore is Andrea's patch with the changes I propose. Opinions?
> 
> Cheers,
>   Trond
> 
> diff -u --recursive --new-file linux-2.4.4-fixes/fs/nfs/file.c linux-2.4.4-mmap/fs/nfs/file.c
> --- linux-2.4.4/fs/nfs/file.c	Fri Feb  9 20:29:44 2001
> +++ linux-2.4.4-mmap/fs/nfs/file.c	Wed May  9 09:18:45 2001
> @@ -39,6 +39,7 @@
>  static ssize_t nfs_file_write(struct file *, const char *, size_t, loff_t *);
>  static int  nfs_file_flush(struct file *);
>  static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
> +static void nfs_file_close_vma(struct vm_area_struct *);
>  
>  struct file_operations nfs_file_operations = {
>  	read:		nfs_file_read,
> @@ -57,6 +58,11 @@
>  	setattr:	nfs_notify_change,
>  };
>  
> +static struct vm_operations_struct nfs_file_vm_ops = {
> +	nopage:		filemap_nopage,
> +	close:		nfs_file_close_vma,
> +};
> +
>  /* Hack for future NFS swap support */
>  #ifndef IS_SWAPFILE
>  # define IS_SWAPFILE(inode)	(0)
> @@ -104,6 +110,20 @@
>  	return result;
>  }
>  
> +static void nfs_file_close_vma(struct vm_area_struct * vma)
> +{
> +	struct inode * inode;
> +
> +	inode = vma->vm_file->f_dentry->d_inode;
> +
> +	if (inode->i_state & I_DIRTY_PAGES) {
> +		down(&inode->i_sem);
> +		filemap_fdatasync(inode->i_mapping);
> +		nfs_wb_all(inode);
> +		up(&inode->i_sem);
> +	}
> +}
> +

Why don't you clean I_DIRTY_PAGES ? 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-10  0:08         ` Andrea Arcangeli
@ 2001-05-09 22:38           ` Marcelo Tosatti
  2001-05-10  1:16             ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2001-05-09 22:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Trond Myklebust, Kurt Garloff, Linux Kernel



On Thu, 10 May 2001, Andrea Arcangeli wrote:

> On Wed, May 09, 2001 at 07:02:16PM -0300, Marcelo Tosatti wrote:
> > Why don't you clean I_DIRTY_PAGES ? 
> 
> we don't have visibilty on the inode_lock from there, we could make a
> function in fs/inode.c or export the inode_lock to do that, but the flag
> will be collected when the inode is released anyways, and it's only an
> hint to make the common case fast (the common case is when nobody ever
> did a MAP_SHARED on the inode). Other places msync/fsync doesn't even
> check for such bit but they fdatasync/fdatawait unconditionally. 

Actually msync/fsync _can't_ rely on this bit because there is no
guarantee that data is fully synced on disk even if the bit is cleared.
(__sync_one (fs/inode.c) clears the bit _before_ starting the writeout,
and thats it).

You have the same problem with your code, so I guess its better to just
remove the I_DIRTY_PAGES check. 

> And on the same lines also sys_fsync and sys_msync could clear the
> I_DIRTY_PAGES but they don't for semplcity (it will be cleared by
> kupdate later).
> 
> So in short we can clear it but it's not required and it won't make much
> difference. If you really care you can clear it before calling fdatasync
> though.

Well, forget about that. :) 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-10  1:16             ` Andrea Arcangeli
@ 2001-05-10  0:00               ` Marcelo Tosatti
  2001-05-10 10:11                 ` Trond Myklebust
  2001-05-10 10:14                 ` Trond Myklebust
  0 siblings, 2 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2001-05-10  0:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Trond Myklebust, Kurt Garloff, Linux Kernel



On Thu, 10 May 2001, Andrea Arcangeli wrote:

> If some page wasn't yet visible in the dirty_pages list by the time
> __sync_one started, we'll find I_DIRTY_PAGES set. This is enforced by
> the locking order (sync_one first clears the I_DIRTY_PAGES and then
> it starts browsing the dirty_pages list while set_page_dirty first make the
> page visible and then marks the inode dirty).
> 
> So the I_DIRTY_PAGES check guarantees that those dirty pages cannot be
> lost in iput, that was the _only_ object of the patch and that is
> certainly enough to fix the nfs fs data corruption reported.
> 
> Now if you claim that munmap needs to be synchronous for nfs that's a
> completly different matter. I didn't even tried to make it synchronous.
> It is possible it has to be synchronous, even write(2) (in theory ;) has
> to behave like O_SYNC with nfs, but I'm not sure.

I suggested the removal of I_DIRTY_PAGES check because the current
behaviour of munmap seems to be synchronous (1), so I guess you _always_
want it to be synchronous.

1) nfs_wb_file() flushes the dirty data and then waits for completion. 

Trond? 

> Another thing (completly unrelated to the above issues) that I noticed
> while looking over this nfs code is that the __sync_one() for example
> called by generic_file_write(O_SYNC) will recall fdatasync but no nfs_wb_all
> is put before the fdatawait, and I'm not sure that the nfs_sync_page
> called by the fdatawait is enough to rapidly flush the writepaged stuff
> to the nfs server. nfs_sync_page apparently only cares about speculative
> reads, not at all about committing writebacks. It would look much saner
> to me if nfs_sync_page also does a nfs_wb_all() on the inode, so that
> the ->sync_page callback gets the same semantics it has for the real
> filesystems.

Looks sane and will probably makes things faster.

Again, Trond? :)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-09 22:02       ` Marcelo Tosatti
@ 2001-05-10  0:08         ` Andrea Arcangeli
  2001-05-09 22:38           ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-05-10  0:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Trond Myklebust, Kurt Garloff, Linux Kernel

On Wed, May 09, 2001 at 07:02:16PM -0300, Marcelo Tosatti wrote:
> Why don't you clean I_DIRTY_PAGES ? 

we don't have visibilty on the inode_lock from there, we could make a
function in fs/inode.c or export the inode_lock to do that, but the flag
will be collected when the inode is released anyways, and it's only an
hint to make the common case fast (the common case is when nobody ever
did a MAP_SHARED on the inode). Other places msync/fsync doesn't even
check for such bit but they fdatasync/fdatawait unconditionally. And on
the same lines also sys_fsync and sys_msync could clear the
I_DIRTY_PAGES but they don't for semplcity (it will be cleared by
kupdate later).

So in short we can clear it but it's not required and it won't make much
difference. If you really care you can clear it before calling fdatasync
though.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-09 22:38           ` Marcelo Tosatti
@ 2001-05-10  1:16             ` Andrea Arcangeli
  2001-05-10  0:00               ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-05-10  1:16 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Trond Myklebust, Kurt Garloff, Linux Kernel

On Wed, May 09, 2001 at 07:38:01PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Thu, 10 May 2001, Andrea Arcangeli wrote:
> 
> > On Wed, May 09, 2001 at 07:02:16PM -0300, Marcelo Tosatti wrote:
> > > Why don't you clean I_DIRTY_PAGES ? 
> > 
> > we don't have visibilty on the inode_lock from there, we could make a
> > function in fs/inode.c or export the inode_lock to do that, but the flag
> > will be collected when the inode is released anyways, and it's only an
> > hint to make the common case fast (the common case is when nobody ever
> > did a MAP_SHARED on the inode). Other places msync/fsync doesn't even
> > check for such bit but they fdatasync/fdatawait unconditionally. 
> 
> Actually msync/fsync _can't_ rely on this bit because there is no
> guarantee that data is fully synced on disk even if the bit is cleared.
> (__sync_one (fs/inode.c) clears the bit _before_ starting the writeout,
> and thats it).

correct sorry, fsync/msync cannot check that bit of course.

> You have the same problem with your code, so I guess its better to just
> remove the I_DIRTY_PAGES check. 

The point you have to clarify before claming we should remove the check
is if the munmap flush needs to be synchronous or not. In general munmap
doesn't need to be synchronous. If you want to commit the writes an
explicit msync(MS_SYNC) or fsync on the file is required.  Otherwise the
updates will hit the platter in a rasonable amount of finite time
asynchronously. If somebody just intiated the fdatasync he will have to
finish before we can collect away the inode and in turn drop all its
cache, so those dirty pages cannot get lost in iput if somebody started
doing the flush under us either, and the guy doing the fdatasync under
us will have to wait synchronously for the stuff to be committed before
it can return.

If some page wasn't yet visible in the dirty_pages list by the time
__sync_one started, we'll find I_DIRTY_PAGES set. This is enforced by
the locking order (sync_one first clears the I_DIRTY_PAGES and then
it starts browsing the dirty_pages list while set_page_dirty first make the
page visible and then marks the inode dirty).

So the I_DIRTY_PAGES check guarantees that those dirty pages cannot be
lost in iput, that was the _only_ object of the patch and that is
certainly enough to fix the nfs fs data corruption reported.

Now if you claim that munmap needs to be synchronous for nfs that's a
completly different matter. I didn't even tried to make it synchronous.
It is possible it has to be synchronous, even write(2) (in theory ;) has
to behave like O_SYNC with nfs, but I'm not sure.


Another thing (completly unrelated to the above issues) that I noticed
while looking over this nfs code is that the __sync_one() for example
called by generic_file_write(O_SYNC) will recall fdatasync but no nfs_wb_all
is put before the fdatawait, and I'm not sure that the nfs_sync_page
called by the fdatawait is enough to rapidly flush the writepaged stuff
to the nfs server. nfs_sync_page apparently only cares about speculative
reads, not at all about committing writebacks. It would look much saner
to me if nfs_sync_page also does a nfs_wb_all() on the inode, so that
the ->sync_page callback gets the same semantics it has for the real
filesystems.

Comments?

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-10  0:00               ` Marcelo Tosatti
@ 2001-05-10 10:11                 ` Trond Myklebust
  2001-05-10 10:14                 ` Trond Myklebust
  1 sibling, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2001-05-10 10:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, Kurt Garloff, Linux Kernel

>>>>> " " == Marcelo Tosatti <marcelo@conectiva.com.br> writes:

     > I suggested the removal of I_DIRTY_PAGES check because the
     > current behaviour of munmap seems to be synchronous (1), so I
     > guess you _always_ want it to be synchronous.

Yes. The NFS concept of close-to-open cache consistency requires you
to flush out all writes upon file close(). In this case, the usual
nfs_wb_file() + error reporting in nfs_release() won't catch anything
that's been put out using writepage(), because the latter can't mark
the requests using the struct file. This means we have to do something
special for mmap...

    >> Another thing (completly unrelated to the above issues) that I
    >> noticed while looking over this nfs code is that the
    >> __sync_one() for example called by generic_file_write(O_SYNC)
    >> will recall fdatasync but no nfs_wb_all is put before the
    >> fdatawait, and I'm not sure that the nfs_sync_page called by
    >> the fdatawait is enough to rapidly flush the writepaged stuff
    >> to the nfs server. nfs_sync_page apparently only cares about
    >> speculative reads, not at all about committing writebacks. It
    >> would look much saner to me if nfs_sync_page also does a
    >> nfs_wb_all() on the inode, so that the ->sync_page callback
    >> gets the same semantics it has for the real filesystems.

     > Looks sane and will probably makes things faster.

This should normally not be needed. Firstly there is logic in the
write code to send off a request as soon as we have scheduled wsize
bytes worth of data. Secondly there is the 'flushd' daemon that exists
in order to time out requests, and to flush them out if the first
logic fails to do so.

That said, the __sync_one() thing is of interest to me. You'll notice
that our lack of a write_inode() means that we don't currently support
the sync() system call. Furthermore, we do O_SYNC through the slower
method of actually making our commit_write() method make a synchronous
call.
I have a patch that implements write_inode() and removes our current
O_SYNC code on

   http://www.fys.uio.no/~trondmy/src/linux-2.4.4-write_inode.dif

It's been running for a month or two on my systems, but I've been wary
of sending it to Linus as it's not, strictly speaking, a bugfix.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nfs MAP_SHARED corruption fix
  2001-05-10  0:00               ` Marcelo Tosatti
  2001-05-10 10:11                 ` Trond Myklebust
@ 2001-05-10 10:14                 ` Trond Myklebust
  1 sibling, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2001-05-10 10:14 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, Kurt Garloff, Linux Kernel

>>>>> " " == Marcelo Tosatti <marcelo@conectiva.com.br> writes:

     > I suggested the removal of I_DIRTY_PAGES check because the
     > current behaviour of munmap seems to be synchronous (1), so I
     > guess you _always_ want it to be synchronous.

Revised patch (+ necessary change in ksyms.c) follows.

Cheers,
  Trond

diff -u --recursive --new-file linux-2.4.4/fs/nfs/file.c linux-2.4.4-mmap/fs/nfs/file.c
--- linux-2.4.4/fs/nfs/file.c	Fri Feb  9 20:29:44 2001
+++ linux-2.4.4-mmap/fs/nfs/file.c	Thu May 10 12:12:38 2001
@@ -39,6 +39,7 @@
 static ssize_t nfs_file_write(struct file *, const char *, size_t, loff_t *);
 static int  nfs_file_flush(struct file *);
 static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
+static void nfs_file_close_vma(struct vm_area_struct *);
 
 struct file_operations nfs_file_operations = {
 	read:		nfs_file_read,
@@ -57,6 +58,11 @@
 	setattr:	nfs_notify_change,
 };
 
+static struct vm_operations_struct nfs_file_vm_ops = {
+	nopage:		filemap_nopage,
+	close:		nfs_file_close_vma,
+};
+
 /* Hack for future NFS swap support */
 #ifndef IS_SWAPFILE
 # define IS_SWAPFILE(inode)	(0)
@@ -104,6 +110,19 @@
 	return result;
 }
 
+static void nfs_file_close_vma(struct vm_area_struct * vma)
+{
+	struct inode * inode;
+
+	inode = vma->vm_file->f_dentry->d_inode;
+
+	filemap_fdatasync(inode->i_mapping);
+	down(&inode->i_sem);
+	nfs_wb_all(inode);
+	up(&inode->i_sem);
+	filemap_fdatawait(inode->i_mapping);
+}
+
 static int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -115,8 +134,11 @@
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
 	status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
-	if (!status)
+	if (!status) {
 		status = generic_file_mmap(file, vma);
+		if (!status)
+			vma->vm_ops = &nfs_file_vm_ops;
+	}
 	return status;
 }
 
@@ -283,9 +305,11 @@
 	 * Flush all pending writes before doing anything
 	 * with locks..
 	 */
-	down(&filp->f_dentry->d_inode->i_sem);
+	filemap_fdatasync(inode->i_mapping);
+	down(&inode->i_sem);
 	status = nfs_wb_all(inode);
-	up(&filp->f_dentry->d_inode->i_sem);
+	up(&inode->i_sem);
+	filemap_fdatawait(inode->i_mapping);
 	if (status < 0)
 		return status;
 
@@ -300,10 +324,12 @@
 	 */
  out_ok:
 	if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) {
-		down(&filp->f_dentry->d_inode->i_sem);
+		filemap_fdatasync(inode->i_mapping);
+		down(&inode->i_sem);
 		nfs_wb_all(inode);      /* we may have slept */
+		up(&inode->i_sem);
+		filemap_fdatawait(inode->i_mapping);
 		nfs_zap_caches(inode);
-		up(&filp->f_dentry->d_inode->i_sem);
 	}
 	return status;
 }
diff -u --recursive --new-file linux-2.4.4/kernel/ksyms.c linux-2.4.4-mmap/kernel/ksyms.c
--- linux-2.4.4/kernel/ksyms.c	Fri Apr 27 23:23:25 2001
+++ linux-2.4.4-mmap/kernel/ksyms.c	Wed May  9 18:05:58 2001
@@ -262,6 +262,8 @@
 EXPORT_SYMBOL(dentry_open);
 EXPORT_SYMBOL(filemap_nopage);
 EXPORT_SYMBOL(filemap_sync);
+EXPORT_SYMBOL(filemap_fdatasync);
+EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 
 /* device registration */

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2001-05-10 12:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-08 14:00 nfs MAP_SHARED corruption fix Andrea Arcangeli
2001-05-08 15:21 ` Trond Myklebust
2001-05-08 15:38   ` Kurt Garloff
2001-05-09  7:30     ` Trond Myklebust
2001-05-09 13:25       ` Andrea Arcangeli
2001-05-09 22:02       ` Marcelo Tosatti
2001-05-10  0:08         ` Andrea Arcangeli
2001-05-09 22:38           ` Marcelo Tosatti
2001-05-10  1:16             ` Andrea Arcangeli
2001-05-10  0:00               ` Marcelo Tosatti
2001-05-10 10:11                 ` Trond Myklebust
2001-05-10 10:14                 ` Trond Myklebust
2001-05-09  2:48   ` Andrea Arcangeli
2001-05-09  7:00     ` Trond Myklebust
  -- strict thread matches above, loose matches on Subject: below --
2001-05-09 10:55 Kurt Garloff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox