All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@digeo.com>
Cc: paulmck@us.ibm.com, dmccr@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Mon, 23 Jun 2003 09:43:53 +0200	[thread overview]
Message-ID: <20030623074353.GE19940@dualathlon.random> (raw)
In-Reply-To: <20030622233235.0924364d.akpm@digeo.com>

On Sun, Jun 22, 2003 at 11:32:35PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> >
> > > but you can't trap this with a single counter increment in do_truncate:
> >  > 
> >  > 	CPU 0			CPU 1
> >  > 	----------		-----------
> >  > 				do_no_page
> >  > 	truncate
> 
>         i_size = new_i_size;
> 
> >  > 	increment counter
> >  > 				read counter
> >  > 				->nopage
> 
>                                 check i_size
> 
> >  > 	vmtruncate
> >  > 				read counter again -> different so retry
> >  > 
> >  > thanks to the second counter increment after vmtruncate in my fix, the
> >  > above race couldn't happen.
> > 
> >  The trick is that CPU 0 is expected to have updated the filesystem's
> >  idea of what pages are available before calling vmtruncate,
> >  invalidate_mmap_range() or whichever.
> 
> i_size has been updated, and filemap_nopage() will return NULL.

ok, I finally see your object now, so it's the i_size that was in theory
supposed to act as a "second" increment on the truncate side to
serialize against in do_no_page.

I was searching for any additional SMP locking, and infact there is
none.  No way the current code can serialize against the i_size.

The problem is that the last patch posted on l-k isn't guaranteeing
anything like what you wrote above.  What can happen is that the i_size
can be read still way before the counter.


 	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
 
                                specualtive read i_size into l2 cache
        i_size = new_i_size;

	increment counter
 				read counter
				->nopage
				check i_size from l2 cache
 

 	vmtruncate
 				read counter again -> different so retry


For the single counter increment patch to close the race using the
implicit i_size update as a second increment, you need to add an
absolutely necessary smb_rmb() here:

        sequence = atomic_read(&mapping->truncate_count);
+	smp_rmb();	
        new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

And for non-x86 you need a smb_wmb() between the i_size = new_i_size and
the atomic_inc on the truncate side too, because advanced archs with
finegrined locking can implement a one-way barrier, where the counter
increment could pass the down(&mapping->i_shared_sem). Last but not the
least it was IMHO misleading and dirty to do the counter increment in
truncate after taking a completely unrelated semaphore just to take
advantage of the implicit barrier on the x86. So doing it with a proper
smp_wmb() besides fixing advanced archs will kind of make the code
readable, so you can understand the i_size update is the second
increment (actually decrement because it's the i_size).

Now that I see what the second increment was supposed to be, it's a cute
idea to use the i_size for that, and I can appreciate its beauty (I
mean, in purist terms, in practice IMHO the seq_lock was more obivously
correct code with all the smp reordering hidden into the lock semantics
and I'm 100% sure you couldn't measure any performance difference in
truncate due the second increment ;). But if you prefer to keep the most
finegrined approch of using the i_size with the explicit memory
barriers, I'm of course not against it after you add the needed fixed I
outlined above that will finally close the race.

thanks,

Andrea

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@digeo.com>
Cc: paulmck@us.ibm.com, dmccr@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Mon, 23 Jun 2003 09:43:53 +0200	[thread overview]
Message-ID: <20030623074353.GE19940@dualathlon.random> (raw)
In-Reply-To: <20030622233235.0924364d.akpm@digeo.com>

On Sun, Jun 22, 2003 at 11:32:35PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> >
> > > but you can't trap this with a single counter increment in do_truncate:
> >  > 
> >  > 	CPU 0			CPU 1
> >  > 	----------		-----------
> >  > 				do_no_page
> >  > 	truncate
> 
>         i_size = new_i_size;
> 
> >  > 	increment counter
> >  > 				read counter
> >  > 				->nopage
> 
>                                 check i_size
> 
> >  > 	vmtruncate
> >  > 				read counter again -> different so retry
> >  > 
> >  > thanks to the second counter increment after vmtruncate in my fix, the
> >  > above race couldn't happen.
> > 
> >  The trick is that CPU 0 is expected to have updated the filesystem's
> >  idea of what pages are available before calling vmtruncate,
> >  invalidate_mmap_range() or whichever.
> 
> i_size has been updated, and filemap_nopage() will return NULL.

ok, I finally see your object now, so it's the i_size that was in theory
supposed to act as a "second" increment on the truncate side to
serialize against in do_no_page.

I was searching for any additional SMP locking, and infact there is
none.  No way the current code can serialize against the i_size.

The problem is that the last patch posted on l-k isn't guaranteeing
anything like what you wrote above.  What can happen is that the i_size
can be read still way before the counter.


 	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
 
                                specualtive read i_size into l2 cache
        i_size = new_i_size;

	increment counter
 				read counter
				->nopage
				check i_size from l2 cache
 

 	vmtruncate
 				read counter again -> different so retry


For the single counter increment patch to close the race using the
implicit i_size update as a second increment, you need to add an
absolutely necessary smb_rmb() here:

        sequence = atomic_read(&mapping->truncate_count);
+	smp_rmb();	
        new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

And for non-x86 you need a smb_wmb() between the i_size = new_i_size and
the atomic_inc on the truncate side too, because advanced archs with
finegrined locking can implement a one-way barrier, where the counter
increment could pass the down(&mapping->i_shared_sem). Last but not the
least it was IMHO misleading and dirty to do the counter increment in
truncate after taking a completely unrelated semaphore just to take
advantage of the implicit barrier on the x86. So doing it with a proper
smp_wmb() besides fixing advanced archs will kind of make the code
readable, so you can understand the i_size update is the second
increment (actually decrement because it's the i_size).

Now that I see what the second increment was supposed to be, it's a cute
idea to use the i_size for that, and I can appreciate its beauty (I
mean, in purist terms, in practice IMHO the seq_lock was more obivously
correct code with all the smp reordering hidden into the lock semantics
and I'm 100% sure you couldn't measure any performance difference in
truncate due the second increment ;). But if you prefer to keep the most
finegrined approch of using the i_size with the explicit memory
barriers, I'm of course not against it after you add the needed fixed I
outlined above that will finally close the race.

thanks,

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2003-06-23  7:30 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-06-12 20:16 [PATCH] Fix vmtruncate race and distributed filesystem race Dave McCracken
2003-06-12 20:49 ` Andrew Morton
2003-06-12 20:49   ` Andrew Morton
2003-06-12 21:00   ` Andrew Morton
2003-06-12 21:00     ` Andrew Morton
2003-06-12 21:08     ` Dave McCracken
2003-06-12 21:08       ` Dave McCracken
2003-06-12 21:44       ` Andrew Morton
2003-06-12 21:44         ` Andrew Morton
2003-06-12 22:56         ` Dave McCracken
2003-06-12 23:07           ` Andrew Morton
2003-06-12 23:07             ` Andrew Morton
2003-06-20  0:17           ` Andrea Arcangeli
2003-06-20  0:17             ` Andrea Arcangeli
2003-06-23  3:28             ` Paul E. McKenney
2003-06-23  3:28               ` Paul E. McKenney
2003-06-23  6:29               ` Andrea Arcangeli
2003-06-23  6:29                 ` Andrea Arcangeli
2003-06-23  6:32               ` Andrew Morton
2003-06-23  6:32                 ` Andrew Morton
2003-06-23  7:43                 ` Andrea Arcangeli [this message]
2003-06-23  7:43                   ` Andrea Arcangeli
2003-06-23  7:56                   ` Andrew Morton
2003-06-23  7:56                     ` Andrew Morton
2003-06-23  8:10                     ` Andrea Arcangeli
2003-06-23  8:10                       ` Andrea Arcangeli
2003-06-24  1:37                       ` Paul E. McKenney
2003-06-24  1:37                         ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030623074353.GE19940@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@digeo.com \
    --cc=dmccr@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=paulmck@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.