* [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing
@ 2014-10-24 19:38 Tejun Heo
2014-10-29 16:37 ` Jan Kara
2014-11-04 17:34 ` Tejun Heo
0 siblings, 2 replies; 4+ messages in thread
From: Tejun Heo @ 2014-10-24 19:38 UTC (permalink / raw)
To: Jens Axboe; +Cc: Jan Kara, Mikulas Patocka, Al Viro, linux-kernel
After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.
And it sure enough was broken. Please consider the following
scenario.
CPU 0 CPU 1
-------------------------------------------------------------------------------
enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()
enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock
Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.
The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().
Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.
Also add comments explaining how and why the barriers are paired.
Lightly tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
---
fs/fs-writeback.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -479,12 +479,28 @@ __writeback_single_inode(struct inode *i
* write_inode()
*/
spin_lock(&inode->i_lock);
- /* Clear I_DIRTY_PAGES if we've written out all dirty pages */
- if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
- inode->i_state &= ~I_DIRTY_PAGES;
+
dirty = inode->i_state & I_DIRTY;
- inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+ inode->i_state &= ~I_DIRTY;
+
+ /*
+ * Paired with smp_mb() in __mark_inode_dirty(). This allows
+ * __mark_inode_dirty() to test i_state without grabbing i_lock -
+ * either they see the I_DIRTY bits cleared or we see the dirtied
+ * inode.
+ *
+ * I_DIRTY_PAGES is always cleared together above even if @mapping
+ * still has dirty pages. The flag is reinstated after smp_mb() if
+ * necessary. This guarantees that either __mark_inode_dirty()
+ * sees clear I_DIRTY_PAGES or we see PAGECACHE_TAG_DIRTY.
+ */
+ smp_mb();
+
+ if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+ inode->i_state |= I_DIRTY_PAGES;
+
spin_unlock(&inode->i_lock);
+
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
int err = write_inode(inode, wbc);
@@ -1148,12 +1164,11 @@ void __mark_inode_dirty(struct inode *in
}
/*
- * make sure that changes are seen by all cpus before we test i_state
- * -- mikulas
+ * Paired with smp_mb() in __writeback_single_inode() for the
+ * following lockless i_state test. See there for details.
*/
smp_mb();
- /* avoid the locking if we can */
if ((inode->i_state & flags) == flags)
return;
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing
2014-10-24 19:38 [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing Tejun Heo
@ 2014-10-29 16:37 ` Jan Kara
2014-11-04 17:34 ` Tejun Heo
1 sibling, 0 replies; 4+ messages in thread
From: Jan Kara @ 2014-10-29 16:37 UTC (permalink / raw)
To: Tejun Heo; +Cc: Jens Axboe, Jan Kara, Mikulas Patocka, Al Viro, linux-kernel
On Fri 24-10-14 15:38:21, Tejun Heo wrote:
> After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
> tests inode->i_state locklessly to see whether it already has all the
> necessary I_DIRTY bits set. The comment above the barrier doesn't
> contain any useful information - memory barriers can't ensure "changes
> are seen by all cpus" by itself.
>
> And it sure enough was broken. Please consider the following
> scenario.
>
> CPU 0 CPU 1
> -------------------------------------------------------------------------------
>
> enters __writeback_single_inode()
> grabs inode->i_lock
> tests PAGECACHE_TAG_DIRTY which is clear
> enters __set_page_dirty()
> grabs mapping->tree_lock
> sets PAGECACHE_TAG_DIRTY
> releases mapping->tree_lock
> leaves __set_page_dirty()
>
> enters __mark_inode_dirty()
> smp_mb()
> sees I_DIRTY_PAGES set
> leaves __mark_inode_dirty()
> clears I_DIRTY_PAGES
> releases inode->i_lock
>
> Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
> to lead to an immediately critical problem because requeue_inode()
> later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
> deciding whether the inode needs to be requeued for IO and there are
> enough unintentional memory barriers inbetween, so while the inode
> ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
> IO list.
>
> The lack of explicit barrier may also theoretically affect the other
> I_DIRTY bits which deal with metadata dirtiness. There is no
> guarantee that a strong enough barrier exists between
> I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
> inode. Filesystem inode writeout path likely has enough stuff which
> can behave as full barrier but it's theoretically possible that the
> writeout may not see all the updates from ->dirty_inode().
>
> Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
> that I_DIRTY_PAGES needs a special treatment as it always needs to be
> cleared to be interlocked with the lockless test on
> __mark_inode_dirty() side. It's cleared unconditionally and
> reinstated after smp_mb() if the mapping still has dirty pages.
>
> Also add comments explaining how and why the barriers are paired.
>
> Lightly tested.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: stable@vger.kernel.org
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/fs-writeback.c | 29 ++++++++++++++++++++++-------
> 1 file changed, 22 insertions(+), 7 deletions(-)
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -479,12 +479,28 @@ __writeback_single_inode(struct inode *i
> * write_inode()
> */
> spin_lock(&inode->i_lock);
> - /* Clear I_DIRTY_PAGES if we've written out all dirty pages */
> - if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> - inode->i_state &= ~I_DIRTY_PAGES;
> +
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> + inode->i_state &= ~I_DIRTY;
> +
> + /*
> + * Paired with smp_mb() in __mark_inode_dirty(). This allows
> + * __mark_inode_dirty() to test i_state without grabbing i_lock -
> + * either they see the I_DIRTY bits cleared or we see the dirtied
> + * inode.
> + *
> + * I_DIRTY_PAGES is always cleared together above even if @mapping
> + * still has dirty pages. The flag is reinstated after smp_mb() if
> + * necessary. This guarantees that either __mark_inode_dirty()
> + * sees clear I_DIRTY_PAGES or we see PAGECACHE_TAG_DIRTY.
> + */
> + smp_mb();
> +
> + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> + inode->i_state |= I_DIRTY_PAGES;
> +
> spin_unlock(&inode->i_lock);
> +
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> int err = write_inode(inode, wbc);
> @@ -1148,12 +1164,11 @@ void __mark_inode_dirty(struct inode *in
> }
>
> /*
> - * make sure that changes are seen by all cpus before we test i_state
> - * -- mikulas
> + * Paired with smp_mb() in __writeback_single_inode() for the
> + * following lockless i_state test. See there for details.
> */
> smp_mb();
>
> - /* avoid the locking if we can */
> if ((inode->i_state & flags) == flags)
> return;
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing
2014-10-24 19:38 [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing Tejun Heo
2014-10-29 16:37 ` Jan Kara
@ 2014-11-04 17:34 ` Tejun Heo
2014-11-04 17:41 ` Jens Axboe
1 sibling, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2014-11-04 17:34 UTC (permalink / raw)
To: Jens Axboe; +Cc: Jan Kara, Mikulas Patocka, Al Viro, linux-kernel
On Fri, Oct 24, 2014 at 03:38:21PM -0400, Tejun Heo wrote:
> After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
> tests inode->i_state locklessly to see whether it already has all the
> necessary I_DIRTY bits set. The comment above the barrier doesn't
> contain any useful information - memory barriers can't ensure "changes
> are seen by all cpus" by itself.
>
> And it sure enough was broken. Please consider the following
> scenario.
>
> CPU 0 CPU 1
> -------------------------------------------------------------------------------
>
> enters __writeback_single_inode()
> grabs inode->i_lock
> tests PAGECACHE_TAG_DIRTY which is clear
> enters __set_page_dirty()
> grabs mapping->tree_lock
> sets PAGECACHE_TAG_DIRTY
> releases mapping->tree_lock
> leaves __set_page_dirty()
>
> enters __mark_inode_dirty()
> smp_mb()
> sees I_DIRTY_PAGES set
> leaves __mark_inode_dirty()
> clears I_DIRTY_PAGES
> releases inode->i_lock
>
> Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
> to lead to an immediately critical problem because requeue_inode()
> later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
> deciding whether the inode needs to be requeued for IO and there are
> enough unintentional memory barriers inbetween, so while the inode
> ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
> IO list.
>
> The lack of explicit barrier may also theoretically affect the other
> I_DIRTY bits which deal with metadata dirtiness. There is no
> guarantee that a strong enough barrier exists between
> I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
> inode. Filesystem inode writeout path likely has enough stuff which
> can behave as full barrier but it's theoretically possible that the
> writeout may not see all the updates from ->dirty_inode().
>
> Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
> that I_DIRTY_PAGES needs a special treatment as it always needs to be
> cleared to be interlocked with the lockless test on
> __mark_inode_dirty() side. It's cleared unconditionally and
> reinstated after smp_mb() if the mapping still has dirty pages.
>
> Also add comments explaining how and why the barriers are paired.
>
> Lightly tested.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: stable@vger.kernel.org
Jens, can you please route this one?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing
2014-11-04 17:34 ` Tejun Heo
@ 2014-11-04 17:41 ` Jens Axboe
0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2014-11-04 17:41 UTC (permalink / raw)
To: Tejun Heo; +Cc: Jan Kara, Mikulas Patocka, Al Viro, linux-kernel
On 2014-11-04 10:34, Tejun Heo wrote:
> On Fri, Oct 24, 2014 at 03:38:21PM -0400, Tejun Heo wrote:
>> After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
>> tests inode->i_state locklessly to see whether it already has all the
>> necessary I_DIRTY bits set. The comment above the barrier doesn't
>> contain any useful information - memory barriers can't ensure "changes
>> are seen by all cpus" by itself.
>>
>> And it sure enough was broken. Please consider the following
>> scenario.
>>
>> CPU 0 CPU 1
>> -------------------------------------------------------------------------------
>>
>> enters __writeback_single_inode()
>> grabs inode->i_lock
>> tests PAGECACHE_TAG_DIRTY which is clear
>> enters __set_page_dirty()
>> grabs mapping->tree_lock
>> sets PAGECACHE_TAG_DIRTY
>> releases mapping->tree_lock
>> leaves __set_page_dirty()
>>
>> enters __mark_inode_dirty()
>> smp_mb()
>> sees I_DIRTY_PAGES set
>> leaves __mark_inode_dirty()
>> clears I_DIRTY_PAGES
>> releases inode->i_lock
>>
>> Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
>> to lead to an immediately critical problem because requeue_inode()
>> later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
>> deciding whether the inode needs to be requeued for IO and there are
>> enough unintentional memory barriers inbetween, so while the inode
>> ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
>> IO list.
>>
>> The lack of explicit barrier may also theoretically affect the other
>> I_DIRTY bits which deal with metadata dirtiness. There is no
>> guarantee that a strong enough barrier exists between
>> I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
>> inode. Filesystem inode writeout path likely has enough stuff which
>> can behave as full barrier but it's theoretically possible that the
>> writeout may not see all the updates from ->dirty_inode().
>>
>> Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
>> that I_DIRTY_PAGES needs a special treatment as it always needs to be
>> cleared to be interlocked with the lockless test on
>> __mark_inode_dirty() side. It's cleared unconditionally and
>> reinstated after smp_mb() if the mapping still has dirty pages.
>>
>> Also add comments explaining how and why the barriers are paired.
>>
>> Lightly tested.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Mikulas Patocka <mpatocka@redhat.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: stable@vger.kernel.org
>
> Jens, can you please route this one?
I can, was going to send an ack anyway.
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-11-04 17:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-24 19:38 [PATCH block/for-linus] writeback: fix a subtle race condition in I_DIRTY clearing Tejun Heo
2014-10-29 16:37 ` Jan Kara
2014-11-04 17:34 ` Tejun Heo
2014-11-04 17:41 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox