stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Jan Kara <jack@suse.cz>, Mikulas Patocka <mpatocka@redhat.com>,
	Jens Axboe <axboe@kernel.dk>, Al Viro <viro@zeniv.linux.org.uk>,
	Jens Axboe <axboe@fb.com>
Subject: [PATCH 3.10 31/44] writeback: fix a subtle race condition in I_DIRTY clearing
Date: Tue, 13 Jan 2015 23:23:52 -0800	[thread overview]
Message-ID: <20150114072229.177197259@linuxfoundation.org> (raw)
In-Reply-To: <20150114072227.419663002@linuxfoundation.org>

3.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tejun Heo <tj@kernel.org>

commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.

After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set.  The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.

And it sure enough was broken.  Please consider the following
scenario.

 CPU 0					CPU 1
 -------------------------------------------------------------------------------

					enters __writeback_single_inode()
					grabs inode->i_lock
					tests PAGECACHE_TAG_DIRTY which is clear
 enters __set_page_dirty()
 grabs mapping->tree_lock
 sets PAGECACHE_TAG_DIRTY
 releases mapping->tree_lock
 leaves __set_page_dirty()

 enters __mark_inode_dirty()
 smp_mb()
 sees I_DIRTY_PAGES set
 leaves __mark_inode_dirty()
					clears I_DIRTY_PAGES
					releases inode->i_lock

Now @inode has dirty pages w/ I_DIRTY_PAGES clear.  This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.

The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness.  There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode.  Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().

Fix it by adding an explicit smp_mb() after I_DIRTY clearing.  Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side.  It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.

Also add comments explaining how and why the barriers are paired.

Lightly tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/fs-writeback.c |   29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -470,12 +470,28 @@ __writeback_single_inode(struct inode *i
 	 * write_inode()
 	 */
 	spin_lock(&inode->i_lock);
-	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
-	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
-		inode->i_state &= ~I_DIRTY_PAGES;
+
 	dirty = inode->i_state & I_DIRTY;
-	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	inode->i_state &= ~I_DIRTY;
+
+	/*
+	 * Paired with smp_mb() in __mark_inode_dirty().  This allows
+	 * __mark_inode_dirty() to test i_state without grabbing i_lock -
+	 * either they see the I_DIRTY bits cleared or we see the dirtied
+	 * inode.
+	 *
+	 * I_DIRTY_PAGES is always cleared together above even if @mapping
+	 * still has dirty pages.  The flag is reinstated after smp_mb() if
+	 * necessary.  This guarantees that either __mark_inode_dirty()
+	 * sees clear I_DIRTY_PAGES or we see PAGECACHE_TAG_DIRTY.
+	 */
+	smp_mb();
+
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+		inode->i_state |= I_DIRTY_PAGES;
+
 	spin_unlock(&inode->i_lock);
+
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -1146,12 +1162,11 @@ void __mark_inode_dirty(struct inode *in
 	}
 
 	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
+	 * Paired with smp_mb() in __writeback_single_inode() for the
+	 * following lockless i_state test.  See there for details.
 	 */
 	smp_mb();
 
-	/* avoid the locking if we can */
 	if ((inode->i_state & flags) == flags)
 		return;
 



  parent reply	other threads:[~2015-01-14  7:23 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-14  7:23 [PATCH 3.10 00/44] 3.10.65-stable review Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 01/44] ocfs2: fix journal commit deadlock Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 02/44] ath9k_hw: fix hardware queue allocation Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 03/44] ath9k: fix BE/BK queue order Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 04/44] can: peak_usb: fix cleanup sequence order in case of error during init Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 05/44] can: peak_usb: fix memset() usage Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 06/44] swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 07/44] ath5k: fix hardware queue index assignment Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 08/44] ASoC: sigmadsp: Refuse to load firmware files with a non-supported version Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 09/44] ASoC: max98090: Fix ill-defined sidetone route Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 10/44] ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 11/44] PCI: Restore detection of read-only BARs Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 12/44] pstore-ram: Fix hangs by using write-combine mappings Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 13/44] pstore-ram: Allow optional mapping with pgprot_noncached Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 14/44] UBI: Fix invalid vfree() Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 15/44] UBI: Fix double free after do_sync_erase() Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 16/44] iommu/vt-d: Fix an off-by-one bug in __domain_mapping() Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 17/44] HID: i2c-hid: fix race condition reading reports Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 18/44] HID: i2c-hid: prevent buffer overflow in early IRQ Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 19/44] HID: roccat: potential out of bounds in pyra_sysfs_write_settings() Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 20/44] HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 22/44] x86_64, vdso: Fix the vdso address randomization algorithm Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 23/44] x86, vdso: Use asm volatile in __getcpu Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 24/44] driver core: Fix unbalanced device reference in drivers_probe Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 25/44] ALSA: usb-audio: extend KEF X300A FU 10 tweak to Arcam rPAC Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 26/44] ALSA: hda - using uninitialized data Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 27/44] ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 28/44] USB: cdc-acm: check for valid interfaces Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 29/44] genhd: check for int overflow in disk_expand_part_tbl() Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 30/44] cdc-acm: memory leak in error case Greg Kroah-Hartman
2015-01-14  7:23 ` Greg Kroah-Hartman [this message]
2015-01-14  7:23 ` [PATCH 3.10 32/44] serial: samsung: wait for transfer completion before clock disable Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 33/44] fs: nfsd: Fix signedness bug in compare_blob Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 34/44] nfsd4: fix xdr4 inclusion of escaped char Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 35/44] nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 36/44] scripts/kernel-doc: dont eat struct members with __aligned Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 37/44] ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP Greg Kroah-Hartman
2015-01-14  7:23 ` [PATCH 3.10 38/44] Btrfs: dont delay inode ref updates during log replay Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 39/44] perf/x86/intel/uncore: Make sure only uncore events are collected Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 40/44] perf: Fix events installation during moving group Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 41/44] perf session: Do not fail on processing out of order event Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 42/44] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 43/44] mm: propagate error from stack expansion even for guard page Greg Kroah-Hartman
2015-01-14  7:24 ` [PATCH 3.10 44/44] mm: Dont count the stack guard page towards RLIMIT_STACK Greg Kroah-Hartman
2015-01-14 22:49 ` [PATCH 3.10 00/44] 3.10.65-stable review Shuah Khan
2015-01-15  0:43 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150114072229.177197259@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@fb.com \
    --cc=axboe@kernel.dk \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).