linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mikulas Patocka <mpatocka@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Jens Axboe <axboe@kernel.dk>,
	"Alasdair G. Kergon" <agk@redhat.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@vger.kernel.org, dm-devel@redhat.com
Subject: Re: Crash when IO is being submitted and block size is changed
Date: Sun, 15 Jul 2012 20:55:02 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.1207152051490.4240@file.rdu.redhat.com> (raw)
In-Reply-To: <20120628111541.GB17515@quack.suse.cz>



On Thu, 28 Jun 2012, Jan Kara wrote:

> On Wed 27-06-12 23:04:09, Mikulas Patocka wrote:
> > The kernel crashes when IO is being submitted to a block device and block 
> > size of that device is changed simultaneously.
>   Nasty ;-)
> 
> > To reproduce the crash, apply this patch:
> > 
> > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
> > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
> > @@ -28,6 +28,7 @@
> >  #include <linux/log2.h>
> >  #include <linux/cleancache.h>
> >  #include <asm/uaccess.h> 
> > +#include <linux/delay.h>
> >  #include "internal.h"
> >  struct bdev_inode {
> > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
> >  
> >  	bh->b_bdev = I_BDEV(inode);
> >  	bh->b_blocknr = iblock;
> > +	msleep(1000);
> >  	bh->b_size = max_blocks << inode->i_blkbits;
> >  	if (max_blocks)
> >  		set_buffer_mapped(bh);
> > 
> > Use some device with 4k blocksize, for example a ramdisk.
> > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
> > /dev/ram0" on the other console.
> > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
> > 
> > 
> > One may ask "why would anyone do this - submit I/O and change block size 
> > simultaneously?" - the problem is that udev and lvm can scan and read all 
> > block devices anytime - so anytime you change block device size, there may 
> > be some i/o to that device in flight and the crash may happen. That BUG 
> > actually happened in production environment because of lvm scanning block 
> > devices and some other software changing block size at the same time.
> > 
>   Yeah, it's nasty and neither solution looks particularly appealing. One
> idea that came to my mind is: I'm trying to solve some races between direct
> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
> sure if it will go anywhere yet but if it does, we can fix the above race
> by taking the mapping lock for the whole block device around setting block
> size thus effectivelly disallowing any IO to it.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> 

Hi

This is the patch that fixes this crash: it takes a rw-semaphore around 
all direct-IO path.

(note that if someone is concerned about performance, the rw-semaphore 
could be made per-cpu --- take it for read on the current CPU and take it 
for write on all CPUs).

Mikulas

---

blockdev: fix a crash when block size is changed and I/O is issued simultaneously

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 +++
 3 files changed, 61 insertions(+), 3 deletions(-)

Index: linux-3.5-rc6-devel/include/linux/fs.h
===================================================================
--- linux-3.5-rc6-devel.orig/include/linux/fs.h	2012-07-16 01:18:45.000000000 +0200
+++ linux-3.5-rc6-devel/include/linux/fs.h	2012-07-16 01:29:21.000000000 +0200
@@ -713,6 +713,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
@@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const 
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
Index: linux-3.5-rc6-devel/fs/block_dev.c
===================================================================
--- linux-3.5-rc6-devel.orig/fs/block_dev.c	2012-07-16 01:14:33.000000000 +0200
+++ linux-3.5-rc6-devel/fs/block_dev.c	2012-07-16 01:37:28.000000000 +0200
@@ -124,6 +124,20 @@ int set_blocksize(struct block_device *b
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Prevent starting I/O or mapping the device */
+	down_write(&bdev->bd_block_size_semaphore);
+
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		up_write(&bdev->bd_block_size_semaphore);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +145,9 @@ int set_blocksize(struct block_device *b
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
+	up_write(&bdev->bd_block_size_semaphore);
+
 	return 0;
 }
 
@@ -472,6 +489,7 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
+	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1567,6 +1585,22 @@ static long block_ioctl(struct file *fil
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,10 +1612,13 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	down_read(&bdev->bd_block_size_semaphore);
+
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
 		ssize_t err;
@@ -1590,10 +1627,27 @@ ssize_t blkdev_aio_write(struct kiocb *i
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+
+	up_read(&bdev->bd_block_size_semaphore);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_mmap(file, vma);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1624,9 +1678,9 @@ const struct file_operations def_blk_fop
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
Index: linux-3.5-rc6-devel/drivers/char/raw.c
===================================================================
--- linux-3.5-rc6-devel.orig/drivers/char/raw.c	2012-07-16 01:29:27.000000000 +0200
+++ linux-3.5-rc6-devel/drivers/char/raw.c	2012-07-16 01:30:04.000000000 +0200
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct 
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,

  parent reply	other threads:[~2012-07-16  0:55 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-28  3:04 Crash when IO is being submitted and block size is changed Mikulas Patocka
2012-06-28 11:15 ` Jan Kara
2012-06-28 15:44   ` Mikulas Patocka
2012-06-28 16:53     ` Jan Kara
2012-07-16  0:55   ` Mikulas Patocka [this message]
2012-07-17 19:19     ` Jeff Moyer
2012-07-19  2:27       ` Mikulas Patocka
2012-07-19 13:33         ` Jeff Moyer
2012-07-28 16:40           ` [PATCH 1/3] Fix " Mikulas Patocka
2012-07-28 16:41             ` [PATCH 2/3] Introduce percpu rw semaphores Mikulas Patocka
2012-07-28 16:42               ` [PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
2012-07-28 20:44               ` [PATCH 2/3] Introduce percpu rw semaphores Eric Dumazet
2012-07-29  5:13                 ` [dm-devel] " Mikulas Patocka
2012-07-29 10:10                   ` Eric Dumazet
2012-07-29 18:36                     ` Eric Dumazet
2012-08-01 20:07                       ` Mikulas Patocka
2012-08-01 20:09                       ` [PATCH 4/3] " Mikulas Patocka
2012-08-31 18:40                         ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
2012-08-31 18:41                           ` [PATCH 1/4] Add a lock that will be needed by the next patch Mikulas Patocka
2012-08-31 18:42                             ` [PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously Mikulas Patocka
2012-08-31 18:43                               ` [PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
2012-08-31 18:43                                 ` [PATCH 4/4] New percpu lock implementation Mikulas Patocka
2012-08-31 19:27                           ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
2012-08-31 20:11                             ` Jeff Moyer
2012-08-31 20:34                               ` Mikulas Patocka
2012-09-17 21:19                               ` Jeff Moyer
2012-09-18 17:04                                 ` Mikulas Patocka
2012-09-18 17:22                                   ` Jeff Moyer
2012-09-18 18:55                                     ` Mikulas Patocka
2012-09-18 18:58                                       ` Jeff Moyer
2012-09-18 20:11                                   ` Jeff Moyer
2012-09-25 17:49                                     ` Jeff Moyer
2012-09-25 17:59                                       ` Jens Axboe
2012-09-25 18:11                                         ` Jens Axboe
2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
2012-09-26  5:48                                             ` Jens Axboe
2012-11-16 22:02                                             ` Jeff Moyer
2012-09-25 22:50                                           ` [PATCH 2/2] " Mikulas Patocka
2012-09-25 22:58                                       ` [PATCH 0/4] " Mikulas Patocka
2012-09-26 13:47                                         ` Jeff Moyer
2012-09-26 14:35                                           ` Mikulas Patocka
2012-07-30 17:00                   ` [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores Paul E. McKenney
2012-07-31  0:00                     ` Mikulas Patocka
2012-08-01 17:15                       ` Paul E. McKenney
2012-06-29  6:25 ` Crash when IO is being submitted and block size is changed Vyacheslav Dubeyko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.1207152051490.4240@file.rdu.redhat.com \
    --to=mpatocka@redhat.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).