public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Kevin Corry <corryk@us.ibm.com>
To: Joe Thornber <joe@fib011235813.fsnet.co.uk>
Cc: Linus Torvalds <torvalds@transmeta.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	lvm-devel@sistina.com
Subject: Re: [lvm-devel] Re: [PATCH] dm.c - device-mapper I/O path fixes
Date: Wed, 11 Dec 2002 06:52:36 -0600	[thread overview]
Message-ID: <02121106523600.29515@boiler> (raw)
In-Reply-To: <20021211121749.GA20782@reti>

On Wednesday 11 December 2002 06:17, Joe Thornber wrote:
> Kevin,
>
> On Tue, Dec 10, 2002 at 04:03:47PM -0600, Kevin Corry wrote:
> > Joe, Linus,
> >
> > This patch fixes problems with the device-mapper I/O path in 2.5.51. The
> > existing code does not properly split requests when necessary, and can
> > cause segfaults and/or data corruption. This can easily manifest itself
> > when running XFS on striped LVM volumes.
>
> Many thanks for doing this work, but _please_ split your patches up more.
> There are several changes rolled in here.

Sorry if I included too many changes in one post. I guess I didn't think the 
patch was really that big.

> I've split the patch up, and will post the ones I'm accepting as
> replies to this current mail.
>
> The full set of changes for 2.5.51 is available here:
> http://people.sistina.com/~thornber/patches/2.5-stable/2.5.51/
>
> This works for me with xfs and stripes (limited testing).
>
>
> These are the bits of your patch that I have queries about:
>
> --- linux-2.5.51a/drivers/md/dm.c	Tue Dec 10 11:01:13 2002
> +++ linux-2.5.51b/drivers/md/dm.c	Tue Dec 10 11:03:55 2002
> @@ -242,4 +242,4 @@
> -		bio_endio(io->bio, io->error ? 0 : io->bio->bi_size, io->error);
> +		bio_endio(io->bio, io->bio->bi_size, io->error);
>
> You seem to be assuming that io->bio->bi_size will always be zero if
> an error occurs.  I was not aware that this was the case.

I'm simply going by the convention used by the other kernel drivers. Do a
grep for bio_endio and bio_io_error in drivers/. Most drivers (e.g. md, loop, 
rd, umem) call bio_endio() and bio_io_error() with the current bi_size of the 
bio they're completing.


> @@ -261,15 +262,15 @@
>  {
>  	struct dm_io *io = bio->bi_private;
>
> -	/*
> -	 * Only call dec_pending if the clone has completely
> -	 * finished.  If a partial io errors I'm assuming it won't
> -	 * be requeued.  FIXME: check this.
> -	 */
> -	if (error || !bio->bi_size) {
> -		dec_pending(io, error);
> -		bio_put(bio);
> +	if (bio->bi_size)
> +		return 1;
> +
> +	if (error) {
> +		struct gendisk *disk = dm_disk(io->md);
> +		DMWARN("I/O error (%d) on device %s\n", error, disk->disk_name);
>  	}
> +	dec_pending(io, error);
> +	bio_put(bio);
>
>  	return 0;
>  }
>
>
> All you're doing here is adding a warning (which is nice), and making
> the same assumption about bio->bi_size in the case of an error.

Again, I changed this based on conventions used by other drivers. Take a look 
at loop_end_io_transfer() in drivers/block/loop.c, or end_request() and 
end_sync_write() in drivers/md/raid1.c. If a driver doesn't want to bother 
with partion bio completions (and DM shouldn't), it should do a
  if (bio->bi_size) return 1;
statement at the top of its callback. Check out the original comments 
regarding this in the BK tree at:
http://linux.bkbits.net:8080/linux-2.5/cset@1.536.40.4?nav=index.html|ChangeSet@-1y

> @@ -457,13 +483,13 @@
>  		up_read(&md->lock);
>
>  		if (bio_rw(bio) == READA) {
> -			bio_io_error(bio, 0);
> +			bio_io_error(bio, bio->bi_size);
>  			return 0;
>  		}
>
>  		r = queue_io(md, bio);
>  		if (r < 0) {
> -			bio_io_error(bio, 0);
> +			bio_io_error(bio, bio->bi_size);
>  			return 0;
>
>  		} else if (r == 0)
>
> Why is it better to say that all the io was 'done' rather than none?
> It did fail after all.

See comments above.

> @@ -369,24 +369,48 @@
>  {
>  	struct bio *clone, *bio = ci->bio;
>  	struct dm_target *ti = dm_table_find_target(ci->md->map, ci->sector);
> -	sector_t len = max_io_len(ci->md, bio->bi_sector, ti);
> +	sector_t bv_len, len = max_io_len(ci->md, ci->sector, ti);
> +	struct bio_vec *bv;
> +	int i, vcnt = bio->bi_vcnt - ci->idx;
>
>  	/* shorter than current target ? */
>  	if (ci->sector_count < len)
>  		len = ci->sector_count;
>
>  	/* create the clone */
> -	clone = bio_clone(ci->bio, GFP_NOIO);
> +	clone = bio_alloc(GFP_NOIO, vcnt);
> +	if (!clone) {
> +		dec_pending(ci->io, -ENOMEM);
> +		return;
> +	}
>  	clone->bi_sector = ci->sector;
> -	clone->bi_idx = ci->idx;
> +	clone->bi_bdev = bio->bi_bdev;
> +	clone->bi_rw = bio->bi_rw;
> +	clone->bi_vcnt = vcnt;
>  	clone->bi_size = len << SECTOR_SHIFT;
>  	clone->bi_end_io = clone_endio;
>  	clone->bi_private = ci->io;
>
> +	/* copy the original vector and adjust if necessary. */
> +	memcpy(clone->bi_io_vec, bio->bi_io_vec + ci->idx,
> +	       vcnt * sizeof(*clone->bi_io_vec));
> +	bv_len = len << SECTOR_SHIFT;
> +	bio_for_each_segment(bv, clone, i) {
> +		if (bv_len >= bv->bv_len) {
> +			bv_len -= bv->bv_len;
> +		} else {
> +			bv->bv_len = bv_len;
> +			clone->bi_vcnt = i + 1;
> +			break;
> +		}
> +	}
> +
> +	/* submit this io */
> +	__map_bio(ti, clone);
> +
>  	/* adjust the remaining io */
>  	ci->sector += len;
>  	ci->sector_count -= len;
> -	__map_bio(ti, clone);
>
>  	/*
>  	 * If we are not performing all remaining io in this
> @@ -395,9 +419,9 @@
>  	 */
>  	if (ci->sector_count) {
>  		while (len) {
> -			struct bio_vec *bv = clone->bi_io_vec + ci->idx;
> -			sector_t bv_len = bv->bv_len >> SECTOR_SHIFT;
> +			bv = bio->bi_io_vec + ci->idx;
> +			bv_len = bv->bv_len >> SECTOR_SHIFT;
>  			if (bv_len <= len)
>  				len -= bv_len;
>
>
> There is no need to use bio_alloc in preference to bio_clone, we're
> not changing the bvec in any way.  All of the above code is redundant.

Yes, we *are* going to have to change the bvec (as I implied in my original 
post). If you split a bio, and that split occurs in the middle of one of the 
bvecs (i.e. in the middle of a page), then the bv_len field in that bvec 
*must* be adjusted accordingly (which is what the bio_for_each_segment loop 
above does). Otherwise when blk_rq_map_sg() in drivers/block/ll_rw_block.c 
processes that bio, it will think that bvec contains a full page. Since that 
page obviously spans a stripe or PE boundary, this is going to cause data 
corruption.

-- 
Kevin Corry
corryk@us.ibm.com
http://evms.sourceforge.net/

  parent reply	other threads:[~2002-12-11 13:31 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-12-10 22:03 [PATCH] dm.c - device-mapper I/O path fixes Kevin Corry
2002-12-11 12:17 ` Joe Thornber
2002-12-11 12:19   ` Joe Thornber
2002-12-11 18:19     ` Denis Vlasenko
2002-12-11 13:16       ` Kevin Corry
2002-12-11 14:18         ` Joe Thornber
2002-12-11 19:24           ` Denis Vlasenko
2002-12-11 14:06             ` Kevin Corry
2002-12-11 21:12               ` Paul Mackerras
2002-12-12 12:30                 ` Kevin Corry
2002-12-11 19:19         ` Denis Vlasenko
2002-12-11 14:02           ` Kevin Corry
2002-12-11 15:12             ` [lvm-devel] " Joe Thornber
2002-12-11 14:58           ` Joe Thornber
2002-12-11 12:19   ` Joe Thornber
2002-12-11 12:20   ` Joe Thornber
2002-12-11 12:21   ` Joe Thornber
2002-12-11 12:52   ` Kevin Corry [this message]
2002-12-16  0:50   ` Linus Torvalds
2002-12-16 10:04     ` Joe Thornber
2002-12-16 10:06       ` 1/19 Joe Thornber
2002-12-16 17:07         ` 1/19 Linus Torvalds
2002-12-16 10:06       ` 2/19 Joe Thornber
2002-12-16 10:07       ` 3/19 Joe Thornber
2002-12-16 10:08       ` 4/19 Joe Thornber
2002-12-16 10:09       ` 5/19 Joe Thornber
2002-12-16 10:09       ` 6/19 Joe Thornber
2002-12-16 10:35         ` 6/19 Tomas Szepe
2002-12-16 10:38           ` 6/19 Tomas Szepe
2002-12-16 10:10       ` 7/19 Joe Thornber
2002-12-16 10:11       ` 8/19 Joe Thornber
2002-12-16 10:11       ` 9/19 Joe Thornber
2002-12-16 10:12       ` 10/19 Joe Thornber
2002-12-16 10:13       ` 11/19 Joe Thornber
2002-12-16 10:14       ` 12/19 Joe Thornber
2002-12-16 10:14       ` 13/19 Joe Thornber
2002-12-16 10:15       ` 14/19 Joe Thornber
2002-12-16 10:16       ` 15/19 Joe Thornber
2002-12-16 10:16       ` 16/19 Joe Thornber
2002-12-16 10:17       ` 17/19 Joe Thornber
2002-12-16 10:18       ` 18/19 Joe Thornber
2002-12-16 10:19       ` 19/19 Joe Thornber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02121106523600.29515@boiler \
    --to=corryk@us.ibm.com \
    --cc=joe@fib011235813.fsnet.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lvm-devel@sistina.com \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox