From: Kevin Corry <corryk@us.ibm.com>
To: Joe Thornber <joe@fib011235813.fsnet.co.uk>
Cc: Linus Torvalds <torvalds@transmeta.com>,
Kernel Mailing List <linux-kernel@vger.kernel.org>,
lvm-devel@sistina.com
Subject: Re: [lvm-devel] Re: [PATCH] dm.c - device-mapper I/O path fixes
Date: Wed, 11 Dec 2002 06:52:36 -0600 [thread overview]
Message-ID: <02121106523600.29515@boiler> (raw)
In-Reply-To: <20021211121749.GA20782@reti>
On Wednesday 11 December 2002 06:17, Joe Thornber wrote:
> Kevin,
>
> On Tue, Dec 10, 2002 at 04:03:47PM -0600, Kevin Corry wrote:
> > Joe, Linus,
> >
> > This patch fixes problems with the device-mapper I/O path in 2.5.51. The
> > existing code does not properly split requests when necessary, and can
> > cause segfaults and/or data corruption. This can easily manifest itself
> > when running XFS on striped LVM volumes.
>
> Many thanks for doing this work, but _please_ split your patches up more.
> There are several changes rolled in here.
Sorry if I included too many changes in one post. I guess I didn't think the
patch was really that big.
> I've split the patch up, and will post the ones I'm accepting as
> replies to this current mail.
>
> The full set of changes for 2.5.51 is available here:
> http://people.sistina.com/~thornber/patches/2.5-stable/2.5.51/
>
> This works for me with xfs and stripes (limited testing).
>
>
> These are the bits of your patch that I have queries about:
>
> --- linux-2.5.51a/drivers/md/dm.c Tue Dec 10 11:01:13 2002
> +++ linux-2.5.51b/drivers/md/dm.c Tue Dec 10 11:03:55 2002
> @@ -242,4 +242,4 @@
> - bio_endio(io->bio, io->error ? 0 : io->bio->bi_size, io->error);
> + bio_endio(io->bio, io->bio->bi_size, io->error);
>
> You seem to be assuming that io->bio->bi_size will always be zero if
> an error occurs. I was not aware that this was the case.
I'm simply going by the convention used by the other kernel drivers. Do a
grep for bio_endio and bio_io_error in drivers/. Most drivers (e.g. md, loop,
rd, umem) call bio_endio() and bio_io_error() with the current bi_size of the
bio they're completing.
> @@ -261,15 +262,15 @@
> {
> struct dm_io *io = bio->bi_private;
>
> - /*
> - * Only call dec_pending if the clone has completely
> - * finished. If a partial io errors I'm assuming it won't
> - * be requeued. FIXME: check this.
> - */
> - if (error || !bio->bi_size) {
> - dec_pending(io, error);
> - bio_put(bio);
> + if (bio->bi_size)
> + return 1;
> +
> + if (error) {
> + struct gendisk *disk = dm_disk(io->md);
> + DMWARN("I/O error (%d) on device %s\n", error, disk->disk_name);
> }
> + dec_pending(io, error);
> + bio_put(bio);
>
> return 0;
> }
>
>
> All you're doing here is adding a warning (which is nice), and making
> the same assumption about bio->bi_size in the case of an error.
Again, I changed this based on conventions used by other drivers. Take a look
at loop_end_io_transfer() in drivers/block/loop.c, or end_request() and
end_sync_write() in drivers/md/raid1.c. If a driver doesn't want to bother
with partion bio completions (and DM shouldn't), it should do a
if (bio->bi_size) return 1;
statement at the top of its callback. Check out the original comments
regarding this in the BK tree at:
http://linux.bkbits.net:8080/linux-2.5/cset@1.536.40.4?nav=index.html|ChangeSet@-1y
> @@ -457,13 +483,13 @@
> up_read(&md->lock);
>
> if (bio_rw(bio) == READA) {
> - bio_io_error(bio, 0);
> + bio_io_error(bio, bio->bi_size);
> return 0;
> }
>
> r = queue_io(md, bio);
> if (r < 0) {
> - bio_io_error(bio, 0);
> + bio_io_error(bio, bio->bi_size);
> return 0;
>
> } else if (r == 0)
>
> Why is it better to say that all the io was 'done' rather than none?
> It did fail after all.
See comments above.
> @@ -369,24 +369,48 @@
> {
> struct bio *clone, *bio = ci->bio;
> struct dm_target *ti = dm_table_find_target(ci->md->map, ci->sector);
> - sector_t len = max_io_len(ci->md, bio->bi_sector, ti);
> + sector_t bv_len, len = max_io_len(ci->md, ci->sector, ti);
> + struct bio_vec *bv;
> + int i, vcnt = bio->bi_vcnt - ci->idx;
>
> /* shorter than current target ? */
> if (ci->sector_count < len)
> len = ci->sector_count;
>
> /* create the clone */
> - clone = bio_clone(ci->bio, GFP_NOIO);
> + clone = bio_alloc(GFP_NOIO, vcnt);
> + if (!clone) {
> + dec_pending(ci->io, -ENOMEM);
> + return;
> + }
> clone->bi_sector = ci->sector;
> - clone->bi_idx = ci->idx;
> + clone->bi_bdev = bio->bi_bdev;
> + clone->bi_rw = bio->bi_rw;
> + clone->bi_vcnt = vcnt;
> clone->bi_size = len << SECTOR_SHIFT;
> clone->bi_end_io = clone_endio;
> clone->bi_private = ci->io;
>
> + /* copy the original vector and adjust if necessary. */
> + memcpy(clone->bi_io_vec, bio->bi_io_vec + ci->idx,
> + vcnt * sizeof(*clone->bi_io_vec));
> + bv_len = len << SECTOR_SHIFT;
> + bio_for_each_segment(bv, clone, i) {
> + if (bv_len >= bv->bv_len) {
> + bv_len -= bv->bv_len;
> + } else {
> + bv->bv_len = bv_len;
> + clone->bi_vcnt = i + 1;
> + break;
> + }
> + }
> +
> + /* submit this io */
> + __map_bio(ti, clone);
> +
> /* adjust the remaining io */
> ci->sector += len;
> ci->sector_count -= len;
> - __map_bio(ti, clone);
>
> /*
> * If we are not performing all remaining io in this
> @@ -395,9 +419,9 @@
> */
> if (ci->sector_count) {
> while (len) {
> - struct bio_vec *bv = clone->bi_io_vec + ci->idx;
> - sector_t bv_len = bv->bv_len >> SECTOR_SHIFT;
> + bv = bio->bi_io_vec + ci->idx;
> + bv_len = bv->bv_len >> SECTOR_SHIFT;
> if (bv_len <= len)
> len -= bv_len;
>
>
> There is no need to use bio_alloc in preference to bio_clone, we're
> not changing the bvec in any way. All of the above code is redundant.
Yes, we *are* going to have to change the bvec (as I implied in my original
post). If you split a bio, and that split occurs in the middle of one of the
bvecs (i.e. in the middle of a page), then the bv_len field in that bvec
*must* be adjusted accordingly (which is what the bio_for_each_segment loop
above does). Otherwise when blk_rq_map_sg() in drivers/block/ll_rw_block.c
processes that bio, it will think that bvec contains a full page. Since that
page obviously spans a stripe or PE boundary, this is going to cause data
corruption.
--
Kevin Corry
corryk@us.ibm.com
http://evms.sourceforge.net/
next prev parent reply other threads:[~2002-12-11 13:31 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-12-10 22:03 [PATCH] dm.c - device-mapper I/O path fixes Kevin Corry
2002-12-11 12:17 ` Joe Thornber
2002-12-11 12:19 ` Joe Thornber
2002-12-11 18:19 ` Denis Vlasenko
2002-12-11 13:16 ` Kevin Corry
2002-12-11 14:18 ` Joe Thornber
2002-12-11 19:24 ` Denis Vlasenko
2002-12-11 14:06 ` Kevin Corry
2002-12-11 21:12 ` Paul Mackerras
2002-12-12 12:30 ` Kevin Corry
2002-12-11 19:19 ` Denis Vlasenko
2002-12-11 14:02 ` Kevin Corry
2002-12-11 15:12 ` [lvm-devel] " Joe Thornber
2002-12-11 14:58 ` Joe Thornber
2002-12-11 12:19 ` Joe Thornber
2002-12-11 12:20 ` Joe Thornber
2002-12-11 12:21 ` Joe Thornber
2002-12-11 12:52 ` Kevin Corry [this message]
2002-12-16 0:50 ` Linus Torvalds
2002-12-16 10:04 ` Joe Thornber
2002-12-16 10:06 ` 1/19 Joe Thornber
2002-12-16 17:07 ` 1/19 Linus Torvalds
2002-12-16 10:06 ` 2/19 Joe Thornber
2002-12-16 10:07 ` 3/19 Joe Thornber
2002-12-16 10:08 ` 4/19 Joe Thornber
2002-12-16 10:09 ` 5/19 Joe Thornber
2002-12-16 10:09 ` 6/19 Joe Thornber
2002-12-16 10:35 ` 6/19 Tomas Szepe
2002-12-16 10:38 ` 6/19 Tomas Szepe
2002-12-16 10:10 ` 7/19 Joe Thornber
2002-12-16 10:11 ` 8/19 Joe Thornber
2002-12-16 10:11 ` 9/19 Joe Thornber
2002-12-16 10:12 ` 10/19 Joe Thornber
2002-12-16 10:13 ` 11/19 Joe Thornber
2002-12-16 10:14 ` 12/19 Joe Thornber
2002-12-16 10:14 ` 13/19 Joe Thornber
2002-12-16 10:15 ` 14/19 Joe Thornber
2002-12-16 10:16 ` 15/19 Joe Thornber
2002-12-16 10:16 ` 16/19 Joe Thornber
2002-12-16 10:17 ` 17/19 Joe Thornber
2002-12-16 10:18 ` 18/19 Joe Thornber
2002-12-16 10:19 ` 19/19 Joe Thornber
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=02121106523600.29515@boiler \
--to=corryk@us.ibm.com \
--cc=joe@fib011235813.fsnet.co.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=lvm-devel@sistina.com \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.