From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f173.google.com ([209.85.216.173]:35577 "EHLO mail-qt0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388248AbeGXO2v (ORCPT ); Tue, 24 Jul 2018 10:28:51 -0400 Received: by mail-qt0-f173.google.com with SMTP id a5-v6so4023116qtp.2 for ; Tue, 24 Jul 2018 06:22:22 -0700 (PDT) Message-ID: <1532438540.9819.2.camel@redhat.com> Subject: Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device From: Laurence Oberman To: Mike Snitzer , Hannes Reinecke Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, dm-devel@redhat.com Date: Tue, 24 Jul 2018 09:22:20 -0400 In-Reply-To: <20180724130703.GA30804@redhat.com> References: <20180723163357.GA29658@redhat.com> <20180724130703.GA30804@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-block-owner@vger.kernel.org List-Id: linux-block@vger.kernel.org On Tue, 2018-07-24 at 09:07 -0400, Mike Snitzer wrote: > On Tue, Jul 24 2018 at  2:00am -0400, > Hannes Reinecke wrote: > > > On 07/23/2018 06:33 PM, Mike Snitzer wrote: > > > Hi, > > > > > > I've opened the following public BZ: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527 > > > > > > Feel free to add comments to that BZ if you have a redhat > > > bugzilla > > > account. > > > > > > But otherwise, happy to get as much feedback and discussion going > > > purely > > > on the relevant lists.  I've taken ~1.5 weeks to categorize and > > > isolate > > > this issue.  But I've reached a point where I'm getting > > > diminishing > > > returns and could _really_ use the collective eyeballs and > > > expertise of > > > the community.  This is by far one of the most nasty cases of > > > corruption > > > I've seen in a while.  Not sure where the ultimate cause of > > > corruption > > > lies (that the money question) but it _feels_ rooted in NVMe and > > > is > > > unique to this particular workload I've stumbled onto via > > > customer > > > escalation and then trying to replicate an rbd device using a > > > more > > > approachable one (request-based DM multipath in this case). > > > > > > > I might be stating the obvious, but so far we only have considered > > request-based multipath as being active for the _entire_ device. > > To my knowledge we've never tested that when running on a > > partition. > > True.  We only ever support mapping the partitions ontop of > request-based multipath (via dm-linear volumes created by kpartx). > > > So, have you tested that request-based multipathing works on a > > partition _at all_? I'm not sure if partition mapping is done > > correctly here; we never remap the start of the request (nor bio, > > come to speak of it), so it looks as if we would be doing the wrong > > things here. > > > > Have you checked that partition remapping is done correctly? > > It clearly doesn't work.  Not quite following why but... > > After running the test the partition table at the start of the whole > NVMe device is overwritten by XFS.  So likely the IO destined to the > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to > the > whole NVMe device: > > # pvcreate /dev/nvme1n1 > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? > [y/n] > > # vgcreate test /dev/nvme1n1 > # lvcreate -n slow -L 512G test > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe > it? > [y/n]: y >   Wiping xfs signature on /dev/test/slow. >   Logical volume "slow" created. > > Isn't this a failing of block core's partitioning?  Why should a > target > that is given the entire partition of a device need to be concerned > with > remapping IO?  Shouldn't block core handle that mapping? > > Anyway, yesterday I went so far as to hack together request-based > support for DM linear (because request-based DM cannot stack on > bio-based DM) .  With this, request-based linear devices instead of > conventional partitioning, I no longer see the XFS corruption when > running the test: > >  drivers/md/dm-linear.c | 45 > ++++++++++++++++++++++++++++++++++++++++++--- >  1 file changed, 42 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c > index d10964d41fd7..d4a65dd20c6e 100644 > --- a/drivers/md/dm-linear.c > +++ b/drivers/md/dm-linear.c > @@ -12,6 +12,7 @@ >  #include >  #include >  #include > +#include >   >  #define DM_MSG_PREFIX "linear" >   > @@ -24,7 +25,7 @@ struct linear_c { >  }; >   >  /* > - * Construct a linear mapping: > + * Construct a linear mapping: [<# optional > params> ] >   */ >  static int linear_ctr(struct dm_target *ti, unsigned int argc, char > **argv) >  { > @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, > unsigned int argc, char **argv) >   goto bad; >   } >   > + // FIXME: need to parse optional args > + // FIXME: model  alloc_multipath_stage2()? > + // Call: dm_table_set_type() > + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED); > + >   ti->num_flush_bios = 1; >   ti->num_discard_bios = 1; >   ti->num_secure_erase_bios = 1; > @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, > struct bio *bio, >   return DM_ENDIO_DONE; >  } >   > +static int linear_clone_and_map(struct dm_target *ti, struct request > *rq, > + union map_info *map_context, > + struct request **__clone) > +{ > + struct linear_c *lc = ti->private; > + struct block_device *bdev = lc->dev->bdev; > + struct request_queue *q = bdev_get_queue(bdev); > + > + struct request *clone = blk_get_request(q, rq->cmd_flags | > REQ_NOMERGE, > + BLK_MQ_REQ_NOWAIT); > + if (IS_ERR(clone)) { > + if (blk_queue_dying(q) || !q->mq_ops) > + return DM_MAPIO_DELAY_REQUEUE; > + > + return DM_MAPIO_REQUEUE; > + } > + > + clone->__sector = linear_map_sector(ti, rq->__sector); > + clone->bio = clone->biotail = NULL; > + clone->rq_disk = bdev->bd_disk; > + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; > + *__clone = clone; > + > + return DM_MAPIO_REMAPPED; > +} > + > +static void linear_release_clone(struct request *clone) > +{ > + blk_put_request(clone); > +} > + >  static void linear_status(struct dm_target *ti, status_type_t type, >     unsigned status_flags, char *result, > unsigned maxlen) >  { > @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct > dm_target *ti, pgoff_t pgoff, >   >  static struct target_type linear_target = { >   .name   = "linear", > - .version = {1, 4, 0}, > - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM, > + .version = {1, 5, 0}, > + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY > | DM_TARGET_ZONED_HM, >   .module = THIS_MODULE, >   .ctr    = linear_ctr, >   .dtr    = linear_dtr, >   .map    = linear_map, >   .end_io = linear_end_io, > + .clone_and_map_rq = linear_clone_and_map, > + .release_clone_rq = linear_release_clone, >   .status = linear_status, >   .prepare_ioctl = linear_prepare_ioctl, >   .iterate_devices = linear_iterate_devices, > > > With Oracle setups and multipath, we have plenty of customers using non NVME LUNS (i.e. F/C) with 1 single partition on top of a request based multipath with no issues. Same for file systems on top of multipath devices with a single partition Its very uncommon for sharing a disk with multiple partitions, and multipath. It has to be the multiple partitions, but we should test on non NVME with multiple partitions in the lab setup I guess to make sure