[Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1

Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed

* [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1
@ 2007-11-25 18:16 Graham, Simon
  2007-11-25 19:20 ` Lars Ellenberg
  2007-11-25 20:30 ` Graham, Simon
  0 siblings, 2 replies; 4+ messages in thread
From: Graham, Simon @ 2007-11-25 18:16 UTC (permalink / raw)
  To: drbd-dev


[-- Attachment #1.1: Type: text/plain, Size: 2373 bytes --]

As part of the work to properly handle on-disk caches, I have enabled
barrier support in DRBD for requests from above - the attached proposed
patch includes the following changes (patch is against 8.0.6):

 

1.       Stop rejecting barrier requests in drbd_make_request_common()
unless we know they are not supported (see point 1 in the 'things to do'
below though).

2.       Fixed a few places where the code assumed
BIO_RW_BARRIER/BIO_RW_SYNC were masks rather than bit numbers

3.       Added barriers to AL/MD writes, including detecting if the
underlying device does not implement barriers and backing off in that
case.

4.       Forced a meta data write when a disk is attached so that we
determine early on whether or not barriers are supported.

5.       Extended the tracing of BIO's to include internally generated
BIOs as well as the ones from above

6.       I reformatted about_to_complete_local_write() - not necessary
but I was trying to read the code...

 

Things to do (potentially):

1.       RIght now, the code assumes that either both systems support
barriers or neither do - should probably detect the mixed case and if
either side does not support for any given device, reject barrier
requests from above - I'm already setting the flag in the mdev when the
disk is attached - we could pass this flag between the two systems and
set the barrier-not-supported flag as the union of the two systems'
values. Off hand, I can't see an easy place to add code to pass this
capability between the systems.

2.       Should complete bitmap writes be issued with a barrier? If so,
then should this be just the first or last or all bitmap I/Os? I think
the last but I'm not sure.

3.       I think we can remove the #ifdef BIO_RW_XXX - certainly they
are not present everywhere these macros are referenced...

 

I've tested this on a system that does support barriers (2.6.18 based
with DRBD on top of LVM volumes) - it's a little hard for me to test in
a case that does not support barriers - clearly that needs to be tested
before this can be applied... I know that there are several flavours of
md device that do not support barriers (linear, raid0, multipath for
example), so that might be a somewhat easy way to setup a test - create
a suitable md device and run drbd on top of it

 

/simgr


[-- Attachment #1.2: Type: text/html, Size: 7428 bytes --]

[-- Attachment #2: drbd-barrier.patch --]
[-- Type: application/octet-stream, Size: 14848 bytes --]

Index: drbd-8.0.6/drbd/drbd_receiver.c
===================================================================
--- drbd-8.0.6/drbd/drbd_receiver.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_receiver.c	(working copy)
@@ -1094,6 +1094,8 @@
 	       INFO("submit EE (RS)WRITE sec=%llus size=%u ee=%p\n",
 		    (unsigned long long)e->sector,e->size,e);
 	       );
+
+	dump_internal_bio("Sec", mdev, WRITE, e->private_bio, 0);
 	drbd_generic_make_request(mdev,WRITE,DRBD_FAULT_RS_WR,e->private_bio);
 	/* accounting done in endio */
 
@@ -1316,6 +1318,7 @@
 	struct Tl_epoch_entry *e;
 	Drbd_Data_Packet *p = (Drbd_Data_Packet*)h;
 	int header_size, data_size;
+	int rw = WRITE;
 	unsigned int barrier_nr = 0;
 	unsigned int epoch_size = 0;
 	u32 dp_flags;
@@ -1359,10 +1362,10 @@
 
 	dp_flags = be32_to_cpu(p->dp_flags);
 	if ( dp_flags & DP_HARDBARRIER ) {
-		e->private_bio->bi_rw |= BIO_RW_BARRIER;
+	    rw |= (1<<BIO_RW_BARRIER);
 	}
 	if ( dp_flags & DP_RW_SYNC ) {
-		e->private_bio->bi_rw |= BIO_RW_SYNC;
+	    rw |= (1<<BIO_RW_SYNC);
 	}
 	if ( dp_flags & DP_MAY_SET_IN_SYNC ) {
 		e->flags |= EE_MAY_SET_IN_SYNC;
@@ -1550,7 +1553,7 @@
 		} else {
 			e->barrier_nr = mdev->next_barrier_nr;
 		}
-		e->private_bio->bi_rw |= BIO_RW_BARRIER;
+		rw |= (1<<BIO_RW_BARRIER);
 		mdev->next_barrier_nr = 0;
 	}
 	list_add(&e->w.list,&mdev->active_ee);
@@ -1592,7 +1595,8 @@
 		    (unsigned long long)e->sector,e->size,e);
 	       );
 	/* FIXME drbd_al_begin_io in case we have two primaries... */
-	drbd_generic_make_request(mdev,WRITE,DRBD_FAULT_DT_WR,e->private_bio);
+	dump_internal_bio("Sec", mdev, rw, e->private_bio, 0);
+	drbd_generic_make_request(mdev,rw,DRBD_FAULT_DT_WR,e->private_bio);
 	/* accounting done in endio */
 
 	maybe_kick_lo(mdev);
@@ -1688,6 +1692,7 @@
 		    (unsigned long long)e->sector,e->size,e);
 	       );
 	/* FIXME actually, it could be a READA originating from the peer ... */
+	dump_internal_bio("Sec",mdev,READ,e->private_bio,0);
 	drbd_generic_make_request(mdev,READ,fault_type,e->private_bio);
 	maybe_kick_lo(mdev);
 
Index: drbd-8.0.6/drbd/drbd_nl.c
===================================================================
--- drbd-8.0.6/drbd/drbd_nl.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_nl.c	(working copy)
@@ -1003,6 +1003,9 @@
 		dec_local(mdev);
 	}
 
+	/* Force meta data to be written to ensure we determine if barriers are supported */
+	drbd_md_mark_dirty(mdev);
+
 	drbd_md_sync(mdev);
 
 	reply->ret_code = retcode;
Index: drbd-8.0.6/drbd/drbd_actlog.c
===================================================================
--- drbd-8.0.6/drbd/drbd_actlog.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_actlog.c	(working copy)
@@ -39,32 +39,57 @@
 				 struct page *page, sector_t sector,
 				 int rw, int size)
 {
-	struct bio *bio = bio_alloc(GFP_NOIO, 1);
-	struct completion event;
+	struct bio *bio;
+	struct drbd_md_io md_io;
 	int ok;
 
+	md_io.mdev = mdev;
+	init_completion(&md_io.event);
+	md_io.error = 0;
+
+#ifdef BIO_RW_BARRIER
+	if (rw == WRITE && !(mdev->flags & NO_BARRIER_SUPP))
+	    rw |= (1<<BIO_RW_BARRIER);
+#endif
+#ifdef BIO_RW_SYNC
+	rw |= (1 << BIO_RW_SYNC);
+#endif
+
+ retry:
+	bio = bio_alloc(GFP_NOIO, 1);
 	bio->bi_bdev = bdev->md_bdev;
 	bio->bi_sector = sector;
 	ok = (bio_add_page(bio, page, size, 0) == size);
 	if(!ok) goto out;
-	init_completion(&event);
-	bio->bi_private = &event;
+	bio->bi_private = &md_io;
 	bio->bi_end_io = drbd_md_io_complete;
 
+	dump_internal_bio("Md",mdev,rw,bio,0);
+
 	if (FAULT_ACTIVE(mdev, (rw & WRITE)? DRBD_FAULT_MD_WR:DRBD_FAULT_MD_RD)) {
 		bio->bi_rw |= rw;
 		bio_endio(bio,bio->bi_size,-EIO);
 	}
 	else {
-#ifdef BIO_RW_SYNC
-		submit_bio(rw | (1 << BIO_RW_SYNC), bio);
-#else
 		submit_bio(rw, bio);
+#ifndef BIO_RW_SYNC
 		drbd_blk_run_queue(bdev_get_queue(bdev->md_bdev));
 #endif
 	}
-	wait_for_completion(&event);
+	wait_for_completion(&md_io.event);
 	ok = test_bit(BIO_UPTODATE, &bio->bi_flags);
+
+#ifdef BIO_RW_BARRIER
+	/* check for unsupported barrier op */
+	if (unlikely(md_io.error == -EOPNOTSUPP && (rw & BIO_RW_BARRIER))) {
+		/* Try again with no barrier */
+		WARN("Barriers not supported - disabling");
+		mdev->flags |= NO_BARRIER_SUPP;
+		rw &= ~BIO_RW_BARRIER;
+		bio_put(bio);
+		goto retry;
+	}
+#endif
  out:
 	bio_put(bio);
 	return ok;
Index: drbd-8.0.6/drbd/drbd_worker.c
===================================================================
--- drbd-8.0.6/drbd/drbd_worker.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_worker.c	(working copy)
@@ -64,11 +64,17 @@
  */
 int drbd_md_io_complete(struct bio *bio, unsigned int bytes_done, int error)
 {
+	struct drbd_md_io *md_io;
+
 	if (bio->bi_size) return 1;
-	/* error parameter ignored:
-	 * drbd_md_sync_page_io explicitly tests bio_uptodate(bio); */
 
-	complete((struct completion*)bio->bi_private);
+	md_io = (struct drbd_md_io *)bio->bi_private;
+
+	md_io->error = error;
+
+	dump_internal_bio("Md", md_io->mdev, 0, bio, 1);
+
+	complete(&md_io->event);
 	return 0;
 }
 
@@ -99,6 +105,8 @@
 
 	D_ASSERT(e->block_id != ID_VACANT);
 
+	dump_internal_bio("Sec", mdev, 0, bio, 1);
+
 	spin_lock_irqsave(&mdev->req_lock,flags);
 	mdev->read_cnt += e->size >> 9;
 	list_del(&e->w.list);
@@ -145,6 +153,8 @@
 
 	D_ASSERT(e->block_id != ID_VACANT);
 
+	dump_internal_bio("Sec", mdev, 0, bio, 1);
+
 	spin_lock_irqsave(&mdev->req_lock,flags);
 	mdev->writ_cnt += e->size >> 9;
 	is_syncer_req = is_syncer_block_id(e->block_id);
@@ -210,6 +220,8 @@
 		error = -EIO;
 	}
 
+	dump_internal_bio("Pri", mdev, 0, bio, 1);
+
 	/* to avoid recursion in _req_mod */
 	what = error
 	       ? (bio_data_dir(bio) == WRITE)
Index: drbd-8.0.6/drbd/drbd_main.c
===================================================================
--- drbd-8.0.6/drbd/drbd_main.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_main.c	(working copy)
@@ -1710,14 +1710,18 @@
 	p.seq_num  = cpu_to_be32( req->seq_num =
 				  atomic_add_return(1,&mdev->packet_seq) );
 	dp_flags = 0;
-	if(req->master_bio->bi_rw & BIO_RW_BARRIER) {
+
+	/* NOTE: no need to check if barriers supported here as we would
+	 *       not pass the test in make_request_common in that case
+	 */
+	if (bio_barrier(req->master_bio)) {
 		dp_flags |= DP_HARDBARRIER;
 	}
-	if(req->master_bio->bi_rw & BIO_RW_SYNC) {
+	if (bio_sync(req->master_bio)) {
 		dp_flags |= DP_RW_SYNC;
 	}
-	if(mdev->state.conn >= SyncSource &&
-	   mdev->state.conn <= PausedSyncT) {
+	if (mdev->state.conn >= SyncSource &&
+	    mdev->state.conn <= PausedSyncT) {
 		dp_flags |= DP_MAY_SET_IN_SYNC;
 	}
 
@@ -3229,7 +3233,7 @@
 
 // Debug routine to dump info about bio
 
-void _dump_bio(drbd_dev *mdev, struct bio *bio, int complete)
+void _dump_bio(const char *pfx, drbd_dev *mdev, int rw, struct bio *bio, int complete)
 {
 #ifdef CONFIG_LBD
 #define SECTOR_FORMAT "%Lx"
@@ -3242,16 +3246,27 @@
 	char *faddr = (char *)(lowaddr);
 	struct bio_vec *bvec;
 	int segno;
+	int biorw, biobarrier, biosync;
 
-	INFO("%s %s Bio:%p - %soffset " SECTOR_FORMAT ", size %x\n",
+	rw |= bio->bi_rw;
+
+	biorw      = (rw & (RW_MASK|RWA_MASK));
+	biobarrier = (rw & (1<<BIO_RW_BARRIER));
+	biosync    = (rw & (1<<BIO_RW_SYNC));
+
+	INFO("%s %s:%s%s%s Bio:%p - %soffset " SECTOR_FORMAT ", size %x\n",
 	     complete? "<<<":">>>",
-	     bio_rw(bio)==WRITE?"Write":"Read",bio,
+	     pfx,
+	     biorw==WRITE?"Write":"Read",
+	     biobarrier?":B":"",
+	     biosync?":S":"",
+	     bio,
 	     complete? (drbd_bio_uptodate(bio)? "Success, ":"Failed, ") : "",
 	     bio->bi_sector << SECTOR_SHIFT,
 	     bio->bi_size);
 
 	if (trace_level >= TraceLvlMetrics &&
-	    ((bio_rw(bio) == WRITE) ^ complete) ) {
+	    ((biorw == WRITE) ^ complete) ) {
 		printk(KERN_DEBUG "  ind     page   offset   length\n");
 		__bio_for_each_segment(bvec, bio, segno, 0) {
 			printk(KERN_DEBUG "  [%d] %p %8.8x %8.8x\n",segno,
Index: drbd-8.0.6/drbd/drbd_req.c
===================================================================
--- drbd-8.0.6/drbd/drbd_req.c	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_req.c	(working copy)
@@ -185,58 +185,58 @@
 static void _about_to_complete_local_write(drbd_dev *mdev, drbd_request_t *req)
 {
 	const unsigned long s = req->rq_state;
-			drbd_request_t *i;
-			struct Tl_epoch_entry *e;
-			struct hlist_node *n;
-			struct hlist_head *slot;
+	drbd_request_t *i;
+	struct Tl_epoch_entry *e;
+	struct hlist_node *n;
+	struct hlist_head *slot;
 
-			/* before we can signal completion to the upper layers,
-			 * we may need to close the current epoch */
-			if (req->epoch == mdev->newest_barrier->br_number)
-				set_bit(ISSUE_BARRIER,&mdev->flags);
+	/* before we can signal completion to the upper layers,
+	 * we may need to close the current epoch */
+	if (req->epoch == mdev->newest_barrier->br_number)
+		set_bit(ISSUE_BARRIER,&mdev->flags);
 
-			/* we need to do the conflict detection stuff,
-			 * if we have the ee_hash (two_primaries) and
-			 * this has been on the network */
-			if ((s & RQ_NET_DONE) && mdev->ee_hash != NULL) {
-				const sector_t sector = req->sector;
-				const int size = req->size;
+	/* we need to do the conflict detection stuff,
+	 * if we have the ee_hash (two_primaries) and
+	 * this has been on the network */
+	if ((s & RQ_NET_DONE) && mdev->ee_hash != NULL) {
+		const sector_t sector = req->sector;
+		const int size = req->size;
 
-				/* ASSERT:
-				 * there must be no conflicting requests, since
-				 * they must have been failed on the spot */
+		/* ASSERT:
+		 * there must be no conflicting requests, since
+		 * they must have been failed on the spot */
 #define OVERLAPS overlaps(sector, size, i->sector, i->size)
-				slot = tl_hash_slot(mdev,sector);
-				hlist_for_each_entry(i, n, slot, colision) {
-					if (OVERLAPS) {
-						ALERT("LOGIC BUG: completed: %p %llus +%u; other: %p %llus +%u\n",
-						      req, (unsigned long long)sector, size,
-						      i,   (unsigned long long)i->sector, i->size);
-					}
-				}
+		slot = tl_hash_slot(mdev,sector);
+		hlist_for_each_entry(i, n, slot, colision) {
+			if (OVERLAPS) {
+				ALERT("LOGIC BUG: completed: %p %llus +%u; other: %p %llus +%u\n",
+				      req, (unsigned long long)sector, size,
+				      i,   (unsigned long long)i->sector, i->size);
+			}
+		}
 
-				/* maybe "wake" those conflicting epoch entries
-				 * that wait for this request to finish.
-				 *
-				 * currently, there can be only _one_ such ee
-				 * (well, or some more, which would be pending
-				 * DiscardAck not yet sent by the asender...),
-				 * since we block the receiver thread upon the
-				 * first conflict detection, which will wait on
-				 * misc_wait.  maybe we want to assert that?
-				 *
-				 * anyways, if we found one,
-				 * we just have to do a wake_up.  */
+		/* maybe "wake" those conflicting epoch entries
+		 * that wait for this request to finish.
+		 *
+		 * currently, there can be only _one_ such ee
+		 * (well, or some more, which would be pending
+		 * DiscardAck not yet sent by the asender...),
+		 * since we block the receiver thread upon the
+		 * first conflict detection, which will wait on
+		 * misc_wait.  maybe we want to assert that?
+		 *
+		 * anyways, if we found one,
+		 * we just have to do a wake_up.  */
 #undef OVERLAPS
 #define OVERLAPS overlaps(sector, size, e->sector, e->size)
-				slot = ee_hash_slot(mdev,req->sector);
-				hlist_for_each_entry(e, n, slot, colision) {
-					if (OVERLAPS) {
-						wake_up(&mdev->misc_wait);
-						break;
-					}
-				}
+		slot = ee_hash_slot(mdev,req->sector);
+		hlist_for_each_entry(e, n, slot, colision) {
+			if (OVERLAPS) {
+				wake_up(&mdev->misc_wait);
+				break;
 			}
+		}
+	}
 #undef OVERLAPS
 }
 
@@ -973,7 +973,6 @@
 			local = 0;
 		}
 		if (remote) dec_ap_pending(mdev);
-		dump_bio(mdev,req->master_bio,1);
 		/* THINK: do we want to fail it (-EIO), or pretend success? */
 		bio_endio(req->master_bio, req->master_bio->bi_size, 0);
 		req->master_bio = NULL;
@@ -1000,6 +999,8 @@
 		 * was not detached below us? */
 		req->private_bio->bi_bdev = mdev->bc->backing_bdev;
 
+		dump_internal_bio("Pri",mdev,rw,req->private_bio,0);
+
 		if (FAULT_ACTIVE(mdev, rw==WRITE ? DRBD_FAULT_DT_WR :
 				       ( rw==READ ? DRBD_FAULT_DT_RD :
   				                   DRBD_FAULT_DT_RA ) ))
@@ -1075,8 +1076,13 @@
 		return 0;
 	}
 
-	/* Currently our BARRIER code is disabled. */
-	if(unlikely(bio_barrier(bio))) {
+	/* Reject barrier requests if we know the underlying device does
+	 * not support them.
+	 * XXX: Need to get this info from peer as well some how so we
+	 * XXX: reject if EITHER side does not support them,,,
+	*/
+	if(unlikely(bio_barrier(bio) && (mdev->flags & NO_BARRIER_SUPP))) {
+		WARN("Rejecting barrier request as underlying device does not support\n");
 		bio_endio(bio, bio->bi_size, -EOPNOTSUPP);
 		return 0;
 	}
Index: drbd-8.0.6/drbd/drbd_int.h
===================================================================
--- drbd-8.0.6/drbd/drbd_int.h	(revision 21229)
+++ drbd-8.0.6/drbd/drbd_int.h	(working copy)
@@ -698,7 +698,8 @@
 	CRASHED_PRIMARY,	// This node was a crashed primary. Gets
 	                        // cleared when the state.conn  goes into
 	                        // Connected state.
-	WRITE_BM_AFTER_RESYNC	// A kmalloc() during resync failed
+	WRITE_BM_AFTER_RESYNC,	// A kmalloc() during resync failed
+	NO_BARRIER_SUPP,        // underlying block device doesn't implement barriers
 };
 
 struct drbd_bitmap; // opaque for Drbd_Conf
@@ -767,6 +768,12 @@
 	struct disk_conf dc; /* The user provided config... */
 };
 
+struct drbd_md_io {
+	struct Drbd_Conf *mdev;
+	struct completion event;
+	int error;
+};
+
 struct Drbd_Conf {
 #ifdef PARANOIA
 	long magic;
@@ -1204,6 +1211,7 @@
 	TraceTypeUnplug = 0x00000020,
 	TraceTypeNl     = 0x00000040,
 	TraceTypeALExts = 0x00000080,
+	TraceTypeIntRq  = 0x00000100,
 };
 
 static inline int
@@ -1247,14 +1255,20 @@
 			      unsigned int length);
 
 // Bio printing support
-extern void _dump_bio(drbd_dev *mdev, struct bio *bio, int complete);
+extern void _dump_bio(const char *pfx, drbd_dev *mdev, int rw, struct bio *bio, int complete);
 
 static inline void dump_bio(drbd_dev *mdev, struct bio *bio, int complete) {
 	MTRACE(TraceTypeRq,TraceLvlSummary,
-	       _dump_bio(mdev, bio, complete);
+	       _dump_bio("Rq", mdev, 0, bio, complete);
 		);
 }
 
+static inline void dump_internal_bio(const char *pfx, drbd_dev *mdev, int rw, struct bio *bio, int complete) {
+	MTRACE(TraceTypeIntRq,TraceLvlSummary,
+	       _dump_bio(pfx, mdev, rw, bio, complete);
+		);
+}
+
 // Packet dumping support
 extern void _dump_packet(drbd_dev *mdev, struct socket *sock,
 			 int recv, Drbd_Polymorph_Packet *p, char* file, int line);
@@ -1274,6 +1288,7 @@
 #define TRACE(ignored...) ((void)0)
 
 #define dump_bio(ignored...) ((void)0)
+#define dump_internal_bio(ignored...) ((void)0)
 #define dump_packet(ignored...) ((void)0)
 #endif
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1
  2007-11-25 18:16 [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1 Graham, Simon
@ 2007-11-25 19:20 ` Lars Ellenberg
  2007-11-25 20:30 ` Graham, Simon
  1 sibling, 0 replies; 4+ messages in thread
From: Lars Ellenberg @ 2007-11-25 19:20 UTC (permalink / raw)
  To: drbd-dev

On Sun, Nov 25, 2007 at 01:16:15PM -0500, Graham, Simon wrote:
> As part of the work to properly handle on-disk caches, I have enabled
> barrier support in DRBD for requests from above - the attached proposed
> patch includes the following changes (patch is against 8.0.6):

thank you very much, just integrated into current drbd-8.0 git.
will discuss with phil tomorrow, do some tests with and without
barrier support, and figure out the best way to map differing support on
both nodes.

since you say "part 1", what will the next part be?
defer "_req_is_done" until the corresponding barrier ack,
even for protocol C?

> 1.       Stop rejecting barrier requests in drbd_make_request_common()
> unless we know they are not supported (see point 1 in the 'things to do'
> below though).

I think this could be done by either
	offloading to the "trusted" admin
	via a configuration setting like use-bmbv

	and/or setting a flag in the "feature bits" area
	during handshake.

	or, preferably, we finally allocate local and remote error flag
	members in struct drbd_request, and properly deal with it :)
	
	meaning we could (at least for protocol C) notice, distinguish,
	and recover from both local and remote ENOTSUPP, for a barrier
	request.

> 2.       Fixed a few places where the code assumed
> BIO_RW_BARRIER/BIO_RW_SYNC were masks rather than bit numbers

oops.

> 3.       Added barriers to AL/MD writes, including detecting if the
> underlying device does not implement barriers and backing off in that
> case.

looks good.

> 4.       Forced a meta data write when a disk is attached so that we
> determine early on whether or not barriers are supported.

remains the problem of storage area device != meta data area device.

> 5.       Extended the tracing of BIO's to include internally generated
> BIOs as well as the ones from above

would you agree if we changed that to no longer use printk,
but to netlink-broadcast to userspace,
so you'd be able to see them with "drbdsetup events"?

I did that already for some other reasons,
so the code is basically there.

> 6.       I reformatted about_to_complete_local_write() - not necessary
> but I was trying to read the code...

I already noticed and fixed the two additional tabs :)

> Things to do (potentially):
> 
> 1.       RIght now, the code assumes that either both systems support
> barriers or neither do - should probably detect the mixed case and if
> either side does not support for any given device, reject barrier
> requests from above - I'm already setting the flag in the mdev when the
> disk is attached - we could pass this flag between the two systems and
> set the barrier-not-supported flag as the union of the two systems'
> values. Off hand, I can't see an easy place to add code to pass this
> capability between the systems.

during handshake, using one of the as yet unused reserved feature bits.

> 2.       Should complete bitmap writes be issued with a barrier? If so,
> then should this be just the first or last or all bitmap I/Os? I think
> the last but I'm not sure.

if the guarantee should be that the bitmap is on disk when we think it
is, I'd agree it has to be the last.

> 3.       I think we can remove the #ifdef BIO_RW_XXX - certainly they
> are not present everywhere these macros are referenced...

I'll have to check that. we still try to support even 2.6.5-something.

> I've tested this on a system that does support barriers (2.6.18 based
> with DRBD on top of LVM volumes) - it's a little hard for me to test in
> a case that does not support barriers - clearly that needs to be tested
> before this can be applied... I know that there are several flavours of
> md device that do not support barriers (linear, raid0, multipath for
> example), so that might be a somewhat easy way to setup a test - create
> a suitable md device and run drbd on top of it

will do.

thank you for that.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1
  2007-11-25 18:16 [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1 Graham, Simon
  2007-11-25 19:20 ` Lars Ellenberg
@ 2007-11-25 20:30 ` Graham, Simon
  2007-11-25 21:22   ` Lars Ellenberg
  1 sibling, 1 reply; 4+ messages in thread
From: Graham, Simon @ 2007-11-25 20:30 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

Thanks for the comments Lars,

To answer questions/comments:

> since you say "part 1", what will the next part be?
> defer "_req_is_done" until the corresponding barrier ack,
> even for protocol C?
> 

Well, Part 2 will be integrating the TL into recovery when we lose
contact with the secondary - not sure I want to add this feature as
well.

> 
> 	or, preferably, we finally allocate local and remote error flag
> 	members in struct drbd_request, and properly deal with it :)
> 
> 	meaning we could (at least for protocol C) notice, distinguish,
> 	and recover from both local and remote ENOTSUPP, for a barrier
> 	request.

I thought about something like this but it gets (more) complicated --
what we should really do in the case where a barrier request results in
-EOPNOTSUPP from either side is return -EOPNOTSUPP to the master bio (so
that we tell the DRBD user that barriers don't work) - I don't think,
for example, that we should retry the request without the barrier bit on
behalf of our user.

As far as protocol goes - I think we need to make sure this failure is
reported in all protocols which means more protocol changes.

> > 4.       Forced a meta data write when a disk is attached so that we
> > determine early on whether or not barriers are supported.
> 
> remains the problem of storage area device != meta data area device.
> 

Rats! I always forget that -- I guess that means we really have to
implement per-I/O checking until we know barriers are not supported on
one or both sides _and_ save a separate bit for metadata and storage
area devices in case they are different...

> > 5.       Extended the tracing of BIO's to include internally
> generated
> > BIOs as well as the ones from above
> 
> would you agree if we changed that to no longer use printk,
> but to netlink-broadcast to userspace,
> so you'd be able to see them with "drbdsetup events"?
> 
> I did that already for some other reasons,
> so the code is basically there.
> 

I think that would be fine.

> > 3.       I think we can remove the #ifdef BIO_RW_XXX - certainly
they
> > are not present everywhere these macros are referenced...
> 
> I'll have to check that. we still try to support even 2.6.5-something.
> 

So, there are some places where you unconditionally checked
BIO_RW_BARRIER previously (I either fixed these or used bio_barrier(bio)
instead, but that shouldn't change whether or not this thing builds on
older releases.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1
  2007-11-25 20:30 ` Graham, Simon
@ 2007-11-25 21:22   ` Lars Ellenberg
  0 siblings, 0 replies; 4+ messages in thread
From: Lars Ellenberg @ 2007-11-25 21:22 UTC (permalink / raw)
  To: drbd-dev

On Sun, Nov 25, 2007 at 03:30:42PM -0500, Graham, Simon wrote:
> Thanks for the comments Lars,
> 
> To answer questions/comments:
> 
> > since you say "part 1", what will the next part be?
> > defer "_req_is_done" until the corresponding barrier ack,
> > even for protocol C?
> > 
> 
> Well, Part 2 will be integrating the TL into recovery when we lose
> contact with the secondary - not sure I want to add this feature as
> well.

I'll do that part, then.

> > 	or, preferably, we finally allocate local and remote error flag
> > 	members in struct drbd_request, and properly deal with it :)
> > 
> > 	meaning we could (at least for protocol C) notice, distinguish,
> > 	and recover from both local and remote ENOTSUPP, for a barrier
> > 	request.
> 
> I thought about something like this but it gets (more) complicated --
> what we should really do in the case where a barrier request results in
> -EOPNOTSUPP from either side is return -EOPNOTSUPP to the master bio (so
> that we tell the DRBD user that barriers don't work) - I don't think,
> for example, that we should retry the request without the barrier bit on
> behalf of our user.

right.
we should fail with EOPNOTSUPP if one of the nodes fails with EOPNOTSUPP.
which makes it even easier, I think.

for protocol != C, and EOPNOTSUPP on the receiving side:
since we cannot take back a successful completion event,
we have to retry on the receiving side,
report both success (data written) and failure (barrier not supported)
at the same time, so the sending node can fail any new barrier request
early with EOPNOTSUPP.

> As far as protocol goes - I think we need to make sure this failure is
> reported in all protocols which means more protocol changes.

I wanted to have a NegAck with error code for ages :)
this is only possible in 8.2, however.

> > > 4.       Forced a meta data write when a disk is attached so that we
> > > determine early on whether or not barriers are supported.
> > 
> > remains the problem of storage area device != meta data area device.
> > 
> 
> Rats! I always forget that -- I guess that means we really have to
> implement per-I/O checking until we know barriers are not supported on
> one or both sides _and_ save a separate bit for metadata and storage
> area devices in case they are different...

interessingly whether or not barriers are supported could change over
time, since the backing store could change.

e.g. when on top of lvm, and the underlying pvs are not behaving the
same, or when on top of barrier supporting md raid1,
and the newly hot-added disk is different then the failed one it replaces.
there is some fun ahead here.

> > > 5.       Extended the tracing of BIO's to include internally
> > > generated BIOs as well as the ones from above
> > 
> > would you agree if we changed that to no longer use printk,
> > but to netlink-broadcast to userspace,
> > so you'd be able to see them with "drbdsetup events"?
> > 
> > I did that already for some other reasons,
> > so the code is basically there.
> > 
> 
> I think that would be fine.
> 
> > > 3.       I think we can remove the #ifdef BIO_RW_XXX - certainly
> > > they are not present everywhere these macros are referenced...
> > 
> > I'll have to check that. we still try to support even 2.6.5-something.
> 
> So, there are some places where you unconditionally checked
> BIO_RW_BARRIER previously (I either fixed these or used bio_barrier(bio)
> instead, but that shouldn't change whether or not this thing builds on
> older releases.

afaics, BIO_RW_BARRIER is present since 2.6.0,
so, yes, cleanup is due.

BIO_RW_SYNC is present only since 2.6.6,
so we need to keep those ifdefs.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-11-25 21:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-25 18:16 [Drbd-dev] [PATCH] Supporting barriers in DRBD, part 1 Graham, Simon
2007-11-25 19:20 ` Lars Ellenberg
2007-11-25 20:30 ` Graham, Simon
2007-11-25 21:22   ` Lars Ellenberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox