[Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-05 14:06 ` Philipp Reisner
  0 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-05  6:07 UTC (permalink / raw)
  To: ocfs2-devel

Hi There,

As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
predictable manner.

That is not true, especially for devices with own merge_bvec_fn().

Therefore OCFS2's heartbeat code is very likely to fail on such devices.

The attached patch corrects this, and reduces the file cluster/heartbeat.c
by 90 lines.

  The whole patch builds around the change that the bio_put() call is
  moved to the bio's bi_end_io() function.
  This makes the whole idea of trying to predict the behaviour of 
  bio_add_page() unnecessary, therefore we can remove compute_max_sectors()
  and o2hb_compute_request_limits().

-Phil

PS1: With that applied I can happily run OCFS2 on a DRBD-8.0 device which
     sits on top of software raid (both have interesting merge_bvec_fn()
     functions.

PS2: I created the ocfs2-fix-bio_add_page.diff on 2.6.17 .
     And later refitted the patch to 2.6.20-rc2.
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ocfs2-fix-bio_add_page.diff
Type: text/x-diff
Size: 9363 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20070105/3ebcad56/ocfs2-fix-bio_add_page.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ocfs2-fix-bio_add_page-for-2.6.20-rc2.diff
Type: text/x-diff
Size: 8766 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20070105/3ebcad56/ocfs2-fix-bio_add_page-for-2.6.20-rc2.bin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Drbd-dev] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-05 14:06 ` Philipp Reisner
  0 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-05 14:06 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

Hi There,

As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
predictable manner.

That is not true, especially for devices with own merge_bvec_fn().

Therefore OCFS2's heartbeat code is very likely to fail on such devices.

The attached patch corrects this, and reduces the file cluster/heartbeat.c
by 90 lines.

  The whole patch builds around the change that the bio_put() call is
  moved to the bio's bi_end_io() function.
  This makes the whole idea of trying to predict the behaviour of 
  bio_add_page() unnecessary, therefore we can remove compute_max_sectors()
  and o2hb_compute_request_limits().

-Phil

PS1: With that applied I can happily run OCFS2 on a DRBD-8.0 device which
     sits on top of software raid (both have interesting merge_bvec_fn()
     functions.

PS2: I created the ocfs2-fix-bio_add_page.diff on 2.6.17 .
     And later refitted the patch to 2.6.20-rc2.
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

[-- Attachment #2: ocfs2-fix-bio_add_page.diff --]
[-- Type: text/x-diff, Size: 9363 bytes --]

diff -rup linux-2.6.17-kdb_orig/fs/ocfs2/cluster/heartbeat.c linux-2.6.17-kdb/fs/ocfs2/cluster/heartbeat.c
--- linux-2.6.17-kdb_orig/fs/ocfs2/cluster/heartbeat.c	2007-01-05 11:29:06.696018258 +0100
+++ linux-2.6.17-kdb/fs/ocfs2/cluster/heartbeat.c	2007-01-05 14:08:00.133625148 +0100
@@ -3,6 +3,8 @@
  *
  * Copyright (C) 2004, 2005 Oracle.  All rights reserved.
  *
+ * 5. Jan 2007 Philipp Reisner, fixed the use of bio_add_page().
+ *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public
  * License as published by the Free Software Foundation; either
@@ -182,10 +184,9 @@ static void o2hb_disarm_write_timeout(st
 	flush_scheduled_work();
 }
 
-static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int num_ios)
+static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
 {
-	atomic_set(&wc->wc_num_reqs, num_ios);
+	atomic_set(&wc->wc_num_reqs, 0);
 	init_completion(&wc->wc_io_complete);
 	wc->wc_error = 0;
 }
@@ -229,6 +230,7 @@ static int o2hb_bio_end_io(struct bio *b
 		return 1;
 
 	o2hb_bio_wait_dec(wc, 1);
+	bio_put(bio);
 	return 0;
 }
 
@@ -236,23 +238,22 @@ static int o2hb_bio_end_io(struct bio *b
  * start_slot. */
 static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 				      struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int start_slot,
-				      unsigned int num_slots)
+				      unsigned int *current_slot,
+				      unsigned int max_slots)
 {
-	int i, nr_vecs, len, first_page, last_page;
+	int len, current_page;
 	unsigned int vec_len, vec_start;
 	unsigned int bits = reg->hr_block_bits;
 	unsigned int spp = reg->hr_slots_per_page;
+	unsigned int cs = *current_slot;
 	struct bio *bio;
 	struct page *page;
 
-	nr_vecs = (num_slots + spp - 1) / spp;
-
 	/* Testing has shown this allocation to take long enough under
 	 * GFP_KERNEL that the local node can get fenced. It would be
 	 * nicest if we could pre-allocate these bios and avoid this
 	 * all together. */
-	bio = bio_alloc(GFP_ATOMIC, nr_vecs);
+	bio = bio_alloc(GFP_ATOMIC, 16);
 	if (!bio) {
 		mlog(ML_ERROR, "Could not alloc slots BIO!\n");
 		bio = ERR_PTR(-ENOMEM);
@@ -260,133 +261,53 @@ static struct bio *o2hb_setup_one_bio(st
 	}
 
 	/* Must put everything in 512 byte sectors for the bio... */
-	bio->bi_sector = (reg->hr_start_block + start_slot) << (bits - 9);
+	bio->bi_sector = (reg->hr_start_block + cs) << (bits - 9);
 	bio->bi_bdev = reg->hr_bdev;
 	bio->bi_private = wc;
 	bio->bi_end_io = o2hb_bio_end_io;
 
-	first_page = start_slot / spp;
-	last_page = first_page + nr_vecs;
-	vec_start = (start_slot << bits) % PAGE_CACHE_SIZE;
-	for(i = first_page; i < last_page; i++) {
-		page = reg->hr_slot_data[i];
-
-		vec_len = PAGE_CACHE_SIZE;
-		/* last page might be short */
-		if (((i + 1) * spp) > (start_slot + num_slots))
-			vec_len = ((num_slots + start_slot) % spp) << bits;
-		vec_len -=  vec_start;
+	vec_start = (cs << bits) % PAGE_CACHE_SIZE;
+	while(cs < max_slots) {
+		current_page = cs / spp;
+		page = reg->hr_slot_data[current_page];
+
+		vec_len = min( PAGE_CACHE_SIZE, 
+			       (max_slots-cs) * (PAGE_CACHE_SIZE/spp) ); 
 
 		mlog(ML_HB_BIO, "page %d, vec_len = %u, vec_start = %u\n",
-		     i, vec_len, vec_start);
+		     current_page, vec_len, vec_start);
 
 		len = bio_add_page(bio, page, vec_len, vec_start);
-		if (len != vec_len) {
-			bio_put(bio);
-			bio = ERR_PTR(-EIO);
-
-			mlog(ML_ERROR, "Error adding page to bio i = %d, "
-			     "vec_len = %u, len = %d\n, start = %u\n",
-			     i, vec_len, len, vec_start);
-			goto bail;
-		}
+		if (len != vec_len) break;
 
+		cs += vec_len / (PAGE_CACHE_SIZE/spp);
 		vec_start = 0;
 	}
 
 bail:
+	*current_slot = cs;
 	return bio;
 }
 
-/*
- * Compute the maximum number of sectors the bdev can handle in one bio,
- * as a power of two.
- *
- * Stolen from oracleasm, thanks Joel!
- */
-static int compute_max_sectors(struct block_device *bdev)
-{
-	int max_pages, max_sectors, pow_two_sectors;
-
-	struct request_queue *q;
-
-	q = bdev_get_queue(bdev);
-	max_pages = q->max_sectors >> (PAGE_SHIFT - 9);
-	if (max_pages > BIO_MAX_PAGES)
-		max_pages = BIO_MAX_PAGES;
-	if (max_pages > q->max_phys_segments)
-		max_pages = q->max_phys_segments;
-	if (max_pages > q->max_hw_segments)
-		max_pages = q->max_hw_segments;
-	max_pages--; /* Handle I/Os that straddle a page */
-
-	max_sectors = max_pages << (PAGE_SHIFT - 9);
-
-	/* Why is fls() 1-based???? */
-	pow_two_sectors = 1 << (fls(max_sectors) - 1);
-
-	return pow_two_sectors;
-}
-
-static inline void o2hb_compute_request_limits(struct o2hb_region *reg,
-					       unsigned int num_slots,
-					       unsigned int *num_bios,
-					       unsigned int *slots_per_bio)
-{
-	unsigned int max_sectors, io_sectors;
-
-	max_sectors = compute_max_sectors(reg->hr_bdev);
-
-	io_sectors = num_slots << (reg->hr_block_bits - 9);
-
-	*num_bios = (io_sectors + max_sectors - 1) / max_sectors;
-	*slots_per_bio = max_sectors >> (reg->hr_block_bits - 9);
-
-	mlog(ML_HB_BIO, "My io size is %u sectors for %u slots. This "
-	     "device can handle %u sectors of I/O\n", io_sectors, num_slots,
-	     max_sectors);
-	mlog(ML_HB_BIO, "Will need %u bios holding %u slots each\n",
-	     *num_bios, *slots_per_bio);
-}
-
 static int o2hb_read_slots(struct o2hb_region *reg,
 			   unsigned int max_slots)
 {
-	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
-	int i, status;
+	unsigned int current_slot=0;
+	int status;
 	struct o2hb_bio_wait_ctxt wc;
-	struct bio **bios;
 	struct bio *bio;
 
-	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
-
-	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
-	if (!bios) {
-		status = -ENOMEM;
-		mlog_errno(status);
-		return status;
-	}
-
-	o2hb_bio_wait_init(&wc, num_bios);
-
-	num_slots = slots_per_bio;
-	for(i = 0; i < num_bios; i++) {
-		start_slot = i * slots_per_bio;
+	o2hb_bio_wait_init(&wc);
 
-		/* adjust num_slots at last bio */
-		if (max_slots < (start_slot + num_slots))
-			num_slots = max_slots - start_slot;
-
-		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
+	while(current_slot < max_slots) {
+		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
 		if (IS_ERR(bio)) {
-			o2hb_bio_wait_dec(&wc, num_bios - i);
-
 			status = PTR_ERR(bio);
 			mlog_errno(status);
 			goto bail_and_wait;
 		}
-		bios[i] = bio;
 
+		atomic_inc(&wc.wc_num_reqs);
 		submit_bio(READ, bio);
 	}
 
@@ -397,38 +318,30 @@ bail_and_wait:
 	if (wc.wc_error && !status)
 		status = wc.wc_error;
 
-	if (bios) {
-		for(i = 0; i < num_bios; i++)
-			if (bios[i])
-				bio_put(bios[i]);
-		kfree(bios);
-	}
-
 	return status;
 }
 
 static int o2hb_issue_node_write(struct o2hb_region *reg,
-				 struct bio **write_bio,
 				 struct o2hb_bio_wait_ctxt *write_wc)
 {
 	int status;
 	unsigned int slot;
 	struct bio *bio;
 
-	o2hb_bio_wait_init(write_wc, 1);
+	o2hb_bio_wait_init(write_wc);
 
 	slot = o2nm_this_node();
 
-	bio = o2hb_setup_one_bio(reg, write_wc, slot, 1);
+	bio = o2hb_setup_one_bio(reg, write_wc, &slot, slot+1);
 	if (IS_ERR(bio)) {
 		status = PTR_ERR(bio);
 		mlog_errno(status);
 		goto bail;
 	}
 
+	atomic_inc(&write_wc->wc_num_reqs);
 	submit_bio(WRITE, bio);
 
-	*write_bio = bio;
 	status = 0;
 bail:
 	return status;
@@ -800,7 +713,6 @@ static int o2hb_do_disk_heartbeat(struct
 {
 	int i, ret, highest_node, change = 0;
 	unsigned long configured_nodes[BITS_TO_LONGS(O2NM_MAX_NODES)];
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 
 	ret = o2nm_configured_node_map(configured_nodes,
@@ -838,7 +750,7 @@ static int o2hb_do_disk_heartbeat(struct
 
 	/* And fire off the write. Note that we don't wait on this I/O
 	 * until later. */
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret < 0) {
 		mlog_errno(ret);
 		return ret;
@@ -856,7 +768,6 @@ static int o2hb_do_disk_heartbeat(struct
 	 * people we find in our steady state have seen us.
 	 */
 	o2hb_wait_on_io(reg, &write_wc);
-	bio_put(write_bio);
 	if (write_wc.wc_error) {
 		/* Do not re-arm the write timeout on I/O error - we
 		 * can't be sure that the new block ever made it to
@@ -917,7 +828,6 @@ static int o2hb_thread(void *data)
 {
 	int i, ret;
 	struct o2hb_region *reg = data;
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 	struct timeval before_hb, after_hb;
 	unsigned int elapsed_msec;
@@ -967,10 +877,9 @@ static int o2hb_thread(void *data)
 	 *
 	 * XXX: Should we skip this on unclean_stop? */
 	o2hb_prepare_block(reg, 0);
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret == 0) {
 		o2hb_wait_on_io(reg, &write_wc);
-		bio_put(write_bio);
 	} else {
 		mlog_errno(ret);
 	}
Binary files linux-2.6.17-kdb_orig/fs/ocfs2/cluster/heartbeat.o and linux-2.6.17-kdb/fs/ocfs2/cluster/heartbeat.o differ
Binary files linux-2.6.17-kdb_orig/fs/ocfs2/cluster/ocfs2_nodemanager.ko and linux-2.6.17-kdb/fs/ocfs2/cluster/ocfs2_nodemanager.ko differ
Binary files linux-2.6.17-kdb_orig/fs/ocfs2/cluster/ocfs2_nodemanager.mod.o and linux-2.6.17-kdb/fs/ocfs2/cluster/ocfs2_nodemanager.mod.o differ
Binary files linux-2.6.17-kdb_orig/fs/ocfs2/cluster/ocfs2_nodemanager.o and linux-2.6.17-kdb/fs/ocfs2/cluster/ocfs2_nodemanager.o differ

[-- Attachment #3: ocfs2-fix-bio_add_page-for-2.6.20-rc2.diff --]
[-- Type: text/x-diff, Size: 8766 bytes --]

--- heartbeat-2.6.20-rc2.c	2007-01-05 14:56:13.000000000 +0100
+++ heartbeat.c	2007-01-05 15:02:08.000000000 +0100
@@ -3,6 +3,8 @@
  *
  * Copyright (C) 2004, 2005 Oracle.  All rights reserved.
  *
+ * 5. Jan 2007 Philipp Reisner, fixed the use of bio_add_page().
+ *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public
  * License as published by the Free Software Foundation; either
@@ -184,10 +186,9 @@ static void o2hb_disarm_write_timeout(st
 	flush_scheduled_work();
 }
 
-static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int num_ios)
+static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
 {
-	atomic_set(&wc->wc_num_reqs, num_ios);
+	atomic_set(&wc->wc_num_reqs, 0);
 	init_completion(&wc->wc_io_complete);
 	wc->wc_error = 0;
 }
@@ -231,6 +232,7 @@ static int o2hb_bio_end_io(struct bio *b
 		return 1;
 
 	o2hb_bio_wait_dec(wc, 1);
+	bio_put(bio);
 	return 0;
 }
 
@@ -238,23 +240,22 @@ static int o2hb_bio_end_io(struct bio *b
  * start_slot. */
 static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 				      struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int start_slot,
-				      unsigned int num_slots)
+				      unsigned int *current_slot,
+				      unsigned int max_slots)
 {
-	int i, nr_vecs, len, first_page, last_page;
+	int len, current_page;
 	unsigned int vec_len, vec_start;
 	unsigned int bits = reg->hr_block_bits;
 	unsigned int spp = reg->hr_slots_per_page;
+	unsigned int cs = *current_slot;
 	struct bio *bio;
 	struct page *page;
 
-	nr_vecs = (num_slots + spp - 1) / spp;
-
 	/* Testing has shown this allocation to take long enough under
 	 * GFP_KERNEL that the local node can get fenced. It would be
 	 * nicest if we could pre-allocate these bios and avoid this
 	 * all together. */
-	bio = bio_alloc(GFP_ATOMIC, nr_vecs);
+	bio = bio_alloc(GFP_ATOMIC, 16);
 	if (!bio) {
 		mlog(ML_ERROR, "Could not alloc slots BIO!\n");
 		bio = ERR_PTR(-ENOMEM);
@@ -262,137 +263,53 @@ static struct bio *o2hb_setup_one_bio(st
 	}
 
 	/* Must put everything in 512 byte sectors for the bio... */
-	bio->bi_sector = (reg->hr_start_block + start_slot) << (bits - 9);
+	bio->bi_sector = (reg->hr_start_block + cs) << (bits - 9);
 	bio->bi_bdev = reg->hr_bdev;
 	bio->bi_private = wc;
 	bio->bi_end_io = o2hb_bio_end_io;
 
-	first_page = start_slot / spp;
-	last_page = first_page + nr_vecs;
-	vec_start = (start_slot << bits) % PAGE_CACHE_SIZE;
-	for(i = first_page; i < last_page; i++) {
-		page = reg->hr_slot_data[i];
-
-		vec_len = PAGE_CACHE_SIZE;
-		/* last page might be short */
-		if (((i + 1) * spp) > (start_slot + num_slots))
-			vec_len = ((num_slots + start_slot) % spp) << bits;
-		vec_len -=  vec_start;
+	vec_start = (cs << bits) % PAGE_CACHE_SIZE;
+	while(cs < max_slots) {
+		current_page = cs / spp;
+		page = reg->hr_slot_data[current_page];
+
+		vec_len = min( PAGE_CACHE_SIZE, 
+			       (max_slots-cs) * (PAGE_CACHE_SIZE/spp) ); 
 
 		mlog(ML_HB_BIO, "page %d, vec_len = %u, vec_start = %u\n",
-		     i, vec_len, vec_start);
+		     current_page, vec_len, vec_start);
 
 		len = bio_add_page(bio, page, vec_len, vec_start);
-		if (len != vec_len) {
-			bio_put(bio);
-			bio = ERR_PTR(-EIO);
-
-			mlog(ML_ERROR, "Error adding page to bio i = %d, "
-			     "vec_len = %u, len = %d\n, start = %u\n",
-			     i, vec_len, len, vec_start);
-			goto bail;
-		}
+		if (len != vec_len) break;
 
+		cs += vec_len / (PAGE_CACHE_SIZE/spp);
 		vec_start = 0;
 	}
 
 bail:
+	*current_slot = cs;
 	return bio;
 }
 
-/*
- * Compute the maximum number of sectors the bdev can handle in one bio,
- * as a power of two.
- *
- * Stolen from oracleasm, thanks Joel!
- */
-static int compute_max_sectors(struct block_device *bdev)
-{
-	int max_pages, max_sectors, pow_two_sectors;
-
-	struct request_queue *q;
-
-	q = bdev_get_queue(bdev);
-	max_pages = q->max_sectors >> (PAGE_SHIFT - 9);
-	if (max_pages > BIO_MAX_PAGES)
-		max_pages = BIO_MAX_PAGES;
-	if (max_pages > q->max_phys_segments)
-		max_pages = q->max_phys_segments;
-	if (max_pages > q->max_hw_segments)
-		max_pages = q->max_hw_segments;
-	max_pages--; /* Handle I/Os that straddle a page */
-
-	if (max_pages) {
-		max_sectors = max_pages << (PAGE_SHIFT - 9);
-	} else {
-		/* If BIO contains 1 or less than 1 page. */
-		max_sectors = q->max_sectors;
-	}
-	/* Why is fls() 1-based???? */
-	pow_two_sectors = 1 << (fls(max_sectors) - 1);
-
-	return pow_two_sectors;
-}
-
-static inline void o2hb_compute_request_limits(struct o2hb_region *reg,
-					       unsigned int num_slots,
-					       unsigned int *num_bios,
-					       unsigned int *slots_per_bio)
-{
-	unsigned int max_sectors, io_sectors;
-
-	max_sectors = compute_max_sectors(reg->hr_bdev);
-
-	io_sectors = num_slots << (reg->hr_block_bits - 9);
-
-	*num_bios = (io_sectors + max_sectors - 1) / max_sectors;
-	*slots_per_bio = max_sectors >> (reg->hr_block_bits - 9);
-
-	mlog(ML_HB_BIO, "My io size is %u sectors for %u slots. This "
-	     "device can handle %u sectors of I/O\n", io_sectors, num_slots,
-	     max_sectors);
-	mlog(ML_HB_BIO, "Will need %u bios holding %u slots each\n",
-	     *num_bios, *slots_per_bio);
-}
-
 static int o2hb_read_slots(struct o2hb_region *reg,
 			   unsigned int max_slots)
 {
-	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
-	int i, status;
+	unsigned int current_slot=0;
+	int status;
 	struct o2hb_bio_wait_ctxt wc;
-	struct bio **bios;
 	struct bio *bio;
 
-	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
+	o2hb_bio_wait_init(&wc);
 
-	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
-	if (!bios) {
-		status = -ENOMEM;
-		mlog_errno(status);
-		return status;
-	}
-
-	o2hb_bio_wait_init(&wc, num_bios);
-
-	num_slots = slots_per_bio;
-	for(i = 0; i < num_bios; i++) {
-		start_slot = i * slots_per_bio;
-
-		/* adjust num_slots at last bio */
-		if (max_slots < (start_slot + num_slots))
-			num_slots = max_slots - start_slot;
-
-		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
+	while(current_slot < max_slots) {
+		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
 		if (IS_ERR(bio)) {
-			o2hb_bio_wait_dec(&wc, num_bios - i);
-
 			status = PTR_ERR(bio);
 			mlog_errno(status);
 			goto bail_and_wait;
 		}
-		bios[i] = bio;
 
+		atomic_inc(&wc.wc_num_reqs);
 		submit_bio(READ, bio);
 	}
 
@@ -403,38 +320,30 @@ bail_and_wait:
 	if (wc.wc_error && !status)
 		status = wc.wc_error;
 
-	if (bios) {
-		for(i = 0; i < num_bios; i++)
-			if (bios[i])
-				bio_put(bios[i]);
-		kfree(bios);
-	}
-
 	return status;
 }
 
 static int o2hb_issue_node_write(struct o2hb_region *reg,
-				 struct bio **write_bio,
 				 struct o2hb_bio_wait_ctxt *write_wc)
 {
 	int status;
 	unsigned int slot;
 	struct bio *bio;
 
-	o2hb_bio_wait_init(write_wc, 1);
+	o2hb_bio_wait_init(write_wc);
 
 	slot = o2nm_this_node();
 
-	bio = o2hb_setup_one_bio(reg, write_wc, slot, 1);
+	bio = o2hb_setup_one_bio(reg, write_wc, &slot, slot+1);
 	if (IS_ERR(bio)) {
 		status = PTR_ERR(bio);
 		mlog_errno(status);
 		goto bail;
 	}
 
+	atomic_inc(&write_wc->wc_num_reqs);
 	submit_bio(WRITE, bio);
 
-	*write_bio = bio;
 	status = 0;
 bail:
 	return status;
@@ -826,7 +735,6 @@ static int o2hb_do_disk_heartbeat(struct
 {
 	int i, ret, highest_node, change = 0;
 	unsigned long configured_nodes[BITS_TO_LONGS(O2NM_MAX_NODES)];
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 
 	ret = o2nm_configured_node_map(configured_nodes,
@@ -864,7 +772,7 @@ static int o2hb_do_disk_heartbeat(struct
 
 	/* And fire off the write. Note that we don't wait on this I/O
 	 * until later. */
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret < 0) {
 		mlog_errno(ret);
 		return ret;
@@ -882,7 +790,6 @@ static int o2hb_do_disk_heartbeat(struct
 	 * people we find in our steady state have seen us.
 	 */
 	o2hb_wait_on_io(reg, &write_wc);
-	bio_put(write_bio);
 	if (write_wc.wc_error) {
 		/* Do not re-arm the write timeout on I/O error - we
 		 * can't be sure that the new block ever made it to
@@ -943,7 +850,6 @@ static int o2hb_thread(void *data)
 {
 	int i, ret;
 	struct o2hb_region *reg = data;
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 	struct timeval before_hb, after_hb;
 	unsigned int elapsed_msec;
@@ -993,10 +899,9 @@ static int o2hb_thread(void *data)
 	 *
 	 * XXX: Should we skip this on unclean_stop? */
 	o2hb_prepare_block(reg, 0);
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret == 0) {
 		o2hb_wait_on_io(reg, &write_wc);
-		bio_put(write_bio);
 	} else {
 		mlog_errno(ret);
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
  2007-01-05 14:06 ` [Drbd-dev] " Philipp Reisner
@ 2007-01-09 22:28   ` Mark Fasheh
  -1 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-01-09 14:28 UTC (permalink / raw)
  To: ocfs2-devel

Hi Philipp,

On Fri, Jan 05, 2007 at 03:06:52PM +0100, Philipp Reisner wrote:
> Hi There,
> 
> As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
> that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
> predictable manner.
> 
> That is not true, especially for devices with own merge_bvec_fn().
> 
> Therefore OCFS2's heartbeat code is very likely to fail on such devices.
> 
> The attached patch corrects this, and reduces the file cluster/heartbeat.c
> by 90 lines.

Thanks for taking this on - fixing up that code has been on my todo list for
a while.


>   The whole patch builds around the change that the bio_put() call is
>   moved to the bio's bi_end_io() function.
>   This makes the whole idea of trying to predict the behaviour of 
>   bio_add_page() unnecessary, therefore we can remove compute_max_sectors()
>   and o2hb_compute_request_limits().
> 
> -Phil
> 
> PS1: With that applied I can happily run OCFS2 on a DRBD-8.0 device which
>      sits on top of software raid (both have interesting merge_bvec_fn()
>      functions.
> 
> PS2: I created the ocfs2-fix-bio_add_page.diff on 2.6.17 .
>      And later refitted the patch to 2.6.20-rc2.

Ok, great. I only looked at the 2.6.20-rc2 patch as that's what I'm running
here.


>  static int o2hb_read_slots(struct o2hb_region *reg,
>  			   unsigned int max_slots)
>  {
> -	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
> -	int i, status;
> +	unsigned int current_slot=0;
> +	int status;
>  	struct o2hb_bio_wait_ctxt wc;
> -	struct bio **bios;
>  	struct bio *bio;
>  
> -	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
> -
> -	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
> -	if (!bios) {
> -		status = -ENOMEM;
> -		mlog_errno(status);
> -		return status;
> -	}
> -
> -	o2hb_bio_wait_init(&wc, num_bios);
> -
> -	num_slots = slots_per_bio;
> -	for(i = 0; i < num_bios; i++) {
> -		start_slot = i * slots_per_bio;
> +	o2hb_bio_wait_init(&wc);
>  
> -		/* adjust num_slots at last bio */
> -		if (max_slots < (start_slot + num_slots))
> -			num_slots = max_slots - start_slot;
> -
> -		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
> +	while(current_slot < max_slots) {
> +		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
>  		if (IS_ERR(bio)) {
> -			o2hb_bio_wait_dec(&wc, num_bios - i);
> -
>  			status = PTR_ERR(bio);
>  			mlog_errno(status);
>  			goto bail_and_wait;
>  		}
> -		bios[i] = bio;
>  
> +		atomic_inc(&wc.wc_num_reqs);
>  		submit_bio(READ, bio);
>  	}

You changed o2hb_bio_wait_init() to initialize wc_num_reqs to zero, and
instead you increment the value before each submit_bio(). This makes sense
because we now don't know how many requests will ultimately be submitted.

However, A race now exists here between the piecemeal increment of
wc_num_reqs and the completion code in o2hb_bio_wait_dec():

		if (atomic_dec_and_test(&wc->wc_num_reqs)) {
			BUG_ON(num > 0);
			complete(&wc->wc_io_complete);
		}

So, if enough bios complete before you're done submitting all your read
requests, you could trigger the completion, wc_io_complete early.

This would certainly be hard to hit, but that's never stopped a race from
occuring before :)

So, NACK until we fix that race. Otherwise the patch seems to look fine. I
can test a future version here at Oracle.

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Drbd-dev] Re: [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-09 22:28   ` Mark Fasheh
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-01-09 22:28 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: ocfs2-devel, drbd-dev

Hi Philipp,

On Fri, Jan 05, 2007 at 03:06:52PM +0100, Philipp Reisner wrote:
> Hi There,
> 
> As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
> that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
> predictable manner.
> 
> That is not true, especially for devices with own merge_bvec_fn().
> 
> Therefore OCFS2's heartbeat code is very likely to fail on such devices.
> 
> The attached patch corrects this, and reduces the file cluster/heartbeat.c
> by 90 lines.

Thanks for taking this on - fixing up that code has been on my todo list for
a while.


>   The whole patch builds around the change that the bio_put() call is
>   moved to the bio's bi_end_io() function.
>   This makes the whole idea of trying to predict the behaviour of 
>   bio_add_page() unnecessary, therefore we can remove compute_max_sectors()
>   and o2hb_compute_request_limits().
> 
> -Phil
> 
> PS1: With that applied I can happily run OCFS2 on a DRBD-8.0 device which
>      sits on top of software raid (both have interesting merge_bvec_fn()
>      functions.
> 
> PS2: I created the ocfs2-fix-bio_add_page.diff on 2.6.17 .
>      And later refitted the patch to 2.6.20-rc2.

Ok, great. I only looked at the 2.6.20-rc2 patch as that's what I'm running
here.


>  static int o2hb_read_slots(struct o2hb_region *reg,
>  			   unsigned int max_slots)
>  {
> -	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
> -	int i, status;
> +	unsigned int current_slot=0;
> +	int status;
>  	struct o2hb_bio_wait_ctxt wc;
> -	struct bio **bios;
>  	struct bio *bio;
>  
> -	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
> -
> -	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
> -	if (!bios) {
> -		status = -ENOMEM;
> -		mlog_errno(status);
> -		return status;
> -	}
> -
> -	o2hb_bio_wait_init(&wc, num_bios);
> -
> -	num_slots = slots_per_bio;
> -	for(i = 0; i < num_bios; i++) {
> -		start_slot = i * slots_per_bio;
> +	o2hb_bio_wait_init(&wc);
>  
> -		/* adjust num_slots at last bio */
> -		if (max_slots < (start_slot + num_slots))
> -			num_slots = max_slots - start_slot;
> -
> -		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
> +	while(current_slot < max_slots) {
> +		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
>  		if (IS_ERR(bio)) {
> -			o2hb_bio_wait_dec(&wc, num_bios - i);
> -
>  			status = PTR_ERR(bio);
>  			mlog_errno(status);
>  			goto bail_and_wait;
>  		}
> -		bios[i] = bio;
>  
> +		atomic_inc(&wc.wc_num_reqs);
>  		submit_bio(READ, bio);
>  	}

You changed o2hb_bio_wait_init() to initialize wc_num_reqs to zero, and
instead you increment the value before each submit_bio(). This makes sense
because we now don't know how many requests will ultimately be submitted.

However, A race now exists here between the piecemeal increment of
wc_num_reqs and the completion code in o2hb_bio_wait_dec():

		if (atomic_dec_and_test(&wc->wc_num_reqs)) {
			BUG_ON(num > 0);
			complete(&wc->wc_io_complete);
		}

So, if enough bios complete before you're done submitting all your read
requests, you could trigger the completion, wc_io_complete early.

This would certainly be hard to hit, but that's never stopped a race from
occuring before :)

So, NACK until we fix that race. Otherwise the patch seems to look fine. I
can test a future version here at Oracle.

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
  2007-01-09 22:28   ` [Drbd-dev] " Mark Fasheh
@ 2007-01-11  9:58     ` Philipp Reisner
  -1 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-11  1:58 UTC (permalink / raw)
  To: ocfs2-devel

[...]
>
> You changed o2hb_bio_wait_init() to initialize wc_num_reqs to zero, and
> instead you increment the value before each submit_bio(). This makes sense
> because we now don't know how many requests will ultimately be submitted.
>
> However, A race now exists here between the piecemeal increment of
> wc_num_reqs and the completion code in o2hb_bio_wait_dec():
>
> 		if (atomic_dec_and_test(&wc->wc_num_reqs)) {
> 			BUG_ON(num > 0);
> 			complete(&wc->wc_io_complete);
> 		}
>
> So, if enough bios complete before you're done submitting all your read
> requests, you could trigger the completion, wc_io_complete early.
>
> This would certainly be hard to hit, but that's never stopped a race from
> occuring before :)
>

Right, here is a new version of the patch. Now wc_num_reqs gets initialized
to one, and the final decrement happens in o2hb_wait_on_io()

My try of a 'The perfect patch':

Moved the bio_put() call into the bio's bi_end_io() function. This makes the
whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
Removed compute_max_sectors() and o2hb_compute_request_limits().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>

--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -184,10 +184,9 @@ static void o2hb_disarm_write_timeout(st
 	flush_scheduled_work();
 }
 
-static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int num_ios)
+static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
 {
-	atomic_set(&wc->wc_num_reqs, num_ios);
+	atomic_set(&wc->wc_num_reqs, 1);
 	init_completion(&wc->wc_io_complete);
 	wc->wc_error = 0;
 }
@@ -212,6 +211,7 @@ static void o2hb_wait_on_io(struct o2hb_
 	struct address_space *mapping = reg->hr_bdev->bd_inode->i_mapping;
 
 	blk_run_address_space(mapping);
+	o2hb_bio_wait_dec(wc, 1);
 
 	wait_for_completion(&wc->wc_io_complete);
 }
@@ -231,6 +231,7 @@ static int o2hb_bio_end_io(struct bio *b
 		return 1;
 
 	o2hb_bio_wait_dec(wc, 1);
+	bio_put(bio);
 	return 0;
 }
 
@@ -238,23 +239,22 @@ static int o2hb_bio_end_io(struct bio *b
  * start_slot. */
 static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 				      struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int start_slot,
-				      unsigned int num_slots)
+				      unsigned int *current_slot,
+				      unsigned int max_slots)
 {
-	int i, nr_vecs, len, first_page, last_page;
+	int len, current_page;
 	unsigned int vec_len, vec_start;
 	unsigned int bits = reg->hr_block_bits;
 	unsigned int spp = reg->hr_slots_per_page;
+	unsigned int cs = *current_slot;
 	struct bio *bio;
 	struct page *page;
 
-	nr_vecs = (num_slots + spp - 1) / spp;
-
 	/* Testing has shown this allocation to take long enough under
 	 * GFP_KERNEL that the local node can get fenced. It would be
 	 * nicest if we could pre-allocate these bios and avoid this
 	 * all together. */
-	bio = bio_alloc(GFP_ATOMIC, nr_vecs);
+	bio = bio_alloc(GFP_ATOMIC, 16);
 	if (!bio) {
 		mlog(ML_ERROR, "Could not alloc slots BIO!\n");
 		bio = ERR_PTR(-ENOMEM);
@@ -262,137 +262,53 @@ static struct bio *o2hb_setup_one_bio(st
 	}
 
 	/* Must put everything in 512 byte sectors for the bio... */
-	bio->bi_sector = (reg->hr_start_block + start_slot) << (bits - 9);
+	bio->bi_sector = (reg->hr_start_block + cs) << (bits - 9);
 	bio->bi_bdev = reg->hr_bdev;
 	bio->bi_private = wc;
 	bio->bi_end_io = o2hb_bio_end_io;
 
-	first_page = start_slot / spp;
-	last_page = first_page + nr_vecs;
-	vec_start = (start_slot << bits) % PAGE_CACHE_SIZE;
-	for(i = first_page; i < last_page; i++) {
-		page = reg->hr_slot_data[i];
+	vec_start = (cs << bits) % PAGE_CACHE_SIZE;
+	while(cs < max_slots) {
+		current_page = cs / spp;
+		page = reg->hr_slot_data[current_page];
 
-		vec_len = PAGE_CACHE_SIZE;
-		/* last page might be short */
-		if (((i + 1) * spp) > (start_slot + num_slots))
-			vec_len = ((num_slots + start_slot) % spp) << bits;
-		vec_len -=  vec_start;
+		vec_len = min( PAGE_CACHE_SIZE, 
+			       (max_slots-cs) * (PAGE_CACHE_SIZE/spp) ); 
 
 		mlog(ML_HB_BIO, "page %d, vec_len = %u, vec_start = %u\n",
-		     i, vec_len, vec_start);
+		     current_page, vec_len, vec_start);
 
 		len = bio_add_page(bio, page, vec_len, vec_start);
-		if (len != vec_len) {
-			bio_put(bio);
-			bio = ERR_PTR(-EIO);
-
-			mlog(ML_ERROR, "Error adding page to bio i = %d, "
-			     "vec_len = %u, len = %d\n, start = %u\n",
-			     i, vec_len, len, vec_start);
-			goto bail;
-		}
+		if (len != vec_len) break;
 
+		cs += vec_len / (PAGE_CACHE_SIZE/spp);
 		vec_start = 0;
 	}
 
 bail:
+	*current_slot = cs;
 	return bio;
 }
 
-/*
- * Compute the maximum number of sectors the bdev can handle in one bio,
- * as a power of two.
- *
- * Stolen from oracleasm, thanks Joel!
- */
-static int compute_max_sectors(struct block_device *bdev)
-{
-	int max_pages, max_sectors, pow_two_sectors;
-
-	struct request_queue *q;
-
-	q = bdev_get_queue(bdev);
-	max_pages = q->max_sectors >> (PAGE_SHIFT - 9);
-	if (max_pages > BIO_MAX_PAGES)
-		max_pages = BIO_MAX_PAGES;
-	if (max_pages > q->max_phys_segments)
-		max_pages = q->max_phys_segments;
-	if (max_pages > q->max_hw_segments)
-		max_pages = q->max_hw_segments;
-	max_pages--; /* Handle I/Os that straddle a page */
-
-	if (max_pages) {
-		max_sectors = max_pages << (PAGE_SHIFT - 9);
-	} else {
-		/* If BIO contains 1 or less than 1 page. */
-		max_sectors = q->max_sectors;
-	}
-	/* Why is fls() 1-based???? */
-	pow_two_sectors = 1 << (fls(max_sectors) - 1);
-
-	return pow_two_sectors;
-}
-
-static inline void o2hb_compute_request_limits(struct o2hb_region *reg,
-					       unsigned int num_slots,
-					       unsigned int *num_bios,
-					       unsigned int *slots_per_bio)
-{
-	unsigned int max_sectors, io_sectors;
-
-	max_sectors = compute_max_sectors(reg->hr_bdev);
-
-	io_sectors = num_slots << (reg->hr_block_bits - 9);
-
-	*num_bios = (io_sectors + max_sectors - 1) / max_sectors;
-	*slots_per_bio = max_sectors >> (reg->hr_block_bits - 9);
-
-	mlog(ML_HB_BIO, "My io size is %u sectors for %u slots. This "
-	     "device can handle %u sectors of I/O\n", io_sectors, num_slots,
-	     max_sectors);
-	mlog(ML_HB_BIO, "Will need %u bios holding %u slots each\n",
-	     *num_bios, *slots_per_bio);
-}
-
 static int o2hb_read_slots(struct o2hb_region *reg,
 			   unsigned int max_slots)
 {
-	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
-	int i, status;
+	unsigned int current_slot=0;
+	int status;
 	struct o2hb_bio_wait_ctxt wc;
-	struct bio **bios;
 	struct bio *bio;
 
-	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
+	o2hb_bio_wait_init(&wc);
 
-	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
-	if (!bios) {
-		status = -ENOMEM;
-		mlog_errno(status);
-		return status;
-	}
-
-	o2hb_bio_wait_init(&wc, num_bios);
-
-	num_slots = slots_per_bio;
-	for(i = 0; i < num_bios; i++) {
-		start_slot = i * slots_per_bio;
-
-		/* adjust num_slots@last bio */
-		if (max_slots < (start_slot + num_slots))
-			num_slots = max_slots - start_slot;
-
-		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
+	while(current_slot < max_slots) {
+		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
 		if (IS_ERR(bio)) {
-			o2hb_bio_wait_dec(&wc, num_bios - i);
-
 			status = PTR_ERR(bio);
 			mlog_errno(status);
 			goto bail_and_wait;
 		}
-		bios[i] = bio;
 
+		atomic_inc(&wc.wc_num_reqs);
 		submit_bio(READ, bio);
 	}
 
@@ -403,38 +319,30 @@ bail_and_wait:
 	if (wc.wc_error && !status)
 		status = wc.wc_error;
 
-	if (bios) {
-		for(i = 0; i < num_bios; i++)
-			if (bios[i])
-				bio_put(bios[i]);
-		kfree(bios);
-	}
-
 	return status;
 }
 
 static int o2hb_issue_node_write(struct o2hb_region *reg,
-				 struct bio **write_bio,
 				 struct o2hb_bio_wait_ctxt *write_wc)
 {
 	int status;
 	unsigned int slot;
 	struct bio *bio;
 
-	o2hb_bio_wait_init(write_wc, 1);
+	o2hb_bio_wait_init(write_wc);
 
 	slot = o2nm_this_node();
 
-	bio = o2hb_setup_one_bio(reg, write_wc, slot, 1);
+	bio = o2hb_setup_one_bio(reg, write_wc, &slot, slot+1);
 	if (IS_ERR(bio)) {
 		status = PTR_ERR(bio);
 		mlog_errno(status);
 		goto bail;
 	}
 
+	atomic_inc(&write_wc->wc_num_reqs);
 	submit_bio(WRITE, bio);
 
-	*write_bio = bio;
 	status = 0;
 bail:
 	return status;
@@ -826,7 +734,6 @@ static int o2hb_do_disk_heartbeat(struct
 {
 	int i, ret, highest_node, change = 0;
 	unsigned long configured_nodes[BITS_TO_LONGS(O2NM_MAX_NODES)];
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 
 	ret = o2nm_configured_node_map(configured_nodes,
@@ -864,7 +771,7 @@ static int o2hb_do_disk_heartbeat(struct
 
 	/* And fire off the write. Note that we don't wait on this I/O
 	 * until later. */
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret < 0) {
 		mlog_errno(ret);
 		return ret;
@@ -882,7 +789,6 @@ static int o2hb_do_disk_heartbeat(struct
 	 * people we find in our steady state have seen us.
 	 */
 	o2hb_wait_on_io(reg, &write_wc);
-	bio_put(write_bio);
 	if (write_wc.wc_error) {
 		/* Do not re-arm the write timeout on I/O error - we
 		 * can't be sure that the new block ever made it to
@@ -943,7 +849,6 @@ static int o2hb_thread(void *data)
 {
 	int i, ret;
 	struct o2hb_region *reg = data;
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 	struct timeval before_hb, after_hb;
 	unsigned int elapsed_msec;
@@ -993,10 +898,9 @@ static int o2hb_thread(void *data)
 	 *
 	 * XXX: Should we skip this on unclean_stop? */
 	o2hb_prepare_block(reg, 0);
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret == 0) {
 		o2hb_wait_on_io(reg, &write_wc);
-		bio_put(write_bio);
 	} else {
 		mlog_errno(ret);
 	}

-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Drbd-dev] Re: [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-11  9:58     ` Philipp Reisner
  0 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-11  9:58 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel, drbd-dev

[...]
>
> You changed o2hb_bio_wait_init() to initialize wc_num_reqs to zero, and
> instead you increment the value before each submit_bio(). This makes sense
> because we now don't know how many requests will ultimately be submitted.
>
> However, A race now exists here between the piecemeal increment of
> wc_num_reqs and the completion code in o2hb_bio_wait_dec():
>
> 		if (atomic_dec_and_test(&wc->wc_num_reqs)) {
> 			BUG_ON(num > 0);
> 			complete(&wc->wc_io_complete);
> 		}
>
> So, if enough bios complete before you're done submitting all your read
> requests, you could trigger the completion, wc_io_complete early.
>
> This would certainly be hard to hit, but that's never stopped a race from
> occuring before :)
>

Right, here is a new version of the patch. Now wc_num_reqs gets initialized
to one, and the final decrement happens in o2hb_wait_on_io()

My try of a 'The perfect patch':

Moved the bio_put() call into the bio's bi_end_io() function. This makes the
whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
Removed compute_max_sectors() and o2hb_compute_request_limits().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>

--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -184,10 +184,9 @@ static void o2hb_disarm_write_timeout(st
 	flush_scheduled_work();
 }
 
-static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int num_ios)
+static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
 {
-	atomic_set(&wc->wc_num_reqs, num_ios);
+	atomic_set(&wc->wc_num_reqs, 1);
 	init_completion(&wc->wc_io_complete);
 	wc->wc_error = 0;
 }
@@ -212,6 +211,7 @@ static void o2hb_wait_on_io(struct o2hb_
 	struct address_space *mapping = reg->hr_bdev->bd_inode->i_mapping;
 
 	blk_run_address_space(mapping);
+	o2hb_bio_wait_dec(wc, 1);
 
 	wait_for_completion(&wc->wc_io_complete);
 }
@@ -231,6 +231,7 @@ static int o2hb_bio_end_io(struct bio *b
 		return 1;
 
 	o2hb_bio_wait_dec(wc, 1);
+	bio_put(bio);
 	return 0;
 }
 
@@ -238,23 +239,22 @@ static int o2hb_bio_end_io(struct bio *b
  * start_slot. */
 static struct bio *o2hb_setup_one_bio(struct o2hb_region *reg,
 				      struct o2hb_bio_wait_ctxt *wc,
-				      unsigned int start_slot,
-				      unsigned int num_slots)
+				      unsigned int *current_slot,
+				      unsigned int max_slots)
 {
-	int i, nr_vecs, len, first_page, last_page;
+	int len, current_page;
 	unsigned int vec_len, vec_start;
 	unsigned int bits = reg->hr_block_bits;
 	unsigned int spp = reg->hr_slots_per_page;
+	unsigned int cs = *current_slot;
 	struct bio *bio;
 	struct page *page;
 
-	nr_vecs = (num_slots + spp - 1) / spp;
-
 	/* Testing has shown this allocation to take long enough under
 	 * GFP_KERNEL that the local node can get fenced. It would be
 	 * nicest if we could pre-allocate these bios and avoid this
 	 * all together. */
-	bio = bio_alloc(GFP_ATOMIC, nr_vecs);
+	bio = bio_alloc(GFP_ATOMIC, 16);
 	if (!bio) {
 		mlog(ML_ERROR, "Could not alloc slots BIO!\n");
 		bio = ERR_PTR(-ENOMEM);
@@ -262,137 +262,53 @@ static struct bio *o2hb_setup_one_bio(st
 	}
 
 	/* Must put everything in 512 byte sectors for the bio... */
-	bio->bi_sector = (reg->hr_start_block + start_slot) << (bits - 9);
+	bio->bi_sector = (reg->hr_start_block + cs) << (bits - 9);
 	bio->bi_bdev = reg->hr_bdev;
 	bio->bi_private = wc;
 	bio->bi_end_io = o2hb_bio_end_io;
 
-	first_page = start_slot / spp;
-	last_page = first_page + nr_vecs;
-	vec_start = (start_slot << bits) % PAGE_CACHE_SIZE;
-	for(i = first_page; i < last_page; i++) {
-		page = reg->hr_slot_data[i];
+	vec_start = (cs << bits) % PAGE_CACHE_SIZE;
+	while(cs < max_slots) {
+		current_page = cs / spp;
+		page = reg->hr_slot_data[current_page];
 
-		vec_len = PAGE_CACHE_SIZE;
-		/* last page might be short */
-		if (((i + 1) * spp) > (start_slot + num_slots))
-			vec_len = ((num_slots + start_slot) % spp) << bits;
-		vec_len -=  vec_start;
+		vec_len = min( PAGE_CACHE_SIZE, 
+			       (max_slots-cs) * (PAGE_CACHE_SIZE/spp) ); 
 
 		mlog(ML_HB_BIO, "page %d, vec_len = %u, vec_start = %u\n",
-		     i, vec_len, vec_start);
+		     current_page, vec_len, vec_start);
 
 		len = bio_add_page(bio, page, vec_len, vec_start);
-		if (len != vec_len) {
-			bio_put(bio);
-			bio = ERR_PTR(-EIO);
-
-			mlog(ML_ERROR, "Error adding page to bio i = %d, "
-			     "vec_len = %u, len = %d\n, start = %u\n",
-			     i, vec_len, len, vec_start);
-			goto bail;
-		}
+		if (len != vec_len) break;
 
+		cs += vec_len / (PAGE_CACHE_SIZE/spp);
 		vec_start = 0;
 	}
 
 bail:
+	*current_slot = cs;
 	return bio;
 }
 
-/*
- * Compute the maximum number of sectors the bdev can handle in one bio,
- * as a power of two.
- *
- * Stolen from oracleasm, thanks Joel!
- */
-static int compute_max_sectors(struct block_device *bdev)
-{
-	int max_pages, max_sectors, pow_two_sectors;
-
-	struct request_queue *q;
-
-	q = bdev_get_queue(bdev);
-	max_pages = q->max_sectors >> (PAGE_SHIFT - 9);
-	if (max_pages > BIO_MAX_PAGES)
-		max_pages = BIO_MAX_PAGES;
-	if (max_pages > q->max_phys_segments)
-		max_pages = q->max_phys_segments;
-	if (max_pages > q->max_hw_segments)
-		max_pages = q->max_hw_segments;
-	max_pages--; /* Handle I/Os that straddle a page */
-
-	if (max_pages) {
-		max_sectors = max_pages << (PAGE_SHIFT - 9);
-	} else {
-		/* If BIO contains 1 or less than 1 page. */
-		max_sectors = q->max_sectors;
-	}
-	/* Why is fls() 1-based???? */
-	pow_two_sectors = 1 << (fls(max_sectors) - 1);
-
-	return pow_two_sectors;
-}
-
-static inline void o2hb_compute_request_limits(struct o2hb_region *reg,
-					       unsigned int num_slots,
-					       unsigned int *num_bios,
-					       unsigned int *slots_per_bio)
-{
-	unsigned int max_sectors, io_sectors;
-
-	max_sectors = compute_max_sectors(reg->hr_bdev);
-
-	io_sectors = num_slots << (reg->hr_block_bits - 9);
-
-	*num_bios = (io_sectors + max_sectors - 1) / max_sectors;
-	*slots_per_bio = max_sectors >> (reg->hr_block_bits - 9);
-
-	mlog(ML_HB_BIO, "My io size is %u sectors for %u slots. This "
-	     "device can handle %u sectors of I/O\n", io_sectors, num_slots,
-	     max_sectors);
-	mlog(ML_HB_BIO, "Will need %u bios holding %u slots each\n",
-	     *num_bios, *slots_per_bio);
-}
-
 static int o2hb_read_slots(struct o2hb_region *reg,
 			   unsigned int max_slots)
 {
-	unsigned int num_bios, slots_per_bio, start_slot, num_slots;
-	int i, status;
+	unsigned int current_slot=0;
+	int status;
 	struct o2hb_bio_wait_ctxt wc;
-	struct bio **bios;
 	struct bio *bio;
 
-	o2hb_compute_request_limits(reg, max_slots, &num_bios, &slots_per_bio);
+	o2hb_bio_wait_init(&wc);
 
-	bios = kcalloc(num_bios, sizeof(struct bio *), GFP_KERNEL);
-	if (!bios) {
-		status = -ENOMEM;
-		mlog_errno(status);
-		return status;
-	}
-
-	o2hb_bio_wait_init(&wc, num_bios);
-
-	num_slots = slots_per_bio;
-	for(i = 0; i < num_bios; i++) {
-		start_slot = i * slots_per_bio;
-
-		/* adjust num_slots at last bio */
-		if (max_slots < (start_slot + num_slots))
-			num_slots = max_slots - start_slot;
-
-		bio = o2hb_setup_one_bio(reg, &wc, start_slot, num_slots);
+	while(current_slot < max_slots) {
+		bio = o2hb_setup_one_bio(reg, &wc, &current_slot, max_slots);
 		if (IS_ERR(bio)) {
-			o2hb_bio_wait_dec(&wc, num_bios - i);
-
 			status = PTR_ERR(bio);
 			mlog_errno(status);
 			goto bail_and_wait;
 		}
-		bios[i] = bio;
 
+		atomic_inc(&wc.wc_num_reqs);
 		submit_bio(READ, bio);
 	}
 
@@ -403,38 +319,30 @@ bail_and_wait:
 	if (wc.wc_error && !status)
 		status = wc.wc_error;
 
-	if (bios) {
-		for(i = 0; i < num_bios; i++)
-			if (bios[i])
-				bio_put(bios[i]);
-		kfree(bios);
-	}
-
 	return status;
 }
 
 static int o2hb_issue_node_write(struct o2hb_region *reg,
-				 struct bio **write_bio,
 				 struct o2hb_bio_wait_ctxt *write_wc)
 {
 	int status;
 	unsigned int slot;
 	struct bio *bio;
 
-	o2hb_bio_wait_init(write_wc, 1);
+	o2hb_bio_wait_init(write_wc);
 
 	slot = o2nm_this_node();
 
-	bio = o2hb_setup_one_bio(reg, write_wc, slot, 1);
+	bio = o2hb_setup_one_bio(reg, write_wc, &slot, slot+1);
 	if (IS_ERR(bio)) {
 		status = PTR_ERR(bio);
 		mlog_errno(status);
 		goto bail;
 	}
 
+	atomic_inc(&write_wc->wc_num_reqs);
 	submit_bio(WRITE, bio);
 
-	*write_bio = bio;
 	status = 0;
 bail:
 	return status;
@@ -826,7 +734,6 @@ static int o2hb_do_disk_heartbeat(struct
 {
 	int i, ret, highest_node, change = 0;
 	unsigned long configured_nodes[BITS_TO_LONGS(O2NM_MAX_NODES)];
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 
 	ret = o2nm_configured_node_map(configured_nodes,
@@ -864,7 +771,7 @@ static int o2hb_do_disk_heartbeat(struct
 
 	/* And fire off the write. Note that we don't wait on this I/O
 	 * until later. */
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret < 0) {
 		mlog_errno(ret);
 		return ret;
@@ -882,7 +789,6 @@ static int o2hb_do_disk_heartbeat(struct
 	 * people we find in our steady state have seen us.
 	 */
 	o2hb_wait_on_io(reg, &write_wc);
-	bio_put(write_bio);
 	if (write_wc.wc_error) {
 		/* Do not re-arm the write timeout on I/O error - we
 		 * can't be sure that the new block ever made it to
@@ -943,7 +849,6 @@ static int o2hb_thread(void *data)
 {
 	int i, ret;
 	struct o2hb_region *reg = data;
-	struct bio *write_bio;
 	struct o2hb_bio_wait_ctxt write_wc;
 	struct timeval before_hb, after_hb;
 	unsigned int elapsed_msec;
@@ -993,10 +898,9 @@ static int o2hb_thread(void *data)
 	 *
 	 * XXX: Should we skip this on unclean_stop? */
 	o2hb_prepare_block(reg, 0);
-	ret = o2hb_issue_node_write(reg, &write_bio, &write_wc);
+	ret = o2hb_issue_node_write(reg, &write_wc);
 	if (ret == 0) {
 		o2hb_wait_on_io(reg, &write_wc);
-		bio_put(write_bio);
 	} else {
 		mlog_errno(ret);
 	}

-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
  2007-01-11  9:58     ` [Drbd-dev] " Philipp Reisner
@ 2007-01-16 19:57       ` Mark Fasheh
  -1 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-01-16 11:58 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Jan 11, 2007 at 10:58:10AM +0100, Philipp Reisner wrote:
> My try of a 'The perfect patch':
> 
> Moved the bio_put() call into the bio's bi_end_io() function. This makes the
> whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
> Removed compute_max_sectors() and o2hb_compute_request_limits().

Ok, this is uploading to ocfs2.git right now. It needs some testing in -mm
(and internally) for a while before I push it upstream. Have you tried this
in a cluster of more than one node?

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
  2007-01-16 19:57       ` [Drbd-dev] " Mark Fasheh
@ 2007-01-16 20:20         ` Philipp Reisner
  -1 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-16 12:20 UTC (permalink / raw)
  To: ocfs2-devel

Am Dienstag, 16. Januar 2007 20:57 schrieb Mark Fasheh:
> On Thu, Jan 11, 2007 at 10:58:10AM +0100, Philipp Reisner wrote:
> > My try of a 'The perfect patch':
> >
> > Moved the bio_put() call into the bio's bi_end_io() function. This makes
> > the whole idea of trying to predict the behaviour of bio_add_page()
> > unnecessary. Removed compute_max_sectors() and
> > o2hb_compute_request_limits().
>
> Ok, this is uploading to ocfs2.git right now. It needs some testing in -mm
> (and internally) for a while before I push it upstream. Have you tried this
> in a cluster of more than one node?
>

Yes in my 2 node cluster. Running OCFS2 on top of DRBD8.

Here is an excerpt from one of the machines kernel log:

Jan 16 21:10:26 wurzel kernel: [15760.632055] drbd5: role( Secondary -> Primary )
Jan 16 21:10:26 wurzel kernel: [15760.632079] drbd5: Writing meta data super block now.
Jan 16 21:10:29 wurzel kernel: [15764.170075] drbd5: peer( Secondary -> Primary )
Jan 16 21:10:53 wurzel kernel: [15787.729397] OCFS2 1.3.3
Jan 16 21:11:08 wurzel kernel: [15802.715225] ocfs2_dlm: Nodes in domain ("B0E1160C780B4AA2A5E763036EE3910E"): 0
Jan 16 21:11:08 wurzel kernel: [15802.734416] JBD: Ignoring recovery information on journal
Jan 16 21:11:08 wurzel kernel: [15802.748171] kjournald starting.  Commit interval 5 seconds
Jan 16 21:11:08 wurzel kernel: [15802.749319] ocfs2: Mounting device (147,5) on (node 0, slot 0) with ordered data mode.
Jan 16 21:11:16 wurzel kernel: [15810.429171] o2net: accepted connection from node sepp (num 1) at 192.168.243.2:7777
Jan 16 21:11:20 wurzel kernel: [15814.585687] ocfs2_dlm: Node 1 joins domain B0E1160C780B4AA2A5E763036EE3910E
Jan 16 21:11:20 wurzel kernel: [15814.585692] ocfs2_dlm: Nodes in domain ("B0E1160C780B4AA2A5E763036EE3910E"): 0 1

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Drbd-dev] Re: [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-16 19:57       ` Mark Fasheh
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-01-16 19:57 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: ocfs2-devel, drbd-dev

On Thu, Jan 11, 2007 at 10:58:10AM +0100, Philipp Reisner wrote:
> My try of a 'The perfect patch':
> 
> Moved the bio_put() call into the bio's bi_end_io() function. This makes the
> whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
> Removed compute_max_sectors() and o2hb_compute_request_limits().

Ok, this is uploading to ocfs2.git right now. It needs some testing in -mm
(and internally) for a while before I push it upstream. Have you tried this
in a cluster of more than one node?

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Drbd-dev] Re: [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page()
@ 2007-01-16 20:20         ` Philipp Reisner
  0 siblings, 0 replies; 10+ messages in thread
From: Philipp Reisner @ 2007-01-16 20:20 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel, drbd-dev

Am Dienstag, 16. Januar 2007 20:57 schrieb Mark Fasheh:
> On Thu, Jan 11, 2007 at 10:58:10AM +0100, Philipp Reisner wrote:
> > My try of a 'The perfect patch':
> >
> > Moved the bio_put() call into the bio's bi_end_io() function. This makes
> > the whole idea of trying to predict the behaviour of bio_add_page()
> > unnecessary. Removed compute_max_sectors() and
> > o2hb_compute_request_limits().
>
> Ok, this is uploading to ocfs2.git right now. It needs some testing in -mm
> (and internally) for a while before I push it upstream. Have you tried this
> in a cluster of more than one node?
>

Yes in my 2 node cluster. Running OCFS2 on top of DRBD8.

Here is an excerpt from one of the machines kernel log:

Jan 16 21:10:26 wurzel kernel: [15760.632055] drbd5: role( Secondary -> Primary )
Jan 16 21:10:26 wurzel kernel: [15760.632079] drbd5: Writing meta data super block now.
Jan 16 21:10:29 wurzel kernel: [15764.170075] drbd5: peer( Secondary -> Primary )
Jan 16 21:10:53 wurzel kernel: [15787.729397] OCFS2 1.3.3
Jan 16 21:11:08 wurzel kernel: [15802.715225] ocfs2_dlm: Nodes in domain ("B0E1160C780B4AA2A5E763036EE3910E"): 0
Jan 16 21:11:08 wurzel kernel: [15802.734416] JBD: Ignoring recovery information on journal
Jan 16 21:11:08 wurzel kernel: [15802.748171] kjournald starting.  Commit interval 5 seconds
Jan 16 21:11:08 wurzel kernel: [15802.749319] ocfs2: Mounting device (147,5) on (node 0, slot 0) with ordered data mode.
Jan 16 21:11:16 wurzel kernel: [15810.429171] o2net: accepted connection from node sepp (num 1) at 192.168.243.2:7777
Jan 16 21:11:20 wurzel kernel: [15814.585687] ocfs2_dlm: Node 1 joins domain B0E1160C780B4AA2A5E763036EE3910E
Jan 16 21:11:20 wurzel kernel: [15814.585692] ocfs2_dlm: Nodes in domain ("B0E1160C780B4AA2A5E763036EE3910E"): 0 1

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-01-17 12:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-05  6:07 [Ocfs2-devel] Fixing cluster/heartbeat.c's use of bio_add_page() Philipp Reisner
2007-01-05 14:06 ` [Drbd-dev] " Philipp Reisner
2007-01-09 14:28 ` [Ocfs2-devel] " Mark Fasheh
2007-01-09 22:28   ` [Drbd-dev] " Mark Fasheh
2007-01-11  1:58   ` Philipp Reisner
2007-01-11  9:58     ` [Drbd-dev] " Philipp Reisner
2007-01-16 11:58     ` Mark Fasheh
2007-01-16 19:57       ` [Drbd-dev] " Mark Fasheh
2007-01-16 12:20       ` Philipp Reisner
2007-01-16 20:20         ` [Drbd-dev] " Philipp Reisner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.