[PATCH 0] rbd: layered writes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0]  rbd: layered writes
@ 2013-04-19 22:46 Alex Elder
  2013-04-19 22:49 ` [PATCH] libceph: fix two messenger bugs Alex Elder
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:46 UTC (permalink / raw)
  To: ceph-devel

This set of patches culminates in providing layered
write functionality for rbd clone images.  Until the
feature bits get enabled, most of this code is not
yet activated.  But it has been tested and it's
ready for review.  The feature will be enabled soon,
after a few more things get cleaned up to prepare
for it.

					-Alex

[PATCH] libceph: fix two messenger bugs
    Two bugs, two improvements.

[PATCH] libceph: support pages for class request data
    This just defines a function that allows a page array
    to be specified for the data to be written by an
    object class method.

[PATCH 1/4] rbd: define separate read and write format funcs
[PATCH 2/4] rbd: encapsulate submission of image object requests
[PATCH 3/4] rbd: define zero_pages()
[PATCH 4/4] rbd: support page array image requests
    The first two of these are just cleanups.  The second two
    add some functionality to allow image requests to use
    page arrays for their object data.

[PATCH 1/2] rbd: implement full object parent reads
[PATCH 2/2] rbd: issue a copyup for layered writes
    These last two finally implement the layered copyup
    functionality.  The first is sort of a half step, to
    break up the review into smaller pieces.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] libceph: fix two messenger bugs
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
@ 2013-04-19 22:49 ` Alex Elder
  2013-04-22  7:14   ` Josh Durgin
  2013-04-19 22:49 ` [PATCH] libceph: support pages for class request data Alex Elder
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:49 UTC (permalink / raw)
  To: ceph-devel

This patch makes four small changes in the ceph messenger.

While getting copyup functionality working I found two bugs in the
messenger.  Existing paths through the code did not trigger these
problems, but they're fixed here:
    - In ceph_msg_data_pagelist_cursor_init(), the cursor's
      last_piece field was being checked against the length
      supplied.  This was OK until this commit: ccba6d98 libceph:
      implement multiple data items in a message That commit changed
      the cursor init routines to allow lengths to be supplied that
      exceeded the size of the current data item. Because of this,
      we have to use the assigned cursor resid field rather than the
      provided length in determining whether the cursor points to
      the last piece of a data item.
    - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
      catching attempts to add page data to a message if the message
      already had data assigned to it. That was OK until that same
      commit, at which point it was fine for messages to have
      multiple data items. It slipped through because that BUG_ON()
      call was present twice in that function. (You can never be too
      careful.)

In addition two other minor things are changed:
    - In ceph_msg_data_cursor_init(), the local variable "data" was
      getting assigned twice.
    - In ceph_msg_data_advance(), it was assumed that the
      type-specific advance routine would set new_piece to true
      after it advanced past the last piece. That may have been
      fine, but since we check for that case we might as well set it
      explicitly in ceph_msg_data_advance().

This resolves:
    http://tracker.ceph.com/issues/4762

Signed-off-by: Alex Elder <elder@inktank.com>
---
 net/ceph/messenger.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index a36d98d..91dd451 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -913,7 +913,7 @@ ceph_msg_data_pagelist_cursor_init(struct
ceph_msg_data_cursor *cursor,
 	cursor->resid = min(length, pagelist->length);
 	cursor->page = page;
 	cursor->offset = 0;
-	cursor->last_piece = length <= PAGE_SIZE;
+	cursor->last_piece = cursor->resid <= PAGE_SIZE;
 }

 static struct page *
@@ -1013,8 +1013,6 @@ static void ceph_msg_data_cursor_init(struct
ceph_msg *msg, size_t length)
 	BUG_ON(length > msg->data_length);
 	BUG_ON(list_empty(&msg->data));

-	data = list_first_entry(&msg->data, struct ceph_msg_data, links);
-
 	cursor->data_head = &msg->data;
 	cursor->total_resid = length;
 	data = list_first_entry(&msg->data, struct ceph_msg_data, links);
@@ -1088,14 +1086,15 @@ static bool ceph_msg_data_advance(struct
ceph_msg_data_cursor *cursor,
 		break;
 	}
 	cursor->total_resid -= bytes;
-	cursor->need_crc = new_piece;

 	if (!cursor->resid && cursor->total_resid) {
 		WARN_ON(!cursor->last_piece);
 		BUG_ON(list_is_last(&cursor->data->links, cursor->data_head));
 		cursor->data = list_entry_next(cursor->data, links);
 		__ceph_msg_data_cursor_init(cursor);
+		new_piece = true;
 	}
+	cursor->need_crc = new_piece;

 	return new_piece;
 }
@@ -3019,7 +3018,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg,
struct page **pages,
 	data->length = length;
 	data->alignment = alignment & ~PAGE_MASK;

-	BUG_ON(!list_empty(&msg->data));
 	list_add_tail(&data->links, &msg->data);
 	msg->data_length += length;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] libceph: support pages for class request data
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
  2013-04-19 22:49 ` [PATCH] libceph: fix two messenger bugs Alex Elder
@ 2013-04-19 22:49 ` Alex Elder
  2013-04-22  7:15   ` Josh Durgin
  2013-04-19 22:49 ` [PATCH 1/4] rbd: define separate read and write format funcs Alex Elder
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:49 UTC (permalink / raw)
  To: ceph-devel

Add the ability to provide an array of pages as outbound request
data for object class method calls.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 include/linux/ceph/osd_client.h |    5 +++++
 net/ceph/osd_client.c           |   12 ++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/ceph/osd_client.h
b/include/linux/ceph/osd_client.h
index 4d84a2b..4191cd2 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -273,6 +273,11 @@ extern void osd_req_op_extent_osd_data_bio(struct
ceph_osd_request *,
 extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
 					unsigned int which,
 					struct ceph_pagelist *pagelist);
+extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *,
+					unsigned int which,
+					struct page **pages, u64 length,
+					u32 alignment, bool pages_from_pool,
+					bool own_pages);
 extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
 					unsigned int which,
 					struct page **pages, u64 length,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index c842e87..467020c 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -214,6 +214,18 @@ void osd_req_op_cls_request_data_pagelist(
 }
 EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);

+void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
+			unsigned int which, struct page **pages, u64 length,
+			u32 alignment, bool pages_from_pool, bool own_pages)
+{
+	struct ceph_osd_data *osd_data;
+
+	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
+	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
+				pages_from_pool, own_pages);
+}
+EXPORT_SYMBOL(osd_req_op_cls_request_data_pages);
+
 void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
 			u32 alignment, bool pages_from_pool, bool own_pages)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/4] rbd: define separate read and write format funcs
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
  2013-04-19 22:49 ` [PATCH] libceph: fix two messenger bugs Alex Elder
  2013-04-19 22:49 ` [PATCH] libceph: support pages for class request data Alex Elder
@ 2013-04-19 22:49 ` Alex Elder
  2013-04-22  7:23   ` Josh Durgin
  2013-04-19 22:50 ` [PATCH 2/4] rbd: encapsulate submission of image object requests Alex Elder
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:49 UTC (permalink / raw)
  To: ceph-devel

Separate rbd_osd_req_format() into two functions, one for read
requests and the other for write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |   49
++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ce2fb3a..a185239 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1489,28 +1489,31 @@ static void rbd_osd_req_callback(struct
ceph_osd_request *osd_req,
 		rbd_obj_request_complete(obj_request);
 }

-static void rbd_osd_req_format(struct rbd_obj_request *obj_request,
-					bool write_request)
+static void rbd_osd_req_format_read(struct rbd_obj_request *obj_request)
 {
 	struct rbd_img_request *img_request = obj_request->img_request;
 	struct ceph_osd_request *osd_req = obj_request->osd_req;
-	struct ceph_snap_context *snapc = NULL;
-	u64 snap_id = CEPH_NOSNAP;
-	struct timespec *mtime = NULL;
-	struct timespec now;
+	u64 snap_id;

 	rbd_assert(osd_req != NULL);

-	if (write_request) {
-		now = CURRENT_TIME;
-		mtime = &now;
-		if (img_request)
-			snapc = img_request->snapc;
-	} else if (img_request) {
-		snap_id = img_request->snap_id;
-	}
+	snap_id = img_request ? img_request->snap_id : CEPH_NOSNAP;
+	ceph_osdc_build_request(osd_req, obj_request->offset,
+			NULL, snap_id, NULL);
+}
+
+static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
+{
+	struct rbd_img_request *img_request = obj_request->img_request;
+	struct ceph_osd_request *osd_req = obj_request->osd_req;
+	struct ceph_snap_context *snapc;
+	struct timespec mtime = CURRENT_TIME;
+
+	rbd_assert(osd_req != NULL);
+
+	snapc = img_request ? img_request->snapc : NULL;
 	ceph_osdc_build_request(osd_req, obj_request->offset,
-			snapc, snap_id, mtime);
+			snapc, CEPH_NOSNAP, &mtime);
 }

 static struct ceph_osd_request *rbd_osd_req_create(
@@ -1845,7 +1848,11 @@ static int rbd_img_request_fill_bio(struct
rbd_img_request *img_request,
 						0, 0);
 		osd_req_op_extent_osd_data_bio(osd_req, 0,
 				obj_request->bio_list, obj_request->length);
-		rbd_osd_req_format(obj_request, write_request);
+
+		if (write_request)
+			rbd_osd_req_format_write(obj_request);
+		else
+			rbd_osd_req_format_read(obj_request);

 		obj_request->img_offset = img_offset;
 		rbd_img_obj_request_add(img_request, obj_request);
@@ -1969,7 +1976,7 @@ static int rbd_img_obj_exists_submit(struct
rbd_obj_request *obj_request)
 	osd_req_op_init(stat_request->osd_req, 0, CEPH_OSD_OP_STAT);
 	osd_req_op_raw_data_in_pages(stat_request->osd_req, 0, pages, size, 0,
 					false, false);
-	rbd_osd_req_format(stat_request, false);
+	rbd_osd_req_format_read(stat_request);

 	osdc = &rbd_dev->rbd_client->client->osdc;
 	ret = rbd_obj_request_submit(osdc, stat_request);
@@ -2091,7 +2098,7 @@ static int rbd_obj_notify_ack(struct rbd_device
*rbd_dev,

 	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_NOTIFY_ACK,
 					notify_id, ver, 0);
-	rbd_osd_req_format(obj_request, false);
+	rbd_osd_req_format_read(obj_request);

 	ret = rbd_obj_request_submit(osdc, obj_request);
 out:
@@ -2161,7 +2168,7 @@ static int rbd_dev_header_watch_sync(struct
rbd_device *rbd_dev, int start)
 	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_WATCH,
 				rbd_dev->watch_event->cookie,
 				rbd_dev->header.obj_version, start);
-	rbd_osd_req_format(obj_request, true);
+	rbd_osd_req_format_write(obj_request);

 	ret = rbd_obj_request_submit(osdc, obj_request);
 	if (ret)
@@ -2262,7 +2269,7 @@ static int rbd_obj_method_sync(struct rbd_device
*rbd_dev,
 	osd_req_op_cls_response_data_pages(obj_request->osd_req, 0,
 					obj_request->pages, inbound_size,
 					0, false, false);
-	rbd_osd_req_format(obj_request, false);
+	rbd_osd_req_format_read(obj_request);

 	ret = rbd_obj_request_submit(osdc, obj_request);
 	if (ret)
@@ -2473,7 +2480,7 @@ static int rbd_obj_read_sync(struct rbd_device
*rbd_dev,
 					obj_request->length,
 					obj_request->offset & ~PAGE_MASK,
 					false, false);
-	rbd_osd_req_format(obj_request, false);
+	rbd_osd_req_format_read(obj_request);

 	ret = rbd_obj_request_submit(osdc, obj_request);
 	if (ret)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] rbd: encapsulate submission of image object requests
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (2 preceding siblings ...)
  2013-04-19 22:49 ` [PATCH 1/4] rbd: define separate read and write format funcs Alex Elder
@ 2013-04-19 22:50 ` Alex Elder
  2013-04-22  7:35   ` Josh Durgin
  2013-04-19 22:50 ` [PATCH 3/4] rbd: define zero_pages() Alex Elder
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:50 UTC (permalink / raw)
  To: ceph-devel

Object requests that are part of an image request are subject to
some additional handling.  Define rbd_img_obj_request_submit() to
encapsulate that, and use it when initially submitting an image
object request, and when re-submitting it during callback of
an object existence check.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |   65
++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a185239..894af4f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -423,6 +423,7 @@ void rbd_warn(struct rbd_device *rbd_dev, const char
*fmt, ...)
 #endif /* !RBD_DEBUG */

 static void rbd_img_parent_read(struct rbd_obj_request *obj_request);
+static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request);

 static int rbd_dev_refresh(struct rbd_device *rbd_dev, u64 *hver);
 static int rbd_dev_v2_refresh(struct rbd_device *rbd_dev, u64 *hver);
@@ -1874,8 +1875,6 @@ out_unwind:

 static void rbd_img_obj_exists_callback(struct rbd_obj_request
*obj_request)
 {
-	struct rbd_device *rbd_dev;
-	struct ceph_osd_client *osdc;
 	struct rbd_obj_request *orig_request;
 	int result;

@@ -1901,8 +1900,6 @@ static void rbd_img_obj_exists_callback(struct
rbd_obj_request *obj_request)

 	rbd_assert(orig_request);
 	rbd_assert(orig_request->img_request);
-	rbd_dev = orig_request->img_request->rbd_dev;
-	osdc = &rbd_dev->rbd_client->client->osdc;

 	/*
 	 * Our only purpose here is to determine whether the object
@@ -1923,7 +1920,7 @@ static void rbd_img_obj_exists_callback(struct
rbd_obj_request *obj_request)
 	 * Resubmit the original request now that we have recorded
 	 * whether the target object exists.
 	 */
-	orig_request->result = rbd_obj_request_submit(osdc, orig_request);
+	orig_request->result = rbd_img_obj_request_submit(orig_request);
 out_err:
 	if (orig_request->result)
 		rbd_obj_request_complete(orig_request);
@@ -1987,32 +1984,56 @@ out:
 	return ret;
 }

+static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request)
+{
+	struct rbd_img_request *img_request;
+
+	rbd_assert(obj_request_img_data_test(obj_request));
+
+	img_request = obj_request->img_request;
+	rbd_assert(img_request);
+
+	/* (At the moment we don't care whether it exists or not...) */
+	(void) obj_request_exists_test;
+
+	/*
+	 * Only layered writes need special handling.  If it's not a
+	 * layered write, or it is a layered write but we know the
+	 * target object exists, it's no different from any other
+	 * object request.
+	 */
+	if (!img_request_write_test(img_request) ||
+		!img_request_layered_test(img_request) ||
+		obj_request_known_test(obj_request)) {
+
+		struct rbd_device *rbd_dev;
+		struct ceph_osd_client *osdc;
+
+		rbd_dev = obj_request->img_request->rbd_dev;
+		osdc = &rbd_dev->rbd_client->client->osdc;
+
+		return rbd_obj_request_submit(osdc, obj_request);
+	}
+
+	/*
+	 * It's a layered write and we don't know whether the target
+	 * exists.  Issue existence check; once that completes the
+	 * original request will be submitted again.
+	 */
+
+	return rbd_img_obj_exists_submit(obj_request);
+}
+
 static int rbd_img_request_submit(struct rbd_img_request *img_request)
 {
-	struct rbd_device *rbd_dev = img_request->rbd_dev;
-	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
 	struct rbd_obj_request *obj_request;
 	struct rbd_obj_request *next_obj_request;
-	bool write_request = img_request_write_test(img_request);
-	bool layered = img_request_layered_test(img_request);

 	dout("%s: img %p\n", __func__, img_request);
 	for_each_obj_request_safe(img_request, obj_request, next_obj_request) {
-		bool known;
-		bool object_exists;
 		int ret;

-		/*
-		 * We need to know whether the target object exists
-		 * for a layered write.  Issue an existence check
-		 * first if we need to.
-		 */
-		known = obj_request_known_test(obj_request);
-		object_exists = known && obj_request_exists_test(obj_request);
-		if (!write_request || !layered || object_exists)
-			ret = rbd_obj_request_submit(osdc, obj_request);
-		else
-			ret = rbd_img_obj_exists_submit(obj_request);
+		ret = rbd_img_obj_request_submit(obj_request);
 		if (ret)
 			return ret;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] rbd: define zero_pages()
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (3 preceding siblings ...)
  2013-04-19 22:50 ` [PATCH 2/4] rbd: encapsulate submission of image object requests Alex Elder
@ 2013-04-19 22:50 ` Alex Elder
  2013-04-22  8:05   ` Josh Durgin
  2013-04-19 22:50 ` [PATCH 4/4] rbd: support page array image requests Alex Elder
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:50 UTC (permalink / raw)
  To: ceph-devel

Define a new function zero_pages() that zeroes a range of memory
defined by a page array, along the lines of zero_bio_chain().  It
saves and the irq flags like bvec_kmap_irq() does, though I'm not
sure at this point that it's necessary.

Update rbd_img_obj_request_read_callback() to use the new function
if the object request contains page rather than bio data.

For the moment, only bio data is used for osd READ ops.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |   55
+++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 894af4f..ac9abab 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -971,6 +971,37 @@ static void zero_bio_chain(struct bio *chain, int
start_ofs)
 }

 /*
+ * similar to zero_bio_chain(), zeros data defined by a page array,
+ * starting at the given byte offset from the start of the array and
+ * continuing up to the given end offset.  The pages array is
+ * assumed to be big enough to hold all bytes up to the end.
+ */
+static void zero_pages(struct page **pages, u64 offset, u64 end)
+{
+	struct page **page = &pages[offset >> PAGE_SHIFT];
+
+	rbd_assert(end > offset);
+	rbd_assert(end - offset <= (u64)SIZE_MAX);
+	while (offset < end) {
+		size_t page_offset;
+		size_t length;
+		unsigned long flags;
+		void *kaddr;
+
+		page_offset = (size_t)(offset & ~PAGE_MASK);
+		length = min(PAGE_SIZE - page_offset, (size_t)(end - offset));
+		local_irq_save(flags);
+		kaddr = kmap_atomic(*page);
+		memset(kaddr + page_offset, 0, length);
+		kunmap_atomic(kaddr);
+		local_irq_restore(flags);
+
+		offset += length;
+		page++;
+	}
+}
+
+/*
  * Clone a portion of a bio, starting at the given byte offset
  * and continuing for the number of bytes indicated.
  */
@@ -1352,9 +1383,12 @@ static bool img_request_layered_test(struct
rbd_img_request *img_request)
 static void
 rbd_img_obj_request_read_callback(struct rbd_obj_request *obj_request)
 {
+	u64 xferred = obj_request->xferred;
+	u64 length = obj_request->length;
+
 	dout("%s: obj %p img %p result %d %llu/%llu\n", __func__,
 		obj_request, obj_request->img_request, obj_request->result,
-		obj_request->xferred, obj_request->length);
+		xferred, length);
 	/*
 	 * ENOENT means a hole in the image.  We zero-fill the
 	 * entire length of the request.  A short read also implies
@@ -1362,15 +1396,20 @@ rbd_img_obj_request_read_callback(struct
rbd_obj_request *obj_request)
 	 * update the xferred count to indicate the whole request
 	 * was satisfied.
 	 */
-	BUG_ON(obj_request->type != OBJ_REQUEST_BIO);
+	rbd_assert(obj_request->type != OBJ_REQUEST_NODATA);
 	if (obj_request->result == -ENOENT) {
-		zero_bio_chain(obj_request->bio_list, 0);
+		if (obj_request->type == OBJ_REQUEST_BIO)
+			zero_bio_chain(obj_request->bio_list, 0);
+		else
+			zero_pages(obj_request->pages, 0, length);
 		obj_request->result = 0;
-		obj_request->xferred = obj_request->length;
-	} else if (obj_request->xferred < obj_request->length &&
-			!obj_request->result) {
-		zero_bio_chain(obj_request->bio_list, obj_request->xferred);
-		obj_request->xferred = obj_request->length;
+		obj_request->xferred = length;
+	} else if (xferred < length && !obj_request->result) {
+		if (obj_request->type == OBJ_REQUEST_BIO)
+			zero_bio_chain(obj_request->bio_list, xferred);
+		else
+			zero_pages(obj_request->pages, xferred, length);
+		obj_request->xferred = length;
 	}
 	obj_request_done_set(obj_request);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] rbd: support page array image requests
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (4 preceding siblings ...)
  2013-04-19 22:50 ` [PATCH 3/4] rbd: define zero_pages() Alex Elder
@ 2013-04-19 22:50 ` Alex Elder
  2013-04-22  8:13   ` Josh Durgin
  2013-04-19 22:50 ` [PATCH 1/2] rbd: implement full object parent reads Alex Elder
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:50 UTC (permalink / raw)
  To: ceph-devel

This patch adds the ability to build an image request whose data
will be written from or read into memory described by a page array.
(Previously only bio lists were supported.)

Originally this was going to define a new function for this purpose
but it was largely identical to the rbd_img_request_fill_bio().  So
instead, rbd_img_request_fill_bio() has been generalized to handle
both types of image request.

For the moment we still only fill image requests with bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |   86
+++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 66 insertions(+), 20 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ac9abab..91fcf36 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1780,6 +1780,13 @@ static bool rbd_img_obj_end_request(struct
rbd_obj_request *obj_request)
 			img_request->result = result;
 	}

+	/* Image object requests don't own their page array */
+
+	if (obj_request->type == OBJ_REQUEST_PAGES) {
+		obj_request->pages = NULL;
+		obj_request->page_count = 0;
+	}
+
 	if (img_request_child_test(img_request)) {
 		rbd_assert(img_request->obj_request != NULL);
 		more = obj_request->which < img_request->obj_request_count - 1;
@@ -1830,30 +1837,48 @@ out:
 		rbd_img_request_complete(img_request);
 }

-static int rbd_img_request_fill_bio(struct rbd_img_request *img_request,
-					struct bio *bio_list)
+/*
+ * Split up an image request into one or more object requests, each
+ * to a different object.  The the "type" parameter indicates
+ * whether "data_desc" is the pointer to the head of a list of bio
+ * structures, or the base of a page array.  In either case this
+ * function assumes data_desc describes memory sufficient to hold
+ * all data described by the image request.
+ */
+static int rbd_img_request_fill(struct rbd_img_request *img_request,
+					enum obj_request_type type,
+					void *data_desc)
 {
 	struct rbd_device *rbd_dev = img_request->rbd_dev;
 	struct rbd_obj_request *obj_request = NULL;
 	struct rbd_obj_request *next_obj_request;
 	bool write_request = img_request_write_test(img_request);
-	unsigned int bio_offset;
+	struct bio *bio_list;
+	unsigned int bio_offset = 0;
+	struct page **pages;
 	u64 img_offset;
 	u64 resid;
 	u16 opcode;

-	dout("%s: img %p bio %p\n", __func__, img_request, bio_list);
+	dout("%s: img %p type %d data_desc %p\n", __func__, img_request,
+		(int)type, data_desc);

 	opcode = write_request ? CEPH_OSD_OP_WRITE : CEPH_OSD_OP_READ;
-	bio_offset = 0;
 	img_offset = img_request->offset;
-	rbd_assert(img_offset == bio_list->bi_sector << SECTOR_SHIFT);
 	resid = img_request->length;
 	rbd_assert(resid > 0);
+
+	if (type == OBJ_REQUEST_BIO) {
+		bio_list = data_desc;
+		rbd_assert(img_offset == bio_list->bi_sector << SECTOR_SHIFT);
+	} else {
+		rbd_assert(type == OBJ_REQUEST_PAGES);
+		pages = data_desc;
+	}
+
 	while (resid) {
 		struct ceph_osd_request *osd_req;
 		const char *object_name;
-		unsigned int clone_size;
 		u64 offset;
 		u64 length;

@@ -1863,19 +1888,33 @@ static int rbd_img_request_fill_bio(struct
rbd_img_request *img_request,
 		offset = rbd_segment_offset(rbd_dev, img_offset);
 		length = rbd_segment_length(rbd_dev, img_offset, resid);
 		obj_request = rbd_obj_request_create(object_name,
-						offset, length,
-						OBJ_REQUEST_BIO);
+						offset, length, type);
 		kfree(object_name);	/* object request has its own copy */
 		if (!obj_request)
 			goto out_unwind;

-		rbd_assert(length <= (u64) UINT_MAX);
-		clone_size = (unsigned int) length;
-		obj_request->bio_list = bio_chain_clone_range(&bio_list,
-						&bio_offset, clone_size,
-						GFP_ATOMIC);
-		if (!obj_request->bio_list)
-			goto out_partial;
+		if (type == OBJ_REQUEST_BIO) {
+			unsigned int clone_size;
+
+			rbd_assert(length <= (u64)UINT_MAX);
+			clone_size = (unsigned int)length;
+			obj_request->bio_list =
+					bio_chain_clone_range(&bio_list,
+								&bio_offset,
+								clone_size,
+								GFP_ATOMIC);
+			if (!obj_request->bio_list)
+				goto out_partial;
+		} else {
+			unsigned int page_count;
+
+			obj_request->pages = pages;
+			page_count = (u32)calc_pages_for(offset, length);
+			obj_request->page_count = page_count;
+			if ((offset + length) & ~PAGE_MASK)
+				page_count--;	/* more on last page */
+			pages += page_count;
+		}

 		osd_req = rbd_osd_req_create(rbd_dev, write_request,
 						obj_request);
@@ -1886,8 +1925,13 @@ static int rbd_img_request_fill_bio(struct
rbd_img_request *img_request,

 		osd_req_op_extent_init(osd_req, 0, opcode, offset, length,
 						0, 0);
-		osd_req_op_extent_osd_data_bio(osd_req, 0,
-				obj_request->bio_list, obj_request->length);
+		if (type == OBJ_REQUEST_BIO)
+			osd_req_op_extent_osd_data_bio(osd_req, 0,
+					obj_request->bio_list, length);
+		else
+			osd_req_op_extent_osd_data_pages(osd_req, 0,
+					obj_request->pages, length,
+					offset & ~PAGE_MASK, false, false);

 		if (write_request)
 			rbd_osd_req_format_write(obj_request);
@@ -2120,7 +2164,8 @@ static void rbd_img_parent_read(struct
rbd_obj_request *obj_request)
 	rbd_obj_request_get(obj_request);
 	img_request->obj_request = obj_request;

-	result = rbd_img_request_fill_bio(img_request, obj_request->bio_list);
+	result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
+					obj_request->bio_list);
 	if (result)
 		goto out_err;

@@ -2425,7 +2470,8 @@ static void rbd_request_fn(struct request_queue *q)

 		img_request->rq = rq;

-		result = rbd_img_request_fill_bio(img_request, rq->bio);
+		result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
+						rq->bio);
 		if (!result)
 			result = rbd_img_request_submit(img_request);
 		if (result)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/2] rbd: implement full object parent reads
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (5 preceding siblings ...)
  2013-04-19 22:50 ` [PATCH 4/4] rbd: support page array image requests Alex Elder
@ 2013-04-19 22:50 ` Alex Elder
  2013-04-22 18:13   ` Josh Durgin
  2013-04-19 22:50 ` [PATCH 2/2] rbd: issue a copyup for layered writes Alex Elder
  2013-04-19 22:52 ` [PATCH 0] rbd: " Alex Elder
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:50 UTC (permalink / raw)
  To: ceph-devel

As a step toward implementing layered writes, implement reading the
data for a target object from the parent image for a write request
whose target object is known to not exist.  Add a copyup_pages field
to an image request to track the page array used (only) for such a
request.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |  152
++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 143 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 91fcf36..c5d0619 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -250,6 +250,7 @@ struct rbd_img_request {
 		struct request		*rq;		/* block request */
 		struct rbd_obj_request	*obj_request;	/* obj req initiator */
 	};
+	struct page		**copyup_pages;
 	spinlock_t		completion_lock;/* protects next_completion */
 	u32			next_completion;
 	rbd_img_callback_t	callback;
@@ -350,6 +351,8 @@ static DEFINE_SPINLOCK(rbd_dev_list_lock);
 static LIST_HEAD(rbd_client_list);		/* clients */
 static DEFINE_SPINLOCK(rbd_client_list_lock);

+static int rbd_img_request_submit(struct rbd_img_request *img_request);
+
 static int rbd_dev_snaps_update(struct rbd_device *rbd_dev);
 static int rbd_dev_snaps_register(struct rbd_device *rbd_dev);

@@ -1956,6 +1959,133 @@ out_unwind:
 	return -ENOMEM;
 }

+static void
+rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
+{
+	struct rbd_obj_request *orig_request;
+	struct page **pages;
+	u32 page_count;
+	int result;
+	u64 obj_size;
+	u64 xferred;
+
+	rbd_assert(img_request_child_test(img_request));
+
+	/* First get what we need from the image request */
+
+	pages = img_request->copyup_pages;
+	rbd_assert(pages != NULL);
+	img_request->copyup_pages = NULL;
+
+	orig_request = img_request->obj_request;
+	rbd_assert(orig_request != NULL);
+
+	result = img_request->result;
+	obj_size = img_request->length;
+	xferred = img_request->xferred;
+
+	rbd_img_request_put(img_request);
+
+	obj_request_existence_set(orig_request, true);
+
+	page_count = (u32)calc_pages_for(0, obj_size);
+	ceph_release_page_vector(pages, page_count);
+
+	/* Resubmit the original request (for now). */
+
+	orig_request->result = rbd_img_obj_request_submit(orig_request);
+	if (orig_request->result) {
+		obj_request_done_set(orig_request);
+		rbd_obj_request_complete(orig_request);
+	}
+}
+
+/*
+ * Read from the parent image the range of data that covers the
+ * entire target of the given object request.  This is used for
+ * satisfying a layered image write request when the target of an
+ * object request from the image request does not exist.
+ *
+ * A page array big enough to hold the returned data is allocated
+ * and supplied to rbd_img_request_fill() as the "data descriptor."
+ * When the read completes, this page array will be transferred to
+ * the original object request for the copyup operation.
+ *
+ * If an error occurs, record it as the result of the original
+ * object request and mark it done so it gets completed.
+ */
+static int rbd_img_obj_parent_read_full(struct rbd_obj_request
*obj_request)
+{
+	struct rbd_img_request *img_request = NULL;
+	struct rbd_img_request *parent_request = NULL;
+	struct rbd_device *rbd_dev;
+	u64 img_offset;
+	u64 length;
+	struct page **pages = NULL;
+	u32 page_count;
+	int result;
+
+	rbd_assert(obj_request_img_data_test(obj_request));
+	rbd_assert(obj_request->type == OBJ_REQUEST_BIO);
+
+	img_request = obj_request->img_request;
+	rbd_assert(img_request != NULL);
+	rbd_dev = img_request->rbd_dev;
+	rbd_assert(rbd_dev->parent != NULL);
+
+	/*
+	 * Determine the byte range covered by the object in the
+	 * child image to which the original request was to be sent.
+	 */
+	img_offset = obj_request->img_offset - obj_request->offset;
+	length = (u64)1 << rbd_dev->header.obj_order;
+
+	/*
+	 * Allocate a page array big enough to receive the data read
+	 * from the parent.
+	 */
+	page_count = (u32)calc_pages_for(0, length);
+	pages = ceph_alloc_page_vector(page_count, GFP_KERNEL);
+	if (IS_ERR(pages)) {
+		result = PTR_ERR(pages);
+		pages = NULL;
+		goto out_err;
+	}
+
+	result = -ENOMEM;
+	parent_request = rbd_img_request_create(rbd_dev->parent,
+						img_offset, length,
+						false, true);
+	if (!parent_request)
+		goto out_err;
+	rbd_obj_request_get(obj_request);
+	parent_request->obj_request = obj_request;
+
+	result = rbd_img_request_fill(parent_request, OBJ_REQUEST_PAGES, pages);
+	if (result)
+		goto out_err;
+	parent_request->copyup_pages = pages;
+
+	parent_request->callback = rbd_img_obj_parent_read_full_callback;
+	result = rbd_img_request_submit(parent_request);
+	if (!result)
+		return 0;
+
+	parent_request->copyup_pages = NULL;
+	parent_request->obj_request = NULL;
+	rbd_obj_request_put(obj_request);
+out_err:
+	if (pages)
+		ceph_release_page_vector(pages, page_count);
+	if (parent_request)
+		rbd_img_request_put(parent_request);
+	obj_request->result = result;
+	obj_request->xferred = 0;
+	obj_request_done_set(obj_request);
+
+	return result;
+}
+
 static void rbd_img_obj_exists_callback(struct rbd_obj_request
*obj_request)
 {
 	struct rbd_obj_request *orig_request;
@@ -1996,7 +2126,7 @@ static void rbd_img_obj_exists_callback(struct
rbd_obj_request *obj_request)
 		obj_request_existence_set(orig_request, false);
 	} else if (result) {
 		orig_request->result = result;
-		goto out_err;
+		goto out;
 	}

 	/*
@@ -2004,7 +2134,7 @@ static void rbd_img_obj_exists_callback(struct
rbd_obj_request *obj_request)
 	 * whether the target object exists.
 	 */
 	orig_request->result = rbd_img_obj_request_submit(orig_request);
-out_err:
+out:
 	if (orig_request->result)
 		rbd_obj_request_complete(orig_request);
 	rbd_obj_request_put(orig_request);
@@ -2070,15 +2200,13 @@ out:
 static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request)
 {
 	struct rbd_img_request *img_request;
+	bool known;

 	rbd_assert(obj_request_img_data_test(obj_request));

 	img_request = obj_request->img_request;
 	rbd_assert(img_request);

-	/* (At the moment we don't care whether it exists or not...) */
-	(void) obj_request_exists_test;
-
 	/*
 	 * Only layered writes need special handling.  If it's not a
 	 * layered write, or it is a layered write but we know the
@@ -2087,7 +2215,8 @@ static int rbd_img_obj_request_submit(struct
rbd_obj_request *obj_request)
 	 */
 	if (!img_request_write_test(img_request) ||
 		!img_request_layered_test(img_request) ||
-		obj_request_known_test(obj_request)) {
+		((known = obj_request_known_test(obj_request)) &&
+			obj_request_exists_test(obj_request))) {

 		struct rbd_device *rbd_dev;
 		struct ceph_osd_client *osdc;
@@ -2099,10 +2228,15 @@ static int rbd_img_obj_request_submit(struct
rbd_obj_request *obj_request)
 	}

 	/*
-	 * It's a layered write and we don't know whether the target
-	 * exists.  Issue existence check; once that completes the
-	 * original request will be submitted again.
+	 * It's a layered write.  The target object might exist but
+	 * we may not know that yet.  If we know it doesn't exist,
+	 * start by reading the data for the full target object from
+	 * the parent so we can use it for a copyup to the target.
 	 */
+	if (known)
+		return rbd_img_obj_parent_read_full(obj_request);
+
+	/* We don't know whether the target exists.  Go find out. */

 	return rbd_img_obj_exists_submit(obj_request);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] rbd: issue a copyup for layered writes
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (6 preceding siblings ...)
  2013-04-19 22:50 ` [PATCH 1/2] rbd: implement full object parent reads Alex Elder
@ 2013-04-19 22:50 ` Alex Elder
  2013-04-22 18:16   ` Josh Durgin
  2013-04-19 22:52 ` [PATCH 0] rbd: " Alex Elder
  8 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:50 UTC (permalink / raw)
  To: ceph-devel

This implements the main copyup functionality for layered writes.

Here we add a copyup_pages field to the object request, which is
used only for copyup requests to keep track of the page array
containing data read from the parent image.

A copyup request is the only request rbd has that requires two osd
operations.  Because of this we handle copyup specially.  All image
object requests get an osd request allocated when they are created.
For a write request, if a copyup is copyup is required, the osd
request origainlly allocated is released, and a new one (with room
for two osd ops) is allocated to replace it.  A new function
rbd_osd_req_create_copyup() allocates an osd request suitable for
a copyup request.

The first op is then filled with a copyup object class method call,
supplying the array of pages containing data read from the parent.
The second op is filled in with the original write request.

The original request otherwise remains intact, and it describes the
original write request (found in the second osd op).  The presence
of the copyup op is sort of implicit; a non-null copyup_pages field
could be used to distinguish between a "normal" write request and a
request containing both a copyup call and a write.

This resolves:
    http://tracker.ceph.com/issues/3419

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |  149
++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 137 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c5d0619..12f7a5f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -218,6 +218,7 @@ struct rbd_obj_request {
 			u32		page_count;
 		};
 	};
+	struct page		**copyup_pages;

 	struct ceph_osd_request	*osd_req;

@@ -1498,7 +1499,7 @@ static void rbd_osd_req_callback(struct
ceph_osd_request *osd_req,
 		obj_request->result = osd_req->r_result;
 	obj_request->version = le64_to_cpu(osd_req->r_reassert_version.version);

-	WARN_ON(osd_req->r_num_ops != 1);	/* For now */
+	BUG_ON(osd_req->r_num_ops > 2);

 	/*
 	 * We support a 64-bit length, but ultimately it has to be
@@ -1601,6 +1602,48 @@ static struct ceph_osd_request *rbd_osd_req_create(
 	return osd_req;
 }

+/*
+ * Create a copyup osd request based on the information in the
+ * object request supplied.  A copyup request has two osd ops,
+ * a copyup method call, and a "normal" write request.
+ */
+static struct ceph_osd_request *
+rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
+{
+	struct rbd_img_request *img_request;
+	struct ceph_snap_context *snapc;
+	struct rbd_device *rbd_dev;
+	struct ceph_osd_client *osdc;
+	struct ceph_osd_request *osd_req;
+
+	rbd_assert(obj_request_img_data_test(obj_request));
+	img_request = obj_request->img_request;
+	rbd_assert(img_request);
+	rbd_assert(img_request_write_test(img_request));
+
+	/* Allocate and initialize the request, for the two ops */
+
+	snapc = img_request->snapc;
+	rbd_dev = img_request->rbd_dev;
+	osdc = &rbd_dev->rbd_client->client->osdc;
+	osd_req = ceph_osdc_alloc_request(osdc, snapc, 2, false, GFP_ATOMIC);
+	if (!osd_req)
+		return NULL;	/* ENOMEM */
+
+	osd_req->r_flags = CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK;
+	osd_req->r_callback = rbd_osd_req_callback;
+	osd_req->r_priv = obj_request;
+
+	osd_req->r_oid_len = strlen(obj_request->object_name);
+	rbd_assert(osd_req->r_oid_len < sizeof (osd_req->r_oid));
+	memcpy(osd_req->r_oid, obj_request->object_name, osd_req->r_oid_len);
+
+	osd_req->r_file_layout = rbd_dev->layout;	/* struct */
+
+	return osd_req;
+}
+
+
 static void rbd_osd_req_destroy(struct ceph_osd_request *osd_req)
 {
 	ceph_osdc_put_request(osd_req);
@@ -1960,11 +2003,49 @@ out_unwind:
 }

 static void
+rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request)
+{
+	struct rbd_img_request *img_request;
+	struct rbd_device *rbd_dev;
+	u64 length;
+	u32 page_count;
+
+	rbd_assert(obj_request->type == OBJ_REQUEST_BIO);
+	rbd_assert(obj_request_img_data_test(obj_request));
+	img_request = obj_request->img_request;
+	rbd_assert(img_request);
+
+	rbd_dev = img_request->rbd_dev;
+	rbd_assert(rbd_dev);
+	length = (u64)1 << rbd_dev->header.obj_order;
+	page_count = (u32)calc_pages_for(0, length);
+
+	rbd_assert(obj_request->copyup_pages);
+	ceph_release_page_vector(obj_request->copyup_pages, page_count);
+	obj_request->copyup_pages = NULL;
+
+	/*
+	 * We want the transfer count to reflect the size of the
+	 * original write request.  There is no such thing as a
+	 * successful short write, so if the request was successful
+	 * we can just set it to the originally-requested length.
+	 */
+	if (!obj_request->result)
+		obj_request->xferred = obj_request->length;
+
+	/* Finish up with the normal image object callback */
+
+	rbd_img_obj_callback(obj_request);
+}
+
+static void
 rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
 {
 	struct rbd_obj_request *orig_request;
+	struct ceph_osd_request *osd_req;
+	struct ceph_osd_client *osdc;
+	struct rbd_device *rbd_dev;
 	struct page **pages;
-	u32 page_count;
 	int result;
 	u64 obj_size;
 	u64 xferred;
@@ -1979,25 +2060,60 @@ rbd_img_obj_parent_read_full_callback(struct
rbd_img_request *img_request)

 	orig_request = img_request->obj_request;
 	rbd_assert(orig_request != NULL);
-
+	rbd_assert(orig_request->type == OBJ_REQUEST_BIO);
 	result = img_request->result;
 	obj_size = img_request->length;
 	xferred = img_request->xferred;

+	rbd_dev = img_request->rbd_dev;
+	rbd_assert(rbd_dev);
+	rbd_assert(obj_size == (u64)1 << rbd_dev->header.obj_order);
+
 	rbd_img_request_put(img_request);

-	obj_request_existence_set(orig_request, true);
+	if (result)
+		goto out_err;
+
+	/* Allocate the new copyup osd request for the original request */

-	page_count = (u32)calc_pages_for(0, obj_size);
-	ceph_release_page_vector(pages, page_count);
+	result = -ENOMEM;
+	rbd_assert(!orig_request->osd_req);
+	osd_req = rbd_osd_req_create_copyup(orig_request);
+	if (!osd_req)
+		goto out_err;
+	orig_request->osd_req = osd_req;
+	orig_request->copyup_pages = pages;

-	/* Resubmit the original request (for now). */
+	/* Initialize the copyup op */

-	orig_request->result = rbd_img_obj_request_submit(orig_request);
-	if (orig_request->result) {
-		obj_request_done_set(orig_request);
-		rbd_obj_request_complete(orig_request);
-	}
+	osd_req_op_cls_init(osd_req, 0, CEPH_OSD_OP_CALL, "rbd", "copyup");
+	osd_req_op_cls_request_data_pages(osd_req, 0, pages, obj_size, 0,
+						false, false);
+
+	/* Then the original write request op */
+
+	osd_req_op_extent_init(osd_req, 1, CEPH_OSD_OP_WRITE,
+					orig_request->offset,
+					orig_request->length, 0, 0);
+	osd_req_op_extent_osd_data_bio(osd_req, 1, orig_request->bio_list,
+					orig_request->length);
+
+	rbd_osd_req_format_write(orig_request);
+
+	/* All set, send it off. */
+
+	orig_request->callback = rbd_img_obj_copyup_callback;
+	osdc = &rbd_dev->rbd_client->client->osdc;
+	result = rbd_obj_request_submit(osdc, orig_request);
+	if (!result)
+		return;
+out_err:
+	/* Record the error code and complete the request */
+
+	orig_request->result = result;
+	orig_request->xferred = 0;
+	obj_request_done_set(orig_request);
+	rbd_obj_request_complete(orig_request);
 }

 /*
@@ -2034,6 +2150,15 @@ static int rbd_img_obj_parent_read_full(struct
rbd_obj_request *obj_request)
 	rbd_assert(rbd_dev->parent != NULL);

 	/*
+	 * First things first.  The original osd request is of no
+	 * use to use any more, we'll need a new one that can hold
+	 * the two ops in a copyup request.  We'll get that later,
+	 * but for now we can release the old one.
+	 */
+	rbd_osd_req_destroy(obj_request->osd_req);
+	obj_request->osd_req = NULL;
+
+	/*
 	 * Determine the byte range covered by the object in the
 	 * child image to which the original request was to be sent.
 	 */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0]  rbd: layered writes
  2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
                   ` (7 preceding siblings ...)
  2013-04-19 22:50 ` [PATCH 2/2] rbd: issue a copyup for layered writes Alex Elder
@ 2013-04-19 22:52 ` Alex Elder
  8 siblings, 0 replies; 19+ messages in thread
From: Alex Elder @ 2013-04-19 22:52 UTC (permalink / raw)
  To: ceph-devel

On 04/19/2013 05:46 PM, Alex Elder wrote:
> This set of patches culminates in providing layered
> write functionality for rbd clone images.  Until the
> feature bits get enabled, most of this code is not
> yet activated.  But it has been tested and it's
> ready for review.  The feature will be enabled soon,
> after a few more things get cleaned up to prepare
> for it.

These patches are available in the "review/wip-copyup"
of the ceph-client git repository.  They are based on
the branch "review/wip-4679-3" found there.

					-Alex
> 
> 					-Alex
> 
> [PATCH] libceph: fix two messenger bugs
>     Two bugs, two improvements.
> 
> [PATCH] libceph: support pages for class request data
>     This just defines a function that allows a page array
>     to be specified for the data to be written by an
>     object class method.
> 
> [PATCH 1/4] rbd: define separate read and write format funcs
> [PATCH 2/4] rbd: encapsulate submission of image object requests
> [PATCH 3/4] rbd: define zero_pages()
> [PATCH 4/4] rbd: support page array image requests
>     The first two of these are just cleanups.  The second two
>     add some functionality to allow image requests to use
>     page arrays for their object data.
> 
> [PATCH 1/2] rbd: implement full object parent reads
> [PATCH 2/2] rbd: issue a copyup for layered writes
>     These last two finally implement the layered copyup
>     functionality.  The first is sort of a half step, to
>     break up the review into smaller pieces.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] libceph: fix two messenger bugs
  2013-04-19 22:49 ` [PATCH] libceph: fix two messenger bugs Alex Elder
@ 2013-04-22  7:14   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  7:14 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 04/19/2013 03:49 PM, Alex Elder wrote:
> This patch makes four small changes in the ceph messenger.
>
> While getting copyup functionality working I found two bugs in the
> messenger.  Existing paths through the code did not trigger these
> problems, but they're fixed here:
>      - In ceph_msg_data_pagelist_cursor_init(), the cursor's
>        last_piece field was being checked against the length
>        supplied.  This was OK until this commit: ccba6d98 libceph:
>        implement multiple data items in a message That commit changed
>        the cursor init routines to allow lengths to be supplied that
>        exceeded the size of the current data item. Because of this,
>        we have to use the assigned cursor resid field rather than the
>        provided length in determining whether the cursor points to
>        the last piece of a data item.
>      - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
>        catching attempts to add page data to a message if the message
>        already had data assigned to it. That was OK until that same
>        commit, at which point it was fine for messages to have
>        multiple data items. It slipped through because that BUG_ON()
>        call was present twice in that function. (You can never be too
>        careful.)
>
> In addition two other minor things are changed:
>      - In ceph_msg_data_cursor_init(), the local variable "data" was
>        getting assigned twice.
>      - In ceph_msg_data_advance(), it was assumed that the
>        type-specific advance routine would set new_piece to true
>        after it advanced past the last piece. That may have been
>        fine, but since we check for that case we might as well set it
>        explicitly in ceph_msg_data_advance().
>
> This resolves:
>      http://tracker.ceph.com/issues/4762
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   net/ceph/messenger.c |    8 +++-----
>   1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index a36d98d..91dd451 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -913,7 +913,7 @@ ceph_msg_data_pagelist_cursor_init(struct
> ceph_msg_data_cursor *cursor,
>   	cursor->resid = min(length, pagelist->length);
>   	cursor->page = page;
>   	cursor->offset = 0;
> -	cursor->last_piece = length <= PAGE_SIZE;
> +	cursor->last_piece = cursor->resid <= PAGE_SIZE;
>   }
>
>   static struct page *
> @@ -1013,8 +1013,6 @@ static void ceph_msg_data_cursor_init(struct
> ceph_msg *msg, size_t length)
>   	BUG_ON(length > msg->data_length);
>   	BUG_ON(list_empty(&msg->data));
>
> -	data = list_first_entry(&msg->data, struct ceph_msg_data, links);
> -
>   	cursor->data_head = &msg->data;
>   	cursor->total_resid = length;
>   	data = list_first_entry(&msg->data, struct ceph_msg_data, links);
> @@ -1088,14 +1086,15 @@ static bool ceph_msg_data_advance(struct
> ceph_msg_data_cursor *cursor,
>   		break;
>   	}
>   	cursor->total_resid -= bytes;
> -	cursor->need_crc = new_piece;
>
>   	if (!cursor->resid && cursor->total_resid) {
>   		WARN_ON(!cursor->last_piece);
>   		BUG_ON(list_is_last(&cursor->data->links, cursor->data_head));
>   		cursor->data = list_entry_next(cursor->data, links);
>   		__ceph_msg_data_cursor_init(cursor);
> +		new_piece = true;
>   	}
> +	cursor->need_crc = new_piece;
>
>   	return new_piece;
>   }
> @@ -3019,7 +3018,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg,
> struct page **pages,
>   	data->length = length;
>   	data->alignment = alignment & ~PAGE_MASK;
>
> -	BUG_ON(!list_empty(&msg->data));
>   	list_add_tail(&data->links, &msg->data);
>   	msg->data_length += length;
>   }
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] libceph: support pages for class request data
  2013-04-19 22:49 ` [PATCH] libceph: support pages for class request data Alex Elder
@ 2013-04-22  7:15   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  7:15 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 04/19/2013 03:49 PM, Alex Elder wrote:
> Add the ability to provide an array of pages as outbound request
> data for object class method calls.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   include/linux/ceph/osd_client.h |    5 +++++
>   net/ceph/osd_client.c           |   12 ++++++++++++
>   2 files changed, 17 insertions(+)
>
> diff --git a/include/linux/ceph/osd_client.h
> b/include/linux/ceph/osd_client.h
> index 4d84a2b..4191cd2 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -273,6 +273,11 @@ extern void osd_req_op_extent_osd_data_bio(struct
> ceph_osd_request *,
>   extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *,
>   					unsigned int which,
>   					struct ceph_pagelist *pagelist);
> +extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *,
> +					unsigned int which,
> +					struct page **pages, u64 length,
> +					u32 alignment, bool pages_from_pool,
> +					bool own_pages);
>   extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *,
>   					unsigned int which,
>   					struct page **pages, u64 length,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index c842e87..467020c 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -214,6 +214,18 @@ void osd_req_op_cls_request_data_pagelist(
>   }
>   EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist);
>
> +void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
> +			unsigned int which, struct page **pages, u64 length,
> +			u32 alignment, bool pages_from_pool, bool own_pages)
> +{
> +	struct ceph_osd_data *osd_data;
> +
> +	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
> +	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
> +				pages_from_pool, own_pages);
> +}
> +EXPORT_SYMBOL(osd_req_op_cls_request_data_pages);
> +
>   void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
>   			unsigned int which, struct page **pages, u64 length,
>   			u32 alignment, bool pages_from_pool, bool own_pages)
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] rbd: define separate read and write format funcs
  2013-04-19 22:49 ` [PATCH 1/4] rbd: define separate read and write format funcs Alex Elder
@ 2013-04-22  7:23   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  7:23 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 04/19/2013 03:49 PM, Alex Elder wrote:
> Separate rbd_osd_req_format() into two functions, one for read
> requests and the other for write requests.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |   49
> ++++++++++++++++++++++++++++---------------------
>   1 file changed, 28 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index ce2fb3a..a185239 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1489,28 +1489,31 @@ static void rbd_osd_req_callback(struct
> ceph_osd_request *osd_req,
>   		rbd_obj_request_complete(obj_request);
>   }
>
> -static void rbd_osd_req_format(struct rbd_obj_request *obj_request,
> -					bool write_request)
> +static void rbd_osd_req_format_read(struct rbd_obj_request *obj_request)
>   {
>   	struct rbd_img_request *img_request = obj_request->img_request;
>   	struct ceph_osd_request *osd_req = obj_request->osd_req;
> -	struct ceph_snap_context *snapc = NULL;
> -	u64 snap_id = CEPH_NOSNAP;
> -	struct timespec *mtime = NULL;
> -	struct timespec now;
> +	u64 snap_id;
>
>   	rbd_assert(osd_req != NULL);
>
> -	if (write_request) {
> -		now = CURRENT_TIME;
> -		mtime = &now;
> -		if (img_request)
> -			snapc = img_request->snapc;
> -	} else if (img_request) {
> -		snap_id = img_request->snap_id;
> -	}
> +	snap_id = img_request ? img_request->snap_id : CEPH_NOSNAP;
> +	ceph_osdc_build_request(osd_req, obj_request->offset,
> +			NULL, snap_id, NULL);
> +}
> +
> +static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
> +{
> +	struct rbd_img_request *img_request = obj_request->img_request;
> +	struct ceph_osd_request *osd_req = obj_request->osd_req;
> +	struct ceph_snap_context *snapc;
> +	struct timespec mtime = CURRENT_TIME;
> +
> +	rbd_assert(osd_req != NULL);
> +
> +	snapc = img_request ? img_request->snapc : NULL;
>   	ceph_osdc_build_request(osd_req, obj_request->offset,
> -			snapc, snap_id, mtime);
> +			snapc, CEPH_NOSNAP, &mtime);
>   }
>
>   static struct ceph_osd_request *rbd_osd_req_create(
> @@ -1845,7 +1848,11 @@ static int rbd_img_request_fill_bio(struct
> rbd_img_request *img_request,
>   						0, 0);
>   		osd_req_op_extent_osd_data_bio(osd_req, 0,
>   				obj_request->bio_list, obj_request->length);
> -		rbd_osd_req_format(obj_request, write_request);
> +
> +		if (write_request)
> +			rbd_osd_req_format_write(obj_request);
> +		else
> +			rbd_osd_req_format_read(obj_request);
>
>   		obj_request->img_offset = img_offset;
>   		rbd_img_obj_request_add(img_request, obj_request);
> @@ -1969,7 +1976,7 @@ static int rbd_img_obj_exists_submit(struct
> rbd_obj_request *obj_request)
>   	osd_req_op_init(stat_request->osd_req, 0, CEPH_OSD_OP_STAT);
>   	osd_req_op_raw_data_in_pages(stat_request->osd_req, 0, pages, size, 0,
>   					false, false);
> -	rbd_osd_req_format(stat_request, false);
> +	rbd_osd_req_format_read(stat_request);
>
>   	osdc = &rbd_dev->rbd_client->client->osdc;
>   	ret = rbd_obj_request_submit(osdc, stat_request);
> @@ -2091,7 +2098,7 @@ static int rbd_obj_notify_ack(struct rbd_device
> *rbd_dev,
>
>   	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_NOTIFY_ACK,
>   					notify_id, ver, 0);
> -	rbd_osd_req_format(obj_request, false);
> +	rbd_osd_req_format_read(obj_request);
>
>   	ret = rbd_obj_request_submit(osdc, obj_request);
>   out:
> @@ -2161,7 +2168,7 @@ static int rbd_dev_header_watch_sync(struct
> rbd_device *rbd_dev, int start)
>   	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_WATCH,
>   				rbd_dev->watch_event->cookie,
>   				rbd_dev->header.obj_version, start);
> -	rbd_osd_req_format(obj_request, true);
> +	rbd_osd_req_format_write(obj_request);
>
>   	ret = rbd_obj_request_submit(osdc, obj_request);
>   	if (ret)
> @@ -2262,7 +2269,7 @@ static int rbd_obj_method_sync(struct rbd_device
> *rbd_dev,
>   	osd_req_op_cls_response_data_pages(obj_request->osd_req, 0,
>   					obj_request->pages, inbound_size,
>   					0, false, false);
> -	rbd_osd_req_format(obj_request, false);
> +	rbd_osd_req_format_read(obj_request);
>
>   	ret = rbd_obj_request_submit(osdc, obj_request);
>   	if (ret)
> @@ -2473,7 +2480,7 @@ static int rbd_obj_read_sync(struct rbd_device
> *rbd_dev,
>   					obj_request->length,
>   					obj_request->offset & ~PAGE_MASK,
>   					false, false);
> -	rbd_osd_req_format(obj_request, false);
> +	rbd_osd_req_format_read(obj_request);
>
>   	ret = rbd_obj_request_submit(osdc, obj_request);
>   	if (ret)
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] rbd: encapsulate submission of image object requests
  2013-04-19 22:50 ` [PATCH 2/4] rbd: encapsulate submission of image object requests Alex Elder
@ 2013-04-22  7:35   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  7:35 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 04/19/2013 03:50 PM, Alex Elder wrote:
> Object requests that are part of an image request are subject to
> some additional handling.  Define rbd_img_obj_request_submit() to
> encapsulate that, and use it when initially submitting an image
> object request, and when re-submitting it during callback of
> an object existence check.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |   65
> ++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 43 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index a185239..894af4f 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -423,6 +423,7 @@ void rbd_warn(struct rbd_device *rbd_dev, const char
> *fmt, ...)
>   #endif /* !RBD_DEBUG */
>
>   static void rbd_img_parent_read(struct rbd_obj_request *obj_request);
> +static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request);
>
>   static int rbd_dev_refresh(struct rbd_device *rbd_dev, u64 *hver);
>   static int rbd_dev_v2_refresh(struct rbd_device *rbd_dev, u64 *hver);
> @@ -1874,8 +1875,6 @@ out_unwind:
>
>   static void rbd_img_obj_exists_callback(struct rbd_obj_request
> *obj_request)
>   {
> -	struct rbd_device *rbd_dev;
> -	struct ceph_osd_client *osdc;
>   	struct rbd_obj_request *orig_request;
>   	int result;
>
> @@ -1901,8 +1900,6 @@ static void rbd_img_obj_exists_callback(struct
> rbd_obj_request *obj_request)
>
>   	rbd_assert(orig_request);
>   	rbd_assert(orig_request->img_request);
> -	rbd_dev = orig_request->img_request->rbd_dev;
> -	osdc = &rbd_dev->rbd_client->client->osdc;
>
>   	/*
>   	 * Our only purpose here is to determine whether the object
> @@ -1923,7 +1920,7 @@ static void rbd_img_obj_exists_callback(struct
> rbd_obj_request *obj_request)
>   	 * Resubmit the original request now that we have recorded
>   	 * whether the target object exists.
>   	 */
> -	orig_request->result = rbd_obj_request_submit(osdc, orig_request);
> +	orig_request->result = rbd_img_obj_request_submit(orig_request);
>   out_err:
>   	if (orig_request->result)
>   		rbd_obj_request_complete(orig_request);
> @@ -1987,32 +1984,56 @@ out:
>   	return ret;
>   }
>
> +static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request)
> +{
> +	struct rbd_img_request *img_request;
> +
> +	rbd_assert(obj_request_img_data_test(obj_request));
> +
> +	img_request = obj_request->img_request;
> +	rbd_assert(img_request);
> +
> +	/* (At the moment we don't care whether it exists or not...) */
> +	(void) obj_request_exists_test;
> +
> +	/*
> +	 * Only layered writes need special handling.  If it's not a
> +	 * layered write, or it is a layered write but we know the
> +	 * target object exists, it's no different from any other
> +	 * object request.
> +	 */
> +	if (!img_request_write_test(img_request) ||
> +		!img_request_layered_test(img_request) ||
> +		obj_request_known_test(obj_request)) {
> +
> +		struct rbd_device *rbd_dev;
> +		struct ceph_osd_client *osdc;
> +
> +		rbd_dev = obj_request->img_request->rbd_dev;
> +		osdc = &rbd_dev->rbd_client->client->osdc;
> +
> +		return rbd_obj_request_submit(osdc, obj_request);
> +	}
> +
> +	/*
> +	 * It's a layered write and we don't know whether the target
> +	 * exists.  Issue existence check; once that completes the
> +	 * original request will be submitted again.
> +	 */
> +
> +	return rbd_img_obj_exists_submit(obj_request);
> +}
> +
>   static int rbd_img_request_submit(struct rbd_img_request *img_request)
>   {
> -	struct rbd_device *rbd_dev = img_request->rbd_dev;
> -	struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc;
>   	struct rbd_obj_request *obj_request;
>   	struct rbd_obj_request *next_obj_request;
> -	bool write_request = img_request_write_test(img_request);
> -	bool layered = img_request_layered_test(img_request);
>
>   	dout("%s: img %p\n", __func__, img_request);
>   	for_each_obj_request_safe(img_request, obj_request, next_obj_request) {
> -		bool known;
> -		bool object_exists;
>   		int ret;
>
> -		/*
> -		 * We need to know whether the target object exists
> -		 * for a layered write.  Issue an existence check
> -		 * first if we need to.
> -		 */
> -		known = obj_request_known_test(obj_request);
> -		object_exists = known && obj_request_exists_test(obj_request);
> -		if (!write_request || !layered || object_exists)
> -			ret = rbd_obj_request_submit(osdc, obj_request);
> -		else
> -			ret = rbd_img_obj_exists_submit(obj_request);
> +		ret = rbd_img_obj_request_submit(obj_request);
>   		if (ret)
>   			return ret;
>   	}
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] rbd: define zero_pages()
  2013-04-19 22:50 ` [PATCH 3/4] rbd: define zero_pages() Alex Elder
@ 2013-04-22  8:05   ` Josh Durgin
  2013-04-22 12:35     ` Alex Elder
  0 siblings, 1 reply; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  8:05 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

On 04/19/2013 03:50 PM, Alex Elder wrote:
> Define a new function zero_pages() that zeroes a range of memory
> defined by a page array, along the lines of zero_bio_chain().  It
> saves and the irq flags like bvec_kmap_irq() does, though I'm not
> sure at this point that it's necessary.

It doesn't seem necessary to me. I don't see anything else doing
an irq save+restore around a k(un)map_atomic.

Other than that, looks good.
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

> Update rbd_img_obj_request_read_callback() to use the new function
> if the object request contains page rather than bio data.
>
> For the moment, only bio data is used for osd READ ops.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |   55
> +++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 894af4f..ac9abab 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -971,6 +971,37 @@ static void zero_bio_chain(struct bio *chain, int
> start_ofs)
>   }
>
>   /*
> + * similar to zero_bio_chain(), zeros data defined by a page array,
> + * starting at the given byte offset from the start of the array and
> + * continuing up to the given end offset.  The pages array is
> + * assumed to be big enough to hold all bytes up to the end.
> + */
> +static void zero_pages(struct page **pages, u64 offset, u64 end)
> +{
> +	struct page **page = &pages[offset >> PAGE_SHIFT];
> +
> +	rbd_assert(end > offset);
> +	rbd_assert(end - offset <= (u64)SIZE_MAX);
> +	while (offset < end) {
> +		size_t page_offset;
> +		size_t length;
> +		unsigned long flags;
> +		void *kaddr;
> +
> +		page_offset = (size_t)(offset & ~PAGE_MASK);
> +		length = min(PAGE_SIZE - page_offset, (size_t)(end - offset));
> +		local_irq_save(flags);
> +		kaddr = kmap_atomic(*page);
> +		memset(kaddr + page_offset, 0, length);
> +		kunmap_atomic(kaddr);
> +		local_irq_restore(flags);
> +
> +		offset += length;
> +		page++;
> +	}
> +}
> +
> +/*
>    * Clone a portion of a bio, starting at the given byte offset
>    * and continuing for the number of bytes indicated.
>    */
> @@ -1352,9 +1383,12 @@ static bool img_request_layered_test(struct
> rbd_img_request *img_request)
>   static void
>   rbd_img_obj_request_read_callback(struct rbd_obj_request *obj_request)
>   {
> +	u64 xferred = obj_request->xferred;
> +	u64 length = obj_request->length;
> +
>   	dout("%s: obj %p img %p result %d %llu/%llu\n", __func__,
>   		obj_request, obj_request->img_request, obj_request->result,
> -		obj_request->xferred, obj_request->length);
> +		xferred, length);
>   	/*
>   	 * ENOENT means a hole in the image.  We zero-fill the
>   	 * entire length of the request.  A short read also implies
> @@ -1362,15 +1396,20 @@ rbd_img_obj_request_read_callback(struct
> rbd_obj_request *obj_request)
>   	 * update the xferred count to indicate the whole request
>   	 * was satisfied.
>   	 */
> -	BUG_ON(obj_request->type != OBJ_REQUEST_BIO);
> +	rbd_assert(obj_request->type != OBJ_REQUEST_NODATA);
>   	if (obj_request->result == -ENOENT) {
> -		zero_bio_chain(obj_request->bio_list, 0);
> +		if (obj_request->type == OBJ_REQUEST_BIO)
> +			zero_bio_chain(obj_request->bio_list, 0);
> +		else
> +			zero_pages(obj_request->pages, 0, length);
>   		obj_request->result = 0;
> -		obj_request->xferred = obj_request->length;
> -	} else if (obj_request->xferred < obj_request->length &&
> -			!obj_request->result) {
> -		zero_bio_chain(obj_request->bio_list, obj_request->xferred);
> -		obj_request->xferred = obj_request->length;
> +		obj_request->xferred = length;
> +	} else if (xferred < length && !obj_request->result) {
> +		if (obj_request->type == OBJ_REQUEST_BIO)
> +			zero_bio_chain(obj_request->bio_list, xferred);
> +		else
> +			zero_pages(obj_request->pages, xferred, length);
> +		obj_request->xferred = length;
>   	}
>   	obj_request_done_set(obj_request);
>   }
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] rbd: support page array image requests
  2013-04-19 22:50 ` [PATCH 4/4] rbd: support page array image requests Alex Elder
@ 2013-04-22  8:13   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22  8:13 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

One typo in a comment. Looks good otherwise.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 04/19/2013 03:50 PM, Alex Elder wrote:
> This patch adds the ability to build an image request whose data
> will be written from or read into memory described by a page array.
> (Previously only bio lists were supported.)
>
> Originally this was going to define a new function for this purpose
> but it was largely identical to the rbd_img_request_fill_bio().  So
> instead, rbd_img_request_fill_bio() has been generalized to handle
> both types of image request.
>
> For the moment we still only fill image requests with bio data.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |   86
> +++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 66 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index ac9abab..91fcf36 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1780,6 +1780,13 @@ static bool rbd_img_obj_end_request(struct
> rbd_obj_request *obj_request)
>   			img_request->result = result;
>   	}
>
> +	/* Image object requests don't own their page array */
> +
> +	if (obj_request->type == OBJ_REQUEST_PAGES) {
> +		obj_request->pages = NULL;
> +		obj_request->page_count = 0;
> +	}
> +
>   	if (img_request_child_test(img_request)) {
>   		rbd_assert(img_request->obj_request != NULL);
>   		more = obj_request->which < img_request->obj_request_count - 1;
> @@ -1830,30 +1837,48 @@ out:
>   		rbd_img_request_complete(img_request);
>   }
>
> -static int rbd_img_request_fill_bio(struct rbd_img_request *img_request,
> -					struct bio *bio_list)
> +/*
> + * Split up an image request into one or more object requests, each
> + * to a different object.  The the "type" parameter indicates

repeated "the"

> + * whether "data_desc" is the pointer to the head of a list of bio
> + * structures, or the base of a page array.  In either case this
> + * function assumes data_desc describes memory sufficient to hold
> + * all data described by the image request.
> + */
> +static int rbd_img_request_fill(struct rbd_img_request *img_request,
> +					enum obj_request_type type,
> +					void *data_desc)
>   {
>   	struct rbd_device *rbd_dev = img_request->rbd_dev;
>   	struct rbd_obj_request *obj_request = NULL;
>   	struct rbd_obj_request *next_obj_request;
>   	bool write_request = img_request_write_test(img_request);
> -	unsigned int bio_offset;
> +	struct bio *bio_list;
> +	unsigned int bio_offset = 0;
> +	struct page **pages;
>   	u64 img_offset;
>   	u64 resid;
>   	u16 opcode;
>
> -	dout("%s: img %p bio %p\n", __func__, img_request, bio_list);
> +	dout("%s: img %p type %d data_desc %p\n", __func__, img_request,
> +		(int)type, data_desc);
>
>   	opcode = write_request ? CEPH_OSD_OP_WRITE : CEPH_OSD_OP_READ;
> -	bio_offset = 0;
>   	img_offset = img_request->offset;
> -	rbd_assert(img_offset == bio_list->bi_sector << SECTOR_SHIFT);
>   	resid = img_request->length;
>   	rbd_assert(resid > 0);
> +
> +	if (type == OBJ_REQUEST_BIO) {
> +		bio_list = data_desc;
> +		rbd_assert(img_offset == bio_list->bi_sector << SECTOR_SHIFT);
> +	} else {
> +		rbd_assert(type == OBJ_REQUEST_PAGES);
> +		pages = data_desc;
> +	}
> +
>   	while (resid) {
>   		struct ceph_osd_request *osd_req;
>   		const char *object_name;
> -		unsigned int clone_size;
>   		u64 offset;
>   		u64 length;
>
> @@ -1863,19 +1888,33 @@ static int rbd_img_request_fill_bio(struct
> rbd_img_request *img_request,
>   		offset = rbd_segment_offset(rbd_dev, img_offset);
>   		length = rbd_segment_length(rbd_dev, img_offset, resid);
>   		obj_request = rbd_obj_request_create(object_name,
> -						offset, length,
> -						OBJ_REQUEST_BIO);
> +						offset, length, type);
>   		kfree(object_name);	/* object request has its own copy */
>   		if (!obj_request)
>   			goto out_unwind;
>
> -		rbd_assert(length <= (u64) UINT_MAX);
> -		clone_size = (unsigned int) length;
> -		obj_request->bio_list = bio_chain_clone_range(&bio_list,
> -						&bio_offset, clone_size,
> -						GFP_ATOMIC);
> -		if (!obj_request->bio_list)
> -			goto out_partial;
> +		if (type == OBJ_REQUEST_BIO) {
> +			unsigned int clone_size;
> +
> +			rbd_assert(length <= (u64)UINT_MAX);
> +			clone_size = (unsigned int)length;
> +			obj_request->bio_list =
> +					bio_chain_clone_range(&bio_list,
> +								&bio_offset,
> +								clone_size,
> +								GFP_ATOMIC);
> +			if (!obj_request->bio_list)
> +				goto out_partial;
> +		} else {
> +			unsigned int page_count;
> +
> +			obj_request->pages = pages;
> +			page_count = (u32)calc_pages_for(offset, length);
> +			obj_request->page_count = page_count;
> +			if ((offset + length) & ~PAGE_MASK)
> +				page_count--;	/* more on last page */
> +			pages += page_count;
> +		}
>
>   		osd_req = rbd_osd_req_create(rbd_dev, write_request,
>   						obj_request);
> @@ -1886,8 +1925,13 @@ static int rbd_img_request_fill_bio(struct
> rbd_img_request *img_request,
>
>   		osd_req_op_extent_init(osd_req, 0, opcode, offset, length,
>   						0, 0);
> -		osd_req_op_extent_osd_data_bio(osd_req, 0,
> -				obj_request->bio_list, obj_request->length);
> +		if (type == OBJ_REQUEST_BIO)
> +			osd_req_op_extent_osd_data_bio(osd_req, 0,
> +					obj_request->bio_list, length);
> +		else
> +			osd_req_op_extent_osd_data_pages(osd_req, 0,
> +					obj_request->pages, length,
> +					offset & ~PAGE_MASK, false, false);
>
>   		if (write_request)
>   			rbd_osd_req_format_write(obj_request);
> @@ -2120,7 +2164,8 @@ static void rbd_img_parent_read(struct
> rbd_obj_request *obj_request)
>   	rbd_obj_request_get(obj_request);
>   	img_request->obj_request = obj_request;
>
> -	result = rbd_img_request_fill_bio(img_request, obj_request->bio_list);
> +	result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
> +					obj_request->bio_list);
>   	if (result)
>   		goto out_err;
>
> @@ -2425,7 +2470,8 @@ static void rbd_request_fn(struct request_queue *q)
>
>   		img_request->rq = rq;
>
> -		result = rbd_img_request_fill_bio(img_request, rq->bio);
> +		result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
> +						rq->bio);
>   		if (!result)
>   			result = rbd_img_request_submit(img_request);
>   		if (result)
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] rbd: define zero_pages()
  2013-04-22  8:05   ` Josh Durgin
@ 2013-04-22 12:35     ` Alex Elder
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Elder @ 2013-04-22 12:35 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 04/22/2013 03:05 AM, Josh Durgin wrote:
> On 04/19/2013 03:50 PM, Alex Elder wrote:
>> Define a new function zero_pages() that zeroes a range of memory
>> defined by a page array, along the lines of zero_bio_chain().  It
>> saves and the irq flags like bvec_kmap_irq() does, though I'm not
>> sure at this point that it's necessary.
> 
> It doesn't seem necessary to me. I don't see anything else doing
> an irq save+restore around a k(un)map_atomic.

I'm going to leave it in for now.  I also wonder whether
I need I need a flush_dcache_page() in there before the
unmap.  For x86 CPUs it's moot but for portability I'd
like to do it right while the code is fresh in mind.

http://tracker.ceph.com/issues/4777

					-Alex
> Other than that, looks good.
> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
> 
>> Update rbd_img_obj_request_read_callback() to use the new function
>> if the object request contains page rather than bio data.
>>
>> For the moment, only bio data is used for osd READ ops.
>>
>> Signed-off-by: Alex Elder <elder@inktank.com>
>> ---
>>   drivers/block/rbd.c |   55
>> +++++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 47 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 894af4f..ac9abab 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -971,6 +971,37 @@ static void zero_bio_chain(struct bio *chain, int
>> start_ofs)
>>   }
>>
>>   /*
>> + * similar to zero_bio_chain(), zeros data defined by a page array,
>> + * starting at the given byte offset from the start of the array and
>> + * continuing up to the given end offset.  The pages array is
>> + * assumed to be big enough to hold all bytes up to the end.
>> + */
>> +static void zero_pages(struct page **pages, u64 offset, u64 end)
>> +{
>> +    struct page **page = &pages[offset >> PAGE_SHIFT];
>> +
>> +    rbd_assert(end > offset);
>> +    rbd_assert(end - offset <= (u64)SIZE_MAX);
>> +    while (offset < end) {
>> +        size_t page_offset;
>> +        size_t length;
>> +        unsigned long flags;
>> +        void *kaddr;
>> +
>> +        page_offset = (size_t)(offset & ~PAGE_MASK);
>> +        length = min(PAGE_SIZE - page_offset, (size_t)(end - offset));
>> +        local_irq_save(flags);
>> +        kaddr = kmap_atomic(*page);
>> +        memset(kaddr + page_offset, 0, length);
>> +        kunmap_atomic(kaddr);
>> +        local_irq_restore(flags);
>> +
>> +        offset += length;
>> +        page++;
>> +    }
>> +}
>> +
>> +/*
>>    * Clone a portion of a bio, starting at the given byte offset
>>    * and continuing for the number of bytes indicated.
>>    */
>> @@ -1352,9 +1383,12 @@ static bool img_request_layered_test(struct
>> rbd_img_request *img_request)
>>   static void
>>   rbd_img_obj_request_read_callback(struct rbd_obj_request *obj_request)
>>   {
>> +    u64 xferred = obj_request->xferred;
>> +    u64 length = obj_request->length;
>> +
>>       dout("%s: obj %p img %p result %d %llu/%llu\n", __func__,
>>           obj_request, obj_request->img_request, obj_request->result,
>> -        obj_request->xferred, obj_request->length);
>> +        xferred, length);
>>       /*
>>        * ENOENT means a hole in the image.  We zero-fill the
>>        * entire length of the request.  A short read also implies
>> @@ -1362,15 +1396,20 @@ rbd_img_obj_request_read_callback(struct
>> rbd_obj_request *obj_request)
>>        * update the xferred count to indicate the whole request
>>        * was satisfied.
>>        */
>> -    BUG_ON(obj_request->type != OBJ_REQUEST_BIO);
>> +    rbd_assert(obj_request->type != OBJ_REQUEST_NODATA);
>>       if (obj_request->result == -ENOENT) {
>> -        zero_bio_chain(obj_request->bio_list, 0);
>> +        if (obj_request->type == OBJ_REQUEST_BIO)
>> +            zero_bio_chain(obj_request->bio_list, 0);
>> +        else
>> +            zero_pages(obj_request->pages, 0, length);
>>           obj_request->result = 0;
>> -        obj_request->xferred = obj_request->length;
>> -    } else if (obj_request->xferred < obj_request->length &&
>> -            !obj_request->result) {
>> -        zero_bio_chain(obj_request->bio_list, obj_request->xferred);
>> -        obj_request->xferred = obj_request->length;
>> +        obj_request->xferred = length;
>> +    } else if (xferred < length && !obj_request->result) {
>> +        if (obj_request->type == OBJ_REQUEST_BIO)
>> +            zero_bio_chain(obj_request->bio_list, xferred);
>> +        else
>> +            zero_pages(obj_request->pages, xferred, length);
>> +        obj_request->xferred = length;
>>       }
>>       obj_request_done_set(obj_request);
>>   }
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] rbd: implement full object parent reads
  2013-04-19 22:50 ` [PATCH 1/2] rbd: implement full object parent reads Alex Elder
@ 2013-04-22 18:13   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22 18:13 UTC (permalink / raw)
  To: Alex Elder, ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

Alex Elder <elder@inktank.com> wrote:
>As a step toward implementing layered writes, implement reading the
>data for a target object from the parent image for a write request
>whose target object is known to not exist.  Add a copyup_pages field
>to an image request to track the page array used (only) for such a
>request.
>
>Signed-off-by: Alex Elder <elder@inktank.com>
>---
> drivers/block/rbd.c |  152
>++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 143 insertions(+), 9 deletions(-)
>
>diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>index 91fcf36..c5d0619 100644
>--- a/drivers/block/rbd.c
>+++ b/drivers/block/rbd.c
>@@ -250,6 +250,7 @@ struct rbd_img_request {
> 		struct request		*rq;		/* block request */
> 		struct rbd_obj_request	*obj_request;	/* obj req initiator */
> 	};
>+	struct page		**copyup_pages;
> 	spinlock_t		completion_lock;/* protects next_completion */
> 	u32			next_completion;
> 	rbd_img_callback_t	callback;
>@@ -350,6 +351,8 @@ static DEFINE_SPINLOCK(rbd_dev_list_lock);
> static LIST_HEAD(rbd_client_list);		/* clients */
> static DEFINE_SPINLOCK(rbd_client_list_lock);
>
>+static int rbd_img_request_submit(struct rbd_img_request
>*img_request);
>+
> static int rbd_dev_snaps_update(struct rbd_device *rbd_dev);
> static int rbd_dev_snaps_register(struct rbd_device *rbd_dev);
>
>@@ -1956,6 +1959,133 @@ out_unwind:
> 	return -ENOMEM;
> }
>
>+static void
>+rbd_img_obj_parent_read_full_callback(struct rbd_img_request
>*img_request)
>+{
>+	struct rbd_obj_request *orig_request;
>+	struct page **pages;
>+	u32 page_count;
>+	int result;
>+	u64 obj_size;
>+	u64 xferred;
>+
>+	rbd_assert(img_request_child_test(img_request));
>+
>+	/* First get what we need from the image request */
>+
>+	pages = img_request->copyup_pages;
>+	rbd_assert(pages != NULL);
>+	img_request->copyup_pages = NULL;
>+
>+	orig_request = img_request->obj_request;
>+	rbd_assert(orig_request != NULL);
>+
>+	result = img_request->result;
>+	obj_size = img_request->length;
>+	xferred = img_request->xferred;
>+
>+	rbd_img_request_put(img_request);
>+
>+	obj_request_existence_set(orig_request, true);
>+
>+	page_count = (u32)calc_pages_for(0, obj_size);
>+	ceph_release_page_vector(pages, page_count);
>+
>+	/* Resubmit the original request (for now). */
>+
>+	orig_request->result = rbd_img_obj_request_submit(orig_request);
>+	if (orig_request->result) {
>+		obj_request_done_set(orig_request);
>+		rbd_obj_request_complete(orig_request);
>+	}
>+}
>+
>+/*
>+ * Read from the parent image the range of data that covers the
>+ * entire target of the given object request.  This is used for
>+ * satisfying a layered image write request when the target of an
>+ * object request from the image request does not exist.
>+ *
>+ * A page array big enough to hold the returned data is allocated
>+ * and supplied to rbd_img_request_fill() as the "data descriptor."
>+ * When the read completes, this page array will be transferred to
>+ * the original object request for the copyup operation.
>+ *
>+ * If an error occurs, record it as the result of the original
>+ * object request and mark it done so it gets completed.
>+ */
>+static int rbd_img_obj_parent_read_full(struct rbd_obj_request
>*obj_request)
>+{
>+	struct rbd_img_request *img_request = NULL;
>+	struct rbd_img_request *parent_request = NULL;
>+	struct rbd_device *rbd_dev;
>+	u64 img_offset;
>+	u64 length;
>+	struct page **pages = NULL;
>+	u32 page_count;
>+	int result;
>+
>+	rbd_assert(obj_request_img_data_test(obj_request));
>+	rbd_assert(obj_request->type == OBJ_REQUEST_BIO);
>+
>+	img_request = obj_request->img_request;
>+	rbd_assert(img_request != NULL);
>+	rbd_dev = img_request->rbd_dev;
>+	rbd_assert(rbd_dev->parent != NULL);
>+
>+	/*
>+	 * Determine the byte range covered by the object in the
>+	 * child image to which the original request was to be sent.
>+	 */
>+	img_offset = obj_request->img_offset - obj_request->offset;
>+	length = (u64)1 << rbd_dev->header.obj_order;
>+
>+	/*
>+	 * Allocate a page array big enough to receive the data read
>+	 * from the parent.
>+	 */
>+	page_count = (u32)calc_pages_for(0, length);
>+	pages = ceph_alloc_page_vector(page_count, GFP_KERNEL);
>+	if (IS_ERR(pages)) {
>+		result = PTR_ERR(pages);
>+		pages = NULL;
>+		goto out_err;
>+	}
>+
>+	result = -ENOMEM;
>+	parent_request = rbd_img_request_create(rbd_dev->parent,
>+						img_offset, length,
>+						false, true);
>+	if (!parent_request)
>+		goto out_err;
>+	rbd_obj_request_get(obj_request);
>+	parent_request->obj_request = obj_request;
>+
>+	result = rbd_img_request_fill(parent_request, OBJ_REQUEST_PAGES,
>pages);
>+	if (result)
>+		goto out_err;
>+	parent_request->copyup_pages = pages;
>+
>+	parent_request->callback = rbd_img_obj_parent_read_full_callback;
>+	result = rbd_img_request_submit(parent_request);
>+	if (!result)
>+		return 0;
>+
>+	parent_request->copyup_pages = NULL;
>+	parent_request->obj_request = NULL;
>+	rbd_obj_request_put(obj_request);
>+out_err:
>+	if (pages)
>+		ceph_release_page_vector(pages, page_count);
>+	if (parent_request)
>+		rbd_img_request_put(parent_request);
>+	obj_request->result = result;
>+	obj_request->xferred = 0;
>+	obj_request_done_set(obj_request);
>+
>+	return result;
>+}
>+
> static void rbd_img_obj_exists_callback(struct rbd_obj_request
>*obj_request)
> {
> 	struct rbd_obj_request *orig_request;
>@@ -1996,7 +2126,7 @@ static void rbd_img_obj_exists_callback(struct
>rbd_obj_request *obj_request)
> 		obj_request_existence_set(orig_request, false);
> 	} else if (result) {
> 		orig_request->result = result;
>-		goto out_err;
>+		goto out;
> 	}
>
> 	/*
>@@ -2004,7 +2134,7 @@ static void rbd_img_obj_exists_callback(struct
>rbd_obj_request *obj_request)
> 	 * whether the target object exists.
> 	 */
> 	orig_request->result = rbd_img_obj_request_submit(orig_request);
>-out_err:
>+out:
> 	if (orig_request->result)
> 		rbd_obj_request_complete(orig_request);
> 	rbd_obj_request_put(orig_request);
>@@ -2070,15 +2200,13 @@ out:
>static int rbd_img_obj_request_submit(struct rbd_obj_request
>*obj_request)
> {
> 	struct rbd_img_request *img_request;
>+	bool known;
>
> 	rbd_assert(obj_request_img_data_test(obj_request));
>
> 	img_request = obj_request->img_request;
> 	rbd_assert(img_request);
>
>-	/* (At the moment we don't care whether it exists or not...) */
>-	(void) obj_request_exists_test;
>-
> 	/*
> 	 * Only layered writes need special handling.  If it's not a
> 	 * layered write, or it is a layered write but we know the
>@@ -2087,7 +2215,8 @@ static int rbd_img_obj_request_submit(struct
>rbd_obj_request *obj_request)
> 	 */
> 	if (!img_request_write_test(img_request) ||
> 		!img_request_layered_test(img_request) ||
>-		obj_request_known_test(obj_request)) {
>+		((known = obj_request_known_test(obj_request)) &&
>+			obj_request_exists_test(obj_request))) {
>
> 		struct rbd_device *rbd_dev;
> 		struct ceph_osd_client *osdc;
>@@ -2099,10 +2228,15 @@ static int rbd_img_obj_request_submit(struct
>rbd_obj_request *obj_request)
> 	}
>
> 	/*
>-	 * It's a layered write and we don't know whether the target
>-	 * exists.  Issue existence check; once that completes the
>-	 * original request will be submitted again.
>+	 * It's a layered write.  The target object might exist but
>+	 * we may not know that yet.  If we know it doesn't exist,
>+	 * start by reading the data for the full target object from
>+	 * the parent so we can use it for a copyup to the target.
> 	 */
>+	if (known)
>+		return rbd_img_obj_parent_read_full(obj_request);
>+
>+	/* We don't know whether the target exists.  Go find out. */
>
> 	return rbd_img_obj_exists_submit(obj_request);
> }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] rbd: issue a copyup for layered writes
  2013-04-19 22:50 ` [PATCH 2/2] rbd: issue a copyup for layered writes Alex Elder
@ 2013-04-22 18:16   ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-22 18:16 UTC (permalink / raw)
  To: Alex Elder, ceph-devel

Alex Elder <elder@inktank.com> wrote:

>This implements the main copyup functionality for layered writes.
>
>Here we add a copyup_pages field to the object request, which is
>used only for copyup requests to keep track of the page array
>containing data read from the parent image.
>
>A copyup request is the only request rbd has that requires two osd
>operations.  Because of this we handle copyup specially.  All image
>object requests get an osd request allocated when they are created.
>For a write request, if a copyup is copyup is required, the osd
>request origainlly allocated is released, and a new one (with room

A couple typos in this sentence, but looks good.
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

>for two osd ops) is allocated to replace it.  A new function
>rbd_osd_req_create_copyup() allocates an osd request suitable for
>a copyup request.
>
>The first op is then filled with a copyup object class method call,
>supplying the array of pages containing data read from the parent.
>The second op is filled in with the original write request.
>
>The original request otherwise remains intact, and it describes the
>original write request (found in the second osd op).  The presence
>of the copyup op is sort of implicit; a non-null copyup_pages field
>could be used to distinguish between a "normal" write request and a
>request containing both a copyup call and a write.
>
>This resolves:
>    http://tracker.ceph.com/issues/3419
>
>Signed-off-by: Alex Elder <elder@inktank.com>
>---
> drivers/block/rbd.c |  149
>++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 137 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>index c5d0619..12f7a5f 100644
>--- a/drivers/block/rbd.c
>+++ b/drivers/block/rbd.c
>@@ -218,6 +218,7 @@ struct rbd_obj_request {
> 			u32		page_count;
> 		};
> 	};
>+	struct page		**copyup_pages;
>
> 	struct ceph_osd_request	*osd_req;
>
>@@ -1498,7 +1499,7 @@ static void rbd_osd_req_callback(struct
>ceph_osd_request *osd_req,
> 		obj_request->result = osd_req->r_result;
>	obj_request->version =
>le64_to_cpu(osd_req->r_reassert_version.version);
>
>-	WARN_ON(osd_req->r_num_ops != 1);	/* For now */
>+	BUG_ON(osd_req->r_num_ops > 2);
>
> 	/*
> 	 * We support a 64-bit length, but ultimately it has to be
>@@ -1601,6 +1602,48 @@ static struct ceph_osd_request
>*rbd_osd_req_create(
> 	return osd_req;
> }
>
>+/*
>+ * Create a copyup osd request based on the information in the
>+ * object request supplied.  A copyup request has two osd ops,
>+ * a copyup method call, and a "normal" write request.
>+ */
>+static struct ceph_osd_request *
>+rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
>+{
>+	struct rbd_img_request *img_request;
>+	struct ceph_snap_context *snapc;
>+	struct rbd_device *rbd_dev;
>+	struct ceph_osd_client *osdc;
>+	struct ceph_osd_request *osd_req;
>+
>+	rbd_assert(obj_request_img_data_test(obj_request));
>+	img_request = obj_request->img_request;
>+	rbd_assert(img_request);
>+	rbd_assert(img_request_write_test(img_request));
>+
>+	/* Allocate and initialize the request, for the two ops */
>+
>+	snapc = img_request->snapc;
>+	rbd_dev = img_request->rbd_dev;
>+	osdc = &rbd_dev->rbd_client->client->osdc;
>+	osd_req = ceph_osdc_alloc_request(osdc, snapc, 2, false, GFP_ATOMIC);
>+	if (!osd_req)
>+		return NULL;	/* ENOMEM */
>+
>+	osd_req->r_flags = CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK;
>+	osd_req->r_callback = rbd_osd_req_callback;
>+	osd_req->r_priv = obj_request;
>+
>+	osd_req->r_oid_len = strlen(obj_request->object_name);
>+	rbd_assert(osd_req->r_oid_len < sizeof (osd_req->r_oid));
>+	memcpy(osd_req->r_oid, obj_request->object_name, osd_req->r_oid_len);
>+
>+	osd_req->r_file_layout = rbd_dev->layout;	/* struct */
>+
>+	return osd_req;
>+}
>+
>+
> static void rbd_osd_req_destroy(struct ceph_osd_request *osd_req)
> {
> 	ceph_osdc_put_request(osd_req);
>@@ -1960,11 +2003,49 @@ out_unwind:
> }
>
> static void
>+rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request)
>+{
>+	struct rbd_img_request *img_request;
>+	struct rbd_device *rbd_dev;
>+	u64 length;
>+	u32 page_count;
>+
>+	rbd_assert(obj_request->type == OBJ_REQUEST_BIO);
>+	rbd_assert(obj_request_img_data_test(obj_request));
>+	img_request = obj_request->img_request;
>+	rbd_assert(img_request);
>+
>+	rbd_dev = img_request->rbd_dev;
>+	rbd_assert(rbd_dev);
>+	length = (u64)1 << rbd_dev->header.obj_order;
>+	page_count = (u32)calc_pages_for(0, length);
>+
>+	rbd_assert(obj_request->copyup_pages);
>+	ceph_release_page_vector(obj_request->copyup_pages, page_count);
>+	obj_request->copyup_pages = NULL;
>+
>+	/*
>+	 * We want the transfer count to reflect the size of the
>+	 * original write request.  There is no such thing as a
>+	 * successful short write, so if the request was successful
>+	 * we can just set it to the originally-requested length.
>+	 */
>+	if (!obj_request->result)
>+		obj_request->xferred = obj_request->length;
>+
>+	/* Finish up with the normal image object callback */
>+
>+	rbd_img_obj_callback(obj_request);
>+}
>+
>+static void
>rbd_img_obj_parent_read_full_callback(struct rbd_img_request
>*img_request)
> {
> 	struct rbd_obj_request *orig_request;
>+	struct ceph_osd_request *osd_req;
>+	struct ceph_osd_client *osdc;
>+	struct rbd_device *rbd_dev;
> 	struct page **pages;
>-	u32 page_count;
> 	int result;
> 	u64 obj_size;
> 	u64 xferred;
>@@ -1979,25 +2060,60 @@ rbd_img_obj_parent_read_full_callback(struct
>rbd_img_request *img_request)
>
> 	orig_request = img_request->obj_request;
> 	rbd_assert(orig_request != NULL);
>-
>+	rbd_assert(orig_request->type == OBJ_REQUEST_BIO);
> 	result = img_request->result;
> 	obj_size = img_request->length;
> 	xferred = img_request->xferred;
>
>+	rbd_dev = img_request->rbd_dev;
>+	rbd_assert(rbd_dev);
>+	rbd_assert(obj_size == (u64)1 << rbd_dev->header.obj_order);
>+
> 	rbd_img_request_put(img_request);
>
>-	obj_request_existence_set(orig_request, true);
>+	if (result)
>+		goto out_err;
>+
>+	/* Allocate the new copyup osd request for the original request */
>
>-	page_count = (u32)calc_pages_for(0, obj_size);
>-	ceph_release_page_vector(pages, page_count);
>+	result = -ENOMEM;
>+	rbd_assert(!orig_request->osd_req);
>+	osd_req = rbd_osd_req_create_copyup(orig_request);
>+	if (!osd_req)
>+		goto out_err;
>+	orig_request->osd_req = osd_req;
>+	orig_request->copyup_pages = pages;
>
>-	/* Resubmit the original request (for now). */
>+	/* Initialize the copyup op */
>
>-	orig_request->result = rbd_img_obj_request_submit(orig_request);
>-	if (orig_request->result) {
>-		obj_request_done_set(orig_request);
>-		rbd_obj_request_complete(orig_request);
>-	}
>+	osd_req_op_cls_init(osd_req, 0, CEPH_OSD_OP_CALL, "rbd", "copyup");
>+	osd_req_op_cls_request_data_pages(osd_req, 0, pages, obj_size, 0,
>+						false, false);
>+
>+	/* Then the original write request op */
>+
>+	osd_req_op_extent_init(osd_req, 1, CEPH_OSD_OP_WRITE,
>+					orig_request->offset,
>+					orig_request->length, 0, 0);
>+	osd_req_op_extent_osd_data_bio(osd_req, 1, orig_request->bio_list,
>+					orig_request->length);
>+
>+	rbd_osd_req_format_write(orig_request);
>+
>+	/* All set, send it off. */
>+
>+	orig_request->callback = rbd_img_obj_copyup_callback;
>+	osdc = &rbd_dev->rbd_client->client->osdc;
>+	result = rbd_obj_request_submit(osdc, orig_request);
>+	if (!result)
>+		return;
>+out_err:
>+	/* Record the error code and complete the request */
>+
>+	orig_request->result = result;
>+	orig_request->xferred = 0;
>+	obj_request_done_set(orig_request);
>+	rbd_obj_request_complete(orig_request);
> }
>
> /*
>@@ -2034,6 +2150,15 @@ static int rbd_img_obj_parent_read_full(struct
>rbd_obj_request *obj_request)
> 	rbd_assert(rbd_dev->parent != NULL);
>
> 	/*
>+	 * First things first.  The original osd request is of no
>+	 * use to use any more, we'll need a new one that can hold
>+	 * the two ops in a copyup request.  We'll get that later,
>+	 * but for now we can release the old one.
>+	 */
>+	rbd_osd_req_destroy(obj_request->osd_req);
>+	obj_request->osd_req = NULL;
>+
>+	/*
> 	 * Determine the byte range covered by the object in the
> 	 * child image to which the original request was to be sent.
> 	 */


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-04-22 18:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-19 22:46 [PATCH 0] rbd: layered writes Alex Elder
2013-04-19 22:49 ` [PATCH] libceph: fix two messenger bugs Alex Elder
2013-04-22  7:14   ` Josh Durgin
2013-04-19 22:49 ` [PATCH] libceph: support pages for class request data Alex Elder
2013-04-22  7:15   ` Josh Durgin
2013-04-19 22:49 ` [PATCH 1/4] rbd: define separate read and write format funcs Alex Elder
2013-04-22  7:23   ` Josh Durgin
2013-04-19 22:50 ` [PATCH 2/4] rbd: encapsulate submission of image object requests Alex Elder
2013-04-22  7:35   ` Josh Durgin
2013-04-19 22:50 ` [PATCH 3/4] rbd: define zero_pages() Alex Elder
2013-04-22  8:05   ` Josh Durgin
2013-04-22 12:35     ` Alex Elder
2013-04-19 22:50 ` [PATCH 4/4] rbd: support page array image requests Alex Elder
2013-04-22  8:13   ` Josh Durgin
2013-04-19 22:50 ` [PATCH 1/2] rbd: implement full object parent reads Alex Elder
2013-04-22 18:13   ` Josh Durgin
2013-04-19 22:50 ` [PATCH 2/2] rbd: issue a copyup for layered writes Alex Elder
2013-04-22 18:16   ` Josh Durgin
2013-04-19 22:52 ` [PATCH 0] rbd: " Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.