krbd blk-mq support ?

All of lore.kernel.org
 help / color / mirror / Atom feed

* krbd blk-mq support ?
       [not found] <9894b2d1-b7c6-4a17-9747-d8a41ec208a4@mailpro>
@ 2014-10-24  7:54 ` Alexandre DERUMIER
  2014-10-24  8:41   ` Ilya Dryomov
  2014-10-24 10:55   ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-24  7:54 UTC (permalink / raw)
  To: Ceph Devel

Hi,

I would like to known if it's planned to add blk-mq (block multiqueue from kernel 3.17)  support to krbd ?

I think it could help single threaded workload (including qemu) to reach more iops.

I find some small discussion about it here:
http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/20584

But no news since then.

Regards,

Alexandre


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-24  7:54 ` krbd blk-mq support ? Alexandre DERUMIER
@ 2014-10-24  8:41   ` Ilya Dryomov
  2014-10-24 10:55   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Ilya Dryomov @ 2014-10-24  8:41 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Ceph Devel

On Fri, Oct 24, 2014 at 11:54 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Hi,
>
> I would like to known if it's planned to add blk-mq (block multiqueue from kernel 3.17)  support to krbd ?
>
> I think it could help single threaded workload (including qemu) to reach more iops.
>
> I find some small discussion about it here:
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/20584
>
> But no news since then.

There are no concrete plans as of now.  For 3.19 and 3.20 the main goal
is to get fancy striping (support for custom striping modes) in and
then get rid of "kernel layering is EXPERIMENTAL!" warning.

krbd is a network block device, so I don't think we will gain anything
significant in the performance department.  blk-mq was mentioned
because it lifts some of the implementation restrictions the current
infrastructure imposes on drivers.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-24  7:54 ` krbd blk-mq support ? Alexandre DERUMIER
  2014-10-24  8:41   ` Ilya Dryomov
@ 2014-10-24 10:55   ` Christoph Hellwig
  2014-10-24 12:27     ` Alexandre DERUMIER
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2014-10-24 10:55 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Ceph Devel

[-- Attachment #1: Type: text/plain, Size: 150 bytes --]

If you're willing to experiment give the patches below a try, not that
I don't have a ceph test cluster available, so the conversion is
untestested.


[-- Attachment #2: 0001-blk-mq-handle-single-queue-case-in-blk_mq_hctx_next_.patch --]
[-- Type: text/plain, Size: 2672 bytes --]

From 00668f00afc6f0cfbce05d1186116469c1f3f9b3 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 24 Oct 2014 11:53:36 +0200
Subject: blk-mq: handle single queue case in blk_mq_hctx_next_cpu

Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68929ba..eaaedea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -760,10 +760,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
-	int cpu = hctx->next_cpu;
+	if (hctx->queue->nr_hw_queues == 1)
+		return WORK_CPU_UNBOUND;
 
 	if (--hctx->next_cpu_batch <= 0) {
-		int next_cpu;
+		int cpu = hctx->next_cpu, next_cpu;
 
 		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
 		if (next_cpu >= nr_cpu_ids)
@@ -771,9 +772,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
 		hctx->next_cpu = next_cpu;
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+	
+		return cpu;
 	}
 
-	return cpu;
+	return hctx->next_cpu;
 }
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
@@ -781,16 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
-	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask))
+	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask)) {
 		__blk_mq_run_hw_queue(hctx);
-	else if (hctx->queue->nr_hw_queues == 1)
-		kblockd_schedule_delayed_work(&hctx->run_work, 0);
-	else {
-		unsigned int cpu;
-
-		cpu = blk_mq_hctx_next_cpu(hctx);
-		kblockd_schedule_delayed_work_on(cpu, &hctx->run_work, 0);
+		return;
 	}
+
+	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+			&hctx->run_work, 0);
 }
 
 void blk_mq_run_queues(struct request_queue *q, bool async)
@@ -888,16 +888,8 @@ static void blk_mq_delay_work_fn(struct work_struct *work)
 
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 {
-	unsigned long tmo = msecs_to_jiffies(msecs);
-
-	if (hctx->queue->nr_hw_queues == 1)
-		kblockd_schedule_delayed_work(&hctx->delay_work, tmo);
-	else {
-		unsigned int cpu;
-
-		cpu = blk_mq_hctx_next_cpu(hctx);
-		kblockd_schedule_delayed_work_on(cpu, &hctx->delay_work, tmo);
-	}
+	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+			&hctx->delay_work, msecs_to_jiffies(msecs));
 }
 EXPORT_SYMBOL(blk_mq_delay_queue);
 
-- 
1.9.1


[-- Attachment #3: 0002-blk-mq-allow-direct-dispatch-to-a-driver-specific-wo.patch --]
[-- Type: text/plain, Size: 3902 bytes --]

From 6002e20c4d2b150fcbe82a7bc45c90d30cb61b78 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 24 Oct 2014 12:04:07 +0200
Subject: blk-mq: allow direct dispatch to a driver specific workqueue

We have various block drivers that need to execute long term blocking
operations during I/O submission like file system or network I/O.

Currently these drivers just queue up work to an internal workqueue
from their request_fn.  With blk-mq we can make sure they always get
called on their own workqueue directly for I/O submission by:

 1) adding a flag to prevent inline submission of I/O, and
 2) allowing the driver to pass in a workqueue in the tag_set that
    will be used instead of kblockd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       |  2 +-
 block/blk-mq.c         | 12 +++++++++---
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  4 ++++
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0421b53..7f7249f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -61,7 +61,7 @@ struct kmem_cache *blk_requestq_cachep;
 /*
  * Controlling structure to kblockd
  */
-static struct workqueue_struct *kblockd_workqueue;
+struct workqueue_struct *kblockd_workqueue;
 
 void blk_queue_congestion_threshold(struct request_queue *q)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index eaaedea..cea2f96 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -784,12 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
-	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask)) {
+	if (!async && !(hctx->flags & BLK_MQ_F_WORKQUEUE) &&
+	    cpumask_test_cpu(smp_processor_id(), hctx->cpumask)) {
 		__blk_mq_run_hw_queue(hctx);
 		return;
 	}
 
-	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+	queue_delayed_work_on(blk_mq_hctx_next_cpu(hctx), hctx->wq,
 			&hctx->run_work, 0);
 }
 
@@ -888,7 +889,7 @@ static void blk_mq_delay_work_fn(struct work_struct *work)
 
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 {
-	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+	queue_delayed_work_on(blk_mq_hctx_next_cpu(hctx), hctx->wq,
 			&hctx->delay_work, msecs_to_jiffies(msecs));
 }
 EXPORT_SYMBOL(blk_mq_delay_queue);
@@ -1551,6 +1552,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->flags = set->flags;
 	hctx->cmd_size = set->cmd_size;
 
+	if (set->wq)
+		hctx->wq = set->wq;
+	else
+		hctx->wq = kblockd_workqueue;
+
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
diff --git a/block/blk.h b/block/blk.h
index 43b0361..fb46ad0 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -25,6 +25,7 @@ struct blk_flush_queue {
 	spinlock_t		mq_flush_lock;
 };
 
+extern struct workqueue_struct *kblockd_workqueue;
 extern struct kmem_cache *blk_requestq_cachep;
 extern struct kmem_cache *request_cachep;
 extern struct kobj_type blk_queue_ktype;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c9be158..d61ecfe 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -37,6 +37,8 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 	struct blk_flush_queue	*fq;
 
+	struct workqueue_struct	*wq;
+
 	void			*driver_data;
 
 	struct blk_mq_ctxmap	ctx_map;
@@ -64,6 +66,7 @@ struct blk_mq_hw_ctx {
 
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
+	struct workqueue_struct	*wq;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
 	unsigned int		reserved_tags;
@@ -140,6 +143,7 @@ enum {
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_SYSFS_UP	= 1 << 3,
+	BLK_MQ_F_WORKQUEUE	= 1 << 4,
 
 	BLK_MQ_S_STOPPED	= 0,
 	BLK_MQ_S_TAG_ACTIVE	= 1,
-- 
1.9.1


[-- Attachment #4: 0003-rbd-WIP-conversion-to-blk-mq.patch --]
[-- Type: text/plain, Size: 6710 bytes --]

From 135c8e415d3800f33142debd93d64af246ccaa57 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 24 Oct 2014 12:46:40 +0200
Subject: rbd: WIP conversion to blk-mq

---
 drivers/block/rbd.c | 106 ++++++++++++++++++++++++----------------------------
 1 file changed, 49 insertions(+), 57 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0a54c58..9321f35 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include <linux/kernel.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/blk-mq.h>
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
@@ -343,7 +344,6 @@ struct rbd_device {
 	struct list_head	rq_queue;	/* incoming rq queue */
 	spinlock_t		lock;		/* queue, flags, open_count */
 	struct workqueue_struct	*rq_wq;
-	struct work_struct	rq_work;
 
 	struct rbd_image_header	header;
 	unsigned long		flags;		/* possibly lock protected */
@@ -361,6 +361,9 @@ struct rbd_device {
 	atomic_t		parent_ref;
 	struct rbd_device	*parent;
 
+	/* Block layer tags. */
+	struct blk_mq_tag_set	tag_set;
+
 	/* protects updating the header */
 	struct rw_semaphore     header_rwsem;
 
@@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
 
 	/*
 	 * We support a 64-bit length, but ultimately it has to be
-	 * passed to blk_end_request(), which takes an unsigned int.
+	 * passed to the block layer, which just supports a 32-bit
+	 * length field.
 	 */
 	obj_request->xferred = osd_req->r_reply_op_len[0];
 	rbd_assert(obj_request->xferred < (u64)UINT_MAX);
@@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request)
 		more = obj_request->which < img_request->obj_request_count - 1;
 	} else {
 		rbd_assert(img_request->rq != NULL);
-		more = blk_end_request(img_request->rq, result, xferred);
+	
+		more = blk_update_request(img_request->rq, result, xferred);
+		if (!more)
+			__blk_mq_end_request(img_request->rq, result);
 	}
 
 	return more;
@@ -3305,8 +3312,10 @@ out:
 	return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+		bool last)
 {
+	struct rbd_device *rbd_dev = rq->q->queuedata;
 	struct rbd_img_request *img_request;
 	struct ceph_snap_context *snapc = NULL;
 	u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT;
@@ -3314,6 +3323,12 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
 	enum obj_operation_type op_type;
 	u64 mapping_size;
 	int result;
+		
+	if (rq->cmd_type != REQ_TYPE_FS) {
+		dout("%s: non-fs request type %d\n", __func__,
+			(int) rq->cmd_type);
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
 
 	if (rq->cmd_flags & REQ_DISCARD)
 		op_type = OBJ_OP_DISCARD;
@@ -3353,6 +3368,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
 		goto err_rq;
 	}
 
+	blk_mq_start_request(rq);
+
 	if (offset && length > U64_MAX - offset + 1) {
 		rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
 			 length);
@@ -3396,7 +3413,7 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
 	if (result)
 		goto err_img_request;
 
-	return;
+	return 0;
 
 err_img_request:
 	rbd_img_request_put(img_request);
@@ -3406,53 +3423,8 @@ err_rq:
 			 obj_op_name(op_type), length, offset, result);
 	if (snapc)
 		ceph_put_snap_context(snapc);
-	blk_end_request_all(rq, result);
-}
-
-static void rbd_request_workfn(struct work_struct *work)
-{
-	struct rbd_device *rbd_dev =
-	    container_of(work, struct rbd_device, rq_work);
-	struct request *rq, *next;
-	LIST_HEAD(requests);
-
-	spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */
-	list_splice_init(&rbd_dev->rq_queue, &requests);
-	spin_unlock_irq(&rbd_dev->lock);
-
-	list_for_each_entry_safe(rq, next, &requests, queuelist) {
-		list_del_init(&rq->queuelist);
-		rbd_handle_request(rbd_dev, rq);
-	}
-}
-
-/*
- * Called with q->queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-	struct rbd_device *rbd_dev = q->queuedata;
-	struct request *rq;
-	int queued = 0;
-
-	rbd_assert(rbd_dev);
-
-	while ((rq = blk_fetch_request(q))) {
-		/* Ignore any non-FS requests that filter through. */
-		if (rq->cmd_type != REQ_TYPE_FS) {
-			dout("%s: non-fs request type %d\n", __func__,
-				(int) rq->cmd_type);
-			__blk_end_request_all(rq, 0);
-			continue;
-		}
-
-		list_add_tail(&rq->queuelist, &rbd_dev->rq_queue);
-		queued++;
-	}
-
-	if (queued)
-		queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work);
+	blk_mq_end_request(rq, result);
+	return 0;
 }
 
 /*
@@ -3513,6 +3485,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
 		del_gendisk(disk);
 		if (disk->queue)
 			blk_cleanup_queue(disk->queue);
+		blk_mq_free_tag_set(&rbd_dev->tag_set);
 	}
 	put_disk(disk);
 }
@@ -3724,11 +3697,17 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	return 0;
 }
 
+static struct blk_mq_ops rbd_mq_ops = {
+	.queue_rq	= rbd_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+};
+
 static int rbd_init_disk(struct rbd_device *rbd_dev)
 {
 	struct gendisk *disk;
 	struct request_queue *q;
 	u64 segment_size;
+	int err;
 
 	/* create gendisk info */
 	disk = alloc_disk(single_major ?
@@ -3746,10 +3725,23 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	disk->fops = &rbd_bd_ops;
 	disk->private_data = rbd_dev;
 
-	q = blk_init_queue(rbd_request_fn, &rbd_dev->lock);
-	if (!q)
+	memset(&rbd_dev->tag_set, 0, sizeof(rbd_dev->tag_set));
+	rbd_dev->tag_set.ops = &rbd_mq_ops;
+	rbd_dev->tag_set.queue_depth = 128; //
+	rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
+	rbd_dev->tag_set.flags =
+		BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	rbd_dev->tag_set.nr_hw_queues = 1;
+
+	err = blk_mq_alloc_tag_set(&rbd_dev->tag_set);
+	if (err)
 		goto out_disk;
 
+	err = -ENOMEM;
+	q = blk_mq_init_queue(&rbd_dev->tag_set);
+	if (!q)
+		goto out_tag_set;
+
 	/* We use the default size, but let's be explicit about it. */
 	blk_queue_physical_block_size(q, SECTOR_SIZE);
 
@@ -3775,10 +3767,11 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	rbd_dev->disk = disk;
 
 	return 0;
+out_tag_set:
+	blk_mq_free_tag_set(&rbd_dev->tag_set);
 out_disk:
 	put_disk(disk);
-
-	return -ENOMEM;
+	return err;
 }
 
 /*
@@ -4036,7 +4029,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
 
 	spin_lock_init(&rbd_dev->lock);
 	INIT_LIST_HEAD(&rbd_dev->rq_queue);
-	INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn);
 	rbd_dev->flags = 0;
 	atomic_set(&rbd_dev->parent_ref, 0);
 	INIT_LIST_HEAD(&rbd_dev->node);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-24 10:55   ` Christoph Hellwig
@ 2014-10-24 12:27     ` Alexandre DERUMIER
  2014-10-26 13:46       ` Alexandre DERUMIER
  0 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-24 12:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues.

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that 
I don't have a ceph test cluster available, so the conversion is 
untestested. 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-24 12:27     ` Alexandre DERUMIER
@ 2014-10-26 13:46       ` Alexandre DERUMIER
  2014-10-26 19:08         ` Somnath Roy
  2014-10-27  9:45         ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-26 13:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

Hi,

some news:

I have applied patches succefully on top of 3.18-rc1 kernel.

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).

My main problem is that I can't reach more than around 50000iops on 1 machine,

and the problem seem to be the kworker process stuck at 100% of 1core.

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops.

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time,

I can reach 50000iops on each node, and both are limited by the kworker process.

That's why I thinked that blk-mq could help, but it don't seem to be the case.

Is this kworker cpu limitation a known bug ?

Regards,

Alexandre

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that 
I don't have a ceph test cluster available, so the conversion is 
untestested. 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: krbd blk-mq support ?
  2014-10-26 13:46       ` Alexandre DERUMIER
@ 2014-10-26 19:08         ` Somnath Roy
  2014-10-27  7:53           ` Alexandre DERUMIER
  2014-10-27 10:26           ` Alexandre DERUMIER
  2014-10-27  9:45         ` Christoph Hellwig
  1 sibling, 2 replies; 27+ messages in thread
From: Somnath Roy @ 2014-10-26 19:08 UTC (permalink / raw)
  To: Alexandre DERUMIER, Christoph Hellwig; +Cc: Ceph Devel

Alexandre,
Have you tried mapping different images on the same m/c with 'noshare' map option ?
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
Sent: Sunday, October 26, 2014 6:46 AM
To: Christoph Hellwig
Cc: Ceph Devel
Subject: Re: krbd blk-mq support ?

Hi,

some news:

I have applied patches succefully on top of 3.18-rc1 kernel.

But don't seem to help is my case.
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).

My main problem is that I can't reach more than around 50000iops on 1 machine,

and the problem seem to be the kworker process stuck at 100% of 1core.

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops.

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time,

I can reach 50000iops on each node, and both are limited by the kworker process.

That's why I thinked that blk-mq could help, but it don't seem to be the case.

Is this kworker cpu limitation a known bug ?

Regards,

Alexandre

----- Mail original -----

De: "Alexandre DERUMIER" <aderumier@odiso.com>
À: "Christoph Hellwig" <hch@infradead.org>
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 24 Octobre 2014 14:27:47
Objet: Re: krbd blk-mq support ?

>>If you're willing to experiment give the patches below a try, not that
>>I don't have a ceph test cluster available, so the conversion is
>>untestested.

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues.

----- Mail original -----

De: "Christoph Hellwig" <hch@infradead.org>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 24 Octobre 2014 12:55:01
Objet: Re: krbd blk-mq support ?

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-26 19:08         ` Somnath Roy
@ 2014-10-27  7:53           ` Alexandre DERUMIER
  2014-10-27 10:26           ` Alexandre DERUMIER
  1 sibling, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-27  7:53 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Ceph Devel, Christoph Hellwig

>>Have you tried mapping different images on the same m/c with 'noshare' map option ?

Oh, I didn't known about his option.

I found 1 reference here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/034213.html

"With noshare each mapped image will appear as a separate client instance, 
which means it will have it's own session with teh monitors and own TCP 
connections to the OSDs.  It may be a viable workaround for now but in 
general I would not recommend it."

So it should help with multiple rbd.
Do you known why Sage don't recommend it in this mail ?

----- Mail original ----- 

De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com>, "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Dimanche 26 Octobre 2014 20:08:42 
Objet: RE: krbd blk-mq support ? 

Alexandre, 
Have you tried mapping different images on the same m/c with 'noshare' map option ? 
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster. 

Thanks & Regards 
Somnath 

-----Original Message----- 
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER 
Sent: Sunday, October 26, 2014 6:46 AM 
To: Christoph Hellwig 
Cc: Ceph Devel 
Subject: Re: krbd blk-mq support ? 

Hi, 

some news: 

I have applied patches succefully on top of 3.18-rc1 kernel. 

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). 

My main problem is that I can't reach more than around 50000iops on 1 machine, 

and the problem seem to be the kworker process stuck at 100% of 1core. 

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops. 

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time, 

I can reach 50000iops on each node, and both are limited by the kworker process. 

That's why I thinked that blk-mq could help, but it don't seem to be the case. 

Is this kworker cpu limitation a known bug ? 

Regards, 

Alexandre 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested. 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 

________________________________ 

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-26 13:46       ` Alexandre DERUMIER
  2014-10-26 19:08         ` Somnath Roy
@ 2014-10-27  9:45         ` Christoph Hellwig
  2014-10-27 10:00           ` Alexandre DERUMIER
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2014-10-27  9:45 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Christoph Hellwig, Ceph Devel

On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote:
> Hi,
> 
> some news:
> 
> I have applied patches succefully on top of 3.18-rc1 kernel.
> 
> But don't seem to help is my case.
> (I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).
> 
> My main problem is that I can't reach more than around 50000iops on 1 machine,
> 
> and the problem seem to be the kworker process stuck at 100% of 1core.

Can you do a perf report -ag and then a perf report to see where these
cycles are spent?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-27  9:45         ` Christoph Hellwig
@ 2014-10-27 10:00           ` Alexandre DERUMIER
  2014-10-28 18:07             ` Christoph Hellwig
  2014-11-03 11:08             ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-27 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]

>>Can you do a perf report -ag and then a perf report to see where these
>>cycles are spent?

Yes, sure.

I have attached the perf report to this mail.
(This is with kernel 3.14, don't have access to my 3.18  host for now)

----- Mail original -----

De: "Christoph Hellwig" <hch@infradead.org>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org>
Envoyé: Lundi 27 Octobre 2014 10:45:56
Objet: Re: krbd blk-mq support ?

On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote:
> Hi,
>
> some news:
>
> I have applied patches succefully on top of 3.18-rc1 kernel.
>
> But don't seem to help is my case.
> (I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually).
>
> My main problem is that I can't reach more than around 50000iops on 1 machine,
>
> and the problem seem to be the kworker process stuck at 100% of 1core.

Can you do a perf report -ag and then a perf report to see where these
cycles are spent? 

[-- Attachment #2: report.txt.gz --]
[-- Type: application/x-gzip, Size: 30413 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-26 19:08         ` Somnath Roy
  2014-10-27  7:53           ` Alexandre DERUMIER
@ 2014-10-27 10:26           ` Alexandre DERUMIER
  1 sibling, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-27 10:26 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Ceph Devel, Christoph Hellwig

Hi Somnath,

I have just tried with 2 rbd volumes with (rbd map -o noshare rbdvolume -p pool) (kernel 3.14),
then a fio benchmark on both volumes at the same time
but I don't seem to help.

I have always the kworker process at 100%, and iops are 25000iops on each rbd volume.

----- Mail original ----- 

De: "Somnath Roy" <Somnath.Roy@sandisk.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com>, "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Dimanche 26 Octobre 2014 20:08:42 
Objet: RE: krbd blk-mq support ? 

Alexandre, 
Have you tried mapping different images on the same m/c with 'noshare' map option ? 
If not, it will not scale with increasing number of images (and thus mapped rbds) on a single m/c as they will share the same connection to cluster. 

Thanks & Regards 
Somnath 

-----Original Message----- 
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER 
Sent: Sunday, October 26, 2014 6:46 AM 
To: Christoph Hellwig 
Cc: Ceph Devel 
Subject: Re: krbd blk-mq support ? 

Hi, 

some news: 

I have applied patches succefully on top of 3.18-rc1 kernel. 

But don't seem to help is my case. 
(I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). 

My main problem is that I can't reach more than around 50000iops on 1 machine, 

and the problem seem to be the kworker process stuck at 100% of 1core. 

I had tried multiple fio process, on differents rbd devices at the same time, and I'm always limited à 50000iops. 

I'm sure that the ceph cluster is not the bottleneck, because if I launch another fio on another node at the same time, 

I can reach 50000iops on each node, and both are limited by the kworker process. 

That's why I thinked that blk-mq could help, but it don't seem to be the case. 

Is this kworker cpu limitation a known bug ? 

Regards, 

Alexandre 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 14:27:47 
Objet: Re: krbd blk-mq support ? 

>>If you're willing to experiment give the patches below a try, not that 
>>I don't have a ceph test cluster available, so the conversion is 
>>untestested. 

Ok, Thanks ! I'll try them and see If I can improve qemu performance on a single drive with multiqueues. 

----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 24 Octobre 2014 12:55:01 
Objet: Re: krbd blk-mq support ? 

If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested. 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 

________________________________ 

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-27 10:00           ` Alexandre DERUMIER
@ 2014-10-28 18:07             ` Christoph Hellwig
  2014-10-28 22:31               ` Alex Elder
                                 ` (2 more replies)
  2014-11-03 11:08             ` Christoph Hellwig
  1 sibling, 3 replies; 27+ messages in thread
From: Christoph Hellwig @ 2014-10-28 18:07 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Ceph Devel

On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
> >>Can you do a perf report -ag and then a perf report to see where these
> >>cycles are spent?
> 
> Yes, sure.
> 
> I have attached the perf report to this mail.
> (This is with kernel 3.14, don't have access to my 3.18  host for now)

Oh, that's without the blk-mq patch?

Either way the profile doesn't really sum up to a fully used up
cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
If not we could probably aim for per-cpu queues using blk-mq and a
socket per cpu or similar.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-28 18:07             ` Christoph Hellwig
@ 2014-10-28 22:31               ` Alex Elder
  2014-10-28 23:11               ` Alex Elder
  2014-10-29  9:09               ` Alexandre DERUMIER
  2 siblings, 0 replies; 27+ messages in thread
From: Alex Elder @ 2014-10-28 22:31 UTC (permalink / raw)
  To: Christoph Hellwig, Alexandre DERUMIER; +Cc: Ceph Devel

On 10/28/2014 01:07 PM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>>>> Can you do a perf report -ag and then a perf report to see where these
>>>> cycles are spent?
>>
>> Yes, sure.
>>
>> I have attached the perf report to this mail.
>> (This is with kernel 3.14, don't have access to my 3.18  host for now)
>
> Oh, that's without the blk-mq patch?
>
> Either way the profile doesn't really sum up to a fully used up
> cpu.  Sage, Alex - are there any ordring constraints in the rbd client?

I don't remember off hand.

In libceph I recall going to great lengths to retain the original
order of requests when they got re-sent after a connection reset.

I'll go look at the code a bit and see if I can refresh my memory
(though Sage may answer before I do).

					-Alex

> If not we could probably aim for per-cpu queues using blk-mq and a
> socket per cpu or similar.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-28 18:07             ` Christoph Hellwig
  2014-10-28 22:31               ` Alex Elder
@ 2014-10-28 23:11               ` Alex Elder
  2014-10-29  9:09               ` Alexandre DERUMIER
  2 siblings, 0 replies; 27+ messages in thread
From: Alex Elder @ 2014-10-28 23:11 UTC (permalink / raw)
  To: Christoph Hellwig, Alexandre DERUMIER; +Cc: Ceph Devel

On 10/28/2014 01:07 PM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>>>> Can you do a perf report -ag and then a perf report to see where these
>>>> cycles are spent?
>>
>> Yes, sure.
>>
>> I have attached the perf report to this mail.
>> (This is with kernel 3.14, don't have access to my 3.18  host for now)
>
> Oh, that's without the blk-mq patch?
>
> Either way the profile doesn't really sum up to a fully used up
> cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
> If not we could probably aim for per-cpu queues using blk-mq and a
> socket per cpu or similar.

First, a disclaimer--I haven't really been following this discussion
very closely.

For an rbd image request (which is what gets created from requests
from the block queue), the order of completion doesn't matter, and
although the object requests are submitted in order that shouldn't
be required either.

The image request is broken into one or more object requests (usually
just one) and they are treated as a unit.  When the last object request
of a set for an image request has completed, the image request is
treated as completed.

I hope that helps.  If not, ask again a different way...

					-Alex

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-28 18:07             ` Christoph Hellwig
  2014-10-28 22:31               ` Alex Elder
  2014-10-28 23:11               ` Alex Elder
@ 2014-10-29  9:09               ` Alexandre DERUMIER
  2014-10-29 15:00                 ` Sage Weil
  2 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-29  9:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

>>Oh, that's without the blk-mq patch?

Yes, sorry, I don't how to use perf with a custom compiled kernel.
(Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)

>>Either way the profile doesn't really sum up to a fully used up cpu.

But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.

I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu



----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 28 Octobre 2014 19:07:25 
Objet: Re: krbd blk-mq support ? 

On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> >>Can you do a perf report -ag and then a perf report to see where these 
> >>cycles are spent? 
> 
> Yes, sure. 
> 
> I have attached the perf report to this mail. 
> (This is with kernel 3.14, don't have access to my 3.18 host for now) 

Oh, that's without the blk-mq patch? 

Either way the profile doesn't really sum up to a fully used up 
cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
If not we could probably aim for per-cpu queues using blk-mq and a 
socket per cpu or similar. 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-29  9:09               ` Alexandre DERUMIER
@ 2014-10-29 15:00                 ` Sage Weil
  2014-10-30  8:11                   ` Alexandre DERUMIER
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2014-10-29 15:00 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Christoph Hellwig, Ceph Devel

On Wed, 29 Oct 2014, Alexandre DERUMIER wrote:
> >>Oh, that's without the blk-mq patch?
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel.
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu.
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network.  How many OSDs do you have?  It should be able to scale 
with the number of OSDs.

sage


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-29 15:00                 ` Sage Weil
@ 2014-10-30  8:11                   ` Alexandre DERUMIER
  2014-10-30 16:01                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-30  8:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: Christoph Hellwig, Ceph Devel

>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>to the network. How many OSDs do you have? It should be able to scale 
>>with the number of OSDs. 

Thanks Sage for your reply.

Currently 6 OSD (ssd) on the test platform.

But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host.
Do you think messenger.c worker can be the bottleneck in this case ?


I'll try to add more OSD next week, if it's scale it's a very good news !







----- Mail original ----- 

De: "Sage Weil" <sage@newdream.net> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 29 Octobre 2014 16:00:56 
Objet: Re: krbd blk-mq support ? 

On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
> >>Oh, that's without the blk-mq patch? 
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu. 
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network. How many OSDs do you have? It should be able to scale 
with the number of OSDs. 

sage 


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-30  8:11                   ` Alexandre DERUMIER
@ 2014-10-30 16:01                     ` Alexandre DERUMIER
  2014-10-30 17:05                       ` Haomai Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-30 16:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: Christoph Hellwig, Ceph Devel

>>I'll try to add more OSD next week, if it's scale it's a very good news !

I just tried to add 2 more osds,

I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously).

and kworker cpu usage is also lower (84% vs 97%).
(don't understand why exactly)

So, Thanks for help everybody !





----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Sage Weil" <sage@newdream.net> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 30 Octobre 2014 09:11:11 
Objet: Re: krbd blk-mq support ? 

>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>to the network. How many OSDs do you have? It should be able to scale 
>>with the number of OSDs. 

Thanks Sage for your reply. 

Currently 6 OSD (ssd) on the test platform. 

But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host. 
Do you think messenger.c worker can be the bottleneck in this case ? 


I'll try to add more OSD next week, if it's scale it's a very good news ! 







----- Mail original ----- 

De: "Sage Weil" <sage@newdream.net> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 29 Octobre 2014 16:00:56 
Objet: Re: krbd blk-mq support ? 

On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
> >>Oh, that's without the blk-mq patch? 
> 
> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
> 
> >>Either way the profile doesn't really sum up to a fully used up cpu. 
> 
> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
> 
> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 

Hmm, this is probably the messenger.c worker then that is feeding messages 
to the network. How many OSDs do you have? It should be able to scale 
with the number of OSDs. 

sage 


> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christoph Hellwig" <hch@infradead.org> 
> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoy?: Mardi 28 Octobre 2014 19:07:25 
> Objet: Re: krbd blk-mq support ? 
> 
> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
> > >>Can you do a perf report -ag and then a perf report to see where these 
> > >>cycles are spent? 
> > 
> > Yes, sure. 
> > 
> > I have attached the perf report to this mail. 
> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
> 
> Oh, that's without the blk-mq patch? 
> 
> Either way the profile doesn't really sum up to a fully used up 
> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
> If not we could probably aim for per-cpu queues using blk-mq and a 
> socket per cpu or similar. 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-30 16:01                     ` Alexandre DERUMIER
@ 2014-10-30 17:05                       ` Haomai Wang
  2014-10-31  5:04                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 27+ messages in thread
From: Haomai Wang @ 2014-10-30 17:05 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Sage Weil, Christoph Hellwig, Ceph Devel

Could you describe more about 2x70000 iops?
So you mean 8 OSD each backend with SSD can achieve with 14w iops?
is it read or write? could you give fio options?

On Fri, Oct 31, 2014 at 12:01 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
>>>I'll try to add more OSD next week, if it's scale it's a very good news !
>
> I just tried to add 2 more osds,
>
> I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously).
>
> and kworker cpu usage is also lower (84% vs 97%).
> (don't understand why exactly)
>
> So, Thanks for help everybody !
>
>
>
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER" <aderumier@odiso.com>
> À: "Sage Weil" <sage@newdream.net>
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org>
> Envoyé: Jeudi 30 Octobre 2014 09:11:11
> Objet: Re: krbd blk-mq support ?
>
>>>Hmm, this is probably the messenger.c worker then that is feeding messages
>>>to the network. How many OSDs do you have? It should be able to scale
>>>with the number of OSDs.
>
> Thanks Sage for your reply.
>
> Currently 6 OSD (ssd) on the test platform.
>
> But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host.
> Do you think messenger.c worker can be the bottleneck in this case ?
>
>
> I'll try to add more OSD next week, if it's scale it's a very good news !
>
>
>
>
>
>
>
> ----- Mail original -----
>
> De: "Sage Weil" <sage@newdream.net>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 29 Octobre 2014 16:00:56
> Objet: Re: krbd blk-mq support ?
>
> On Wed, 29 Oct 2014, Alexandre DERUMIER wrote:
>> >>Oh, that's without the blk-mq patch?
>>
>> Yes, sorry, I don't how to use perf with a custom compiled kernel.
>> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package)
>>
>> >>Either way the profile doesn't really sum up to a fully used up cpu.
>>
>> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops.
>>
>> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu
>
> Hmm, this is probably the messenger.c worker then that is feeding messages
> to the network. How many OSDs do you have? It should be able to scale
> with the number of OSDs.
>
> sage
>
>
>>
>>
>>
>> ----- Mail original -----
>>
>> De: "Christoph Hellwig" <hch@infradead.org>
>> ?: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org>
>> Envoy?: Mardi 28 Octobre 2014 19:07:25
>> Objet: Re: krbd blk-mq support ?
>>
>> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
>> > >>Can you do a perf report -ag and then a perf report to see where these
>> > >>cycles are spent?
>> >
>> > Yes, sure.
>> >
>> > I have attached the perf report to this mail.
>> > (This is with kernel 3.14, don't have access to my 3.18 host for now)
>>
>> Oh, that's without the blk-mq patch?
>>
>> Either way the profile doesn't really sum up to a fully used up
>> cpu. Sage, Alex - are there any ordring constraints in the rbd client?
>> If not we could probably aim for per-cpu queues using blk-mq and a
>> socket per cpu or similar.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-30 17:05                       ` Haomai Wang
@ 2014-10-31  5:04                         ` Alexandre DERUMIER
  0 siblings, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-31  5:04 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, Christoph Hellwig, Ceph Devel

>>Could you describe more about 2x70000 iops?
>>So you mean 8 OSD each backend with SSD can achieve with 14w iops?

It's a small rbd (10G), so mostly read hit the buffer cache.
But yes, it's able to deliver 140000iops with 8 osd. (I check also stats in ceph cluster to be sure).
(and I'm not cpu bound on osd nodes)
>> 2014-10-31 05:58:34.231037 mon.0 [INF] pgmap v7109: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 560 MB/s rd, 140 kop/s


here the ceph.conf of osd nodes

[global]
fsid = c29f4643-9577-4671-ae25-59ad14550aba
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true

       debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0

        osd_op_threads = 5
        filestore_op_threads = 4


        osd_op_num_threads_per_shard = 1
        osd_op_num_shards = 25
        filestore_fd_cache_size = 64
        filestore_fd_cache_shards = 32
        osd_enable_op_tracker = false

 

>>is it read or write? could you give fio options?
random read 4K

Here the fio config.

[global]
ioengine=aio
invalidate=1    
rw=randread
bs=4K
direct=1
numjobs=1
group_reporting=1
size=10G

[test1]
iodepth=64
filename=/dev/rbd/test/test


On 1 client node, I can't reach more than 50000iops with 6osd or 70000iops with 8 osd.
(I had try to increasing numjobs to have more fio process or with 2 differents rbd volume at the same time, 
 but performance is the same).

>> 2014-10-31 05:57:30.078348 mon.0 [INF] pgmap v7070: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 290 MB/s rd, 74572 op/s


But If I launch same fio test on another client node, I can reach same 70000iops at the same time.


>> 2014-10-31 05:58:34.231037 mon.0 [INF] pgmap v7109: 1264 pgs: 1264 active+clean; 165 GB data, 109 GB used, 6226 GB / 6335 GB avail; 560 MB/s rd, 140 kop/s


----- Mail original ----- 

De: "Haomai Wang" <haomaiwang@gmail.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@newdream.net>, "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 30 Octobre 2014 18:05:26 
Objet: Re: krbd blk-mq support ? 

Could you describe more about 2x70000 iops? 
So you mean 8 OSD each backend with SSD can achieve with 14w iops? 
is it read or write? could you give fio options? 

On Fri, Oct 31, 2014 at 12:01 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
>>>I'll try to add more OSD next week, if it's scale it's a very good news ! 
> 
> I just tried to add 2 more osds, 
> 
> I can now reach 2x 70000 iops on 2 client nodes (vs 2 x 50000 previously). 
> 
> and kworker cpu usage is also lower (84% vs 97%). 
> (don't understand why exactly) 
> 
> So, Thanks for help everybody ! 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Alexandre DERUMIER" <aderumier@odiso.com> 
> À: "Sage Weil" <sage@newdream.net> 
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Jeudi 30 Octobre 2014 09:11:11 
> Objet: Re: krbd blk-mq support ? 
> 
>>>Hmm, this is probably the messenger.c worker then that is feeding messages 
>>>to the network. How many OSDs do you have? It should be able to scale 
>>>with the number of OSDs. 
> 
> Thanks Sage for your reply. 
> 
> Currently 6 OSD (ssd) on the test platform. 
> 
> But I can reach 2x 50000iops on same rbd volume with 2 clients on 2 differents host. 
> Do you think messenger.c worker can be the bottleneck in this case ? 
> 
> 
> I'll try to add more OSD next week, if it's scale it's a very good news ! 
> 
> 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Sage Weil" <sage@newdream.net> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 29 Octobre 2014 16:00:56 
> Objet: Re: krbd blk-mq support ? 
> 
> On Wed, 29 Oct 2014, Alexandre DERUMIER wrote: 
>> >>Oh, that's without the blk-mq patch? 
>> 
>> Yes, sorry, I don't how to use perf with a custom compiled kernel. 
>> (Usualy I'm using perf from debian, with linux-tools package provided with the debian kernel package) 
>> 
>> >>Either way the profile doesn't really sum up to a fully used up cpu. 
>> 
>> But I see mostly same behaviour with or without blk-mq patch, I have always 1 kworker at around 97-100%cpu (1core) for 50000iops. 
>> 
>> I had also tried to map the rbd volume with nocrc, it's going to 60000iops with same kworker at around 97-100%cpu 
> 
> Hmm, this is probably the messenger.c worker then that is feeding messages 
> to the network. How many OSDs do you have? It should be able to scale 
> with the number of OSDs. 
> 
> sage 
> 
> 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Christoph Hellwig" <hch@infradead.org> 
>> ?: "Alexandre DERUMIER" <aderumier@odiso.com> 
>> Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
>> Envoy?: Mardi 28 Octobre 2014 19:07:25 
>> Objet: Re: krbd blk-mq support ? 
>> 
>> On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: 
>> > >>Can you do a perf report -ag and then a perf report to see where these 
>> > >>cycles are spent? 
>> > 
>> > Yes, sure. 
>> > 
>> > I have attached the perf report to this mail. 
>> > (This is with kernel 3.14, don't have access to my 3.18 host for now) 
>> 
>> Oh, that's without the blk-mq patch? 
>> 
>> Either way the profile doesn't really sum up to a fully used up 
>> cpu. Sage, Alex - are there any ordring constraints in the rbd client? 
>> If not we could probably aim for per-cpu queues using blk-mq and a 
>> socket per cpu or similar. 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>> 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 
Best Regards, 

Wheat 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
       [not found] <90C9DE11-CACE-4533-83EF-6F1F887E6A8F@profihost.ag>
@ 2014-10-31 20:12 ` Alexandre DERUMIER
  0 siblings, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-10-31 20:12 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Haomai Wang, Sage Weil, Christoph Hellwig, Ceph Devel

>>filestore_xattr_use_omap ? 

Oh, I think it's a mistake (Just copy paste some configs from sommath).
But I'm using xfs, so I don't think I need it

----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Haomai Wang" <haomaiwang@gmail.com>, "Sage Weil" <sage@newdream.net>, "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 31 Octobre 2014 06:39:22 
Objet: Re: krbd blk-mq support ? 

Hi, 

why do you use 

filestore_xattr_use_omap ? 

Stefan 

Excuse my typo s ent from my mobile phone. 

Am 31.10.2014 um 06:04 schrieb Alexandre DERUMIER < aderumier@odiso.com >: 

filestore_xattr_use_omap 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-10-27 10:00           ` Alexandre DERUMIER
  2014-10-28 18:07             ` Christoph Hellwig
@ 2014-11-03 11:08             ` Christoph Hellwig
  2014-11-03 13:08               ` Alexandre DERUMIER
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2014-11-03 11:08 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Christoph Hellwig, Ceph Devel

Hi Alexandre,

can you try the patch below instead of the previous three patches?
This one uses a per-request work struct to allow for more concurrency.

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0a54c58..b981096 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include <linux/kernel.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/blk-mq.h>
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
@@ -343,7 +344,6 @@ struct rbd_device {
 	struct list_head	rq_queue;	/* incoming rq queue */
 	spinlock_t		lock;		/* queue, flags, open_count */
 	struct workqueue_struct	*rq_wq;
-	struct work_struct	rq_work;
 
 	struct rbd_image_header	header;
 	unsigned long		flags;		/* possibly lock protected */
@@ -361,6 +361,9 @@ struct rbd_device {
 	atomic_t		parent_ref;
 	struct rbd_device	*parent;
 
+	/* Block layer tags. */
+	struct blk_mq_tag_set	tag_set;
+
 	/* protects updating the header */
 	struct rw_semaphore     header_rwsem;
 
@@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
 
 	/*
 	 * We support a 64-bit length, but ultimately it has to be
-	 * passed to blk_end_request(), which takes an unsigned int.
+	 * passed to the block layer, which just supports a 32-bit
+	 * length field.
 	 */
 	obj_request->xferred = osd_req->r_reply_op_len[0];
 	rbd_assert(obj_request->xferred < (u64)UINT_MAX);
@@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request)
 		more = obj_request->which < img_request->obj_request_count - 1;
 	} else {
 		rbd_assert(img_request->rq != NULL);
-		more = blk_end_request(img_request->rq, result, xferred);
+	
+		more = blk_update_request(img_request->rq, result, xferred);
+		if (!more)
+			__blk_mq_end_request(img_request->rq, result);
 	}
 
 	return more;
@@ -3305,8 +3312,10 @@ out:
 	return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static void rbd_queue_workfn(struct work_struct *work)
 {
+	struct request *rq = blk_mq_rq_from_pdu(work);
+	struct rbd_device *rbd_dev = rq->q->queuedata;
 	struct rbd_img_request *img_request;
 	struct ceph_snap_context *snapc = NULL;
 	u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT;
@@ -3314,6 +3323,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
 	enum obj_operation_type op_type;
 	u64 mapping_size;
 	int result;
+		
+	if (rq->cmd_type != REQ_TYPE_FS) {
+		dout("%s: non-fs request type %d\n", __func__,
+			(int) rq->cmd_type);
+		result = -EIO;
+		goto err;
+	}
 
 	if (rq->cmd_flags & REQ_DISCARD)
 		op_type = OBJ_OP_DISCARD;
@@ -3353,6 +3369,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
 		goto err_rq;
 	}
 
+	blk_mq_start_request(rq);
+
 	if (offset && length > U64_MAX - offset + 1) {
 		rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
 			 length);
@@ -3406,53 +3424,18 @@ err_rq:
 			 obj_op_name(op_type), length, offset, result);
 	if (snapc)
 		ceph_put_snap_context(snapc);
-	blk_end_request_all(rq, result);
+err:
+	blk_mq_end_request(rq, result);
 }
 
-static void rbd_request_workfn(struct work_struct *work)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+		bool last)
 {
-	struct rbd_device *rbd_dev =
-	    container_of(work, struct rbd_device, rq_work);
-	struct request *rq, *next;
-	LIST_HEAD(requests);
-
-	spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */
-	list_splice_init(&rbd_dev->rq_queue, &requests);
-	spin_unlock_irq(&rbd_dev->lock);
-
-	list_for_each_entry_safe(rq, next, &requests, queuelist) {
-		list_del_init(&rq->queuelist);
-		rbd_handle_request(rbd_dev, rq);
-	}
-}
+	struct rbd_device *rbd_dev = rq->q->queuedata;
+	struct work_struct *work = blk_mq_rq_to_pdu(rq);
 
-/*
- * Called with q->queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-	struct rbd_device *rbd_dev = q->queuedata;
-	struct request *rq;
-	int queued = 0;
-
-	rbd_assert(rbd_dev);
-
-	while ((rq = blk_fetch_request(q))) {
-		/* Ignore any non-FS requests that filter through. */
-		if (rq->cmd_type != REQ_TYPE_FS) {
-			dout("%s: non-fs request type %d\n", __func__,
-				(int) rq->cmd_type);
-			__blk_end_request_all(rq, 0);
-			continue;
-		}
-
-		list_add_tail(&rq->queuelist, &rbd_dev->rq_queue);
-		queued++;
-	}
-
-	if (queued)
-		queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work);
+	queue_work(rbd_dev->rq_wq, work);
+	return 0;
 }
 
 /*
@@ -3513,6 +3496,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
 		del_gendisk(disk);
 		if (disk->queue)
 			blk_cleanup_queue(disk->queue);
+		blk_mq_free_tag_set(&rbd_dev->tag_set);
 	}
 	put_disk(disk);
 }
@@ -3724,11 +3708,28 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	return 0;
 }
 
+static int rbd_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct work_struct *work = blk_mq_rq_to_pdu(rq);
+
+	INIT_WORK(work, rbd_queue_workfn);
+	return 0;
+}
+
+static struct blk_mq_ops rbd_mq_ops = {
+	.queue_rq	= rbd_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+	.init_request	= rbd_init_request,
+};
+
 static int rbd_init_disk(struct rbd_device *rbd_dev)
 {
 	struct gendisk *disk;
 	struct request_queue *q;
 	u64 segment_size;
+	int err;
 
 	/* create gendisk info */
 	disk = alloc_disk(single_major ?
@@ -3746,10 +3747,24 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	disk->fops = &rbd_bd_ops;
 	disk->private_data = rbd_dev;
 
-	q = blk_init_queue(rbd_request_fn, &rbd_dev->lock);
-	if (!q)
+	memset(&rbd_dev->tag_set, 0, sizeof(rbd_dev->tag_set));
+	rbd_dev->tag_set.ops = &rbd_mq_ops;
+	rbd_dev->tag_set.queue_depth = 128; //
+	rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
+	rbd_dev->tag_set.flags =
+		BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	rbd_dev->tag_set.nr_hw_queues = 1;
+	rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
+
+	err = blk_mq_alloc_tag_set(&rbd_dev->tag_set);
+	if (err)
 		goto out_disk;
 
+	err = -ENOMEM;
+	q = blk_mq_init_queue(&rbd_dev->tag_set);
+	if (!q)
+		goto out_tag_set;
+
 	/* We use the default size, but let's be explicit about it. */
 	blk_queue_physical_block_size(q, SECTOR_SIZE);
 
@@ -3775,10 +3790,11 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	rbd_dev->disk = disk;
 
 	return 0;
+out_tag_set:
+	blk_mq_free_tag_set(&rbd_dev->tag_set);
 out_disk:
 	put_disk(disk);
-
-	return -ENOMEM;
+	return err;
 }
 
 /*
@@ -4036,7 +4052,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
 
 	spin_lock_init(&rbd_dev->lock);
 	INIT_LIST_HEAD(&rbd_dev->rq_queue);
-	INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn);
 	rbd_dev->flags = 0;
 	atomic_set(&rbd_dev->parent_ref, 0);
 	INIT_LIST_HEAD(&rbd_dev->node);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-11-03 11:08             ` Christoph Hellwig
@ 2014-11-03 13:08               ` Alexandre DERUMIER
  0 siblings, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-11-03 13:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

>>can you try the patch below instead of the previous three patches? 

Sure, I'll try tomorrow.


----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 3 Novembre 2014 12:08:07 
Objet: Re: krbd blk-mq support ? 

Hi Alexandre, 

can you try the patch below instead of the previous three patches? 
This one uses a per-request work struct to allow for more concurrency. 

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c 
index 0a54c58..b981096 100644 
--- a/drivers/block/rbd.c 
+++ b/drivers/block/rbd.c 
@@ -38,6 +38,7 @@ 
#include <linux/kernel.h> 
#include <linux/device.h> 
#include <linux/module.h> 
+#include <linux/blk-mq.h> 
#include <linux/fs.h> 
#include <linux/blkdev.h> 
#include <linux/slab.h> 
@@ -343,7 +344,6 @@ struct rbd_device { 
struct list_head rq_queue; /* incoming rq queue */ 
spinlock_t lock; /* queue, flags, open_count */ 
struct workqueue_struct *rq_wq; 
- struct work_struct rq_work; 

struct rbd_image_header header; 
unsigned long flags; /* possibly lock protected */ 
@@ -361,6 +361,9 @@ struct rbd_device { 
atomic_t parent_ref; 
struct rbd_device *parent; 

+ /* Block layer tags. */ 
+ struct blk_mq_tag_set tag_set; 
+ 
/* protects updating the header */ 
struct rw_semaphore header_rwsem; 

@@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, 

/* 
* We support a 64-bit length, but ultimately it has to be 
- * passed to blk_end_request(), which takes an unsigned int. 
+ * passed to the block layer, which just supports a 32-bit 
+ * length field. 
*/ 
obj_request->xferred = osd_req->r_reply_op_len[0]; 
rbd_assert(obj_request->xferred < (u64)UINT_MAX); 
@@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) 
more = obj_request->which < img_request->obj_request_count - 1; 
} else { 
rbd_assert(img_request->rq != NULL); 
- more = blk_end_request(img_request->rq, result, xferred); 
+ 
+ more = blk_update_request(img_request->rq, result, xferred); 
+ if (!more) 
+ __blk_mq_end_request(img_request->rq, result); 
} 

return more; 
@@ -3305,8 +3312,10 @@ out: 
return ret; 
} 

-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
+static void rbd_queue_workfn(struct work_struct *work) 
{ 
+ struct request *rq = blk_mq_rq_from_pdu(work); 
+ struct rbd_device *rbd_dev = rq->q->queuedata; 
struct rbd_img_request *img_request; 
struct ceph_snap_context *snapc = NULL; 
u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT; 
@@ -3314,6 +3323,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
enum obj_operation_type op_type; 
u64 mapping_size; 
int result; 
+ 
+ if (rq->cmd_type != REQ_TYPE_FS) { 
+ dout("%s: non-fs request type %d\n", __func__, 
+ (int) rq->cmd_type); 
+ result = -EIO; 
+ goto err; 
+ } 

if (rq->cmd_flags & REQ_DISCARD) 
op_type = OBJ_OP_DISCARD; 
@@ -3353,6 +3369,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
goto err_rq; 
} 

+ blk_mq_start_request(rq); 
+ 
if (offset && length > U64_MAX - offset + 1) { 
rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset, 
length); 
@@ -3406,53 +3424,18 @@ err_rq: 
obj_op_name(op_type), length, offset, result); 
if (snapc) 
ceph_put_snap_context(snapc); 
- blk_end_request_all(rq, result); 
+err: 
+ blk_mq_end_request(rq, result); 
} 

-static void rbd_request_workfn(struct work_struct *work) 
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq, 
+ bool last) 
{ 
- struct rbd_device *rbd_dev = 
- container_of(work, struct rbd_device, rq_work); 
- struct request *rq, *next; 
- LIST_HEAD(requests); 
- 
- spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */ 
- list_splice_init(&rbd_dev->rq_queue, &requests); 
- spin_unlock_irq(&rbd_dev->lock); 
- 
- list_for_each_entry_safe(rq, next, &requests, queuelist) { 
- list_del_init(&rq->queuelist); 
- rbd_handle_request(rbd_dev, rq); 
- } 
-} 
+ struct rbd_device *rbd_dev = rq->q->queuedata; 
+ struct work_struct *work = blk_mq_rq_to_pdu(rq); 

-/* 
- * Called with q->queue_lock held and interrupts disabled, possibly on 
- * the way to schedule(). Do not sleep here! 
- */ 
-static void rbd_request_fn(struct request_queue *q) 
-{ 
- struct rbd_device *rbd_dev = q->queuedata; 
- struct request *rq; 
- int queued = 0; 
- 
- rbd_assert(rbd_dev); 
- 
- while ((rq = blk_fetch_request(q))) { 
- /* Ignore any non-FS requests that filter through. */ 
- if (rq->cmd_type != REQ_TYPE_FS) { 
- dout("%s: non-fs request type %d\n", __func__, 
- (int) rq->cmd_type); 
- __blk_end_request_all(rq, 0); 
- continue; 
- } 
- 
- list_add_tail(&rq->queuelist, &rbd_dev->rq_queue); 
- queued++; 
- } 
- 
- if (queued) 
- queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work); 
+ queue_work(rbd_dev->rq_wq, work); 
+ return 0; 
} 

/* 
@@ -3513,6 +3496,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev) 
del_gendisk(disk); 
if (disk->queue) 
blk_cleanup_queue(disk->queue); 
+ blk_mq_free_tag_set(&rbd_dev->tag_set); 
} 
put_disk(disk); 
} 
@@ -3724,11 +3708,28 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) 
return 0; 
} 

+static int rbd_init_request(void *data, struct request *rq, 
+ unsigned int hctx_idx, unsigned int request_idx, 
+ unsigned int numa_node) 
+{ 
+ struct work_struct *work = blk_mq_rq_to_pdu(rq); 
+ 
+ INIT_WORK(work, rbd_queue_workfn); 
+ return 0; 
+} 
+ 
+static struct blk_mq_ops rbd_mq_ops = { 
+ .queue_rq = rbd_queue_rq, 
+ .map_queue = blk_mq_map_queue, 
+ .init_request = rbd_init_request, 
+}; 
+ 
static int rbd_init_disk(struct rbd_device *rbd_dev) 
{ 
struct gendisk *disk; 
struct request_queue *q; 
u64 segment_size; 
+ int err; 

/* create gendisk info */ 
disk = alloc_disk(single_major ? 
@@ -3746,10 +3747,24 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
disk->fops = &rbd_bd_ops; 
disk->private_data = rbd_dev; 

- q = blk_init_queue(rbd_request_fn, &rbd_dev->lock); 
- if (!q) 
+ memset(&rbd_dev->tag_set, 0, sizeof(rbd_dev->tag_set)); 
+ rbd_dev->tag_set.ops = &rbd_mq_ops; 
+ rbd_dev->tag_set.queue_depth = 128; // 
+ rbd_dev->tag_set.numa_node = NUMA_NO_NODE; 
+ rbd_dev->tag_set.flags = 
+ BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; 
+ rbd_dev->tag_set.nr_hw_queues = 1; 
+ rbd_dev->tag_set.cmd_size = sizeof(struct work_struct); 
+ 
+ err = blk_mq_alloc_tag_set(&rbd_dev->tag_set); 
+ if (err) 
goto out_disk; 

+ err = -ENOMEM; 
+ q = blk_mq_init_queue(&rbd_dev->tag_set); 
+ if (!q) 
+ goto out_tag_set; 
+ 
/* We use the default size, but let's be explicit about it. */ 
blk_queue_physical_block_size(q, SECTOR_SIZE); 

@@ -3775,10 +3790,11 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
rbd_dev->disk = disk; 

return 0; 
+out_tag_set: 
+ blk_mq_free_tag_set(&rbd_dev->tag_set); 
out_disk: 
put_disk(disk); 
- 
- return -ENOMEM; 
+ return err; 
} 

/* 
@@ -4036,7 +4052,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, 

spin_lock_init(&rbd_dev->lock); 
INIT_LIST_HEAD(&rbd_dev->rq_queue); 
- INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn); 
rbd_dev->flags = 0; 
atomic_set(&rbd_dev->parent_ref, 0); 
INIT_LIST_HEAD(&rbd_dev->node); 
-- 
1.9.1 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
       [not found] <100ed8b3-2427-46d0-9a0c-e5e1a92031af@mailpro>
@ 2014-11-04  7:19 ` Alexandre DERUMIER
  2014-11-13  7:18   ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-11-04  7:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

Hi again,

I have good news, I finally solved my problem !

Simply with installing irqbalance

#apt-get install irqbalance

So maybe the problem was at the nic/network level.



Now : 3.18 kernel + your patch : 120000 iops
      3.10 kernel : 80000iops


I'll try 3.18 kernel without your patch to compare.




      

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Christoph Hellwig" <hch@infradead.org> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mardi 4 Novembre 2014 07:57:19 
Objet: Re: krbd blk-mq support ? 

Hi Christoph, 

I had tried your patch, but no improvement for my problem. 

I have always a kworker near 100% on 1core. 

I have finally be able to do perf on 3.18 kernel + your patch, I have attached the report in this mail. 



----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Christoph Hellwig" <hch@infradead.org>, "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 3 Novembre 2014 12:08:07 
Objet: Re: krbd blk-mq support ? 

Hi Alexandre, 

can you try the patch below instead of the previous three patches? 
This one uses a per-request work struct to allow for more concurrency. 

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c 
index 0a54c58..b981096 100644 
--- a/drivers/block/rbd.c 
+++ b/drivers/block/rbd.c 
@@ -38,6 +38,7 @@ 
#include <linux/kernel.h> 
#include <linux/device.h> 
#include <linux/module.h> 
+#include <linux/blk-mq.h> 
#include <linux/fs.h> 
#include <linux/blkdev.h> 
#include <linux/slab.h> 
@@ -343,7 +344,6 @@ struct rbd_device { 
struct list_head rq_queue; /* incoming rq queue */ 
spinlock_t lock; /* queue, flags, open_count */ 
struct workqueue_struct *rq_wq; 
- struct work_struct rq_work; 

struct rbd_image_header header; 
unsigned long flags; /* possibly lock protected */ 
@@ -361,6 +361,9 @@ struct rbd_device { 
atomic_t parent_ref; 
struct rbd_device *parent; 

+ /* Block layer tags. */ 
+ struct blk_mq_tag_set tag_set; 
+ 
/* protects updating the header */ 
struct rw_semaphore header_rwsem; 

@@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, 

/* 
* We support a 64-bit length, but ultimately it has to be 
- * passed to blk_end_request(), which takes an unsigned int. 
+ * passed to the block layer, which just supports a 32-bit 
+ * length field. 
*/ 
obj_request->xferred = osd_req->r_reply_op_len[0]; 
rbd_assert(obj_request->xferred < (u64)UINT_MAX); 
@@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) 
more = obj_request->which < img_request->obj_request_count - 1; 
} else { 
rbd_assert(img_request->rq != NULL); 
- more = blk_end_request(img_request->rq, result, xferred); 
+ 
+ more = blk_update_request(img_request->rq, result, xferred); 
+ if (!more) 
+ __blk_mq_end_request(img_request->rq, result); 
} 

return more; 
@@ -3305,8 +3312,10 @@ out: 
return ret; 
} 

-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
+static void rbd_queue_workfn(struct work_struct *work) 
{ 
+ struct request *rq = blk_mq_rq_from_pdu(work); 
+ struct rbd_device *rbd_dev = rq->q->queuedata; 
struct rbd_img_request *img_request; 
struct ceph_snap_context *snapc = NULL; 
u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT; 
@@ -3314,6 +3323,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
enum obj_operation_type op_type; 
u64 mapping_size; 
int result; 
+ 
+ if (rq->cmd_type != REQ_TYPE_FS) { 
+ dout("%s: non-fs request type %d\n", __func__, 
+ (int) rq->cmd_type); 
+ result = -EIO; 
+ goto err; 
+ } 

if (rq->cmd_flags & REQ_DISCARD) 
op_type = OBJ_OP_DISCARD; 
@@ -3353,6 +3369,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) 
goto err_rq; 
} 

+ blk_mq_start_request(rq); 
+ 
if (offset && length > U64_MAX - offset + 1) { 
rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset, 
length); 
@@ -3406,53 +3424,18 @@ err_rq: 
obj_op_name(op_type), length, offset, result); 
if (snapc) 
ceph_put_snap_context(snapc); 
- blk_end_request_all(rq, result); 
+err: 
+ blk_mq_end_request(rq, result); 
} 

-static void rbd_request_workfn(struct work_struct *work) 
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq, 
+ bool last) 
{ 
- struct rbd_device *rbd_dev = 
- container_of(work, struct rbd_device, rq_work); 
- struct request *rq, *next; 
- LIST_HEAD(requests); 
- 
- spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */ 
- list_splice_init(&rbd_dev->rq_queue, &requests); 
- spin_unlock_irq(&rbd_dev->lock); 
- 
- list_for_each_entry_safe(rq, next, &requests, queuelist) { 
- list_del_init(&rq->queuelist); 
- rbd_handle_request(rbd_dev, rq); 
- } 
-} 
+ struct rbd_device *rbd_dev = rq->q->queuedata; 
+ struct work_struct *work = blk_mq_rq_to_pdu(rq); 

-/* 
- * Called with q->queue_lock held and interrupts disabled, possibly on 
- * the way to schedule(). Do not sleep here! 
- */ 
-static void rbd_request_fn(struct request_queue *q) 
-{ 
- struct rbd_device *rbd_dev = q->queuedata; 
- struct request *rq; 
- int queued = 0; 
- 
- rbd_assert(rbd_dev); 
- 
- while ((rq = blk_fetch_request(q))) { 
- /* Ignore any non-FS requests that filter through. */ 
- if (rq->cmd_type != REQ_TYPE_FS) { 
- dout("%s: non-fs request type %d\n", __func__, 
- (int) rq->cmd_type); 
- __blk_end_request_all(rq, 0); 
- continue; 
- } 
- 
- list_add_tail(&rq->queuelist, &rbd_dev->rq_queue); 
- queued++; 
- } 
- 
- if (queued) 
- queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work); 
+ queue_work(rbd_dev->rq_wq, work); 
+ return 0; 
} 

/* 
@@ -3513,6 +3496,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev) 
del_gendisk(disk); 
if (disk->queue) 
blk_cleanup_queue(disk->queue); 
+ blk_mq_free_tag_set(&rbd_dev->tag_set); 
} 
put_disk(disk); 
} 
@@ -3724,11 +3708,28 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) 
return 0; 
} 

+static int rbd_init_request(void *data, struct request *rq, 
+ unsigned int hctx_idx, unsigned int request_idx, 
+ unsigned int numa_node) 
+{ 
+ struct work_struct *work = blk_mq_rq_to_pdu(rq); 
+ 
+ INIT_WORK(work, rbd_queue_workfn); 
+ return 0; 
+} 
+ 
+static struct blk_mq_ops rbd_mq_ops = { 
+ .queue_rq = rbd_queue_rq, 
+ .map_queue = blk_mq_map_queue, 
+ .init_request = rbd_init_request, 
+}; 
+ 
static int rbd_init_disk(struct rbd_device *rbd_dev) 
{ 
struct gendisk *disk; 
struct request_queue *q; 
u64 segment_size; 
+ int err; 

/* create gendisk info */ 
disk = alloc_disk(single_major ? 
@@ -3746,10 +3747,24 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
disk->fops = &rbd_bd_ops; 
disk->private_data = rbd_dev; 

- q = blk_init_queue(rbd_request_fn, &rbd_dev->lock); 
- if (!q) 
+ memset(&rbd_dev->tag_set, 0, sizeof(rbd_dev->tag_set)); 
+ rbd_dev->tag_set.ops = &rbd_mq_ops; 
+ rbd_dev->tag_set.queue_depth = 128; // 
+ rbd_dev->tag_set.numa_node = NUMA_NO_NODE; 
+ rbd_dev->tag_set.flags = 
+ BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; 
+ rbd_dev->tag_set.nr_hw_queues = 1; 
+ rbd_dev->tag_set.cmd_size = sizeof(struct work_struct); 
+ 
+ err = blk_mq_alloc_tag_set(&rbd_dev->tag_set); 
+ if (err) 
goto out_disk; 

+ err = -ENOMEM; 
+ q = blk_mq_init_queue(&rbd_dev->tag_set); 
+ if (!q) 
+ goto out_tag_set; 
+ 
/* We use the default size, but let's be explicit about it. */ 
blk_queue_physical_block_size(q, SECTOR_SIZE); 

@@ -3775,10 +3790,11 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
rbd_dev->disk = disk; 

return 0; 
+out_tag_set: 
+ blk_mq_free_tag_set(&rbd_dev->tag_set); 
out_disk: 
put_disk(disk); 
- 
- return -ENOMEM; 
+ return err; 
} 

/* 
@@ -4036,7 +4052,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, 

spin_lock_init(&rbd_dev->lock); 
INIT_LIST_HEAD(&rbd_dev->rq_queue); 
- INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn); 
rbd_dev->flags = 0; 
atomic_set(&rbd_dev->parent_ref, 0); 
INIT_LIST_HEAD(&rbd_dev->node); 
-- 
1.9.1 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-11-04  7:19 ` Alexandre DERUMIER
@ 2014-11-13  7:18   ` Christoph Hellwig
  2014-11-13  9:44     ` Alexandre DERUMIER
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2014-11-13  7:18 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Ceph Devel

On Tue, Nov 04, 2014 at 08:19:32AM +0100, Alexandre DERUMIER wrote:
> Now : 3.18 kernel + your patch : 120000 iops
>       3.10 kernel : 80000iops
> 
> 
> I'll try 3.18 kernel without your patch to compare.

Did you manage to get those numbers?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-11-13  7:18   ` Christoph Hellwig
@ 2014-11-13  9:44     ` Alexandre DERUMIER
  2014-12-10 14:05       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-11-13  9:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ceph Devel

>>Did you manage to get those numbers? 

Not yet, I'll try next week.


----- Mail original ----- 

De: "Christoph Hellwig" <hch@infradead.org> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Ceph Devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 13 Novembre 2014 08:18:15 
Objet: Re: krbd blk-mq support ? 

On Tue, Nov 04, 2014 at 08:19:32AM +0100, Alexandre DERUMIER wrote: 
> Now : 3.18 kernel + your patch : 120000 iops 
> 3.10 kernel : 80000iops 
> 
> 
> I'll try 3.18 kernel without your patch to compare. 

Did you manage to get those numbers? 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
  2014-11-13  9:44     ` Alexandre DERUMIER
@ 2014-12-10 14:05       ` Christoph Hellwig
       [not found]         ` <1394621476.865737.1418231736651.JavaMail.zimbra@oxygem.tv>
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2014-12-10 14:05 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Ceph Devel

On Thu, Nov 13, 2014 at 10:44:18AM +0100, Alexandre DERUMIER wrote:
> >>Did you manage to get those numbers?
> 
> Not yet, I'll try next week.

What's the result?  I'd really like to get rid of old request drivers
as much as possible.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: krbd blk-mq support ?
       [not found]         ` <1394621476.865737.1418231736651.JavaMail.zimbra@oxygem.tv>
@ 2014-12-10 17:15           ` Alexandre DERUMIER
  0 siblings, 0 replies; 27+ messages in thread
From: Alexandre DERUMIER @ 2014-12-10 17:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ceph-devel

Hi Christoph,

I have redone bench, but I think I don't have enough ios/osd.

I'm stuck around 120000iops randread 4k with or without your patch.

(But I don't see any speed regression)

I'm going to have a bigger full ssd production cluster in the coming months,
So I'll redone tests when I'll be ready.

Regards,

Alexandre
----- Mail original -----
De: "Christoph Hellwig" <hch@infradead.org>
À: "aderumier" <aderumier@odiso.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 10 Décembre 2014 15:05:18
Objet: Re: krbd blk-mq support ?

On Thu, Nov 13, 2014 at 10:44:18AM +0100, Alexandre DERUMIER wrote: 
> >>Did you manage to get those numbers? 
> 
> Not yet, I'll try next week. 

What's the result? I'd really like to get rid of old request drivers 
as much as possible. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-12-10 17:15 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <9894b2d1-b7c6-4a17-9747-d8a41ec208a4@mailpro>
2014-10-24  7:54 ` krbd blk-mq support ? Alexandre DERUMIER
2014-10-24  8:41   ` Ilya Dryomov
2014-10-24 10:55   ` Christoph Hellwig
2014-10-24 12:27     ` Alexandre DERUMIER
2014-10-26 13:46       ` Alexandre DERUMIER
2014-10-26 19:08         ` Somnath Roy
2014-10-27  7:53           ` Alexandre DERUMIER
2014-10-27 10:26           ` Alexandre DERUMIER
2014-10-27  9:45         ` Christoph Hellwig
2014-10-27 10:00           ` Alexandre DERUMIER
2014-10-28 18:07             ` Christoph Hellwig
2014-10-28 22:31               ` Alex Elder
2014-10-28 23:11               ` Alex Elder
2014-10-29  9:09               ` Alexandre DERUMIER
2014-10-29 15:00                 ` Sage Weil
2014-10-30  8:11                   ` Alexandre DERUMIER
2014-10-30 16:01                     ` Alexandre DERUMIER
2014-10-30 17:05                       ` Haomai Wang
2014-10-31  5:04                         ` Alexandre DERUMIER
2014-11-03 11:08             ` Christoph Hellwig
2014-11-03 13:08               ` Alexandre DERUMIER
     [not found] <90C9DE11-CACE-4533-83EF-6F1F887E6A8F@profihost.ag>
2014-10-31 20:12 ` Alexandre DERUMIER
     [not found] <100ed8b3-2427-46d0-9a0c-e5e1a92031af@mailpro>
2014-11-04  7:19 ` Alexandre DERUMIER
2014-11-13  7:18   ` Christoph Hellwig
2014-11-13  9:44     ` Alexandre DERUMIER
2014-12-10 14:05       ` Christoph Hellwig
     [not found]         ` <1394621476.865737.1418231736651.JavaMail.zimbra@oxygem.tv>
2014-12-10 17:15           ` Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.