* [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code
@ 2018-10-25 21:13 Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 1/2] nvmet-rdma: Rework DoS attack mitigation Bart Van Assche
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Bart Van Assche @ 2018-10-25 21:13 UTC (permalink / raw)
Hi Christoph,
This patch series reworks the DoS mitigation code in the NVMeOF
target code and also addresses a lockdep complaint that has been
discussed recently on the NVMe mailing list. Please consider this
patch series for the upstream kernel.
Thanks,
Bart.
Changes compared to v2:
- Avoid that code introduced in patch 1/2 is moved around in patch 2/2.
The end result of patches 1 and 2 combined remains the same.
Changes compared to v1:
- Reworked the approach of patch 1/2.
Bart Van Assche (2):
nvmet-rdma: Rework DoS attack mitigation
nvmet-rdma: Use the system workqueues again for deletion work
drivers/nvme/target/rdma.c | 36 +++++++++++++++++-------------------
1 file changed, 17 insertions(+), 19 deletions(-)
--
2.19.1.568.g152ad8e336-goog
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/2] nvmet-rdma: Rework DoS attack mitigation
2018-10-25 21:13 [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Bart Van Assche
@ 2018-10-25 21:13 ` Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 2/2] nvmet-rdma: Use the system workqueues again for deletion work Bart Van Assche
2018-10-26 1:00 ` [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Sagi Grimberg
2 siblings, 0 replies; 7+ messages in thread
From: Bart Van Assche @ 2018-10-25 21:13 UTC (permalink / raw)
When the rate of connect / disconnect requests is high, temporarily
refuse connections instead of waiting until previous connections have
been cleaned up. This patch avoids that the lockdep complaint shown
below is reported. See also the following commits:
* 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing").
* 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown").
======================================================
WARNING: possible circular locking dependency detected
4.19.0-dbg+ #1 Not tainted
------------------------------------------------------
kworker/u12:0/7 is trying to acquire lock:
00000000c03a91d1 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm]
but task is already holding lock:
(work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3c9/0x9f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 ((work_completion)(&queue->release_work)){+.+.}:
process_one_work+0x447/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #2 ((wq_completion)"nvmet-rdma-delete-wq"){+.+.}:
flush_workqueue+0xf3/0x970
nvmet_rdma_cm_handler+0x1320/0x170f [nvmet_rdma]
cma_ib_req_handler+0x72f/0xf90 [rdma_cm]
cm_process_work+0x2e/0x110 [ib_cm]
cm_work_handler+0x431e/0x50ba [ib_cm]
process_one_work+0x481/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #1 (&id_priv->handler_mutex/1){+.+.}:
__mutex_lock+0xfe/0xbe0
mutex_lock_nested+0x1b/0x20
cma_ib_req_handler+0x6aa/0xf90 [rdma_cm]
cm_process_work+0x2e/0x110 [ib_cm]
cm_work_handler+0x431e/0x50ba [ib_cm]
process_one_work+0x481/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #0 (&id_priv->handler_mutex){+.+.}:
lock_acquire+0xc5/0x200
__mutex_lock+0xfe/0xbe0
mutex_lock_nested+0x1b/0x20
rdma_destroy_id+0x6f/0x440 [rdma_cm]
nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
process_one_work+0x481/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
other info that might help us debug this:
Chain exists of:
&id_priv->handler_mutex --> (wq_completion)"nvmet-rdma-delete-wq" --> (work_completion)(&queue->release_work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&queue->release_work));
lock((wq_completion)"nvmet-rdma-delete-wq");
lock((work_completion)(&queue->release_work));
lock(&id_priv->handler_mutex);
*** DEADLOCK ***
2 locks held by kworker/u12:0/7:
#0: 00000000272134f2 ((wq_completion)"nvmet-rdma-delete-wq"){+.+.}, at: process_one_work+0x3c9/0x9f0
#1: 0000000090531fcd ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3c9/0x9f0
stack backtrace:
CPU: 1 PID: 7 Comm: kworker/u12:0 Not tainted 4.19.0-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: nvmet-rdma-delete-wq nvmet_rdma_release_queue_work [nvmet_rdma]
Call Trace:
dump_stack+0x86/0xc5
print_circular_bug.isra.32+0x20a/0x218
__lock_acquire+0x1c68/0x1cf0
lock_acquire+0xc5/0x200
__mutex_lock+0xfe/0xbe0
mutex_lock_nested+0x1b/0x20
rdma_destroy_id+0x6f/0x440 [rdma_cm]
nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
process_one_work+0x481/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Johannes Berg <johannes.berg at intel.com>
Signed-off-by: Bart Van Assche <bvanassche at acm.org>
---
drivers/nvme/target/rdma.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index bd265aceb90c..e85445585884 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -127,6 +127,8 @@ static bool nvmet_rdma_use_srq;
module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
+#define MAX_QUEUES_BEING_RELEASED 16
+static atomic_t queues_being_released;
static DEFINE_IDA(nvmet_rdma_queue_ida);
static LIST_HEAD(nvmet_rdma_queue_list);
static DEFINE_MUTEX(nvmet_rdma_queue_mutex);
@@ -1068,6 +1070,8 @@ static void nvmet_rdma_release_queue_work(struct work_struct *w)
nvmet_rdma_free_queue(queue);
+ atomic_dec(&queues_being_released);
+
kref_put(&dev->ref, nvmet_rdma_free_dev);
}
@@ -1253,6 +1257,13 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
struct nvmet_rdma_queue *queue;
int ret = -EINVAL;
+ if (atomic_read(&queues_being_released) >
+ MAX_QUEUES_BEING_RELEASED) {
+ pr_warn_ratelimited("The rate at which initiator systems connect and disconnect seems to be high. New connections will be refused temporarily.\n");
+ nvmet_rdma_cm_reject(cm_id, NVME_RDMA_CM_NO_RSC);
+ return -ECONNREFUSED;
+ }
+
ndev = nvmet_rdma_find_get_device(cm_id);
if (!ndev) {
nvmet_rdma_cm_reject(cm_id, NVME_RDMA_CM_NO_RSC);
@@ -1266,14 +1277,11 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
}
queue->port = cm_id->context;
- if (queue->host_qid == 0) {
- /* Let inflight controller teardown complete */
- flush_workqueue(nvmet_rdma_delete_wq);
- }
-
ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
if (ret) {
- queue_work(nvmet_rdma_delete_wq, &queue->release_work);
+ atomic_inc(&queues_being_released);
+ WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq,
+ &queue->release_work));
/* Destroying rdma_cm id is not needed here */
return 0;
}
@@ -1338,7 +1346,9 @@ static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
if (disconnect) {
rdma_disconnect(queue->cm_id);
- queue_work(nvmet_rdma_delete_wq, &queue->release_work);
+ atomic_inc(&queues_being_released);
+ WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq,
+ &queue->release_work));
}
}
@@ -1368,7 +1378,8 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
mutex_unlock(&nvmet_rdma_queue_mutex);
pr_err("failed to connect queue %d\n", queue->idx);
- queue_work(nvmet_rdma_delete_wq, &queue->release_work);
+ atomic_inc(&queues_being_released);
+ WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq, &queue->release_work));
}
/**
--
2.19.1.568.g152ad8e336-goog
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/2] nvmet-rdma: Use the system workqueues again for deletion work
2018-10-25 21:13 [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 1/2] nvmet-rdma: Rework DoS attack mitigation Bart Van Assche
@ 2018-10-25 21:13 ` Bart Van Assche
2018-10-26 1:00 ` [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Sagi Grimberg
2 siblings, 0 replies; 7+ messages in thread
From: Bart Van Assche @ 2018-10-25 21:13 UTC (permalink / raw)
Minimize system resource usage by switching back to the system
workqueues for deletion work. This patch reverts commit 2acf70ade79d
("nvmet-rdma: use a private workqueue for delete").
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Johannes Berg <johannes.berg at intel.com>
Signed-off-by: Bart Van Assche <bvanassche at acm.org>
---
drivers/nvme/target/rdma.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index e85445585884..a87bb8006d1a 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -122,7 +122,6 @@ struct nvmet_rdma_device {
int inline_page_count;
};
-static struct workqueue_struct *nvmet_rdma_delete_wq;
static bool nvmet_rdma_use_srq;
module_param_named(use_srq, nvmet_rdma_use_srq, bool, 0444);
MODULE_PARM_DESC(use_srq, "Use shared receive queue.");
@@ -1280,8 +1279,7 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
if (ret) {
atomic_inc(&queues_being_released);
- WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq,
- &queue->release_work));
+ WARN_ON_ONCE(!schedule_work(&queue->release_work));
/* Destroying rdma_cm id is not needed here */
return 0;
}
@@ -1347,8 +1345,7 @@ static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
if (disconnect) {
rdma_disconnect(queue->cm_id);
atomic_inc(&queues_being_released);
- WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq,
- &queue->release_work));
+ WARN_ON_ONCE(!schedule_work(&queue->release_work));
}
}
@@ -1379,7 +1376,7 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
pr_err("failed to connect queue %d\n", queue->idx);
atomic_inc(&queues_being_released);
- WARN_ON_ONCE(!queue_work(nvmet_rdma_delete_wq, &queue->release_work));
+ WARN_ON_ONCE(!schedule_work(&queue->release_work));
}
/**
@@ -1661,17 +1658,8 @@ static int __init nvmet_rdma_init(void)
if (ret)
goto err_ib_client;
- nvmet_rdma_delete_wq = alloc_workqueue("nvmet-rdma-delete-wq",
- WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
- if (!nvmet_rdma_delete_wq) {
- ret = -ENOMEM;
- goto err_unreg_transport;
- }
-
return 0;
-err_unreg_transport:
- nvmet_unregister_transport(&nvmet_rdma_ops);
err_ib_client:
ib_unregister_client(&nvmet_rdma_ib_client);
return ret;
@@ -1679,7 +1667,6 @@ static int __init nvmet_rdma_init(void)
static void __exit nvmet_rdma_exit(void)
{
- destroy_workqueue(nvmet_rdma_delete_wq);
nvmet_unregister_transport(&nvmet_rdma_ops);
ib_unregister_client(&nvmet_rdma_ib_client);
WARN_ON_ONCE(!list_empty(&nvmet_rdma_queue_list));
--
2.19.1.568.g152ad8e336-goog
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code
2018-10-25 21:13 [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 1/2] nvmet-rdma: Rework DoS attack mitigation Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 2/2] nvmet-rdma: Use the system workqueues again for deletion work Bart Van Assche
@ 2018-10-26 1:00 ` Sagi Grimberg
2018-10-26 18:45 ` Bart Van Assche
2 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2018-10-26 1:00 UTC (permalink / raw)
> Hi Christoph,
Hi Bart,
> This patch series reworks the DoS mitigation code in the NVMeOF
> target code and also addresses a lockdep complaint that has been
> discussed recently on the NVMe mailing list. Please consider this
> patch series for the upstream kernel.
I can't say I agree with rejecting random connect requests if we have
more then a maid up define resources queued for deletion. Given that
there is no real locking incorrectness in waiting a bit such that they
all drain. The issue is no more then a false complaint from lockdep.
Given that there are a number of unknown factors that affect this such
as the number of hosts a target is servicing, the number of inflight I/O
that needs quiescing at the time of the removal, and the system resource
limits, any approach that has a hard-coded limit cannot be correct.
If we do indeed want to get around the false report and not annotate it,
perhaps we should invoke rdma_destroy_id outside the work execution
context (such that it is destroyed before the work runs) and then the
dependency no longer exists?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code
2018-10-26 1:00 ` [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Sagi Grimberg
@ 2018-10-26 18:45 ` Bart Van Assche
2018-10-30 18:30 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2018-10-26 18:45 UTC (permalink / raw)
On Thu, 2018-10-25@18:00 -0700, Sagi Grimberg wrote:
> > This patch series reworks the DoS mitigation code in the NVMeOF
> > target code and also addresses a lockdep complaint that has been
> > discussed recently on the NVMe mailing list. Please consider this
> > patch series for the upstream kernel.
>
> I can't say I agree with rejecting random connect requests if we have
> more then a maid up define resources queued for deletion. Given that
> there is no real locking incorrectness in waiting a bit such that they
> all drain. The issue is no more then a false complaint from lockdep.
>
> Given that there are a number of unknown factors that affect this such
> as the number of hosts a target is servicing, the number of inflight I/O
> that needs quiescing at the time of the removal, and the system resource
> limits, any approach that has a hard-coded limit cannot be correct.
>
> If we do indeed want to get around the false report and not annotate it,
> perhaps we should invoke rdma_destroy_id outside the work execution
> context (such that it is destroyed before the work runs) and then the
> dependency no longer exists?
Hi Sagi,
Thanks for having taken a look. However, I'm not sure that refusing some
connection requests is really new behavior. From drivers/nvme/target/rdma.c:
ret = rdma_listen(cm_id, 128);
In other words, the maximum backlog for the CM ID through which connections
are accepted is 128. What happens if a connection request is received and the
backlog count is exceeded?
Regarding invoking rdma_destroy_id() from another context: which context
should that be? rdma_destroy_id() locks and unlocks the handler mutex.
Maintaining a list of CM IDs to be destroyed and iterating over that list
from inside a CM ID handler function will cause nested locking because as
you know CM ID handlers are called with the handler_mutex held. If we choose
this approach we will have to be careful not to trigger lock inversion.
Bart.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code
2018-10-26 18:45 ` Bart Van Assche
@ 2018-10-30 18:30 ` Sagi Grimberg
2018-10-30 19:43 ` Bart Van Assche
0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2018-10-30 18:30 UTC (permalink / raw)
>> I can't say I agree with rejecting random connect requests if we have
>> more then a maid up define resources queued for deletion. Given that
>> there is no real locking incorrectness in waiting a bit such that they
>> all drain. The issue is no more then a false complaint from lockdep.
>>
>> Given that there are a number of unknown factors that affect this such
>> as the number of hosts a target is servicing, the number of inflight I/O
>> that needs quiescing at the time of the removal, and the system resource
>> limits, any approach that has a hard-coded limit cannot be correct.
>>
>> If we do indeed want to get around the false report and not annotate it,
>> perhaps we should invoke rdma_destroy_id outside the work execution
>> context (such that it is destroyed before the work runs) and then the
>> dependency no longer exists?
>
> Hi Sagi,
>
> Thanks for having taken a look. However, I'm not sure that refusing some
> connection requests is really new behavior. From drivers/nvme/target/rdma.c:
>
> ret = rdma_listen(cm_id, 128);
>
> In other words, the maximum backlog for the CM ID through which connections
> are accepted is 128. What happens if a connection request is received and the
> backlog count is exceeded?
That is correct in theory, but I've never seen this limit being hit in
practice (maybe because its only relevant for iwarp).. The case where we
quiesce IO in queue shutdown is much more common.
> Regarding invoking rdma_destroy_id() from another context: which context
> should that be? rdma_destroy_id() locks and unlocks the handler mutex.
> Maintaining a list of CM IDs to be destroyed and iterating over that list
> from inside a CM ID handler function will cause nested locking because as
> you know CM ID handlers are called with the handler_mutex held. If we choose
> this approach we will have to be careful not to trigger lock inversion.
I thought maybe we should destroy the cm_id before queueing the work
(unless we are invoked from the cm_id handler in which case we can
return a non-zero code from the handler which will implicitly destroy
the cm_id).
But are we sure we can't simply resolve the false lockdep complaint?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code
2018-10-30 18:30 ` Sagi Grimberg
@ 2018-10-30 19:43 ` Bart Van Assche
0 siblings, 0 replies; 7+ messages in thread
From: Bart Van Assche @ 2018-10-30 19:43 UTC (permalink / raw)
On Tue, 2018-10-30@11:30 -0700, Sagi Grimberg wrote:
> But are we sure we can't simply resolve the false lockdep complaint?
Hi Sagi,
Although I agree that the code that is currently upstream is fine, it is
hard to teach this to lockdep. The flush_workqueue(nvmet_rdma_delete_wq)
call waits for all work that has been queued on nvmet_rdma_delete_wq and is
called with a CM ID handler mutex held. Work items queued on that work queue
lock and unlock the CM ID handler mutex. This is fine because each work
queue flush call waits for deletion of previously created CM IDs to finish.
I think that teaching this to lockdep would require to model the handler
mutex of later created CM IDs as outer locks and the handler mutexes of
earlier created CM IDs as inner locks. The 'subclass' argument of the
functions in <linux/lockdep.h> represents the locking nesting level. A
subclass could be associated with the handler mutex by setting it at CM ID
creation time. Since CM IDs associated with logins are created from inside
the RDMA CM I'm not sure how to implement this. Additionally, the maximum
number of subclasses supported by lockdep is not that high:
#define MAX_LOCKDEP_SUBCLASSES 8UL
It's not clear to me how to model CM ID cleanup as nested locking after more
than MAX_LOCKDEP_SUBCLASSES logins.
Since I am not aware of an elegant way to suppress this lockdep complaint
without also suppressing useful reports I came up with this patch series.
This patch series brings two additional benefits:
- Lower latency for login due to the removal of a flush_workqueue() call from
the login code.
- Removal of a global workqueue.
If you would prefer to implement another solution to get rid of this lockdep
complaint, alternative proposals are welcome.
Bart.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-10-30 19:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-25 21:13 [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 1/2] nvmet-rdma: Rework DoS attack mitigation Bart Van Assche
2018-10-25 21:13 ` [PATCH v3 2/2] nvmet-rdma: Use the system workqueues again for deletion work Bart Van Assche
2018-10-26 1:00 ` [PATCH v3 0/2] nvmet-rdma: Rework the DoS mitigation code Sagi Grimberg
2018-10-26 18:45 ` Bart Van Assche
2018-10-30 18:30 ` Sagi Grimberg
2018-10-30 19:43 ` Bart Van Assche
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).