[PATCH] nvmet-rdma: fix possible bogus dereference under heavy load

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
@ 2018-09-01  1:14 Sagi Grimberg
  2018-09-01  7:44 ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2018-09-01  1:14 UTC (permalink / raw)


Currently we always repost the recv buffer before we send a response
capsule back to the host. Since ordering is not guaranteed for send
and recv completions, it is posible that we will receive a new request
from the host before we got a send completion for the response capsule.

Today, we pre-allocate 2x rsps the length of the queue, but in reality, under
heavy load there is nothing that is really preventing the gap to expand
until we exhaust all our rsps.

To fix this, if we don't have any pre-allocated rsps left, we dynamically
allocate a rsp and make sure to free it when we are done.

Reported-by: Steve Wise <swise at opengridcomputing.com>
Tested-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/target/rdma.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index e7f43d1e1779..1515b8a76ea5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -66,6 +66,7 @@ struct nvmet_rdma_rsp {
 
 	struct nvmet_req	req;
 
+	bool			allocated;
 	u8			n_rdma;
 	u32			flags;
 	u32			invalidate_rkey;
@@ -174,10 +175,16 @@ nvmet_rdma_get_rsp(struct nvmet_rdma_queue *queue)
 	unsigned long flags;
 
 	spin_lock_irqsave(&queue->rsps_lock, flags);
-	rsp = list_first_entry(&queue->free_rsps,
+	rsp = list_first_entry_or_null(&queue->free_rsps,
 				struct nvmet_rdma_rsp, free_list);
-	list_del(&rsp->free_list);
 	spin_unlock_irqrestore(&queue->rsps_lock, flags);
+	if (rsp) {
+		list_del(&rsp->free_list);
+		rsp->allocated = false;
+	} else {
+		rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
+		rsp->allocated = true;
+	}
 
 	return rsp;
 }
@@ -187,6 +194,11 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 {
 	unsigned long flags;
 
+	if (rsp->allocated) {
+		kfree(rsp);
+		return;
+	}
+
 	spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
 	list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-01  1:14 [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load Sagi Grimberg
@ 2018-09-01  7:44 ` Christoph Hellwig
  2018-09-01 12:40   ` Steve Wise
  2018-09-03 10:35   ` Sagi Grimberg
  0 siblings, 2 replies; 8+ messages in thread
From: Christoph Hellwig @ 2018-09-01  7:44 UTC (permalink / raw)


On Fri, Aug 31, 2018@06:14:47PM -0700, Sagi Grimberg wrote:
> Currently we always repost the recv buffer before we send a response
> capsule back to the host. Since ordering is not guaranteed for send
> and recv completions, it is posible that we will receive a new request
> from the host before we got a send completion for the response capsule.
> 
> Today, we pre-allocate 2x rsps the length of the queue, but in reality, under
> heavy load there is nothing that is really preventing the gap to expand
> until we exhaust all our rsps.

This is a little scary and means we have the same issue in all other
protocol drivers.  It also means that there probably is no point in
allocating twice the queue length as it will be wasted most of the
time, but still not enough.

>  	spin_lock_irqsave(&queue->rsps_lock, flags);
> -	rsp = list_first_entry(&queue->free_rsps,
> +	rsp = list_first_entry_or_null(&queue->free_rsps,
>  				struct nvmet_rdma_rsp, free_list);
> -	list_del(&rsp->free_list);
>  	spin_unlock_irqrestore(&queue->rsps_lock, flags);
> +	if (rsp) {
> +		list_del(&rsp->free_list);

The list del needs to be under the lock.

> +		rsp->allocated = false;
> +	} else {
> +		rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
> +		rsp->allocated = true;

This needs to handle a NULL return from kmalloc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-01  7:44 ` Christoph Hellwig
@ 2018-09-01 12:40   ` Steve Wise
  2018-09-02 16:24     ` Max Gurtovoy
  2018-09-03 10:35   ` Sagi Grimberg
  1 sibling, 1 reply; 8+ messages in thread
From: Steve Wise @ 2018-09-01 12:40 UTC (permalink / raw)




> -----Original Message-----
> From: Christoph Hellwig <hch at lst.de>
> Sent: Saturday, September 1, 2018 2:44 AM
> To: Sagi Grimberg <sagi at grimberg.me>
> Cc: linux-nvme at lists.infradead.org; Steve Wise
> <swise at opengridcomputing.com>; Christoph Hellwig <hch at lst.de>
> Subject: Re: [PATCH] nvmet-rdma: fix possible bogus dereference under
> heavy load
> 
> On Fri, Aug 31, 2018@06:14:47PM -0700, Sagi Grimberg wrote:
> > Currently we always repost the recv buffer before we send a response
> > capsule back to the host. Since ordering is not guaranteed for send
> > and recv completions, it is posible that we will receive a new request
> > from the host before we got a send completion for the response capsule.
> >
> > Today, we pre-allocate 2x rsps the length of the queue, but in reality,
under
> > heavy load there is nothing that is really preventing the gap to expand
> > until we exhaust all our rsps.
> 
> This is a little scary and means we have the same issue in all other
> protocol drivers.  It also means that there probably is no point in
> allocating twice the queue length as it will be wasted most of the
> time, but still not enough.

I'm not sure the other protocols have this issue.  Where is the similar
issue in ib_isert?

If we made the rsp struct part of the nvmet_rdma_cmd, then we could remove
this issue and the list management involved with the nvmet_rdma_rsp struct.
Then we're ensured that when we get a nvme command from the host via a
recv_done completion, we have a rsp. 

Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-01 12:40   ` Steve Wise
@ 2018-09-02 16:24     ` Max Gurtovoy
  2018-09-03 10:42       ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Max Gurtovoy @ 2018-09-02 16:24 UTC (permalink / raw)




On 9/1/2018 3:40 PM, Steve Wise wrote:
> 
> 
>> -----Original Message-----
>> From: Christoph Hellwig <hch at lst.de>
>> Sent: Saturday, September 1, 2018 2:44 AM
>> To: Sagi Grimberg <sagi at grimberg.me>
>> Cc: linux-nvme at lists.infradead.org; Steve Wise
>> <swise at opengridcomputing.com>; Christoph Hellwig <hch at lst.de>
>> Subject: Re: [PATCH] nvmet-rdma: fix possible bogus dereference under
>> heavy load
>>
>> On Fri, Aug 31, 2018@06:14:47PM -0700, Sagi Grimberg wrote:
>>> Currently we always repost the recv buffer before we send a response
>>> capsule back to the host. Since ordering is not guaranteed for send
>>> and recv completions, it is posible that we will receive a new request
>>> from the host before we got a send completion for the response capsule.
>>>
>>> Today, we pre-allocate 2x rsps the length of the queue, but in reality,
> under
>>> heavy load there is nothing that is really preventing the gap to expand
>>> until we exhaust all our rsps.
>>
>> This is a little scary and means we have the same issue in all other
>> protocol drivers.  It also means that there probably is no point in
>> allocating twice the queue length as it will be wasted most of the
>> time, but still not enough.
> 
> I'm not sure the other protocols have this issue.  Where is the similar
> issue in ib_isert?
> 
> If we made the rsp struct part of the nvmet_rdma_cmd, then we could remove
> this issue and the list management involved with the nvmet_rdma_rsp struct.
> Then we're ensured that when we get a nvme command from the host via a
> recv_done completion, we have a rsp.

I'm not sure this is so simple since we destroy the rdma_rw_ctx during 
the send completion (that will tell us that the RDMA op finished), that 
happens after we post_recv the nvmet_rdma_cmd.

we also can't check during send completion:
  if (rsp->req.sg != &rsp->cmd->inline_sg)
                 sgl_free(rsp->req.sg);

since the cmd was post_recv'ed (Sagi can you handle this in V2 ? can 
simply add pointer in rsp to this.) We need to make sure we don't count 
on rsp-cmd assosiacion after post_recv call (maybe remove it ??).

for now the proposed solution looks fine (With Christoph's and the above 
comments).
But we must also check the return value from nvmet_rdma_get_rsp and in 
case we got NULL we should add the cmd to a new cmd_wait_list.
Thoughts ?


> 
> Steve
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&amp;data=02%7C01%7Cmaxg%40mellanox.com%7Ce5872db21a5c4ac6656b08d61008319c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636714024720448674&amp;sdata=ii1zaSdMNtoNDFvefX5zEEvqdsaet0gLAe0M04JmGtc%3D&amp;reserved=0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-01  7:44 ` Christoph Hellwig
  2018-09-01 12:40   ` Steve Wise
@ 2018-09-03 10:35   ` Sagi Grimberg
  1 sibling, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2018-09-03 10:35 UTC (permalink / raw)



> This is a little scary and means we have the same issue in all other
> protocol drivers.  It also means that there probably is no point in
> allocating twice the queue length as it will be wasted most of the
> time, but still not enough.

Not exactly, the 2x I guess was designed to sustain the overhead of
the send completion until more commands are sent back by the host.
I guess it just doesn't seem to be enough. But without the factor
overhead I guess we'd find ourselves dynamically allocating more
than we'd want to. Its cheap enough to keep in my mind.

>>   	spin_lock_irqsave(&queue->rsps_lock, flags);
>> -	rsp = list_first_entry(&queue->free_rsps,
>> +	rsp = list_first_entry_or_null(&queue->free_rsps,
>>   				struct nvmet_rdma_rsp, free_list);
>> -	list_del(&rsp->free_list);
>>   	spin_unlock_irqrestore(&queue->rsps_lock, flags);
>> +	if (rsp) {
>> +		list_del(&rsp->free_list);
> 
> The list del needs to be under the lock.

Oops... to fast on the trigger...

>> +		rsp->allocated = false;
>> +	} else {
>> +		rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
>> +		rsp->allocated = true;
> 
> This needs to handle a NULL return from kmalloc.

Right, thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-02 16:24     ` Max Gurtovoy
@ 2018-09-03 10:42       ` Sagi Grimberg
  2018-09-03 10:45         ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2018-09-03 10:42 UTC (permalink / raw)



> we also can't check during send completion:
>  ?if (rsp->req.sg != &rsp->cmd->inline_sg)
>  ??????????????? sgl_free(rsp->req.sg);
> 
> since the cmd was post_recv'ed (Sagi can you handle this in V2 ? can 
> simply add pointer in rsp to this.) We need to make sure we don't count 
> on rsp-cmd assosiacion after post_recv call (maybe remove it ??).

Right, will add this as an incremental patch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-03 10:42       ` Sagi Grimberg
@ 2018-09-03 10:45         ` Sagi Grimberg
  2018-09-03 14:54           ` Max Gurtovoy
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2018-09-03 10:45 UTC (permalink / raw)


>> we also can't check during send completion:
>> ??if (rsp->req.sg != &rsp->cmd->inline_sg)
>> ???????????????? sgl_free(rsp->req.sg);
>>
>> since the cmd was post_recv'ed (Sagi can you handle this in V2 ? can 
>> simply add pointer in rsp to this.) We need to make sure we don't 
>> count on rsp-cmd assosiacion after post_recv call (maybe remove it ??).
> 
> Right, will add this as an incremental patch.

Wait, cmd->inline_sg never change and we only deallocate if
we allocated a different sgl which means we cannot free something
that shouldn't be. I agree its not clean but isn't it harmless?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load
  2018-09-03 10:45         ` Sagi Grimberg
@ 2018-09-03 14:54           ` Max Gurtovoy
  0 siblings, 0 replies; 8+ messages in thread
From: Max Gurtovoy @ 2018-09-03 14:54 UTC (permalink / raw)




On 9/3/2018 1:45 PM, Sagi Grimberg wrote:
>>> we also can't check during send completion:
>>> ??if (rsp->req.sg != &rsp->cmd->inline_sg)
>>> ???????????????? sgl_free(rsp->req.sg);
>>>
>>> since the cmd was post_recv'ed (Sagi can you handle this in V2 ? can 
>>> simply add pointer in rsp to this.) We need to make sure we don't 
>>> count on rsp-cmd assosiacion after post_recv call (maybe remove it ??).
>>
>> Right, will add this as an incremental patch.
> 
> Wait, cmd->inline_sg never change and we only deallocate if
> we allocated a different sgl which means we cannot free something
> that shouldn't be. I agree its not clean but isn't it harmless?

Yes, I was woried because we free the cmds before rsps but we also 
destroy the QP before so we're ok here.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-09-03 14:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-01  1:14 [PATCH] nvmet-rdma: fix possible bogus dereference under heavy load Sagi Grimberg
2018-09-01  7:44 ` Christoph Hellwig
2018-09-01 12:40   ` Steve Wise
2018-09-02 16:24     ` Max Gurtovoy
2018-09-03 10:42       ` Sagi Grimberg
2018-09-03 10:45         ` Sagi Grimberg
2018-09-03 14:54           ` Max Gurtovoy
2018-09-03 10:35   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).