[PATCH 0/1] Fix for nvme-rdma host crash in nvmf-all.3

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sagi@grimberg.me (Sagi Grimberg)
Subject: [PATCH 0/1] Fix for nvme-rdma host crash in nvmf-all.3
Date: Thu, 23 Jun 2016 18:50:31 +0300	[thread overview]
Message-ID: <576C0547.4060403@grimberg.me> (raw)
In-Reply-To: <004f01d1cd57$85fc0530$91f40f90$@opengridcomputing.com>


>>> This patch fixes a touch-after-free bug I discovered.  It is against
>>> nvmf-all.3 branch of git://git.infradead.org/nvme-fabrics.git.  The patch
>>> is kind of ugly, so any ideas on a cleaner solution are welcome.
>>
>> Hey Steve, I don't see how this bug fixes the root-cause. Not exactly
>> sure we understand the root-cause. Is it possible that this is a chelsio
>> specific issue with send completion signaling (like we saw before)? Did
>> this happen with a non-chelsio device?
>
> Due to the stack trace, I believe this is a similar issue we saw before.  It is
> probably chelsio-specific.  I don't see it on mlx4.
>
> The fix for the previous occurrence of this crash was to signal all FLUSH
> commands.  Do you recall why that fixed it?  Perhaps this failure path needs
> some other signaled command to force the pending unsignaled WRs to be marked
> "complete" by the driver?

OK, so as discussed off-list signaling connect sends resolves the issue.
My recollection was that when the Chelsio queue-pair transitions to
error/drain state, the cxgb4 driver does not know which sends were
completed without the completion signal causing it to complete it again
and the wr_cqe might have been already freed. I assume the same is going
on here as we free the tag set before draining the qp...

WARNING: multiple messages have this Message-ID (diff)

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
To: Steve Wise
	<swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org
Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 0/1] Fix for nvme-rdma host crash in nvmf-all.3
Date: Thu, 23 Jun 2016 18:50:31 +0300	[thread overview]
Message-ID: <576C0547.4060403@grimberg.me> (raw)
In-Reply-To: <004f01d1cd57$85fc0530$91f40f90$@opengridcomputing.com>


>>> This patch fixes a touch-after-free bug I discovered.  It is against
>>> nvmf-all.3 branch of git://git.infradead.org/nvme-fabrics.git.  The patch
>>> is kind of ugly, so any ideas on a cleaner solution are welcome.
>>
>> Hey Steve, I don't see how this bug fixes the root-cause. Not exactly
>> sure we understand the root-cause. Is it possible that this is a chelsio
>> specific issue with send completion signaling (like we saw before)? Did
>> this happen with a non-chelsio device?
>
> Due to the stack trace, I believe this is a similar issue we saw before.  It is
> probably chelsio-specific.  I don't see it on mlx4.
>
> The fix for the previous occurrence of this crash was to signal all FLUSH
> commands.  Do you recall why that fixed it?  Perhaps this failure path needs
> some other signaled command to force the pending unsignaled WRs to be marked
> "complete" by the driver?

OK, so as discussed off-list signaling connect sends resolves the issue.
My recollection was that when the Chelsio queue-pair transitions to
error/drain state, the cxgb4 driver does not know which sends were
completed without the completion signal causing it to complete it again
and the wr_cqe might have been already freed. I assume the same is going
on here as we free the tag set before draining the qp...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2016-06-23 15:50 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-22 20:14 [PATCH 0/1] Fix for nvme-rdma host crash in nvmf-all.3 Steve Wise
2016-06-22 20:14 ` Steve Wise
2016-06-22 20:14 ` [PATCH 1/1] nvme-rdma: correctly unwind on bad subsystemnqn error Steve Wise
2016-06-22 20:14   ` Steve Wise
2016-06-23  7:28 ` [PATCH 0/1] Fix for nvme-rdma host crash in nvmf-all.3 Sagi Grimberg
2016-06-23  7:28   ` Sagi Grimberg
2016-06-23 13:59   ` Steve Wise
2016-06-23 13:59     ` Steve Wise
2016-06-23 15:50     ` Sagi Grimberg [this message]
2016-06-23 15:50       ` Sagi Grimberg
2016-06-23 15:59       ` Steve Wise
2016-06-23 15:59         ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=576C0547.4060403@grimberg.me \
    --to=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.