[PATCH] nvme: avoid cqe corruption when update at the same time as read

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
       [not found] <1963234885.1121305.1457109031908.JavaMail.zimbra@kalray.eu>
@ 2016-03-05  7:23 ` Marta Rybczynska
  2016-03-09 17:08   ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Marta Rybczynska @ 2016-03-05  7:23 UTC (permalink / raw)


The cqe structure read normally happens from lower to upper addresses
and the validity bit (status) is at the highest address. If the PCI
updates the memory when the cqe is read by multiple non-atomic loads,
the structure may be corrupted. Avoid this by reading the status
first and then the whole structure.

Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a128672..cf74ffe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -729,12 +729,13 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 	phase = nvmeq->cq_phase;
 
 	for (;;) {
-		struct nvme_completion cqe = nvmeq->cqes[head];
-		u16 status = le16_to_cpu(cqe.status);
+		struct nvme_completion cqe;
+		u16 status = le16_to_cpu(nvmeq->cqes[head].status);
 		struct request *req;
 
 		if ((status & 1) != phase)
 			break;
+		cqe = nvmeq->cqes[head];
 		nvmeq->sq_head = le16_to_cpu(cqe.sq_head);
 		if (++head == nvmeq->q_depth) {
 			head = 0;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-05  7:23 ` [PATCH] nvme: avoid cqe corruption when update at the same time as read Marta Rybczynska
@ 2016-03-09 17:08   ` Christoph Hellwig
  2016-03-10  8:31     ` Johannes Thumshirn
  2016-03-10 11:08     ` Marta Rybczynska
  0 siblings, 2 replies; 8+ messages in thread
From: Christoph Hellwig @ 2016-03-09 17:08 UTC (permalink / raw)


On Sat, Mar 05, 2016@08:23:15AM +0100, Marta Rybczynska wrote:
> The cqe structure read normally happens from lower to upper addresses
> and the validity bit (status) is at the highest address. If the PCI
> updates the memory when the cqe is read by multiple non-atomic loads,
> the structure may be corrupted. Avoid this by reading the status
> first and then the whole structure.

Doing the phase check separately sounds sensible to me, but how about
doing something like the version below, which ensures we only read the
whole cqe after the check, and cleans things up a bit by using a common
helper:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1d1c6d1..f6ea5d3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -704,6 +704,11 @@ static void nvme_complete_rq(struct request *req)
 	blk_mq_end_request(req, error);
 }
 
+static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head)
+{
+	return (le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) == head;
+}
+
 static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 {
 	u16 head, phase;
@@ -711,13 +716,10 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 	head = nvmeq->cq_head;
 	phase = nvmeq->cq_phase;
 
-	for (;;) {
+	while (nvme_cqe_valid(nvmeq, head)) {
 		struct nvme_completion cqe = nvmeq->cqes[head];
-		u16 status = le16_to_cpu(cqe.status);
 		struct request *req;
 
-		if ((status & 1) != phase)
-			break;
 		if (++head == nvmeq->q_depth) {
 			head = 0;
 			phase = !phase;
@@ -748,7 +750,7 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
 		if (req->cmd_type == REQ_TYPE_DRV_PRIV && req->special)
 			memcpy(req->special, &cqe, sizeof(cqe));
-		blk_mq_complete_request(req, status >> 1);
+		blk_mq_complete_request(req, le16_to_cpu(cqe.status) >> 1);
 
 	}
 
@@ -789,18 +791,17 @@ static irqreturn_t nvme_irq(int irq, void *data)
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	struct nvme_completion cqe = nvmeq->cqes[nvmeq->cq_head];
-	if ((le16_to_cpu(cqe.status) & 1) != nvmeq->cq_phase)
-		return IRQ_NONE;
-	return IRQ_WAKE_THREAD;
+
+	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head))
+		return IRQ_WAKE_THREAD;
+	return IRQ_NONE;
 }
 
 static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 
-	if ((le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) ==
-	    nvmeq->cq_phase) {
+	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head)) {
 		spin_lock_irq(&nvmeq->q_lock);
 		__nvme_process_cq(nvmeq, &tag);
 		spin_unlock_irq(&nvmeq->q_lock);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-09 17:08   ` Christoph Hellwig
@ 2016-03-10  8:31     ` Johannes Thumshirn
  2016-03-10 11:08     ` Marta Rybczynska
  1 sibling, 0 replies; 8+ messages in thread
From: Johannes Thumshirn @ 2016-03-10  8:31 UTC (permalink / raw)


On Wed, Mar 09, 2016@09:08:46AM -0800, Christoph Hellwig wrote:
> On Sat, Mar 05, 2016@08:23:15AM +0100, Marta Rybczynska wrote:
> > The cqe structure read normally happens from lower to upper addresses
> > and the validity bit (status) is at the highest address. If the PCI
> > updates the memory when the cqe is read by multiple non-atomic loads,
> > the structure may be corrupted. Avoid this by reading the status
> > first and then the whole structure.
> 
> Doing the phase check separately sounds sensible to me, but how about
> doing something like the version below, which ensures we only read the
> whole cqe after the check, and cleans things up a bit by using a common
> helper:
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1d1c6d1..f6ea5d3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -704,6 +704,11 @@ static void nvme_complete_rq(struct request *req)
>  	blk_mq_end_request(req, error);
>  }
>  
> +static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head)
> +{
> +	return (le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) == head;
> +}
> +
>  static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>  {
>  	u16 head, phase;
> @@ -711,13 +716,10 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>  	head = nvmeq->cq_head;
>  	phase = nvmeq->cq_phase;
>  
> -	for (;;) {
> +	while (nvme_cqe_valid(nvmeq, head)) {
>  		struct nvme_completion cqe = nvmeq->cqes[head];
> -		u16 status = le16_to_cpu(cqe.status);
>  		struct request *req;
>  
> -		if ((status & 1) != phase)
> -			break;
>  		if (++head == nvmeq->q_depth) {
>  			head = 0;
>  			phase = !phase;
> @@ -748,7 +750,7 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>  		req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
>  		if (req->cmd_type == REQ_TYPE_DRV_PRIV && req->special)
>  			memcpy(req->special, &cqe, sizeof(cqe));
> -		blk_mq_complete_request(req, status >> 1);
> +		blk_mq_complete_request(req, le16_to_cpu(cqe.status) >> 1);
>  
>  	}
>  
> @@ -789,18 +791,17 @@ static irqreturn_t nvme_irq(int irq, void *data)
>  static irqreturn_t nvme_irq_check(int irq, void *data)
>  {
>  	struct nvme_queue *nvmeq = data;
> -	struct nvme_completion cqe = nvmeq->cqes[nvmeq->cq_head];
> -	if ((le16_to_cpu(cqe.status) & 1) != nvmeq->cq_phase)
> -		return IRQ_NONE;
> -	return IRQ_WAKE_THREAD;
> +
> +	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head))
> +		return IRQ_WAKE_THREAD;
> +	return IRQ_NONE;
>  }
>  
>  static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
>  {
>  	struct nvme_queue *nvmeq = hctx->driver_data;
>  
> -	if ((le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) ==
> -	    nvmeq->cq_phase) {
> +	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head)) {
>  		spin_lock_irq(&nvmeq->q_lock);
>  		__nvme_process_cq(nvmeq, &tag);
>  		spin_unlock_irq(&nvmeq->q_lock);
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

This looks sensible,
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-09 17:08   ` Christoph Hellwig
  2016-03-10  8:31     ` Johannes Thumshirn
@ 2016-03-10 11:08     ` Marta Rybczynska
  2016-03-10 14:36       ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Marta Rybczynska @ 2016-03-10 11:08 UTC (permalink / raw)



----- Le 9 Mar 16, ? 18:08, Christoph Hellwig hch at infradead.org a ?crit :

> On Sat, Mar 05, 2016@08:23:15AM +0100, Marta Rybczynska wrote:
>> The cqe structure read normally happens from lower to upper addresses
>> and the validity bit (status) is at the highest address. If the PCI
>> updates the memory when the cqe is read by multiple non-atomic loads,
>> the structure may be corrupted. Avoid this by reading the status
>> first and then the whole structure.
> 
> Doing the phase check separately sounds sensible to me, but how about
> doing something like the version below, which ensures we only read the
> whole cqe after the check, and cleans things up a bit by using a common
> helper:
> 

Looks like a good refactoring to me. However, it seems to me that in
nvme_cqe_valid we should be checking for phase, not for head.

I will test it and re-submit.

Regards,
Marta Rybczynska

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-10 11:08     ` Marta Rybczynska
@ 2016-03-10 14:36       ` Christoph Hellwig
  2016-03-10 15:07         ` Marta Rybczynska
  2016-03-10 15:07         ` Keith Busch
  0 siblings, 2 replies; 8+ messages in thread
From: Christoph Hellwig @ 2016-03-10 14:36 UTC (permalink / raw)


On Thu, Mar 10, 2016@12:08:48PM +0100, Marta Rybczynska wrote:
> Looks like a good refactoring to me. However, it seems to me that in
> nvme_cqe_valid we should be checking for phase, not for head.

Yeah, this doesn't work as-is.  I still think that non-obvious
check should be split out into something.  Maybe just:

static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head, u16 phase)
{
	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-10 14:36       ` Christoph Hellwig
@ 2016-03-10 15:07         ` Marta Rybczynska
  2016-03-15  8:24           ` Christoph Hellwig
  2016-03-10 15:07         ` Keith Busch
  1 sibling, 1 reply; 8+ messages in thread
From: Marta Rybczynska @ 2016-03-10 15:07 UTC (permalink / raw)


----- Le 10 Mar 16, ? 15:36, Christoph Hellwig hch at infradead.org a ?crit :
> 
> Yeah, this doesn't work as-is.  I still think that non-obvious
> check should be split out into something.  Maybe just:
> 
> static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head, u16 phase)
> {
>	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
> }

I did the same in my version. Here's what I have now:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e9f18e1..6d4a616 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -704,6 +704,13 @@ static void nvme_complete_rq(struct request *req)
         blk_mq_end_request(req, error);
 }
 
+/* We read the CQE phase first to check if the rest of the entry is valid */
+static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
+        u16 phase)
+{
+        return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
+}
+
 static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 {
         u16 head, phase;
@@ -711,13 +718,10 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
         head = nvmeq->cq_head;
         phase = nvmeq->cq_phase;
 
-        for (;;) {
+        while (nvme_cqe_valid(nvmeq, head, phase)) {
                 struct nvme_completion cqe = nvmeq->cqes[head];
-                u16 status = le16_to_cpu(cqe.status);
                 struct request *req;
 
-                if ((status & 1) != phase)
-                        break;
                 if (++head == nvmeq->q_depth) {
                         head = 0;
                         phase = !phase;
@@ -748,7 +752,7 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
                 req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
                 if (req->cmd_type == REQ_TYPE_DRV_PRIV && req->special)
                         memcpy(req->special, &cqe, sizeof(cqe));
-                blk_mq_complete_request(req, status >> 1);
+                blk_mq_complete_request(req, le16_to_cpu(cqe.status) >> 1);
 
         }
 
@@ -789,18 +793,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
         struct nvme_queue *nvmeq = data;
-        struct nvme_completion cqe = nvmeq->cqes[nvmeq->cq_head];
-        if ((le16_to_cpu(cqe.status) & 1) != nvmeq->cq_phase)
-                return IRQ_NONE;
-        return IRQ_WAKE_THREAD;
+        if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase))
+                return IRQ_WAKE_THREAD;
+        return IRQ_NONE;
 }
 
 static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
         struct nvme_queue *nvmeq = hctx->driver_data;
 
-        if ((le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) ==
-            nvmeq->cq_phase) {
+        if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
                 spin_lock_irq(&nvmeq->q_lock);
                 __nvme_process_cq(nvmeq, &tag);
                 spin_unlock_irq(&nvmeq->q_lock);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-10 15:07         ` Marta Rybczynska
@ 2016-03-15  8:24           ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2016-03-15  8:24 UTC (permalink / raw)


Can you resend the patch you've been testing with?  No need to attribute
me, please keep your signoff as the original and final version will be
from you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] nvme: avoid cqe corruption when update at the same time as read
  2016-03-10 14:36       ` Christoph Hellwig
  2016-03-10 15:07         ` Marta Rybczynska
@ 2016-03-10 15:07         ` Keith Busch
  1 sibling, 0 replies; 8+ messages in thread
From: Keith Busch @ 2016-03-10 15:07 UTC (permalink / raw)


On Thu, Mar 10, 2016@06:36:50AM -0800, Christoph Hellwig wrote:
> On Thu, Mar 10, 2016@12:08:48PM +0100, Marta Rybczynska wrote:
> > Looks like a good refactoring to me. However, it seems to me that in
> > nvme_cqe_valid we should be checking for phase, not for head.
> 
> Yeah, this doesn't work as-is.  I still think that non-obvious
> check should be split out into something.  Maybe just:
> 
> static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head, u16 phase)
> {
> 	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
> }

Looks good.

The code had copied the CQE since the initial commit, so just want to
mention a torn CQE read could cause real trouble only if the Phase is
out of sync with the Command ID. These are in the same DWORD and many
(most?) archs compile to read those 4-bytes atomically.

We certainly can't rely on that, so this is a good change.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-15  8:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1963234885.1121305.1457109031908.JavaMail.zimbra@kalray.eu>
2016-03-05  7:23 ` [PATCH] nvme: avoid cqe corruption when update at the same time as read Marta Rybczynska
2016-03-09 17:08   ` Christoph Hellwig
2016-03-10  8:31     ` Johannes Thumshirn
2016-03-10 11:08     ` Marta Rybczynska
2016-03-10 14:36       ` Christoph Hellwig
2016-03-10 15:07         ` Marta Rybczynska
2016-03-15  8:24           ` Christoph Hellwig
2016-03-10 15:07         ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox