[PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
@ 2018-03-07 17:56 Sagi Grimberg
  2018-03-08  7:56 ` Christoph Hellwig
  2018-03-08 15:51 ` Keith Busch
  0 siblings, 2 replies; 12+ messages in thread
From: Sagi Grimberg @ 2018-03-07 17:56 UTC (permalink / raw)


A wmb() is always required before operations that trigger DMA
(e.g. mmio write to a doorbell record). Not having it might result
in write reordering causing the device to fetch partly updated
submission queue entries.

Reported-by: Jason Gunthorpe <jgg at ziepe.ca>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
Note that this is stable material, We could add stable
tag when applying it to the nvme tree.

 drivers/nvme/host/pci.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index db4dedb1cc8f..6c2dcef1ff91 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
 	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
-					      nvmeq->dbbuf_sq_ei))
+					      nvmeq->dbbuf_sq_ei)) {
+		/*
+		 * Make sure that descriptors are written before
+		 * doorbell record.
+		 */
+		wmb();
 		writel(tail, nvmeq->q_db);
+	}
 	nvmeq->sq_tail = tail;
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-07 17:56 [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update Sagi Grimberg
@ 2018-03-08  7:56 ` Christoph Hellwig
  2018-03-08 15:51 ` Keith Busch
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2018-03-08  7:56 UTC (permalink / raw)


Yes, looks like we'd need this.

On Wed, Mar 07, 2018@07:56:26PM +0200, Sagi Grimberg wrote:
> A wmb() is always required before operations that trigger DMA
> (e.g. mmio write to a doorbell record). Not having it might result
> in write reordering causing the device to fetch partly updated
> submission queue entries.
> 
> Reported-by: Jason Gunthorpe <jgg at ziepe.ca>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> Note that this is stable material, We could add stable
> tag when applying it to the nvme tree.
> 
>  drivers/nvme/host/pci.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index db4dedb1cc8f..6c2dcef1ff91 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  	if (++tail == nvmeq->q_depth)
>  		tail = 0;
>  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> -					      nvmeq->dbbuf_sq_ei))
> +					      nvmeq->dbbuf_sq_ei)) {
> +		/*
> +		 * Make sure that descriptors are written before
> +		 * doorbell record.
> +		 */
> +		wmb();
>  		writel(tail, nvmeq->q_db);
> +	}
>  	nvmeq->sq_tail = tail;
>  }
>  
> -- 
> 2.14.1
---end quoted text---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-07 17:56 [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update Sagi Grimberg
  2018-03-08  7:56 ` Christoph Hellwig
@ 2018-03-08 15:51 ` Keith Busch
  2018-03-08 17:20   ` Jason Gunthorpe
  1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2018-03-08 15:51 UTC (permalink / raw)


On Wed, Mar 07, 2018@07:56:26PM +0200, Sagi Grimberg wrote:
> @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  	if (++tail == nvmeq->q_depth)
>  		tail = 0;
>  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> -					      nvmeq->dbbuf_sq_ei))
> +					      nvmeq->dbbuf_sq_ei)) {
> +		/*
> +		 * Make sure that descriptors are written before
> +		 * doorbell record.
> +		 */
> +		wmb();
>  		writel(tail, nvmeq->q_db);
> +	}
>  	nvmeq->sq_tail = tail;
>  }

If this really is necessary, we'd need this before updating the event
shadow registers too.

I'd like to understand this a bit more as we haven't done this in eight
years and I can't recall any issues around this section. Have we just
been fortunate that the problem this fixes is extraordinarily unlikely,
or is something else implicitly ordering within this critical section?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 15:51 ` Keith Busch
@ 2018-03-08 17:20   ` Jason Gunthorpe
  2018-03-08 17:37     ` Keith Busch
  2018-03-08 17:56     ` Sagi Grimberg
  0 siblings, 2 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2018-03-08 17:20 UTC (permalink / raw)


On Thu, Mar 08, 2018@08:51:47AM -0700, Keith Busch wrote:
> On Wed, Mar 07, 2018@07:56:26PM +0200, Sagi Grimberg wrote:
> > @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  	if (++tail == nvmeq->q_depth)
> >  		tail = 0;
> >  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> > -					      nvmeq->dbbuf_sq_ei))
> > +					      nvmeq->dbbuf_sq_ei)) {
> > +		/*
> > +		 * Make sure that descriptors are written before
> > +		 * doorbell record.
> > +		 */
> > +		wmb();
> >  		writel(tail, nvmeq->q_db);
> > +	}
> >  	nvmeq->sq_tail = tail;
> >  }
> 
> If this really is necessary, we'd need this before updating the event
> shadow registers too.
> 
> I'd like to understand this a bit more as we haven't done this in eight
> years and I can't recall any issues around this section. Have we just
> been fortunate that the problem this fixes is extraordinarily unlikely,
> or is something else implicitly ordering within this critical section?

Well, there is a wmb() already inside
nvme_dbbuf_update_and_check_event so any failure would only
be related to dbbuf_sq_db being wrong when tail is written to q_db.

Guessing that might be a basically undetectable situation, maybe some
temporary higher latency or something?

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 17:20   ` Jason Gunthorpe
@ 2018-03-08 17:37     ` Keith Busch
  2018-03-08 17:56     ` Sagi Grimberg
  1 sibling, 0 replies; 12+ messages in thread
From: Keith Busch @ 2018-03-08 17:37 UTC (permalink / raw)


On Thu, Mar 08, 2018@10:20:18AM -0700, Jason Gunthorpe wrote:
> On Thu, Mar 08, 2018@08:51:47AM -0700, Keith Busch wrote:
> > On Wed, Mar 07, 2018@07:56:26PM +0200, Sagi Grimberg wrote:
> > > @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> > >  	if (++tail == nvmeq->q_depth)
> > >  		tail = 0;
> > >  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> > > -					      nvmeq->dbbuf_sq_ei))
> > > +					      nvmeq->dbbuf_sq_ei)) {
> > > +		/*
> > > +		 * Make sure that descriptors are written before
> > > +		 * doorbell record.
> > > +		 */
> > > +		wmb();
> > >  		writel(tail, nvmeq->q_db);
> > > +	}
> > >  	nvmeq->sq_tail = tail;
> > >  }
> > 
> > If this really is necessary, we'd need this before updating the event
> > shadow registers too.
> > 
> > I'd like to understand this a bit more as we haven't done this in eight
> > years and I can't recall any issues around this section. Have we just
> > been fortunate that the problem this fixes is extraordinarily unlikely,
> > or is something else implicitly ordering within this critical section?
> 
> Well, there is a wmb() already inside
> nvme_dbbuf_update_and_check_event so any failure would only
> be related to dbbuf_sq_db being wrong when tail is written to q_db.

Ah, nvme_dbbuf_update_and_check_event is used for CQ's too, which
has no need for such a barrier. We just need it on the SQ side.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 17:20   ` Jason Gunthorpe
  2018-03-08 17:37     ` Keith Busch
@ 2018-03-08 17:56     ` Sagi Grimberg
  2018-03-08 18:36       ` Jason Gunthorpe
  2018-05-30 23:34       ` Sagi Grimberg
  1 sibling, 2 replies; 12+ messages in thread
From: Sagi Grimberg @ 2018-03-08 17:56 UTC (permalink / raw)



>>> @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>   	if (++tail == nvmeq->q_depth)
>>>   		tail = 0;
>>>   	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
>>> -					      nvmeq->dbbuf_sq_ei))
>>> +					      nvmeq->dbbuf_sq_ei)) {
>>> +		/*
>>> +		 * Make sure that descriptors are written before
>>> +		 * doorbell record.
>>> +		 */
>>> +		wmb();
>>>   		writel(tail, nvmeq->q_db);
>>> +	}
>>>   	nvmeq->sq_tail = tail;
>>>   }
>>
>> If this really is necessary, we'd need this before updating the event
>> shadow registers too.
>>
>> I'd like to understand this a bit more as we haven't done this in eight
>> years and I can't recall any issues around this section. Have we just
>> been fortunate that the problem this fixes is extraordinarily unlikely,
>> or is something else implicitly ordering within this critical section?
> 
> Well, there is a wmb() already inside
> nvme_dbbuf_update_and_check_event so any failure would only
> be related to dbbuf_sq_db being wrong when tail is written to q_db.

Right, we have that covered.

> Guessing that might be a basically undetectable situation, maybe some
> temporary higher latency or something?

If the SQE and DB update have been reordered, and the device was able
to fetch the old copy of the SQE, it would be either silent data
corruption or dma access to unmapped memory or unexpected behavior (if
the SQE wasn't initialized). I just think its a *very* rare race that
can happen...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 17:56     ` Sagi Grimberg
@ 2018-03-08 18:36       ` Jason Gunthorpe
  2018-03-08 18:42         ` Keith Busch
  2018-05-30 23:34       ` Sagi Grimberg
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2018-03-08 18:36 UTC (permalink / raw)


On Thu, Mar 08, 2018@07:56:37PM +0200, Sagi Grimberg wrote:
> 
> >>>@@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >>>  	if (++tail == nvmeq->q_depth)
> >>>  		tail = 0;
> >>>  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> >>>-					      nvmeq->dbbuf_sq_ei))
> >>>+					      nvmeq->dbbuf_sq_ei)) {
> >>>+		/*
> >>>+		 * Make sure that descriptors are written before
> >>>+		 * doorbell record.
> >>>+		 */
> >>>+		wmb();
> >>>  		writel(tail, nvmeq->q_db);
> >>>+	}
> >>>  	nvmeq->sq_tail = tail;
> >>>  }
> >>
> >>If this really is necessary, we'd need this before updating the event
> >>shadow registers too.
> >>
> >>I'd like to understand this a bit more as we haven't done this in eight
> >>years and I can't recall any issues around this section. Have we just
> >>been fortunate that the problem this fixes is extraordinarily unlikely,
> >>or is something else implicitly ordering within this critical section?
> >
> >Well, there is a wmb() already inside
> >nvme_dbbuf_update_and_check_event so any failure would only
> >be related to dbbuf_sq_db being wrong when tail is written to q_db.
> 
> Right, we have that covered.
> 
> >Guessing that might be a basically undetectable situation, maybe some
> >temporary higher latency or something?
> 
> If the SQE and DB update have been reordered,

But that can't happen, the SQE is written before
nvme_dbbuf_update_and_check_event(), and that function does wmb.

The only reordering is related to this:

		wmb();

		old_value = *dbbuf_db;
		*dbbuf_db = value;
[..]
	writel(tail, nvmeq->q_db);

So q_db and dbbuf_db could be swapped at the worst.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 18:36       ` Jason Gunthorpe
@ 2018-03-08 18:42         ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2018-03-08 18:42 UTC (permalink / raw)


On Thu, Mar 08, 2018@11:36:23AM -0700, Jason Gunthorpe wrote:
> On Thu, Mar 08, 2018@07:56:37PM +0200, Sagi Grimberg wrote:
> > If the SQE and DB update have been reordered,
> 
> But that can't happen, the SQE is written before
> nvme_dbbuf_update_and_check_event(), and that function does wmb.

I believe Sagi is talking about the typical controller that doesn't
have shadow db registers (no real nvme device has those), which doesn't
currently go through a path calling wmb().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-03-08 17:56     ` Sagi Grimberg
  2018-03-08 18:36       ` Jason Gunthorpe
@ 2018-05-30 23:34       ` Sagi Grimberg
  2018-05-31  5:47         ` Christoph Hellwig
  1 sibling, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2018-05-30 23:34 UTC (permalink / raw)


resurrecting this one...

Can anyone say why this is _not_ needed?

On 03/08/2018 07:56 PM, Sagi Grimberg wrote:
> 
>>>> @@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue 
>>>> *nvmeq,
>>>> ????? if (++tail == nvmeq->q_depth)
>>>> ????????? tail = 0;
>>>> ????? if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
>>>> -????????????????????????? nvmeq->dbbuf_sq_ei))
>>>> +????????????????????????? nvmeq->dbbuf_sq_ei)) {
>>>> +??????? /*
>>>> +???????? * Make sure that descriptors are written before
>>>> +???????? * doorbell record.
>>>> +???????? */
>>>> +??????? wmb();
>>>> ????????? writel(tail, nvmeq->q_db);
>>>> +??? }
>>>> ????? nvmeq->sq_tail = tail;
>>>> ? }
>>>
>>> If this really is necessary, we'd need this before updating the event
>>> shadow registers too.
>>>
>>> I'd like to understand this a bit more as we haven't done this in eight
>>> years and I can't recall any issues around this section. Have we just
>>> been fortunate that the problem this fixes is extraordinarily unlikely,
>>> or is something else implicitly ordering within this critical section?
>>
>> Well, there is a wmb() already inside
>> nvme_dbbuf_update_and_check_event so any failure would only
>> be related to dbbuf_sq_db being wrong when tail is written to q_db.
> 
> Right, we have that covered.
> 
>> Guessing that might be a basically undetectable situation, maybe some
>> temporary higher latency or something?
> 
> If the SQE and DB update have been reordered, and the device was able
> to fetch the old copy of the SQE, it would be either silent data
> corruption or dma access to unmapped memory or unexpected behavior (if
> the SQE wasn't initialized). I just think its a *very* rare race that
> can happen...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-05-30 23:34       ` Sagi Grimberg
@ 2018-05-31  5:47         ` Christoph Hellwig
  2018-05-31  7:53           ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2018-05-31  5:47 UTC (permalink / raw)


On Thu, May 31, 2018@02:34:33AM +0300, Sagi Grimberg wrote:
> resurrecting this one...
>
> Can anyone say why this is _not_ needed?

>From Documentation/memory-barriers.txt in linux-next, where this was
discussed once more not too log ago:

  Note that, when using writel(), a prior wmb() is not needed to guarantee
  that the cache coherent memory writes have completed before writing to
  the MMIO region.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-05-31  5:47         ` Christoph Hellwig
@ 2018-05-31  7:53           ` Sagi Grimberg
  2018-05-31 15:20             ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2018-05-31  7:53 UTC (permalink / raw)



>> resurrecting this one...
>>
>> Can anyone say why this is _not_ needed?
> 
>  From Documentation/memory-barriers.txt in linux-next, where this was
> discussed once more not too log ago:
> 
>    Note that, when using writel(), a prior wmb() is not needed to guarantee
>    that the cache coherent memory writes have completed before writing to
>    the MMIO region.

Thanks for clarifying, didn't know it was being discussed. I assume a
bunch of other drivers need to clean it up now...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update
  2018-05-31  7:53           ` Sagi Grimberg
@ 2018-05-31 15:20             ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2018-05-31 15:20 UTC (permalink / raw)


On Thu, May 31, 2018@10:53:22AM +0300, Sagi Grimberg wrote:
> 
> >>resurrecting this one...
> >>
> >>Can anyone say why this is _not_ needed?
> >
> > From Documentation/memory-barriers.txt in linux-next, where this was
> >discussed once more not too log ago:
> >
> >   Note that, when using writel(), a prior wmb() is not needed to guarantee
> >   that the cache coherent memory writes have completed before writing to
> >   the MMIO region.
> 
> Thanks for clarifying, didn't know it was being discussed. I assume a
> bunch of other drivers need to clean it up now...

Yes, that was the conclusion. It is tricky though because the above
statement is maybe only true for un-cached memory.

It is less clear what the semantics are for WC memory...

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-05-31 15:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-07 17:56 [PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update Sagi Grimberg
2018-03-08  7:56 ` Christoph Hellwig
2018-03-08 15:51 ` Keith Busch
2018-03-08 17:20   ` Jason Gunthorpe
2018-03-08 17:37     ` Keith Busch
2018-03-08 17:56     ` Sagi Grimberg
2018-03-08 18:36       ` Jason Gunthorpe
2018-03-08 18:42         ` Keith Busch
2018-05-30 23:34       ` Sagi Grimberg
2018-05-31  5:47         ` Christoph Hellwig
2018-05-31  7:53           ` Sagi Grimberg
2018-05-31 15:20             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).