Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes

From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes
Date: Thu, 07 Jul 2016 21:03:55 +0000	[thread overview]
Message-ID: <1467925432.5999.254.camel@intel.com> (raw)
In-Reply-To: 770C7090-3709-42B1-AD8F-D0F7D20CA49D@xeograph.com

[-- Attachment #1: Type: text/plain, Size: 20095 bytes --]

Glad you found it. Here's an explanation of what was wrong in the NVMe driver:

The queue pairs are implemented as two arrays - one of 64 byte commands and the other of 16 byte completions - and two sets of head/tail indices. The driver submits commands by copying commands into the submission array and then doing an MMIO write to the "doorbell" register with the new value of the driver's submission queue tail index. The SSD completions commands by copying a completion into the completion array and toggling its phase bit, like I explained before. When the completion has been consumed, the NVMe driver writes an updated value for its completion tail index to a doorbell register. This final step is important - it is what signals which entries in the completion array are open for new completions.

MMIO is expensive (writes less so than reads, but still expensive), so we try very hard to avoid doing any. Instead of writing a new completion queue tail index after each completion, we recently added a patch that writes out the tail index only after we're done processing a whole batch of completions. Unfortunately, we didn't consider the case where the entire queue depth worth of completions was processed in one call to check for completions. We didn't originally think this was possible because the queue is 256 deep but we only allow 128 actual commands to be outstanding. It turns out, it is possible under very specific circumstances though, which is what you hit.

Here's a concrete example of one way to hit the problem. Let's say that the current completion queue tail index is 0 and there is an active completion there. When the function to poll for completions checks that and finds the phase bit toggled, it will immediately call the user callback. If that callback submits I/O and then does something computationally expensive, it's possible that by the time the user callback returns the I/O that was submitted is already complete. That means the polling function will find the next completion and repeat. It's possible that this goes on until the completion queue is entirely full (256 entries), at which point no new completions come and we exit the loop. When we exit the loop, we write the location where we expect the next completion to be placed - but that's 0 (circular queue) which is the old value. The device can't detect that we wrote the same value as was previously there, so it acts as if there was no update. This is responsible for the hang you were seeing. At high queue depths, the user callback doesn't even have to be particular expensive to make this happen on low latency devices.

We talked about different ways to solve this and the one we've agreed on is to limit the maximum number of completions in one batch of polling to the queue depth - 1 (255). I'm submitting a patch for this now.

Thanks for your help digging through this.

On Thu, 2016-07-07 at 18:45 +0000, Will Del Genio wrote:
Ben,

I was able to track down the bug I mentioned in the previous email to an issue in my code.  I believe the spdk is working correctly now thanks to change you suggested.

-Will

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Will Del Genio <wdelgenio(a)xeograph.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Thursday, July 7, 2016 at 10:36 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes

Ben,

I have run more tests and experienced another failure (the same as before).  I wasn’t able to attach gdb to debug so I’ve been trying to replicate and haven’t had luck yet.  I will continue to try to replicate.

-Will

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Will Del Genio <wdelgenio(a)xeograph.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Thursday, July 7, 2016 at 10:09 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes

Ben,

I manually reverted the change from that commit and it looks like that resolved the issue.  At qd 256 I have not seen any failures.

For reference here is the relevant section of my nvme_qpair.c file now:

                if (++qpair->cq_head == qpair->num_entries) {

                        qpair->cq_head = 0;

                        qpair->phase = !qpair->phase;

                }

                spdk_mmio_write_4(qpair->cq_hdbl, qpair->cq_head);

                if (++num_completions == max_completions) {

                        break;

                }

        }

        return num_completions;

}

Let me know if you need any more help/info/testing.  Thank you very much for your help.

-Will

On 7/6/16, 6:23 PM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:

On Wed, 2016-07-06 at 22:56 +0000, Will Del Genio wrote:

> Ben,

>

> I tried capping my queue depth to 128 and it significantly improved the problem (I’m not entirely

> sure if it was completely eliminated).

Can you try reverting commit ff7e2122c74b09e5961cbcb2622fda9c0087f48f  and see if that fixes the

problem? I believe we see the problem and it will only occur with 256 or greater queue depth. If

reverting that commit solves the problem, we'll submit a patch to fix it the right way and I'll

explain what happened.

>

> To get the vars I had to put the qd back to 256:

>

> (gdb) p *qpair

> $1 = {sq_tdbl = 0x7ffff7ff2008, cq_hdbl = 0x7ffff7ff200c, cmd = 0x7fffdaae0000, cpl =

> 0x7fffdaade000, free_tr = {lh_first = 0x0}, outstanding_tr = {lh_first = 0x7fffdaabf000}, tr =

> 0x7fffdaa5d000,

>   queued_req = {stqh_first = 0x7fffea91a500, stqh_last = 0x7fffea8efec0}, id = 1, num_entries =

> 256, sq_tail = 121, cq_head = 249, phase = 0 '\000', is_enabled = true, sq_in_cmb = false,

>   qprio = 0 '\000', ctrlr = 0x7fffddca6740, tailq = {tqe_next = 0x0, tqe_prev = 0x7fffddca7b70},

> cmd_bus_addr = 35430203392, cpl_bus_addr = 35430195200}

> (gdb) p qpair->cpl[qpair->cq_head -1]

> $2 = {cdw0 = 0, rsvd1 = 0, sqhd = 250, sqid = 1, cid = 97, status = {p = 0, sc = 0, sct = 0, rsvd2

> = 0, m = 0, dnr = 0}}

>  (gdb) p *(qpair->sq_tdbl)

> $4 = 121

>  (gdb) p *(qpair->cq_hdbl)

> $5 = 249

> (gdb)

>

> On 7/6/16, 4:40 PM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a)lists.01.org on behalf of b

> enjamin.walker(a)intel.com> wrote:

>

> On Wed, 2016-07-06 at 21:00 +0000, Will Del Genio wrote:

> > Ben,

> >

> > Thanks, you explained that very well.  I’m working with a random 4k read only workload of queue

> > depth 256.

>

> Can you try capping your queue depth at 128? That's the maximum I/O we allow outstanding at the

> hardware. The NVMe driver should be doing software queueing beyond that automatically, but this

> data

> point will help narrow down the problem.

>

> >  I’m using 4 drives and one thread per drive.  If it’s true that there are just no more

> > completions to handle, then I will recheck the code I wrote to keep track of the number of

> > outstanding read requests.

> >

> > Here is the qpair:

> > (gdb) p *qpair

> > $1 = {sq_tdbl = 0x7ffff7ff2008, cq_hdbl = 0x7ffff7ff200c, cmd = 0x7fffdaae0000, cpl =

> > 0x7fffdaade000, free_tr = {lh_first = 0x0}, outstanding_tr = {lh_first = 0x7fffdaad8000}, tr =

> > 0x7fffdaa5d000,

> >   queued_req = {stqh_first = 0x7fffea9a1780, stqh_last = 0x7fffea9cdf40}, id = 1, num_entries =

> > 256, sq_tail = 249, cq_head = 121, phase = 0 '\000', is_enabled = true, sq_in_cmb = false,

>

> Note how sq_tail - cq_head is 128, meaning the driver believes there to be 128 commands

> outstanding.

> The driver's view of the world (commands outstanding) doesn't line up with us not getting any NVMe

> completions - there is definitely a problem here.

>

> >   qprio = 0 '\000', ctrlr = 0x7fffddca6740, tailq = {tqe_next = 0x0, tqe_prev = 0x7fffddca7b70},

> > cmd_bus_addr = 35430203392, cpl_bus_addr = 35430195200}

> > (gdb) p qpair->phase

> > $2 = 0 '\000'

> > (gdb) p qpair->cpl[qpair->cq_head]

> > $3 = {cdw0 = 0, rsvd1 = 0, sqhd = 132, sqid = 1, cid = 112, status = {p = 1, sc = 0, sct = 0,

> > rsvd2 = 0, m = 0, dnr = 0}}

>

> Can you print out the following 3 things:

> - qpair->cpl[qpair->cq_head - 1]

> - qpair->sq_tdbl

> - qpair->cq_hdbl

>

> >

> > -Will

> >

> > On 7/6/16, 3:50 PM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a)lists.01.org on behalf of

> > b

> > enjamin.walker(a)intel.com> wrote:

> >

> > On Wed, 2016-07-06 at 20:33 +0000, Will Del Genio wrote:

> > > Andrey,

> > >

> > > I was able to step into the spdk_nvme_qpair_process_completions() function with gdb and found

> > > the

> > > reason is isn’t returning any completions is because this check is failing at line 469: if

> > > (cpl-

> > > > status.p != qpair->phase)

> > >

> > > Relevant gdb info here:

> > > Thread 4 "xg:nvmeIo:9" hit Breakpoint 5, spdk_nvme_qpair_process_completions

> > > (qpair=0x7ffff0a5baa0, max_completions=0) at nvme_qpair.c:463

> > > 463   in nvme_qpair.c

> > > (gdb) p qpair->cpl[qpair->cq_head]

> > > $11 = {cdw0 = 0, rsvd1 = 0, sqhd = 51, sqid = 1, cid = 27, status = {p = 0, sc = 0, sct = 0,

> > > rsvd2

> > > = 0, m = 0, dnr = 0}}

> > > (gdb) p qpair->phase

> > > $12 = 1 '\001'

> > >

> > > What does this mean?  Does this information help at all?

> >

> > The NVMe hardware queue pairs consist of two arrays - one of commands and the other of responses

> > -

> > and a set of head and tail indices. The arrays are circular, so you can loop back around to the

> > beginning. Each entry in the array contains a phase bit, which is either 1 or 0. On the first

> > pass

> > through the array, new entries in the queue are marked by setting their phase to 1. On the next

> > pass

> > through the array, new elements are marked by setting their phase bit to 0, etc. The current

> > iteration's expected phase value is stored in qpair->phase. So the code you are looking at on

> > lines

> > 467-470 is basically saying:

> >

> > 1) Grab the completion entry at the head of the queue

> > 2) Check its phase bit. If it hasn't toggled, there is no new completion, so exit

> >

> > All that means is that there are no completions outstanding for any commands according to the

> > SSD,

> > which doesn't narrow much down. At the point where you are broken in, can you just dump out the

> > whole qpair structure? Something like "p *qpair" should do it. That way I can see if there are

> > any

> > commands actually pending at the device and what state the device is in. My expectation is that

> > there aren't any commands outstanding.

> >

> > Can you also provide some background as to what kind of I/O you're submitting (read or write,

> > size,

> > queue depth, etc.) when this occurs?

> >

> > > -Will

> > >

> > > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>

> > > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>

> > > Date: Wednesday, July 6, 2016 at 2:35 PM

> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>

> > > Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes

> > >

> > >

> > >

> > >

> > >

> > > On Wed, Jul 6, 2016, 20:56 Will Del Genio <wdelgenio(a)xeograph.com> wrote:

> > >

> > > Andrey,

> > > That sounds exactly like what we are experiencing, however we’re working off the spdk codebase

> > > that was current as of last week and are still experiencing the issue.  Do you know what the

> > > resource allocation fault was and how we might be able to determine if that is still

> > > occurring?

> > >

> > >

> > >

> > > I'll take a look at commit log, both SPDK and mine, and will get back to you.

> > >

> > >

> > >

> > > Regards,

> > >

> > > Andrey

> > >

> > > Ben,

> > > We’re ASSERTing that the result of spdk_nvme_ns_cmd_read() == 0.  If I set our queue depth

> > > high

> > > enough it will fail that assertion, as would be expected.  Whatever other failure we’re

> > > experiencing does not seem to be causing spdk_nvme_ns_cmd_read() to return an error code.

> > >

> > > Also I performed some tests with the spdk perf tool and was not able to replicate our

> > > problem.  It

> > > ran fine at various queue depths and core masks.  When the qd was set too high, it failed

> > > gracefully with an error message.  This is all as expected.

> > >

> > > I’d like to continue down the path of investigating if some resource allocation or something

> > > else

> > > is failing silently for us.  Any specific ideas?

> > >

> > > Thanks!

> > > Will

> > >

> > > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>

> > > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>

> > > Date: Wednesday, July 6, 2016 at 12:01 PM

> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>

> > > Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called sometimes

> > >

> > >

> > >

> > >

> > >

> > > On Wed, Jul 6, 2016 at 6:35 PM, Walker, Benjamin <benjamin.walker(a)intel.com> wrote:

> > >

> > > Hi Will,

> > >

> > >

> > >

> > > Since I can't see the code for your application I'd like to try and reproduce the problem with

> > > code that I have some visibility into. Are you able to reproduce the problem using our perf

> > > tool

> > > (examples/nvme/perf)? If you aren't, this is likely a problem with your test application and

> > > not

> > > SPDK.

> > >

> > >

> > >

> > >

> > > I had been witnessing a similar issue with an earlier SPDK release, back around Feb, where the

> > > submit call was failing due to the resource allocation fault and neither returning an error

> > > nor

> > > invoking the callback, but my issue has been fixed in the recent release (I can't recall the

> > > actual commit, but there definitely was one dealing exactly with the cause).

> > >

> > >

> > >

> > >

> > > Based on the symptoms, my best guess is that your memory pool ran out of request objects. The

> > > first thing to check is whether spdk_nvme_ns_cmd_read failed. If it fails, it won't call the

> > > callback. You can check for failure by looking at the return value - see the documentation

> > > here.

> > > Your application allocates this memory pool up front - all of our examples allocate 8k

> > > requests

> > > (see line 1097 in examples/nvme/perf/perf.c) You need to allocate a large enough pool to

> > > handle

> > > the maximum number of outstanding requests you plan to have. We recently added a "hello_world"

> > > style example for the NVMe driver at https://github.com/spdk/spdk/tree/master/examples/nvme/he

> > > ll

> > > o_

> > > world with tons of comments. One of the comments explains this memory pool in detail.

> > >

> > >

> > >

> > > That memory pool allocation is a bit of a wart on our otherwise clean API. We're looking at

> > > different strategies to clean that up. Let me know what the result of the debugging is and

> > > I'll

> > > shoot you some more ideas to try if necessary.

> > >

> > >

> > >

> > >

> > > Are there any plans regarding the global request pool rework?

> > >

> > >

> > >

> > > Regards,

> > >

> > > Andrey

> > >

> > >

> > >

> > >

> > >

> > > Thanks,

> > >

> > > Ben

> > >

> > >

> > >

> > > On Tue, 2016-07-05 at 21:03 +0000, Will Del Genio wrote:

> > >

> > > Hello,

> > > We have written a test application that is utilizing the spdk library to benchmark a set of 3

> > > Intel P3700 drives and a single 750 drive (concurrently).  We’ve done some testing using fio

> > > and

> > > the kernel nvme drivers and have had no problem achieving the claimed IOPs (4k random read) of

> > > all

> > > drives on our system.

> > >

> > > What we have found during our testing is that spdk will sometimes start to silently fail to

> > > call

> > > the callback passed to spdk_nvme_ns_cmd_read in the following situations:

> > > 1.       Testing a single drive and passing in 0 for max_completions to

> > > spdk_nvme_qpair_process_completions().  We haven’t seen any issues with single drive testing

> > > when

> > > max_completions was > 0.

> > > 2.       Testing all four drives at once will result in one drive failing to receive

> > > callbacks,

> > > seemingly regardless of what number we pass for max_completions (1 through 128).

> > >

> > > Here are other observations we’ve made

> > > -When the callbacks fail to be called for a drive, they fail to be called for the remaining

> > > duration of the test.

> > > -The drive that ‘fails’ when testing 4 drives concurrently varies from test to test.

> > > -‘failure’ of a drive seems to be correlated with the number of outstanding read operations,

> > > though it is not a strict correlation.

> > >

> > > Our system is a dual socket  E5-2630 v3.  One drive is on a PCI slot for CPU 0 and the other 3

> > > are

> > > on PCI slots on CPU 1.  The master/slave threads are on the the same cpu socket as the nvme

> > > device

> > > they are talking to.

> > >

> > > We’d like to know what is causing this issue and what we can do to help investigate the

> > > problem.

> > > What other information can we provide?  Is there some part of the spdk code that we can look

> > > at

> > > to

> > > help determine the cause?

> > >

> > > Thanks,

> > > Will

> > >

> > >

> > > _______________________________________________

> > > SPDK mailing list

> > > SPDK(a)lists.01.org

> > > https://lists.01.org/mailman/listinfo/spdk

> > >

> > >

> > > _______________________________________________

> > > SPDK mailing list

> > > SPDK(a)lists.01.org

> > > https://lists.01.org/mailman/listinfo/spdk

> > >

> > >

> > >

> > >

> > >

> > >

> > >

> > >

> > >

> > > _______________________________________________

> > > SPDK mailing list

> > > SPDK(a)lists.01.org

> > > https://lists.01.org/mailman/listinfo/spdk

> > >

> > > --

> > >

> > > Regards,

> > > Andrey

> > >

> > >

> > >

> > >

> > > _______________________________________________

> > > SPDK mailing list

> > > SPDK(a)lists.01.org

> > > https://lists.01.org/mailman/listinfo/spdk

> > _______________________________________________

> > SPDK mailing list

> > SPDK(a)lists.01.org

> > https://lists.01.org/mailman/listinfo/spdk

> >

> >

> > _______________________________________________

> > SPDK mailing list

> > SPDK(a)lists.01.org

> > https://lists.01.org/mailman/listinfo/spdk

> _______________________________________________

> SPDK mailing list

> SPDK(a)lists.01.org

> https://lists.01.org/mailman/listinfo/spdk

>

>

> _______________________________________________

> SPDK mailing list

> SPDK(a)lists.01.org

> https://lists.01.org/mailman/listinfo/spdk

_______________________________________________

SPDK mailing list

SPDK(a)lists.01.org

https://lists.01.org/mailman/listinfo/spdk

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 43592 bytes --]