From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] A issue about maximums of write latency when we access the same block consecutively.
Date: Thu, 03 Aug 2017 23:36:07 +0000 [thread overview]
Message-ID: <1501803363.67512.3.camel@intel.com> (raw)
In-Reply-To: 1f6ed155.e7db.15da2a6fb37.Coremail.cjj25233@163.com
[-- Attachment #1: Type: text/plain, Size: 4392 bytes --]
On Wed, 2017-08-02 at 19:13 +0800, 储 wrote:
> Answers:
> (1) "access" = write. We experiment read and write operations respectively,
> but only find the strange phenomenon in the writing experiments.
> The comparison of experiments can be seen in accessories.
> (2) We use a NAND based SSD, Intel P3608.
> (3) The result presented in accessories is produced with no delay.
> We try to set "sleep(1)" between the two operations, but it seems does
> not work.
>
> At 2017-08-02 07:49:13, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
> > Hi Jiajia,
> >
> > I have a bunch of questions that will help me figure out what you are
> > seeing.
> >
> > 1) When you say "access", do you mean read or write? The behavior of these
> > two operations is quite different.
> > 2) Are you using a NAND based or 3D XPoint based SSD? These again work
> > entirely differently.
> > 3) When you access the same block repeatedly, what's the delay between each
> > access? None?
I was able to verify the behavior you are seeing. I'm afraid I'm not going to be
able to give you an exact answer for your particular device - I don't have
insight into the specifics of how each SSD is implemented. I brainstormed with a
few of my colleagues though, so what I can do is give you some idea of what is
happening inside of the device that will make it clear why writing to the same
block over and over may cause performance problems.
A good mental model for an SSD is basically a log of (LBA, data) pairs. When you
write to any LBA, it just appends to the end of the log and updates an internal
map of the location of that LBA. It does this appending by buffering several
writes into RAM located on the SSD, then it sends that batch of data to the NAND
all at once. The other important understanding is that the SSD is composed of a
large number of physical NAND dies, with some number of entirely parallel NAND
channels that can handle writes. Writing to the log sends the batched data to
each channel more or less round-robin. The final thing to remember is that this
whole process is implemented in hardware, not software, so adding things like
coordination between parallel operations is not as simple as just adding a lock.
When you write the same LBA over and over, a few things could happen inside the
SSD (I don't know how your SSD specifically works).
One possibility is that the SSD could see that the LBA is already buffered in
memory from a previous write and it could just update that memory. However, that
doesn't actually work in general. The data in that memory buffer may be
currently in use as part of a write to actual NAND, or may even be currently
being read. So the only option is to append to the end of the log for each new
write to the LBA. This could probably be coordinated with locking in software,
but remember that the SSD controller is implemented in hardware. If handling
this case makes the design far more complex, it may not be possible given power,
latency, and other budgets.
Another possibility is that the data is appended to the log for each write just
like any other I/O. However, it is still more complicated than the case where
random LBAs are being written to. Once one buffer is filled up, a write to NAND
is issued. When that write completes, it has to update the map for the location
of the LBA. If, while that write is outstanding, another buffer fills up with
new writes to the same LBA, the device has to figure out what to do. If it
submits the second NAND write to a new channel, it's then effectively racing
against the first write. If they complete out of order, the user will end up
with stale data. This case could also probably be handled by better coordination
on the completion side, but again there is a complexity trade off when
implementing this in actual hardware.
The easiest solution is probably to just detect if a NAND write is active for an
LBA in a given buffer, and then just queue up the next write until the one
before it finishes. That adds potentially a lot of latency, but it simplifies
the hardware design considerably.
Ultimately, I have no idea what that SSD is actually doing, but you can see that
it's fairly complex to handle this case. It is certainly more complex than
handling random I/O.
I hope that helps,
Ben
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]
next reply other threads:[~2017-08-03 23:36 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-03 23:36 Walker, Benjamin [this message]
-- strict thread matches above, loose matches on Subject: below --
2017-08-11 13:27 [SPDK] A issue about maximums of write latency when we access the same block consecutively
2017-08-09 5:34 Crane Chu
2017-08-07 6:40
2017-08-02 11:13
2017-08-01 23:49 Walker, Benjamin
2017-08-01 19:41 Luse, Paul E
2017-08-01 8:46
2017-08-01 8:05 Danielle Costantino
2017-08-01 7:58
2017-07-31 13:34
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1501803363.67512.3.camel@intel.com \
--to=spdk@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.