From: Daniel Verkamp <daniel@drv.nu>
To: Shimi Gersner <gersner@gmail.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
Keith Busch <keith.busch@intel.com>,
Kevin Wolf <kwolf@redhat.com>, David Sariel <davidsa@openu.ac.il>,
Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
Date: Thu, 30 Aug 2018 08:45:05 -0700 [thread overview]
Message-ID: <CAGFXeQKjnCvY7hRw_1MxZOROhHbsdHM9YPL-QzzWeJZcu0A8Jg@mail.gmail.com> (raw)
In-Reply-To: <CALKrh6mPkirNjwkCMLWU4d+eQnnkYDmjPVAWccP4K4tQDxxLaA@mail.gmail.com>
Hi Shimi,
On Sun, Aug 26, 2018 at 2:50 PM Gersner <gersner@gmail.com> wrote:
>
> Hi Daniel,
> Thanks for taking a look. Comments are inline.
>
> Gersner.
>
> On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp <daniel@drv.nu> wrote:
>>
>> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner <gersner@gmail.com> wrote:
>> > PCI/e configuration currently does not meets specifications.
>> >
>> > Patch includes various configuration changes to support specifications
>> > - BAR2 to return zero when read and CMD.IOSE is not set.
>> > - Expose NVME configuration through IO space (Optional).
>> > - PCI Power Management v1.2.
>> > - PCIe Function Level Reset.
>> > - Disable QEMUs default use of PCIe Link Status (DLLLA).
>> > - PCIe missing AOC compliance flag.
>> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification).
>> [...]
>> > n->num_namespaces = 1;
>> > n->num_queues = 64;
>> > - n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
>> > + n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4);
>>
>> The existing math looks OK to me (maybe already 4 bytes larger than
>> necessary?). The controller registers should be 0x1000 bytes for the
>> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is
>> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra
>> padding between queues). The result is also rounded up to a power of
>> two, so the result will typically be 8 KB. What is the rationale for
>> this change?
>
> You are correct, this change shouldn't be here. It was made during tests against the
> nvme compliance which failed here.
If the NVMe compliance test requires it, that is probably sufficient
reason to change it - although it would be interesting to hear their
rationale.
> The specification states that bits 0 to 13 are RO, which implicitly suggests
> that the minimal size of this BAR should be at least 16K as this is a standard
> way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't
> fit very well with the memory mapped laid out on 3.1 Register Definition
> of the 1.3b nvme spec. Any idea?
You're right, the MLBAR section of the NVMe spec does seem to indicate
that up to bit 13 should be reserved/RO. This is also probably a good
enough reason to make the change, as long as this reason is provided.
>
> Additionally it does look like it has an extra 4 bytes.
>
>>
>>
>> > n->ns_size = bs_size / (uint64_t)n->num_namespaces;
>> >
>> > n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
>> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>> > pci_register_bar(&n->parent_obj, 0,
>> > PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
>> > &n->iomem);
>> > +
>> > + // Expose the NVME memory through Address Space IO (Optional by spec)
>> > + pci_register_bar(&n->parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, &n->iomem);
>>
>> This looks like it will register the whole 4KB+ NVMe register set
>> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see
>> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if
>> implemented) should just contain two 4-byte registers, Index and Data,
>> that can be used to indirectly access the NVMe register set. (Just
>> for curiosity, do you know of any software that uses this feature? It
>> could be useful for testing the implementation.)
>
> Very nice catch. We indeed totally miss-interpreted the specification here.
>
> We use the compliance suit for sanity, but it doesn't actually use this feature,
> just verifies the configuration of the registers.
>
> We will go with rolling back this feature as this is optional.
This is probably the right move; I don't know of any real hardware
devices that implement it (and therefore I doubt any software will
require it). However, it should not be too difficult to add an
implementation of this feature; if you are interested, take a look at
the ahci_idp_read/ahci_idp_write functions - the NVMe index/data pair
should be very similar.
> Question - I'm having hard time to determine from the specification - As
> this is optional, how device drivers determine if it is available? Does it
> simply the CMD.IOSE flag from the PCI? And if so, what is
> the standard way in QEMU to disable the flag completely?
Based on the wording in NVMe 1.3 section 3.2, it seems like the only
indication of support for Index/Data Pair is whether BAR2 is filled
out. I believe PCI defines unused BARs to be all 0s, so software
should be able to use this to determine whether the feature is
provided by a particular NVMe PCIe device, and removing the
pci_register_bar() call should be enough to accomplish this from the
QEMU side.
Thanks,
-- Daniel
next prev parent reply other threads:[~2018-08-30 15:49 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-22 11:22 [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification Shimi Gersner
2018-06-22 11:22 ` [Qemu-devel] [PATCH 2/5] nvme: CQ/SQ proper validation & status code Shimi Gersner
2018-06-22 11:22 ` [Qemu-devel] [PATCH 3/5] nvme: Proper state handling on enable/disable Shimi Gersner
2018-06-22 11:22 ` [Qemu-devel] [PATCH 4/5] nvme: Fix phantom irq raise Shimi Gersner
2018-06-22 11:22 ` [Qemu-devel] [PATCH 5/5] nvme: Missing MSI message upon partial CQ read Shimi Gersner
2018-07-12 11:47 ` [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification Kevin Wolf
2018-07-13 7:40 ` David Sariel
2018-07-15 6:20 ` Daniel Verkamp
2018-08-26 21:49 ` Gersner
2018-08-30 15:45 ` Daniel Verkamp [this message]
2018-09-12 19:53 ` Gersner
2018-09-12 21:21 ` Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGFXeQKjnCvY7hRw_1MxZOROhHbsdHM9YPL-QzzWeJZcu0A8Jg@mail.gmail.com \
--to=daniel@drv.nu \
--cc=davidsa@openu.ac.il \
--cc=gersner@gmail.com \
--cc=keith.busch@intel.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).