* Partitions and end-to-end protection
@ 2015-07-13 18:50 Paul Grabinar
2015-07-13 19:16 ` Keith Busch
0 siblings, 1 reply; 15+ messages in thread
From: Paul Grabinar @ 2015-07-13 18:50 UTC (permalink / raw)
Hi All,
I've had a problem reported against the NVMe driver on the v4.1 kernel.
The user has formatted a namespace with 512 byte sectors, separate
buffer 8 byte meta-data and protection type 1.
That should all be okay, and indeed reads and writes work fine.
However, if a partition table is created in the namespace and then the
machine is rebooted, or the driver is unloaded and loaded, the
partitions don't appear again unless they are manually probed.
What seems to be happening is that the driver starts off by setting the
capacity to zero whilst it validates the meta-data configuration and
then sets the real size after this validation. However, setting the
capacity does not seem to be enough to tell the kernel to re-read the
partition table.
Anyone else seeing this?
Thanks,
Paul.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-13 18:50 Partitions and end-to-end protection Paul Grabinar
@ 2015-07-13 19:16 ` Keith Busch
2015-07-13 19:22 ` Paul Grabinar
2015-07-14 18:06 ` Christoph Hellwig
0 siblings, 2 replies; 15+ messages in thread
From: Keith Busch @ 2015-07-13 19:16 UTC (permalink / raw)
On Mon, 13 Jul 2015, Paul Grabinar wrote:
> What seems to be happening is that the driver starts off by setting the
> capacity to zero whilst it validates the meta-data configuration and
> then sets the real size after this validation. However, setting the
> capacity does not seem to be enough to tell the kernel to re-read the
> partition table.
The driver did it this way because we need a disk allocated before we can
subscribe to block integrity extensions. I think we can fix the partition
issue by making a call to blkdev_reread_part() right after the second
revalidate_disk if the namespace is a meta-data formatted. I'll test it
out and post a patch if it works out.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-13 19:16 ` Keith Busch
@ 2015-07-13 19:22 ` Paul Grabinar
2015-07-14 18:06 ` Christoph Hellwig
1 sibling, 0 replies; 15+ messages in thread
From: Paul Grabinar @ 2015-07-13 19:22 UTC (permalink / raw)
On 13/07/15 20:16, Keith Busch wrote:
> On Mon, 13 Jul 2015, Paul Grabinar wrote:
>> What seems to be happening is that the driver starts off by setting the
>> capacity to zero whilst it validates the meta-data configuration and
>> then sets the real size after this validation. However, setting the
>> capacity does not seem to be enough to tell the kernel to re-read the
>> partition table.
>
> The driver did it this way because we need a disk allocated before we can
> subscribe to block integrity extensions. I think we can fix the partition
> issue by making a call to blkdev_reread_part() right after the second
> revalidate_disk if the namespace is a meta-data formatted. I'll test it
> out and post a patch if it works out.
>
Thanks Keith.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-13 19:16 ` Keith Busch
2015-07-13 19:22 ` Paul Grabinar
@ 2015-07-14 18:06 ` Christoph Hellwig
2015-07-14 18:13 ` Dan Williams
2015-07-14 18:41 ` Martin K. Petersen
1 sibling, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2015-07-14 18:06 UTC (permalink / raw)
On Mon, Jul 13, 2015@07:16:25PM +0000, Keith Busch wrote:
> On Mon, 13 Jul 2015, Paul Grabinar wrote:
> >What seems to be happening is that the driver starts off by setting the
> >capacity to zero whilst it validates the meta-data configuration and
> >then sets the real size after this validation. However, setting the
> >capacity does not seem to be enough to tell the kernel to re-read the
> >partition table.
>
> The driver did it this way because we need a disk allocated before we can
> subscribe to block integrity extensions. I think we can fix the partition
> issue by making a call to blkdev_reread_part() right after the second
> revalidate_disk if the namespace is a meta-data formatted. I'll test it
> out and post a patch if it works out.
I dont think requiring this crazy scheme is something that should spread
in drivers. I'd suggest you work with Martin (on Cc) so that we can
set up the integrity profile at the right time.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:06 ` Christoph Hellwig
@ 2015-07-14 18:13 ` Dan Williams
2015-07-14 18:15 ` Christoph Hellwig
2015-07-14 18:41 ` Martin K. Petersen
1 sibling, 1 reply; 15+ messages in thread
From: Dan Williams @ 2015-07-14 18:13 UTC (permalink / raw)
On Tue, Jul 14, 2015@11:06 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Jul 13, 2015@07:16:25PM +0000, Keith Busch wrote:
>> On Mon, 13 Jul 2015, Paul Grabinar wrote:
>> >What seems to be happening is that the driver starts off by setting the
>> >capacity to zero whilst it validates the meta-data configuration and
>> >then sets the real size after this validation. However, setting the
>> >capacity does not seem to be enough to tell the kernel to re-read the
>> >partition table.
>>
>> The driver did it this way because we need a disk allocated before we can
>> subscribe to block integrity extensions. I think we can fix the partition
>> issue by making a call to blkdev_reread_part() right after the second
>> revalidate_disk if the namespace is a meta-data formatted. I'll test it
>> out and post a patch if it works out.
>
> I dont think requiring this crazy scheme is something that should spread
> in drivers. I'd suggest you work with Martin (on Cc) so that we can
> set up the integrity profile at the right time.
>
We got bit by this in the nvdimm enabling as well, it is mostly hidden
by the fact that sd triggers revalidate_disk() by default.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:13 ` Dan Williams
@ 2015-07-14 18:15 ` Christoph Hellwig
2015-07-14 18:25 ` Dan Williams
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2015-07-14 18:15 UTC (permalink / raw)
On Tue, Jul 14, 2015@11:13:18AM -0700, Dan Williams wrote:
> We got bit by this in the nvdimm enabling as well, it is mostly hidden
> by the fact that sd triggers revalidate_disk() by default.
But that's just revalidate_disk, which shouldn't be require but isn't
too mad. blkdev_reread_part and the setup work for it is a totally
different league.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:15 ` Christoph Hellwig
@ 2015-07-14 18:25 ` Dan Williams
0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2015-07-14 18:25 UTC (permalink / raw)
On Tue, Jul 14, 2015@11:15 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jul 14, 2015@11:13:18AM -0700, Dan Williams wrote:
>> We got bit by this in the nvdimm enabling as well, it is mostly hidden
>> by the fact that sd triggers revalidate_disk() by default.
>
> But that's just revalidate_disk, which shouldn't be require but isn't
> too mad. blkdev_reread_part and the setup work for it is a totally
> different league.
Agreed, something is wrong if we need to do bdget() in the driver.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:06 ` Christoph Hellwig
2015-07-14 18:13 ` Dan Williams
@ 2015-07-14 18:41 ` Martin K. Petersen
2015-07-14 18:43 ` Christoph Hellwig
1 sibling, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2015-07-14 18:41 UTC (permalink / raw)
>>>>> "Christoph" == Christoph Hellwig <hch at infradead.org> writes:
>> The driver did it this way because we need a disk allocated before we
>> can subscribe to block integrity extensions.
The gendisk is a prerequisite by virtue of the integrity profile hanging
off of it. What I don't understand is why you do this capacity
zeroing/revalidate business in the first place?
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:41 ` Martin K. Petersen
@ 2015-07-14 18:43 ` Christoph Hellwig
2015-07-14 18:58 ` Martin K. Petersen
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2015-07-14 18:43 UTC (permalink / raw)
On Tue, Jul 14, 2015@02:41:43PM -0400, Martin K. Petersen wrote:
> >>>>> "Christoph" == Christoph Hellwig <hch at infradead.org> writes:
>
> >> The driver did it this way because we need a disk allocated before we
> >> can subscribe to block integrity extensions.
>
> The gendisk is a prerequisite by virtue of the integrity profile hanging
> off of it.
Having one allocated is. But why do we need to have it added (at which
point it is live)?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:43 ` Christoph Hellwig
@ 2015-07-14 18:58 ` Martin K. Petersen
2015-07-14 19:25 ` Keith Busch
0 siblings, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2015-07-14 18:58 UTC (permalink / raw)
>>>>> "Christoph" == Christoph Hellwig <hch at infradead.org> writes:
>> The gendisk is a prerequisite by virtue of the integrity profile
>> hanging off of it.
Christoph> Having one allocated is. But why do we need to have it added
Christoph> (at which point it is live)?
Well, we're poking at the queue (for block size) and the bdi. I guess we
could split things up and have another function to twiddle these
parameters separately. During revalidate, perhaps.
But the fundamental question remains: Why is it a prerequisite for the
NVMe integrity profile to be added before the disk is live? Nobody else
requires this (sd, md, dm).
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 18:58 ` Martin K. Petersen
@ 2015-07-14 19:25 ` Keith Busch
2015-07-14 19:38 ` Martin K. Petersen
0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2015-07-14 19:25 UTC (permalink / raw)
On Tue, 14 Jul 2015, Martin K. Petersen wrote:
> But the fundamental question remains: Why is it a prerequisite for the
> NVMe integrity profile to be added before the disk is live? Nobody else
> requires this (sd, md, dm).
NVMe doesn't provide a way to disable transferring metadata
per-io. Capacity zero prevents IO when it's added because we don't have
a metadata buffer to assign CMD.MPTR. The drive will unconditionally
attempt DMA to/from MPTR, so it's potential memory and data corruption
if we let it happen.
An alternative to requiring blk integrity provide the metadata buffer,
we could have the driver allocate a large scratch buffer just so we have
something valid to set MPTR even. I didn't do that initially because we
can't use the metadata for anything at that point, and it seems odd to
have the driver implement this special case just to satisfy add_disk().
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 19:25 ` Keith Busch
@ 2015-07-14 19:38 ` Martin K. Petersen
2015-07-14 19:45 ` Keith Busch
0 siblings, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2015-07-14 19:38 UTC (permalink / raw)
>>>>> "Keith" == Keith Busch <keith.busch at intel.com> writes:
Keith> NVMe doesn't provide a way to disable transferring metadata
Keith> per-io.
If you do a READ with PRACT=1 the PI should get stripped at the
controller. That was the intent, anyway.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 19:38 ` Martin K. Petersen
@ 2015-07-14 19:45 ` Keith Busch
2015-07-14 20:07 ` Martin K. Petersen
0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2015-07-14 19:45 UTC (permalink / raw)
On Tue, 14 Jul 2015, Martin K. Petersen wrote:
>>>>>> "Keith" == Keith Busch <keith.busch at intel.com> writes:
>
> Keith> NVMe doesn't provide a way to disable transferring metadata
> Keith> per-io.
>
> If you do a READ with PRACT=1 the PI should get stripped at the
> controller. That was the intent, anyway.
PRACT=1 strips/generates PI only for a subset of valid NVMe meta-data
formats, so we need something for the others.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 19:45 ` Keith Busch
@ 2015-07-14 20:07 ` Martin K. Petersen
2015-07-14 20:14 ` Keith Busch
0 siblings, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2015-07-14 20:07 UTC (permalink / raw)
>>>>> "Keith" == Keith Busch <keith.busch at intel.com> writes:
Keith> PRACT=1 strips/generates PI only for a subset of valid NVMe
Keith> meta-data formats, so we need something for the others.
*sigh* I hate that non-PI metadata stuff so much.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 15+ messages in thread
* Partitions and end-to-end protection
2015-07-14 20:07 ` Martin K. Petersen
@ 2015-07-14 20:14 ` Keith Busch
0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2015-07-14 20:14 UTC (permalink / raw)
On Tue, 14 Jul 2015, Martin K. Petersen wrote:
>>>>>> "Keith" == Keith Busch <keith.busch at intel.com> writes:
> Keith> PRACT=1 strips/generates PI only for a subset of valid NVMe
> Keith> meta-data formats, so we need something for the others.
>
> *sigh* I hate that non-PI metadata stuff so much.
And it gets even worse! PI formatted drives with metadata size > 8 bytes
don't use PRACT either.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-07-14 20:14 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-13 18:50 Partitions and end-to-end protection Paul Grabinar
2015-07-13 19:16 ` Keith Busch
2015-07-13 19:22 ` Paul Grabinar
2015-07-14 18:06 ` Christoph Hellwig
2015-07-14 18:13 ` Dan Williams
2015-07-14 18:15 ` Christoph Hellwig
2015-07-14 18:25 ` Dan Williams
2015-07-14 18:41 ` Martin K. Petersen
2015-07-14 18:43 ` Christoph Hellwig
2015-07-14 18:58 ` Martin K. Petersen
2015-07-14 19:25 ` Keith Busch
2015-07-14 19:38 ` Martin K. Petersen
2015-07-14 19:45 ` Keith Busch
2015-07-14 20:07 ` Martin K. Petersen
2015-07-14 20:14 ` Keith Busch
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.