* libata .sg_tablesize: why always dividing by 2 ?
@ 2008-02-26 0:02 Mark Lord
2008-02-26 0:15 ` Jeff Garzik
2008-02-26 0:22 ` Jeff Garzik
0 siblings, 2 replies; 27+ messages in thread
From: Mark Lord @ 2008-02-26 0:02 UTC (permalink / raw)
To: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list <linux-ide>
Jeff,
We had a discussion here today about IOMMUs,
and they *never* split sg list entries -- they only ever *merge*.
And this happens only after the block layer has
already done merging while respecting q->seg_boundary_mask.
So worst case, the IOMMU may merge everything, and then in
libata we unmerge them again. But the end result can never
exceed the max_sg_entries limit enforced by the block layer.
So.. why are we still specifying .sg_tablesize as half of
what the LLD can really handle?
This can cost a lot of memory, as using NCQ effectively multiplies
everything by 32..
Based on this information, I should be able to do this in sata_mv,
for example:
- .sg_tablesize = MV_MAX_SG_CT / 2,
+ .sg_tablesize = MV_MAX_SG_CT,
???
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 0:02 libata .sg_tablesize: why always dividing by 2 ? Mark Lord
@ 2008-02-26 0:15 ` Jeff Garzik
2008-02-26 0:27 ` Mark Lord
2008-02-26 0:54 ` Benjamin Herrenschmidt
2008-02-26 0:22 ` Jeff Garzik
1 sibling, 2 replies; 27+ messages in thread
From: Jeff Garzik @ 2008-02-26 0:15 UTC (permalink / raw)
To: Mark Lord
Cc: Tejun Heo, Alan Cox, James Bottomley, IDE/ATA development list,
Benjamin Herrenschmidt
Mark Lord wrote:
> Jeff,
>
> We had a discussion here today about IOMMUs,
> and they *never* split sg list entries -- they only ever *merge*.
>
> And this happens only after the block layer has
> already done merging while respecting q->seg_boundary_mask.
>
> So worst case, the IOMMU may merge everything, and then in
> libata we unmerge them again. But the end result can never
> exceed the max_sg_entries limit enforced by the block layer.
<shrug> Early experience said otherwise. The split in foo_fill_sg()
and resulting sg_tablesize reduction were both needed to successfully
transfer data, when Ben H originally did the work.
If Ben H and everyone on the arch side agrees with the above analysis, I
would be quite happy to remove all those "/ 2".
> This can cost a lot of memory, as using NCQ effectively multiplies
> everything by 32..
I recommend dialing down the hyperbole a bit :)
"a lot" in this case is... maybe another page or two per table, if
that. Compared with everything else in the system going on, with
16-byte S/G entries, S/G table size is really the least of our worries.
If you were truly concerned about memory usage in sata_mv, a more
effective route is simply reducing MV_MAX_SG_CT to a number closer to
the average s/g table size -- which is far, far lower than 256
(currently MV_MAX_SG_CT), or even 128 (MV_MAX_SG_CT/2).
Or moving to a scheme where you allocate (for example) S/G tables with
32 entries... then allocate on the fly for the rare case where the S/G
table must be larger.
Memory usage is simply not an effective argument in this case. Safety
and correctness are far more paramount.
Jeff
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 0:15 ` Jeff Garzik
@ 2008-02-26 0:27 ` Mark Lord
2008-02-26 0:54 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 27+ messages in thread
From: Mark Lord @ 2008-02-26 0:27 UTC (permalink / raw)
To: Jeff Garzik
Cc: Tejun Heo, Alan Cox, James Bottomley, IDE/ATA development list,
Benjamin Herrenschmidt
Jeff Garzik wrote:
> Mark Lord wrote:
>> Jeff,
>>
>> We had a discussion here today about IOMMUs,
>> and they *never* split sg list entries -- they only ever *merge*.
>>
>> And this happens only after the block layer has
>> already done merging while respecting q->seg_boundary_mask.
>>
>> So worst case, the IOMMU may merge everything, and then in
>> libata we unmerge them again. But the end result can never
>> exceed the max_sg_entries limit enforced by the block layer.
>
> <shrug> Early experience said otherwise. The split in foo_fill_sg()
> and resulting sg_tablesize reduction were both needed to successfully
> transfer data, when Ben H originally did the work.
>
> If Ben H and everyone on the arch side agrees with the above analysis, I
> would be quite happy to remove all those "/ 2".
>
>
>> This can cost a lot of memory, as using NCQ effectively multiplies
>> everything by 32..
>
> I recommend dialing down the hyperbole a bit :)
>
> "a lot" in this case is... maybe another page or two per table, if
> that. Compared with everything else in the system going on, with
> 16-byte S/G entries, S/G table size is really the least of our worries.
..
Well, today each sg table is about a page in size,
and sata_mv has 32 of them per port.
So cutting them in half would save 16 pages per port,
or 64 pages per host controller.
That's a lot for a small system, but maybe not for my 4GB test boxes.
> If you were truly concerned about memory usage in sata_mv, a more
> effective route is simply reducing MV_MAX_SG_CT to a number closer to
> the average s/g table size -- which is far, far lower than 256
> (currently MV_MAX_SG_CT), or even 128 (MV_MAX_SG_CT/2).
>
> Or moving to a scheme where you allocate (for example) S/G tables with
> 32 entries... then allocate on the fly for the rare case where the S/G
> table must be larger...
..
Oh, absolutely.. that's on my "clean up" list once the rest of
the driver becomes stable and mostly done. But for now, "safety and correctness is far more paramount" in sata_mv. :)
Cheers
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 0:15 ` Jeff Garzik
2008-02-26 0:27 ` Mark Lord
@ 2008-02-26 0:54 ` Benjamin Herrenschmidt
2008-02-26 1:37 ` Mark Lord
1 sibling, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 0:54 UTC (permalink / raw)
To: Jeff Garzik
Cc: Mark Lord, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote:
> Mark Lord wrote:
> > Jeff,
> >
> > We had a discussion here today about IOMMUs,
> > and they *never* split sg list entries -- they only ever *merge*.
> >
> > And this happens only after the block layer has
> > already done merging while respecting q->seg_boundary_mask.
> >
> > So worst case, the IOMMU may merge everything, and then in
> > libata we unmerge them again. But the end result can never
> > exceed the max_sg_entries limit enforced by the block layer.
>
> <shrug> Early experience said otherwise. The split in foo_fill_sg()
> and resulting sg_tablesize reduction were both needed to successfully
> transfer data, when Ben H originally did the work.
>
> If Ben H and everyone on the arch side agrees with the above analysis, I
> would be quite happy to remove all those "/ 2".
The split wasn't done by the iommu. The split was done by the IDE code
itself to handle the stupid 64k crossing thingy. If it's done
differently now, it might be possible to remove it, I haven't looked.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 0:54 ` Benjamin Herrenschmidt
@ 2008-02-26 1:37 ` Mark Lord
2008-02-26 1:43 ` Mark Lord
2008-02-26 2:52 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 27+ messages in thread
From: Mark Lord @ 2008-02-26 1:37 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
Benjamin Herrenschmidt wrote:
> On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote:
>> Mark Lord wrote:
>>> Jeff,
>>>
>>> We had a discussion here today about IOMMUs,
>>> and they *never* split sg list entries -- they only ever *merge*.
>>>
>>> And this happens only after the block layer has
>>> already done merging while respecting q->seg_boundary_mask.
>>>
>>> So worst case, the IOMMU may merge everything, and then in
>>> libata we unmerge them again. But the end result can never
>>> exceed the max_sg_entries limit enforced by the block layer.
>> <shrug> Early experience said otherwise. The split in foo_fill_sg()
>> and resulting sg_tablesize reduction were both needed to successfully
>> transfer data, when Ben H originally did the work.
>>
>> If Ben H and everyone on the arch side agrees with the above analysis, I
>> would be quite happy to remove all those "/ 2".
>
> The split wasn't done by the iommu. The split was done by the IDE code
> itself to handle the stupid 64k crossing thingy. If it's done
> differently now, it might be possible to remove it, I haven't looked.
..
The block layer uses seg_boundary_mask to ensure that we never have
to split them again in the LLD.
A very long time ago, when I wrote the IDE DMA code, this was not the case.
Cheers
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 1:37 ` Mark Lord
@ 2008-02-26 1:43 ` Mark Lord
2008-02-26 2:54 ` Benjamin Herrenschmidt
2008-02-26 2:52 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 27+ messages in thread
From: Mark Lord @ 2008-02-26 1:43 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
Mark Lord wrote:
> Benjamin Herrenschmidt wrote:
..
>> The split wasn't done by the iommu. The split was done by the IDE code
>> itself to handle the stupid 64k crossing thingy. If it's done
>> differently now, it might be possible to remove it, I haven't looked.
> ..
>
> The block layer uses seg_boundary_mask to ensure that we never have
> to split them again in the LLD.
..
James B. suggests that we stick a WARN_ON() into libata to let us
know if that precondition is violated. Sounds like an easy thing to do
for a couple of -rc cycles someday.
Cheers
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 1:43 ` Mark Lord
@ 2008-02-26 2:54 ` Benjamin Herrenschmidt
2008-02-26 4:38 ` Mark Lord
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 2:54 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
> James B. suggests that we stick a WARN_ON() into libata to let us
> know if that precondition is violated. Sounds like an easy thing to do
> for a couple of -rc cycles someday.
If the block layer gives us a 32k block aligned on a 32k boundary
(aligned), we have no guarantee that the iommu will not turn that into
something unaligned crossing a 32k (and thus possibly a 64k) boundary.
On powerpc, the iommu operates on 4k pages and only provides that level
of alignment to dma_map_sg() (dma_alloc_coherent are naturally aligned,
but not map_sg, that would put way too much pressure on the allocator on
machines that have pinhole-sized iommu space).
Cheers,
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 2:54 ` Benjamin Herrenschmidt
@ 2008-02-26 4:38 ` Mark Lord
2008-02-26 5:30 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: Mark Lord @ 2008-02-26 4:38 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
Benjamin Herrenschmidt wrote:
>> James B. suggests that we stick a WARN_ON() into libata to let us
>> know if that precondition is violated. Sounds like an easy thing to do
>> for a couple of -rc cycles someday.
>
> If the block layer gives us a 32k block aligned on a 32k boundary
> (aligned), we have no guarantee that the iommu will not turn that into
> something unaligned crossing a 32k (and thus possibly a 64k) boundary.
..
Certainly, but never any worse than what the block layer gave originally.
The important note being: IOMMU only ever *merges*, it never *splits*.
Which means that, by the time we split up any mis-merges again for 64K crossings,
we can never have more SG segments than what the block layer originally
fed to the IOMMU stuff.
Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
even after my own skeptical questioning.
Cheers
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 4:38 ` Mark Lord
@ 2008-02-26 5:30 ` Benjamin Herrenschmidt
2008-02-26 5:43 ` Mark Lord
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 5:30 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote:
> Benjamin Herrenschmidt wrote:
> >> James B. suggests that we stick a WARN_ON() into libata to let us
> >> know if that precondition is violated. Sounds like an easy thing to do
> >> for a couple of -rc cycles someday.
> >
> > If the block layer gives us a 32k block aligned on a 32k boundary
> > (aligned), we have no guarantee that the iommu will not turn that into
> > something unaligned crossing a 32k (and thus possibly a 64k) boundary.
> ..
>
> Certainly, but never any worse than what the block layer gave originally.
>
> The important note being: IOMMU only ever *merges*, it never *splits*.
Yes, but it will also change the address and doesn't guarantee the
alignment.
> Which means that, by the time we split up any mis-merges again for 64K crossings,
> we can never have more SG segments than what the block layer originally
> fed to the IOMMU stuff.
>
> Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
> even after my own skeptical questioning.
I suppose so. I don't remember all of the details, but iirc, it has to
do with crossing 64K boundaries. Some controllers can't handle it.
It's not only the _size_ of the segments, it's their alignment.
The iommu will not keep alignement beyond the page size (and even
then... on powerpc with a 64k base page size, you may still end up with
a 4k aligned result, but let's not go there now).
So that means that even if your block layer gives you nice aligned less
than 64k segments that don't cross 64k boundaries, and even if your
iommu isn't doing any merging at all, it may still give you back things
that do not respect that 64k alignment boundary, might cross them, and
thus might need to be split.
Now, it would make sense (if we don't have it already) to have a flag
provided by the host controller that tells us whether it suffers from
that limitation, and if not, we get avoid the whole thing.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 5:30 ` Benjamin Herrenschmidt
@ 2008-02-26 5:43 ` Mark Lord
2008-02-26 5:47 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: Mark Lord @ 2008-02-26 5:43 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
Benjamin Herrenschmidt wrote:
> On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote:
>> Benjamin Herrenschmidt wrote:
>>>> James B. suggests that we stick a WARN_ON() into libata to let us
>>>> know if that precondition is violated. Sounds like an easy thing to do
>>>> for a couple of -rc cycles someday.
>>> If the block layer gives us a 32k block aligned on a 32k boundary
>>> (aligned), we have no guarantee that the iommu will not turn that into
>>> something unaligned crossing a 32k (and thus possibly a 64k) boundary.
>> ..
>>
>> Certainly, but never any worse than what the block layer gave originally.
>>
>> The important note being: IOMMU only ever *merges*, it never *splits*.
>
> Yes, but it will also change the address and doesn't guarantee the
> alignment.
>
>> Which means that, by the time we split up any mis-merges again for 64K crossings,
>> we can never have more SG segments than what the block layer originally
>> fed to the IOMMU stuff.
>>
>> Or so the IOMMU and SCSI experts here at LSF'08 have assured me,
>> even after my own skeptical questioning.
>
> I suppose so. I don't remember all of the details, but iirc, it has to
> do with crossing 64K boundaries. Some controllers can't handle it.
>
> It's not only the _size_ of the segments, it's their alignment.
>
> The iommu will not keep alignement beyond the page size (and even
> then... on powerpc with a 64k base page size, you may still end up with
> a 4k aligned result, but let's not go there now).
..
That's just not possible, unless the IOMMU *splits* segments.
And the IOMMU experts here say that it never does that.
-ml
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 5:43 ` Mark Lord
@ 2008-02-26 5:47 ` Benjamin Herrenschmidt
2008-02-26 16:09 ` James Bottomley
2008-02-26 16:25 ` Mark Lord
0 siblings, 2 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 5:47 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:
> > I suppose so. I don't remember all of the details, but iirc, it has to
> > do with crossing 64K boundaries. Some controllers can't handle it.
> >
> > It's not only the _size_ of the segments, it's their alignment.
> >
> > The iommu will not keep alignement beyond the page size (and even
> > then... on powerpc with a 64k base page size, you may still end up with
> > a 4k aligned result, but let's not go there now).
> ..
>
> That's just not possible, unless the IOMMU *splits* segments.
> And the IOMMU experts here say that it never does that.
It is totally possible, and I know as wrote part of the powerpc iommu
code :-)
The iommu code makes no guarantee vs. preserving the alignment of a
segment, at least not below PAGE_SIZE.
Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
get back a 4K aligned 64K segment.
Enforcing natural alignment in the iommu code only happens for
dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
happen for map_sg. If we were to do that, we would make it very likely
for iommu allocations to fail on machine with small DMA windows.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 5:47 ` Benjamin Herrenschmidt
@ 2008-02-26 16:09 ` James Bottomley
2008-02-26 21:43 ` Benjamin Herrenschmidt
2008-02-26 16:25 ` Mark Lord
1 sibling, 1 reply; 27+ messages in thread
From: James Bottomley @ 2008-02-26 16:09 UTC (permalink / raw)
To: benh; +Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox,
IDE/ATA development list
On Tue, 2008-02-26 at 16:47 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:
> > > I suppose so. I don't remember all of the details, but iirc, it has to
> > > do with crossing 64K boundaries. Some controllers can't handle it.
> > >
> > > It's not only the _size_ of the segments, it's their alignment.
> > >
> > > The iommu will not keep alignement beyond the page size (and even
> > > then... on powerpc with a 64k base page size, you may still end up with
> > > a 4k aligned result, but let's not go there now).
> > ..
> >
> > That's just not possible, unless the IOMMU *splits* segments.
> > And the IOMMU experts here say that it never does that.
>
> It is totally possible, and I know as wrote part of the powerpc iommu
> code :-)
>
> The iommu code makes no guarantee vs. preserving the alignment of a
> segment, at least not below PAGE_SIZE.
It's supposed to, precisely to forestall this case. The alignment
guarantees of the parisc iommu code are sg length aligned up to a fixed
maximum (128k on 32 bit and 256k on 64 bit because of the way the
allocator works). However, tomo's code is fixing this, so it shouldn't
be a problem much longer.
James
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 16:09 ` James Bottomley
@ 2008-02-26 21:43 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 21:43 UTC (permalink / raw)
To: James Bottomley
Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox,
IDE/ATA development list
On Tue, 2008-02-26 at 08:09 -0800, James Bottomley wrote:
> > The iommu code makes no guarantee vs. preserving the alignment of a
> > segment, at least not below PAGE_SIZE.
>
> It's supposed to, precisely to forestall this case. The alignment
> guarantees of the parisc iommu code are sg length aligned up to a fixed
> maximum (128k on 32 bit and 256k on 64 bit because of the way the
> allocator works). However, tomo's code is fixing this, so it shouldn't
> be a problem much longer.
It will be. If we start enforcing that alignment, pSeries machines will
continuously run out of iommu space when the BIO starts handing out
large chunks.
Not acceptable for us. I suspect x86_64 will have a similar problem as,
afaik, the IOMMU space is very scarce there.
Can you guys stop designing the whole IO layer around PA-RISC ?
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 5:47 ` Benjamin Herrenschmidt
2008-02-26 16:09 ` James Bottomley
@ 2008-02-26 16:25 ` Mark Lord
2008-02-26 16:51 ` Mark Lord
2008-02-26 21:43 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 27+ messages in thread
From: Mark Lord @ 2008-02-26 16:25 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
Benjamin Herrenschmidt wrote:
> On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:
>>> I suppose so. I don't remember all of the details, but iirc, it has to
>>> do with crossing 64K boundaries. Some controllers can't handle it.
>>>
>>> It's not only the _size_ of the segments, it's their alignment.
>>>
>>> The iommu will not keep alignement beyond the page size (and even
>>> then... on powerpc with a 64k base page size, you may still end up with
>>> a 4k aligned result, but let's not go there now).
>> ..
>>
>> That's just not possible, unless the IOMMU *splits* segments.
>> And the IOMMU experts here say that it never does that.
>
> It is totally possible, and I know as wrote part of the powerpc iommu
> code :-)
>
> The iommu code makes no guarantee vs. preserving the alignment of a
> segment, at least not below PAGE_SIZE.
>
> Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
> get back a 4K aligned 64K segment.
>
> Enforcing natural alignment in the iommu code only happens for
> dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
> happen for map_sg. If we were to do that, we would make it very likely
> for iommu allocations to fail on machine with small DMA windows.
>
> Ben.
..
That's interesting. Can you point us to the exact file::lines where
this happens? It would be good to ensure that this gets fixed.
I'm copying Fujita Tomonori on this thread now -- he's the dude who's
trying to sort out the IOMMU mess.
Thanks
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 16:25 ` Mark Lord
@ 2008-02-26 16:51 ` Mark Lord
2008-02-26 21:50 ` Benjamin Herrenschmidt
2008-02-26 21:43 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 27+ messages in thread
From: Mark Lord @ 2008-02-26 16:51 UTC (permalink / raw)
To: benh
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
Mark Lord wrote:
> Benjamin Herrenschmidt wrote:
>> On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote:
>>>> I suppose so. I don't remember all of the details, but iirc, it has to
>>>> do with crossing 64K boundaries. Some controllers can't handle it.
>>>>
>>>> It's not only the _size_ of the segments, it's their alignment.
>>>>
>>>> The iommu will not keep alignement beyond the page size (and even
>>>> then... on powerpc with a 64k base page size, you may still end up with
>>>> a 4k aligned result, but let's not go there now).
>>> ..
>>>
>>> That's just not possible, unless the IOMMU *splits* segments.
>>> And the IOMMU experts here say that it never does that.
>>
>> It is totally possible, and I know as wrote part of the powerpc iommu
>> code :-)
>>
>> The iommu code makes no guarantee vs. preserving the alignment of a
>> segment, at least not below PAGE_SIZE.
>>
>> Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well
>> get back a 4K aligned 64K segment.
>>
>> Enforcing natural alignment in the iommu code only happens for
>> dma_alloc_coherent (it uses order-N allocations anyway), it doesn't
>> happen for map_sg. If we were to do that, we would make it very likely
>> for iommu allocations to fail on machine with small DMA windows.
>>
>> Ben.
> ..
>
> That's interesting. Can you point us to the exact file::lines where
> this happens? It would be good to ensure that this gets fixed.
>
> I'm copying Fujita Tomonori on this thread now -- he's the dude who's
> trying to sort out the IOMMU mess.
..
Mmm.. looks like ppc is already fixed in mainline,
commit 740c3ce66700640a6e6136ff679b067e92125794
Ben?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 16:51 ` Mark Lord
@ 2008-02-26 21:50 ` Benjamin Herrenschmidt
2008-02-26 21:56 ` James Bottomley
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 21:50 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
On Tue, 2008-02-26 at 11:51 -0500, Mark Lord wrote:
> > That's interesting. Can you point us to the exact file::lines where
> > this happens? It would be good to ensure that this gets fixed.
> >
> > I'm copying Fujita Tomonori on this thread now -- he's the dude
> who's
> > trying to sort out the IOMMU mess.
> ..
>
> Mmm.. looks like ppc is already fixed in mainline,
> commit 740c3ce66700640a6e6136ff679b067e92125794
No, I only fixed alignent up to PAGE_SIZE, not above.
Our iommu can use smaller than PAGE_SIZE allocations when PAGE_SIZE is
64k (it can use 4k), and that helps a lot with networking to avoid
filling up pSeries small iommu's too fast. Unfortunately, that meant we
used to not even get PAGE_SIZE alignment for some cases. The workaround
fixes that but does not provide natural size alignment.
If we were to provide size based alignment of sg requests, fragmentation
would become so bad we would basically end up failing a large amount of
iommu allocations on servers.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 21:50 ` Benjamin Herrenschmidt
@ 2008-02-26 21:56 ` James Bottomley
2008-02-26 22:30 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: James Bottomley @ 2008-02-26 21:56 UTC (permalink / raw)
To: benh
Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox,
IDE/ATA development list, FUJITA Tomonori
On Wed, 2008-02-27 at 08:50 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2008-02-26 at 11:51 -0500, Mark Lord wrote:
> > > That's interesting. Can you point us to the exact file::lines where
> > > this happens? It would be good to ensure that this gets fixed.
> > >
> > > I'm copying Fujita Tomonori on this thread now -- he's the dude
> > who's
> > > trying to sort out the IOMMU mess.
> > ..
> >
> > Mmm.. looks like ppc is already fixed in mainline,
> > commit 740c3ce66700640a6e6136ff679b067e92125794
>
> No, I only fixed alignent up to PAGE_SIZE, not above.
>
> Our iommu can use smaller than PAGE_SIZE allocations when PAGE_SIZE is
> 64k (it can use 4k), and that helps a lot with networking to avoid
> filling up pSeries small iommu's too fast. Unfortunately, that meant we
> used to not even get PAGE_SIZE alignment for some cases. The workaround
> fixes that but does not provide natural size alignment.
>
> If we were to provide size based alignment of sg requests, fragmentation
> would become so bad we would basically end up failing a large amount of
> iommu allocations on servers.
Ben, he just quoted the wrong patch ... that was the segment length
limit patch. The boundary alignment patch is this one:
fb3475e9b6bfa666107512fbd6006c26014f04b8
And it does alter the ppc iommu to respect the dma_boundary, so all of
the mechanics for removing the ppc special casing in libata/ide is
upstream.
James
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 21:56 ` James Bottomley
@ 2008-02-26 22:30 ` Benjamin Herrenschmidt
2008-02-26 23:16 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 22:30 UTC (permalink / raw)
To: James Bottomley
Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox,
IDE/ATA development list, FUJITA Tomonori
On Tue, 2008-02-26 at 13:56 -0800, James Bottomley wrote:
> > Our iommu can use smaller than PAGE_SIZE allocations when PAGE_SIZE is
> > 64k (it can use 4k), and that helps a lot with networking to avoid
> > filling up pSeries small iommu's too fast. Unfortunately, that meant we
> > used to not even get PAGE_SIZE alignment for some cases. The workaround
> > fixes that but does not provide natural size alignment.
> >
> > If we were to provide size based alignment of sg requests, fragmentation
> > would become so bad we would basically end up failing a large amount of
> > iommu allocations on servers.
>
> Ben, he just quoted the wrong patch ... that was the segment length
> limit patch. The boundary alignment patch is this one:
>
> fb3475e9b6bfa666107512fbd6006c26014f04b8
>
> And it does alter the ppc iommu to respect the dma_boundary, so all of
> the mechanics for removing the ppc special casing in libata/ide is
> upstream.
And the fact that this will totally blow fragmentation through the roof
and cause half of the dma-map requests to fail on machines with small
iommu has been totally ignored I suppose ?
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 22:30 ` Benjamin Herrenschmidt
@ 2008-02-26 23:16 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 23:16 UTC (permalink / raw)
To: James Bottomley
Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox,
IDE/ATA development list, FUJITA Tomonori
> And the fact that this will totally blow fragmentation through the roof
> and cause half of the dma-map requests to fail on machines with small
> iommu has been totally ignored I suppose ?
Ok, so after checking those patches more closely, it appears that
alignment will only be enforced for devices that impose segment
boundaries, which means that normal devices can still pack. That
sounds good.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 16:25 ` Mark Lord
2008-02-26 16:51 ` Mark Lord
@ 2008-02-26 21:43 ` Benjamin Herrenschmidt
2008-02-26 23:07 ` Alan Cox
1 sibling, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 21:43 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
On Tue, 2008-02-26 at 11:25 -0500, Mark Lord wrote:
> That's interesting. Can you point us to the exact file::lines where
> this happens? It would be good to ensure that this gets fixed.
>
> I'm copying Fujita Tomonori on this thread now -- he's the dude who's
> trying to sort out the IOMMU mess.
As I said before, it should not be fixed.
If it's "fixed", we'll run out of iommu space all the time.
We should stop having stupid requirements instead.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 21:43 ` Benjamin Herrenschmidt
@ 2008-02-26 23:07 ` Alan Cox
2008-02-26 23:19 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: Alan Cox @ 2008-02-26 23:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Mark Lord, Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
On Wed, Feb 27, 2008 at 08:43:53AM +1100, Benjamin Herrenschmidt wrote:
> As I said before, it should not be fixed.
> If it's "fixed", we'll run out of iommu space all the time.
> We should stop having stupid requirements instead.
Hints would be extremely good. Teaching the I/O layers to do
if (blk_queue_align_64k(queue) == -EGOSTICKYOURHEADINABUCKET)
size /= 2;
is not very hard - or even an arch helper for
n = blk_queue_worst_case(65536, 256);
Alan
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 23:07 ` Alan Cox
@ 2008-02-26 23:19 ` Benjamin Herrenschmidt
2008-02-28 7:36 ` FUJITA Tomonori
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 23:19 UTC (permalink / raw)
To: Alan Cox
Cc: Mark Lord, Jeff Garzik, Tejun Heo, James Bottomley,
IDE/ATA development list, FUJITA Tomonori
On Tue, 2008-02-26 at 18:07 -0500, Alan Cox wrote:
> On Wed, Feb 27, 2008 at 08:43:53AM +1100, Benjamin Herrenschmidt wrote:
> > As I said before, it should not be fixed.
> > If it's "fixed", we'll run out of iommu space all the time.
> > We should stop having stupid requirements instead.
>
> Hints would be extremely good. Teaching the I/O layers to do
>
> if (blk_queue_align_64k(queue) == -EGOSTICKYOURHEADINABUCKET)
> size /= 2;
>
> is not very hard - or even an arch helper for
>
> n = blk_queue_worst_case(65536, 256);
Looks like the new iommu stuff will deal with it fine by passing an
optional boundary not-to-cross requirement down. Normal devices don't
set that and the iommu can still pack. IDE can pass 64K and the iommu
will align things that may cross that boundary.
That should work out fine.
BTW. If we -still- run out of iommu space, do we have sane mechanisms
nowadays to break things up and retry ?
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 23:19 ` Benjamin Herrenschmidt
@ 2008-02-28 7:36 ` FUJITA Tomonori
2008-02-28 7:44 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: FUJITA Tomonori @ 2008-02-28 7:36 UTC (permalink / raw)
To: benh
Cc: alan, liml, jgarzik, htejun, James.Bottomley, linux-ide,
fujita.tomonori
On Wed, 27 Feb 2008 10:19:36 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> On Tue, 2008-02-26 at 18:07 -0500, Alan Cox wrote:
> > On Wed, Feb 27, 2008 at 08:43:53AM +1100, Benjamin Herrenschmidt wrote:
> > > As I said before, it should not be fixed.
> > > If it's "fixed", we'll run out of iommu space all the time.
> > > We should stop having stupid requirements instead.
> >
> > Hints would be extremely good. Teaching the I/O layers to do
> >
> > if (blk_queue_align_64k(queue) == -EGOSTICKYOURHEADINABUCKET)
> > size /= 2;
> >
> > is not very hard - or even an arch helper for
> >
> > n = blk_queue_worst_case(65536, 256);
>
> Looks like the new iommu stuff will deal with it fine by passing an
> optional boundary not-to-cross requirement down. Normal devices don't
> set that and the iommu can still pack. IDE can pass 64K and the iommu
> will align things that may cross that boundary.
>
> That should work out fine.
Yeah, nothing changes for drivers that don't set the boundary limit.
Even for drivers setting the boundary limit, the IOMMUs just allocate
a memory area that doesn't cross the boundary. It is NOT the
size-alignment allocation that POWER IOMMU's dma_alloc_coherent does
(as discussed in this thread). So the IOMMU space doesn't run out
quickly.
I have not fixed some IOMMUs yet (as I explained at LSF). If I finish
the job, libata can remove the workaround for splitting sg lists.
> BTW. If we -still- run out of iommu space, do we have sane mechanisms
> nowadays to break things up and retry ?
About SCSI, some drivers do the right thing (retrying), some assume
that IOMMU mapping failure never happens, some just crash. I'll take
care about it after fixing the IOMMUs.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-28 7:36 ` FUJITA Tomonori
@ 2008-02-28 7:44 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-28 7:44 UTC (permalink / raw)
To: FUJITA Tomonori
Cc: alan, liml, jgarzik, htejun, James.Bottomley, linux-ide,
fujita.tomonori
On Thu, 2008-02-28 at 16:36 +0900, FUJITA Tomonori wrote:
> > BTW. If we -still- run out of iommu space, do we have sane
> mechanisms
> > nowadays to break things up and retry ?
>
> About SCSI, some drivers do the right thing (retrying), some assume
> that IOMMU mapping failure never happens, some just crash. I'll take
> care about it after fixing the IOMMUs.
Thanks for doing this.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 1:37 ` Mark Lord
2008-02-26 1:43 ` Mark Lord
@ 2008-02-26 2:52 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-26 2:52 UTC (permalink / raw)
To: Mark Lord
Cc: Jeff Garzik, Tejun Heo, Alan Cox, James Bottomley,
IDE/ATA development list
> The block layer uses seg_boundary_mask to ensure that we never have
> to split them again in the LLD.
>
> A very long time ago, when I wrote the IDE DMA code, this was not the case.
Not good enough, still, because the boundaries can change due to the
iommu merging, thus the split must be re-done.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: libata .sg_tablesize: why always dividing by 2 ?
2008-02-26 0:02 libata .sg_tablesize: why always dividing by 2 ? Mark Lord
2008-02-26 0:15 ` Jeff Garzik
@ 2008-02-26 0:22 ` Jeff Garzik
2008-02-26 0:28 ` Mark Lord
1 sibling, 1 reply; 27+ messages in thread
From: Jeff Garzik @ 2008-02-26 0:22 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, Alan Cox, James Bottomley, IDE/ATA development list
As an aside, ISTR tomo-san was working on eliminating the need for the
"/2" by tackling the details on the IOMMU side...
Jeff
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2008-02-28 7:46 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-26 0:02 libata .sg_tablesize: why always dividing by 2 ? Mark Lord
2008-02-26 0:15 ` Jeff Garzik
2008-02-26 0:27 ` Mark Lord
2008-02-26 0:54 ` Benjamin Herrenschmidt
2008-02-26 1:37 ` Mark Lord
2008-02-26 1:43 ` Mark Lord
2008-02-26 2:54 ` Benjamin Herrenschmidt
2008-02-26 4:38 ` Mark Lord
2008-02-26 5:30 ` Benjamin Herrenschmidt
2008-02-26 5:43 ` Mark Lord
2008-02-26 5:47 ` Benjamin Herrenschmidt
2008-02-26 16:09 ` James Bottomley
2008-02-26 21:43 ` Benjamin Herrenschmidt
2008-02-26 16:25 ` Mark Lord
2008-02-26 16:51 ` Mark Lord
2008-02-26 21:50 ` Benjamin Herrenschmidt
2008-02-26 21:56 ` James Bottomley
2008-02-26 22:30 ` Benjamin Herrenschmidt
2008-02-26 23:16 ` Benjamin Herrenschmidt
2008-02-26 21:43 ` Benjamin Herrenschmidt
2008-02-26 23:07 ` Alan Cox
2008-02-26 23:19 ` Benjamin Herrenschmidt
2008-02-28 7:36 ` FUJITA Tomonori
2008-02-28 7:44 ` Benjamin Herrenschmidt
2008-02-26 2:52 ` Benjamin Herrenschmidt
2008-02-26 0:22 ` Jeff Garzik
2008-02-26 0:28 ` Mark Lord
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).