AArch64 memory

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* AArch64 memory
@ 2018-05-17 15:58 Tim Harvey
  2018-05-18 11:59 ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Harvey @ 2018-05-17 15:58 UTC (permalink / raw)
  To: linux-arm-kernel

Greetings,

I'm trying to understand some details of the AArch64 memory
configuration in the kernel.

I've looked at Documentation/arm64/memory.txt which describes the
virtual memory layout used in terms of translation levels. This
relates to CONFIG_ARM64_{4K,16K,64K}_PAGES and CONFIG_ARM64_VA_BITS*.

My first question has to do with virtual memory layout: What are the
advantages and disadvantages for a system with a fixed 2GB of DRAM
when using using 4KB pages + 3 levels (CONFIG_ARM64_4K_PAGES=y
CONFIG_ARM64_VA_BITS=39) resulting in 512GB user / 512GB kernel vs
using 64KB pages + 3 levels (CONFIG_ARM64_64K_PAGES=y
CONFIG_ARM64_VA_BITS=48)? The physical memory is far less than what
any of the combinations would offer but I'm not clear if the number of
levels affects any sort of performance or how fragmentation could play
into performance.

My second question has to do with CMA and coherent_pool. I have
understood CMA as being a chunk of physical memory carved out by the
kernel for allocations from dma_alloc_coherent by drivers that need
chunks of contiguous memory for DMA buffers. I believe that before CMA
was introduced we had to do this by defining memory holes. I'm not
understanding the difference between CMA and the coherent pool. I've
noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from
CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if
defined you need to make sure your CMA is larger than coherent_pool?
What drivers/calls use coherent_pool vs cma?

Best Regards,

Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AArch64 memory
  2018-05-17 15:58 AArch64 memory Tim Harvey
@ 2018-05-18 11:59 ` Robin Murphy
  2018-05-18 16:43   ` Tim Harvey
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2018-05-18 11:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tim,

On 17/05/18 16:58, Tim Harvey wrote:
> Greetings,
> 
> I'm trying to understand some details of the AArch64 memory
> configuration in the kernel.
> 
> I've looked at Documentation/arm64/memory.txt which describes the
> virtual memory layout used in terms of translation levels. This
> relates to CONFIG_ARM64_{4K,16K,64K}_PAGES and CONFIG_ARM64_VA_BITS*.
> 
> My first question has to do with virtual memory layout: What are the
> advantages and disadvantages for a system with a fixed 2GB of DRAM
> when using using 4KB pages + 3 levels (CONFIG_ARM64_4K_PAGES=y
> CONFIG_ARM64_VA_BITS=39) resulting in 512GB user / 512GB kernel vs
> using 64KB pages + 3 levels (CONFIG_ARM64_64K_PAGES=y
> CONFIG_ARM64_VA_BITS=48)? The physical memory is far less than what
> any of the combinations would offer but I'm not clear if the number of
> levels affects any sort of performance or how fragmentation could play
> into performance.

There have been a number of discussions on the lists about the general 
topic in the contexts of several architectures, and I'm sure the last 
one I saw regarding arm64 actually had some measurements in it, although 
it's proving remarkably tricky to actually dig up again this morning :/

I think the rough executive summary remains that for certain 
memory-intensive workloads on AArch64, 64K pages *can* give a notable 
performance benefit in terms of reduced TLB pressure (and potentially 
also some for TLB miss overhead with 42-bit VA and 2-level tables). The 
(major) tradeoff is that for most other workloads, including much of the 
kernel itself, the increased allocation granularity leads to a 
significant increase in wasted RAM.

My gut feeling is that if you have relatively limited RAM and don't know 
otherwise, then 39-bit VA is probably the way to go - notably, there are 
also still drivers/filesystems/etc. which don't play too well with 
PAGE_SIZE != 4096 - but I'm by no means an expert in this area. If 
you're targeting a particular application area (e.g. networking) and can 
benchmark some representative workloads to look at performance vs. RAM 
usage for different configs, that would probably help inform your 
decision the most.

> My second question has to do with CMA and coherent_pool. I have
> understood CMA as being a chunk of physical memory carved out by the
> kernel for allocations from dma_alloc_coherent by drivers that need
> chunks of contiguous memory for DMA buffers. I believe that before CMA
> was introduced we had to do this by defining memory holes. I'm not
> understanding the difference between CMA and the coherent pool. I've
> noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from
> CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if
> defined you need to make sure your CMA is larger than coherent_pool?
> What drivers/calls use coherent_pool vs cma?

coherent_pool is a special thing which exists for the sake of 
non-hardware-coherent devices - normally for those we satisfy 
DMA-coherent allocations by setting up a non-cacheable remap of the 
allocated buffer - see dma_common_contiguous_remap(). However, drivers 
may call dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers, at 
which point we can't call get_vm_area() to remap on demand, since that 
might sleep, so we reserve a pool pre-mapped as non-cacheable to satisfy 
such atomic allocations from. I'm not sure why its user-visible name is 
"coherent pool" rather than the more descriptive "atomic pool" which 
it's named internally, but there's probably some history there. If 
you're lucky enough not to have any non-coherent DMA masters then you 
can safely ignore the whole thing; otherwise it's still generally rare 
that it should need adjusting.

CMA is, as you surmise, a much more general thing for providing large 
physically-contiguous areas, which the arch code correspondingly uses to 
get DMA-contiguous buffers. Unless all your DMA masters are behind 
IOMMUs (such that we can make any motley collection of pages look 
DMA-contiguous), you probably don't want to turn it off. None of these 
details should be relevant as far as drivers are concerned, since from 
their viewpoint it's all abstracted behind dma_alloc_coherent().

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AArch64 memory
  2018-05-18 11:59 ` Robin Murphy
@ 2018-05-18 16:43   ` Tim Harvey
  2018-05-18 18:15     ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Harvey @ 2018-05-18 16:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 18, 2018 at 4:59 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Tim,
>
> On 17/05/18 16:58, Tim Harvey wrote:
>>
>> Greetings,
>>
>> I'm trying to understand some details of the AArch64 memory
>> configuration in the kernel.
>>
>> I've looked at Documentation/arm64/memory.txt which describes the
>> virtual memory layout used in terms of translation levels. This
>> relates to CONFIG_ARM64_{4K,16K,64K}_PAGES and CONFIG_ARM64_VA_BITS*.
>>
>> My first question has to do with virtual memory layout: What are the
>> advantages and disadvantages for a system with a fixed 2GB of DRAM
>> when using using 4KB pages + 3 levels (CONFIG_ARM64_4K_PAGES=y
>> CONFIG_ARM64_VA_BITS=39) resulting in 512GB user / 512GB kernel vs
>> using 64KB pages + 3 levels (CONFIG_ARM64_64K_PAGES=y
>> CONFIG_ARM64_VA_BITS=48)? The physical memory is far less than what
>> any of the combinations would offer but I'm not clear if the number of
>> levels affects any sort of performance or how fragmentation could play
>> into performance.
>
>
> There have been a number of discussions on the lists about the general topic
> in the contexts of several architectures, and I'm sure the last one I saw
> regarding arm64 actually had some measurements in it, although it's proving
> remarkably tricky to actually dig up again this morning :/
>
> I think the rough executive summary remains that for certain
> memory-intensive workloads on AArch64, 64K pages *can* give a notable
> performance benefit in terms of reduced TLB pressure (and potentially also
> some for TLB miss overhead with 42-bit VA and 2-level tables). The (major)
> tradeoff is that for most other workloads, including much of the kernel
> itself, the increased allocation granularity leads to a significant increase
> in wasted RAM.
>
> My gut feeling is that if you have relatively limited RAM and don't know
> otherwise, then 39-bit VA is probably the way to go - notably, there are
> also still drivers/filesystems/etc. which don't play too well with PAGE_SIZE
> != 4096 - but I'm by no means an expert in this area. If you're targeting a
> particular application area (e.g. networking) and can benchmark some
> representative workloads to look at performance vs. RAM usage for different
> configs, that would probably help inform your decision the most.

Robin,

Thanks for the explanation - this makes sense and I understand that
its not easy to determine what is best. I'll do some tests with the
boards I'm working with (which are Cavium Octeon-TX CN80XX quad-core
1.5GHz boards with 1MB L2 cache and 2GB 32bit DDR4 with up to 5x GbE).

>
>> My second question has to do with CMA and coherent_pool. I have
>> understood CMA as being a chunk of physical memory carved out by the
>> kernel for allocations from dma_alloc_coherent by drivers that need
>> chunks of contiguous memory for DMA buffers. I believe that before CMA
>> was introduced we had to do this by defining memory holes. I'm not
>> understanding the difference between CMA and the coherent pool. I've
>> noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from
>> CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if
>> defined you need to make sure your CMA is larger than coherent_pool?
>> What drivers/calls use coherent_pool vs cma?
>
>
> coherent_pool is a special thing which exists for the sake of
> non-hardware-coherent devices - normally for those we satisfy DMA-coherent
> allocations by setting up a non-cacheable remap of the allocated buffer -
> see dma_common_contiguous_remap(). However, drivers may call
> dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers, at which point
> we can't call get_vm_area() to remap on demand, since that might sleep, so
> we reserve a pool pre-mapped as non-cacheable to satisfy such atomic
> allocations from. I'm not sure why its user-visible name is "coherent pool"
> rather than the more descriptive "atomic pool" which it's named internally,
> but there's probably some history there. If you're lucky enough not to have
> any non-coherent DMA masters then you can safely ignore the whole thing;
> otherwise it's still generally rare that it should need adjusting.

is there an easy way to tell if I have non-coherent DMA masters? The
Cavium SDK uses a kernel cmdline param of coherent_pool=16M so I'm
guessing something in the CN80XX/CN81XX (BGX NIC's or CPT perhaps)
need atomic pool mem.

>
> CMA is, as you surmise, a much more general thing for providing large
> physically-contiguous areas, which the arch code correspondingly uses to get
> DMA-contiguous buffers. Unless all your DMA masters are behind IOMMUs (such
> that we can make any motley collection of pages look DMA-contiguous), you
> probably don't want to turn it off. None of these details should be relevant
> as far as drivers are concerned, since from their viewpoint it's all
> abstracted behind dma_alloc_coherent().
>

I don't want to turn off CONFIG_CMA but I'm still not clear if I
should turn off CONFIG_DMA_CMA. I noticed the Cavium SDK 4.9 kernel
has CONFIG_CMA=y but does not enable CONFIG_DMA_CMA which I believe
means that the atomic pool does not pull its chunks from the CMA pool.

Thanks,

Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AArch64 memory
  2018-05-18 16:43   ` Tim Harvey
@ 2018-05-18 18:15     ` Robin Murphy
  2018-05-18 18:49       ` Tim Harvey
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2018-05-18 18:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 18/05/18 17:43, Tim Harvey wrote:
[...]
>>> My second question has to do with CMA and coherent_pool. I have
>>> understood CMA as being a chunk of physical memory carved out by the
>>> kernel for allocations from dma_alloc_coherent by drivers that need
>>> chunks of contiguous memory for DMA buffers. I believe that before CMA
>>> was introduced we had to do this by defining memory holes. I'm not
>>> understanding the difference between CMA and the coherent pool. I've
>>> noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from
>>> CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if
>>> defined you need to make sure your CMA is larger than coherent_pool?
>>> What drivers/calls use coherent_pool vs cma?
>>
>>
>> coherent_pool is a special thing which exists for the sake of
>> non-hardware-coherent devices - normally for those we satisfy DMA-coherent
>> allocations by setting up a non-cacheable remap of the allocated buffer -
>> see dma_common_contiguous_remap(). However, drivers may call
>> dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers, at which point
>> we can't call get_vm_area() to remap on demand, since that might sleep, so
>> we reserve a pool pre-mapped as non-cacheable to satisfy such atomic
>> allocations from. I'm not sure why its user-visible name is "coherent pool"
>> rather than the more descriptive "atomic pool" which it's named internally,
>> but there's probably some history there. If you're lucky enough not to have
>> any non-coherent DMA masters then you can safely ignore the whole thing;
>> otherwise it's still generally rare that it should need adjusting.
> 
> is there an easy way to tell if I have non-coherent DMA masters? The
> Cavium SDK uses a kernel cmdline param of coherent_pool=16M so I'm
> guessing something in the CN80XX/CN81XX (BGX NIC's or CPT perhaps)
> need atomic pool mem.

AFAIK the big-boy CN88xx is fully coherent everywhere, but whether the 
peripherals and interconnect in the littler Octeon TX variants are 
different I have no idea. If the contents of your dts-newport repo on 
GitHub are the right thing to be looking at, then you do have the 
"dma-coherent" property on the PCI nodes, which should cover everything 
beneath (I'd expect that in reality the SMMU may actually be coherent as 
well, but fortunately that's irrelevant here). Thus everything which 
matters *should* be being picked up as coherent already, and if not it 
would be a Linux problem. I can't imagine what the SDK is up to there, 
but 16MB of coherent pool does sound like something being done wrong, 
like incorrectly compensating for bad firmware failing to describe the 
hardware properly in the first place.

>> CMA is, as you surmise, a much more general thing for providing large
>> physically-contiguous areas, which the arch code correspondingly uses to get
>> DMA-contiguous buffers. Unless all your DMA masters are behind IOMMUs (such
>> that we can make any motley collection of pages look DMA-contiguous), you
>> probably don't want to turn it off. None of these details should be relevant
>> as far as drivers are concerned, since from their viewpoint it's all
>> abstracted behind dma_alloc_coherent().
>>
> 
> I don't want to turn off CONFIG_CMA but I'm still not clear if I
> should turn off CONFIG_DMA_CMA. I noticed the Cavium SDK 4.9 kernel
> has CONFIG_CMA=y but does not enable CONFIG_DMA_CMA which I believe
> means that the atomic pool does not pull its chunks from the CMA pool.

I wouldn't think there's much good reason to turn DMA_CMA off either, 
even if nothing actually needs huge DMA buffers. Where the atomic pool 
comes from shouldn't really matter, as it's a very early one-off 
allocation. To speculate wildly I suppose there *might* possibly be some 
performance difference between cma_alloc() and falling back to the 
regular page allocator - if that were the case it ought to be measurable 
by profiling something which calls dma_alloc_coherent() in process 
context a lot, under both configurations. Even then I'd imagine it's 
something that would matter most on the 2-socket 96-core systems, and 
not so much on the diddy ones.

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AArch64 memory
  2018-05-18 18:15     ` Robin Murphy
@ 2018-05-18 18:49       ` Tim Harvey
  2018-05-18 20:59         ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Harvey @ 2018-05-18 18:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 18, 2018 at 11:15 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 18/05/18 17:43, Tim Harvey wrote:
> [...]
>
>>>> My second question has to do with CMA and coherent_pool. I have
>>>> understood CMA as being a chunk of physical memory carved out by the
>>>> kernel for allocations from dma_alloc_coherent by drivers that need
>>>> chunks of contiguous memory for DMA buffers. I believe that before CMA
>>>> was introduced we had to do this by defining memory holes. I'm not
>>>> understanding the difference between CMA and the coherent pool. I've
>>>> noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from
>>>> CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if
>>>> defined you need to make sure your CMA is larger than coherent_pool?
>>>> What drivers/calls use coherent_pool vs cma?
>>>
>>>
>>>
>>> coherent_pool is a special thing which exists for the sake of
>>> non-hardware-coherent devices - normally for those we satisfy
>>> DMA-coherent
>>> allocations by setting up a non-cacheable remap of the allocated buffer -
>>> see dma_common_contiguous_remap(). However, drivers may call
>>> dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers, at which
>>> point
>>> we can't call get_vm_area() to remap on demand, since that might sleep,
>>> so
>>> we reserve a pool pre-mapped as non-cacheable to satisfy such atomic
>>> allocations from. I'm not sure why its user-visible name is "coherent
>>> pool"
>>> rather than the more descriptive "atomic pool" which it's named
>>> internally,
>>> but there's probably some history there. If you're lucky enough not to
>>> have
>>> any non-coherent DMA masters then you can safely ignore the whole thing;
>>> otherwise it's still generally rare that it should need adjusting.
>>
>>
>> is there an easy way to tell if I have non-coherent DMA masters? The
>> Cavium SDK uses a kernel cmdline param of coherent_pool=16M so I'm
>> guessing something in the CN80XX/CN81XX (BGX NIC's or CPT perhaps)
>> need atomic pool mem.
>
>
> AFAIK the big-boy CN88xx is fully coherent everywhere, but whether the
> peripherals and interconnect in the littler Octeon TX variants are different
> I have no idea. If the contents of your dts-newport repo on GitHub are the
> right thing to be looking at, then you do have the "dma-coherent" property
> on the PCI nodes, which should cover everything beneath (I'd expect that in
> reality the SMMU may actually be coherent as well, but fortunately that's
> irrelevant here). Thus everything which matters *should* be being picked up
> as coherent already, and if not it would be a Linux problem. I can't imagine
> what the SDK is up to there, but 16MB of coherent pool does sound like
> something being done wrong, like incorrectly compensating for bad firmware
> failing to describe the hardware properly in the first place.

Yes https://github.com/Gateworks/dts-newport/ is the board that I'm
working with :)

Ok, I think I understand now that the dma-coherent property on the PCI
host controller is saying that all allocations by PCI device drivers
will come from the atomic pool defined by coherent_pool=.

Why does coherent_pool=16M seem wrong to you?

>
>>> CMA is, as you surmise, a much more general thing for providing large
>>> physically-contiguous areas, which the arch code correspondingly uses to
>>> get
>>> DMA-contiguous buffers. Unless all your DMA masters are behind IOMMUs
>>> (such
>>> that we can make any motley collection of pages look DMA-contiguous), you
>>> probably don't want to turn it off. None of these details should be
>>> relevant
>>> as far as drivers are concerned, since from their viewpoint it's all
>>> abstracted behind dma_alloc_coherent().
>>>
>>
>> I don't want to turn off CONFIG_CMA but I'm still not clear if I
>> should turn off CONFIG_DMA_CMA. I noticed the Cavium SDK 4.9 kernel
>> has CONFIG_CMA=y but does not enable CONFIG_DMA_CMA which I believe
>> means that the atomic pool does not pull its chunks from the CMA pool.
>
>
> I wouldn't think there's much good reason to turn DMA_CMA off either, even
> if nothing actually needs huge DMA buffers. Where the atomic pool comes from
> shouldn't really matter, as it's a very early one-off allocation. To
> speculate wildly I suppose there *might* possibly be some performance
> difference between cma_alloc() and falling back to the regular page
> allocator - if that were the case it ought to be measurable by profiling
> something which calls dma_alloc_coherent() in process context a lot, under
> both configurations. Even then I'd imagine it's something that would matter
> most on the 2-socket 96-core systems, and not so much on the diddy ones.
>

If you enable DMA_CMA then you have to make sure to size CMA large
enough to handle coherent_pool (and any additional CMA you will need).
I made the mistake of setting CONFIG_CMA_SIZE_MBYTES=16 then passing
in a coherent_pool=64M which causes the coherent pool DMA allocation
to fail and I'm not clear if that even has an impact on the system. It
seems to me that the kernel should perhaps catch the case where CMA <
dma_coherent when CONFIG_CMA_DMA=y and either warn about that
condition or set cma to coherent_pool to resolve it.

Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AArch64 memory
  2018-05-18 18:49       ` Tim Harvey
@ 2018-05-18 20:59         ` Robin Murphy
  0 siblings, 0 replies; 6+ messages in thread
From: Robin Murphy @ 2018-05-18 20:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 18 May 2018 11:49:05 -0700
Tim Harvey <tharvey@gateworks.com> wrote:

> On Fri, May 18, 2018 at 11:15 AM, Robin Murphy <robin.murphy@arm.com>
> wrote:
> > On 18/05/18 17:43, Tim Harvey wrote:
> > [...]
> >  
> >>>> My second question has to do with CMA and coherent_pool. I have
> >>>> understood CMA as being a chunk of physical memory carved out by
> >>>> the kernel for allocations from dma_alloc_coherent by drivers
> >>>> that need chunks of contiguous memory for DMA buffers. I believe
> >>>> that before CMA was introduced we had to do this by defining
> >>>> memory holes. I'm not understanding the difference between CMA
> >>>> and the coherent pool. I've noticed that if CONFIG_DMA_CMA=y
> >>>> then the coherent pool allocates from CMA. Is there some
> >>>> disadvantage of CONFIG_DMA_CMA=y other than if defined you need
> >>>> to make sure your CMA is larger than coherent_pool? What
> >>>> drivers/calls use coherent_pool vs cma?  
> >>>
> >>>
> >>>
> >>> coherent_pool is a special thing which exists for the sake of
> >>> non-hardware-coherent devices - normally for those we satisfy
> >>> DMA-coherent
> >>> allocations by setting up a non-cacheable remap of the allocated
> >>> buffer - see dma_common_contiguous_remap(). However, drivers may
> >>> call dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers,
> >>> at which point
> >>> we can't call get_vm_area() to remap on demand, since that might
> >>> sleep, so
> >>> we reserve a pool pre-mapped as non-cacheable to satisfy such
> >>> atomic allocations from. I'm not sure why its user-visible name
> >>> is "coherent pool"
> >>> rather than the more descriptive "atomic pool" which it's named
> >>> internally,
> >>> but there's probably some history there. If you're lucky enough
> >>> not to have
> >>> any non-coherent DMA masters then you can safely ignore the whole
> >>> thing; otherwise it's still generally rare that it should need
> >>> adjusting.  
> >>
> >>
> >> is there an easy way to tell if I have non-coherent DMA masters?
> >> The Cavium SDK uses a kernel cmdline param of coherent_pool=16M so
> >> I'm guessing something in the CN80XX/CN81XX (BGX NIC's or CPT
> >> perhaps) need atomic pool mem.  
> >
> >
> > AFAIK the big-boy CN88xx is fully coherent everywhere, but whether
> > the peripherals and interconnect in the littler Octeon TX variants
> > are different I have no idea. If the contents of your dts-newport
> > repo on GitHub are the right thing to be looking at, then you do
> > have the "dma-coherent" property on the PCI nodes, which should
> > cover everything beneath (I'd expect that in reality the SMMU may
> > actually be coherent as well, but fortunately that's irrelevant
> > here). Thus everything which matters *should* be being picked up as
> > coherent already, and if not it would be a Linux problem. I can't
> > imagine what the SDK is up to there, but 16MB of coherent pool does
> > sound like something being done wrong, like incorrectly
> > compensating for bad firmware failing to describe the hardware
> > properly in the first place.  
> 
> Yes https://github.com/Gateworks/dts-newport/ is the board that I'm
> working with :)
> 
> Ok, I think I understand now that the dma-coherent property on the PCI
> host controller is saying that all allocations by PCI device drivers
> will come from the atomic pool defined by coherent_pool=.

No no, quite the opposite! With that property present, all the devices
should be treated as hardware-coherent, meaning that CPU accesses to
DMA buffers can be via the regular (cacheable) kernel address, and the
non-cacheable remaps aren't necessary. Thus *nothing* will be touching
the atomic pool at all.

> Why does coherent_pool=16M seem wrong to you?

Because it's two-hundred and fifty six times the default value, and
atomic allocations should be very rare to begin with. IOW it stinks
of badly-written drivers.

> >  
> >>> CMA is, as you surmise, a much more general thing for providing
> >>> large physically-contiguous areas, which the arch code
> >>> correspondingly uses to get
> >>> DMA-contiguous buffers. Unless all your DMA masters are behind
> >>> IOMMUs (such
> >>> that we can make any motley collection of pages look
> >>> DMA-contiguous), you probably don't want to turn it off. None of
> >>> these details should be relevant
> >>> as far as drivers are concerned, since from their viewpoint it's
> >>> all abstracted behind dma_alloc_coherent().
> >>>  
> >>
> >> I don't want to turn off CONFIG_CMA but I'm still not clear if I
> >> should turn off CONFIG_DMA_CMA. I noticed the Cavium SDK 4.9 kernel
> >> has CONFIG_CMA=y but does not enable CONFIG_DMA_CMA which I believe
> >> means that the atomic pool does not pull its chunks from the CMA
> >> pool.  
> >
> >
> > I wouldn't think there's much good reason to turn DMA_CMA off
> > either, even if nothing actually needs huge DMA buffers. Where the
> > atomic pool comes from shouldn't really matter, as it's a very
> > early one-off allocation. To speculate wildly I suppose there
> > *might* possibly be some performance difference between cma_alloc()
> > and falling back to the regular page allocator - if that were the
> > case it ought to be measurable by profiling something which calls
> > dma_alloc_coherent() in process context a lot, under both
> > configurations. Even then I'd imagine it's something that would
> > matter most on the 2-socket 96-core systems, and not so much on the
> > diddy ones. 
> 
> If you enable DMA_CMA then you have to make sure to size CMA large
> enough to handle coherent_pool (and any additional CMA you will need).
> I made the mistake of setting CONFIG_CMA_SIZE_MBYTES=16 then passing
> in a coherent_pool=64M which causes the coherent pool DMA allocation
> to fail and I'm not clear if that even has an impact on the system. It
> seems to me that the kernel should perhaps catch the case where CMA <
> dma_coherent when CONFIG_CMA_DMA=y and either warn about that
> condition or set cma to coherent_pool to resolve it.

Unfortunately that's not really practical - the default DMA_CMA region
is pulled out of memblock way early by generic code, while the atomic
pool is an Arm-specific thing which only comes into the picture much
later. Users already get a warning when creating the atomic pool
failed, so if they really want to go to crazy town with command-line
values they can always just reboot with "cma=<bigger>" as well (and
without CMA you're way beyond MAX_ORDER with those kind of sizes
anyway).

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-05-18 20:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-17 15:58 AArch64 memory Tim Harvey
2018-05-18 11:59 ` Robin Murphy
2018-05-18 16:43   ` Tim Harvey
2018-05-18 18:15     ` Robin Murphy
2018-05-18 18:49       ` Tim Harvey
2018-05-18 20:59         ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).