xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Dom0 physical networking/swiotlb/something issue in 3.7-rc1
@ 2012-10-12 10:28 Ian Campbell
  2012-10-12 11:59 ` Konrad Rzeszutek Wilk
  2012-11-09  9:03 ` Jan Beulich
  0 siblings, 2 replies; 14+ messages in thread
From: Ian Campbell @ 2012-10-12 10:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

Hi Konrad,

The following patch causes fairly large packet loss when transmitting
from dom0 to the physical network, at least with my tg3 hardware, but I
assume it can impact anything which uses this interface.

I suspect that the issue is that the compound pages allocated in this
way are not backed by contiguous mfns and so things fall apart when the
driver tries to do DMA.

However I don't understand why the swiotlb is not fixing this up
successfully? The tg3 driver seems to use pci_map_single on this data.
Any thoughts? Perhaps the swiotlb (either generically or in the Xen
backend) doesn't correctly handle compound pages?

Ideally we would also fix this at the point of allocation to avoid the
bouncing -- I suppose that would involve using the DMA API in
netdev_alloc_frag?

We have a, sort of, similar situation in the block layer which is solved
via BIOVEC_PHYS_MERGEABLE. Sadly I don't think anything similar can
easily be retrofitted to the net drivers without changing every single
one.

Ian.

commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag
    
    We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()
    
    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
    
    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
    
    - Better filling of space (the ending hole overhead is less an issue)
    
    - Less calls to page allocator or accesses to page->_count
    
    - Could allow struct skb_shared_info futures changes without major
      performance impact.
    
    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.
    
    It also uses a standard "struct page_frag" instead of a custom one.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@intel.com>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 10:28 Dom0 physical networking/swiotlb/something issue in 3.7-rc1 Ian Campbell
@ 2012-10-12 11:59 ` Konrad Rzeszutek Wilk
  2012-10-12 12:09   ` Ian Campbell
  2012-10-12 12:10   ` Konrad Rzeszutek Wilk
  2012-11-09  9:03 ` Jan Beulich
  1 sibling, 2 replies; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-12 11:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> Hi Konrad,
> 
> The following patch causes fairly large packet loss when transmitting
> from dom0 to the physical network, at least with my tg3 hardware, but I
> assume it can impact anything which uses this interface.

Ah, that would explain why one of my machines suddenly started
developing checksum errors (and had a tg3 card). I hadn't gotten
deep into it.
> 
> I suspect that the issue is that the compound pages allocated in this
> way are not backed by contiguous mfns and so things fall apart when the
> driver tries to do DMA.

So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> 
> However I don't understand why the swiotlb is not fixing this up
> successfully? The tg3 driver seems to use pci_map_single on this data.
> Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> backend) doesn't correctly handle compound pages?

The assumption is that it is just a page. I am surprsed that the other
IOMMUs aren't hitting this as well - ah, that is b/c they do handle
a virtual address of more than one PAGE_SIZE..
> 
> Ideally we would also fix this at the point of allocation to avoid the
> bouncing -- I suppose that would involve using the DMA API in
> netdev_alloc_frag?

Using pci_alloc_coherent would do it.. but
> 
> We have a, sort of, similar situation in the block layer which is solved
> via BIOVEC_PHYS_MERGEABLE. Sadly I don't think anything similar can
> easily be retrofitted to the net drivers without changing every single
> one.

.. I think the right way would be to fix the SWIOTLB. And since I am now
officially the maintainer of said subsystem you have come to the right
person!

What is the easiest way of reproducing this? Just doing large amount
of netperf/netserver traffic both ways?
> 
> Ian.
> 
> commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Sep 26 06:46:57 2012 +0000
> 
>     net: use bigger pages in __netdev_alloc_frag
>     
>     We currently use percpu order-0 pages in __netdev_alloc_frag
>     to deliver fragments used by __netdev_alloc_skb()
>     
>     Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
>     be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
>     
>     Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
>     
>     - Better filling of space (the ending hole overhead is less an issue)
>     
>     - Less calls to page allocator or accesses to page->_count
>     
>     - Could allow struct skb_shared_info futures changes without major
>       performance impact.
>     
>     This patch implements a transparent fallback to smaller
>     pages in case of memory pressure.
>     
>     It also uses a standard "struct page_frag" instead of a custom one.
>     
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Cc: Alexander Duyck <alexander.h.duyck@intel.com>
>     Cc: Benjamin LaHaise <bcrl@kvack.org>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 11:59 ` Konrad Rzeszutek Wilk
@ 2012-10-12 12:09   ` Ian Campbell
  2012-10-12 12:10   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2012-10-12 12:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

On Fri, 2012-10-12 at 12:59 +0100, Konrad Rzeszutek Wilk wrote:
> > I suspect that the issue is that the compound pages allocated in this
> > way are not backed by contiguous mfns and so things fall apart when the
> > driver tries to do DMA.
> 
> So this should also be easily reproduced on barmetal with 'iommu=soft' then.

Does that cause the frames backing the page to become non-contiguous in
"machine" memory?

> > However I don't understand why the swiotlb is not fixing this up
> > successfully? The tg3 driver seems to use pci_map_single on this data.
> > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > backend) doesn't correctly handle compound pages?
> 
> The assumption is that it is just a page. I am surprsed that the other
> IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> a virtual address of more than one PAGE_SIZE..
> > 
> > Ideally we would also fix this at the point of allocation to avoid the
> > bouncing -- I suppose that would involve using the DMA API in
> > netdev_alloc_frag?
> 
> Using pci_alloc_coherent would do it.. but
> > 
> > We have a, sort of, similar situation in the block layer which is solved
> > via BIOVEC_PHYS_MERGEABLE. Sadly I don't think anything similar can
> > easily be retrofitted to the net drivers without changing every single
> > one.
> 
> .. I think the right way would be to fix the SWIOTLB. And since I am now
> officially the maintainer of said subsystem you have come to the right
> person!
> 
> What is the easiest way of reproducing this? Just doing large amount
> of netperf/netserver traffic both ways?

I'm seeing ~75% loss from ping plus an scp from the dom0 to another host
was measuring hundreds of kb/s instead of a few mb/s.

Fixing this only at the swiotlb layer is going to cause lots of bouncing
though?

Ian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 11:59 ` Konrad Rzeszutek Wilk
  2012-10-12 12:09   ` Ian Campbell
@ 2012-10-12 12:10   ` Konrad Rzeszutek Wilk
  2012-10-12 12:18     ` Ian Campbell
  1 sibling, 1 reply; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-12 12:10 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> > Hi Konrad,
> > 
> > The following patch causes fairly large packet loss when transmitting
> > from dom0 to the physical network, at least with my tg3 hardware, but I
> > assume it can impact anything which uses this interface.
> 
> Ah, that would explain why one of my machines suddenly started
> developing checksum errors (and had a tg3 card). I hadn't gotten
> deep into it.
> > 
> > I suspect that the issue is that the compound pages allocated in this
> > way are not backed by contiguous mfns and so things fall apart when the
> > driver tries to do DMA.
> 
> So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> > 
> > However I don't understand why the swiotlb is not fixing this up
> > successfully? The tg3 driver seems to use pci_map_single on this data.
> > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > backend) doesn't correctly handle compound pages?
> 
> The assumption is that it is just a page. I am surprsed that the other
> IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> a virtual address of more than one PAGE_SIZE..

So.. the GART one (AMD poor man IOTLB - was used for AGP card
translation, but can still be used as an IOMMU - and is still present on
some AMD machines), looks to suffer the same problem.

But perhaps not - can you explain to me if a compound page
is virtually contingous? One of the things the GART does for
pci_map_single is call page_to_phys(p), feeds the CPU physical address
(and size) into the GART engine to setup the mapping.

If compound pages are virtually (and physically on barmetal) contingous
- this ought to work. But if they are not, then this should also break on
AMD machines with tg3 and a AMD GART enabled.

(and I should be able to find such machine in my lab).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 12:10   ` Konrad Rzeszutek Wilk
@ 2012-10-12 12:18     ` Ian Campbell
  2012-10-12 13:17       ` Konrad Rzeszutek Wilk
  2012-10-29 15:55       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 14+ messages in thread
From: Ian Campbell @ 2012-10-12 12:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

On Fri, 2012-10-12 at 13:10 +0100, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> > > Hi Konrad,
> > > 
> > > The following patch causes fairly large packet loss when transmitting
> > > from dom0 to the physical network, at least with my tg3 hardware, but I
> > > assume it can impact anything which uses this interface.
> > 
> > Ah, that would explain why one of my machines suddenly started
> > developing checksum errors (and had a tg3 card). I hadn't gotten
> > deep into it.
> > > 
> > > I suspect that the issue is that the compound pages allocated in this
> > > way are not backed by contiguous mfns and so things fall apart when the
> > > driver tries to do DMA.
> > 
> > So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> > > 
> > > However I don't understand why the swiotlb is not fixing this up
> > > successfully? The tg3 driver seems to use pci_map_single on this data.
> > > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > > backend) doesn't correctly handle compound pages?
> > 
> > The assumption is that it is just a page. I am surprsed that the other
> > IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> > a virtual address of more than one PAGE_SIZE..
> 
> So.. the GART one (AMD poor man IOTLB - was used for AGP card
> translation, but can still be used as an IOMMU - and is still present on
> some AMD machines), looks to suffer the same problem.
> 
> But perhaps not - can you explain to me if a compound page
> is virtually contingous? One of the things the GART does for
> pci_map_single is call page_to_phys(p), feeds the CPU physical address
> (and size) into the GART engine to setup the mapping.
> 
> If compound pages are virtually (and physically on barmetal) contingous
> - this ought to work. But if they are not, then this should also break on
> AMD machines with tg3 and a AMD GART enabled.

AFAIK compound pages are always physically contiguous. i.e. given a
"struct page *page" which is the head of a compound page you can do
"page++" to walk through its constituent frames.

I'm not sure about virtually contiguous. Obviously if they are in lowmem
then the 1-1 map combined with the fact that they are physically
contiguous makes them virtually contiguous too. I'm not sure what
happens if they are highmem -- since kmap (or whatever) would need to do
some extra work in this case. I've not looked but I don't recall
noticing this in the past...

Ian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 12:18     ` Ian Campbell
@ 2012-10-12 13:17       ` Konrad Rzeszutek Wilk
  2012-10-29 15:55       ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-12 13:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

On Fri, Oct 12, 2012 at 01:18:16PM +0100, Ian Campbell wrote:
> On Fri, 2012-10-12 at 13:10 +0100, Konrad Rzeszutek Wilk wrote:
> > On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> > > > Hi Konrad,
> > > > 
> > > > The following patch causes fairly large packet loss when transmitting
> > > > from dom0 to the physical network, at least with my tg3 hardware, but I
> > > > assume it can impact anything which uses this interface.
> > > 
> > > Ah, that would explain why one of my machines suddenly started
> > > developing checksum errors (and had a tg3 card). I hadn't gotten
> > > deep into it.
> > > > 
> > > > I suspect that the issue is that the compound pages allocated in this
> > > > way are not backed by contiguous mfns and so things fall apart when the
> > > > driver tries to do DMA.
> > > 
> > > So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> > > > 
> > > > However I don't understand why the swiotlb is not fixing this up
> > > > successfully? The tg3 driver seems to use pci_map_single on this data.
> > > > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > > > backend) doesn't correctly handle compound pages?
> > > 
> > > The assumption is that it is just a page. I am surprsed that the other
> > > IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> > > a virtual address of more than one PAGE_SIZE..
> > 
> > So.. the GART one (AMD poor man IOTLB - was used for AGP card
> > translation, but can still be used as an IOMMU - and is still present on
> > some AMD machines), looks to suffer the same problem.
> > 
> > But perhaps not - can you explain to me if a compound page
> > is virtually contingous? One of the things the GART does for
> > pci_map_single is call page_to_phys(p), feeds the CPU physical address
> > (and size) into the GART engine to setup the mapping.
> > 
> > If compound pages are virtually (and physically on barmetal) contingous
> > - this ought to work. But if they are not, then this should also break on
> > AMD machines with tg3 and a AMD GART enabled.
> 
> AFAIK compound pages are always physically contiguous. i.e. given a
> "struct page *page" which is the head of a compound page you can do
> "page++" to walk through its constituent frames.
> 
> I'm not sure about virtually contiguous. Obviously if they are in lowmem
> then the 1-1 map combined with the fact that they are physically
> contiguous makes them virtually contiguous too. I'm not sure what
> happens if they are highmem -- since kmap (or whatever) would need to do
> some extra work in this case. I've not looked but I don't recall
> noticing this in the past...

I think it also depends on how they are allocated - if your use GFP_DMA32
they will be in lowmem. And since the networking is using that by default
they would be in the 1-1 map.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 12:18     ` Ian Campbell
  2012-10-12 13:17       ` Konrad Rzeszutek Wilk
@ 2012-10-29 15:55       ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-29 15:55 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 6854 bytes --]

On Fri, Oct 12, 2012 at 01:18:16PM +0100, Ian Campbell wrote:
> On Fri, 2012-10-12 at 13:10 +0100, Konrad Rzeszutek Wilk wrote:
> > On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> > > > Hi Konrad,
> > > > 
> > > > The following patch causes fairly large packet loss when transmitting
> > > > from dom0 to the physical network, at least with my tg3 hardware, but I
> > > > assume it can impact anything which uses this interface.
> > > 
> > > Ah, that would explain why one of my machines suddenly started
> > > developing checksum errors (and had a tg3 card). I hadn't gotten
> > > deep into it.
> > > > 
> > > > I suspect that the issue is that the compound pages allocated in this
> > > > way are not backed by contiguous mfns and so things fall apart when the
> > > > driver tries to do DMA.
> > > 
> > > So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> > > > 
> > > > However I don't understand why the swiotlb is not fixing this up
> > > > successfully? The tg3 driver seems to use pci_map_single on this data.
> > > > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > > > backend) doesn't correctly handle compound pages?
> > > 
> > > The assumption is that it is just a page. I am surprsed that the other
> > > IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> > > a virtual address of more than one PAGE_SIZE..
> > 
> > So.. the GART one (AMD poor man IOTLB - was used for AGP card
> > translation, but can still be used as an IOMMU - and is still present on
> > some AMD machines), looks to suffer the same problem.
> > 
> > But perhaps not - can you explain to me if a compound page
> > is virtually contingous? One of the things the GART does for
> > pci_map_single is call page_to_phys(p), feeds the CPU physical address
> > (and size) into the GART engine to setup the mapping.
> > 
> > If compound pages are virtually (and physically on barmetal) contingous
> > - this ought to work. But if they are not, then this should also break on
> > AMD machines with tg3 and a AMD GART enabled.
> 
> AFAIK compound pages are always physically contiguous. i.e. given a
> "struct page *page" which is the head of a compound page you can do
> "page++" to walk through its constituent frames.
> 
> I'm not sure about virtually contiguous. Obviously if they are in lowmem
> then the 1-1 map combined with the fact that they are physically
> contiguous makes them virtually contiguous too. I'm not sure what
> happens if they are highmem -- since kmap (or whatever) would need to do
> some extra work in this case. I've not looked but I don't recall
> noticing this in the past...

So to double check this, I wrote this nice little module (attached)
that would allocate these type of pages and do 'DMA' on them.

>From the tests it seems to work OK - in some cases it uses a bounce
buffer and in some it does not. And the resulting buffers do contain
the data we expected.

# modprobe dma_test
modprobe dma_test
calling  dma_test_init+0x0/0x1000 [dma_test] @ 2875
initcall dma_test_init+0x0/0x1000 [dma_test] returned 0 after 309 usecs
fallback_bus: to_cpu: va: ffff8800642dd000 (pfn:642dd, mfn:53706) w.r.t prev mfn: 53707!
fallback_bus: to_cpu: va: ffff8800642de000 (pfn:642de, mfn:53705) w.r.t prev mfn: 53706!
fallback_bus: to_cpu: va: ffff8800642df000 (pfn:642df, mfn:53704) w.r.t prev mfn: 53705!
fallback_bus: to_cpu: ffff8800642dc000 (pfn:642dc, bus frame: 53707) <= ffff880070046000 (addr: 70046000, frame: 186)
fallback_bus: to_cpu: ffff8800642dd000 (pfn:642dd, bus frame: 53706) <= ffff880070047000 (addr: 70047000, frame: 187)
fallback_bus: to_cpu: ffff8800642de000 (pfn:642de, bus frame: 53705) <= ffff880070048000 (addr: 70048000, frame: 188)
fallback_bus: to_cpu: ffff8800642df000 (pfn:642df, bus frame: 53704) <= ffff880070049000 (addr: 70049000, frame: 189)
fallback_bus: to_dev: va: ffff880059521000 (pfn:59521, mfn:488c2) w.r.t prev mfn: 488c3!
fallback_bus: to_dev: va: ffff880059522000 (pfn:59522, mfn:488c1) w.r.t prev mfn: 488c2!
fallback_bus: to_dev: va: ffff880059523000 (pfn:59523, mfn:488c0) w.r.t prev mfn: 488c1!
fallback_bus: to_dev: va: ffff880059524000 (pfn:59524, mfn:488bf) w.r.t prev mfn: 488c0!
fallback_bus: to_dev: va: ffff880059525000 (pfn:59525, mfn:488be) w.r.t prev mfn: 488bf!
fallback_bus: to_dev: va: ffff880059526000 (pfn:59526, mfn:488bd) w.r.t prev mfn: 488be!
fallback_bus: to_dev: va: ffff880059527000 (pfn:59527, mfn:488bc) w.r.t prev mfn: 488bd!
fallback_bus: to_dev: 0xffff88007004a000(bounce)  <=  0xffff880059520000 (sz: 32768)
fallback_bus: to_dev: ffff880059520000 (pfn:59520, bus frame: 488c3) => ffff88007004a000 (addr: 7004a000, frame: 18a)
fallback_bus: to_dev: ffff880059521000 (pfn:59521, bus frame: 488c2) => ffff88007004b000 (addr: 7004b000, frame: 18b)
fallback_bus: to_dev: ffff880059522000 (pfn:59522, bus frame: 488c1) => ffff88007004c000 (addr: 7004c000, frame: 18c)
fallback_bus: to_dev: ffff880059523000 (pfn:59523, bus frame: 488c0) => ffff88007004d000 (addr: 7004d000, frame: 18d)
fallback_bus: to_dev: ffff880059524000 (pfn:59524, bus frame: 488bf) => ffff88007004e000 (addr: 7004e000, frame: 18e)
fallback_bus: to_dev: ffff880059525000 (pfn:59525, bus frame: 488be) => ffff88007004f000 (addr: 7004f000, frame: 18f)
fallback_bus: to_dev: ffff880059526000 (pfn:59526, bus frame: 488bd) => ffff880070050000 (addr: 70050000, frame: 190)
fallback_bus: to_dev: ffff880059527000 (pfn:59527, bus frame: 488bc) => ffff880070051000 (addr: 70051000, frame: 191)


fallback_bus: to_dev: ffff880059520000 with DMA (18a000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059521000 with DMA (18b000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059522000 with DMA (18c000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059523000 with DMA (18d000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059524000 with DMA (18e000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059525000 with DMA (18f000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059526000 with DMA (190000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059527000 with DMA (191000) has ffffffcc (expected ffffffcc)
fallback_bus: to_cpu: 0xffff880070046000(bounce)  =>  0xffff8800642dc000 (sz: 16384)
fallback_bus: to_cpu: ffff8800642dc000 with DMA (186000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642dd000 with DMA (187000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642de000 with DMA (188000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642df000 with DMA (189000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: 0xffff880070046000(bounce)  =>  0xffff8800642dc000 (sz: 16384)

> 
> Ian.

[-- Attachment #2: dma_test.c --]
[-- Type: text/plain, Size: 5659 bytes --]

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <xen/page.h>

#define DMA_TEST  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("dma_test");
MODULE_LICENSE("GPL");
MODULE_VERSION(DMA_TEST);

static struct bus_type fallback_bus_type = {
	.name = "fallback_bus:",
};

static void fake_release(struct device *dev)
{
	/* No kfree as the device was allocated on stack. */
}

struct args {
	int len;
	enum dma_data_direction dir;
};

#define MAGIC_DEVICE 0xffffffdd
#define MAGIC_CPU 0xffffffcc

static int dma_test_thread(void *arg)
{
	struct page *page;
	dma_addr_t dma_addr = 0;
	struct device fake = {
		.coherent_dma_mask = DMA_BIT_MASK(32),
		.bus = &fallback_bus_type,
		.release = fake_release,
	};
	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC;
	int ret;
	int i;
	void *addr;
	struct page *p;

	struct args *args = (struct args *)arg;
	int dir = args->dir;
	int len = args->len;

	dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu");
 	fake.dma_mask = &fake.coherent_dma_mask;
 	ret = device_register(&fake);
	if (ret)
		goto out;

	do {
		unsigned long prev_mfn = 0;
		bool bus_and_dma_same;

		page = alloc_pages(gfp, get_order(len));
		p = page;
		/* Check that the bus addresses are contingous. */
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long pfn, mfn;

			addr = page_address(p);
			pfn = PFN_DOWN(virt_to_phys(addr));
			if (xen_domain())
				mfn = pfn_to_mfn(pfn);
			else
				mfn = pfn;
			if (i != 0) {
				if (prev_mfn + 1 != mfn)
					dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n",
						 (unsigned long)addr, pfn, mfn, prev_mfn);
			}
			prev_mfn = mfn;
		}
		dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir);
		/* Note, dma_addr is the physical address ! */
		if (dma_mapping_error(&fake, dma_addr)) {
			dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr,
				(unsigned long)page_address(page));
			__free_pages(page, get_order(len));
			page = NULL;
		}
		bus_and_dma_same = false;
		if (page) {
			unsigned long phys;
			unsigned long pfn, mfn, bus_addr_mfn;
			unsigned long bus_addr = 0;

			p = page;
			for (i = 0; i < len / PAGE_SIZE; i++, p++) {
				void *bus_va;

				addr = page_address(p);
				phys = virt_to_phys(addr);
				pfn = PFN_DOWN(phys);

				bus_va = (void *)(dma_addr + (i * PAGE_SIZE));

				if (xen_domain()) {
					void * tmp;

					/* Find the bus frame for the physical frame*/
					mfn = pfn_to_mfn(pfn);
					/* and .. voodoo time! */
					bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE));
					bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn));
					tmp = __va(bus_addr);
					bus_va = mfn_to_virt(bus_addr_mfn);
					WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n",
						(unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn);
				} else {
					mfn = pfn;
					/* Assume DMA addr == physical addr */
					bus_addr_mfn = PFN_DOWN(bus_addr);
					bus_va = __va(PFN_PHYS(bus_addr_mfn));
				}

				dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n",
					(unsigned long)addr, pfn, mfn,
					dir == DMA_TO_DEVICE ? "=>" : "<=",
					(unsigned long)bus_va, bus_addr, bus_addr_mfn);

				if (!virt_addr_valid(bus_va))
					break;
				if (!virt_addr_valid(addr))
					break;

				/* CPU */
				memset(addr, 0xCC, PAGE_SIZE);

				/* Device */
				memset(bus_va, 0xDD, PAGE_SIZE);
				if (addr == bus_va)
					bus_and_dma_same = true;
			}
		}
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout_interruptible(5*HZ);

		if (!page)
			continue;
		p = page;

		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			if (bus_and_dma_same)
				continue;
			addr = page_address(p);
			if (((char *)addr)[0] != MAGIC_CPU)
				dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
					(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
					((char *)addr)[0], (unsigned long)MAGIC_CPU);
		}
		/* sync the page */
		dma_sync_single_for_cpu(&fake, dma_addr, len, dir);
		p = page;
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long check_val = MAGIC_DEVICE;

			addr = page_address(p);
			if (dir == DMA_TO_DEVICE)
				check_val = MAGIC_CPU;
			if (dir == DMA_FROM_DEVICE)
				check_val = MAGIC_DEVICE;

			dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
				(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
				((char *)addr)[0], check_val);
		}
		dma_unmap_page(&fake, dma_addr, len, dir);
		dma_addr = 0;
		__free_pages(page, get_order(len));
		page = NULL;
	} while (!kthread_should_stop());

	if (dma_addr)
		dma_unmap_page(&fake, dma_addr, len, dir);
	if (page)
		__free_pages(page, get_order(len));
   	put_device(&fake);
 	device_unregister(&fake);
out:
	return 0;
}
static struct task_struct *t[2];
static struct args a[2];

static int __init dma_test_init(void)
{
	int ret;

	/* No point doing this without SWIOTLB */
	if (!swiotlb_nr_tbl())
		return -ENODEV;

	ret = bus_register(&fallback_bus_type);
	if (ret)
		return ret;

	a[0].dir = DMA_TO_DEVICE;
	a[0].len = 32768;
	t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev");

	a[1].len = 16384;
	a[1].dir = DMA_FROM_DEVICE;
	t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu");
	return 0;
}
static void __exit dma_test_exit(void)
{
	if (t[0])
		kthread_stop(t[0]);
	if (t[1])
		kthread_stop(t[1]);

 	bus_unregister(&fallback_bus_type);
}
module_init(dma_test_init);
module_exit(dma_test_exit);

[-- Attachment #3: 0001-swiotlb-Add-debugging.patch --]
[-- Type: text/plain, Size: 3496 bytes --]

>From db63c863456f0914ad96db38544c364eaad787ab Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 29 Oct 2012 09:35:55 -0400
Subject: [PATCH] swiotlb: Add debugging.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 include/linux/swiotlb.h |    2 +-
 lib/swiotlb.c           |   25 ++++++++++++++++++++-----
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..0a17d79 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -46,7 +46,7 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
 				    enum dma_sync_target target);
 
 /* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
+extern void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size,
 			   enum dma_data_direction dir);
 
 extern void
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..e5d37a3 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -346,7 +346,7 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
 /*
  * Bounce: copy the swiotlb buffer back to the original dma location
  */
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
+void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size,
 		    enum dma_data_direction dir)
 {
 	unsigned long pfn = PFN_DOWN(phys);
@@ -376,6 +376,21 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
 			offset = 0;
 		}
 	} else {
+		if (size >= PAGE_SIZE) {
+			const char *type;
+
+			if (dir == DMA_TO_DEVICE)
+				type = " <= ";
+			else if (dir == DMA_FROM_DEVICE)
+				type = " => ";
+			else
+				type = " <=> ";
+			dev_info(dev, "0x%lx%s %s 0x%lx%s (sz: %ld)\n", (unsigned long)dma_addr,
+				is_swiotlb_buffer(virt_to_phys(dma_addr)) ? "(bounce)" : "",
+				type, (unsigned long)phys_to_virt(phys),
+				is_swiotlb_buffer(phys) ? "(bounce)" : "",
+				size);
+		}
 		if (dir == DMA_TO_DEVICE)
 			memcpy(dma_addr, phys_to_virt(phys), size);
 		else
@@ -483,7 +498,7 @@ found:
 	for (i = 0; i < nslots; i++)
 		io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT);
 	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
-		swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+		swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE);
 
 	return dma_addr;
 }
@@ -518,7 +533,7 @@ swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
 	 * First, sync the memory before unmapping the entry
 	 */
 	if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-		swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+		swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE);
 
 	/*
 	 * Return the buffer to the free list by setting the corresponding
@@ -560,13 +575,13 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
 	switch (target) {
 	case SYNC_FOR_CPU:
 		if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-			swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+			swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE);
 		else
 			BUG_ON(dir != DMA_TO_DEVICE);
 		break;
 	case SYNC_FOR_DEVICE:
 		if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-			swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+			swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE);
 		else
 			BUG_ON(dir != DMA_FROM_DEVICE);
 		break;
-- 
1.7.7.6


[-- Attachment #4: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-10-12 10:28 Dom0 physical networking/swiotlb/something issue in 3.7-rc1 Ian Campbell
  2012-10-12 11:59 ` Konrad Rzeszutek Wilk
@ 2012-11-09  9:03 ` Jan Beulich
  2012-11-09  9:16   ` Ian Campbell
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-11-09  9:03 UTC (permalink / raw)
  To: Ian Campbell, Konrad Rzeszutek Wilk; +Cc: xen-devel

>>> On 12.10.12 at 12:28, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> The following patch causes fairly large packet loss when transmitting
> from dom0 to the physical network, at least with my tg3 hardware, but I
> assume it can impact anything which uses this interface.
> 
> I suspect that the issue is that the compound pages allocated in this
> way are not backed by contiguous mfns and so things fall apart when the
> driver tries to do DMA.

Has this seen any sort of resolution yet? Despite having forced
NETDEV_FRAG_PAGE_MAX_ORDER to zero for Xen (following
your suggested patch, Ian), and with a different NIC (e1000e
driven) I'm seeing similar packet loss/corruption on transmits,
and only if running a debug hypervisor (in a non-debug one,
MFNs are largely contiguous, so this issue should be observable
there only very rarely).

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09  9:03 ` Jan Beulich
@ 2012-11-09  9:16   ` Ian Campbell
  2012-11-09  9:40     ` Jan Beulich
  0 siblings, 1 reply; 14+ messages in thread
From: Ian Campbell @ 2012-11-09  9:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Konrad Rzeszutek Wilk

On Fri, 2012-11-09 at 09:03 +0000, Jan Beulich wrote:
> >>> On 12.10.12 at 12:28, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > The following patch causes fairly large packet loss when transmitting
> > from dom0 to the physical network, at least with my tg3 hardware, but I
> > assume it can impact anything which uses this interface.
> > 
> > I suspect that the issue is that the compound pages allocated in this
> > way are not backed by contiguous mfns and so things fall apart when the
> > driver tries to do DMA.
> 
> Has this seen any sort of resolution yet? Despite having forced
> NETDEV_FRAG_PAGE_MAX_ORDER to zero for Xen (following
> your suggested patch, Ian), and with a different NIC (e1000e
> driven) I'm seeing similar packet loss/corruption on transmits,
> and only if running a debug hypervisor (in a non-debug one,
> MFNs are largely contiguous, so this issue should be observable
> there only very rarely).

I think Konrad is still looking into the underlying swiotlb issue.

If you want to go with the workaround then there is another order>0 to
frob in net/core/sock.c
#define SKB_FRAG_PAGE_ORDER	get_order(32768)
which might help.

Dave Miller unequivocally rejected this approach so I haven't been
pursuing it any further. Perhaps once the swiotlb fix is made it will be
possible to make an argument for it based on actual performance numbers.

Ian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09  9:16   ` Ian Campbell
@ 2012-11-09  9:40     ` Jan Beulich
  2012-11-09 10:36       ` Jan Beulich
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-11-09  9:40 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Konrad Rzeszutek Wilk, xen-devel

>>> On 09.11.12 at 10:16, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2012-11-09 at 09:03 +0000, Jan Beulich wrote:
>> >>> On 12.10.12 at 12:28, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > The following patch causes fairly large packet loss when transmitting
>> > from dom0 to the physical network, at least with my tg3 hardware, but I
>> > assume it can impact anything which uses this interface.
>> > 
>> > I suspect that the issue is that the compound pages allocated in this
>> > way are not backed by contiguous mfns and so things fall apart when the
>> > driver tries to do DMA.
>> 
>> Has this seen any sort of resolution yet? Despite having forced
>> NETDEV_FRAG_PAGE_MAX_ORDER to zero for Xen (following
>> your suggested patch, Ian), and with a different NIC (e1000e
>> driven) I'm seeing similar packet loss/corruption on transmits,
>> and only if running a debug hypervisor (in a non-debug one,
>> MFNs are largely contiguous, so this issue should be observable
>> there only very rarely).
> 
> I think Konrad is still looking into the underlying swiotlb issue.
> 
> If you want to go with the workaround then there is another order>0 to
> frob in net/core/sock.c
> #define SKB_FRAG_PAGE_ORDER	get_order(32768)
> which might help.

Thanks, that indeed helped. And gives me food for thought as
well, since this should - if anything at all - be a performance
improvement, not a something affecting correctness.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09  9:40     ` Jan Beulich
@ 2012-11-09 10:36       ` Jan Beulich
  2012-11-09 11:43         ` Jan Beulich
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-11-09 10:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ian Campbell, xen-devel

>>> On 09.11.12 at 10:40, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 09.11.12 at 10:16, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> On Fri, 2012-11-09 at 09:03 +0000, Jan Beulich wrote:
>>> >>> On 12.10.12 at 12:28, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>>> > The following patch causes fairly large packet loss when transmitting
>>> > from dom0 to the physical network, at least with my tg3 hardware, but I
>>> > assume it can impact anything which uses this interface.
>>> > 
>>> > I suspect that the issue is that the compound pages allocated in this
>>> > way are not backed by contiguous mfns and so things fall apart when the
>>> > driver tries to do DMA.
>>> 
>>> Has this seen any sort of resolution yet? Despite having forced
>>> NETDEV_FRAG_PAGE_MAX_ORDER to zero for Xen (following
>>> your suggested patch, Ian), and with a different NIC (e1000e
>>> driven) I'm seeing similar packet loss/corruption on transmits,
>>> and only if running a debug hypervisor (in a non-debug one,
>>> MFNs are largely contiguous, so this issue should be observable
>>> there only very rarely).
>> 
>> I think Konrad is still looking into the underlying swiotlb issue.
>> 
>> If you want to go with the workaround then there is another order>0 to
>> frob in net/core/sock.c
>> #define SKB_FRAG_PAGE_ORDER	get_order(32768)
>> which might help.
> 
> Thanks, that indeed helped. And gives me food for thought as
> well, since this should - if anything at all - be a performance
> improvement, not a something affecting correctness.

Okay, one problem in the pv-ops case seems pretty obvious:
dma_capable() does (potentially cross-page) arithmetic on a
dma_addr_t value, ignoring dis-contiguity altogether. Specifically
its first use in swiotlb_map_page() and its only use in
swiotlb_map_sg_attrs() are problematic.

In the forward ported kernels, those two checks are however
accompanied by range_needs_mapping() (aka
range_straddles_page_boundary()) checks, which ought to
take care of this. There is brokenness there with the invocations
of gnttab_dma_map_page(), but only if the initial offset is at
least PAGE_SIZE - will have to check whether that occurs.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09 10:36       ` Jan Beulich
@ 2012-11-09 11:43         ` Jan Beulich
  2012-11-09 13:48           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-11-09 11:43 UTC (permalink / raw)
  To: Ian Campbell, Konrad Rzeszutek Wilk; +Cc: xen-devel

>>> On 09.11.12 at 11:36, "Jan Beulich" <JBeulich@suse.com> wrote:
> In the forward ported kernels, those two checks are however
> accompanied by range_needs_mapping() (aka
> range_straddles_page_boundary()) checks, which ought to
> take care of this. There is brokenness there with the invocations
> of gnttab_dma_map_page(), but only if the initial offset is at
> least PAGE_SIZE - will have to check whether that occurs.

And indeed, fixing this also makes the problem go away when
the allocation order doesn't get forced to zero. So presumably
there's also only that one problem I had pointed out in pv-ops.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09 11:43         ` Jan Beulich
@ 2012-11-09 13:48           ` Konrad Rzeszutek Wilk
  2012-11-09 17:34             ` Jan Beulich
  0 siblings, 1 reply; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-11-09 13:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Fri, Nov 09, 2012 at 11:43:39AM +0000, Jan Beulich wrote:
> >>> On 09.11.12 at 11:36, "Jan Beulich" <JBeulich@suse.com> wrote:
> > In the forward ported kernels, those two checks are however
> > accompanied by range_needs_mapping() (aka
> > range_straddles_page_boundary()) checks, which ought to
> > take care of this. There is brokenness there with the invocations
> > of gnttab_dma_map_page(), but only if the initial offset is at
> > least PAGE_SIZE - will have to check whether that occurs.
> 
> And indeed, fixing this also makes the problem go away when
> the allocation order doesn't get forced to zero. So presumably
> there's also only that one problem I had pointed out in pv-ops.

The pvops one has this in the map-page variant (xen_swiotlb_map_page):

351         if (dma_capable(dev, dev_addr, size) &&
352             !range_straddles_page_boundary(phys, size) && !swiotlb_force)
353                 return dev_addr;

and in the sg variant:

494                 if (swiotlb_force ||
495                     !dma_capable(hwdev, dev_addr, sg->length) ||
496                     range_straddles_page_boundary(paddr, sg->length)) {
497                         void *map = swiotlb_tbl_map_single(hwdev,

So I think that check is OK. There is no gnttab_dma_map_page call - so that
can't be the issue.

I did play with this a bit and wrote this little driver (see attached) that
forces allocation of large pages and it worked as expected on Xen-SWIOTLB.

But while doing this I found that the 'skge' driver is busted - it does not
even work on baremetal if you do 'iommu=soft swiotlb=force'. Since
Xen-SWIOTLB would occasionaly use the bounce-buffer  - and with greater than
0-page order - the bug in skge became more obvious. I hadn't narrowed down
where the issue is with skge.
> 
> Jan

[-- Attachment #2: dma_test.c --]
[-- Type: text/plain, Size: 5698 bytes --]

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <xen/page.h>

#define DMA_TEST  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("dma_test");
MODULE_LICENSE("GPL");
MODULE_VERSION(DMA_TEST);

static struct bus_type fallback_bus_type = {
	.name = "fallback_bus:",
};

static void fake_release(struct device *dev)
{
	/* No kfree as the device was allocated on stack. */
}

struct args {
	int len;
	enum dma_data_direction dir;
};

#define MAGIC_DEVICE 0xffffffdd
#define MAGIC_CPU 0xffffffcc

static int dma_test_thread(void *arg)
{
	struct page *page;
	dma_addr_t dma_addr = 0;
	struct device fake = {
		.coherent_dma_mask = DMA_BIT_MASK(32),
		.bus = &fallback_bus_type,
		.release = fake_release,
	};
	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC;
	int ret;
	int i;
	void *addr;
	struct page *p;

	struct args *args = (struct args *)arg;
	int dir = args->dir;
	int len = args->len;

	dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu");
 	fake.dma_mask = &fake.coherent_dma_mask;
 	ret = device_register(&fake);
	if (ret)
		goto out;

	do {
		unsigned long prev_mfn = 0;
		bool bus_and_dma_same;

		page = alloc_pages(gfp, get_order(len));
		p = page;
		/* Check that the bus addresses are contingous. */
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long pfn, mfn;

			addr = page_address(p);
			pfn = PFN_DOWN(virt_to_phys(addr));
			if (xen_domain())
				mfn = pfn_to_mfn(pfn);
			else
				mfn = pfn;
			if (i != 0) {
				if (prev_mfn + 1 != mfn)
					dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n",
						 (unsigned long)addr, pfn, mfn, prev_mfn);
			}
			prev_mfn = mfn;
		}
		dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir);
		/* Note, dma_addr is the physical address ! */
		if (dma_mapping_error(&fake, dma_addr)) {
			dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr,
				(unsigned long)page_address(page));
			__free_pages(page, get_order(len));
			page = NULL;
		}
		bus_and_dma_same = false;
		if (page) {
			unsigned long phys;
			unsigned long pfn, mfn, bus_addr_mfn;
			unsigned long bus_addr = 0;

			p = page;
			for (i = 0; i < len / PAGE_SIZE; i++, p++) {
				void *bus_va;

				addr = page_address(p);
				phys = virt_to_phys(addr);
				pfn = PFN_DOWN(phys);

				bus_va = (void *)(dma_addr + (i * PAGE_SIZE));

				if (xen_domain()) {
					void * tmp;

					/* Find the bus frame for the physical frame*/
					mfn = pfn_to_mfn(pfn);
					/* and .. voodoo time! */
					bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE));
					bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn));
					tmp = __va(bus_addr);
					bus_va = mfn_to_virt(bus_addr_mfn);
					WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n",
						(unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn);
				} else {
					mfn = pfn;
					bus_addr = (unsigned long)bus_va;
					/* Assume DMA addr == physical addr */
					bus_addr_mfn = PFN_DOWN(bus_addr);
					bus_va = __va(PFN_PHYS(bus_addr_mfn));
				}

				dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n",
					(unsigned long)addr, pfn, mfn,
					dir == DMA_TO_DEVICE ? "=>" : "<=",
					(unsigned long)bus_va, bus_addr, bus_addr_mfn);

				if (!virt_addr_valid(bus_va))
					break;
				if (!virt_addr_valid(addr))
					break;

				/* CPU */
				memset(addr, 0xCC, PAGE_SIZE);

				/* Device */
				memset(bus_va, 0xDD, PAGE_SIZE);
				if (addr == bus_va)
					bus_and_dma_same = true;
			}
		}
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout_interruptible(5*HZ);

		if (!page)
			continue;
		p = page;

		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			if (bus_and_dma_same)
				continue;
			addr = page_address(p);
			if (((char *)addr)[0] != MAGIC_CPU)
				dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
					(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
					((char *)addr)[0], (unsigned long)MAGIC_CPU);
		}
		/* sync the page */
		dma_sync_single_for_cpu(&fake, dma_addr, len, dir);
		p = page;
		for (i = 0; i < len / PAGE_SIZE; i++, p++) {
			unsigned long check_val = MAGIC_DEVICE;

			addr = page_address(p);
			if (dir == DMA_TO_DEVICE)
				check_val = MAGIC_CPU;
			if (dir == DMA_FROM_DEVICE)
				check_val = MAGIC_DEVICE;

			dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
				(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
				((char *)addr)[0], check_val);
		}
		dma_unmap_page(&fake, dma_addr, len, dir);
		dma_addr = 0;
		__free_pages(page, get_order(len));
		page = NULL;
	} while (!kthread_should_stop());

	if (dma_addr)
		dma_unmap_page(&fake, dma_addr, len, dir);
	if (page)
		__free_pages(page, get_order(len));
   	put_device(&fake);
 	device_unregister(&fake);
out:
	return 0;
}
static struct task_struct *t[2];
static struct args a[2];

static int __init dma_test_init(void)
{
	int ret;

	/* No point doing this without SWIOTLB */
	if (!swiotlb_nr_tbl())
		return -ENODEV;

	ret = bus_register(&fallback_bus_type);
	if (ret)
		return ret;

	a[0].dir = DMA_TO_DEVICE;
	a[0].len = 32768;
	t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev");

	a[1].len = 16384;
	a[1].dir = DMA_FROM_DEVICE;
	t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu");
	return 0;
}
static void __exit dma_test_exit(void)
{
	if (t[0])
		kthread_stop(t[0]);
	if (t[1])
		kthread_stop(t[1]);

 	bus_unregister(&fallback_bus_type);
}
module_init(dma_test_init);
module_exit(dma_test_exit);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
  2012-11-09 13:48           ` Konrad Rzeszutek Wilk
@ 2012-11-09 17:34             ` Jan Beulich
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-11-09 17:34 UTC (permalink / raw)
  To: konrad.wilk; +Cc: Ian.Campbell, xen-devel

>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 11/09/12 2:48 PM >>>
>On Fri, Nov 09, 2012 at 11:43:39AM +0000, Jan Beulich wrote:
>> >>> On 09.11.12 at 11:36, "Jan Beulich" <JBeulich@suse.com> wrote:
>> > In the forward ported kernels, those two checks are however
>> > accompanied by range_needs_mapping() (aka
>> > range_straddles_page_boundary()) checks, which ought to
>> > take care of this. There is brokenness there with the invocations
>> > of gnttab_dma_map_page(), but only if the initial offset is at
>> > least PAGE_SIZE - will have to check whether that occurs.
>> 
>> And indeed, fixing this also makes the problem go away when
>> the allocation order doesn't get forced to zero. So presumably
>> there's also only that one problem I had pointed out in pv-ops.
>
>The pvops one has this in the map-page variant (xen_swiotlb_map_page):
>
>351         if (dma_capable(dev, dev_addr, size) &&
>352             !range_straddles_page_boundary(phys, size) && !swiotlb_force)
>353                 return dev_addr;
>
>and in the sg variant:
>
>494                 if (swiotlb_force ||
>495                     !dma_capable(hwdev, dev_addr, sg->length) ||
>496                     range_straddles_page_boundary(paddr, sg->length)) {
>497                         void *map = swiotlb_tbl_map_single(hwdev,

Oh, right, I forgot that there's yet another clone of that code under drivers/xen/.

>So I think that check is OK. There is no gnttab_dma_map_page call - so that
>can't be the issue.

Indeed.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-11-09 17:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-12 10:28 Dom0 physical networking/swiotlb/something issue in 3.7-rc1 Ian Campbell
2012-10-12 11:59 ` Konrad Rzeszutek Wilk
2012-10-12 12:09   ` Ian Campbell
2012-10-12 12:10   ` Konrad Rzeszutek Wilk
2012-10-12 12:18     ` Ian Campbell
2012-10-12 13:17       ` Konrad Rzeszutek Wilk
2012-10-29 15:55       ` Konrad Rzeszutek Wilk
2012-11-09  9:03 ` Jan Beulich
2012-11-09  9:16   ` Ian Campbell
2012-11-09  9:40     ` Jan Beulich
2012-11-09 10:36       ` Jan Beulich
2012-11-09 11:43         ` Jan Beulich
2012-11-09 13:48           ` Konrad Rzeszutek Wilk
2012-11-09 17:34             ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).