From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: xen-devel <xen-devel@lists.xen.org>
Subject: Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1
Date: Mon, 29 Oct 2012 11:55:35 -0400 [thread overview]
Message-ID: <20121029155535.GA25691@phenom.dumpdata.com> (raw)
In-Reply-To: <1350044296.14806.100.camel@zakaz.uk.xensource.com>
[-- Attachment #1: Type: text/plain, Size: 6854 bytes --]
On Fri, Oct 12, 2012 at 01:18:16PM +0100, Ian Campbell wrote:
> On Fri, 2012-10-12 at 13:10 +0100, Konrad Rzeszutek Wilk wrote:
> > On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote:
> > > > Hi Konrad,
> > > >
> > > > The following patch causes fairly large packet loss when transmitting
> > > > from dom0 to the physical network, at least with my tg3 hardware, but I
> > > > assume it can impact anything which uses this interface.
> > >
> > > Ah, that would explain why one of my machines suddenly started
> > > developing checksum errors (and had a tg3 card). I hadn't gotten
> > > deep into it.
> > > >
> > > > I suspect that the issue is that the compound pages allocated in this
> > > > way are not backed by contiguous mfns and so things fall apart when the
> > > > driver tries to do DMA.
> > >
> > > So this should also be easily reproduced on barmetal with 'iommu=soft' then.
> > > >
> > > > However I don't understand why the swiotlb is not fixing this up
> > > > successfully? The tg3 driver seems to use pci_map_single on this data.
> > > > Any thoughts? Perhaps the swiotlb (either generically or in the Xen
> > > > backend) doesn't correctly handle compound pages?
> > >
> > > The assumption is that it is just a page. I am surprsed that the other
> > > IOMMUs aren't hitting this as well - ah, that is b/c they do handle
> > > a virtual address of more than one PAGE_SIZE..
> >
> > So.. the GART one (AMD poor man IOTLB - was used for AGP card
> > translation, but can still be used as an IOMMU - and is still present on
> > some AMD machines), looks to suffer the same problem.
> >
> > But perhaps not - can you explain to me if a compound page
> > is virtually contingous? One of the things the GART does for
> > pci_map_single is call page_to_phys(p), feeds the CPU physical address
> > (and size) into the GART engine to setup the mapping.
> >
> > If compound pages are virtually (and physically on barmetal) contingous
> > - this ought to work. But if they are not, then this should also break on
> > AMD machines with tg3 and a AMD GART enabled.
>
> AFAIK compound pages are always physically contiguous. i.e. given a
> "struct page *page" which is the head of a compound page you can do
> "page++" to walk through its constituent frames.
>
> I'm not sure about virtually contiguous. Obviously if they are in lowmem
> then the 1-1 map combined with the fact that they are physically
> contiguous makes them virtually contiguous too. I'm not sure what
> happens if they are highmem -- since kmap (or whatever) would need to do
> some extra work in this case. I've not looked but I don't recall
> noticing this in the past...
So to double check this, I wrote this nice little module (attached)
that would allocate these type of pages and do 'DMA' on them.
>From the tests it seems to work OK - in some cases it uses a bounce
buffer and in some it does not. And the resulting buffers do contain
the data we expected.
# modprobe dma_test
modprobe dma_test
calling dma_test_init+0x0/0x1000 [dma_test] @ 2875
initcall dma_test_init+0x0/0x1000 [dma_test] returned 0 after 309 usecs
fallback_bus: to_cpu: va: ffff8800642dd000 (pfn:642dd, mfn:53706) w.r.t prev mfn: 53707!
fallback_bus: to_cpu: va: ffff8800642de000 (pfn:642de, mfn:53705) w.r.t prev mfn: 53706!
fallback_bus: to_cpu: va: ffff8800642df000 (pfn:642df, mfn:53704) w.r.t prev mfn: 53705!
fallback_bus: to_cpu: ffff8800642dc000 (pfn:642dc, bus frame: 53707) <= ffff880070046000 (addr: 70046000, frame: 186)
fallback_bus: to_cpu: ffff8800642dd000 (pfn:642dd, bus frame: 53706) <= ffff880070047000 (addr: 70047000, frame: 187)
fallback_bus: to_cpu: ffff8800642de000 (pfn:642de, bus frame: 53705) <= ffff880070048000 (addr: 70048000, frame: 188)
fallback_bus: to_cpu: ffff8800642df000 (pfn:642df, bus frame: 53704) <= ffff880070049000 (addr: 70049000, frame: 189)
fallback_bus: to_dev: va: ffff880059521000 (pfn:59521, mfn:488c2) w.r.t prev mfn: 488c3!
fallback_bus: to_dev: va: ffff880059522000 (pfn:59522, mfn:488c1) w.r.t prev mfn: 488c2!
fallback_bus: to_dev: va: ffff880059523000 (pfn:59523, mfn:488c0) w.r.t prev mfn: 488c1!
fallback_bus: to_dev: va: ffff880059524000 (pfn:59524, mfn:488bf) w.r.t prev mfn: 488c0!
fallback_bus: to_dev: va: ffff880059525000 (pfn:59525, mfn:488be) w.r.t prev mfn: 488bf!
fallback_bus: to_dev: va: ffff880059526000 (pfn:59526, mfn:488bd) w.r.t prev mfn: 488be!
fallback_bus: to_dev: va: ffff880059527000 (pfn:59527, mfn:488bc) w.r.t prev mfn: 488bd!
fallback_bus: to_dev: 0xffff88007004a000(bounce) <= 0xffff880059520000 (sz: 32768)
fallback_bus: to_dev: ffff880059520000 (pfn:59520, bus frame: 488c3) => ffff88007004a000 (addr: 7004a000, frame: 18a)
fallback_bus: to_dev: ffff880059521000 (pfn:59521, bus frame: 488c2) => ffff88007004b000 (addr: 7004b000, frame: 18b)
fallback_bus: to_dev: ffff880059522000 (pfn:59522, bus frame: 488c1) => ffff88007004c000 (addr: 7004c000, frame: 18c)
fallback_bus: to_dev: ffff880059523000 (pfn:59523, bus frame: 488c0) => ffff88007004d000 (addr: 7004d000, frame: 18d)
fallback_bus: to_dev: ffff880059524000 (pfn:59524, bus frame: 488bf) => ffff88007004e000 (addr: 7004e000, frame: 18e)
fallback_bus: to_dev: ffff880059525000 (pfn:59525, bus frame: 488be) => ffff88007004f000 (addr: 7004f000, frame: 18f)
fallback_bus: to_dev: ffff880059526000 (pfn:59526, bus frame: 488bd) => ffff880070050000 (addr: 70050000, frame: 190)
fallback_bus: to_dev: ffff880059527000 (pfn:59527, bus frame: 488bc) => ffff880070051000 (addr: 70051000, frame: 191)
fallback_bus: to_dev: ffff880059520000 with DMA (18a000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059521000 with DMA (18b000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059522000 with DMA (18c000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059523000 with DMA (18d000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059524000 with DMA (18e000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059525000 with DMA (18f000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059526000 with DMA (190000) has ffffffcc (expected ffffffcc)
fallback_bus: to_dev: ffff880059527000 with DMA (191000) has ffffffcc (expected ffffffcc)
fallback_bus: to_cpu: 0xffff880070046000(bounce) => 0xffff8800642dc000 (sz: 16384)
fallback_bus: to_cpu: ffff8800642dc000 with DMA (186000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642dd000 with DMA (187000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642de000 with DMA (188000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: ffff8800642df000 with DMA (189000) has ffffffdd (expected ffffffdd)
fallback_bus: to_cpu: 0xffff880070046000(bounce) => 0xffff8800642dc000 (sz: 16384)
>
> Ian.
[-- Attachment #2: dma_test.c --]
[-- Type: text/plain, Size: 5659 bytes --]
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <xen/page.h>
#define DMA_TEST "0.1"
MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("dma_test");
MODULE_LICENSE("GPL");
MODULE_VERSION(DMA_TEST);
static struct bus_type fallback_bus_type = {
.name = "fallback_bus:",
};
static void fake_release(struct device *dev)
{
/* No kfree as the device was allocated on stack. */
}
struct args {
int len;
enum dma_data_direction dir;
};
#define MAGIC_DEVICE 0xffffffdd
#define MAGIC_CPU 0xffffffcc
static int dma_test_thread(void *arg)
{
struct page *page;
dma_addr_t dma_addr = 0;
struct device fake = {
.coherent_dma_mask = DMA_BIT_MASK(32),
.bus = &fallback_bus_type,
.release = fake_release,
};
gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC;
int ret;
int i;
void *addr;
struct page *p;
struct args *args = (struct args *)arg;
int dir = args->dir;
int len = args->len;
dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu");
fake.dma_mask = &fake.coherent_dma_mask;
ret = device_register(&fake);
if (ret)
goto out;
do {
unsigned long prev_mfn = 0;
bool bus_and_dma_same;
page = alloc_pages(gfp, get_order(len));
p = page;
/* Check that the bus addresses are contingous. */
for (i = 0; i < len / PAGE_SIZE; i++, p++) {
unsigned long pfn, mfn;
addr = page_address(p);
pfn = PFN_DOWN(virt_to_phys(addr));
if (xen_domain())
mfn = pfn_to_mfn(pfn);
else
mfn = pfn;
if (i != 0) {
if (prev_mfn + 1 != mfn)
dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n",
(unsigned long)addr, pfn, mfn, prev_mfn);
}
prev_mfn = mfn;
}
dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir);
/* Note, dma_addr is the physical address ! */
if (dma_mapping_error(&fake, dma_addr)) {
dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr,
(unsigned long)page_address(page));
__free_pages(page, get_order(len));
page = NULL;
}
bus_and_dma_same = false;
if (page) {
unsigned long phys;
unsigned long pfn, mfn, bus_addr_mfn;
unsigned long bus_addr = 0;
p = page;
for (i = 0; i < len / PAGE_SIZE; i++, p++) {
void *bus_va;
addr = page_address(p);
phys = virt_to_phys(addr);
pfn = PFN_DOWN(phys);
bus_va = (void *)(dma_addr + (i * PAGE_SIZE));
if (xen_domain()) {
void * tmp;
/* Find the bus frame for the physical frame*/
mfn = pfn_to_mfn(pfn);
/* and .. voodoo time! */
bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE));
bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn));
tmp = __va(bus_addr);
bus_va = mfn_to_virt(bus_addr_mfn);
WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n",
(unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn);
} else {
mfn = pfn;
/* Assume DMA addr == physical addr */
bus_addr_mfn = PFN_DOWN(bus_addr);
bus_va = __va(PFN_PHYS(bus_addr_mfn));
}
dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n",
(unsigned long)addr, pfn, mfn,
dir == DMA_TO_DEVICE ? "=>" : "<=",
(unsigned long)bus_va, bus_addr, bus_addr_mfn);
if (!virt_addr_valid(bus_va))
break;
if (!virt_addr_valid(addr))
break;
/* CPU */
memset(addr, 0xCC, PAGE_SIZE);
/* Device */
memset(bus_va, 0xDD, PAGE_SIZE);
if (addr == bus_va)
bus_and_dma_same = true;
}
}
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout_interruptible(5*HZ);
if (!page)
continue;
p = page;
for (i = 0; i < len / PAGE_SIZE; i++, p++) {
if (bus_and_dma_same)
continue;
addr = page_address(p);
if (((char *)addr)[0] != MAGIC_CPU)
dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
((char *)addr)[0], (unsigned long)MAGIC_CPU);
}
/* sync the page */
dma_sync_single_for_cpu(&fake, dma_addr, len, dir);
p = page;
for (i = 0; i < len / PAGE_SIZE; i++, p++) {
unsigned long check_val = MAGIC_DEVICE;
addr = page_address(p);
if (dir == DMA_TO_DEVICE)
check_val = MAGIC_CPU;
if (dir == DMA_FROM_DEVICE)
check_val = MAGIC_DEVICE;
dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n",
(unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)),
((char *)addr)[0], check_val);
}
dma_unmap_page(&fake, dma_addr, len, dir);
dma_addr = 0;
__free_pages(page, get_order(len));
page = NULL;
} while (!kthread_should_stop());
if (dma_addr)
dma_unmap_page(&fake, dma_addr, len, dir);
if (page)
__free_pages(page, get_order(len));
put_device(&fake);
device_unregister(&fake);
out:
return 0;
}
static struct task_struct *t[2];
static struct args a[2];
static int __init dma_test_init(void)
{
int ret;
/* No point doing this without SWIOTLB */
if (!swiotlb_nr_tbl())
return -ENODEV;
ret = bus_register(&fallback_bus_type);
if (ret)
return ret;
a[0].dir = DMA_TO_DEVICE;
a[0].len = 32768;
t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev");
a[1].len = 16384;
a[1].dir = DMA_FROM_DEVICE;
t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu");
return 0;
}
static void __exit dma_test_exit(void)
{
if (t[0])
kthread_stop(t[0]);
if (t[1])
kthread_stop(t[1]);
bus_unregister(&fallback_bus_type);
}
module_init(dma_test_init);
module_exit(dma_test_exit);
[-- Attachment #3: 0001-swiotlb-Add-debugging.patch --]
[-- Type: text/plain, Size: 3496 bytes --]
>From db63c863456f0914ad96db38544c364eaad787ab Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 29 Oct 2012 09:35:55 -0400
Subject: [PATCH] swiotlb: Add debugging.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
include/linux/swiotlb.h | 2 +-
lib/swiotlb.c | 25 ++++++++++++++++++++-----
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..0a17d79 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -46,7 +46,7 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
enum dma_sync_target target);
/* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
+extern void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size,
enum dma_data_direction dir);
extern void
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..e5d37a3 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -346,7 +346,7 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
/*
* Bounce: copy the swiotlb buffer back to the original dma location
*/
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
+void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size,
enum dma_data_direction dir)
{
unsigned long pfn = PFN_DOWN(phys);
@@ -376,6 +376,21 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
offset = 0;
}
} else {
+ if (size >= PAGE_SIZE) {
+ const char *type;
+
+ if (dir == DMA_TO_DEVICE)
+ type = " <= ";
+ else if (dir == DMA_FROM_DEVICE)
+ type = " => ";
+ else
+ type = " <=> ";
+ dev_info(dev, "0x%lx%s %s 0x%lx%s (sz: %ld)\n", (unsigned long)dma_addr,
+ is_swiotlb_buffer(virt_to_phys(dma_addr)) ? "(bounce)" : "",
+ type, (unsigned long)phys_to_virt(phys),
+ is_swiotlb_buffer(phys) ? "(bounce)" : "",
+ size);
+ }
if (dir == DMA_TO_DEVICE)
memcpy(dma_addr, phys_to_virt(phys), size);
else
@@ -483,7 +498,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
- swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+ swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE);
return dma_addr;
}
@@ -518,7 +533,7 @@ swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
* First, sync the memory before unmapping the entry
*/
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
- swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+ swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE);
/*
* Return the buffer to the free list by setting the corresponding
@@ -560,13 +575,13 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+ swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+ swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;
--
1.7.7.6
[-- Attachment #4: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2012-10-29 15:55 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-12 10:28 Dom0 physical networking/swiotlb/something issue in 3.7-rc1 Ian Campbell
2012-10-12 11:59 ` Konrad Rzeszutek Wilk
2012-10-12 12:09 ` Ian Campbell
2012-10-12 12:10 ` Konrad Rzeszutek Wilk
2012-10-12 12:18 ` Ian Campbell
2012-10-12 13:17 ` Konrad Rzeszutek Wilk
2012-10-29 15:55 ` Konrad Rzeszutek Wilk [this message]
2012-11-09 9:03 ` Jan Beulich
2012-11-09 9:16 ` Ian Campbell
2012-11-09 9:40 ` Jan Beulich
2012-11-09 10:36 ` Jan Beulich
2012-11-09 11:43 ` Jan Beulich
2012-11-09 13:48 ` Konrad Rzeszutek Wilk
2012-11-09 17:34 ` Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121029155535.GA25691@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=Ian.Campbell@citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.