From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Dom0 physical networking/swiotlb/something issue in 3.7-rc1 Date: Mon, 29 Oct 2012 11:55:35 -0400 Message-ID: <20121029155535.GA25691@phenom.dumpdata.com> References: <1350037688.14806.93.camel@zakaz.uk.xensource.com> <20121012115949.GB4028@localhost.localdomain> <20121012121049.GB4155@localhost.localdomain> <1350044296.14806.100.camel@zakaz.uk.xensource.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="J/dobhs11T7y2rNN" Return-path: Content-Disposition: inline In-Reply-To: <1350044296.14806.100.camel@zakaz.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: xen-devel List-Id: xen-devel@lists.xenproject.org --J/dobhs11T7y2rNN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Oct 12, 2012 at 01:18:16PM +0100, Ian Campbell wrote: > On Fri, 2012-10-12 at 13:10 +0100, Konrad Rzeszutek Wilk wrote: > > On Fri, Oct 12, 2012 at 07:59:49AM -0400, Konrad Rzeszutek Wilk wrote: > > > On Fri, Oct 12, 2012 at 11:28:08AM +0100, Ian Campbell wrote: > > > > Hi Konrad, > > > > > > > > The following patch causes fairly large packet loss when transmitting > > > > from dom0 to the physical network, at least with my tg3 hardware, but I > > > > assume it can impact anything which uses this interface. > > > > > > Ah, that would explain why one of my machines suddenly started > > > developing checksum errors (and had a tg3 card). I hadn't gotten > > > deep into it. > > > > > > > > I suspect that the issue is that the compound pages allocated in this > > > > way are not backed by contiguous mfns and so things fall apart when the > > > > driver tries to do DMA. > > > > > > So this should also be easily reproduced on barmetal with 'iommu=soft' then. > > > > > > > > However I don't understand why the swiotlb is not fixing this up > > > > successfully? The tg3 driver seems to use pci_map_single on this data. > > > > Any thoughts? Perhaps the swiotlb (either generically or in the Xen > > > > backend) doesn't correctly handle compound pages? > > > > > > The assumption is that it is just a page. I am surprsed that the other > > > IOMMUs aren't hitting this as well - ah, that is b/c they do handle > > > a virtual address of more than one PAGE_SIZE.. > > > > So.. the GART one (AMD poor man IOTLB - was used for AGP card > > translation, but can still be used as an IOMMU - and is still present on > > some AMD machines), looks to suffer the same problem. > > > > But perhaps not - can you explain to me if a compound page > > is virtually contingous? One of the things the GART does for > > pci_map_single is call page_to_phys(p), feeds the CPU physical address > > (and size) into the GART engine to setup the mapping. > > > > If compound pages are virtually (and physically on barmetal) contingous > > - this ought to work. But if they are not, then this should also break on > > AMD machines with tg3 and a AMD GART enabled. > > AFAIK compound pages are always physically contiguous. i.e. given a > "struct page *page" which is the head of a compound page you can do > "page++" to walk through its constituent frames. > > I'm not sure about virtually contiguous. Obviously if they are in lowmem > then the 1-1 map combined with the fact that they are physically > contiguous makes them virtually contiguous too. I'm not sure what > happens if they are highmem -- since kmap (or whatever) would need to do > some extra work in this case. I've not looked but I don't recall > noticing this in the past... So to double check this, I wrote this nice little module (attached) that would allocate these type of pages and do 'DMA' on them. >>From the tests it seems to work OK - in some cases it uses a bounce buffer and in some it does not. And the resulting buffers do contain the data we expected. # modprobe dma_test modprobe dma_test calling dma_test_init+0x0/0x1000 [dma_test] @ 2875 initcall dma_test_init+0x0/0x1000 [dma_test] returned 0 after 309 usecs fallback_bus: to_cpu: va: ffff8800642dd000 (pfn:642dd, mfn:53706) w.r.t prev mfn: 53707! fallback_bus: to_cpu: va: ffff8800642de000 (pfn:642de, mfn:53705) w.r.t prev mfn: 53706! fallback_bus: to_cpu: va: ffff8800642df000 (pfn:642df, mfn:53704) w.r.t prev mfn: 53705! fallback_bus: to_cpu: ffff8800642dc000 (pfn:642dc, bus frame: 53707) <= ffff880070046000 (addr: 70046000, frame: 186) fallback_bus: to_cpu: ffff8800642dd000 (pfn:642dd, bus frame: 53706) <= ffff880070047000 (addr: 70047000, frame: 187) fallback_bus: to_cpu: ffff8800642de000 (pfn:642de, bus frame: 53705) <= ffff880070048000 (addr: 70048000, frame: 188) fallback_bus: to_cpu: ffff8800642df000 (pfn:642df, bus frame: 53704) <= ffff880070049000 (addr: 70049000, frame: 189) fallback_bus: to_dev: va: ffff880059521000 (pfn:59521, mfn:488c2) w.r.t prev mfn: 488c3! fallback_bus: to_dev: va: ffff880059522000 (pfn:59522, mfn:488c1) w.r.t prev mfn: 488c2! fallback_bus: to_dev: va: ffff880059523000 (pfn:59523, mfn:488c0) w.r.t prev mfn: 488c1! fallback_bus: to_dev: va: ffff880059524000 (pfn:59524, mfn:488bf) w.r.t prev mfn: 488c0! fallback_bus: to_dev: va: ffff880059525000 (pfn:59525, mfn:488be) w.r.t prev mfn: 488bf! fallback_bus: to_dev: va: ffff880059526000 (pfn:59526, mfn:488bd) w.r.t prev mfn: 488be! fallback_bus: to_dev: va: ffff880059527000 (pfn:59527, mfn:488bc) w.r.t prev mfn: 488bd! fallback_bus: to_dev: 0xffff88007004a000(bounce) <= 0xffff880059520000 (sz: 32768) fallback_bus: to_dev: ffff880059520000 (pfn:59520, bus frame: 488c3) => ffff88007004a000 (addr: 7004a000, frame: 18a) fallback_bus: to_dev: ffff880059521000 (pfn:59521, bus frame: 488c2) => ffff88007004b000 (addr: 7004b000, frame: 18b) fallback_bus: to_dev: ffff880059522000 (pfn:59522, bus frame: 488c1) => ffff88007004c000 (addr: 7004c000, frame: 18c) fallback_bus: to_dev: ffff880059523000 (pfn:59523, bus frame: 488c0) => ffff88007004d000 (addr: 7004d000, frame: 18d) fallback_bus: to_dev: ffff880059524000 (pfn:59524, bus frame: 488bf) => ffff88007004e000 (addr: 7004e000, frame: 18e) fallback_bus: to_dev: ffff880059525000 (pfn:59525, bus frame: 488be) => ffff88007004f000 (addr: 7004f000, frame: 18f) fallback_bus: to_dev: ffff880059526000 (pfn:59526, bus frame: 488bd) => ffff880070050000 (addr: 70050000, frame: 190) fallback_bus: to_dev: ffff880059527000 (pfn:59527, bus frame: 488bc) => ffff880070051000 (addr: 70051000, frame: 191) fallback_bus: to_dev: ffff880059520000 with DMA (18a000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059521000 with DMA (18b000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059522000 with DMA (18c000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059523000 with DMA (18d000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059524000 with DMA (18e000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059525000 with DMA (18f000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059526000 with DMA (190000) has ffffffcc (expected ffffffcc) fallback_bus: to_dev: ffff880059527000 with DMA (191000) has ffffffcc (expected ffffffcc) fallback_bus: to_cpu: 0xffff880070046000(bounce) => 0xffff8800642dc000 (sz: 16384) fallback_bus: to_cpu: ffff8800642dc000 with DMA (186000) has ffffffdd (expected ffffffdd) fallback_bus: to_cpu: ffff8800642dd000 with DMA (187000) has ffffffdd (expected ffffffdd) fallback_bus: to_cpu: ffff8800642de000 with DMA (188000) has ffffffdd (expected ffffffdd) fallback_bus: to_cpu: ffff8800642df000 with DMA (189000) has ffffffdd (expected ffffffdd) fallback_bus: to_cpu: 0xffff880070046000(bounce) => 0xffff8800642dc000 (sz: 16384) > > Ian. --J/dobhs11T7y2rNN Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="dma_test.c" #include #include #include #include #include #include #include #define DMA_TEST "0.1" MODULE_AUTHOR("Konrad Rzeszutek Wilk "); MODULE_DESCRIPTION("dma_test"); MODULE_LICENSE("GPL"); MODULE_VERSION(DMA_TEST); static struct bus_type fallback_bus_type = { .name = "fallback_bus:", }; static void fake_release(struct device *dev) { /* No kfree as the device was allocated on stack. */ } struct args { int len; enum dma_data_direction dir; }; #define MAGIC_DEVICE 0xffffffdd #define MAGIC_CPU 0xffffffcc static int dma_test_thread(void *arg) { struct page *page; dma_addr_t dma_addr = 0; struct device fake = { .coherent_dma_mask = DMA_BIT_MASK(32), .bus = &fallback_bus_type, .release = fake_release, }; gfp_t gfp = __GFP_COMP | __GFP_NOWARN | GFP_ATOMIC; int ret; int i; void *addr; struct page *p; struct args *args = (struct args *)arg; int dir = args->dir; int len = args->len; dev_set_name(&fake, "%s", dir == DMA_TO_DEVICE ? "to_dev" : "to_cpu"); fake.dma_mask = &fake.coherent_dma_mask; ret = device_register(&fake); if (ret) goto out; do { unsigned long prev_mfn = 0; bool bus_and_dma_same; page = alloc_pages(gfp, get_order(len)); p = page; /* Check that the bus addresses are contingous. */ for (i = 0; i < len / PAGE_SIZE; i++, p++) { unsigned long pfn, mfn; addr = page_address(p); pfn = PFN_DOWN(virt_to_phys(addr)); if (xen_domain()) mfn = pfn_to_mfn(pfn); else mfn = pfn; if (i != 0) { if (prev_mfn + 1 != mfn) dev_warn(&fake, "va: %lx (pfn:%lx, mfn:%lx) w.r.t prev mfn: %lx!\n", (unsigned long)addr, pfn, mfn, prev_mfn); } prev_mfn = mfn; } dma_addr = dma_map_page(&fake, page, 0 /* no offset */, len, dir); /* Note, dma_addr is the physical address ! */ if (dma_mapping_error(&fake, dma_addr)) { dev_warn(&fake, "DMA %lx for %lx is not right\n", (unsigned long)dma_addr, (unsigned long)page_address(page)); __free_pages(page, get_order(len)); page = NULL; } bus_and_dma_same = false; if (page) { unsigned long phys; unsigned long pfn, mfn, bus_addr_mfn; unsigned long bus_addr = 0; p = page; for (i = 0; i < len / PAGE_SIZE; i++, p++) { void *bus_va; addr = page_address(p); phys = virt_to_phys(addr); pfn = PFN_DOWN(phys); bus_va = (void *)(dma_addr + (i * PAGE_SIZE)); if (xen_domain()) { void * tmp; /* Find the bus frame for the physical frame*/ mfn = pfn_to_mfn(pfn); /* and .. voodoo time! */ bus_addr_mfn = PFN_DOWN(dma_addr + (i * PAGE_SIZE)); bus_addr = PFN_PHYS(mfn_to_pfn(bus_addr_mfn)); tmp = __va(bus_addr); bus_va = mfn_to_virt(bus_addr_mfn); WARN(bus_va != tmp, "Expected %lx (%lx+%d*PAGE_SIZE), got: %lx (pfn: %lx, mfn: %lx)!\n", (unsigned long)bus_va, (unsigned long)dma_addr, i, (unsigned long)tmp, PFN_DOWN(bus_addr), bus_addr_mfn); } else { mfn = pfn; /* Assume DMA addr == physical addr */ bus_addr_mfn = PFN_DOWN(bus_addr); bus_va = __va(PFN_PHYS(bus_addr_mfn)); } dev_info(&fake, "%lx (pfn:%lx, bus frame: %lx) %s %lx (addr: %lx, frame: %lx)\n", (unsigned long)addr, pfn, mfn, dir == DMA_TO_DEVICE ? "=>" : "<=", (unsigned long)bus_va, bus_addr, bus_addr_mfn); if (!virt_addr_valid(bus_va)) break; if (!virt_addr_valid(addr)) break; /* CPU */ memset(addr, 0xCC, PAGE_SIZE); /* Device */ memset(bus_va, 0xDD, PAGE_SIZE); if (addr == bus_va) bus_and_dma_same = true; } } set_current_state(TASK_INTERRUPTIBLE); schedule_timeout_interruptible(5*HZ); if (!page) continue; p = page; for (i = 0; i < len / PAGE_SIZE; i++, p++) { if (bus_and_dma_same) continue; addr = page_address(p); if (((char *)addr)[0] != MAGIC_CPU) dev_warn(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n", (unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)), ((char *)addr)[0], (unsigned long)MAGIC_CPU); } /* sync the page */ dma_sync_single_for_cpu(&fake, dma_addr, len, dir); p = page; for (i = 0; i < len / PAGE_SIZE; i++, p++) { unsigned long check_val = MAGIC_DEVICE; addr = page_address(p); if (dir == DMA_TO_DEVICE) check_val = MAGIC_CPU; if (dir == DMA_FROM_DEVICE) check_val = MAGIC_DEVICE; dev_info(&fake, "%lx with DMA (%lx) has %x (expected %lx)\n", (unsigned long)addr, (unsigned long)(dma_addr + (i * PAGE_SIZE)), ((char *)addr)[0], check_val); } dma_unmap_page(&fake, dma_addr, len, dir); dma_addr = 0; __free_pages(page, get_order(len)); page = NULL; } while (!kthread_should_stop()); if (dma_addr) dma_unmap_page(&fake, dma_addr, len, dir); if (page) __free_pages(page, get_order(len)); put_device(&fake); device_unregister(&fake); out: return 0; } static struct task_struct *t[2]; static struct args a[2]; static int __init dma_test_init(void) { int ret; /* No point doing this without SWIOTLB */ if (!swiotlb_nr_tbl()) return -ENODEV; ret = bus_register(&fallback_bus_type); if (ret) return ret; a[0].dir = DMA_TO_DEVICE; a[0].len = 32768; t[0] = kthread_run(dma_test_thread, &a[0], "dma_test_dev"); a[1].len = 16384; a[1].dir = DMA_FROM_DEVICE; t[1] = kthread_run(dma_test_thread, &a[1], "dma_test_cpu"); return 0; } static void __exit dma_test_exit(void) { if (t[0]) kthread_stop(t[0]); if (t[1]) kthread_stop(t[1]); bus_unregister(&fallback_bus_type); } module_init(dma_test_init); module_exit(dma_test_exit); --J/dobhs11T7y2rNN Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-swiotlb-Add-debugging.patch" >>From db63c863456f0914ad96db38544c364eaad787ab Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Mon, 29 Oct 2012 09:35:55 -0400 Subject: [PATCH] swiotlb: Add debugging. Signed-off-by: Konrad Rzeszutek Wilk --- include/linux/swiotlb.h | 2 +- lib/swiotlb.c | 25 ++++++++++++++++++++----- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 8d08b3e..0a17d79 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -46,7 +46,7 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, enum dma_sync_target target); /* Accessory functions. */ -extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size, +extern void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size, enum dma_data_direction dir); extern void diff --git a/lib/swiotlb.c b/lib/swiotlb.c index f114bf6..e5d37a3 100644 --- a/lib/swiotlb.c +++ b/lib/swiotlb.c @@ -346,7 +346,7 @@ static int is_swiotlb_buffer(phys_addr_t paddr) /* * Bounce: copy the swiotlb buffer back to the original dma location */ -void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size, +void swiotlb_bounce(struct device *dev, phys_addr_t phys, char *dma_addr, size_t size, enum dma_data_direction dir) { unsigned long pfn = PFN_DOWN(phys); @@ -376,6 +376,21 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size, offset = 0; } } else { + if (size >= PAGE_SIZE) { + const char *type; + + if (dir == DMA_TO_DEVICE) + type = " <= "; + else if (dir == DMA_FROM_DEVICE) + type = " => "; + else + type = " <=> "; + dev_info(dev, "0x%lx%s %s 0x%lx%s (sz: %ld)\n", (unsigned long)dma_addr, + is_swiotlb_buffer(virt_to_phys(dma_addr)) ? "(bounce)" : "", + type, (unsigned long)phys_to_virt(phys), + is_swiotlb_buffer(phys) ? "(bounce)" : "", + size); + } if (dir == DMA_TO_DEVICE) memcpy(dma_addr, phys_to_virt(phys), size); else @@ -483,7 +498,7 @@ found: for (i = 0; i < nslots; i++) io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT); if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) - swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE); + swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE); return dma_addr; } @@ -518,7 +533,7 @@ swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size, * First, sync the memory before unmapping the entry */ if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL))) - swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE); + swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE); /* * Return the buffer to the free list by setting the corresponding @@ -560,13 +575,13 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size, switch (target) { case SYNC_FOR_CPU: if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)) - swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE); + swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_FROM_DEVICE); else BUG_ON(dir != DMA_TO_DEVICE); break; case SYNC_FOR_DEVICE: if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)) - swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE); + swiotlb_bounce(hwdev, phys, dma_addr, size, DMA_TO_DEVICE); else BUG_ON(dir != DMA_FROM_DEVICE); break; -- 1.7.7.6 --J/dobhs11T7y2rNN Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --J/dobhs11T7y2rNN--