From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joel Soete Subject: How DMA_CHUNK_SIZE is related to BIO_VMERGE_MAX_SIZE???[Was: ccio-dma: is issue could be related to too much io_tlb entries?] Date: Thu, 04 Sep 2008 17:36:39 +0000 Message-ID: <48C01CA7.70501@scarlet.be> References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Cc: kyle , linux-parisc To: grundler Return-path: In-Reply-To: List-ID: List-Id: linux-parisc.vger.kernel.org Hello Grant et al., Looking out how DMA_CHUNK_SIZE is used else where: root@etchome:/Develop/parisc-linux/Kyle/linux-2.6-snaps.git# find . -type f | xargs grep DMA_CHUNK_SIZE ./arch/parisc/kernel/pci-dma.c:#include /* for DMA_CHUNK_SIZE */ ./arch/ia64/hp/common/sba_iommu.c:** DMA_CHUNK_SIZE is used by the SCSI mid-layer to break up ./arch/ia64/hp/common/sba_iommu.c:#define DMA_CHUNK_SIZE (BITS_PER_LONG*iovp_size) ./arch/ia64/hp/common/sba_iommu.c: ASSERT(size <= DMA_CHUNK_SIZE); ./arch/ia64/hp/common/sba_iommu.c: ASSERT(startsg->length <= DMA_CHUNK_SIZE); ./arch/ia64/hp/common/sba_iommu.c: ** exceed DMA_CHUNK_SIZE if we coalesce the ./arch/ia64/hp/common/sba_iommu.c: > DMA_CHUNK_SIZE) ./arch/ia64/hp/common/sba_iommu.c: ASSERT(dma_len <= DMA_CHUNK_SIZE); ./drivers/parisc/ccio-dma.c: BUG_ON((pages_needed * IOVP_SIZE) > DMA_CHUNK_SIZE); ./drivers/parisc/ccio-dma.c: BUG_ON((pages_mapped * IOVP_SIZE) > DMA_CHUNK_SIZE); ./drivers/parisc/iommu-helpers.h: ** exceed DMA_CHUNK_SIZE if we coalesce the ./drivers/parisc/iommu-helpers.h: IOVP_SIZE) > DMA_CHUNK_SIZE)) ./drivers/parisc/sba_iommu.c:#include /* for DMA_CHUNK_SIZE */ ./include/asm-parisc/dma.h:** DMA_CHUNK_SIZE is used by the SCSI mid-layer to break up ./include/asm-parisc/dma.h:#define DMA_CHUNK_SIZE (BITS_PER_LONG*PAGE_SIZE) Well this last comment "DMA_CHUNK_SIZE is used by the SCSI mid-layer ..." would certainly be an evidence to me in 2.4 tree with scsi_merge.c but today (with 2.6 tree) this was removed and even with google it's hard to track down what this DMA_CHUNK_SIZE size become? Never the less, in parisc ccio-dma and sba_iommu there are: parisc_vmerge_boundary = IOVP_SIZE; parisc_vmerge_max_size = IOVP_SIZE * BITS_PER_LONG; that would mean parisc_vmerge_max_size == DMA_CHUNK_SIZE if it was writen like for ia64 sba_iommu i.e. #define DMA_CHUNK_SIZE (BITS_PER_LONG*iovp_size). Here numerically PAGE_SIZE == IOVP_SIZE, so it will not exhibit any issue if I replace DMA_CHUNK_SIZE by BIO_VMERGE_MAX_SIZE (include/asm-parisc/io.h:#define BIO_VMERGE_MAX_SIZE parisc_vmerge_max_size). Obvioulsy that would help me to make the link between those stuff but what's up if somebody will try to manage parisc IOVP_SIZE > PAGE_SIZE? Tia for your additional advise, J. PS: may be would it also help me to figure out if (for ccio-dma) DMA_CHUNK_SIZE wouldn't be better chain_size? Joel Soete wrote: >> On Tue, Aug 05, 2008 at 03:21:32PM +0100, Joel Soete wrote: >>>> On Thu, Jul 24, 2008 at 02:13:55PM +0100, Joel Soete wrote: >>>>> Hello Grant, Kyle, et al., >>>>> >>>>> Iirc the number of io_tlb enties on this u2/uturn ioa is of 256? >>>> ISTR that u2 and uturn have different number of IO TLB entries. >>>> But I don't recall how many exactly. Need the ERSs to look that up. >>>> >>> Well, I don't yet find the right way to get access, sorry. >>> >>>>> Because issue occur only when I do a lot of I/O on scsi disk (sometime > request >>>>> of mapping reach 128 pages), the idea was that it could induce some > exceed of >>>>> iotlb entries. >>>>> >>> [snip] >>>> The number of "used" entries include "in flight" DMA and pci_consistent >>> allocations. This generally isn't that many pages of RAM. >>> Ok, >>> But the idea was that if so much pdir entries was mapped in a so short time >>> (1s), it should be also that the device will try to use them on the fly (just >>> an hypothesis.) >>> And so far as I can observe, the pb occurs when os operate on numerous huge >>> data blocks (i.e. a tar -xvf of a linux tree into a single fs); so in this >>> case it should be that the i/o device trigger many i/o tlb miss and may be >>> much more i/o tlb entries then it can be freed? >> Yes, that's certainly possible. >> But it's not the only behavior triggered by lots of in-flight IO traffic. >> > Ok > (that's just the simplest way I found to reproduce the day to day issue I > encounter when I do an update of my system: this is not during the download of > pkg but during 'Unpacking' step and that already broken my fs ;_() > >>> What I observe also is that the pb become worse either with a system with few >>> ram (like my c110 with 64M) or when I resurrect CCIO_MEM_RATIO (e.g. 2 or 4) >>> on a system with 256Mb of RAM). In those last 2 cases the effect is the same: >>> a/ it makes the pdir_size and the number of pdir entries smaller >> Yes. >> >>> b/ as well for chainid_shift. >> I've forgotten exactly the role of the chainid...I'd have to study >> the code again. >> > no pb > >>> This last point (b/) make me thought that it would also make smaller the >>> number of 4k-byte per chainid and so for a same DMA block size it would >>> required more iotlb entries. >> No. The number of IO TLB entries (192 or something like that) and IO MMU >> page size (4k) are both fixed. >> Both are also completely unrelated to the size of the IO Pdir. >> > Totaly agree. > But I wrongly explain my idea, my understanding was that chainid_shift allow > to compute a chainid_mask to setup the U2 (in my case) iommu. > After my reading of hp paper: "Hardware Cache Coherent Input/Output", I > supposed (that's certainly where I am wrong) that this chainid_mask was a hint > to instruct iommu the max size of an I/O data block (e.g. for the d380 with > 256Mb I got chainid_shift = 19 [18 with ccio_mem_ratio = 2] and so chain_size > = 2 ^ 19 = 128 * 4k pages (at least that's only what clear_io_tlb() does)). So > for a big data block of 128*4k pages (I realy read such request of mapping) > the scsi device would just need 1 io_tlb entry while it would request 2 (with > ccio_mem_ratio = 2) and even 4 (with a c110 with only 64Mb). > That's obviously my own reading (without any coach ;-), sorry in advance if > that's more confusing. > >>> Obviously just speculation ;<). >>> >>> Even thought 3 things sure: >>> - issue occurs for huge I/O >>> - become worse with reduced iov_space_size (physical or logical) >>> - backport sba help a bit but doesn't fix issue >> Yeah, those suggest IO TLB flushing is failing or IO Pdir isn't coherent. >> There might be other things broken too. >> > Yes > (with relayfs I tried to trace as much as I can but it have the default to not > capture all messages and so just give me a overview of the execution path. > Next step in my investigation: coalesce_chunks(); > but I am still looking for sg_list detail, though what kind of sg sump I could > grab (after coalesce_chunks()): > > this one is easy to understand: > [0]- page_link: 0x10692980 (275327360), offset:0x0, length: 4096, > iova(dma_address): 0xad0000, iova_length(dma_length): 40960. > [1]- page_link: 0x10692960 (275327328), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [2]- page_link: 0x10692940 (275327296), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [3]- page_link: 0x10692920 (275327264), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [4]- page_link: 0x10692900 (275327232), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [5]- page_link: 0x106928e0 (275327200), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [6]- page_link: 0x106928c0 (275327168), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [7]- page_link: 0x10692a80 (275327616), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [8]- page_link: 0x10692c40 (275328064), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [9]- page_link: 0x10692c22 (275328034), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > > i.e. 10 * 4k pages fused (coalesce?) in one dma data block of 40K using iova > 0xad0000 (ok?) > > but I don't yet understand following ones: > [0]- page_link: 0x10681b40 (275258176), offset:0x0, length: 4096, > iova(dma_address): 0x198000, iova_length(dma_length): 12288. > [1]- page_link: 0x10681b20 (275258144), offset:0x0, length: 4096, > iova(dma_address): 0x19bc00, iova_length(dma_length): 1024. > [2]- page_link: 0x10681b00 (275258112), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [3]- page_link: 0x10681a82 (275257986), offset:0xc00, length: 1024, > iova(dma_address): 0x8019bc00, iova_length(dma_length): 0. > > why not fuse in only one block? > > or this one: > [0]- page_link: 0x10692f00 (275328768), offset:0x0, length: 12288, > iova(dma_address): 0x1a30000, iova_length(dma_length): 49152. > [1]- page_link: 0x10693060 (275329120), offset:0x0, length: 4096, > iova(dma_address): 0x1a40000, iova_length(dma_length): 40960. > [2]- page_link: 0x106930a0 (275329184), offset:0x0, length: 8192, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [3]- page_link: 0x10693240 (275329600), offset:0x0, length: 24576, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [4]- page_link: 0x106935e0 (275330528), offset:0x0, length: 20480, > iova(dma_address): 0x81a40000, iova_length(dma_length): 0. > [5]- page_link: 0x106937a0 (275330976), offset:0x0, length: 4096, > iova(dma_address): 0x0, iova_length(dma_length): 0. > [6]- page_link: 0x106937e2 (275331042), offset:0x0, length: 16384, > iova(dma_address): 0x0, iova_length(dma_length): 0. > > as the chainid_size is of 128*4k pages (=512k) why not coalescing all stuff in > one data block? > Or it's not the place where scatterlist blocks are put together to form one > contiguous block for dma access? > (well my understanding of the beginning of sg list management was to put > together scattered blocks at contiguous _physical_ address for dma access. But > with those U2 we work now with _virtual_ address and index so am I a bit lost ;-) > > But this next one are totally puzzling me: > [0]- page_link: 0x10667600 (275150336), length: 1024, iova(dma_address): > 0x800ae000, iova_length(dma_length): 1024. > [1]- page_link: 0x1072e2e0 (275964640), length: 1024, iova(dma_address): > 0x800afc00, iova_length(dma_length): 1024. > [2]- page_link: 0x10676180 (275210624), length: 1024, iova(dma_address): > 0x800b0800, iova_length(dma_length): 1024. > [3]- page_link: 0x10541d00 (273947904), length: 1024, iova(dma_address): > 0x800b1c00, iova_length(dma_length): 1024. > [4]- page_link: 0x1072dd00 (275963136), length: 1024, iova(dma_address): > 0x800b2800, iova_length(dma_length): 1024. > [5]- page_link: 0x1072dd20 (275963168), length: 1024, iova(dma_address): > 0x800b3800, iova_length(dma_length): 1024. > [6]- page_link: 0x107284c0 (275940544), length: 1024, iova(dma_address): > 0x800b4c00, iova_length(dma_length): 1024. > > (sorry, here I don't have offsets but I doubt it would help me to understand > why no gather occurs here?) > > >>>>> Anyway, difference between those last 2 samples (718 - 444) = 274 > increase of >>>>> io_pdir entries. >>>> That's about right for a SCSI device since it can't have that much >>>> IO in flight for one or two disks. >>>> >>> [snip] >>>> Of course. The number of "used" entries in the IO Pdir has no direct >>>> correlation to the number of "in use" IO TLB entries. IO TLB is fixed >>>> size while the IO Pdir size can vary between boots. >>>> >>>>> Well as scatterlist is still puzzling me, I can still be confused between >>>>> iommu and mmu pages mapping, sorry so in advance if it's yet another > annoying >>>>> comment. >>>> IOMMU is an MMU for IO devices. MMU is the same thing for CPU. >>>> Differences exist between those two. DMA is generally to larger >>>> chunks/regions of RAM (256-2K bytes) while CPUs need to enforce >>>> access rights (X/R/W) to memory and deal with cachelines or less. >>>> >>> (well I still have difficulties in the relationship between all those buffers >>> which are caches and tlb and over that I/O DMA with its own set of cache and >>> iotlb. Fortunately there are now good doc freely available and good engine to >>> look for it, but it's not yet so easy to me) >> Agreed - it's not easy. >> > Tx (when a master said 'it's not easy' that sincerely encourage me to continue > my learning) > > Again thanks a lot for your kind attention, > J. > >> grant >> >>> Tx again for advises, >>> J. >>> >>>> hth, >>>> grant >>>> -- >> -- > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-parisc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >