* Re: [git patches] xfs and block fixes for virtually indexed arches [not found] <20091216043618.GB9104@hera.kernel.org> @ 2009-12-17 13:22 ` Kyle McMartin 2009-12-17 13:22 ` Kyle McMartin ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Kyle McMartin @ 2009-12-17 13:22 UTC (permalink / raw) To: kyle; +Cc: torvalds, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch Linus, any word on these? I'd really love to get jejb and hch off my back. :) cheers, Kyle On Wed, Dec 16, 2009 at 04:36:18AM +0000, Kyle McMartin wrote: > As discussed on linux-arch@ recently, these fixes are necessary for > XFS to work on virtually indexed architectures since otherwise vmalloc'd > pages are not properly flushed from the cache. > > Thanks! > Kyle > > The following changes since commit 8bea8672edfca7ec5f661cafb218f1205863b343: > Stephen Rothwell (1): > mfd: compile fix for twl4030 renaming > > are available in the git repository at: > > hera.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6.git coherence > > James Bottomley (6): > mm: add coherence API for DMA to vmalloc/vmap areas > parisc: add mm API for DMA to vmalloc/vmap areas > arm: add mm API for DMA to vmalloc/vmap areas > sh: add mm API for DMA to vmalloc/vmap areas > block: permit I/O to vmalloc/vmap kernel pages > xfs: fix xfs to work with Virtually Indexed architectures > > Documentation/cachetlb.txt | 24 ++++++++++++++++++++++++ > arch/arm/include/asm/cacheflush.h | 10 ++++++++++ > arch/parisc/include/asm/cacheflush.h | 8 ++++++++ > arch/sh/include/asm/cacheflush.h | 8 ++++++++ > fs/bio.c | 20 ++++++++++++++++++-- > fs/xfs/linux-2.6/xfs_buf.c | 20 ++++++++++++++++++++ > include/linux/highmem.h | 6 ++++++ > 7 files changed, 94 insertions(+), 2 deletions(-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-parisc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 13:22 ` [git patches] xfs and block fixes for virtually indexed arches Kyle McMartin @ 2009-12-17 13:22 ` Kyle McMartin 2009-12-17 13:25 ` Christoph Hellwig 2009-12-17 16:16 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Kyle McMartin @ 2009-12-17 13:22 UTC (permalink / raw) To: kyle; +Cc: torvalds, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch Linus, any word on these? I'd really love to get jejb and hch off my back. :) cheers, Kyle On Wed, Dec 16, 2009 at 04:36:18AM +0000, Kyle McMartin wrote: > As discussed on linux-arch@ recently, these fixes are necessary for > XFS to work on virtually indexed architectures since otherwise vmalloc'd > pages are not properly flushed from the cache. > > Thanks! > Kyle > > The following changes since commit 8bea8672edfca7ec5f661cafb218f1205863b343: > Stephen Rothwell (1): > mfd: compile fix for twl4030 renaming > > are available in the git repository at: > > hera.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6.git coherence > > James Bottomley (6): > mm: add coherence API for DMA to vmalloc/vmap areas > parisc: add mm API for DMA to vmalloc/vmap areas > arm: add mm API for DMA to vmalloc/vmap areas > sh: add mm API for DMA to vmalloc/vmap areas > block: permit I/O to vmalloc/vmap kernel pages > xfs: fix xfs to work with Virtually Indexed architectures > > Documentation/cachetlb.txt | 24 ++++++++++++++++++++++++ > arch/arm/include/asm/cacheflush.h | 10 ++++++++++ > arch/parisc/include/asm/cacheflush.h | 8 ++++++++ > arch/sh/include/asm/cacheflush.h | 8 ++++++++ > fs/bio.c | 20 ++++++++++++++++++-- > fs/xfs/linux-2.6/xfs_buf.c | 20 ++++++++++++++++++++ > include/linux/highmem.h | 6 ++++++ > 7 files changed, 94 insertions(+), 2 deletions(-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-parisc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 13:22 ` [git patches] xfs and block fixes for virtually indexed arches Kyle McMartin 2009-12-17 13:22 ` Kyle McMartin @ 2009-12-17 13:25 ` Christoph Hellwig 2009-12-17 16:16 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 13:25 UTC (permalink / raw) To: Kyle McMartin Cc: torvalds, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch On Thu, Dec 17, 2009 at 08:22:56AM -0500, Kyle McMartin wrote: > Linus, any word on these? I'd really love to get jejb and hch off > my back. :) Yeah, users have been pretty impatiently waiting for this. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 13:22 ` [git patches] xfs and block fixes for virtually indexed arches Kyle McMartin 2009-12-17 13:22 ` Kyle McMartin 2009-12-17 13:25 ` Christoph Hellwig @ 2009-12-17 16:16 ` Linus Torvalds 2009-12-17 16:30 ` tytso 2 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 16:16 UTC (permalink / raw) To: Kyle McMartin Cc: linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, 17 Dec 2009, Kyle McMartin wrote: > > Linus, any word on these? I'd really love to get jejb and hch off > my back. :) I hate them. I don't see what the point of allowing kernel virtual addresses in bio's is. It's wrong. The fact that XFS does that sh*t is an XFS issue. Handle it there. Fix XFS. Or convince me with some really good arguments, and make sure that Jens signs off on the cr*p too. In other words, the thing I object to isn't even the new flushing thing, it's this idiocy: - save off virtual address: .. bio->bi_private = data; .. - do vmalloc_to_page: + if (is_vmalloc_addr(data)) { + flush_kernel_dcache_addr(data); + page = vmalloc_to_page(data); + } else + page = virt_to_page(data); WTF? Why the hell should the block layer support this kind of absolute crap? When has "use random kernel virtual address" ever been a valid thing to do for IO? Why aren't you doing this before you submit the vmalloc range for IO? So no. Not a way in hell do I pull this for 33. And preferably never. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:16 ` Linus Torvalds @ 2009-12-17 16:30 ` tytso 2009-12-17 16:46 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: tytso @ 2009-12-17 16:30 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 08:16:12AM -0800, Linus Torvalds wrote: > > I hate them. > > I don't see what the point of allowing kernel virtual addresses in bio's > is. It's wrong. The fact that XFS does that sh*t is an XFS issue. Handle > it there. > > Fix XFS. Or convince me with some really good arguments, and make sure > that Jens signs off on the cr*p too. I have a somewhat similar issue that comes up for ext4; at the moment occasionaly we need to clone a buffer head buffer; either because the first four bytes match the magic JBD "escape sequence", and we need to modify the block and escape it before writing it to the journal, or because we need to make a copy of a allocation bitmap block so we can write a fixed copy to disk while we modify the "real" block during a commit. At the moment we allocate a full page, even if that means allocating a 16k PPC page when the file system block size is 4k, or allocating a 4k x86 page when the file system block size is 1k. That's because apparently the iSCSI and DMA blocks assume that they have Real Pages (tm) passed to block I/O requests, and apparently XFS ran into problems when sending vmalloc'ed pages. I don't know if this is a problem if we pass the bio layer addresses coming from the SLAB allocator, but oral tradition seems to indicate this is problematic, although no one has given me the full chapter and verse explanation about why this is so. Now that I see Linus's complaint, I'm wondering if the issue is really about kernel virtual addresses (i.e., coming from vmalloc), and not a requirement for Real Pages (i.e., coming from the SLAB allocator as opposed to get_free_page). And can this be documented someplace? I tried looking at the bio documentation, and couldn't find anything definitive on the subject. Thanks, - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:30 ` tytso @ 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 16:46 ` Linus Torvalds ` (2 more replies) 2009-12-17 17:10 ` Christoph Hellwig 2009-12-17 17:10 ` Christoph Hellwig 2 siblings, 3 replies; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 16:46 UTC (permalink / raw) To: tytso Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > That's because apparently the iSCSI and DMA blocks assume that they > have Real Pages (tm) passed to block I/O requests, and apparently XFS > ran into problems when sending vmalloc'ed pages. I don't know if this > is a problem if we pass the bio layer addresses coming from the SLAB > allocator, but oral tradition seems to indicate this is problematic, > although no one has given me the full chapter and verse explanation > about why this is so. kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA translations for such pages is trivial and fundamental. In contrast, vmalloc is pure and utter unadulterated CRAP. The pages may be contiguous virtually, but it makes no difference for the block layer, that has to be able to do IO by DMA anyway, so it has to look up the page translations in the page tables etc crazy sh*t. So passing vmalloc'ed page addresses around to something that will eventually do a non-CPU-virtual thing on them is fundamentally insane. The vmalloc space is about CPU virtual addresses. Such concepts simpyl do not -exist- for some random block device. > Now that I see Linus's complaint, I'm wondering if the issue is really > about kernel virtual addresses (i.e., coming from vmalloc), and not a > requirement for Real Pages (i.e., coming from the SLAB allocator as > opposed to get_free_page). And can this be documented someplace? I > tried looking at the bio documentation, and couldn't find anything > definitive on the subject. The whole "vmalloc is special" has always been true. If you want to treat vmalloc as normal memory, you need to look up the pages yourself. We have helpers for that (including helpers that populate vmalloc space from a page array to begin with - so you can _start_ from some array of pages and then lay them out virtually if you want to have a convenient CPU access to the array). And this whole "vmalloc is about CPU virtual addresses" is so obviously and fundamentally true that I don't understand how anybody can ever be confused about it. The "v" in vmalloc is for "virtual" as in virtual memory. Think of it like virtual user addresses. Does anybody really expect to be able to pass a random user address to the BIO layer? And if you do, I would suggest that you get out of kernel programming pronto. You're a danger to society, and have a lukewarm IQ. I don't want you touching kernel code. And no, I do _not_ want the BIO layer having to walk page tables. Not for vmalloc space, not for user virtual addresses. (And don't tell me it already does. Maybe somebody sneaked it in past me, without me ever noticing. That wouldn't be an excuse, that would be just sad. Jesus wept) Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:46 ` Linus Torvalds @ 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 17:07 ` Christoph Hellwig 2009-12-17 17:39 ` tytso 2 siblings, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 16:46 UTC (permalink / raw) To: tytso Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > That's because apparently the iSCSI and DMA blocks assume that they > have Real Pages (tm) passed to block I/O requests, and apparently XFS > ran into problems when sending vmalloc'ed pages. I don't know if this > is a problem if we pass the bio layer addresses coming from the SLAB > allocator, but oral tradition seems to indicate this is problematic, > although no one has given me the full chapter and verse explanation > about why this is so. kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA translations for such pages is trivial and fundamental. In contrast, vmalloc is pure and utter unadulterated CRAP. The pages may be contiguous virtually, but it makes no difference for the block layer, that has to be able to do IO by DMA anyway, so it has to look up the page translations in the page tables etc crazy sh*t. So passing vmalloc'ed page addresses around to something that will eventually do a non-CPU-virtual thing on them is fundamentally insane. The vmalloc space is about CPU virtual addresses. Such concepts simpyl do not -exist- for some random block device. > Now that I see Linus's complaint, I'm wondering if the issue is really > about kernel virtual addresses (i.e., coming from vmalloc), and not a > requirement for Real Pages (i.e., coming from the SLAB allocator as > opposed to get_free_page). And can this be documented someplace? I > tried looking at the bio documentation, and couldn't find anything > definitive on the subject. The whole "vmalloc is special" has always been true. If you want to treat vmalloc as normal memory, you need to look up the pages yourself. We have helpers for that (including helpers that populate vmalloc space from a page array to begin with - so you can _start_ from some array of pages and then lay them out virtually if you want to have a convenient CPU access to the array). And this whole "vmalloc is about CPU virtual addresses" is so obviously and fundamentally true that I don't understand how anybody can ever be confused about it. The "v" in vmalloc is for "virtual" as in virtual memory. Think of it like virtual user addresses. Does anybody really expect to be able to pass a random user address to the BIO layer? And if you do, I would suggest that you get out of kernel programming pronto. You're a danger to society, and have a lukewarm IQ. I don't want you touching kernel code. And no, I do _not_ want the BIO layer having to walk page tables. Not for vmalloc space, not for user virtual addresses. (And don't tell me it already does. Maybe somebody sneaked it in past me, without me ever noticing. That wouldn't be an excuse, that would be just sad. Jesus wept) Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 16:46 ` Linus Torvalds @ 2009-12-17 17:07 ` Christoph Hellwig 2009-12-17 17:07 ` Christoph Hellwig 2009-12-17 17:42 ` Linus Torvalds 2009-12-17 17:39 ` tytso 2 siblings, 2 replies; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > The whole "vmalloc is special" has always been true. If you want to > treat vmalloc as normal memory, you need to look up the pages yourself. We > have helpers for that (including helpers that populate vmalloc space from > a page array to begin with - so you can _start_ from some array of pages > and then lay them out virtually if you want to have a convenient CPU > access to the array). Which is exactly what the XFS code does. Pages are allocated manually and we store pointers to the page struct that later get added to the bio. But we access them through vmap (which I added exactly for this reason back in 2002) for kernel accesses. On all architectures with sane caches things just work, but for parisc, arm and friends that have virtually indexed caches we need to make sure to flush caches for this different access. The vmalloc linear address is not used for I/O everywhere. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:07 ` Christoph Hellwig @ 2009-12-17 17:07 ` Christoph Hellwig 2009-12-17 17:42 ` Linus Torvalds 1 sibling, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > The whole "vmalloc is special" has always been true. If you want to > treat vmalloc as normal memory, you need to look up the pages yourself. We > have helpers for that (including helpers that populate vmalloc space from > a page array to begin with - so you can _start_ from some array of pages > and then lay them out virtually if you want to have a convenient CPU > access to the array). Which is exactly what the XFS code does. Pages are allocated manually and we store pointers to the page struct that later get added to the bio. But we access them through vmap (which I added exactly for this reason back in 2002) for kernel accesses. On all architectures with sane caches things just work, but for parisc, arm and friends that have virtually indexed caches we need to make sure to flush caches for this different access. The vmalloc linear address is not used for I/O everywhere. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:07 ` Christoph Hellwig 2009-12-17 17:07 ` Christoph Hellwig @ 2009-12-17 17:42 ` Linus Torvalds 2009-12-17 17:51 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 17:42 UTC (permalink / raw) To: Christoph Hellwig Cc: tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, 17 Dec 2009, Christoph Hellwig wrote: > > On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > > The whole "vmalloc is special" has always been true. If you want to > > treat vmalloc as normal memory, you need to look up the pages yourself. We > > have helpers for that (including helpers that populate vmalloc space from > > a page array to begin with - so you can _start_ from some array of pages > > and then lay them out virtually if you want to have a convenient CPU > > access to the array). > > Which is exactly what the XFS code does. Pages are allocated manually > and we store pointers to the page struct that later get added to the > bio. Hmm. The BIO interface that the patch-series changes (bio_map_kern) doesn't work that way. It takes a "buf, len" kind of thing. That's what I'm complaining about. > But we access them through vmap (which I added exactly for this > reason back in 2002) for kernel accesses. On all architectures with > sane caches things just work, but for parisc, arm and friends that have > virtually indexed caches we need to make sure to flush caches for this > different access. The vmalloc linear address is not used for I/O > everywhere. Well, they clearly are _after_ this series, since that's what all those changes to __bio_map_kernel() and bio_map_kern_endio() are all about. So I believe you when you say that XFS perhaps does everything right - I just think that the patch series in question actually makes things worse, exactly because it is starting to use virtual addresses. I also think that the changes to bio_map_kernel() and bio_map_kern_endio() are not just "fundamentally ugly", I think they are made worse by the fact that it's not even done "right". You both flush the virtual caches before the IO and invalidate after - when the real pattern should be that you flush it before a write, and invalidate it after a read. And I really think that would be all much more properly done at the _caller_ level, not by the BIO layer. You must have some locking and allocation etc logic at the caller anyway, why doesn't _that_ level just do the flushing or invalidation? I get the feeling that somebody decided that the whole "do DMA to/from vmalloc space" was somehow a common generic pattern that should be supported in general, and I violently disagree. Maybe XFS has good reasons for doing it, but that does emphatically _not_ make it a good idea in general, and that does _not_ mean that the BIO layer should make it easy to do for other users and have a general interface for that kind of crazyness. IOW, I'm perfectly happy with the patch to fs/xfs/linux-2.6/xfs_buf.c. That one still seems to use 'bio_add_page()' with a regular 'struct page'. But the fs/bio.c patch looks like just total and utter crap to me, and is the reason I refuse to pull this series. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:42 ` Linus Torvalds @ 2009-12-17 17:51 ` Christoph Hellwig 2009-12-17 17:51 ` Christoph Hellwig 2009-12-17 18:08 ` Russell King 2009-12-19 18:33 ` Ralf Baechle 2 siblings, 1 reply; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > > Which is exactly what the XFS code does. Pages are allocated manually > > and we store pointers to the page struct that later get added to the > > bio. > > Hmm. The BIO interface that the patch-series changes (bio_map_kern) > doesn't work that way. It takes a "buf, len" kind of thing. That's what > I'm complaining about. Indeed, the "block: permit I/O to vmalloc/vmap kernel pages" does what you complain about. But the series doesn't actually add a user for that. What it does in XFS is quite a bit of black magic, too - but only with the new cache coherence calls that are noops on architectures with physically indexed caches. > Well, they clearly are _after_ this series, since that's what all those > changes to __bio_map_kernel() and bio_map_kern_endio() are all about. > > So I believe you when you say that XFS perhaps does everything right - I > just think that the patch series in question actually makes things worse, > exactly because it is starting to use virtual addresses. I'm not entirely sure why James added those, but XFS doesn't actually use bio_map_kern. > And I really think that would be all much more properly done at the > _caller_ level, not by the BIO layer. > > You must have some locking and allocation etc logic at the caller anyway, > why doesn't _that_ level just do the flushing or invalidation? http://git.kernel.org/?p=linux/kernel/git/kyle/parisc-2.6.git;a=commitdiff;h=56c8214b842324e94aa88012010b0f1f9847daec does it in the caller level. Not exactly in a beautiful way, but who am I complain as I'm already lost in our mess of cache coherency APIs. > IOW, I'm perfectly happy with the patch to fs/xfs/linux-2.6/xfs_buf.c. > That one still seems to use 'bio_add_page()' with a regular 'struct page'. > > But the fs/bio.c patch looks like just total and utter crap to me, and is > the reason I refuse to pull this series. Kyle/James, can you regenerate the tree without that patch included? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:51 ` Christoph Hellwig @ 2009-12-17 17:51 ` Christoph Hellwig 0 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > > Which is exactly what the XFS code does. Pages are allocated manually > > and we store pointers to the page struct that later get added to the > > bio. > > Hmm. The BIO interface that the patch-series changes (bio_map_kern) > doesn't work that way. It takes a "buf, len" kind of thing. That's what > I'm complaining about. Indeed, the "block: permit I/O to vmalloc/vmap kernel pages" does what you complain about. But the series doesn't actually add a user for that. What it does in XFS is quite a bit of black magic, too - but only with the new cache coherence calls that are noops on architectures with physically indexed caches. > Well, they clearly are _after_ this series, since that's what all those > changes to __bio_map_kernel() and bio_map_kern_endio() are all about. > > So I believe you when you say that XFS perhaps does everything right - I > just think that the patch series in question actually makes things worse, > exactly because it is starting to use virtual addresses. I'm not entirely sure why James added those, but XFS doesn't actually use bio_map_kern. > And I really think that would be all much more properly done at the > _caller_ level, not by the BIO layer. > > You must have some locking and allocation etc logic at the caller anyway, > why doesn't _that_ level just do the flushing or invalidation? http://git.kernel.org/?p=linux/kernel/git/kyle/parisc-2.6.git;a=commitdiff;h=56c8214b842324e94aa88012010b0f1f9847daec does it in the caller level. Not exactly in a beautiful way, but who am I complain as I'm already lost in our mess of cache coherency APIs. > IOW, I'm perfectly happy with the patch to fs/xfs/linux-2.6/xfs_buf.c. > That one still seems to use 'bio_add_page()' with a regular 'struct page'. > > But the fs/bio.c patch looks like just total and utter crap to me, and is > the reason I refuse to pull this series. Kyle/James, can you regenerate the tree without that patch included? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:42 ` Linus Torvalds 2009-12-17 17:51 ` Christoph Hellwig @ 2009-12-17 18:08 ` Russell King 2009-12-17 18:08 ` Russell King 2009-12-17 18:17 ` Linus Torvalds 2009-12-19 18:33 ` Ralf Baechle 2 siblings, 2 replies; 48+ messages in thread From: Russell King @ 2009-12-17 18:08 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > You both flush the virtual caches before > the IO and invalidate after - when the real pattern should be that you > flush it before a write, and invalidate it after a read. That's not entirely true. If you have write back caches which are not DMA coherent, you need to as a minimum: - on write, clean the cache to ensure that the page is up to date with data held in cache. - on read, you must ensure that there are no potential write-backs before the read commenses and invalidate at some point. The point at which you invalidate depends on whether the CPU speculatively prefetches: - If it doesn't, you can invalidate the cache before the read, thereby destroying any potential writebacks, and the cache will remain unallocated for that address range until explicitly accessed. - If you do have a CPU which does prefetch speculatively, then you do need to clean the cache before DMA starts, and then you must invalidate after the DMA completes. Invalidating after DMA completes for the non-speculative prefetch just wastes performance, especially if you have to do so line by line over a region. With ARM architecture version 7, we now have ARM CPUs which fall into both categories. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 18:08 ` Russell King @ 2009-12-17 18:08 ` Russell King 2009-12-17 18:17 ` Linus Torvalds 1 sibling, 0 replies; 48+ messages in thread From: Russell King @ 2009-12-17 18:08 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > You both flush the virtual caches before > the IO and invalidate after - when the real pattern should be that you > flush it before a write, and invalidate it after a read. That's not entirely true. If you have write back caches which are not DMA coherent, you need to as a minimum: - on write, clean the cache to ensure that the page is up to date with data held in cache. - on read, you must ensure that there are no potential write-backs before the read commenses and invalidate at some point. The point at which you invalidate depends on whether the CPU speculatively prefetches: - If it doesn't, you can invalidate the cache before the read, thereby destroying any potential writebacks, and the cache will remain unallocated for that address range until explicitly accessed. - If you do have a CPU which does prefetch speculatively, then you do need to clean the cache before DMA starts, and then you must invalidate after the DMA completes. Invalidating after DMA completes for the non-speculative prefetch just wastes performance, especially if you have to do so line by line over a region. With ARM architecture version 7, we now have ARM CPUs which fall into both categories. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 18:08 ` Russell King 2009-12-17 18:08 ` Russell King @ 2009-12-17 18:17 ` Linus Torvalds 2009-12-17 18:17 ` Linus Torvalds 1 sibling, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 18:17 UTC (permalink / raw) To: Russell King Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, 17 Dec 2009, Russell King wrote: > > That's not entirely true. If you have write back caches which are not DMA > coherent, you need to as a minimum: > > - on write, clean the cache to ensure that the page is up to date with data > held in cache. > > - on read, you must ensure that there are no potential write-backs before > the read commenses and invalidate at some point. Right you are. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 18:17 ` Linus Torvalds @ 2009-12-17 18:17 ` Linus Torvalds 0 siblings, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 18:17 UTC (permalink / raw) To: Russell King Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, 17 Dec 2009, Russell King wrote: > > That's not entirely true. If you have write back caches which are not DMA > coherent, you need to as a minimum: > > - on write, clean the cache to ensure that the page is up to date with data > held in cache. > > - on read, you must ensure that there are no potential write-backs before > the read commenses and invalidate at some point. Right you are. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:42 ` Linus Torvalds 2009-12-17 17:51 ` Christoph Hellwig 2009-12-17 18:08 ` Russell King @ 2009-12-19 18:33 ` Ralf Baechle 2009-12-19 18:33 ` Ralf Baechle 2009-12-21 17:14 ` James Bottomley 2 siblings, 2 replies; 48+ messages in thread From: Ralf Baechle @ 2009-12-19 18:33 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > > I also think that the changes to bio_map_kernel() and bio_map_kern_endio() > are not just "fundamentally ugly", I think they are made worse by the fact > that it's not even done "right". You both flush the virtual caches before > the IO and invalidate after - when the real pattern should be that you > flush it before a write, and invalidate it after a read. > > And I really think that would be all much more properly done at the > _caller_ level, not by the BIO layer. > > You must have some locking and allocation etc logic at the caller anyway, > why doesn't _that_ level just do the flushing or invalidation? And then there are certain types of caches that need invalidation before _and_ after a DMA transaction as a workaround for a processor being grossly abused in a system that it should not be used in. Basically the issue is that falsly speculated stores may dirty caches. Ralf ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-19 18:33 ` Ralf Baechle @ 2009-12-19 18:33 ` Ralf Baechle 2009-12-21 17:14 ` James Bottomley 1 sibling, 0 replies; 48+ messages in thread From: Ralf Baechle @ 2009-12-19 18:33 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > > I also think that the changes to bio_map_kernel() and bio_map_kern_endio() > are not just "fundamentally ugly", I think they are made worse by the fact > that it's not even done "right". You both flush the virtual caches before > the IO and invalidate after - when the real pattern should be that you > flush it before a write, and invalidate it after a read. > > And I really think that would be all much more properly done at the > _caller_ level, not by the BIO layer. > > You must have some locking and allocation etc logic at the caller anyway, > why doesn't _that_ level just do the flushing or invalidation? And then there are certain types of caches that need invalidation before _and_ after a DMA transaction as a workaround for a processor being grossly abused in a system that it should not be used in. Basically the issue is that falsly speculated stores may dirty caches. Ralf ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-19 18:33 ` Ralf Baechle 2009-12-19 18:33 ` Ralf Baechle @ 2009-12-21 17:14 ` James Bottomley 1 sibling, 0 replies; 48+ messages in thread From: James Bottomley @ 2009-12-21 17:14 UTC (permalink / raw) To: Ralf Baechle Cc: Linus Torvalds, Christoph Hellwig, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, linux-arch, Jens Axboe On Sat, 2009-12-19 at 18:33 +0000, Ralf Baechle wrote: > On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote: > > > > > I also think that the changes to bio_map_kernel() and bio_map_kern_endio() > > are not just "fundamentally ugly", I think they are made worse by the fact > > that it's not even done "right". You both flush the virtual caches before > > the IO and invalidate after - when the real pattern should be that you > > flush it before a write, and invalidate it after a read. > > > > And I really think that would be all much more properly done at the > > _caller_ level, not by the BIO layer. > > > > You must have some locking and allocation etc logic at the caller anyway, > > why doesn't _that_ level just do the flushing or invalidation? > > And then there are certain types of caches that need invalidation before > _and_ after a DMA transaction as a workaround for a processor being > grossly abused in a system that it should not be used in. Basically the > issue is that falsly speculated stores may dirty caches. Um, so that's just so wrong it doesn't even seem possible. It would mean that a speculation could make a line dirty while DMA was in progress to the underlying memory. There's always going to be a window between the DMA completion and the software invalidation where the dirty line could be flushed, thus corrupting the DMA transfer. Hopefully steps have been taken to see that whoever thought this was a good idea isn't still contributing to the gene pool? James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 17:07 ` Christoph Hellwig @ 2009-12-17 17:39 ` tytso 2009-12-17 17:39 ` tytso ` (2 more replies) 2 siblings, 3 replies; 48+ messages in thread From: tytso @ 2009-12-17 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA > translations for such pages is trivial and fundamental. Sure, but there's some rumors/oral traditions going around that some block devices want bio address which are page aligned, because they want to play some kind of refcounting game, and if you pass them a kmalloc() memory, they will explode in some interesting and entertaining way. And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us don't have access to, so just because it works Just Fine on SATA doesn't mean anything. And none of this is documented anywhere, which is frustrating as hell. Just rumors that "if you do this, AoE/iSCSI will corrupt your file systems". - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:39 ` tytso @ 2009-12-17 17:39 ` tytso 2009-12-17 17:51 ` Linus Torvalds 2009-12-18 0:21 ` FUJITA Tomonori 2 siblings, 0 replies; 48+ messages in thread From: tytso @ 2009-12-17 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA > translations for such pages is trivial and fundamental. Sure, but there's some rumors/oral traditions going around that some block devices want bio address which are page aligned, because they want to play some kind of refcounting game, and if you pass them a kmalloc() memory, they will explode in some interesting and entertaining way. And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us don't have access to, so just because it works Just Fine on SATA doesn't mean anything. And none of this is documented anywhere, which is frustrating as hell. Just rumors that "if you do this, AoE/iSCSI will corrupt your file systems". - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:39 ` tytso 2009-12-17 17:39 ` tytso @ 2009-12-17 17:51 ` Linus Torvalds 2009-12-17 19:36 ` Jens Axboe 2009-12-18 0:21 ` FUJITA Tomonori 2 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2009-12-17 17:51 UTC (permalink / raw) To: tytso Cc: Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > Sure, but there's some rumors/oral traditions going around that some > block devices want bio address which are page aligned, because they > want to play some kind of refcounting game, Yeah, you might be right at that. > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > don't have access to, so just because it works Just Fine on SATA doesn't > mean anything. > > And none of this is documented anywhere, which is frustrating as hell. > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > systems". ACK. Jens? Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:51 ` Linus Torvalds @ 2009-12-17 19:36 ` Jens Axboe 2009-12-17 19:36 ` Jens Axboe 2009-12-17 23:57 ` James Bottomley 0 siblings, 2 replies; 48+ messages in thread From: Jens Axboe @ 2009-12-17 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch On Thu, Dec 17 2009, Linus Torvalds wrote: > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > Sure, but there's some rumors/oral traditions going around that some > > block devices want bio address which are page aligned, because they > > want to play some kind of refcounting game, > > Yeah, you might be right at that. > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > don't have access to, so just because it works Just Fine on SATA doesn't > > mean anything. > > > > And none of this is documented anywhere, which is frustrating as hell. > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > systems". > > ACK. Jens? I've heard those rumours too, and I don't even know if they are true. Who has a pointer to such a bug report and/or issue? The block layer itself doesn't not have any such requirements, and the only places where we play page games is for bio's that were explicitly mapped with pages by itself (like mapping user data).o We fix driver crap like that, we don't work around it. It's a BUG. -- Jens Axboe ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 19:36 ` Jens Axboe @ 2009-12-17 19:36 ` Jens Axboe 2009-12-17 23:57 ` James Bottomley 1 sibling, 0 replies; 48+ messages in thread From: Jens Axboe @ 2009-12-17 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch On Thu, Dec 17 2009, Linus Torvalds wrote: > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > Sure, but there's some rumors/oral traditions going around that some > > block devices want bio address which are page aligned, because they > > want to play some kind of refcounting game, > > Yeah, you might be right at that. > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > don't have access to, so just because it works Just Fine on SATA doesn't > > mean anything. > > > > And none of this is documented anywhere, which is frustrating as hell. > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > systems". > > ACK. Jens? I've heard those rumours too, and I don't even know if they are true. Who has a pointer to such a bug report and/or issue? The block layer itself doesn't not have any such requirements, and the only places where we play page games is for bio's that were explicitly mapped with pages by itself (like mapping user data).o We fix driver crap like that, we don't work around it. It's a BUG. -- Jens Axboe ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 19:36 ` Jens Axboe 2009-12-17 19:36 ` Jens Axboe @ 2009-12-17 23:57 ` James Bottomley 2009-12-17 23:57 ` James Bottomley 2009-12-18 1:00 ` FUJITA Tomonori 1 sibling, 2 replies; 48+ messages in thread From: James Bottomley @ 2009-12-17 23:57 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, hch, linux-arch On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > block devices want bio address which are page aligned, because they > > > want to play some kind of refcounting game, > > > > Yeah, you might be right at that. > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > mean anything. > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > systems". > > > > ACK. Jens? > > I've heard those rumours too, and I don't even know if they are true. > Who has a pointer to such a bug report and/or issue? The block layer > itself doesn't not have any such requirements, and the only places where > we play page games is for bio's that were explicitly mapped with pages > by itself (like mapping user data).o OK, so what happened is that prior to the map single fix commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f Author: Mike Christie <michaelc@cs.wisc.edu> Date: Mon Jun 20 14:04:44 2005 +0200 [PATCH] Add blk_rq_map_kern() bio could only accept user space buffers, so we had a special path for kernel allocated buffers. That commit unified the path (with a separate block API) so we could now submit kmalloc'd buffers via block APIs. So the rule now is we can accept any user mapped area via blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might not be able to do a stack area (depending on how the arch maps the stack) and we definitely cannot do a vmalloc'd area. So it sounds like we only need a blk_rq_map_vmalloc() using the same techniques as the patch set and we're good to go. > We fix driver crap like that, we don't work around it. It's a BUG. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 23:57 ` James Bottomley @ 2009-12-17 23:57 ` James Bottomley 2009-12-18 1:00 ` FUJITA Tomonori 1 sibling, 0 replies; 48+ messages in thread From: James Bottomley @ 2009-12-17 23:57 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, tytso, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, hch, linux-arch On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > block devices want bio address which are page aligned, because they > > > want to play some kind of refcounting game, > > > > Yeah, you might be right at that. > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > mean anything. > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > systems". > > > > ACK. Jens? > > I've heard those rumours too, and I don't even know if they are true. > Who has a pointer to such a bug report and/or issue? The block layer > itself doesn't not have any such requirements, and the only places where > we play page games is for bio's that were explicitly mapped with pages > by itself (like mapping user data).o OK, so what happened is that prior to the map single fix commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f Author: Mike Christie <michaelc@cs.wisc.edu> Date: Mon Jun 20 14:04:44 2005 +0200 [PATCH] Add blk_rq_map_kern() bio could only accept user space buffers, so we had a special path for kernel allocated buffers. That commit unified the path (with a separate block API) so we could now submit kmalloc'd buffers via block APIs. So the rule now is we can accept any user mapped area via blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might not be able to do a stack area (depending on how the arch maps the stack) and we definitely cannot do a vmalloc'd area. So it sounds like we only need a blk_rq_map_vmalloc() using the same techniques as the patch set and we're good to go. > We fix driver crap like that, we don't work around it. It's a BUG. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 23:57 ` James Bottomley 2009-12-17 23:57 ` James Bottomley @ 2009-12-18 1:00 ` FUJITA Tomonori 2009-12-18 2:44 ` Dave Chinner 2009-12-18 7:08 ` James Bottomley 1 sibling, 2 replies; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 1:00 UTC (permalink / raw) To: James.Bottomley Cc: jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 18 Dec 2009 00:57:00 +0100 James Bottomley <James.Bottomley@suse.de> wrote: > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > > block devices want bio address which are page aligned, because they > > > > want to play some kind of refcounting game, > > > > > > Yeah, you might be right at that. > > > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > > mean anything. > > > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > > systems". > > > > > > ACK. Jens? > > > > I've heard those rumours too, and I don't even know if they are true. > > Who has a pointer to such a bug report and/or issue? The block layer > > itself doesn't not have any such requirements, and the only places where > > we play page games is for bio's that were explicitly mapped with pages > > by itself (like mapping user data).o > > OK, so what happened is that prior to the map single fix > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f > Author: Mike Christie <michaelc@cs.wisc.edu> > Date: Mon Jun 20 14:04:44 2005 +0200 > > [PATCH] Add blk_rq_map_kern() > > > bio could only accept user space buffers, so we had a special path for > kernel allocated buffers. That commit unified the path (with a separate > block API) so we could now submit kmalloc'd buffers via block APIs. > > So the rule now is we can accept any user mapped area via > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might > not be able to do a stack area (depending on how the arch maps the > stack) and we definitely cannot do a vmalloc'd area. > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > techniques as the patch set and we're good to go. I'm not sure about it. As I said before (when I was against this 'adding vmalloc support to the block layer' stuff), are there potential users of this except for XFS? Are there anyone who does such a thing now? This API might be useful for only journaling file systems using log formats that need large contiguous buffer. Sound like only XFS? Even if we have some potential users, I'm not sure that supporting vmalloc in the block layer officially is a good idea. IMO, it needs too many tricks for generic code. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 1:00 ` FUJITA Tomonori @ 2009-12-18 2:44 ` Dave Chinner 2009-12-18 2:44 ` Dave Chinner ` (2 more replies) 2009-12-18 7:08 ` James Bottomley 1 sibling, 3 replies; 48+ messages in thread From: Dave Chinner @ 2009-12-18 2:44 UTC (permalink / raw) To: FUJITA Tomonori Cc: James.Bottomley, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, Dec 18, 2009 at 10:00:21AM +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 00:57:00 +0100 > James Bottomley <James.Bottomley@suse.de> wrote: > > > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > > > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > > > block devices want bio address which are page aligned, because they > > > > > want to play some kind of refcounting game, > > > > > > > > Yeah, you might be right at that. > > > > > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > > > mean anything. > > > > > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > > > systems". > > > > > > > > ACK. Jens? > > > > > > I've heard those rumours too, and I don't even know if they are true. > > > Who has a pointer to such a bug report and/or issue? The block layer > > > itself doesn't not have any such requirements, and the only places where > > > we play page games is for bio's that were explicitly mapped with pages > > > by itself (like mapping user data).o > > > > OK, so what happened is that prior to the map single fix > > > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f > > Author: Mike Christie <michaelc@cs.wisc.edu> > > Date: Mon Jun 20 14:04:44 2005 +0200 > > > > [PATCH] Add blk_rq_map_kern() > > > > > > bio could only accept user space buffers, so we had a special path for > > kernel allocated buffers. That commit unified the path (with a separate > > block API) so we could now submit kmalloc'd buffers via block APIs. > > > > So the rule now is we can accept any user mapped area via > > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might > > not be able to do a stack area (depending on how the arch maps the > > stack) and we definitely cannot do a vmalloc'd area. > > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > techniques as the patch set and we're good to go. > > I'm not sure about it. > > As I said before (when I was against this 'adding vmalloc support to > the block layer' stuff), are there potential users of this except for > XFS? Are there anyone who does such a thing now? As Christoph already mentioned, XFS is not passing the vmalloc'd range to the block layer - it passes the underlying pages to the block layer. Hence I'm not sure there actually is anyone who is passing vmalloc'd addresses to the block layer. Perhaps we should put a WARN_ON() in the block layer to catch anyone doing such a thing before considering supporting vmalloc'd addresses in the block layer? > This API might be useful for only journaling file systems using log > formats that need large contiguous buffer. Sound like only XFS? FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log recovery in XFS. e.g. filesystems with directory block size larger than page size uses mapped buffers. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 2:44 ` Dave Chinner @ 2009-12-18 2:44 ` Dave Chinner 2009-12-18 3:51 ` FUJITA Tomonori 2009-12-18 7:10 ` James Bottomley 2 siblings, 0 replies; 48+ messages in thread From: Dave Chinner @ 2009-12-18 2:44 UTC (permalink / raw) To: FUJITA Tomonori Cc: James.Bottomley, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, Dec 18, 2009 at 10:00:21AM +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 00:57:00 +0100 > James Bottomley <James.Bottomley@suse.de> wrote: > > > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > > > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > > > block devices want bio address which are page aligned, because they > > > > > want to play some kind of refcounting game, > > > > > > > > Yeah, you might be right at that. > > > > > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > > > mean anything. > > > > > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > > > systems". > > > > > > > > ACK. Jens? > > > > > > I've heard those rumours too, and I don't even know if they are true. > > > Who has a pointer to such a bug report and/or issue? The block layer > > > itself doesn't not have any such requirements, and the only places where > > > we play page games is for bio's that were explicitly mapped with pages > > > by itself (like mapping user data).o > > > > OK, so what happened is that prior to the map single fix > > > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f > > Author: Mike Christie <michaelc@cs.wisc.edu> > > Date: Mon Jun 20 14:04:44 2005 +0200 > > > > [PATCH] Add blk_rq_map_kern() > > > > > > bio could only accept user space buffers, so we had a special path for > > kernel allocated buffers. That commit unified the path (with a separate > > block API) so we could now submit kmalloc'd buffers via block APIs. > > > > So the rule now is we can accept any user mapped area via > > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might > > not be able to do a stack area (depending on how the arch maps the > > stack) and we definitely cannot do a vmalloc'd area. > > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > techniques as the patch set and we're good to go. > > I'm not sure about it. > > As I said before (when I was against this 'adding vmalloc support to > the block layer' stuff), are there potential users of this except for > XFS? Are there anyone who does such a thing now? As Christoph already mentioned, XFS is not passing the vmalloc'd range to the block layer - it passes the underlying pages to the block layer. Hence I'm not sure there actually is anyone who is passing vmalloc'd addresses to the block layer. Perhaps we should put a WARN_ON() in the block layer to catch anyone doing such a thing before considering supporting vmalloc'd addresses in the block layer? > This API might be useful for only journaling file systems using log > formats that need large contiguous buffer. Sound like only XFS? FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log recovery in XFS. e.g. filesystems with directory block size larger than page size uses mapped buffers. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 2:44 ` Dave Chinner 2009-12-18 2:44 ` Dave Chinner @ 2009-12-18 3:51 ` FUJITA Tomonori 2009-12-18 3:51 ` FUJITA Tomonori 2009-12-18 7:10 ` James Bottomley 2 siblings, 1 reply; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 3:51 UTC (permalink / raw) To: david Cc: fujita.tomonori, James.Bottomley, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 18 Dec 2009 13:44:40 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > > techniques as the patch set and we're good to go. > > > > I'm not sure about it. > > > > As I said before (when I was against this 'adding vmalloc support to > > the block layer' stuff), are there potential users of this except for > > XFS? Are there anyone who does such a thing now? > > As Christoph already mentioned, XFS is not passing the vmalloc'd > range to the block layer Oops, I should have said a vmalloc/vmap area. > > This API might be useful for only journaling file systems using log > > formats that need large contiguous buffer. Sound like only XFS? > > FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log > recovery in XFS. e.g. filesystems with directory block size larger > than page size uses mapped buffers. I see, thanks. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 3:51 ` FUJITA Tomonori @ 2009-12-18 3:51 ` FUJITA Tomonori 0 siblings, 0 replies; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 3:51 UTC (permalink / raw) To: david Cc: fujita.tomonori, James.Bottomley, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 18 Dec 2009 13:44:40 +1100 Dave Chinner <david@fromorbit.com> wrote: > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > > techniques as the patch set and we're good to go. > > > > I'm not sure about it. > > > > As I said before (when I was against this 'adding vmalloc support to > > the block layer' stuff), are there potential users of this except for > > XFS? Are there anyone who does such a thing now? > > As Christoph already mentioned, XFS is not passing the vmalloc'd > range to the block layer Oops, I should have said a vmalloc/vmap area. > > This API might be useful for only journaling file systems using log > > formats that need large contiguous buffer. Sound like only XFS? > > FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log > recovery in XFS. e.g. filesystems with directory block size larger > than page size uses mapped buffers. I see, thanks. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 2:44 ` Dave Chinner 2009-12-18 2:44 ` Dave Chinner 2009-12-18 3:51 ` FUJITA Tomonori @ 2009-12-18 7:10 ` James Bottomley 2 siblings, 0 replies; 48+ messages in thread From: James Bottomley @ 2009-12-18 7:10 UTC (permalink / raw) To: Dave Chinner Cc: FUJITA Tomonori, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 2009-12-18 at 13:44 +1100, Dave Chinner wrote: > On Fri, Dec 18, 2009 at 10:00:21AM +0900, FUJITA Tomonori wrote: > > On Fri, 18 Dec 2009 00:57:00 +0100 > > James Bottomley <James.Bottomley@suse.de> wrote: > > > > > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > > > > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > > > > > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > > > > block devices want bio address which are page aligned, because they > > > > > > want to play some kind of refcounting game, > > > > > > > > > > Yeah, you might be right at that. > > > > > > > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > > > > mean anything. > > > > > > > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > > > > systems". > > > > > > > > > > ACK. Jens? > > > > > > > > I've heard those rumours too, and I don't even know if they are true. > > > > Who has a pointer to such a bug report and/or issue? The block layer > > > > itself doesn't not have any such requirements, and the only places where > > > > we play page games is for bio's that were explicitly mapped with pages > > > > by itself (like mapping user data).o > > > > > > OK, so what happened is that prior to the map single fix > > > > > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f > > > Author: Mike Christie <michaelc@cs.wisc.edu> > > > Date: Mon Jun 20 14:04:44 2005 +0200 > > > > > > [PATCH] Add blk_rq_map_kern() > > > > > > > > > bio could only accept user space buffers, so we had a special path for > > > kernel allocated buffers. That commit unified the path (with a separate > > > block API) so we could now submit kmalloc'd buffers via block APIs. > > > > > > So the rule now is we can accept any user mapped area via > > > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might > > > not be able to do a stack area (depending on how the arch maps the > > > stack) and we definitely cannot do a vmalloc'd area. > > > > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > > techniques as the patch set and we're good to go. > > > > I'm not sure about it. > > > > As I said before (when I was against this 'adding vmalloc support to > > the block layer' stuff), are there potential users of this except for > > XFS? Are there anyone who does such a thing now? > > As Christoph already mentioned, XFS is not passing the vmalloc'd > range to the block layer - it passes the underlying pages to the > block layer. Hence I'm not sure there actually is anyone who is > passing vmalloc'd addresses to the block layer. Perhaps we should > put a WARN_ON() in the block layer to catch anyone doing such a > thing before considering supporting vmalloc'd addresses in the block > layer? vmalloc is just an alias for vmap/vmalloc in the above statements (basically anything with an additional kernel virtual mapping which causes aliases). If we support vmap, we naturally support vmalloc as well. > > This API might be useful for only journaling file systems using log > > formats that need large contiguous buffer. Sound like only XFS? > > FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log > recovery in XFS. e.g. filesystems with directory block size larger > than page size uses mapped buffers. However, XFS is the only fs that actually uses kernel virtual mapping to solve this problem. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 1:00 ` FUJITA Tomonori 2009-12-18 2:44 ` Dave Chinner @ 2009-12-18 7:08 ` James Bottomley 2009-12-18 9:34 ` FUJITA Tomonori 2009-12-18 12:00 ` Dave Chinner 1 sibling, 2 replies; 48+ messages in thread From: James Bottomley @ 2009-12-18 7:08 UTC (permalink / raw) To: FUJITA Tomonori Cc: jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 2009-12-18 at 10:00 +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 00:57:00 +0100 > James Bottomley <James.Bottomley@suse.de> wrote: > > > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote: > > > On Thu, Dec 17 2009, Linus Torvalds wrote: > > > > > > > > > > > > On Thu, 17 Dec 2009, tytso@mit.edu wrote: > > > > > > > > > > Sure, but there's some rumors/oral traditions going around that some > > > > > block devices want bio address which are page aligned, because they > > > > > want to play some kind of refcounting game, > > > > > > > > Yeah, you might be right at that. > > > > > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us > > > > > don't have access to, so just because it works Just Fine on SATA doesn't > > > > > mean anything. > > > > > > > > > > And none of this is documented anywhere, which is frustrating as hell. > > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file > > > > > systems". > > > > > > > > ACK. Jens? > > > > > > I've heard those rumours too, and I don't even know if they are true. > > > Who has a pointer to such a bug report and/or issue? The block layer > > > itself doesn't not have any such requirements, and the only places where > > > we play page games is for bio's that were explicitly mapped with pages > > > by itself (like mapping user data).o > > > > OK, so what happened is that prior to the map single fix > > > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f > > Author: Mike Christie <michaelc@cs.wisc.edu> > > Date: Mon Jun 20 14:04:44 2005 +0200 > > > > [PATCH] Add blk_rq_map_kern() > > > > > > bio could only accept user space buffers, so we had a special path for > > kernel allocated buffers. That commit unified the path (with a separate > > block API) so we could now submit kmalloc'd buffers via block APIs. > > > > So the rule now is we can accept any user mapped area via > > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might > > not be able to do a stack area (depending on how the arch maps the > > stack) and we definitely cannot do a vmalloc'd area. > > > > So it sounds like we only need a blk_rq_map_vmalloc() using the same > > techniques as the patch set and we're good to go. > > I'm not sure about it. > > As I said before (when I was against this 'adding vmalloc support to > the block layer' stuff), are there potential users of this except for > XFS? Are there anyone who does such a thing now? > > This API might be useful for only journaling file systems using log > formats that need large contiguous buffer. Sound like only XFS? > > Even if we have some potential users, I'm not sure that supporting > vmalloc in the block layer officially is a good idea. IMO, it needs > too many tricks for generic code. So far, there's only xfs that I know of. Given the way journalling works, it's not an unusual requirement to use a large buffer for operations. It's a bit of a waste of kernel resources to have this physically contiguous, but it is a waste of resources (and for buffers over our kmalloc max, it would even have to be allocated at start of day), so I think large kernel virtual areas (like vmap/vmalloc) have a part to play in fs operations. As to the API, the specific problem is that the block and lower arch layers are specifically programmed to think any kernel address has only a single alias and that it's offset mapped, which is why we get the failure. An alternative proposal to modifying the block layer to do coherency, might be simply to have the fs layer do a vunmap before doing I/O and re-vmap when it's completed. That would ensure the architecturally correct flushing of the aliases, and would satisfy the expectations of blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear page tables, which isn't necessary and might impact performance (xfs people?) If the performance impact of the above is too great, then we might introduce a vmalloc sync API to do the flush before and the invalidate after (would have to be called twice, once before I/O and once after). This is sort of a violation of our architectural knowledge layering, since the user of a vmap/vmalloc area has to know intrinsically how to handle I/O instead of having it just work(tm), but since the users are few and specialised, it's not going to lead to too many coding problems. Any opinions? James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 7:08 ` James Bottomley @ 2009-12-18 9:34 ` FUJITA Tomonori 2009-12-18 10:01 ` James Bottomley 2009-12-18 12:00 ` Dave Chinner 1 sibling, 1 reply; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 9:34 UTC (permalink / raw) To: James.Bottomley Cc: fujita.tomonori, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 18 Dec 2009 08:08:48 +0100 James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > Even if we have some potential users, I'm not sure that supporting > > vmalloc in the block layer officially is a good idea. IMO, it needs > > too many tricks for generic code. > > So far, there's only xfs that I know of. > > Given the way journalling works, it's not an unusual requirement to use > a large buffer for operations. It's a bit of a waste of kernel > resources to have this physically contiguous, but it is a waste of > resources (and for buffers over our kmalloc max, it would even have to > be allocated at start of day), so I think large kernel virtual areas > (like vmap/vmalloc) have a part to play in fs operations. Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't it better to wait until we have real users of the API? > As to the API, the specific problem is that the block and lower arch > layers are specifically programmed to think any kernel address has only > a single alias and that it's offset mapped, which is why we get the > failure. Yeah, however we can make a rule that you can't pass a vmap area (including vmap'ed pages) to the block layer. We can't make the rule effective for the past so XFS is the only exception. > An alternative proposal to modifying the block layer to do coherency, > might be simply to have the fs layer do a vunmap before doing I/O and > re-vmap when it's completed. I'm not sure it's worth making the whole block layer compatible to vmap (adding complexity and possibly performance penalty). If we really need to support this, I like helper APIs that the callers must use before and after I/Os. > That would ensure the architecturally > correct flushing of the aliases, and would satisfy the expectations of > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > page tables, which isn't necessary and might impact performance (xfs > people?) btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to file systems since blk_rq_map_user and blk_rq_map_kern takes a request structure. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 9:34 ` FUJITA Tomonori @ 2009-12-18 10:01 ` James Bottomley 2009-12-18 10:01 ` James Bottomley 2009-12-18 10:24 ` FUJITA Tomonori 0 siblings, 2 replies; 48+ messages in thread From: James Bottomley @ 2009-12-18 10:01 UTC (permalink / raw) To: FUJITA Tomonori Cc: jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 2009-12-18 at 18:34 +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 08:08:48 +0100 > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > Even if we have some potential users, I'm not sure that supporting > > > vmalloc in the block layer officially is a good idea. IMO, it needs > > > too many tricks for generic code. > > > > So far, there's only xfs that I know of. > > > > Given the way journalling works, it's not an unusual requirement to use > > a large buffer for operations. It's a bit of a waste of kernel > > resources to have this physically contiguous, but it is a waste of > > resources (and for buffers over our kmalloc max, it would even have to > > be allocated at start of day), so I think large kernel virtual areas > > (like vmap/vmalloc) have a part to play in fs operations. > > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't > it better to wait until we have real users of the API? XFS is a real user ... the XFS filesystem is our most trusted code base that can break the 8TB limit, which hard disks are already at. Ext4 may be ready, but it's not universally present in enterprise distros like XFS. > > As to the API, the specific problem is that the block and lower arch > > layers are specifically programmed to think any kernel address has only > > a single alias and that it's offset mapped, which is why we get the > > failure. > > Yeah, however we can make a rule that you can't pass a vmap area > (including vmap'ed pages) to the block layer. We can't make the rule > effective for the past so XFS is the only exception. We need something that works for XFS. The next proposal works for the current block API because the vunmap makes the xfs pages look like standard kernel pages, which blk_rq_map_kern() will process correctly. But, in principle, I think whatever fix is chosen, we shouldn't necessarily discourage others from using it. > > An alternative proposal to modifying the block layer to do coherency, > > might be simply to have the fs layer do a vunmap before doing I/O and > > re-vmap when it's completed. > > I'm not sure it's worth making the whole block layer compatible to > vmap (adding complexity and possibly performance penalty). This proposal has no block layer changes. It just makes the XFS vmap area look like a standard set of kernel pages ... with the overhead of the page table manipulations on unmap and remap. > If we really need to support this, I like helper APIs that the callers > must use before and after I/Os. If it's just this route, they already exist ... they're vmap and vunmap. > > That would ensure the architecturally > > correct flushing of the aliases, and would satisfy the expectations of > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > > page tables, which isn't necessary and might impact performance (xfs > > people?) > > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to > file systems since blk_rq_map_user and blk_rq_map_kern takes a request > structure. OK, so that was illustrative. The meat of the change is at the bio layer anyway (fss tend to speak bios). But the point of *this* particular proposal is that it requires no changes either in the blk_ or bio_ routines. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 10:01 ` James Bottomley @ 2009-12-18 10:01 ` James Bottomley 2009-12-18 10:24 ` FUJITA Tomonori 1 sibling, 0 replies; 48+ messages in thread From: James Bottomley @ 2009-12-18 10:01 UTC (permalink / raw) To: FUJITA Tomonori Cc: jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 2009-12-18 at 18:34 +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 08:08:48 +0100 > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > Even if we have some potential users, I'm not sure that supporting > > > vmalloc in the block layer officially is a good idea. IMO, it needs > > > too many tricks for generic code. > > > > So far, there's only xfs that I know of. > > > > Given the way journalling works, it's not an unusual requirement to use > > a large buffer for operations. It's a bit of a waste of kernel > > resources to have this physically contiguous, but it is a waste of > > resources (and for buffers over our kmalloc max, it would even have to > > be allocated at start of day), so I think large kernel virtual areas > > (like vmap/vmalloc) have a part to play in fs operations. > > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't > it better to wait until we have real users of the API? XFS is a real user ... the XFS filesystem is our most trusted code base that can break the 8TB limit, which hard disks are already at. Ext4 may be ready, but it's not universally present in enterprise distros like XFS. > > As to the API, the specific problem is that the block and lower arch > > layers are specifically programmed to think any kernel address has only > > a single alias and that it's offset mapped, which is why we get the > > failure. > > Yeah, however we can make a rule that you can't pass a vmap area > (including vmap'ed pages) to the block layer. We can't make the rule > effective for the past so XFS is the only exception. We need something that works for XFS. The next proposal works for the current block API because the vunmap makes the xfs pages look like standard kernel pages, which blk_rq_map_kern() will process correctly. But, in principle, I think whatever fix is chosen, we shouldn't necessarily discourage others from using it. > > An alternative proposal to modifying the block layer to do coherency, > > might be simply to have the fs layer do a vunmap before doing I/O and > > re-vmap when it's completed. > > I'm not sure it's worth making the whole block layer compatible to > vmap (adding complexity and possibly performance penalty). This proposal has no block layer changes. It just makes the XFS vmap area look like a standard set of kernel pages ... with the overhead of the page table manipulations on unmap and remap. > If we really need to support this, I like helper APIs that the callers > must use before and after I/Os. If it's just this route, they already exist ... they're vmap and vunmap. > > That would ensure the architecturally > > correct flushing of the aliases, and would satisfy the expectations of > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > > page tables, which isn't necessary and might impact performance (xfs > > people?) > > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to > file systems since blk_rq_map_user and blk_rq_map_kern takes a request > structure. OK, so that was illustrative. The meat of the change is at the bio layer anyway (fss tend to speak bios). But the point of *this* particular proposal is that it requires no changes either in the blk_ or bio_ routines. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 10:01 ` James Bottomley 2009-12-18 10:01 ` James Bottomley @ 2009-12-18 10:24 ` FUJITA Tomonori 2009-12-18 10:30 ` James Bottomley 1 sibling, 1 reply; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 10:24 UTC (permalink / raw) To: James.Bottomley Cc: fujita.tomonori, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 18 Dec 2009 11:01:29 +0100 James Bottomley <James.Bottomley@suse.de> wrote: > > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't > > it better to wait until we have real users of the API? > > XFS is a real user ... the XFS filesystem is our most trusted code base > that can break the 8TB limit, which hard disks are already at. Ext4 may > be ready, but it's not universally present in enterprise distros like > XFS. XFS already has the own code to handle that, which works fine (with your patchset except for 5/6 for the block layer). Not much motivation for XFS to move to the generic API? > > > That would ensure the architecturally > > > correct flushing of the aliases, and would satisfy the expectations of > > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > > > page tables, which isn't necessary and might impact performance (xfs > > > people?) > > > > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to > > file systems since blk_rq_map_user and blk_rq_map_kern takes a request > > structure. > > OK, so that was illustrative. The meat of the change is at the bio > layer anyway (fss tend to speak bios). Yeah, I think so, it's up to Jens to add new APIs for vmap there. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 10:24 ` FUJITA Tomonori @ 2009-12-18 10:30 ` James Bottomley 0 siblings, 0 replies; 48+ messages in thread From: James Bottomley @ 2009-12-18 10:30 UTC (permalink / raw) To: FUJITA Tomonori Cc: jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, 2009-12-18 at 19:24 +0900, FUJITA Tomonori wrote: > On Fri, 18 Dec 2009 11:01:29 +0100 > James Bottomley <James.Bottomley@suse.de> wrote: > > > > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't > > > it better to wait until we have real users of the API? > > > > XFS is a real user ... the XFS filesystem is our most trusted code base > > that can break the 8TB limit, which hard disks are already at. Ext4 may > > be ready, but it's not universally present in enterprise distros like > > XFS. > > XFS already has the own code to handle that, which works fine (with > your patchset except for 5/6 for the block layer). Not much motivation > for XFS to move to the generic API? Right, but it's for completeness. If we decide to allow vmap buffers, then only supporting them on certain paths is a recipe for confusion in a year's time when someone assumes we support vmap buffers on all block paths; a bit like the current confusion over what we support .... > > > > That would ensure the architecturally > > > > correct flushing of the aliases, and would satisfy the expectations of > > > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > > > > page tables, which isn't necessary and might impact performance (xfs > > > > people?) > > > > > > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to > > > file systems since blk_rq_map_user and blk_rq_map_kern takes a request > > > structure. > > > > OK, so that was illustrative. The meat of the change is at the bio > > layer anyway (fss tend to speak bios). > > Yeah, I think so, it's up to Jens to add new APIs for vmap there. Agreed. James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 7:08 ` James Bottomley 2009-12-18 9:34 ` FUJITA Tomonori @ 2009-12-18 12:00 ` Dave Chinner 2009-12-18 12:00 ` Dave Chinner 1 sibling, 1 reply; 48+ messages in thread From: Dave Chinner @ 2009-12-18 12:00 UTC (permalink / raw) To: James Bottomley Cc: FUJITA Tomonori, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, Dec 18, 2009 at 08:08:48AM +0100, James Bottomley wrote: > On Fri, 2009-12-18 at 10:00 +0900, FUJITA Tomonori wrote: > An alternative proposal to modifying the block layer to do coherency, > might be simply to have the fs layer do a vunmap before doing I/O and > re-vmap when it's completed. I don't think that works for XFS as it stands. Unmapping across IO would have to guarantee we remap the buffer at the same address after IO because we currently assume that mapped address of the buffer can't change while references are held on the buffer. That seems like an awful lot of complexity compared to the few lines of code in XFS and the arches needed to support this. > That would ensure the architecturally > correct flushing of the aliases, and would satisfy the expectations of > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > page tables, which isn't necessary and might impact performance (xfs > people?) We play lots of tricks in XFS to avoid mapping buffers when we can because of the performance impact it has. Nick Piggin's recent work getting vmap to scale helped a lot, but it is still best to avoid mapped buffers where possible. > If the performance impact of the above is too great, then we might > introduce a vmalloc sync API to do the flush before and the invalidate > after (would have to be called twice, once before I/O and once after). > This is sort of a violation of our architectural knowledge layering, > since the user of a vmap/vmalloc area has to know intrinsically how to > handle I/O instead of having it just work(tm), but since the users are > few and specialised, it's not going to lead to too many coding problems. > > Any opinions? Personally I see nothing wrong with the original patch series. If the block layer mods are contentious, then just drop that patch until a real need is brought to life. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 12:00 ` Dave Chinner @ 2009-12-18 12:00 ` Dave Chinner 0 siblings, 0 replies; 48+ messages in thread From: Dave Chinner @ 2009-12-18 12:00 UTC (permalink / raw) To: James Bottomley Cc: FUJITA Tomonori, jens.axboe, torvalds, tytso, kyle, linux-parisc, linux-kernel, hch, linux-arch On Fri, Dec 18, 2009 at 08:08:48AM +0100, James Bottomley wrote: > On Fri, 2009-12-18 at 10:00 +0900, FUJITA Tomonori wrote: > An alternative proposal to modifying the block layer to do coherency, > might be simply to have the fs layer do a vunmap before doing I/O and > re-vmap when it's completed. I don't think that works for XFS as it stands. Unmapping across IO would have to guarantee we remap the buffer at the same address after IO because we currently assume that mapped address of the buffer can't change while references are held on the buffer. That seems like an awful lot of complexity compared to the few lines of code in XFS and the arches needed to support this. > That would ensure the architecturally > correct flushing of the aliases, and would satisfy the expectations of > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear > page tables, which isn't necessary and might impact performance (xfs > people?) We play lots of tricks in XFS to avoid mapping buffers when we can because of the performance impact it has. Nick Piggin's recent work getting vmap to scale helped a lot, but it is still best to avoid mapped buffers where possible. > If the performance impact of the above is too great, then we might > introduce a vmalloc sync API to do the flush before and the invalidate > after (would have to be called twice, once before I/O and once after). > This is sort of a violation of our architectural knowledge layering, > since the user of a vmap/vmalloc area has to know intrinsically how to > handle I/O instead of having it just work(tm), but since the users are > few and specialised, it's not going to lead to too many coding problems. > > Any opinions? Personally I see nothing wrong with the original patch series. If the block layer mods are contentious, then just drop that patch until a real need is brought to life. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:39 ` tytso 2009-12-17 17:39 ` tytso 2009-12-17 17:51 ` Linus Torvalds @ 2009-12-18 0:21 ` FUJITA Tomonori 2009-12-18 14:17 ` tytso 2 siblings, 1 reply; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-18 0:21 UTC (permalink / raw) To: tytso Cc: torvalds, kyle, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch, jens.axboe On Thu, 17 Dec 2009 12:39:57 -0500 tytso@mit.edu wrote: > On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote: > > kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA > > translations for such pages is trivial and fundamental. > > Sure, but there's some rumors/oral traditions going around that some > block devices want bio address which are page aligned, because they > want to play some kind of refcounting game, and if you pass them a > kmalloc() memory, they will explode in some interesting and > entertaining way. And it's Weird Shit(tm) (aka iSCSI, AoE) type > drivers, that most of us don't have access to, so just because it > works Just Fine on SATA doesn't mean anything. iSCSI initiator driver should work with kmalloc'ed memory. The reason why iSCSI didn't work with kmalloc'ed memory is that it uses sendpage (which needs refcountable pages). We added a workaround to not use sendpage with kmalloc'ed memory (it would be great if we remove the workaround though). ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 0:21 ` FUJITA Tomonori @ 2009-12-18 14:17 ` tytso 2009-12-18 14:17 ` tytso 2009-12-21 8:53 ` FUJITA Tomonori 0 siblings, 2 replies; 48+ messages in thread From: tytso @ 2009-12-18 14:17 UTC (permalink / raw) To: FUJITA Tomonori Cc: torvalds, kyle, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch, jens.axboe On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote: > > iSCSI initiator driver should work with kmalloc'ed memory. > > The reason why iSCSI didn't work with kmalloc'ed memory is that it > uses sendpage (which needs refcountable pages). We added a workaround > to not use sendpage with kmalloc'ed memory (it would be great if we > remove the workaround though). Well, with a patch that I plan to be pushing that we have general agreement that it is a block device driver BUG not to accept kmalloc'ed/SLAB allocated memory, is one where ext4 will use kmalloc'ed/slab allocated memory on occasion when it needs to make shadow copy of buffers for journalling purposes AND when the fs block size is smaller than the page size. (i.e., no more allocating a 16k page when the fs block size is 4k). So this won't happen all the time; even if the case of a 16k Itanium system with 4k blocks, the bulk of the data won't be sent via kmalloc'ed memory --- just some critical metadata block and some data blocks that need to be escaped when being written into the journal. I do think we need to document that block device drivers are _expected_ to be able to handle kmalloc'ed memory, and if they can't, #1 they should do a BUG_ON instead of corrupting user's data, and #2, if they do corrupt data, we should send the angry users with corrupted file systems to bang at the doors of the block device authors. - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 14:17 ` tytso @ 2009-12-18 14:17 ` tytso 2009-12-21 8:53 ` FUJITA Tomonori 1 sibling, 0 replies; 48+ messages in thread From: tytso @ 2009-12-18 14:17 UTC (permalink / raw) To: FUJITA Tomonori Cc: torvalds, kyle, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch, jens.axboe On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote: > > iSCSI initiator driver should work with kmalloc'ed memory. > > The reason why iSCSI didn't work with kmalloc'ed memory is that it > uses sendpage (which needs refcountable pages). We added a workaround > to not use sendpage with kmalloc'ed memory (it would be great if we > remove the workaround though). Well, with a patch that I plan to be pushing that we have general agreement that it is a block device driver BUG not to accept kmalloc'ed/SLAB allocated memory, is one where ext4 will use kmalloc'ed/slab allocated memory on occasion when it needs to make shadow copy of buffers for journalling purposes AND when the fs block size is smaller than the page size. (i.e., no more allocating a 16k page when the fs block size is 4k). So this won't happen all the time; even if the case of a 16k Itanium system with 4k blocks, the bulk of the data won't be sent via kmalloc'ed memory --- just some critical metadata block and some data blocks that need to be escaped when being written into the journal. I do think we need to document that block device drivers are _expected_ to be able to handle kmalloc'ed memory, and if they can't, #1 they should do a BUG_ON instead of corrupting user's data, and #2, if they do corrupt data, we should send the angry users with corrupted file systems to bang at the doors of the block device authors. - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-18 14:17 ` tytso 2009-12-18 14:17 ` tytso @ 2009-12-21 8:53 ` FUJITA Tomonori 1 sibling, 0 replies; 48+ messages in thread From: FUJITA Tomonori @ 2009-12-21 8:53 UTC (permalink / raw) To: tytso Cc: fujita.tomonori, torvalds, kyle, linux-parisc, linux-kernel, James.Bottomley, hch, linux-arch, jens.axboe On Fri, 18 Dec 2009 09:17:32 -0500 tytso@mit.edu wrote: > On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote: > > > > iSCSI initiator driver should work with kmalloc'ed memory. > > > > The reason why iSCSI didn't work with kmalloc'ed memory is that it > > uses sendpage (which needs refcountable pages). We added a workaround > > to not use sendpage with kmalloc'ed memory (it would be great if we > > remove the workaround though). > > Well, with a patch that I plan to be pushing that we have general > agreement that it is a block device driver BUG not to accept > kmalloc'ed/SLAB allocated memory, is one where ext4 will use > kmalloc'ed/slab allocated memory on occasion when it needs to make > shadow copy of buffers for journalling purposes AND when the fs block > size is smaller than the page size. (i.e., no more allocating a 16k > page when the fs block size is 4k). So this won't happen all the > time; even if the case of a 16k Itanium system with 4k blocks, the > bulk of the data won't be sent via kmalloc'ed memory --- just some > critical metadata block and some data blocks that need to be escaped > when being written into the journal. Actually, ext3 (jbd) sent kmalloc'ed buffer to the block layer for frozen data. xfs also used kmalloc'ed buffer. Neither do now (so, as you said above, jbd wastes some memory when the block size is not equal to page size, I think). > I do think we need to document that block device drivers are > _expected_ to be able to handle kmalloc'ed memory, Agreed. Note that network block drivers (iSCSI, drbd, something else?) doesn't play with page ref-counting. They want to use sendpage. The network layer (sendpage) can't handle non-ref-counted pages. The best solution for fs and network block drivers might be modifying sendpage to handle such pages. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:30 ` tytso 2009-12-17 16:46 ` Linus Torvalds @ 2009-12-17 17:10 ` Christoph Hellwig 2009-12-17 17:10 ` Christoph Hellwig 2 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:10 UTC (permalink / raw) To: tytso, Linus Torvalds, Kyle McMartin, linux-parisc, Linux Kernel Mailing List On Thu, Dec 17, 2009 at 11:30:36AM -0500, tytso@mit.edu wrote: > That's because apparently the iSCSI and DMA blocks assume that they > have Real Pages (tm) passed to block I/O requests, and apparently XFS > ran into problems when sending vmalloc'ed pages. I don't know if this > is a problem if we pass the bio layer addresses coming from the SLAB > allocator, but oral tradition seems to indicate this is problematic, > although no one has given me the full chapter and verse explanation > about why this is so. Actually at least iscsi now has a workaround for that by checking for PageSlab. Back when we deal with the XFS issue that check was only available with debug options enabled. I tried to sort it out by agreeing with the block and iscsi folks that either a) we need to send down refcountable pages b) block drivers need to accept kmalloced pages I could not get any afreement, and thus we stopped using the kmalloced pages in XFS which was easy enough. A bit later people fixed iscsi, but we still don't have formal rules about what is acceptable to the block layer. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 16:30 ` tytso 2009-12-17 16:46 ` Linus Torvalds 2009-12-17 17:10 ` Christoph Hellwig @ 2009-12-17 17:10 ` Christoph Hellwig 2009-12-17 17:33 ` tytso 2 siblings, 1 reply; 48+ messages in thread From: Christoph Hellwig @ 2009-12-17 17:10 UTC (permalink / raw) To: tytso, Linus Torvalds, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, hch, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 11:30:36AM -0500, tytso@mit.edu wrote: > That's because apparently the iSCSI and DMA blocks assume that they > have Real Pages (tm) passed to block I/O requests, and apparently XFS > ran into problems when sending vmalloc'ed pages. I don't know if this > is a problem if we pass the bio layer addresses coming from the SLAB > allocator, but oral tradition seems to indicate this is problematic, > although no one has given me the full chapter and verse explanation > about why this is so. Actually at least iscsi now has a workaround for that by checking for PageSlab. Back when we deal with the XFS issue that check was only available with debug options enabled. I tried to sort it out by agreeing with the block and iscsi folks that either a) we need to send down refcountable pages b) block drivers need to accept kmalloced pages I could not get any afreement, and thus we stopped using the kmalloced pages in XFS which was easy enough. A bit later people fixed iscsi, but we still don't have formal rules about what is acceptable to the block layer. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:10 ` Christoph Hellwig @ 2009-12-17 17:33 ` tytso 2009-12-17 17:33 ` tytso 0 siblings, 1 reply; 48+ messages in thread From: tytso @ 2009-12-17 17:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 12:10:25PM -0500, Christoph Hellwig wrote: > On Thu, Dec 17, 2009 at 11:30:36AM -0500, tytso@mit.edu wrote: > > That's because apparently the iSCSI and DMA blocks assume that they > > have Real Pages (tm) passed to block I/O requests, and apparently XFS > > ran into problems when sending vmalloc'ed pages. I don't know if this > > is a problem if we pass the bio layer addresses coming from the SLAB > > allocator, but oral tradition seems to indicate this is problematic, > > although no one has given me the full chapter and verse explanation > > about why this is so. > > Actually at least iscsi now has a workaround for that by checking for > PageSlab. Back when we deal with the XFS issue that check was only > available with debug options enabled. I tried to sort it out by > agreeing with the block and iscsi folks that either > > a) we need to send down refcountable pages > b) block drivers need to accept kmalloced pages > > I could not get any afreement, and thus we stopped using the kmalloced > pages in XFS which was easy enough. A bit later people fixed iscsi, > but we still don't have formal rules about what is acceptable to the > block layer. It would be good to get some formal rules articulated. Someone has asserted that the AoE (ATA over Ethernet) driver will barf on SLAB/kmalloc allocated memory. True, false? - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [git patches] xfs and block fixes for virtually indexed arches 2009-12-17 17:33 ` tytso @ 2009-12-17 17:33 ` tytso 0 siblings, 0 replies; 48+ messages in thread From: tytso @ 2009-12-17 17:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Kyle McMartin, linux-parisc, Linux Kernel Mailing List, James.Bottomley, linux-arch, Jens Axboe On Thu, Dec 17, 2009 at 12:10:25PM -0500, Christoph Hellwig wrote: > On Thu, Dec 17, 2009 at 11:30:36AM -0500, tytso@mit.edu wrote: > > That's because apparently the iSCSI and DMA blocks assume that they > > have Real Pages (tm) passed to block I/O requests, and apparently XFS > > ran into problems when sending vmalloc'ed pages. I don't know if this > > is a problem if we pass the bio layer addresses coming from the SLAB > > allocator, but oral tradition seems to indicate this is problematic, > > although no one has given me the full chapter and verse explanation > > about why this is so. > > Actually at least iscsi now has a workaround for that by checking for > PageSlab. Back when we deal with the XFS issue that check was only > available with debug options enabled. I tried to sort it out by > agreeing with the block and iscsi folks that either > > a) we need to send down refcountable pages > b) block drivers need to accept kmalloced pages > > I could not get any afreement, and thus we stopped using the kmalloced > pages in XFS which was easy enough. A bit later people fixed iscsi, > but we still don't have formal rules about what is acceptable to the > block layer. It would be good to get some formal rules articulated. Someone has asserted that the AoE (ATA over Ethernet) driver will barf on SLAB/kmalloc allocated memory. True, false? - Ted ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2009-12-21 17:14 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20091216043618.GB9104@hera.kernel.org>
2009-12-17 13:22 ` [git patches] xfs and block fixes for virtually indexed arches Kyle McMartin
2009-12-17 13:22 ` Kyle McMartin
2009-12-17 13:25 ` Christoph Hellwig
2009-12-17 16:16 ` Linus Torvalds
2009-12-17 16:30 ` tytso
2009-12-17 16:46 ` Linus Torvalds
2009-12-17 16:46 ` Linus Torvalds
2009-12-17 17:07 ` Christoph Hellwig
2009-12-17 17:07 ` Christoph Hellwig
2009-12-17 17:42 ` Linus Torvalds
2009-12-17 17:51 ` Christoph Hellwig
2009-12-17 17:51 ` Christoph Hellwig
2009-12-17 18:08 ` Russell King
2009-12-17 18:08 ` Russell King
2009-12-17 18:17 ` Linus Torvalds
2009-12-17 18:17 ` Linus Torvalds
2009-12-19 18:33 ` Ralf Baechle
2009-12-19 18:33 ` Ralf Baechle
2009-12-21 17:14 ` James Bottomley
2009-12-17 17:39 ` tytso
2009-12-17 17:39 ` tytso
2009-12-17 17:51 ` Linus Torvalds
2009-12-17 19:36 ` Jens Axboe
2009-12-17 19:36 ` Jens Axboe
2009-12-17 23:57 ` James Bottomley
2009-12-17 23:57 ` James Bottomley
2009-12-18 1:00 ` FUJITA Tomonori
2009-12-18 2:44 ` Dave Chinner
2009-12-18 2:44 ` Dave Chinner
2009-12-18 3:51 ` FUJITA Tomonori
2009-12-18 3:51 ` FUJITA Tomonori
2009-12-18 7:10 ` James Bottomley
2009-12-18 7:08 ` James Bottomley
2009-12-18 9:34 ` FUJITA Tomonori
2009-12-18 10:01 ` James Bottomley
2009-12-18 10:01 ` James Bottomley
2009-12-18 10:24 ` FUJITA Tomonori
2009-12-18 10:30 ` James Bottomley
2009-12-18 12:00 ` Dave Chinner
2009-12-18 12:00 ` Dave Chinner
2009-12-18 0:21 ` FUJITA Tomonori
2009-12-18 14:17 ` tytso
2009-12-18 14:17 ` tytso
2009-12-21 8:53 ` FUJITA Tomonori
2009-12-17 17:10 ` Christoph Hellwig
2009-12-17 17:10 ` Christoph Hellwig
2009-12-17 17:33 ` tytso
2009-12-17 17:33 ` tytso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).