* Re: still nfs problems [Was: Linux 2.6.37-rc8]
@ 2011-01-05 19:05 James Bottomley
2011-01-05 19:18 ` Linus Torvalds
0 siblings, 1 reply; 71+ messages in thread
From: James Bottomley @ 2011-01-05 19:05 UTC (permalink / raw)
To: Russell King - ARM Linux
Cc: Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde,
Uwe Kleine-König, Linus Torvalds, Marc Kleine-Budde,
linux-arm-kernel, Parisc List, linux-arch
[sorry for the unthreaded insertion. We're seeing this on parisc too]
> On Wed, Jan 05, 2011 at 10:14:17AM -0500, Trond Myklebust wrote:
> > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of
> > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap()
> > interfaces, but we can end up reading them via a virtual address range
> > that gets set up via vm_map_ram() (that range gets set up before the
> > write occurs).
>
> kmap of lowmem pages will always reuses the existing kernel direct
> mapping, so there won't be a problem there.
>
> > Do we perhaps need an invalidate_kernel_vmap_range() before we can read
> > the data on ARM in this kind of scenario?
>
> Firstly, vm_map_ram() does no cache maintainence of any sort, nor does
> it take care of page colouring - so any architecture where cache aliasing
> can occur will see this problem. It should not limited to ARM.
>
> Secondly, no, invalidate_kernel_vmap_range() probably isn't sufficient.
> There's two problems here:
>
> addr = kmap(lowmem_page);
> *addr = stuff;
> kunmap(lowmem_page);
>
> Such lowmem pages are accessed through their kernel direct mapping.
>
> ptr = vm_map_ram(lowmem_page);
> read = *ptr;
>
> This creates a new mapping which can alias with the kernel direct mapping.
> Now, as this is a new mapping, there should be no cache lines associated
> with it. (Looking at vm_unmap_ram(), it calls free_unmap_vmap_area_addr(),
> free_unmap_vmap_area(), which then calls flush_cache_vunmap() on the
> region. vb_free() also calls flush_cache_vunmap() too.)
>
> If the write after kmap() hits an already present cache line, the cache
> line will be updated, but it won't be written back to memory. So, on
> a subsequent vm_map_ram(), with any kind of aliasing cache, there's
> no guarantee that you'll hit that cache line and read the data just
> written there.
>
> The kernel direct mapping would need to be flushed.
>
> I'm really getting to the point of hating the poliferation of RAM
> remapping interfaces - it's going to (and is) causing nothing but lots
> of pain on virtual cache architectures, needing more and more cache
> flushing interfaces to be created.
>
> Is there any other solution to this?
I think the solution for the kernel direct mapping problem is to take
the expected flushes and invalidates into kmap/kunmap[_atomic]. I think
the original reason for not doing this was efficiency: the user should
know what they did with the data (i.e. if they're only reading it, it
doesn't need to be flushed on unmap). However, the difficulty of
getting all this right seems to outweigh the efficiency of only using
the necessary flushing. At least on some architectures, we can look at
the TLB flags to see if the page was dirtied (and only flush if it was).
James
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:05 still nfs problems [Was: Linux 2.6.37-rc8] James Bottomley @ 2011-01-05 19:18 ` Linus Torvalds [not found] ` <AANLkTi=VZUxNFd7n-qwf5aiOeK5rkk8qBmo+kOpgg7up-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 19:18 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > I think the solution for the kernel direct mapping problem is to take > the expected flushes and invalidates into kmap/kunmap[_atomic]. No, we really can't do that. Most of the time, the kmap() is the only way we access the page anyway, so flushing things would just be stupid. Why waste time and energy on doing something pointless? In fact, kmap() here is a total non-issue. It's not the kmap() that introduces any virtual aliases, and never has been. It's the "vm_map_ram()" that is the problem. Unlike the kmap(), that really _does_ introduce a virtual alias, and is a problem for any virtual cache. So don't blame kmap(). It's innocent and irrelevant - the bug could happen entirely without it (think a 64-bit address space that doesn't even _have_ kmap, but has software that mixes vm_map_ram() with non-mapped accesses). Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <AANLkTi=VZUxNFd7n-qwf5aiOeK5rkk8qBmo+kOpgg7up-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <AANLkTi=VZUxNFd7n-qwf5aiOeK5rkk8qBmo+kOpgg7up-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:36 ` James Bottomley ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > <James.Bottomley-JuX6DAaQMKPCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> wrote: > > > > I think the solution for the kernel direct mapping problem is to take > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > No, we really can't do that. Most of the time, the kmap() is the only > way we access the page anyway, so flushing things would just be > stupid. Why waste time and energy on doing something pointless? It's hardly pointless. The kmap sets up an inequivalent alias in the cache. When you write to the kmap region, you dirty the CPU caches for that alias. If you tear down the mapping without flushing, the CPU will write out the cache lines at its leisure. If you access the line via the other mapping *before* the CPU does writeout, you see stale data. When the kernel dirties a kmap region, it always has to flush somehow before kunmap. One of the problems here is that that flush isn't in the NFS code. > In fact, kmap() here is a total non-issue. It's not the kmap() that > introduces any virtual aliases, and never has been. It's the > "vm_map_ram()" that is the problem. Unlike the kmap(), that really > _does_ introduce a virtual alias, and is a problem for any virtual > cache. > > So don't blame kmap(). It's innocent and irrelevant - the bug could > happen entirely without it (think a 64-bit address space that doesn't > even _have_ kmap, but has software that mixes vm_map_ram() with > non-mapped accesses). I didn't say it was kmap's entire problem ... I just said, can't we simplify some of this by consolidating the flushing into the interfaces. James -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:36 ` James Bottomley @ 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:49 ` Linus Torvalds [not found] ` <1294256169.16957.18.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 2 siblings, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > I think the solution for the kernel direct mapping problem is to take > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > No, we really can't do that. Most of the time, the kmap() is the only > way we access the page anyway, so flushing things would just be > stupid. Why waste time and energy on doing something pointless? It's hardly pointless. The kmap sets up an inequivalent alias in the cache. When you write to the kmap region, you dirty the CPU caches for that alias. If you tear down the mapping without flushing, the CPU will write out the cache lines at its leisure. If you access the line via the other mapping *before* the CPU does writeout, you see stale data. When the kernel dirties a kmap region, it always has to flush somehow before kunmap. One of the problems here is that that flush isn't in the NFS code. > In fact, kmap() here is a total non-issue. It's not the kmap() that > introduces any virtual aliases, and never has been. It's the > "vm_map_ram()" that is the problem. Unlike the kmap(), that really > _does_ introduce a virtual alias, and is a problem for any virtual > cache. > > So don't blame kmap(). It's innocent and irrelevant - the bug could > happen entirely without it (think a 64-bit address space that doesn't > even _have_ kmap, but has software that mixes vm_map_ram() with > non-mapped accesses). I didn't say it was kmap's entire problem ... I just said, can't we simplify some of this by consolidating the flushing into the interfaces. James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:36 ` James Bottomley @ 2011-01-05 19:49 ` Linus Torvalds 2011-01-05 20:35 ` James Bottomley [not found] ` <1294256169.16957.18.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 2 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 19:49 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch 2011/1/5 James Bottomley <James.Bottomley@hansenpartnership.com>: >> >> No, we really can't do that. Most of the time, the kmap() is the only >> way we access the page anyway, so flushing things would just be >> stupid. Why waste time and energy on doing something pointless? > > It's hardly pointless. The kmap sets up an inequivalent alias in the > cache. NO IT DOES NOT. Stop arguing, when you are so wrong. kmap() does not create any aliases. For low-memory, it just returns the physical address. No alias. And for high memory, there is no equivalent low memory address to alias _with_. Now, when you actually mix multiple kmap's and you have a virtually based cache, then the kmap's obviously need to flush that particular page when switching between each other. But that has nothing to do with the actual page being kmap'ed, it's entirely an internal issue about the particular virtual memory area being re-used. And ARM (and any other virtually based CPU) already does that in __kunmap_atomic(). But notice the case of the low-mem. And understand that you are WRONG about the "inequivalent alias" thing. So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering. It's _purely_ an issue of vm_map_ram(). Nothing else. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:49 ` Linus Torvalds @ 2011-01-05 20:35 ` James Bottomley 0 siblings, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 20:35 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 11:49 -0800, Linus Torvalds wrote: > 2011/1/5 James Bottomley <James.Bottomley@hansenpartnership.com>: > >> > >> No, we really can't do that. Most of the time, the kmap() is the only > >> way we access the page anyway, so flushing things would just be > >> stupid. Why waste time and energy on doing something pointless? > > > > It's hardly pointless. The kmap sets up an inequivalent alias in the > > cache. > > NO IT DOES NOT. Well, it does ... but not in this case because the page is freshly allocated (which I missed before) so it has no use cache colour yet. James > Stop arguing, when you are so wrong. > > kmap() does not create any aliases. For low-memory, it just returns > the physical address. No alias. And for high memory, there is no > equivalent low memory address to alias _with_. > > Now, when you actually mix multiple kmap's and you have a virtually > based cache, then the kmap's obviously need to flush that particular > page when switching between each other. But that has nothing to do > with the actual page being kmap'ed, it's entirely an internal issue > about the particular virtual memory area being re-used. And ARM (and > any other virtually based CPU) already does that in __kunmap_atomic(). > > But notice the case of the low-mem. And understand that you are WRONG > about the "inequivalent alias" thing. > > So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering. > > It's _purely_ an issue of vm_map_ram(). Nothing else. > > Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294256169.16957.18.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294256169.16957.18.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> @ 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:33 ` James Bottomley 0 siblings, 2 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 20:00 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > <James.Bottomley-JuX6DAaQMKPCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> wrote: > > > > > > I think the solution for the kernel direct mapping problem is to take > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > No, we really can't do that. Most of the time, the kmap() is the only > > way we access the page anyway, so flushing things would just be > > stupid. Why waste time and energy on doing something pointless? > > It's hardly pointless. The kmap sets up an inequivalent alias in the > cache. No it doesn't. For pages which are inaccessible, it sets up a mapping for those pages. On aliasing cache architectures, when you tear down such a mapping, you have to flush the cache before you do so (otherwise you can end up with cache lines existing in the cache for inaccessible mappings.) For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage because all lowmem pages are already accessible. So kunmap() doesn't do anything - just like the !HIGHMEM implementation for these macros. So, for highmem-enabled systems: low_addr = kmap_atomic(lowmem_page); high_addr = kmap_atomic(highmem_page); results in low_addr in the kernel direct-mapped region, and high_addr in the kmap_atomic region. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:00 ` Russell King - ARM Linux @ 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:33 ` James Bottomley 1 sibling, 0 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 20:00 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > I think the solution for the kernel direct mapping problem is to take > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > No, we really can't do that. Most of the time, the kmap() is the only > > way we access the page anyway, so flushing things would just be > > stupid. Why waste time and energy on doing something pointless? > > It's hardly pointless. The kmap sets up an inequivalent alias in the > cache. No it doesn't. For pages which are inaccessible, it sets up a mapping for those pages. On aliasing cache architectures, when you tear down such a mapping, you have to flush the cache before you do so (otherwise you can end up with cache lines existing in the cache for inaccessible mappings.) For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage because all lowmem pages are already accessible. So kunmap() doesn't do anything - just like the !HIGHMEM implementation for these macros. So, for highmem-enabled systems: low_addr = kmap_atomic(lowmem_page); high_addr = kmap_atomic(highmem_page); results in low_addr in the kernel direct-mapped region, and high_addr in the kmap_atomic region. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:00 ` Russell King - ARM Linux @ 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:48 ` Linus Torvalds 1 sibling, 2 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 20:33 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 20:00 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > I think the solution for the kernel direct mapping problem is to take > > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > > > No, we really can't do that. Most of the time, the kmap() is the only > > > way we access the page anyway, so flushing things would just be > > > stupid. Why waste time and energy on doing something pointless? > > > > It's hardly pointless. The kmap sets up an inequivalent alias in the > > cache. > > No it doesn't. For pages which are inaccessible, it sets up a mapping > for those pages. On aliasing cache architectures, when you tear down > such a mapping, you have to flush the cache before you do so (otherwise > you can end up with cache lines existing in the cache for inaccessible > mappings.) > > For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage > because all lowmem pages are already accessible. So kunmap() doesn't > do anything - just like the !HIGHMEM implementation for these macros. well, that depends. For us on parisc, kmap of a user page in !HIGHMEM sets up an inequivalent aliase still ... because the cache colour of the user and kernel virtual addresses are different. Depending on the return path to userspace, we still usually have to flush to get the user to see the changes the kernel has made. James > So, for highmem-enabled systems: > > low_addr = kmap_atomic(lowmem_page); > high_addr = kmap_atomic(highmem_page); > > results in low_addr in the kernel direct-mapped region, and high_addr > in the kmap_atomic region. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:33 ` James Bottomley @ 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:48 ` Linus Torvalds 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 20:33 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 20:00 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > I think the solution for the kernel direct mapping problem is to take > > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > > > No, we really can't do that. Most of the time, the kmap() is the only > > > way we access the page anyway, so flushing things would just be > > > stupid. Why waste time and energy on doing something pointless? > > > > It's hardly pointless. The kmap sets up an inequivalent alias in the > > cache. > > No it doesn't. For pages which are inaccessible, it sets up a mapping > for those pages. On aliasing cache architectures, when you tear down > such a mapping, you have to flush the cache before you do so (otherwise > you can end up with cache lines existing in the cache for inaccessible > mappings.) > > For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage > because all lowmem pages are already accessible. So kunmap() doesn't > do anything - just like the !HIGHMEM implementation for these macros. well, that depends. For us on parisc, kmap of a user page in !HIGHMEM sets up an inequivalent aliase still ... because the cache colour of the user and kernel virtual addresses are different. Depending on the return path to userspace, we still usually have to flush to get the user to see the changes the kernel has made. James > So, for highmem-enabled systems: > > low_addr = kmap_atomic(lowmem_page); > high_addr = kmap_atomic(highmem_page); > > results in low_addr in the kernel direct-mapped region, and high_addr > in the kmap_atomic region. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:33 ` James Bottomley @ 2011-01-05 20:48 ` Linus Torvalds [not found] ` <AANLkTimzzBsdtWcZtP5E_CH1hUZugGMoaHOiMdQJf764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-01-05 21:16 ` James Bottomley 1 sibling, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 20:48 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > well, that depends. For us on parisc, kmap of a user page in !HIGHMEM > sets up an inequivalent aliase still ... because the cache colour of the > user and kernel virtual addresses are different. Depending on the > return path to userspace, we still usually have to flush to get the user > to see the changes the kernel has made. Umm. Again, that has nothing to do with kmap(). This time it's about the user space mapping. Repeat after me: even without the kmap(), the kernel access to that mapping would have caused cache aliases. See? Once more, the kmap() is entirely innocent. You can have a non-highmem mapping that you never use kmap for, and that you map into user space, and you'd see exactly the same aliases. Notice? Look ma, no kmap(). So clearly kmap() is not the issue. The issue continues to be a totally separate virtual mapping. Whether it's a user mapping or vm_map_ram() is obviously immaterial - as far as the CPU is concerned, there is no difference between the two (apart from the trivial differences of virtual location and permissions). (You can also force the problem with vmalloc() an then following the kernel page tables, but I hope nobody does that any more. I suspect I'm wrong, though, there's probably code that mixes vmalloc and physical page accesses in various drivers) Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <AANLkTimzzBsdtWcZtP5E_CH1hUZugGMoaHOiMdQJf764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <AANLkTimzzBsdtWcZtP5E_CH1hUZugGMoaHOiMdQJf764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:08 ` Linus Torvalds 0 siblings, 2 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be deprecated then? ;) However, we seem have new ways of doing this - rather than asking vmalloc() for some memory, and then getting at the pages by following the page tables, we now have ways to create mappings using arrays of struct pages and access them via their already known mappings: - vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) - map_kernel_range_noflush(unsigned long addr, unsigned long size, pgprot_t prot, struct page **pages) - map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) - vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot) So really it's the same problem, just created by some other easier to use methods. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:04 ` Russell King - ARM Linux @ 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:08 ` Linus Torvalds 1 sibling, 0 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be deprecated then? ;) However, we seem have new ways of doing this - rather than asking vmalloc() for some memory, and then getting at the pages by following the page tables, we now have ways to create mappings using arrays of struct pages and access them via their already known mappings: - vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) - map_kernel_range_noflush(unsigned long addr, unsigned long size, pgprot_t prot, struct page **pages) - map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) - vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot) So really it's the same problem, just created by some other easier to use methods. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:04 ` Russell King - ARM Linux @ 2011-01-05 21:08 ` Linus Torvalds 2011-01-05 21:08 ` Linus Torvalds [not found] ` <AANLkTi=EXXBTW7oWHq3D+PHsx=thF1CpkRjn0ax2p5rm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 21:08 UTC (permalink / raw) To: Russell King - ARM Linux Cc: James Bottomley, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: >> (You can also force the problem with vmalloc() an then following the >> kernel page tables, but I hope nobody does that any more. I suspect >> I'm wrong, though, there's probably code that mixes vmalloc and >> physical page accesses in various drivers) > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > deprecated then? ;) I do think that the "modern" way of doing it is "vmap()"/"vm_map_ram()" and friends, and it should be preferred over using vmalloc() and then looking up the pages. But in the end, the two approaches really are equivalent, so it's not like it really matters. So I don't think we need to deprecate things officially, but obviously we should make people more aware of the whole virtual alias thing that crops up whenever you use any of these approaches. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:08 ` Linus Torvalds @ 2011-01-05 21:08 ` Linus Torvalds [not found] ` <AANLkTi=EXXBTW7oWHq3D+PHsx=thF1CpkRjn0ax2p5rm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 21:08 UTC (permalink / raw) To: Russell King - ARM Linux Cc: James Bottomley, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: >> (You can also force the problem with vmalloc() an then following the >> kernel page tables, but I hope nobody does that any more. I suspect >> I'm wrong, though, there's probably code that mixes vmalloc and >> physical page accesses in various drivers) > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > deprecated then? ;) I do think that the "modern" way of doing it is "vmap()"/"vm_map_ram()" and friends, and it should be preferred over using vmalloc() and then looking up the pages. But in the end, the two approaches really are equivalent, so it's not like it really matters. So I don't think we need to deprecate things officially, but obviously we should make people more aware of the whole virtual alias thing that crops up whenever you use any of these approaches. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <AANLkTi=EXXBTW7oWHq3D+PHsx=thF1CpkRjn0ax2p5rm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <AANLkTi=EXXBTW7oWHq3D+PHsx=thF1CpkRjn0ax2p5rm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-01-05 21:16 ` Trond Myklebust 2011-01-05 21:16 ` Trond Myklebust [not found] ` <1294262208.2952.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-05 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, James Bottomley, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, 2011-01-05 at 13:08 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux > <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org> wrote: > > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > >> (You can also force the problem with vmalloc() an then following the > >> kernel page tables, but I hope nobody does that any more. I suspect > >> I'm wrong, though, there's probably code that mixes vmalloc and > >> physical page accesses in various drivers) > > > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > > deprecated then? ;) > > I do think that the "modern" way of doing it is > "vmap()"/"vm_map_ram()" and friends, and it should be preferred over > using vmalloc() and then looking up the pages. > > But in the end, the two approaches really are equivalent, so it's not > like it really matters. So I don't think we need to deprecate things > officially, but obviously we should make people more aware of the > whole virtual alias thing that crops up whenever you use any of these > approaches. So what should be the preferred way to ensure data gets flushed when you've written directly to a page, and then want to read through the vm_map_ram() virtual range? Should we be adding new semantics to flush_kernel_dcache_page()? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:16 ` Trond Myklebust @ 2011-01-05 21:16 ` Trond Myklebust [not found] ` <1294262208.2952.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-05 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 13:08 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > >> (You can also force the problem with vmalloc() an then following the > >> kernel page tables, but I hope nobody does that any more. I suspect > >> I'm wrong, though, there's probably code that mixes vmalloc and > >> physical page accesses in various drivers) > > > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > > deprecated then? ;) > > I do think that the "modern" way of doing it is > "vmap()"/"vm_map_ram()" and friends, and it should be preferred over > using vmalloc() and then looking up the pages. > > But in the end, the two approaches really are equivalent, so it's not > like it really matters. So I don't think we need to deprecate things > officially, but obviously we should make people more aware of the > whole virtual alias thing that crops up whenever you use any of these > approaches. So what should be the preferred way to ensure data gets flushed when you've written directly to a page, and then want to read through the vm_map_ram() virtual range? Should we be adding new semantics to flush_kernel_dcache_page()? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294262208.2952.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294262208.2952.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 23:06 ` Trond Myklebust 0 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 21:30 UTC (permalink / raw) To: Trond Myklebust Cc: Russell King - ARM Linux, James Bottomley, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> wrote: > > So what should be the preferred way to ensure data gets flushed when > you've written directly to a page, and then want to read through the > vm_map_ram() virtual range? Should we be adding new semantics to > flush_kernel_dcache_page()? The "preferred way" is actually simple: "don't do that". IOW, if some page is accessed through a virtual mapping you've set up, then _always_ access it through that virtual mapping. Now, when that is impossible (and yes, it sometimes is), then you should flush after doing all writes. And if you do the write through the regular kernel mapping, you should use flush_dcache_page(). And if you did it through the virtual mapping, you should use "flush_kernel_vmap_range()" or whatever. NOTE! I really didn't look those up very closely, and if the accesses can happen concurrently you are basically screwed, so you do need to do locking or something else to guarantee that there is some nice sequential order. And maybe I forgot something. Which is why I do suggest "don't do that" as a primary approach to the problem if at all possible. Oh, and you may need to flush before reading too (and many writes do end up being "read-modify-write" cycles) in case it's possible that you have stale data from a previous read that was then invalidated by a write to the aliasing address. Even if that write was flushed out, the stale read data may exist at the virtual address. I forget what all we required - in the end the only sane model is "virtual caches suck so bad that anybody who does them should be laughed at for being a retard". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:30 ` Linus Torvalds @ 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 23:06 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 21:30 UTC (permalink / raw) To: Trond Myklebust Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > So what should be the preferred way to ensure data gets flushed when > you've written directly to a page, and then want to read through the > vm_map_ram() virtual range? Should we be adding new semantics to > flush_kernel_dcache_page()? The "preferred way" is actually simple: "don't do that". IOW, if some page is accessed through a virtual mapping you've set up, then _always_ access it through that virtual mapping. Now, when that is impossible (and yes, it sometimes is), then you should flush after doing all writes. And if you do the write through the regular kernel mapping, you should use flush_dcache_page(). And if you did it through the virtual mapping, you should use "flush_kernel_vmap_range()" or whatever. NOTE! I really didn't look those up very closely, and if the accesses can happen concurrently you are basically screwed, so you do need to do locking or something else to guarantee that there is some nice sequential order. And maybe I forgot something. Which is why I do suggest "don't do that" as a primary approach to the problem if at all possible. Oh, and you may need to flush before reading too (and many writes do end up being "read-modify-write" cycles) in case it's possible that you have stale data from a previous read that was then invalidated by a write to the aliasing address. Even if that write was flushed out, the stale read data may exist at the virtual address. I forget what all we required - in the end the only sane model is "virtual caches suck so bad that anybody who does them should be laughed at for being a retard". Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 21:30 ` Linus Torvalds @ 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:06 ` Trond Myklebust ` (2 more replies) 1 sibling, 3 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-05 23:06 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > So what should be the preferred way to ensure data gets flushed when > > you've written directly to a page, and then want to read through the > > vm_map_ram() virtual range? Should we be adding new semantics to > > flush_kernel_dcache_page()? > > The "preferred way" is actually simple: "don't do that". IOW, if some > page is accessed through a virtual mapping you've set up, then > _always_ access it through that virtual mapping. > > Now, when that is impossible (and yes, it sometimes is), then you > should flush after doing all writes. And if you do the write through > the regular kernel mapping, you should use flush_dcache_page(). And if > you did it through the virtual mapping, you should use > "flush_kernel_vmap_range()" or whatever. > > NOTE! I really didn't look those up very closely, and if the accesses > can happen concurrently you are basically screwed, so you do need to > do locking or something else to guarantee that there is some nice > sequential order. And maybe I forgot something. Which is why I do > suggest "don't do that" as a primary approach to the problem if at all > possible. > > Oh, and you may need to flush before reading too (and many writes do > end up being "read-modify-write" cycles) in case it's possible that > you have stale data from a previous read that was then invalidated by > a write to the aliasing address. Even if that write was flushed out, > the stale read data may exist at the virtual address. I forget what > all we required - in the end the only sane model is "virtual caches > suck so bad that anybody who does them should be laughed at for being > a retard". Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), which takes care of invalidating the cache prior to a virtual address read. My question was specifically about the write through the regular kernel mapping: according to Russell and my reading of the cachetlb.txt documentation, flush_dcache_page() is only guaranteed to have an effect on page cache pages. flush_kernel_dcache_page() (not to be confused with flush_dcache_page) would appear to be the closest fit according to my reading of the documentation, however the ARM implementation appears to be a no-op... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:06 ` Trond Myklebust @ 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:28 ` James Bottomley 2011-01-05 23:28 ` Linus Torvalds 2 siblings, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-05 23:06 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > So what should be the preferred way to ensure data gets flushed when > > you've written directly to a page, and then want to read through the > > vm_map_ram() virtual range? Should we be adding new semantics to > > flush_kernel_dcache_page()? > > The "preferred way" is actually simple: "don't do that". IOW, if some > page is accessed through a virtual mapping you've set up, then > _always_ access it through that virtual mapping. > > Now, when that is impossible (and yes, it sometimes is), then you > should flush after doing all writes. And if you do the write through > the regular kernel mapping, you should use flush_dcache_page(). And if > you did it through the virtual mapping, you should use > "flush_kernel_vmap_range()" or whatever. > > NOTE! I really didn't look those up very closely, and if the accesses > can happen concurrently you are basically screwed, so you do need to > do locking or something else to guarantee that there is some nice > sequential order. And maybe I forgot something. Which is why I do > suggest "don't do that" as a primary approach to the problem if at all > possible. > > Oh, and you may need to flush before reading too (and many writes do > end up being "read-modify-write" cycles) in case it's possible that > you have stale data from a previous read that was then invalidated by > a write to the aliasing address. Even if that write was flushed out, > the stale read data may exist at the virtual address. I forget what > all we required - in the end the only sane model is "virtual caches > suck so bad that anybody who does them should be laughed at for being > a retard". Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), which takes care of invalidating the cache prior to a virtual address read. My question was specifically about the write through the regular kernel mapping: according to Russell and my reading of the cachetlb.txt documentation, flush_dcache_page() is only guaranteed to have an effect on page cache pages. flush_kernel_dcache_page() (not to be confused with flush_dcache_page) would appear to be the closest fit according to my reading of the documentation, however the ARM implementation appears to be a no-op... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:06 ` Trond Myklebust @ 2011-01-05 23:28 ` James Bottomley 2011-01-06 17:40 ` James Bottomley 2011-01-05 23:28 ` Linus Torvalds 2 siblings, 1 reply; 71+ messages in thread From: James Bottomley @ 2011-01-05 23:28 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 18:06 -0500, Trond Myklebust wrote: > On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust > > <Trond.Myklebust@netapp.com> wrote: > > > > > > So what should be the preferred way to ensure data gets flushed when > > > you've written directly to a page, and then want to read through the > > > vm_map_ram() virtual range? Should we be adding new semantics to > > > flush_kernel_dcache_page()? > > > > The "preferred way" is actually simple: "don't do that". IOW, if some > > page is accessed through a virtual mapping you've set up, then > > _always_ access it through that virtual mapping. > > > > Now, when that is impossible (and yes, it sometimes is), then you > > should flush after doing all writes. And if you do the write through > > the regular kernel mapping, you should use flush_dcache_page(). And if > > you did it through the virtual mapping, you should use > > "flush_kernel_vmap_range()" or whatever. > > > > NOTE! I really didn't look those up very closely, and if the accesses > > can happen concurrently you are basically screwed, so you do need to > > do locking or something else to guarantee that there is some nice > > sequential order. And maybe I forgot something. Which is why I do > > suggest "don't do that" as a primary approach to the problem if at all > > possible. > > > > Oh, and you may need to flush before reading too (and many writes do > > end up being "read-modify-write" cycles) in case it's possible that > > you have stale data from a previous read that was then invalidated by > > a write to the aliasing address. Even if that write was flushed out, > > the stale read data may exist at the virtual address. I forget what > > all we required - in the end the only sane model is "virtual caches > > suck so bad that anybody who does them should be laughed at for being > > a retard". > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > which takes care of invalidating the cache prior to a virtual address > read. > > My question was specifically about the write through the regular kernel > mapping: according to Russell and my reading of the cachetlb.txt > documentation, flush_dcache_page() is only guaranteed to have an effect > on page cache pages. > flush_kernel_dcache_page() (not to be confused with flush_dcache_page) > would appear to be the closest fit according to my reading of the > documentation, however the ARM implementation appears to be a no-op... It depends on exactly what you're doing. In the worst case, (ping pong reads and writes through both aliases) you have to flush and invalidate both alias 1 alias 2 each time you access on one and then another. Can you explain how the code works? it looks to me like you read the xdr stuff through the vmap region then write it out directly to the pages? *if* this is just a conversion, *and* you never need to read the new data through the vmap alias, you might be able to get away with a flush_dcache_page in nfs_readdir_release_array(). If the access pattern is more complex, you'll need more stuff splashed through the loop (including vmap invalidation/flushing). Is there any way you could just rewrite nfs_readdir_add_to_array() to use the vmap address instead of doing a kmap? That way everything will go through a single alias and not end up with this incoherency. James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:28 ` James Bottomley @ 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 17:40 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > Can you explain how the code works? it looks to me like you read the xdr > stuff through the vmap region then write it out directly to the pages? OK, I think I see how this is supposed to work: It's a sequential loop of reading in via the pages (i.e. through the kernel mapping) and then updating those pages via the vmap. In which case, I think this patch is what you need. The theory of operation is that the readdir on pages actually uses the network DMA operations to perform, so when it's finished, the underlying page is up to date. After this you invalidate the vmap range, so we have no cache lines above it (so it picks up the values from the uptodate page). Finally, after the operation on the vmap region has finished, you flush it so that any updated contents go back to the pages themselves before the next iteration begins. Does this look right to people? I've verified it fixes the issues on parisc. James --- diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..bde1911 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; + + invalidate_kernel_vmap_range(pages_ptr, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; break; } + flush_kernel_vmap_range(pages_ptr, pglen); } while (array->eof_index < 0); nfs_readdir_free_large_page(pages_ptr, pages, array_size); ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:40 ` James Bottomley @ 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 17:51 ` James Bottomley 2011-01-06 17:55 ` Linus Torvalds [not found] ` <1294335614.22825.154.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 2011-01-06 20:19 ` John Stoffel 2 siblings, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-06 17:47 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > Can you explain how the code works? it looks to me like you read the xdr > > stuff through the vmap region then write it out directly to the pages? > > OK, I think I see how this is supposed to work: It's a sequential loop > of reading in via the pages (i.e. through the kernel mapping) and then > updating those pages via the vmap. In which case, I think this patch is > what you need. > > The theory of operation is that the readdir on pages actually uses the > network DMA operations to perform, so when it's finished, the underlying > page is up to date. After this you invalidate the vmap range, so we > have no cache lines above it (so it picks up the values from the > uptodate page). Finally, after the operation on the vmap region has > finished, you flush it so that any updated contents go back to the pages > themselves before the next iteration begins. > > Does this look right to people? I've verified it fixes the issues on > parisc. > > James > > --- > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..bde1911 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > + > + invalidate_kernel_vmap_range(pages_ptr, pglen); > + > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > if (status < 0) { > if (status == -ENOSPC) > status = 0; > break; > } > + flush_kernel_vmap_range(pages_ptr, pglen); Why is this line needed? We're not writing through the virtual mapping. We checked using just the invalidate_kernel_vmap_range(), and that appeared to suffice to fix the problem on ARM. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:47 ` Trond Myklebust @ 2011-01-06 17:51 ` James Bottomley 2011-01-06 17:55 ` Linus Torvalds 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 17:51 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 12:47 -0500, Trond Myklebust wrote: > On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > > Can you explain how the code works? it looks to me like you read the xdr > > > stuff through the vmap region then write it out directly to the pages? > > > > OK, I think I see how this is supposed to work: It's a sequential loop > > of reading in via the pages (i.e. through the kernel mapping) and then > > updating those pages via the vmap. In which case, I think this patch is > > what you need. > > > > The theory of operation is that the readdir on pages actually uses the > > network DMA operations to perform, so when it's finished, the underlying > > page is up to date. After this you invalidate the vmap range, so we > > have no cache lines above it (so it picks up the values from the > > uptodate page). Finally, after the operation on the vmap region has > > finished, you flush it so that any updated contents go back to the pages > > themselves before the next iteration begins. > > > > Does this look right to people? I've verified it fixes the issues on > > parisc. > > > > James > > > > --- > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > index 996dd89..bde1911 100644 > > --- a/fs/nfs/dir.c > > +++ b/fs/nfs/dir.c > > @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > > if (status < 0) > > break; > > pglen = status; > > + > > + invalidate_kernel_vmap_range(pages_ptr, pglen); > > + > > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > > if (status < 0) { > > if (status == -ENOSPC) > > status = 0; > > break; > > } > > + flush_kernel_vmap_range(pages_ptr, pglen); > > Why is this line needed? We're not writing through the virtual mapping. If you're not altering it, it isn't ... the problem on parisc is that invalidate is a nop for us because flush does it all, but I can fix that. James > We checked using just the invalidate_kernel_vmap_range(), and that > appeared to suffice to fix the problem on ARM. > > Cheers > Trond ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 17:51 ` James Bottomley @ 2011-01-06 17:55 ` Linus Torvalds 2011-01-06 17:55 ` Linus Torvalds 2011-01-07 18:53 ` Trond Myklebust 1 sibling, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-06 17:55 UTC (permalink / raw) To: Trond Myklebust Cc: linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Why is this line needed? We're not writing through the virtual mapping. I haven't looked at the sequence of accesses, but you need to be _very_ aware that "write-through" is absolutely NOT sufficient for cache coherency. In cache coherency, you have three options: - true coherency (eg physically indexed/tagged caches) - exclusion (eg virtual caches, but with an exclusion guarantee that guarantees that aliases cannot happen: either by using physical tagging or by not allowing cases that could cause virtual aliases) - write-through AND non-cached reads (ie "no caching at all"). You seem to be forgetting the "no cached reads" part. It's not sufficient to flush after a write - you need to make sure that you also don't have a cached copy of the alias for the read. So "We're not writing through the virtual mapping" is NOT a sufficient excuse. If you're reading through the virtual mapping, you need to make sure that the virtual mapping is flushed _after_ any writes through any other mapping and _before_ any reads through the virtual one. This is why you really really really generally don't want to have aliasing. Purely virtual caches are pure crap. Really. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:55 ` Linus Torvalds @ 2011-01-06 17:55 ` Linus Torvalds 2011-01-07 18:53 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-06 17:55 UTC (permalink / raw) To: Trond Myklebust Cc: James Bottomley, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Why is this line needed? We're not writing through the virtual mapping. I haven't looked at the sequence of accesses, but you need to be _very_ aware that "write-through" is absolutely NOT sufficient for cache coherency. In cache coherency, you have three options: - true coherency (eg physically indexed/tagged caches) - exclusion (eg virtual caches, but with an exclusion guarantee that guarantees that aliases cannot happen: either by using physical tagging or by not allowing cases that could cause virtual aliases) - write-through AND non-cached reads (ie "no caching at all"). You seem to be forgetting the "no cached reads" part. It's not sufficient to flush after a write - you need to make sure that you also don't have a cached copy of the alias for the read. So "We're not writing through the virtual mapping" is NOT a sufficient excuse. If you're reading through the virtual mapping, you need to make sure that the virtual mapping is flushed _after_ any writes through any other mapping and _before_ any reads through the virtual one. This is why you really really really generally don't want to have aliasing. Purely virtual caches are pure crap. Really. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:55 ` Linus Torvalds 2011-01-06 17:55 ` Linus Torvalds @ 2011-01-07 18:53 ` Trond Myklebust [not found] ` <1294426405.2929.23.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2011-01-07 19:05 ` James Bottomley 1 sibling, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-07 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 09:55 -0800, Linus Torvalds wrote: > On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > Why is this line needed? We're not writing through the virtual mapping. > > I haven't looked at the sequence of accesses, but you need to be > _very_ aware that "write-through" is absolutely NOT sufficient for > cache coherency. > > In cache coherency, you have three options: > > - true coherency (eg physically indexed/tagged caches) > > - exclusion (eg virtual caches, but with an exclusion guarantee that > guarantees that aliases cannot happen: either by using physical > tagging or by not allowing cases that could cause virtual aliases) > > - write-through AND non-cached reads (ie "no caching at all"). > > You seem to be forgetting the "no cached reads" part. It's not > sufficient to flush after a write - you need to make sure that you > also don't have a cached copy of the alias for the read. > > So "We're not writing through the virtual mapping" is NOT a sufficient > excuse. If you're reading through the virtual mapping, you need to > make sure that the virtual mapping is flushed _after_ any writes > through any other mapping and _before_ any reads through the virtual > one. I'm aware of that. That part should be taken care of by the call to invalidate_kernel_vmap_range() which was in both James and my patch. There is already code in the SUNRPC layer that calls flush_dcache_page() after writing (although as Russell pointed out earlier, that is apparently a no-op for non-page cache pages such as these). > This is why you really really really generally don't want to have > aliasing. Purely virtual caches are pure crap. Really. Well, it looks as if NOMMU is giving us problems due to the lack of a vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262). I'd still like to keep the existing code for those architectures that don't have problems, since that allows us to send 32k READDIR requests instead of being limited to 4k. For large directories, that is a clear win. For the NOMMU case we will just go back to using a single page for storage (and 4k READDIR requests only). Should I just do the same for architectures like ARM and PARISC? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294426405.2929.23.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294426405.2929.23.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:02 ` Russell King - ARM Linux ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-07 19:02 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, James Bottomley, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > I'd still like to keep the existing code for those architectures that > don't have problems, since that allows us to send 32k READDIR requests > instead of being limited to 4k. For large directories, that is a clear > win. > For the NOMMU case we will just go back to using a single page for > storage (and 4k READDIR requests only). Should I just do the same for > architectures like ARM and PARISC? I think you said that readdir reads via the vmalloc mapping of the group of pages, but XDR writes to the individual pages. As I understand NFS, you receive a packet, you then have to use XDR to unpack the data, which you presumably write into the set of struct page *'s using kmap? Isn't a solution to have XDR write directly into the vmalloc mapping rather than using struct page * and kmap? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:02 ` Russell King - ARM Linux @ 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:11 ` James Bottomley 2011-01-07 19:13 ` Trond Myklebust 2 siblings, 0 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-07 19:02 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > I'd still like to keep the existing code for those architectures that > don't have problems, since that allows us to send 32k READDIR requests > instead of being limited to 4k. For large directories, that is a clear > win. > For the NOMMU case we will just go back to using a single page for > storage (and 4k READDIR requests only). Should I just do the same for > architectures like ARM and PARISC? I think you said that readdir reads via the vmalloc mapping of the group of pages, but XDR writes to the individual pages. As I understand NFS, you receive a packet, you then have to use XDR to unpack the data, which you presumably write into the set of struct page *'s using kmap? Isn't a solution to have XDR write directly into the vmalloc mapping rather than using struct page * and kmap? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:02 ` Russell King - ARM Linux @ 2011-01-07 19:11 ` James Bottomley 2011-01-07 19:11 ` James Bottomley [not found] ` <1294427467.4895.66.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 2011-01-07 19:13 ` Trond Myklebust 2 siblings, 2 replies; 71+ messages in thread From: James Bottomley @ 2011-01-07 19:11 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote: > On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > > I'd still like to keep the existing code for those architectures that > > don't have problems, since that allows us to send 32k READDIR requests > > instead of being limited to 4k. For large directories, that is a clear > > win. > > For the NOMMU case we will just go back to using a single page for > > storage (and 4k READDIR requests only). Should I just do the same for > > architectures like ARM and PARISC? > > I think you said that readdir reads via the vmalloc mapping of the > group of pages, but XDR writes to the individual pages. Actually it's the other way around, but the point still stands. > As I understand NFS, you receive a packet, you then have to use XDR > to unpack the data, which you presumably write into the set of > struct page *'s using kmap? > > Isn't a solution to have XDR write directly into the vmalloc mapping > rather than using struct page * and kmap? So, unfortuantely, I looked at doing this and we can't. the ->readdir() call takes an array of pages, not a kernel virtual address of the pages, so there's no way to tell it to use a different mapping from the usual kernel one on them. On the other hand, the xdr routines, since they take the pages anyway, could use a scatterlist approach to writing through the kernel mapping instead of using vmap ... we have all the machinery for this in lib/scatterlist.c ... it's not designed for this case, since it's designed to allow arbitrary linear reads and writes on a block scatterlist, but the principle is the same ... it looks like it would be rather a big patch, though ... James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:11 ` James Bottomley @ 2011-01-07 19:11 ` James Bottomley [not found] ` <1294427467.4895.66.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-07 19:11 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote: > On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > > I'd still like to keep the existing code for those architectures that > > don't have problems, since that allows us to send 32k READDIR requests > > instead of being limited to 4k. For large directories, that is a clear > > win. > > For the NOMMU case we will just go back to using a single page for > > storage (and 4k READDIR requests only). Should I just do the same for > > architectures like ARM and PARISC? > > I think you said that readdir reads via the vmalloc mapping of the > group of pages, but XDR writes to the individual pages. Actually it's the other way around, but the point still stands. > As I understand NFS, you receive a packet, you then have to use XDR > to unpack the data, which you presumably write into the set of > struct page *'s using kmap? > > Isn't a solution to have XDR write directly into the vmalloc mapping > rather than using struct page * and kmap? So, unfortuantely, I looked at doing this and we can't. the ->readdir() call takes an array of pages, not a kernel virtual address of the pages, so there's no way to tell it to use a different mapping from the usual kernel one on them. On the other hand, the xdr routines, since they take the pages anyway, could use a scatterlist approach to writing through the kernel mapping instead of using vmap ... we have all the machinery for this in lib/scatterlist.c ... it's not designed for this case, since it's designed to allow arbitrary linear reads and writes on a block scatterlist, but the principle is the same ... it looks like it would be rather a big patch, though ... James ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294427467.4895.66.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294427467.4895.66.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> @ 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 23:15 ` Trond Myklebust 0 siblings, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-08 16:49 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > On the other hand, the xdr routines, since they take the pages anyway, > could use a scatterlist approach to writing through the kernel mapping > instead of using vmap ... we have all the machinery for this in > lib/scatterlist.c ... it's not designed for this case, since it's > designed to allow arbitrary linear reads and writes on a block > scatterlist, but the principle is the same ... it looks like it would be > rather a big patch, though ... The following alternative seems to work for me, but has only been lightly tested so far. It's a bit large for a stable patch, but not too ungainly. It modifies xdr_stream, adding the ability to iterate through page data. To avoid kmap()/kunmap(), it does require that pages be allocated in lowmem, but since the only use case here is when using page arrays as temporary buffers, that seems like an acceptable compromise. Cheers Trond --------------------------------------------------------------------------------- From f87f13e3198ec536c1b9cfe19ad47df4fb922382 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Date: Fri, 7 Jan 2011 18:51:33 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/dir.c | 45 ++++++++--------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 115 ++++++++++++++++++++++++++++++++++++-------- 6 files changed, 120 insertions(+), 62 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..ad9e5e0 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,27 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; struct nfs_cache_array *array; unsigned int count = 0; + struct page *scratch; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); + xdr_enter_page(&stream, buflen); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +507,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +524,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +532,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +542,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +576,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +586,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..cff0974 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -569,29 +569,112 @@ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) xdr->iov = iov; xdr->p = p; xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->page_ptr = NULL; + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; } EXPORT_SYMBOL_GPL(xdr_init_decode); /** - * xdr_inline_peek - Allow read-ahead in the XDR data stream + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + return 0; +} + +static int xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + return xdr_set_page_base(xdr, newbase, PAGE_SIZE); +} + +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); + +static __be32 *xdr_page_inline_decode(struct xdr_stream *xdr, size_t nbytes) +{ + size_t cplen; + void *cpdest; + __be32 *p; + + if (xdr->p == xdr->end && xdr_set_next_page(xdr) < 0) + return NULL; + + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + cplen = (char *)xdr->end - (char *)xdr->p; + cpdest = xdr->scratch.iov_base; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (xdr_set_next_page(xdr) < 0) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} /** * xdr_inline_decode - Retrieve non-page XDR data to decode @@ -605,13 +688,9 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); - - if (unlikely(q > xdr->end || q < p)) - return NULL; - xdr->p = q; - return p; + if (xdr->page_ptr == NULL) + return __xdr_inline_decode(xdr, nbytes); + return xdr_page_inline_decode(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +750,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 16:49 ` Trond Myklebust @ 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 23:15 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-08 16:49 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > On the other hand, the xdr routines, since they take the pages anyway, > could use a scatterlist approach to writing through the kernel mapping > instead of using vmap ... we have all the machinery for this in > lib/scatterlist.c ... it's not designed for this case, since it's > designed to allow arbitrary linear reads and writes on a block > scatterlist, but the principle is the same ... it looks like it would be > rather a big patch, though ... The following alternative seems to work for me, but has only been lightly tested so far. It's a bit large for a stable patch, but not too ungainly. It modifies xdr_stream, adding the ability to iterate through page data. To avoid kmap()/kunmap(), it does require that pages be allocated in lowmem, but since the only use case here is when using page arrays as temporary buffers, that seems like an acceptable compromise. Cheers Trond --------------------------------------------------------------------------------- From f87f13e3198ec536c1b9cfe19ad47df4fb922382 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Fri, 7 Jan 2011 18:51:33 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 45 ++++++++--------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 115 ++++++++++++++++++++++++++++++++++++-------- 6 files changed, 120 insertions(+), 62 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..ad9e5e0 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,27 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; struct nfs_cache_array *array; unsigned int count = 0; + struct page *scratch; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); + xdr_enter_page(&stream, buflen); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +507,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +524,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +532,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +542,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +576,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +586,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..cff0974 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -569,29 +569,112 @@ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) xdr->iov = iov; xdr->p = p; xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->page_ptr = NULL; + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; } EXPORT_SYMBOL_GPL(xdr_init_decode); /** - * xdr_inline_peek - Allow read-ahead in the XDR data stream + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + return 0; +} + +static int xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + return xdr_set_page_base(xdr, newbase, PAGE_SIZE); +} + +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); + +static __be32 *xdr_page_inline_decode(struct xdr_stream *xdr, size_t nbytes) +{ + size_t cplen; + void *cpdest; + __be32 *p; + + if (xdr->p == xdr->end && xdr_set_next_page(xdr) < 0) + return NULL; + + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + cplen = (char *)xdr->end - (char *)xdr->p; + cpdest = xdr->scratch.iov_base; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (xdr_set_next_page(xdr) < 0) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} /** * xdr_inline_decode - Retrieve non-page XDR data to decode @@ -605,13 +688,9 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); - - if (unlikely(q > xdr->end || q < p)) - return NULL; - xdr->p = q; - return p; + if (xdr->page_ptr == NULL) + return __xdr_inline_decode(xdr, nbytes); + return xdr_page_inline_decode(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +750,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 16:49 ` Trond Myklebust @ 2011-01-08 23:15 ` Trond Myklebust 2011-01-08 23:15 ` Trond Myklebust [not found] ` <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-08 23:15 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Sat, 2011-01-08 at 11:49 -0500, Trond Myklebust wrote: > On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > > On the other hand, the xdr routines, since they take the pages anyway, > > could use a scatterlist approach to writing through the kernel mapping > > instead of using vmap ... we have all the machinery for this in > > lib/scatterlist.c ... it's not designed for this case, since it's > > designed to allow arbitrary linear reads and writes on a block > > scatterlist, but the principle is the same ... it looks like it would be > > rather a big patch, though ... > > The following alternative seems to work for me, but has only been > lightly tested so far. It's a bit large for a stable patch, but not too > ungainly. > > It modifies xdr_stream, adding the ability to iterate through page data. > To avoid kmap()/kunmap(), it does require that pages be allocated in > lowmem, but since the only use case here is when using page arrays as > temporary buffers, that seems like an acceptable compromise. ...and here is an update which makes the whole process transparent to the decoder. It basically teaches xdr_inline_decode() how to switch buffers when it hits the end of the current iovec and/or page. Cheers Trond ----------------------------------------------------------------------------------- From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sat, 8 Jan 2011 17:45:38 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 44 ++++++------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- 6 files changed, 148 insertions(+), 73 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..0108cf4 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,26 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; + struct page *scratch; struct nfs_cache_array *array; unsigned int count = 0; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +523,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..679cd67 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b } EXPORT_SYMBOL_GPL(xdr_write_pages); +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, + __be32 *p, unsigned int len) +{ + if (len > iov->iov_len) + len = iov->iov_len; + if (p == NULL) + p = (__be32*)iov->iov_base; + xdr->p = p; + xdr->end = (__be32*)(iov->iov_base + len); + xdr->iov = iov; + xdr->page_ptr = NULL; +} + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + xdr->iov = NULL; + return 0; +} + +static void xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); +} + +static bool xdr_set_next_buffer(struct xdr_stream *xdr) +{ + if (xdr->page_ptr != NULL) + xdr_set_next_page(xdr); + else if (xdr->iov == xdr->buf->head) { + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); + } + return xdr->p != xdr->end; +} + /** * xdr_init_decode - Initialize an xdr_stream for decoding data. * @xdr: pointer to xdr_stream struct @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); */ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) { - struct kvec *iov = buf->head; - unsigned int len = iov->iov_len; - - if (len > buf->len) - len = buf->len; xdr->buf = buf; - xdr->iov = iov; - xdr->p = p; - xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; + if (buf->head[0].iov_len != 0) + xdr_set_iov(xdr, buf->head, p, buf->len); + else if (buf->page_len != 0) + xdr_set_page_base(xdr, 0, buf->len); } EXPORT_SYMBOL_GPL(xdr_init_decode); -/** - * xdr_inline_peek - Allow read-ahead in the XDR data stream - * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode - * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. - */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); /** - * xdr_inline_decode - Retrieve non-page XDR data to decode + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. + * @xdr: pointer to xdr_stream struct + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' + * + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. + */ +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) +{ + __be32 *p; + void *cpdest = xdr->scratch.iov_base; + size_t cplen = (char *)xdr->end - (char *)xdr->p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (!xdr_set_next_buffer(xdr)) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} + +/** + * xdr_inline_decode - Retrieve XDR data to decode * @xdr: pointer to xdr_stream struct * @nbytes: number of bytes of data to decode * @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); + __be32 *p; - if (unlikely(q > xdr->end || q < p)) + if (nbytes == 0) + return xdr->p; + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) return NULL; - xdr->p = q; - return p; + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + return xdr_copy_to_scratch(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 23:15 ` Trond Myklebust @ 2011-01-08 23:15 ` Trond Myklebust [not found] ` <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-08 23:15 UTC (permalink / raw) To: James Bottomley Cc: Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Sat, 2011-01-08 at 11:49 -0500, Trond Myklebust wrote: > On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > > On the other hand, the xdr routines, since they take the pages anyway, > > could use a scatterlist approach to writing through the kernel mapping > > instead of using vmap ... we have all the machinery for this in > > lib/scatterlist.c ... it's not designed for this case, since it's > > designed to allow arbitrary linear reads and writes on a block > > scatterlist, but the principle is the same ... it looks like it would be > > rather a big patch, though ... > > The following alternative seems to work for me, but has only been > lightly tested so far. It's a bit large for a stable patch, but not too > ungainly. > > It modifies xdr_stream, adding the ability to iterate through page data. > To avoid kmap()/kunmap(), it does require that pages be allocated in > lowmem, but since the only use case here is when using page arrays as > temporary buffers, that seems like an acceptable compromise. ...and here is an update which makes the whole process transparent to the decoder. It basically teaches xdr_inline_decode() how to switch buffers when it hits the end of the current iovec and/or page. Cheers Trond ----------------------------------------------------------------------------------- From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sat, 8 Jan 2011 17:45:38 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 44 ++++++------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- 6 files changed, 148 insertions(+), 73 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..0108cf4 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,26 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; + struct page *scratch; struct nfs_cache_array *array; unsigned int count = 0; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +523,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..679cd67 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b } EXPORT_SYMBOL_GPL(xdr_write_pages); +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, + __be32 *p, unsigned int len) +{ + if (len > iov->iov_len) + len = iov->iov_len; + if (p == NULL) + p = (__be32*)iov->iov_base; + xdr->p = p; + xdr->end = (__be32*)(iov->iov_base + len); + xdr->iov = iov; + xdr->page_ptr = NULL; +} + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + xdr->iov = NULL; + return 0; +} + +static void xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); +} + +static bool xdr_set_next_buffer(struct xdr_stream *xdr) +{ + if (xdr->page_ptr != NULL) + xdr_set_next_page(xdr); + else if (xdr->iov == xdr->buf->head) { + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); + } + return xdr->p != xdr->end; +} + /** * xdr_init_decode - Initialize an xdr_stream for decoding data. * @xdr: pointer to xdr_stream struct @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); */ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) { - struct kvec *iov = buf->head; - unsigned int len = iov->iov_len; - - if (len > buf->len) - len = buf->len; xdr->buf = buf; - xdr->iov = iov; - xdr->p = p; - xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; + if (buf->head[0].iov_len != 0) + xdr_set_iov(xdr, buf->head, p, buf->len); + else if (buf->page_len != 0) + xdr_set_page_base(xdr, 0, buf->len); } EXPORT_SYMBOL_GPL(xdr_init_decode); -/** - * xdr_inline_peek - Allow read-ahead in the XDR data stream - * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode - * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. - */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); /** - * xdr_inline_decode - Retrieve non-page XDR data to decode + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. + * @xdr: pointer to xdr_stream struct + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' + * + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. + */ +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) +{ + __be32 *p; + void *cpdest = xdr->scratch.iov_base; + size_t cplen = (char *)xdr->end - (char *)xdr->p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (!xdr_set_next_buffer(xdr)) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} + +/** + * xdr_inline_decode - Retrieve XDR data to decode * @xdr: pointer to xdr_stream struct * @nbytes: number of bytes of data to decode * @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); + __be32 *p; - if (unlikely(q > xdr->end || q < p)) + if (nbytes == 0) + return xdr->p; + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) return NULL; - xdr->p = q; - return p; + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + return xdr_copy_to_scratch(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 71+ messages in thread
[parent not found: <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 16:25 ` Trond Myklebust 2011-01-10 12:44 ` Marc Kleine-Budde 1 sibling, 2 replies; 71+ messages in thread From: Uwe Kleine-König @ 2011-01-10 10:50 UTC (permalink / raw) To: Trond Myklebust Cc: James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA Hi Trond, On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > ----------------------------------------------------------------------------------- It would be great if you could add a "8<" in the line above next time. Then git-am -c does the right thing (at least I think so). > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Tested-by: Uwe Kleine-König <u.kleine-koenig-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> on tx28. Thanks Uwe > --- > fs/nfs/dir.c | 44 ++++++------- > fs/nfs/nfs2xdr.c | 6 -- > fs/nfs/nfs3xdr.c | 6 -- > fs/nfs/nfs4xdr.c | 6 -- > include/linux/sunrpc/xdr.h | 4 +- > net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- > 6 files changed, 148 insertions(+), 73 deletions(-) > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..0108cf4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -33,7 +33,6 @@ > #include <linux/namei.h> > #include <linux/mount.h> > #include <linux/sched.h> > -#include <linux/vmalloc.h> > #include <linux/kmemleak.h> > > #include "delegation.h" > @@ -459,25 +458,26 @@ out: > /* Perform conversion from xdr to cache array */ > static > int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, > - void *xdr_page, struct page *page, unsigned int buflen) > + struct page **xdr_pages, struct page *page, unsigned int buflen) > { > struct xdr_stream stream; > - struct xdr_buf buf; > - __be32 *ptr = xdr_page; > + struct xdr_buf buf = { > + .pages = xdr_pages, > + .page_len = buflen, > + .buflen = buflen, > + .len = buflen, > + }; > + struct page *scratch; > struct nfs_cache_array *array; > unsigned int count = 0; > int status; > > - buf.head->iov_base = xdr_page; > - buf.head->iov_len = buflen; > - buf.tail->iov_len = 0; > - buf.page_base = 0; > - buf.page_len = 0; > - buf.buflen = buf.head->iov_len; > - buf.len = buf.head->iov_len; > - > - xdr_init_decode(&stream, &buf, ptr); > + scratch = alloc_page(GFP_KERNEL); > + if (scratch == NULL) > + return -ENOMEM; > > + xdr_init_decode(&stream, &buf, NULL); > + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); > > do { > status = xdr_decode(desc, entry, &stream); > @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en > } else > status = PTR_ERR(array); > } > + > + put_page(scratch); > return status; > } > > @@ -521,7 +523,6 @@ static > void nfs_readdir_free_large_page(void *ptr, struct page **pages, > unsigned int npages) > { > - vm_unmap_ram(ptr, npages); > nfs_readdir_free_pagearray(pages, npages); > } > > @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, > * to nfs_readdir_free_large_page > */ > static > -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > +int nfs_readdir_large_page(struct page **pages, unsigned int npages) > { > - void *ptr; > unsigned int i; > > for (i = 0; i < npages; i++) { > @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > goto out_freepages; > pages[i] = page; > } > + return 0; > > - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); > - if (!IS_ERR_OR_NULL(ptr)) > - return ptr; > out_freepages: > nfs_readdir_free_pagearray(pages, i); > - return NULL; > + return -ENOMEM; > } > > static > @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > memset(array, 0, sizeof(struct nfs_cache_array)); > array->eof_index = -1; > > - pages_ptr = nfs_readdir_large_page(pages, array_size); > - if (!pages_ptr) > + status = nfs_readdir_large_page(pages, array_size); > + if (status < 0) > goto out_release_array; > do { > unsigned int pglen; > @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); > if (status < 0) { > if (status == -ENOSPC) > status = 0; > diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c > index 5914a19..b382a1b 100644 > --- a/fs/nfs/nfs2xdr.c > +++ b/fs/nfs/nfs2xdr.c > @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se > > entry->d_type = DT_UNKNOWN; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c > index f6cc60f..ba91236 100644 > --- a/fs/nfs/nfs3xdr.c > +++ b/fs/nfs/nfs3xdr.c > @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s > memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); > } > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c > index 9f1826b..0662a98 100644 > --- a/fs/nfs/nfs4xdr.c > +++ b/fs/nfs/nfs4xdr.c > @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, > if (verify_attr_len(xdr, p, len) < 0) > goto out_overflow; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h > index 498ab93..7783c68 100644 > --- a/include/linux/sunrpc/xdr.h > +++ b/include/linux/sunrpc/xdr.h > @@ -201,6 +201,8 @@ struct xdr_stream { > > __be32 *end; /* end of available buffer space */ > struct kvec *iov; /* pointer to the current kvec */ > + struct kvec scratch; /* Scratch buffer */ > + struct page **page_ptr; /* pointer to the current page */ > }; > > extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, > unsigned int base, unsigned int len); > extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); > +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); > extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); > extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > index cd9e841..679cd67 100644 > --- a/net/sunrpc/xdr.c > +++ b/net/sunrpc/xdr.c > @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b > } > EXPORT_SYMBOL_GPL(xdr_write_pages); > > +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, > + __be32 *p, unsigned int len) > +{ > + if (len > iov->iov_len) > + len = iov->iov_len; > + if (p == NULL) > + p = (__be32*)iov->iov_base; > + xdr->p = p; > + xdr->end = (__be32*)(iov->iov_base + len); > + xdr->iov = iov; > + xdr->page_ptr = NULL; > +} > + > +static int xdr_set_page_base(struct xdr_stream *xdr, > + unsigned int base, unsigned int len) > +{ > + unsigned int pgnr; > + unsigned int maxlen; > + unsigned int pgoff; > + unsigned int pgend; > + void *kaddr; > + > + maxlen = xdr->buf->page_len; > + if (base >= maxlen) > + return -EINVAL; > + maxlen -= base; > + if (len > maxlen) > + len = maxlen; > + > + base += xdr->buf->page_base; > + > + pgnr = base >> PAGE_SHIFT; > + xdr->page_ptr = &xdr->buf->pages[pgnr]; > + kaddr = page_address(*xdr->page_ptr); > + > + pgoff = base & ~PAGE_MASK; > + xdr->p = (__be32*)(kaddr + pgoff); > + > + pgend = pgoff + len; > + if (pgend > PAGE_SIZE) > + pgend = PAGE_SIZE; > + xdr->end = (__be32*)(kaddr + pgend); > + xdr->iov = NULL; > + return 0; > +} > + > +static void xdr_set_next_page(struct xdr_stream *xdr) > +{ > + unsigned int newbase; > + > + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; > + newbase -= xdr->buf->page_base; > + > + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > +} > + > +static bool xdr_set_next_buffer(struct xdr_stream *xdr) > +{ > + if (xdr->page_ptr != NULL) > + xdr_set_next_page(xdr); > + else if (xdr->iov == xdr->buf->head) { > + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > + } > + return xdr->p != xdr->end; > +} > + > /** > * xdr_init_decode - Initialize an xdr_stream for decoding data. > * @xdr: pointer to xdr_stream struct > @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); > */ > void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) > { > - struct kvec *iov = buf->head; > - unsigned int len = iov->iov_len; > - > - if (len > buf->len) > - len = buf->len; > xdr->buf = buf; > - xdr->iov = iov; > - xdr->p = p; > - xdr->end = (__be32 *)((char *)iov->iov_base + len); > + xdr->scratch.iov_base = NULL; > + xdr->scratch.iov_len = 0; > + if (buf->head[0].iov_len != 0) > + xdr_set_iov(xdr, buf->head, p, buf->len); > + else if (buf->page_len != 0) > + xdr_set_page_base(xdr, 0, buf->len); > } > EXPORT_SYMBOL_GPL(xdr_init_decode); > > -/** > - * xdr_inline_peek - Allow read-ahead in the XDR data stream > - * @xdr: pointer to xdr_stream struct > - * @nbytes: number of bytes of data to decode > - * > - * Check if the input buffer is long enough to enable us to decode > - * 'nbytes' more bytes of data starting at the current position. > - * If so return the current pointer without updating the current > - * pointer position. > - */ > -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) > +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > __be32 *p = xdr->p; > __be32 *q = p + XDR_QUADLEN(nbytes); > > if (unlikely(q > xdr->end || q < p)) > return NULL; > + xdr->p = q; > return p; > } > -EXPORT_SYMBOL_GPL(xdr_inline_peek); > > /** > - * xdr_inline_decode - Retrieve non-page XDR data to decode > + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. > + * @xdr: pointer to xdr_stream struct > + * @buf: pointer to an empty buffer > + * @buflen: size of 'buf' > + * > + * The scratch buffer is used when decoding from an array of pages. > + * If an xdr_inline_decode() call spans across page boundaries, then > + * we copy the data into the scratch buffer in order to allow linear > + * access. > + */ > +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) > +{ > + xdr->scratch.iov_base = buf; > + xdr->scratch.iov_len = buflen; > +} > +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); > + > +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) > +{ > + __be32 *p; > + void *cpdest = xdr->scratch.iov_base; > + size_t cplen = (char *)xdr->end - (char *)xdr->p; > + > + if (nbytes > xdr->scratch.iov_len) > + return NULL; > + memcpy(cpdest, xdr->p, cplen); > + cpdest += cplen; > + nbytes -= cplen; > + if (!xdr_set_next_buffer(xdr)) > + return NULL; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p == NULL) > + return NULL; > + memcpy(cpdest, p, nbytes); > + return xdr->scratch.iov_base; > +} > + > +/** > + * xdr_inline_decode - Retrieve XDR data to decode > * @xdr: pointer to xdr_stream struct > * @nbytes: number of bytes of data to decode > * > @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); > */ > __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > - __be32 *p = xdr->p; > - __be32 *q = p + XDR_QUADLEN(nbytes); > + __be32 *p; > > - if (unlikely(q > xdr->end || q < p)) > + if (nbytes == 0) > + return xdr->p; > + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) > return NULL; > - xdr->p = q; > - return p; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p != NULL) > + return p; > + return xdr_copy_to_scratch(xdr, nbytes); > } > EXPORT_SYMBOL_GPL(xdr_inline_decode); > > @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); > */ > void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) > { > - char * kaddr = page_address(xdr->buf->pages[0]); > xdr_read_pages(xdr, len); > /* > * Position current pointer at beginning of tail, and > * set remaining message length. > */ > - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) > - len = PAGE_CACHE_SIZE - xdr->buf->page_base; > - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); > - xdr->end = (__be32 *)((char *)xdr->p + len); > + xdr_set_page_base(xdr, 0, len); > } > EXPORT_SYMBOL_GPL(xdr_enter_page); > > -- > 1.7.3.4 > > > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org > www.netapp.com > > -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | http://www.pengutronix.de/ | -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 10:50 ` Uwe Kleine-König @ 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 16:25 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Uwe Kleine-König @ 2011-01-10 10:50 UTC (permalink / raw) To: Trond Myklebust Cc: James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch Hi Trond, On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > ----------------------------------------------------------------------------------- It would be great if you could add a "8<" in the line above next time. Then git-am -c does the right thing (at least I think so). > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> on tx28. Thanks Uwe > --- > fs/nfs/dir.c | 44 ++++++------- > fs/nfs/nfs2xdr.c | 6 -- > fs/nfs/nfs3xdr.c | 6 -- > fs/nfs/nfs4xdr.c | 6 -- > include/linux/sunrpc/xdr.h | 4 +- > net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- > 6 files changed, 148 insertions(+), 73 deletions(-) > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..0108cf4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -33,7 +33,6 @@ > #include <linux/namei.h> > #include <linux/mount.h> > #include <linux/sched.h> > -#include <linux/vmalloc.h> > #include <linux/kmemleak.h> > > #include "delegation.h" > @@ -459,25 +458,26 @@ out: > /* Perform conversion from xdr to cache array */ > static > int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, > - void *xdr_page, struct page *page, unsigned int buflen) > + struct page **xdr_pages, struct page *page, unsigned int buflen) > { > struct xdr_stream stream; > - struct xdr_buf buf; > - __be32 *ptr = xdr_page; > + struct xdr_buf buf = { > + .pages = xdr_pages, > + .page_len = buflen, > + .buflen = buflen, > + .len = buflen, > + }; > + struct page *scratch; > struct nfs_cache_array *array; > unsigned int count = 0; > int status; > > - buf.head->iov_base = xdr_page; > - buf.head->iov_len = buflen; > - buf.tail->iov_len = 0; > - buf.page_base = 0; > - buf.page_len = 0; > - buf.buflen = buf.head->iov_len; > - buf.len = buf.head->iov_len; > - > - xdr_init_decode(&stream, &buf, ptr); > + scratch = alloc_page(GFP_KERNEL); > + if (scratch == NULL) > + return -ENOMEM; > > + xdr_init_decode(&stream, &buf, NULL); > + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); > > do { > status = xdr_decode(desc, entry, &stream); > @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en > } else > status = PTR_ERR(array); > } > + > + put_page(scratch); > return status; > } > > @@ -521,7 +523,6 @@ static > void nfs_readdir_free_large_page(void *ptr, struct page **pages, > unsigned int npages) > { > - vm_unmap_ram(ptr, npages); > nfs_readdir_free_pagearray(pages, npages); > } > > @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, > * to nfs_readdir_free_large_page > */ > static > -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > +int nfs_readdir_large_page(struct page **pages, unsigned int npages) > { > - void *ptr; > unsigned int i; > > for (i = 0; i < npages; i++) { > @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > goto out_freepages; > pages[i] = page; > } > + return 0; > > - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); > - if (!IS_ERR_OR_NULL(ptr)) > - return ptr; > out_freepages: > nfs_readdir_free_pagearray(pages, i); > - return NULL; > + return -ENOMEM; > } > > static > @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > memset(array, 0, sizeof(struct nfs_cache_array)); > array->eof_index = -1; > > - pages_ptr = nfs_readdir_large_page(pages, array_size); > - if (!pages_ptr) > + status = nfs_readdir_large_page(pages, array_size); > + if (status < 0) > goto out_release_array; > do { > unsigned int pglen; > @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); > if (status < 0) { > if (status == -ENOSPC) > status = 0; > diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c > index 5914a19..b382a1b 100644 > --- a/fs/nfs/nfs2xdr.c > +++ b/fs/nfs/nfs2xdr.c > @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se > > entry->d_type = DT_UNKNOWN; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c > index f6cc60f..ba91236 100644 > --- a/fs/nfs/nfs3xdr.c > +++ b/fs/nfs/nfs3xdr.c > @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s > memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); > } > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c > index 9f1826b..0662a98 100644 > --- a/fs/nfs/nfs4xdr.c > +++ b/fs/nfs/nfs4xdr.c > @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, > if (verify_attr_len(xdr, p, len) < 0) > goto out_overflow; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h > index 498ab93..7783c68 100644 > --- a/include/linux/sunrpc/xdr.h > +++ b/include/linux/sunrpc/xdr.h > @@ -201,6 +201,8 @@ struct xdr_stream { > > __be32 *end; /* end of available buffer space */ > struct kvec *iov; /* pointer to the current kvec */ > + struct kvec scratch; /* Scratch buffer */ > + struct page **page_ptr; /* pointer to the current page */ > }; > > extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, > unsigned int base, unsigned int len); > extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); > +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); > extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); > extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > index cd9e841..679cd67 100644 > --- a/net/sunrpc/xdr.c > +++ b/net/sunrpc/xdr.c > @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b > } > EXPORT_SYMBOL_GPL(xdr_write_pages); > > +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, > + __be32 *p, unsigned int len) > +{ > + if (len > iov->iov_len) > + len = iov->iov_len; > + if (p == NULL) > + p = (__be32*)iov->iov_base; > + xdr->p = p; > + xdr->end = (__be32*)(iov->iov_base + len); > + xdr->iov = iov; > + xdr->page_ptr = NULL; > +} > + > +static int xdr_set_page_base(struct xdr_stream *xdr, > + unsigned int base, unsigned int len) > +{ > + unsigned int pgnr; > + unsigned int maxlen; > + unsigned int pgoff; > + unsigned int pgend; > + void *kaddr; > + > + maxlen = xdr->buf->page_len; > + if (base >= maxlen) > + return -EINVAL; > + maxlen -= base; > + if (len > maxlen) > + len = maxlen; > + > + base += xdr->buf->page_base; > + > + pgnr = base >> PAGE_SHIFT; > + xdr->page_ptr = &xdr->buf->pages[pgnr]; > + kaddr = page_address(*xdr->page_ptr); > + > + pgoff = base & ~PAGE_MASK; > + xdr->p = (__be32*)(kaddr + pgoff); > + > + pgend = pgoff + len; > + if (pgend > PAGE_SIZE) > + pgend = PAGE_SIZE; > + xdr->end = (__be32*)(kaddr + pgend); > + xdr->iov = NULL; > + return 0; > +} > + > +static void xdr_set_next_page(struct xdr_stream *xdr) > +{ > + unsigned int newbase; > + > + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; > + newbase -= xdr->buf->page_base; > + > + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > +} > + > +static bool xdr_set_next_buffer(struct xdr_stream *xdr) > +{ > + if (xdr->page_ptr != NULL) > + xdr_set_next_page(xdr); > + else if (xdr->iov == xdr->buf->head) { > + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > + } > + return xdr->p != xdr->end; > +} > + > /** > * xdr_init_decode - Initialize an xdr_stream for decoding data. > * @xdr: pointer to xdr_stream struct > @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); > */ > void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) > { > - struct kvec *iov = buf->head; > - unsigned int len = iov->iov_len; > - > - if (len > buf->len) > - len = buf->len; > xdr->buf = buf; > - xdr->iov = iov; > - xdr->p = p; > - xdr->end = (__be32 *)((char *)iov->iov_base + len); > + xdr->scratch.iov_base = NULL; > + xdr->scratch.iov_len = 0; > + if (buf->head[0].iov_len != 0) > + xdr_set_iov(xdr, buf->head, p, buf->len); > + else if (buf->page_len != 0) > + xdr_set_page_base(xdr, 0, buf->len); > } > EXPORT_SYMBOL_GPL(xdr_init_decode); > > -/** > - * xdr_inline_peek - Allow read-ahead in the XDR data stream > - * @xdr: pointer to xdr_stream struct > - * @nbytes: number of bytes of data to decode > - * > - * Check if the input buffer is long enough to enable us to decode > - * 'nbytes' more bytes of data starting at the current position. > - * If so return the current pointer without updating the current > - * pointer position. > - */ > -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) > +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > __be32 *p = xdr->p; > __be32 *q = p + XDR_QUADLEN(nbytes); > > if (unlikely(q > xdr->end || q < p)) > return NULL; > + xdr->p = q; > return p; > } > -EXPORT_SYMBOL_GPL(xdr_inline_peek); > > /** > - * xdr_inline_decode - Retrieve non-page XDR data to decode > + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. > + * @xdr: pointer to xdr_stream struct > + * @buf: pointer to an empty buffer > + * @buflen: size of 'buf' > + * > + * The scratch buffer is used when decoding from an array of pages. > + * If an xdr_inline_decode() call spans across page boundaries, then > + * we copy the data into the scratch buffer in order to allow linear > + * access. > + */ > +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) > +{ > + xdr->scratch.iov_base = buf; > + xdr->scratch.iov_len = buflen; > +} > +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); > + > +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) > +{ > + __be32 *p; > + void *cpdest = xdr->scratch.iov_base; > + size_t cplen = (char *)xdr->end - (char *)xdr->p; > + > + if (nbytes > xdr->scratch.iov_len) > + return NULL; > + memcpy(cpdest, xdr->p, cplen); > + cpdest += cplen; > + nbytes -= cplen; > + if (!xdr_set_next_buffer(xdr)) > + return NULL; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p == NULL) > + return NULL; > + memcpy(cpdest, p, nbytes); > + return xdr->scratch.iov_base; > +} > + > +/** > + * xdr_inline_decode - Retrieve XDR data to decode > * @xdr: pointer to xdr_stream struct > * @nbytes: number of bytes of data to decode > * > @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); > */ > __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > - __be32 *p = xdr->p; > - __be32 *q = p + XDR_QUADLEN(nbytes); > + __be32 *p; > > - if (unlikely(q > xdr->end || q < p)) > + if (nbytes == 0) > + return xdr->p; > + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) > return NULL; > - xdr->p = q; > - return p; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p != NULL) > + return p; > + return xdr_copy_to_scratch(xdr, nbytes); > } > EXPORT_SYMBOL_GPL(xdr_inline_decode); > > @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); > */ > void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) > { > - char * kaddr = page_address(xdr->buf->pages[0]); > xdr_read_pages(xdr, len); > /* > * Position current pointer at beginning of tail, and > * set remaining message length. > */ > - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) > - len = PAGE_CACHE_SIZE - xdr->buf->page_base; > - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); > - xdr->end = (__be32 *)((char *)xdr->p + len); > + xdr_set_page_base(xdr, 0, len); > } > EXPORT_SYMBOL_GPL(xdr_enter_page); > > -- > 1.7.3.4 > > > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com > > -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 10:50 ` Uwe Kleine-König @ 2011-01-10 16:25 ` Trond Myklebust [not found] ` <1294676734.3349.10.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 71+ messages in thread From: Trond Myklebust @ 2011-01-10 16:25 UTC (permalink / raw) To: Uwe Kleine-König Cc: James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: > Hi Trond, > > On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > ----------------------------------------------------------------------------------- > It would be great if you could add a "8<" in the line above next time. > Then git-am -c does the right thing (at least I think so). Sorry. I wasn't aware of that particular idiom. So something like 8<------------------------------------------------------------ From: Trond Myklebust <Trond.Myklebust@netapp.com> Subject: ..... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294676734.3349.10.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294676734.3349.10.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:20 ` Trond Myklebust 0 siblings, 2 replies; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:08 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1141 bytes --] On 01/10/2011 05:25 PM, Trond Myklebust wrote: > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: >> Hi Trond, >> >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: >>> ----------------------------------------------------------------------------------- >> It would be great if you could add a "8<" in the line above next time. >> Then git-am -c does the right thing (at least I think so). > > Sorry. I wasn't aware of that particular idiom. So something like > > 8<------------------------------------------------------------ > From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > Subject: ..... Yes. From "man git-mailinfo": "A line that mainly consists of scissors (either ">8" or "8<") and perforation (dash "-")" BTW: Is this patch a candidate for stable? regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:08 ` Marc Kleine-Budde @ 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:20 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:08 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch [-- Attachment #1: Type: text/plain, Size: 1112 bytes --] On 01/10/2011 05:25 PM, Trond Myklebust wrote: > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: >> Hi Trond, >> >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: >>> ----------------------------------------------------------------------------------- >> It would be great if you could add a "8<" in the line above next time. >> Then git-am -c does the right thing (at least I think so). > > Sorry. I wasn't aware of that particular idiom. So something like > > 8<------------------------------------------------------------ > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Subject: ..... Yes. From "man git-mailinfo": "A line that mainly consists of scissors (either ">8" or "8<") and perforation (dash "-")" BTW: Is this patch a candidate for stable? regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:08 ` Marc Kleine-Budde @ 2011-01-10 17:20 ` Trond Myklebust [not found] ` <1294680035.3349.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2011-01-10 19:25 ` Uwe Kleine-König 1 sibling, 2 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-10 17:20 UTC (permalink / raw) To: Marc Kleine-Budde Cc: Uwe Kleine-König, James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: > >> Hi Trond, > >> > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > >>> ----------------------------------------------------------------------------------- > >> It would be great if you could add a "8<" in the line above next time. > >> Then git-am -c does the right thing (at least I think so). > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > 8<------------------------------------------------------------ > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > Subject: ..... > > Yes. > > From "man git-mailinfo": > "A line that mainly consists of scissors (either ">8" or "8<") and > perforation (dash "-")" > > BTW: > Is this patch a candidate for stable? Yes. I'm planning on sending it to the stable list after Linus merges it into mainline. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294680035.3349.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294680035.3349.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-10 17:26 ` Marc Kleine-Budde 2011-01-10 17:26 ` Marc Kleine-Budde 0 siblings, 1 reply; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:26 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 491 bytes --] On 01/10/2011 06:20 PM, Trond Myklebust wrote: >> Is this patch a candidate for stable? > > Yes. I'm planning on sending it to the stable list after Linus merges it > into mainline. Fine! regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:26 ` Marc Kleine-Budde @ 2011-01-10 17:26 ` Marc Kleine-Budde 0 siblings, 0 replies; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:26 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch [-- Attachment #1: Type: text/plain, Size: 491 bytes --] On 01/10/2011 06:20 PM, Trond Myklebust wrote: >> Is this patch a candidate for stable? > > Yes. I'm planning on sending it to the stable list after Linus merges it > into mainline. Fine! regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:20 ` Trond Myklebust [not found] ` <1294680035.3349.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-10 19:25 ` Uwe Kleine-König [not found] ` <20110110192552.GG24920-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> 1 sibling, 1 reply; 71+ messages in thread From: Uwe Kleine-König @ 2011-01-10 19:25 UTC (permalink / raw) To: Trond Myklebust Cc: Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Linus Torvalds, Marc Kleine-Budde, linux-arm-kernel Hello Trond, On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote: > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: > > >> Hi Trond, > > >> > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > >>> ----------------------------------------------------------------------------------- > > >> It would be great if you could add a "8<" in the line above next time. > > >> Then git-am -c does the right thing (at least I think so). > > > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > > > 8<------------------------------------------------------------ > > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > > Subject: ..... > > > > Yes. > > > > From "man git-mailinfo": > > "A line that mainly consists of scissors (either ">8" or "8<") and > > perforation (dash "-")" > > > > BTW: > > Is this patch a candidate for stable? > > Yes. I'm planning on sending it to the stable list after Linus merges it > into mainline. So there is another idiom for you: just put Cc: stable@kernel.org in the S-o-b block and Greg will pick it off "automatically". (Just in case you don't know, and if you do, maybe someone else learned something.) Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20110110192552.GG24920-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <20110110192552.GG24920-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> @ 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:29 ` Trond Myklebust ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-10 19:29 UTC (permalink / raw) To: Uwe Kleine-König Cc: Marc Kleine-Budde, linux-arch-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Russell King - ARM Linux, Parisc List, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Linus Torvalds, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Mon, 2011-01-10 at 20:25 +0100, Uwe Kleine-König wrote: > Hello Trond, > > On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote: > > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > > > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: > > > >> Hi Trond, > > > >> > > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > > >>> ----------------------------------------------------------------------------------- > > > >> It would be great if you could add a "8<" in the line above next time. > > > >> Then git-am -c does the right thing (at least I think so). > > > > > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > > > > > 8<------------------------------------------------------------ > > > > From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > > > > Subject: ..... > > > > > > Yes. > > > > > > From "man git-mailinfo": > > > "A line that mainly consists of scissors (either ">8" or "8<") and > > > perforation (dash "-")" > > > > > > BTW: > > > Is this patch a candidate for stable? > > > > Yes. I'm planning on sending it to the stable list after Linus merges it > > into mainline. > So there is another idiom for you: just put > > Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org > > in the S-o-b block and Greg will pick it off "automatically". (Just in > case you don't know, and if you do, maybe someone else learned > something.) I usually do this, but there is a slight problem with that approach: Greg gets to do all the work of figuring out to which stable kernels this particular patch applies. In this case, since we're only talking about the 2.6.37 kernel, I prefer to use the mailing lists. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:29 ` Trond Myklebust @ 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:31 ` James Bottomley 2011-01-10 19:34 ` Linus Torvalds 2 siblings, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-10 19:29 UTC (permalink / raw) To: Uwe Kleine-König Cc: Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Linus Torvalds, Marc Kleine-Budde, linux-arm-kernel On Mon, 2011-01-10 at 20:25 +0100, Uwe Kleine-König wrote: > Hello Trond, > > On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote: > > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > > > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote: > > > >> Hi Trond, > > > >> > > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > > >>> ----------------------------------------------------------------------------------- > > > >> It would be great if you could add a "8<" in the line above next time. > > > >> Then git-am -c does the right thing (at least I think so). > > > > > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > > > > > 8<------------------------------------------------------------ > > > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > > > Subject: ..... > > > > > > Yes. > > > > > > From "man git-mailinfo": > > > "A line that mainly consists of scissors (either ">8" or "8<") and > > > perforation (dash "-")" > > > > > > BTW: > > > Is this patch a candidate for stable? > > > > Yes. I'm planning on sending it to the stable list after Linus merges it > > into mainline. > So there is another idiom for you: just put > > Cc: stable@kernel.org > > in the S-o-b block and Greg will pick it off "automatically". (Just in > case you don't know, and if you do, maybe someone else learned > something.) I usually do this, but there is a slight problem with that approach: Greg gets to do all the work of figuring out to which stable kernels this particular patch applies. In this case, since we're only talking about the 2.6.37 kernel, I prefer to use the mailing lists. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:29 ` Trond Myklebust @ 2011-01-10 19:31 ` James Bottomley 2011-01-10 19:34 ` Linus Torvalds 2 siblings, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-10 19:31 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, Linus Torvalds, Marc Kleine-Budde, linux-arm-kernel On Mon, 2011-01-10 at 14:29 -0500, Trond Myklebust wrote: > I usually do this, but there is a slight problem with that approach: > Greg gets to do all the work of figuring out to which stable kernels > this particular patch applies. In this case, since we're only talking > about the 2.6.37 kernel, I prefer to use the mailing lists. So the non-standard, but accepted way of doing this is Cc: Stable Tree <stable@kernel.org> [2.6.37] James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:31 ` James Bottomley @ 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 20:15 ` Trond Myklebust 2 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-10 19:34 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Marc Kleine-Budde, linux-arm-kernel 2011/1/10 Trond Myklebust <Trond.Myklebust@netapp.com>: > > I usually do this, but there is a slight problem with that approach: > Greg gets to do all the work of figuring out to which stable kernels > this particular patch applies. In this case, since we're only talking > about the 2.6.37 kernel, I prefer to use the mailing lists. Just do Cc: stable@kernel.org [2.6.37] or similar. It's quite common. So EVEN IF you want to email people around about the patch separately, do add the "Cc: stable" marker. It's worthwhile information about the patch for everybody involved. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:34 ` Linus Torvalds @ 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 20:15 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-10 19:34 UTC (permalink / raw) To: Trond Myklebust Cc: Uwe Kleine-König, Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Marc Kleine-Budde, linux-arm-kernel 2011/1/10 Trond Myklebust <Trond.Myklebust@netapp.com>: > > I usually do this, but there is a slight problem with that approach: > Greg gets to do all the work of figuring out to which stable kernels > this particular patch applies. In this case, since we're only talking > about the 2.6.37 kernel, I prefer to use the mailing lists. Just do Cc: stable@kernel.org [2.6.37] or similar. It's quite common. So EVEN IF you want to email people around about the patch separately, do add the "Cc: stable" marker. It's worthwhile information about the patch for everybody involved. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 19:34 ` Linus Torvalds @ 2011-01-10 20:15 ` Trond Myklebust 1 sibling, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-10 20:15 UTC (permalink / raw) To: Linus Torvalds Cc: Uwe Kleine-König, Marc Kleine-Budde, linux-arch, linux-nfs, Russell King - ARM Linux, Parisc List, linux-kernel, James Bottomley, Marc Kleine-Budde, linux-arm-kernel On Mon, 2011-01-10 at 11:34 -0800, Linus Torvalds wrote: > 2011/1/10 Trond Myklebust <Trond.Myklebust@netapp.com>: > > > > I usually do this, but there is a slight problem with that approach: > > Greg gets to do all the work of figuring out to which stable kernels > > this particular patch applies. In this case, since we're only talking > > about the 2.6.37 kernel, I prefer to use the mailing lists. > > Just do > > Cc: stable@kernel.org [2.6.37] > > or similar. It's quite common. > > So EVEN IF you want to email people around about the patch separately, > do add the "Cc: stable" marker. It's worthwhile information about the > patch for everybody involved. OK. Patch description amended and recommitted in git. Thanks to all for the tips... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2011-01-10 10:50 ` Uwe Kleine-König @ 2011-01-10 12:44 ` Marc Kleine-Budde 2011-01-10 12:44 ` Marc Kleine-Budde 1 sibling, 1 reply; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 12:44 UTC (permalink / raw) To: Trond Myklebust Cc: James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1233 bytes --] On 01/09/2011 12:15 AM, Trond Myklebust wrote: > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Tested-by: Marc Kleine-Budde <mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> ..on AT91 (armv5) regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 12:44 ` Marc Kleine-Budde @ 2011-01-10 12:44 ` Marc Kleine-Budde 0 siblings, 0 replies; 71+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 12:44 UTC (permalink / raw) To: Trond Myklebust Cc: James Bottomley, Russell King - ARM Linux, Linus Torvalds, linux-nfs, linux-kernel, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch [-- Attachment #1: Type: text/plain, Size: 1150 bytes --] On 01/09/2011 12:15 AM, Trond Myklebust wrote: > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Marc Kleine-Budde <mkl@pengutronix.de> ..on AT91 (armv5) regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:11 ` James Bottomley @ 2011-01-07 19:13 ` Trond Myklebust 2 siblings, 0 replies; 71+ messages in thread From: Trond Myklebust @ 2011-01-07 19:13 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote: > On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > > I'd still like to keep the existing code for those architectures that > > don't have problems, since that allows us to send 32k READDIR requests > > instead of being limited to 4k. For large directories, that is a clear > > win. > > For the NOMMU case we will just go back to using a single page for > > storage (and 4k READDIR requests only). Should I just do the same for > > architectures like ARM and PARISC? > > I think you said that readdir reads via the vmalloc mapping of the > group of pages, but XDR writes to the individual pages. > > As I understand NFS, you receive a packet, you then have to use XDR > to unpack the data, which you presumably write into the set of > struct page *'s using kmap? No. The socket or RDMA layers place the data directly into the struct pages. We then unpack them in the XDR layer using the vmalloc mapping and place the resulting processed readdir data into the page cache. > Isn't a solution to have XDR write directly into the vmalloc mapping > rather than using struct page * and kmap? Unfortunately that isn't possible. :-( -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 18:53 ` Trond Myklebust [not found] ` <1294426405.2929.23.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2011-01-07 19:05 ` James Bottomley 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-07 19:05 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Fri, 2011-01-07 at 13:53 -0500, Trond Myklebust wrote: > There is already code in the SUNRPC layer that calls flush_dcache_page() > after writing (although as Russell pointed out earlier, that is > apparently a no-op for non-page cache pages such as these). Actually (and possibly fortunately) none of our flush_dcache_page() implementations do this (check for an actual non page cache page and nop if they find one). Although, they may according to the docs which say that flush_dcache_page() is only called on page cache pages. But it's definitely using the API outside its documented scope. We have lots of places in the VFS where we don't call flush_dcache_page() even after altering a kernel page (even in the page cache) if we know the page will never be mapped to userspace. The assumption here is that the kernel never sets up non-user aliases of these pages, so not doing the flushing is an optimisation since we only access them through the kernel address space. Of course, setting up vmap areas of these pages within the kernel violates this assumption. > > This is why you really really really generally don't want to have > > aliasing. Purely virtual caches are pure crap. Really. > > Well, it looks as if NOMMU is giving us problems due to the lack of a > vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262). > > I'd still like to keep the existing code for those architectures that > don't have problems, since that allows us to send 32k READDIR requests > instead of being limited to 4k. For large directories, that is a clear > win. > For the NOMMU case we will just go back to using a single page for > storage (and 4k READDIR requests only). Should I just do the same for > architectures like ARM and PARISC? Well, that would include any VI architecture (like SPARC and others) as well. However, I think we can just make the invalidate_kernel_vmap_range() work. James ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294335614.22825.154.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294335614.22825.154.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> @ 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:14 ` James Bottomley 0 siblings, 2 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-06 18:05 UTC (permalink / raw) To: James Bottomley Cc: Trond Myklebust, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > Can you explain how the code works? it looks to me like you read the xdr > > stuff through the vmap region then write it out directly to the pages? > > OK, I think I see how this is supposed to work: It's a sequential loop > of reading in via the pages (i.e. through the kernel mapping) and then > updating those pages via the vmap. In which case, I think this patch is > what you need. > > The theory of operation is that the readdir on pages actually uses the > network DMA operations to perform, so when it's finished, the underlying What network DMA operations - what if your NIC doesn't do DMA because it's an SMSC device? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:05 ` Russell King - ARM Linux @ 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:14 ` James Bottomley 1 sibling, 0 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-06 18:05 UTC (permalink / raw) To: James Bottomley Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > Can you explain how the code works? it looks to me like you read the xdr > > stuff through the vmap region then write it out directly to the pages? > > OK, I think I see how this is supposed to work: It's a sequential loop > of reading in via the pages (i.e. through the kernel mapping) and then > updating those pages via the vmap. In which case, I think this patch is > what you need. > > The theory of operation is that the readdir on pages actually uses the > network DMA operations to perform, so when it's finished, the underlying What network DMA operations - what if your NIC doesn't do DMA because it's an SMSC device? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:05 ` Russell King - ARM Linux @ 2011-01-06 18:14 ` James Bottomley 2011-01-06 18:14 ` James Bottomley [not found] ` <1294337670.22825.199.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 1 sibling, 2 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 18:14 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > > Can you explain how the code works? it looks to me like you read the xdr > > > stuff through the vmap region then write it out directly to the pages? > > > > OK, I think I see how this is supposed to work: It's a sequential loop > > of reading in via the pages (i.e. through the kernel mapping) and then > > updating those pages via the vmap. In which case, I think this patch is > > what you need. > > > > The theory of operation is that the readdir on pages actually uses the > > network DMA operations to perform, so when it's finished, the underlying > > What network DMA operations - what if your NIC doesn't do DMA because > it's an SMSC device? So this is the danger area ... we might be caught by our own flushing tricks. I can't test this on parisc since all my network drivers use DMA (which automatically coheres the kernel mapping by flush/invalidate). What should happen is that the kernel mapping pages go through the ->readdir() path. Any return from this has to be ready to map the pages back to user space, so the kernel alias has to be flushed to make the underlying page up to date. The exception is pages we haven't yet mapped to userspace. Here we set the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we expect the addition of a userspace mapping will detect this case and do the flush and clear the bit before the mapping goes live. I assume you're thinking that because this page is allocated and freed internally to NFS, it never gets a userspace mapping and therefore, we can return from ->readdir() with a dirty kernel cache (and the corresponding flag set)? I think that is a possible hypothesis in certain cases. James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:14 ` James Bottomley @ 2011-01-06 18:14 ` James Bottomley [not found] ` <1294337670.22825.199.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 18:14 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > > Can you explain how the code works? it looks to me like you read the xdr > > > stuff through the vmap region then write it out directly to the pages? > > > > OK, I think I see how this is supposed to work: It's a sequential loop > > of reading in via the pages (i.e. through the kernel mapping) and then > > updating those pages via the vmap. In which case, I think this patch is > > what you need. > > > > The theory of operation is that the readdir on pages actually uses the > > network DMA operations to perform, so when it's finished, the underlying > > What network DMA operations - what if your NIC doesn't do DMA because > it's an SMSC device? So this is the danger area ... we might be caught by our own flushing tricks. I can't test this on parisc since all my network drivers use DMA (which automatically coheres the kernel mapping by flush/invalidate). What should happen is that the kernel mapping pages go through the ->readdir() path. Any return from this has to be ready to map the pages back to user space, so the kernel alias has to be flushed to make the underlying page up to date. The exception is pages we haven't yet mapped to userspace. Here we set the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we expect the addition of a userspace mapping will detect this case and do the flush and clear the bit before the mapping goes live. I assume you're thinking that because this page is allocated and freed internally to NFS, it never gets a userspace mapping and therefore, we can return from ->readdir() with a dirty kernel cache (and the corresponding flag set)? I think that is a possible hypothesis in certain cases. James ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1294337670.22825.199.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <1294337670.22825.199.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> @ 2011-01-06 18:25 ` James Bottomley 2011-01-06 18:25 ` James Bottomley 2011-01-06 21:07 ` James Bottomley 0 siblings, 2 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 18:25 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Thu, 2011-01-06 at 12:14 -0600, James Bottomley wrote: > On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > > What network DMA operations - what if your NIC doesn't do DMA because > > it's an SMSC device? > > So this is the danger area ... we might be caught by our own flushing > tricks. I can't test this on parisc since all my network drivers use > DMA (which automatically coheres the kernel mapping by > flush/invalidate). > > What should happen is that the kernel mapping pages go through the > ->readdir() path. Any return from this has to be ready to map the pages > back to user space, so the kernel alias has to be flushed to make the > underlying page up to date. > > The exception is pages we haven't yet mapped to userspace. Here we set > the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we > expect the addition of a userspace mapping will detect this case and do > the flush and clear the bit before the mapping goes live. I assume > you're thinking that because this page is allocated and freed internally > to NFS, it never gets a userspace mapping and therefore, we can return > from ->readdir() with a dirty kernel cache (and the corresponding flag > set)? I think that is a possible hypothesis in certain cases. OK, so thinking about this, it seems that the only danger is actually what NFS is doing: reading cache pages via a vmap. In that case, since the requirement is to invalidate the vmap range to prepare for read, we could have invalidate_kernel_vmap_range loop over the underlying pages and flush them through the kernel alias if the architecture specific flag indicates their contents might be dirty. The loop adds expense that is probably largely unnecessary to invalidate_kernel_vmap_range() but the alternative is adding to the API proliferation with something that only flushes the kernel pages if the arch specific flag says they're dirty. James -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:25 ` James Bottomley @ 2011-01-06 18:25 ` James Bottomley 2011-01-06 21:07 ` James Bottomley 1 sibling, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 18:25 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 12:14 -0600, James Bottomley wrote: > On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > > What network DMA operations - what if your NIC doesn't do DMA because > > it's an SMSC device? > > So this is the danger area ... we might be caught by our own flushing > tricks. I can't test this on parisc since all my network drivers use > DMA (which automatically coheres the kernel mapping by > flush/invalidate). > > What should happen is that the kernel mapping pages go through the > ->readdir() path. Any return from this has to be ready to map the pages > back to user space, so the kernel alias has to be flushed to make the > underlying page up to date. > > The exception is pages we haven't yet mapped to userspace. Here we set > the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we > expect the addition of a userspace mapping will detect this case and do > the flush and clear the bit before the mapping goes live. I assume > you're thinking that because this page is allocated and freed internally > to NFS, it never gets a userspace mapping and therefore, we can return > from ->readdir() with a dirty kernel cache (and the corresponding flag > set)? I think that is a possible hypothesis in certain cases. OK, so thinking about this, it seems that the only danger is actually what NFS is doing: reading cache pages via a vmap. In that case, since the requirement is to invalidate the vmap range to prepare for read, we could have invalidate_kernel_vmap_range loop over the underlying pages and flush them through the kernel alias if the architecture specific flag indicates their contents might be dirty. The loop adds expense that is probably largely unnecessary to invalidate_kernel_vmap_range() but the alternative is adding to the API proliferation with something that only flushes the kernel pages if the arch specific flag says they're dirty. James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:25 ` James Bottomley 2011-01-06 18:25 ` James Bottomley @ 2011-01-06 21:07 ` James Bottomley 2011-01-06 21:07 ` James Bottomley 1 sibling, 1 reply; 71+ messages in thread From: James Bottomley @ 2011-01-06 21:07 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 12:25 -0600, James Bottomley wrote: > OK, so thinking about this, it seems that the only danger is actually > what NFS is doing: reading cache pages via a vmap. In that case, since > the requirement is to invalidate the vmap range to prepare for read, we > could have invalidate_kernel_vmap_range loop over the underlying pages > and flush them through the kernel alias if the architecture specific > flag indicates their contents might be dirty. > > The loop adds expense that is probably largely unnecessary to > invalidate_kernel_vmap_range() but the alternative is adding to the API > proliferation with something that only flushes the kernel pages if the > arch specific flag says they're dirty. This is what I think the arm patch would look like (example only: I can't compile it). Is something like this too expensive? the loop can't be optimised away because of the need to check the pages (and vmalloc_to_page is a three level page table lookup). James --- diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 3acd8fa..34469ca 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -414,8 +414,17 @@ static inline void flush_kernel_vmap_range(void *addr, int size) } static inline void invalidate_kernel_vmap_range(void *addr, int size) { - if ((cache_is_vivt() || cache_is_vipt_aliasing())) - __cpuc_flush_dcache_area(addr, (size_t)size); + if ((cache_is_vivt() || cache_is_vipt_aliasing())) { + void *cursor = addr; + + for ( ; cursor < addr + size; cursor += PAGE_SIZE) { + struct page *page = vmalloc_to_page(cursor); + + if (!test_and_set_bit(PG_dcache_clean, &page->flags)) + __flush_dcache_page(page_mapping(page), page); + } + __cpuc_flush_dcache_area(addr, (size_t)size); + } } #define ARCH_HAS_FLUSH_ANON_PAGE ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 21:07 ` James Bottomley @ 2011-01-06 21:07 ` James Bottomley 0 siblings, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-06 21:07 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Trond Myklebust, Linus Torvalds, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Thu, 2011-01-06 at 12:25 -0600, James Bottomley wrote: > OK, so thinking about this, it seems that the only danger is actually > what NFS is doing: reading cache pages via a vmap. In that case, since > the requirement is to invalidate the vmap range to prepare for read, we > could have invalidate_kernel_vmap_range loop over the underlying pages > and flush them through the kernel alias if the architecture specific > flag indicates their contents might be dirty. > > The loop adds expense that is probably largely unnecessary to > invalidate_kernel_vmap_range() but the alternative is adding to the API > proliferation with something that only flushes the kernel pages if the > arch specific flag says they're dirty. This is what I think the arm patch would look like (example only: I can't compile it). Is something like this too expensive? the loop can't be optimised away because of the need to check the pages (and vmalloc_to_page is a three level page table lookup). James --- diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 3acd8fa..34469ca 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -414,8 +414,17 @@ static inline void flush_kernel_vmap_range(void *addr, int size) } static inline void invalidate_kernel_vmap_range(void *addr, int size) { - if ((cache_is_vivt() || cache_is_vipt_aliasing())) - __cpuc_flush_dcache_area(addr, (size_t)size); + if ((cache_is_vivt() || cache_is_vipt_aliasing())) { + void *cursor = addr; + + for ( ; cursor < addr + size; cursor += PAGE_SIZE) { + struct page *page = vmalloc_to_page(cursor); + + if (!test_and_set_bit(PG_dcache_clean, &page->flags)) + __flush_dcache_page(page_mapping(page), page); + } + __cpuc_flush_dcache_area(addr, (size_t)size); + } } #define ARCH_HAS_FLUSH_ANON_PAGE ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust [not found] ` <1294335614.22825.154.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org> @ 2011-01-06 20:19 ` John Stoffel 2011-01-06 20:19 ` John Stoffel 2 siblings, 1 reply; 71+ messages in thread From: John Stoffel @ 2011-01-06 20:19 UTC (permalink / raw) To: James Bottomley Cc: Trond Myklebust, Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes: James> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: >> Can you explain how the code works? it looks to me like you read the xdr >> stuff through the vmap region then write it out directly to the pages? James> OK, I think I see how this is supposed to work: It's a James> sequential loop of reading in via the pages (i.e. through the James> kernel mapping) and then updating those pages via the vmap. In James> which case, I think this patch is what you need. James> The theory of operation is that the readdir on pages actually James> uses the network DMA operations to perform, so when it's James> finished, the underlying page is up to date. After this you James> invalidate the vmap range, so we have no cache lines above it James> (so it picks up the values from the uptodate page). Finally, James> after the operation on the vmap region has finished, you flush James> it so that any updated contents go back to the pages themselves James> before the next iteration begins. You need to re-spin this patch to include the above description into the magic steps your taking here, or at least document it more clearly somewhere why you need to make these funky steps. John James> James James> --- James> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c James> index 996dd89..bde1911 100644 James> --- a/fs/nfs/dir.c James> +++ b/fs/nfs/dir.c James> @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, James> if (status < 0) James> break; James> pglen = status; James> + James> + invalidate_kernel_vmap_range(pages_ptr, pglen); James> + James> status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); James> if (status < 0) { James> if (status == -ENOSPC) James> status = 0; James> break; James> } James> + flush_kernel_vmap_range(pages_ptr, pglen); James> } while (array->eof_index < 0); James> nfs_readdir_free_large_page(pages_ptr, pages, array_size); James> -- James> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in James> the body of a message to majordomo@vger.kernel.org James> More majordomo info at http://vger.kernel.org/majordomo-info.html James> Please read the FAQ at http://www.tux.org/lkml/ -- ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 20:19 ` John Stoffel @ 2011-01-06 20:19 ` John Stoffel 0 siblings, 0 replies; 71+ messages in thread From: John Stoffel @ 2011-01-06 20:19 UTC (permalink / raw) To: James Bottomley Cc: Trond Myklebust, Linus Torvalds, Russell King - ARM Linux, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes: James> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: >> Can you explain how the code works? it looks to me like you read the xdr >> stuff through the vmap region then write it out directly to the pages? James> OK, I think I see how this is supposed to work: It's a James> sequential loop of reading in via the pages (i.e. through the James> kernel mapping) and then updating those pages via the vmap. In James> which case, I think this patch is what you need. James> The theory of operation is that the readdir on pages actually James> uses the network DMA operations to perform, so when it's James> finished, the underlying page is up to date. After this you James> invalidate the vmap range, so we have no cache lines above it James> (so it picks up the values from the uptodate page). Finally, James> after the operation on the vmap region has finished, you flush James> it so that any updated contents go back to the pages themselves James> before the next iteration begins. You need to re-spin this patch to include the above description into the magic steps your taking here, or at least document it more clearly somewhere why you need to make these funky steps. John James> James James> --- James> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c James> index 996dd89..bde1911 100644 James> --- a/fs/nfs/dir.c James> +++ b/fs/nfs/dir.c James> @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, James> if (status < 0) James> break; James> pglen = status; James> + James> + invalidate_kernel_vmap_range(pages_ptr, pglen); James> + James> status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); James> if (status < 0) { James> if (status == -ENOSPC) James> status = 0; James> break; James> } James> + flush_kernel_vmap_range(pages_ptr, pglen); James> } while (array->eof_index < 0); James> nfs_readdir_free_large_page(pages_ptr, pages, array_size); James> -- James> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in James> the body of a message to majordomo@vger.kernel.org James> More majordomo info at http://vger.kernel.org/majordomo-info.html James> Please read the FAQ at http://www.tux.org/lkml/ -- ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:28 ` James Bottomley @ 2011-01-05 23:28 ` Linus Torvalds 2011-01-05 23:28 ` Linus Torvalds [not found] ` <AANLkTi=SjMinMp+m726GS1iehj6cQgNy1RqSoUqKhjtv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 23:28 UTC (permalink / raw) To: Trond Myklebust Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > which takes care of invalidating the cache prior to a virtual address > read. > > My question was specifically about the write through the regular kernel > mapping: according to Russell and my reading of the cachetlb.txt > documentation, flush_dcache_page() is only guaranteed to have an effect > on page cache pages. I don't think that should ever matter. It's not like the hardware can know whether it's a dcache page or not. And if the sw implementation cares, it's doing something really odd. But who knows - there's a lot of crap out there, and people sometimes do really odd things to work around the brokenness of a VIVT cache with aliases. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:28 ` Linus Torvalds @ 2011-01-05 23:28 ` Linus Torvalds [not found] ` <AANLkTi=SjMinMp+m726GS1iehj6cQgNy1RqSoUqKhjtv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2011-01-05 23:28 UTC (permalink / raw) To: Trond Myklebust Cc: Russell King - ARM Linux, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > which takes care of invalidating the cache prior to a virtual address > read. > > My question was specifically about the write through the regular kernel > mapping: according to Russell and my reading of the cachetlb.txt > documentation, flush_dcache_page() is only guaranteed to have an effect > on page cache pages. I don't think that should ever matter. It's not like the hardware can know whether it's a dcache page or not. And if the sw implementation cares, it's doing something really odd. But who knows - there's a lot of crap out there, and people sometimes do really odd things to work around the brokenness of a VIVT cache with aliases. Linus ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <AANLkTi=SjMinMp+m726GS1iehj6cQgNy1RqSoUqKhjtv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: still nfs problems [Was: Linux 2.6.37-rc8] [not found] ` <AANLkTi=SjMinMp+m726GS1iehj6cQgNy1RqSoUqKhjtv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-01-05 23:59 ` Russell King - ARM Linux 2011-01-05 23:59 ` Russell King - ARM Linux 0 siblings, 1 reply; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 23:59 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, James Bottomley, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Parisc List, linux-arch-u79uwXL29TY76Z2rM5mHXA On Wed, Jan 05, 2011 at 03:28:53PM -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust > <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> wrote: > > > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > > which takes care of invalidating the cache prior to a virtual address > > read. > > > > My question was specifically about the write through the regular kernel > > mapping: according to Russell and my reading of the cachetlb.txt > > documentation, flush_dcache_page() is only guaranteed to have an effect > > on page cache pages. > > I don't think that should ever matter. It's not like the hardware can > know whether it's a dcache page or not. > > And if the sw implementation cares, it's doing something really odd. From the hardware perspective you're correct that it doesn't. However, from the efficient implementation perspective it does matter. Take for example the read-ahead done on block devices. We don't want to flush all those pages that were read in when we don't know that they're ever going to end up in a user mapping. So what's commonly done (as suggested by DaveM) is that flush_dcache_page() detects that it's a dcache page, ensures that there's no user mappings, and sets a 'dirty' flag. This flag is guaranteed to be clear when new, clean, unread pages enter the page cache. When the page eventually ends up in a user mapping, that dirty flag is checked and the necessary cache flushing done at that point. Note that when there are user mappings, flush_dcache_page() has to flush those mappings too, otherwise mmap() <-> read()/write() coherency breaks. I believe this was what flush_dcache_page() was created to resolve. flush_kernel_dcache_page() was to solve the problem of PIO drivers writing to dcache pages, so that data written into the kernel mapping would be visible to subsequent user mappings. We chose a different overall approach - which had already been adopted by PPC - where we invert the meaning of this 'dirty' bit to mean that it's clean. So every new page cache page starts out life as being marked dirty and so nothing needs to be done at flush_kernel_dcache_page(). We continue to use davem's optimization but with the changed meaning of the bit, but as we now support SMP we do the flushing at set_pte_at() time. This also means that we don't have to rely on the (endlessly) buggy PIO drivers remembering to add flush_kernel_dcache_page() calls - something which has been a source of constant never-ending pain for us. The final piece of the jigsaw is flush_anon_page() which deals with kernel<->user coherency for anonymous pages by flushing both the user and kernel sides of the mapping. This was to solve direct-io coherency problems. As the users of flush_anon_page() always do: flush_anon_page(vma, page, addr); flush_dcache_page(page); and documentation doesn't appear to imply that this will always be the case, we restrict flush_dcache_page() to only work on page cache pages, otherwise we end up flushing the kernel-side mapping multiple time in succession. Maybe we should make flush_anon_page() only flush the user mapping, stipulate that it shall always be followed by flush_dcache_page(), which shall flush the kernel side mapping even for anonymous pages? That sounds to me like a recipe for missing flush_dcache_page() calls causing bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:59 ` Russell King - ARM Linux @ 2011-01-05 23:59 ` Russell King - ARM Linux 0 siblings, 0 replies; 71+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 23:59 UTC (permalink / raw) To: Linus Torvalds Cc: Trond Myklebust, James Bottomley, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, Jan 05, 2011 at 03:28:53PM -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > > which takes care of invalidating the cache prior to a virtual address > > read. > > > > My question was specifically about the write through the regular kernel > > mapping: according to Russell and my reading of the cachetlb.txt > > documentation, flush_dcache_page() is only guaranteed to have an effect > > on page cache pages. > > I don't think that should ever matter. It's not like the hardware can > know whether it's a dcache page or not. > > And if the sw implementation cares, it's doing something really odd. From the hardware perspective you're correct that it doesn't. However, from the efficient implementation perspective it does matter. Take for example the read-ahead done on block devices. We don't want to flush all those pages that were read in when we don't know that they're ever going to end up in a user mapping. So what's commonly done (as suggested by DaveM) is that flush_dcache_page() detects that it's a dcache page, ensures that there's no user mappings, and sets a 'dirty' flag. This flag is guaranteed to be clear when new, clean, unread pages enter the page cache. When the page eventually ends up in a user mapping, that dirty flag is checked and the necessary cache flushing done at that point. Note that when there are user mappings, flush_dcache_page() has to flush those mappings too, otherwise mmap() <-> read()/write() coherency breaks. I believe this was what flush_dcache_page() was created to resolve. flush_kernel_dcache_page() was to solve the problem of PIO drivers writing to dcache pages, so that data written into the kernel mapping would be visible to subsequent user mappings. We chose a different overall approach - which had already been adopted by PPC - where we invert the meaning of this 'dirty' bit to mean that it's clean. So every new page cache page starts out life as being marked dirty and so nothing needs to be done at flush_kernel_dcache_page(). We continue to use davem's optimization but with the changed meaning of the bit, but as we now support SMP we do the flushing at set_pte_at() time. This also means that we don't have to rely on the (endlessly) buggy PIO drivers remembering to add flush_kernel_dcache_page() calls - something which has been a source of constant never-ending pain for us. The final piece of the jigsaw is flush_anon_page() which deals with kernel<->user coherency for anonymous pages by flushing both the user and kernel sides of the mapping. This was to solve direct-io coherency problems. As the users of flush_anon_page() always do: flush_anon_page(vma, page, addr); flush_dcache_page(page); and documentation doesn't appear to imply that this will always be the case, we restrict flush_dcache_page() to only work on page cache pages, otherwise we end up flushing the kernel-side mapping multiple time in succession. Maybe we should make flush_anon_page() only flush the user mapping, stipulate that it shall always be followed by flush_dcache_page(), which shall flush the kernel side mapping even for anonymous pages? That sounds to me like a recipe for missing flush_dcache_page() calls causing bugs. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:48 ` Linus Torvalds [not found] ` <AANLkTimzzBsdtWcZtP5E_CH1hUZugGMoaHOiMdQJf764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-01-05 21:16 ` James Bottomley 2011-01-05 21:16 ` James Bottomley 1 sibling, 1 reply; 71+ messages in thread From: James Bottomley @ 2011-01-05 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 12:48 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > well, that depends. For us on parisc, kmap of a user page in !HIGHMEM > > sets up an inequivalent aliase still ... because the cache colour of the > > user and kernel virtual addresses are different. Depending on the > > return path to userspace, we still usually have to flush to get the user > > to see the changes the kernel has made. > > Umm. Again, that has nothing to do with kmap(). > > This time it's about the user space mapping. > > Repeat after me: even without the kmap(), the kernel access to that > mapping would have caused cache aliases. > > See? Once more, the kmap() is entirely innocent. You can have a > non-highmem mapping that you never use kmap for, and that you map into > user space, and you'd see exactly the same aliases. Notice? Look ma, > no kmap(). Yes, I understand that (we have no highmem on parisc, so kmap is a nop). The problem (at least as I see it) is that once something within the kernel (well, OK, mostly within drivers) touches a user page via its kernel mapping, the flush often gets forgotten (mainly because it always works on x86). What I was thinking about is that every time the kernel touches a user space page, it has to be within a kmap/kunmap pair (because the page might be highmem) ... so it's possible to make kmap/kunmap do the flushing for this case so the driver writer can't ever forget it. I think the problem case is only really touching scatter/gather elements outside of the DMA API (i.e. the driver pio case), so this may be overkill. Russell also pointed out that a lot of the PIO iterators do excessive kmap_atomic/kunmap_atomic on the same page, so adding a flush could damage performance to the point where the flash root devices on arm might not work. Plus the pio iterators already contain the appropriate flush, so perhaps just using them in every case fixes the problem. > So clearly kmap() is not the issue. The issue continues to be a > totally separate virtual mapping. Whether it's a user mapping or > vm_map_ram() is obviously immaterial - as far as the CPU is concerned, > there is no difference between the two (apart from the trivial > differences of virtual location and permissions). > > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Yes, unfortunately, we have seen this quite a bit; mainly to get large buffers. Its not just confined to drivers: xfs used to fail on both arm and parisc because it used a vmalloc region for its log buffer which it then had to do I/O on. James ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:16 ` James Bottomley @ 2011-01-05 21:16 ` James Bottomley 0 siblings, 0 replies; 71+ messages in thread From: James Bottomley @ 2011-01-05 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Trond Myklebust, linux-nfs, linux-kernel, Marc Kleine-Budde, Uwe Kleine-König, Marc Kleine-Budde, linux-arm-kernel, Parisc List, linux-arch On Wed, 2011-01-05 at 12:48 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > well, that depends. For us on parisc, kmap of a user page in !HIGHMEM > > sets up an inequivalent aliase still ... because the cache colour of the > > user and kernel virtual addresses are different. Depending on the > > return path to userspace, we still usually have to flush to get the user > > to see the changes the kernel has made. > > Umm. Again, that has nothing to do with kmap(). > > This time it's about the user space mapping. > > Repeat after me: even without the kmap(), the kernel access to that > mapping would have caused cache aliases. > > See? Once more, the kmap() is entirely innocent. You can have a > non-highmem mapping that you never use kmap for, and that you map into > user space, and you'd see exactly the same aliases. Notice? Look ma, > no kmap(). Yes, I understand that (we have no highmem on parisc, so kmap is a nop). The problem (at least as I see it) is that once something within the kernel (well, OK, mostly within drivers) touches a user page via its kernel mapping, the flush often gets forgotten (mainly because it always works on x86). What I was thinking about is that every time the kernel touches a user space page, it has to be within a kmap/kunmap pair (because the page might be highmem) ... so it's possible to make kmap/kunmap do the flushing for this case so the driver writer can't ever forget it. I think the problem case is only really touching scatter/gather elements outside of the DMA API (i.e. the driver pio case), so this may be overkill. Russell also pointed out that a lot of the PIO iterators do excessive kmap_atomic/kunmap_atomic on the same page, so adding a flush could damage performance to the point where the flash root devices on arm might not work. Plus the pio iterators already contain the appropriate flush, so perhaps just using them in every case fixes the problem. > So clearly kmap() is not the issue. The issue continues to be a > totally separate virtual mapping. Whether it's a user mapping or > vm_map_ram() is obviously immaterial - as far as the CPU is concerned, > there is no difference between the two (apart from the trivial > differences of virtual location and permissions). > > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Yes, unfortunately, we have seen this quite a bit; mainly to get large buffers. Its not just confined to drivers: xfs used to fail on both arm and parisc because it used a vmalloc region for its log buffer which it then had to do I/O on. James ^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2011-01-10 20:15 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-05 19:05 still nfs problems [Was: Linux 2.6.37-rc8] James Bottomley
2011-01-05 19:18 ` Linus Torvalds
[not found] ` <AANLkTi=VZUxNFd7n-qwf5aiOeK5rkk8qBmo+kOpgg7up-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-05 19:36 ` James Bottomley
2011-01-05 19:36 ` James Bottomley
2011-01-05 19:49 ` Linus Torvalds
2011-01-05 20:35 ` James Bottomley
[not found] ` <1294256169.16957.18.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>
2011-01-05 20:00 ` Russell King - ARM Linux
2011-01-05 20:00 ` Russell King - ARM Linux
2011-01-05 20:33 ` James Bottomley
2011-01-05 20:33 ` James Bottomley
2011-01-05 20:48 ` Linus Torvalds
[not found] ` <AANLkTimzzBsdtWcZtP5E_CH1hUZugGMoaHOiMdQJf764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-05 21:04 ` Russell King - ARM Linux
2011-01-05 21:04 ` Russell King - ARM Linux
2011-01-05 21:08 ` Linus Torvalds
2011-01-05 21:08 ` Linus Torvalds
[not found] ` <AANLkTi=EXXBTW7oWHq3D+PHsx=thF1CpkRjn0ax2p5rm-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-05 21:16 ` Trond Myklebust
2011-01-05 21:16 ` Trond Myklebust
[not found] ` <1294262208.2952.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-01-05 21:30 ` Linus Torvalds
2011-01-05 21:30 ` Linus Torvalds
2011-01-05 23:06 ` Trond Myklebust
2011-01-05 23:06 ` Trond Myklebust
2011-01-05 23:28 ` James Bottomley
2011-01-06 17:40 ` James Bottomley
2011-01-06 17:47 ` Trond Myklebust
2011-01-06 17:51 ` James Bottomley
2011-01-06 17:55 ` Linus Torvalds
2011-01-06 17:55 ` Linus Torvalds
2011-01-07 18:53 ` Trond Myklebust
[not found] ` <1294426405.2929.23.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-01-07 19:02 ` Russell King - ARM Linux
2011-01-07 19:02 ` Russell King - ARM Linux
2011-01-07 19:11 ` James Bottomley
2011-01-07 19:11 ` James Bottomley
[not found] ` <1294427467.4895.66.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>
2011-01-08 16:49 ` Trond Myklebust
2011-01-08 16:49 ` Trond Myklebust
2011-01-08 23:15 ` Trond Myklebust
2011-01-08 23:15 ` Trond Myklebust
[not found] ` <1294528551.4181.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-01-10 10:50 ` Uwe Kleine-König
2011-01-10 10:50 ` Uwe Kleine-König
2011-01-10 16:25 ` Trond Myklebust
[not found] ` <1294676734.3349.10.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-01-10 17:08 ` Marc Kleine-Budde
2011-01-10 17:08 ` Marc Kleine-Budde
2011-01-10 17:20 ` Trond Myklebust
[not found] ` <1294680035.3349.19.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-01-10 17:26 ` Marc Kleine-Budde
2011-01-10 17:26 ` Marc Kleine-Budde
2011-01-10 19:25 ` Uwe Kleine-König
[not found] ` <20110110192552.GG24920-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
2011-01-10 19:29 ` Trond Myklebust
2011-01-10 19:29 ` Trond Myklebust
2011-01-10 19:31 ` James Bottomley
2011-01-10 19:34 ` Linus Torvalds
2011-01-10 19:34 ` Linus Torvalds
2011-01-10 20:15 ` Trond Myklebust
2011-01-10 12:44 ` Marc Kleine-Budde
2011-01-10 12:44 ` Marc Kleine-Budde
2011-01-07 19:13 ` Trond Myklebust
2011-01-07 19:05 ` James Bottomley
[not found] ` <1294335614.22825.154.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>
2011-01-06 18:05 ` Russell King - ARM Linux
2011-01-06 18:05 ` Russell King - ARM Linux
2011-01-06 18:14 ` James Bottomley
2011-01-06 18:14 ` James Bottomley
[not found] ` <1294337670.22825.199.camel-0iu6Cu4xQGLYCGPCin2YbQ@public.gmane.org>
2011-01-06 18:25 ` James Bottomley
2011-01-06 18:25 ` James Bottomley
2011-01-06 21:07 ` James Bottomley
2011-01-06 21:07 ` James Bottomley
2011-01-06 20:19 ` John Stoffel
2011-01-06 20:19 ` John Stoffel
2011-01-05 23:28 ` Linus Torvalds
2011-01-05 23:28 ` Linus Torvalds
[not found] ` <AANLkTi=SjMinMp+m726GS1iehj6cQgNy1RqSoUqKhjtv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-05 23:59 ` Russell King - ARM Linux
2011-01-05 23:59 ` Russell King - ARM Linux
2011-01-05 21:16 ` James Bottomley
2011-01-05 21:16 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).