* still nfs problems [Was: Linux 2.6.37-rc8]
@ 2011-01-05 19:05 James Bottomley
2011-01-05 19:18 ` Linus Torvalds
0 siblings, 1 reply; 79+ messages in thread
From: James Bottomley @ 2011-01-05 19:05 UTC (permalink / raw)
To: linux-arm-kernel
[sorry for the unthreaded insertion. We're seeing this on parisc too]
> On Wed, Jan 05, 2011 at 10:14:17AM -0500, Trond Myklebust wrote:
> > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of
> > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap()
> > interfaces, but we can end up reading them via a virtual address range
> > that gets set up via vm_map_ram() (that range gets set up before the
> > write occurs).
>
> kmap of lowmem pages will always reuses the existing kernel direct
> mapping, so there won't be a problem there.
>
> > Do we perhaps need an invalidate_kernel_vmap_range() before we can read
> > the data on ARM in this kind of scenario?
>
> Firstly, vm_map_ram() does no cache maintainence of any sort, nor does
> it take care of page colouring - so any architecture where cache aliasing
> can occur will see this problem. It should not limited to ARM.
>
> Secondly, no, invalidate_kernel_vmap_range() probably isn't sufficient.
> There's two problems here:
>
> addr = kmap(lowmem_page);
> *addr = stuff;
> kunmap(lowmem_page);
>
> Such lowmem pages are accessed through their kernel direct mapping.
>
> ptr = vm_map_ram(lowmem_page);
> read = *ptr;
>
> This creates a new mapping which can alias with the kernel direct mapping.
> Now, as this is a new mapping, there should be no cache lines associated
> with it. (Looking at vm_unmap_ram(), it calls free_unmap_vmap_area_addr(),
> free_unmap_vmap_area(), which then calls flush_cache_vunmap() on the
> region. vb_free() also calls flush_cache_vunmap() too.)
>
> If the write after kmap() hits an already present cache line, the cache
> line will be updated, but it won't be written back to memory. So, on
> a subsequent vm_map_ram(), with any kind of aliasing cache, there's
> no guarantee that you'll hit that cache line and read the data just
> written there.
>
> The kernel direct mapping would need to be flushed.
>
> I'm really getting to the point of hating the poliferation of RAM
> remapping interfaces - it's going to (and is) causing nothing but lots
> of pain on virtual cache architectures, needing more and more cache
> flushing interfaces to be created.
>
> Is there any other solution to this?
I think the solution for the kernel direct mapping problem is to take
the expected flushes and invalidates into kmap/kunmap[_atomic]. I think
the original reason for not doing this was efficiency: the user should
know what they did with the data (i.e. if they're only reading it, it
doesn't need to be flushed on unmap). However, the difficulty of
getting all this right seems to outweigh the efficiency of only using
the necessary flushing. At least on some architectures, we can look at
the TLB flags to see if the page was dirtied (and only flush if it was).
James
^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:05 still nfs problems [Was: Linux 2.6.37-rc8] James Bottomley @ 2011-01-05 19:18 ` Linus Torvalds 2011-01-05 19:36 ` James Bottomley 0 siblings, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 19:18 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > I think the solution for the kernel direct mapping problem is to take > the expected flushes and invalidates into kmap/kunmap[_atomic]. No, we really can't do that. Most of the time, the kmap() is the only way we access the page anyway, so flushing things would just be stupid. Why waste time and energy on doing something pointless? In fact, kmap() here is a total non-issue. It's not the kmap() that introduces any virtual aliases, and never has been. It's the "vm_map_ram()" that is the problem. Unlike the kmap(), that really _does_ introduce a virtual alias, and is a problem for any virtual cache. So don't blame kmap(). It's innocent and irrelevant - the bug could happen entirely without it (think a 64-bit address space that doesn't even _have_ kmap, but has software that mixes vm_map_ram() with non-mapped accesses). Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:18 ` Linus Torvalds @ 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:49 ` Linus Torvalds 2011-01-05 20:00 ` Russell King - ARM Linux 0 siblings, 2 replies; 79+ messages in thread From: James Bottomley @ 2011-01-05 19:36 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > I think the solution for the kernel direct mapping problem is to take > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > No, we really can't do that. Most of the time, the kmap() is the only > way we access the page anyway, so flushing things would just be > stupid. Why waste time and energy on doing something pointless? It's hardly pointless. The kmap sets up an inequivalent alias in the cache. When you write to the kmap region, you dirty the CPU caches for that alias. If you tear down the mapping without flushing, the CPU will write out the cache lines at its leisure. If you access the line via the other mapping *before* the CPU does writeout, you see stale data. When the kernel dirties a kmap region, it always has to flush somehow before kunmap. One of the problems here is that that flush isn't in the NFS code. > In fact, kmap() here is a total non-issue. It's not the kmap() that > introduces any virtual aliases, and never has been. It's the > "vm_map_ram()" that is the problem. Unlike the kmap(), that really > _does_ introduce a virtual alias, and is a problem for any virtual > cache. > > So don't blame kmap(). It's innocent and irrelevant - the bug could > happen entirely without it (think a 64-bit address space that doesn't > even _have_ kmap, but has software that mixes vm_map_ram() with > non-mapped accesses). I didn't say it was kmap's entire problem ... I just said, can't we simplify some of this by consolidating the flushing into the interfaces. James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:36 ` James Bottomley @ 2011-01-05 19:49 ` Linus Torvalds 2011-01-05 20:35 ` James Bottomley 2011-01-05 20:00 ` Russell King - ARM Linux 1 sibling, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 19:49 UTC (permalink / raw) To: linux-arm-kernel 2011/1/5 James Bottomley <James.Bottomley@hansenpartnership.com>: >> >> No, we really can't do that. Most of the time, the kmap() is the only >> way we access the page anyway, so flushing things would just be >> stupid. Why waste time and energy on doing something pointless? > > It's hardly pointless. ?The kmap sets up an inequivalent alias in the > cache. NO IT DOES NOT. Stop arguing, when you are so wrong. kmap() does not create any aliases. For low-memory, it just returns the physical address. No alias. And for high memory, there is no equivalent low memory address to alias _with_. Now, when you actually mix multiple kmap's and you have a virtually based cache, then the kmap's obviously need to flush that particular page when switching between each other. But that has nothing to do with the actual page being kmap'ed, it's entirely an internal issue about the particular virtual memory area being re-used. And ARM (and any other virtually based CPU) already does that in __kunmap_atomic(). But notice the case of the low-mem. And understand that you are WRONG about the "inequivalent alias" thing. So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering. It's _purely_ an issue of vm_map_ram(). Nothing else. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:49 ` Linus Torvalds @ 2011-01-05 20:35 ` James Bottomley 0 siblings, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-05 20:35 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 11:49 -0800, Linus Torvalds wrote: > 2011/1/5 James Bottomley <James.Bottomley@hansenpartnership.com>: > >> > >> No, we really can't do that. Most of the time, the kmap() is the only > >> way we access the page anyway, so flushing things would just be > >> stupid. Why waste time and energy on doing something pointless? > > > > It's hardly pointless. The kmap sets up an inequivalent alias in the > > cache. > > NO IT DOES NOT. Well, it does ... but not in this case because the page is freshly allocated (which I missed before) so it has no use cache colour yet. James > Stop arguing, when you are so wrong. > > kmap() does not create any aliases. For low-memory, it just returns > the physical address. No alias. And for high memory, there is no > equivalent low memory address to alias _with_. > > Now, when you actually mix multiple kmap's and you have a virtually > based cache, then the kmap's obviously need to flush that particular > page when switching between each other. But that has nothing to do > with the actual page being kmap'ed, it's entirely an internal issue > about the particular virtual memory area being re-used. And ARM (and > any other virtually based CPU) already does that in __kunmap_atomic(). > > But notice the case of the low-mem. And understand that you are WRONG > about the "inequivalent alias" thing. > > So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering. > > It's _purely_ an issue of vm_map_ram(). Nothing else. > > Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:49 ` Linus Torvalds @ 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:33 ` James Bottomley 1 sibling, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 20:00 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > I think the solution for the kernel direct mapping problem is to take > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > No, we really can't do that. Most of the time, the kmap() is the only > > way we access the page anyway, so flushing things would just be > > stupid. Why waste time and energy on doing something pointless? > > It's hardly pointless. The kmap sets up an inequivalent alias in the > cache. No it doesn't. For pages which are inaccessible, it sets up a mapping for those pages. On aliasing cache architectures, when you tear down such a mapping, you have to flush the cache before you do so (otherwise you can end up with cache lines existing in the cache for inaccessible mappings.) For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage because all lowmem pages are already accessible. So kunmap() doesn't do anything - just like the !HIGHMEM implementation for these macros. So, for highmem-enabled systems: low_addr = kmap_atomic(lowmem_page); high_addr = kmap_atomic(highmem_page); results in low_addr in the kernel direct-mapped region, and high_addr in the kmap_atomic region. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:00 ` Russell King - ARM Linux @ 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:48 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: James Bottomley @ 2011-01-05 20:33 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 20:00 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote: > > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > I think the solution for the kernel direct mapping problem is to take > > > > the expected flushes and invalidates into kmap/kunmap[_atomic]. > > > > > > No, we really can't do that. Most of the time, the kmap() is the only > > > way we access the page anyway, so flushing things would just be > > > stupid. Why waste time and energy on doing something pointless? > > > > It's hardly pointless. The kmap sets up an inequivalent alias in the > > cache. > > No it doesn't. For pages which are inaccessible, it sets up a mapping > for those pages. On aliasing cache architectures, when you tear down > such a mapping, you have to flush the cache before you do so (otherwise > you can end up with cache lines existing in the cache for inaccessible > mappings.) > > For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage > because all lowmem pages are already accessible. So kunmap() doesn't > do anything - just like the !HIGHMEM implementation for these macros. well, that depends. For us on parisc, kmap of a user page in !HIGHMEM sets up an inequivalent aliase still ... because the cache colour of the user and kernel virtual addresses are different. Depending on the return path to userspace, we still usually have to flush to get the user to see the changes the kernel has made. James > So, for highmem-enabled systems: > > low_addr = kmap_atomic(lowmem_page); > high_addr = kmap_atomic(highmem_page); > > results in low_addr in the kernel direct-mapped region, and high_addr > in the kmap_atomic region. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:33 ` James Bottomley @ 2011-01-05 20:48 ` Linus Torvalds 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:16 ` James Bottomley 0 siblings, 2 replies; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 20:48 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > well, that depends. ?For us on parisc, kmap of a user page in !HIGHMEM > sets up an inequivalent aliase still ... because the cache colour of the > user and kernel virtual addresses are different. ?Depending on the > return path to userspace, we still usually have to flush to get the user > to see the changes the kernel has made. Umm. Again, that has nothing to do with kmap(). This time it's about the user space mapping. Repeat after me: even without the kmap(), the kernel access to that mapping would have caused cache aliases. See? Once more, the kmap() is entirely innocent. You can have a non-highmem mapping that you never use kmap for, and that you map into user space, and you'd see exactly the same aliases. Notice? Look ma, no kmap(). So clearly kmap() is not the issue. The issue continues to be a totally separate virtual mapping. Whether it's a user mapping or vm_map_ram() is obviously immaterial - as far as the CPU is concerned, there is no difference between the two (apart from the trivial differences of virtual location and permissions). (You can also force the problem with vmalloc() an then following the kernel page tables, but I hope nobody does that any more. I suspect I'm wrong, though, there's probably code that mixes vmalloc and physical page accesses in various drivers) Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:48 ` Linus Torvalds @ 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:08 ` Linus Torvalds 2011-01-05 21:16 ` James Bottomley 1 sibling, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 21:04 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be deprecated then? ;) However, we seem have new ways of doing this - rather than asking vmalloc() for some memory, and then getting at the pages by following the page tables, we now have ways to create mappings using arrays of struct pages and access them via their already known mappings: - vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) - map_kernel_range_noflush(unsigned long addr, unsigned long size, pgprot_t prot, struct page **pages) - map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) - vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot) So really it's the same problem, just created by some other easier to use methods. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:04 ` Russell King - ARM Linux @ 2011-01-05 21:08 ` Linus Torvalds 2011-01-05 21:16 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 21:08 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: >> (You can also force the problem with vmalloc() an then following the >> kernel page tables, but I hope nobody does that any more. I suspect >> I'm wrong, though, there's probably code that mixes vmalloc and >> physical page accesses in various drivers) > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > deprecated then? ;) I do think that the "modern" way of doing it is "vmap()"/"vm_map_ram()" and friends, and it should be preferred over using vmalloc() and then looking up the pages. But in the end, the two approaches really are equivalent, so it's not like it really matters. So I don't think we need to deprecate things officially, but obviously we should make people more aware of the whole virtual alias thing that crops up whenever you use any of these approaches. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:08 ` Linus Torvalds @ 2011-01-05 21:16 ` Trond Myklebust 2011-01-05 21:30 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 21:16 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 13:08 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote: > >> (You can also force the problem with vmalloc() an then following the > >> kernel page tables, but I hope nobody does that any more. I suspect > >> I'm wrong, though, there's probably code that mixes vmalloc and > >> physical page accesses in various drivers) > > > > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be > > deprecated then? ;) > > I do think that the "modern" way of doing it is > "vmap()"/"vm_map_ram()" and friends, and it should be preferred over > using vmalloc() and then looking up the pages. > > But in the end, the two approaches really are equivalent, so it's not > like it really matters. So I don't think we need to deprecate things > officially, but obviously we should make people more aware of the > whole virtual alias thing that crops up whenever you use any of these > approaches. So what should be the preferred way to ensure data gets flushed when you've written directly to a page, and then want to read through the vm_map_ram() virtual range? Should we be adding new semantics to flush_kernel_dcache_page()? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:16 ` Trond Myklebust @ 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 23:06 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 21:30 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > So what should be the preferred way to ensure data gets flushed when > you've written directly to a page, and then want to read through the > vm_map_ram() virtual range? Should we be adding new semantics to > flush_kernel_dcache_page()? The "preferred way" is actually simple: "don't do that". IOW, if some page is accessed through a virtual mapping you've set up, then _always_ access it through that virtual mapping. Now, when that is impossible (and yes, it sometimes is), then you should flush after doing all writes. And if you do the write through the regular kernel mapping, you should use flush_dcache_page(). And if you did it through the virtual mapping, you should use "flush_kernel_vmap_range()" or whatever. NOTE! I really didn't look those up very closely, and if the accesses can happen concurrently you are basically screwed, so you do need to do locking or something else to guarantee that there is some nice sequential order. And maybe I forgot something. Which is why I do suggest "don't do that" as a primary approach to the problem if at all possible. Oh, and you may need to flush before reading too (and many writes do end up being "read-modify-write" cycles) in case it's possible that you have stale data from a previous read that was then invalidated by a write to the aliasing address. Even if that write was flushed out, the stale read data may exist at the virtual address. I forget what all we required - in the end the only sane model is "virtual caches suck so bad that anybody who does them should be laughed at for being a retard". Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 21:30 ` Linus Torvalds @ 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:28 ` James Bottomley 2011-01-05 23:28 ` Linus Torvalds 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 23:06 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > So what should be the preferred way to ensure data gets flushed when > > you've written directly to a page, and then want to read through the > > vm_map_ram() virtual range? Should we be adding new semantics to > > flush_kernel_dcache_page()? > > The "preferred way" is actually simple: "don't do that". IOW, if some > page is accessed through a virtual mapping you've set up, then > _always_ access it through that virtual mapping. > > Now, when that is impossible (and yes, it sometimes is), then you > should flush after doing all writes. And if you do the write through > the regular kernel mapping, you should use flush_dcache_page(). And if > you did it through the virtual mapping, you should use > "flush_kernel_vmap_range()" or whatever. > > NOTE! I really didn't look those up very closely, and if the accesses > can happen concurrently you are basically screwed, so you do need to > do locking or something else to guarantee that there is some nice > sequential order. And maybe I forgot something. Which is why I do > suggest "don't do that" as a primary approach to the problem if at all > possible. > > Oh, and you may need to flush before reading too (and many writes do > end up being "read-modify-write" cycles) in case it's possible that > you have stale data from a previous read that was then invalidated by > a write to the aliasing address. Even if that write was flushed out, > the stale read data may exist at the virtual address. I forget what > all we required - in the end the only sane model is "virtual caches > suck so bad that anybody who does them should be laughed at for being > a retard". Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), which takes care of invalidating the cache prior to a virtual address read. My question was specifically about the write through the regular kernel mapping: according to Russell and my reading of the cachetlb.txt documentation, flush_dcache_page() is only guaranteed to have an effect on page cache pages. flush_kernel_dcache_page() (not to be confused with flush_dcache_page) would appear to be the closest fit according to my reading of the documentation, however the ARM implementation appears to be a no-op... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:06 ` Trond Myklebust @ 2011-01-05 23:28 ` James Bottomley 2011-01-06 17:40 ` James Bottomley 2011-01-05 23:28 ` Linus Torvalds 1 sibling, 1 reply; 79+ messages in thread From: James Bottomley @ 2011-01-05 23:28 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 18:06 -0500, Trond Myklebust wrote: > On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust > > <Trond.Myklebust@netapp.com> wrote: > > > > > > So what should be the preferred way to ensure data gets flushed when > > > you've written directly to a page, and then want to read through the > > > vm_map_ram() virtual range? Should we be adding new semantics to > > > flush_kernel_dcache_page()? > > > > The "preferred way" is actually simple: "don't do that". IOW, if some > > page is accessed through a virtual mapping you've set up, then > > _always_ access it through that virtual mapping. > > > > Now, when that is impossible (and yes, it sometimes is), then you > > should flush after doing all writes. And if you do the write through > > the regular kernel mapping, you should use flush_dcache_page(). And if > > you did it through the virtual mapping, you should use > > "flush_kernel_vmap_range()" or whatever. > > > > NOTE! I really didn't look those up very closely, and if the accesses > > can happen concurrently you are basically screwed, so you do need to > > do locking or something else to guarantee that there is some nice > > sequential order. And maybe I forgot something. Which is why I do > > suggest "don't do that" as a primary approach to the problem if at all > > possible. > > > > Oh, and you may need to flush before reading too (and many writes do > > end up being "read-modify-write" cycles) in case it's possible that > > you have stale data from a previous read that was then invalidated by > > a write to the aliasing address. Even if that write was flushed out, > > the stale read data may exist at the virtual address. I forget what > > all we required - in the end the only sane model is "virtual caches > > suck so bad that anybody who does them should be laughed at for being > > a retard". > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > which takes care of invalidating the cache prior to a virtual address > read. > > My question was specifically about the write through the regular kernel > mapping: according to Russell and my reading of the cachetlb.txt > documentation, flush_dcache_page() is only guaranteed to have an effect > on page cache pages. > flush_kernel_dcache_page() (not to be confused with flush_dcache_page) > would appear to be the closest fit according to my reading of the > documentation, however the ARM implementation appears to be a no-op... It depends on exactly what you're doing. In the worst case, (ping pong reads and writes through both aliases) you have to flush and invalidate both alias 1 alias 2 each time you access on one and then another. Can you explain how the code works? it looks to me like you read the xdr stuff through the vmap region then write it out directly to the pages? *if* this is just a conversion, *and* you never need to read the new data through the vmap alias, you might be able to get away with a flush_dcache_page in nfs_readdir_release_array(). If the access pattern is more complex, you'll need more stuff splashed through the loop (including vmap invalidation/flushing). Is there any way you could just rewrite nfs_readdir_add_to_array() to use the vmap address instead of doing a kmap? That way everything will go through a single alias and not end up with this incoherency. James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:28 ` James Bottomley @ 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust ` (2 more replies) 0 siblings, 3 replies; 79+ messages in thread From: James Bottomley @ 2011-01-06 17:40 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > Can you explain how the code works? it looks to me like you read the xdr > stuff through the vmap region then write it out directly to the pages? OK, I think I see how this is supposed to work: It's a sequential loop of reading in via the pages (i.e. through the kernel mapping) and then updating those pages via the vmap. In which case, I think this patch is what you need. The theory of operation is that the readdir on pages actually uses the network DMA operations to perform, so when it's finished, the underlying page is up to date. After this you invalidate the vmap range, so we have no cache lines above it (so it picks up the values from the uptodate page). Finally, after the operation on the vmap region has finished, you flush it so that any updated contents go back to the pages themselves before the next iteration begins. Does this look right to people? I've verified it fixes the issues on parisc. James --- diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..bde1911 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; + + invalidate_kernel_vmap_range(pages_ptr, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; break; } + flush_kernel_vmap_range(pages_ptr, pglen); } while (array->eof_index < 0); nfs_readdir_free_large_page(pages_ptr, pages, array_size); ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:40 ` James Bottomley @ 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 17:51 ` James Bottomley 2011-01-06 17:55 ` Linus Torvalds 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 20:19 ` John Stoffel 2 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-06 17:47 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > Can you explain how the code works? it looks to me like you read the xdr > > stuff through the vmap region then write it out directly to the pages? > > OK, I think I see how this is supposed to work: It's a sequential loop > of reading in via the pages (i.e. through the kernel mapping) and then > updating those pages via the vmap. In which case, I think this patch is > what you need. > > The theory of operation is that the readdir on pages actually uses the > network DMA operations to perform, so when it's finished, the underlying > page is up to date. After this you invalidate the vmap range, so we > have no cache lines above it (so it picks up the values from the > uptodate page). Finally, after the operation on the vmap region has > finished, you flush it so that any updated contents go back to the pages > themselves before the next iteration begins. > > Does this look right to people? I've verified it fixes the issues on > parisc. > > James > > --- > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..bde1911 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > + > + invalidate_kernel_vmap_range(pages_ptr, pglen); > + > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > if (status < 0) { > if (status == -ENOSPC) > status = 0; > break; > } > + flush_kernel_vmap_range(pages_ptr, pglen); Why is this line needed? We're not writing through the virtual mapping. We checked using just the invalidate_kernel_vmap_range(), and that appeared to suffice to fix the problem on ARM. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:47 ` Trond Myklebust @ 2011-01-06 17:51 ` James Bottomley 2011-01-06 17:55 ` Linus Torvalds 1 sibling, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-06 17:51 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 12:47 -0500, Trond Myklebust wrote: > On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > > Can you explain how the code works? it looks to me like you read the xdr > > > stuff through the vmap region then write it out directly to the pages? > > > > OK, I think I see how this is supposed to work: It's a sequential loop > > of reading in via the pages (i.e. through the kernel mapping) and then > > updating those pages via the vmap. In which case, I think this patch is > > what you need. > > > > The theory of operation is that the readdir on pages actually uses the > > network DMA operations to perform, so when it's finished, the underlying > > page is up to date. After this you invalidate the vmap range, so we > > have no cache lines above it (so it picks up the values from the > > uptodate page). Finally, after the operation on the vmap region has > > finished, you flush it so that any updated contents go back to the pages > > themselves before the next iteration begins. > > > > Does this look right to people? I've verified it fixes the issues on > > parisc. > > > > James > > > > --- > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > index 996dd89..bde1911 100644 > > --- a/fs/nfs/dir.c > > +++ b/fs/nfs/dir.c > > @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > > if (status < 0) > > break; > > pglen = status; > > + > > + invalidate_kernel_vmap_range(pages_ptr, pglen); > > + > > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > > if (status < 0) { > > if (status == -ENOSPC) > > status = 0; > > break; > > } > > + flush_kernel_vmap_range(pages_ptr, pglen); > > Why is this line needed? We're not writing through the virtual mapping. If you're not altering it, it isn't ... the problem on parisc is that invalidate is a nop for us because flush does it all, but I can fix that. James > We checked using just the invalidate_kernel_vmap_range(), and that > appeared to suffice to fix the problem on ARM. > > Cheers > Trond ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 17:51 ` James Bottomley @ 2011-01-06 17:55 ` Linus Torvalds 2011-01-07 18:53 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-06 17:55 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Why is this line needed? We're not writing through the virtual mapping. I haven't looked at the sequence of accesses, but you need to be _very_ aware that "write-through" is absolutely NOT sufficient for cache coherency. In cache coherency, you have three options: - true coherency (eg physically indexed/tagged caches) - exclusion (eg virtual caches, but with an exclusion guarantee that guarantees that aliases cannot happen: either by using physical tagging or by not allowing cases that could cause virtual aliases) - write-through AND non-cached reads (ie "no caching at all"). You seem to be forgetting the "no cached reads" part. It's not sufficient to flush after a write - you need to make sure that you also don't have a cached copy of the alias for the read. So "We're not writing through the virtual mapping" is NOT a sufficient excuse. If you're reading through the virtual mapping, you need to make sure that the virtual mapping is flushed _after_ any writes through any other mapping and _before_ any reads through the virtual one. This is why you really really really generally don't want to have aliasing. Purely virtual caches are pure crap. Really. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:55 ` Linus Torvalds @ 2011-01-07 18:53 ` Trond Myklebust 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:05 ` James Bottomley 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-07 18:53 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 09:55 -0800, Linus Torvalds wrote: > On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > Why is this line needed? We're not writing through the virtual mapping. > > I haven't looked at the sequence of accesses, but you need to be > _very_ aware that "write-through" is absolutely NOT sufficient for > cache coherency. > > In cache coherency, you have three options: > > - true coherency (eg physically indexed/tagged caches) > > - exclusion (eg virtual caches, but with an exclusion guarantee that > guarantees that aliases cannot happen: either by using physical > tagging or by not allowing cases that could cause virtual aliases) > > - write-through AND non-cached reads (ie "no caching at all"). > > You seem to be forgetting the "no cached reads" part. It's not > sufficient to flush after a write - you need to make sure that you > also don't have a cached copy of the alias for the read. > > So "We're not writing through the virtual mapping" is NOT a sufficient > excuse. If you're reading through the virtual mapping, you need to > make sure that the virtual mapping is flushed _after_ any writes > through any other mapping and _before_ any reads through the virtual > one. I'm aware of that. That part should be taken care of by the call to invalidate_kernel_vmap_range() which was in both James and my patch. There is already code in the SUNRPC layer that calls flush_dcache_page() after writing (although as Russell pointed out earlier, that is apparently a no-op for non-page cache pages such as these). > This is why you really really really generally don't want to have > aliasing. Purely virtual caches are pure crap. Really. Well, it looks as if NOMMU is giving us problems due to the lack of a vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262). I'd still like to keep the existing code for those architectures that don't have problems, since that allows us to send 32k READDIR requests instead of being limited to 4k. For large directories, that is a clear win. For the NOMMU case we will just go back to using a single page for storage (and 4k READDIR requests only). Should I just do the same for architectures like ARM and PARISC? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 18:53 ` Trond Myklebust @ 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:11 ` James Bottomley 2011-01-07 19:13 ` Trond Myklebust 2011-01-07 19:05 ` James Bottomley 1 sibling, 2 replies; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-07 19:02 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > I'd still like to keep the existing code for those architectures that > don't have problems, since that allows us to send 32k READDIR requests > instead of being limited to 4k. For large directories, that is a clear > win. > For the NOMMU case we will just go back to using a single page for > storage (and 4k READDIR requests only). Should I just do the same for > architectures like ARM and PARISC? I think you said that readdir reads via the vmalloc mapping of the group of pages, but XDR writes to the individual pages. As I understand NFS, you receive a packet, you then have to use XDR to unpack the data, which you presumably write into the set of struct page *'s using kmap? Isn't a solution to have XDR write directly into the vmalloc mapping rather than using struct page * and kmap? ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:02 ` Russell King - ARM Linux @ 2011-01-07 19:11 ` James Bottomley 2011-01-08 16:49 ` Trond Myklebust 2011-01-07 19:13 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: James Bottomley @ 2011-01-07 19:11 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote: > On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > > I'd still like to keep the existing code for those architectures that > > don't have problems, since that allows us to send 32k READDIR requests > > instead of being limited to 4k. For large directories, that is a clear > > win. > > For the NOMMU case we will just go back to using a single page for > > storage (and 4k READDIR requests only). Should I just do the same for > > architectures like ARM and PARISC? > > I think you said that readdir reads via the vmalloc mapping of the > group of pages, but XDR writes to the individual pages. Actually it's the other way around, but the point still stands. > As I understand NFS, you receive a packet, you then have to use XDR > to unpack the data, which you presumably write into the set of > struct page *'s using kmap? > > Isn't a solution to have XDR write directly into the vmalloc mapping > rather than using struct page * and kmap? So, unfortuantely, I looked at doing this and we can't. the ->readdir() call takes an array of pages, not a kernel virtual address of the pages, so there's no way to tell it to use a different mapping from the usual kernel one on them. On the other hand, the xdr routines, since they take the pages anyway, could use a scatterlist approach to writing through the kernel mapping instead of using vmap ... we have all the machinery for this in lib/scatterlist.c ... it's not designed for this case, since it's designed to allow arbitrary linear reads and writes on a block scatterlist, but the principle is the same ... it looks like it would be rather a big patch, though ... James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:11 ` James Bottomley @ 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 23:15 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-08 16:49 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > On the other hand, the xdr routines, since they take the pages anyway, > could use a scatterlist approach to writing through the kernel mapping > instead of using vmap ... we have all the machinery for this in > lib/scatterlist.c ... it's not designed for this case, since it's > designed to allow arbitrary linear reads and writes on a block > scatterlist, but the principle is the same ... it looks like it would be > rather a big patch, though ... The following alternative seems to work for me, but has only been lightly tested so far. It's a bit large for a stable patch, but not too ungainly. It modifies xdr_stream, adding the ability to iterate through page data. To avoid kmap()/kunmap(), it does require that pages be allocated in lowmem, but since the only use case here is when using page arrays as temporary buffers, that seems like an acceptable compromise. Cheers Trond --------------------------------------------------------------------------------- >From f87f13e3198ec536c1b9cfe19ad47df4fb922382 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Fri, 7 Jan 2011 18:51:33 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 45 ++++++++--------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 115 ++++++++++++++++++++++++++++++++++++-------- 6 files changed, 120 insertions(+), 62 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..ad9e5e0 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,27 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; struct nfs_cache_array *array; unsigned int count = 0; + struct page *scratch; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); + xdr_enter_page(&stream, buflen); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +507,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +524,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +532,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +542,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +576,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +586,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..cff0974 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -569,29 +569,112 @@ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) xdr->iov = iov; xdr->p = p; xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->page_ptr = NULL; + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; } EXPORT_SYMBOL_GPL(xdr_init_decode); /** - * xdr_inline_peek - Allow read-ahead in the XDR data stream + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + return 0; +} + +static int xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + return xdr_set_page_base(xdr, newbase, PAGE_SIZE); +} + +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); + +static __be32 *xdr_page_inline_decode(struct xdr_stream *xdr, size_t nbytes) +{ + size_t cplen; + void *cpdest; + __be32 *p; + + if (xdr->p == xdr->end && xdr_set_next_page(xdr) < 0) + return NULL; + + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + cplen = (char *)xdr->end - (char *)xdr->p; + cpdest = xdr->scratch.iov_base; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (xdr_set_next_page(xdr) < 0) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} /** * xdr_inline_decode - Retrieve non-page XDR data to decode @@ -605,13 +688,9 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); - - if (unlikely(q > xdr->end || q < p)) - return NULL; - xdr->p = q; - return p; + if (xdr->page_ptr == NULL) + return __xdr_inline_decode(xdr, nbytes); + return xdr_page_inline_decode(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +750,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 16:49 ` Trond Myklebust @ 2011-01-08 23:15 ` Trond Myklebust 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 12:44 ` Marc Kleine-Budde 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-08 23:15 UTC (permalink / raw) To: linux-arm-kernel On Sat, 2011-01-08 at 11:49 -0500, Trond Myklebust wrote: > On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote: > > On the other hand, the xdr routines, since they take the pages anyway, > > could use a scatterlist approach to writing through the kernel mapping > > instead of using vmap ... we have all the machinery for this in > > lib/scatterlist.c ... it's not designed for this case, since it's > > designed to allow arbitrary linear reads and writes on a block > > scatterlist, but the principle is the same ... it looks like it would be > > rather a big patch, though ... > > The following alternative seems to work for me, but has only been > lightly tested so far. It's a bit large for a stable patch, but not too > ungainly. > > It modifies xdr_stream, adding the ability to iterate through page data. > To avoid kmap()/kunmap(), it does require that pages be allocated in > lowmem, but since the only use case here is when using page arrays as > temporary buffers, that seems like an acceptable compromise. ...and here is an update which makes the whole process transparent to the decoder. It basically teaches xdr_inline_decode() how to switch buffers when it hits the end of the current iovec and/or page. Cheers Trond ----------------------------------------------------------------------------------- >From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sat, 8 Jan 2011 17:45:38 -0500 Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir vm_map_ram() is not available on NOMMU platforms, and causes trouble on incoherrent architectures such as ARM when we access the page data through both the direct and the virtual mapping. The alternative is to use the direct mapping to access page data for the case when we are not crossing a page boundary, but to copy the data into a linear scratch buffer when we are accessing data that spans page boundaries. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/dir.c | 44 ++++++------- fs/nfs/nfs2xdr.c | 6 -- fs/nfs/nfs3xdr.c | 6 -- fs/nfs/nfs4xdr.c | 6 -- include/linux/sunrpc/xdr.h | 4 +- net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- 6 files changed, 148 insertions(+), 73 deletions(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..0108cf4 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -33,7 +33,6 @@ #include <linux/namei.h> #include <linux/mount.h> #include <linux/sched.h> -#include <linux/vmalloc.h> #include <linux/kmemleak.h> #include "delegation.h" @@ -459,25 +458,26 @@ out: /* Perform conversion from xdr to cache array */ static int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, - void *xdr_page, struct page *page, unsigned int buflen) + struct page **xdr_pages, struct page *page, unsigned int buflen) { struct xdr_stream stream; - struct xdr_buf buf; - __be32 *ptr = xdr_page; + struct xdr_buf buf = { + .pages = xdr_pages, + .page_len = buflen, + .buflen = buflen, + .len = buflen, + }; + struct page *scratch; struct nfs_cache_array *array; unsigned int count = 0; int status; - buf.head->iov_base = xdr_page; - buf.head->iov_len = buflen; - buf.tail->iov_len = 0; - buf.page_base = 0; - buf.page_len = 0; - buf.buflen = buf.head->iov_len; - buf.len = buf.head->iov_len; - - xdr_init_decode(&stream, &buf, ptr); + scratch = alloc_page(GFP_KERNEL); + if (scratch == NULL) + return -ENOMEM; + xdr_init_decode(&stream, &buf, NULL); + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); do { status = xdr_decode(desc, entry, &stream); @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en } else status = PTR_ERR(array); } + + put_page(scratch); return status; } @@ -521,7 +523,6 @@ static void nfs_readdir_free_large_page(void *ptr, struct page **pages, unsigned int npages) { - vm_unmap_ram(ptr, npages); nfs_readdir_free_pagearray(pages, npages); } @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, * to nfs_readdir_free_large_page */ static -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) +int nfs_readdir_large_page(struct page **pages, unsigned int npages) { - void *ptr; unsigned int i; for (i = 0; i < npages; i++) { @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) goto out_freepages; pages[i] = page; } + return 0; - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); - if (!IS_ERR_OR_NULL(ptr)) - return ptr; out_freepages: nfs_readdir_free_pagearray(pages, i); - return NULL; + return -ENOMEM; } static @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, memset(array, 0, sizeof(struct nfs_cache_array)); array->eof_index = -1; - pages_ptr = nfs_readdir_large_page(pages, array_size); - if (!pages_ptr) + status = nfs_readdir_large_page(pages, array_size); + if (status < 0) goto out_release_array; do { unsigned int pglen; @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); if (status < 0) { if (status == -ENOSPC) status = 0; diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 5914a19..b382a1b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se entry->d_type = DT_UNKNOWN; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index f6cc60f..ba91236 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); } - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 9f1826b..0662a98 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, if (verify_attr_len(xdr, p, len) < 0) goto out_overflow; - p = xdr_inline_peek(xdr, 8); - if (p != NULL) - entry->eof = !p[0] && p[1]; - else - entry->eof = 0; - return p; out_overflow: diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 498ab93..7783c68 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -201,6 +201,8 @@ struct xdr_stream { __be32 *end; /* end of available buffer space */ struct kvec *iov; /* pointer to the current kvec */ + struct kvec scratch; /* Scratch buffer */ + struct page **page_ptr; /* pointer to the current page */ }; extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base, unsigned int len); extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index cd9e841..679cd67 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b } EXPORT_SYMBOL_GPL(xdr_write_pages); +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, + __be32 *p, unsigned int len) +{ + if (len > iov->iov_len) + len = iov->iov_len; + if (p == NULL) + p = (__be32*)iov->iov_base; + xdr->p = p; + xdr->end = (__be32*)(iov->iov_base + len); + xdr->iov = iov; + xdr->page_ptr = NULL; +} + +static int xdr_set_page_base(struct xdr_stream *xdr, + unsigned int base, unsigned int len) +{ + unsigned int pgnr; + unsigned int maxlen; + unsigned int pgoff; + unsigned int pgend; + void *kaddr; + + maxlen = xdr->buf->page_len; + if (base >= maxlen) + return -EINVAL; + maxlen -= base; + if (len > maxlen) + len = maxlen; + + base += xdr->buf->page_base; + + pgnr = base >> PAGE_SHIFT; + xdr->page_ptr = &xdr->buf->pages[pgnr]; + kaddr = page_address(*xdr->page_ptr); + + pgoff = base & ~PAGE_MASK; + xdr->p = (__be32*)(kaddr + pgoff); + + pgend = pgoff + len; + if (pgend > PAGE_SIZE) + pgend = PAGE_SIZE; + xdr->end = (__be32*)(kaddr + pgend); + xdr->iov = NULL; + return 0; +} + +static void xdr_set_next_page(struct xdr_stream *xdr) +{ + unsigned int newbase; + + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; + newbase -= xdr->buf->page_base; + + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); +} + +static bool xdr_set_next_buffer(struct xdr_stream *xdr) +{ + if (xdr->page_ptr != NULL) + xdr_set_next_page(xdr); + else if (xdr->iov == xdr->buf->head) { + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); + } + return xdr->p != xdr->end; +} + /** * xdr_init_decode - Initialize an xdr_stream for decoding data. * @xdr: pointer to xdr_stream struct @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); */ void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) { - struct kvec *iov = buf->head; - unsigned int len = iov->iov_len; - - if (len > buf->len) - len = buf->len; xdr->buf = buf; - xdr->iov = iov; - xdr->p = p; - xdr->end = (__be32 *)((char *)iov->iov_base + len); + xdr->scratch.iov_base = NULL; + xdr->scratch.iov_len = 0; + if (buf->head[0].iov_len != 0) + xdr_set_iov(xdr, buf->head, p, buf->len); + else if (buf->page_len != 0) + xdr_set_page_base(xdr, 0, buf->len); } EXPORT_SYMBOL_GPL(xdr_init_decode); -/** - * xdr_inline_peek - Allow read-ahead in the XDR data stream - * @xdr: pointer to xdr_stream struct - * @nbytes: number of bytes of data to decode - * - * Check if the input buffer is long enough to enable us to decode - * 'nbytes' more bytes of data starting at the current position. - * If so return the current pointer without updating the current - * pointer position. - */ -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr->p; __be32 *q = p + XDR_QUADLEN(nbytes); if (unlikely(q > xdr->end || q < p)) return NULL; + xdr->p = q; return p; } -EXPORT_SYMBOL_GPL(xdr_inline_peek); /** - * xdr_inline_decode - Retrieve non-page XDR data to decode + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. + * @xdr: pointer to xdr_stream struct + * @buf: pointer to an empty buffer + * @buflen: size of 'buf' + * + * The scratch buffer is used when decoding from an array of pages. + * If an xdr_inline_decode() call spans across page boundaries, then + * we copy the data into the scratch buffer in order to allow linear + * access. + */ +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) +{ + xdr->scratch.iov_base = buf; + xdr->scratch.iov_len = buflen; +} +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); + +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) +{ + __be32 *p; + void *cpdest = xdr->scratch.iov_base; + size_t cplen = (char *)xdr->end - (char *)xdr->p; + + if (nbytes > xdr->scratch.iov_len) + return NULL; + memcpy(cpdest, xdr->p, cplen); + cpdest += cplen; + nbytes -= cplen; + if (!xdr_set_next_buffer(xdr)) + return NULL; + p = __xdr_inline_decode(xdr, nbytes); + if (p == NULL) + return NULL; + memcpy(cpdest, p, nbytes); + return xdr->scratch.iov_base; +} + +/** + * xdr_inline_decode - Retrieve XDR data to decode * @xdr: pointer to xdr_stream struct * @nbytes: number of bytes of data to decode * @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); */ __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p = xdr->p; - __be32 *q = p + XDR_QUADLEN(nbytes); + __be32 *p; - if (unlikely(q > xdr->end || q < p)) + if (nbytes == 0) + return xdr->p; + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) return NULL; - xdr->p = q; - return p; + p = __xdr_inline_decode(xdr, nbytes); + if (p != NULL) + return p; + return xdr_copy_to_scratch(xdr, nbytes); } EXPORT_SYMBOL_GPL(xdr_inline_decode); @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); */ void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) { - char * kaddr = page_address(xdr->buf->pages[0]); xdr_read_pages(xdr, len); /* * Position current pointer at beginning of tail, and * set remaining message length. */ - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) - len = PAGE_CACHE_SIZE - xdr->buf->page_base; - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); - xdr->end = (__be32 *)((char *)xdr->p + len); + xdr_set_page_base(xdr, 0, len); } EXPORT_SYMBOL_GPL(xdr_enter_page); -- 1.7.3.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 23:15 ` Trond Myklebust @ 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 16:25 ` Trond Myklebust 2011-01-10 12:44 ` Marc Kleine-Budde 1 sibling, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-10 10:50 UTC (permalink / raw) To: linux-arm-kernel Hi Trond, On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > ----------------------------------------------------------------------------------- It would be great if you could add a "8<" in the line above next time. Then git-am -c does the right thing (at least I think so). > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Uwe Kleine-K?nig <u.kleine-koenig@pengutronix.de> on tx28. Thanks Uwe > --- > fs/nfs/dir.c | 44 ++++++------- > fs/nfs/nfs2xdr.c | 6 -- > fs/nfs/nfs3xdr.c | 6 -- > fs/nfs/nfs4xdr.c | 6 -- > include/linux/sunrpc/xdr.h | 4 +- > net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++--------- > 6 files changed, 148 insertions(+), 73 deletions(-) > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..0108cf4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -33,7 +33,6 @@ > #include <linux/namei.h> > #include <linux/mount.h> > #include <linux/sched.h> > -#include <linux/vmalloc.h> > #include <linux/kmemleak.h> > > #include "delegation.h" > @@ -459,25 +458,26 @@ out: > /* Perform conversion from xdr to cache array */ > static > int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, > - void *xdr_page, struct page *page, unsigned int buflen) > + struct page **xdr_pages, struct page *page, unsigned int buflen) > { > struct xdr_stream stream; > - struct xdr_buf buf; > - __be32 *ptr = xdr_page; > + struct xdr_buf buf = { > + .pages = xdr_pages, > + .page_len = buflen, > + .buflen = buflen, > + .len = buflen, > + }; > + struct page *scratch; > struct nfs_cache_array *array; > unsigned int count = 0; > int status; > > - buf.head->iov_base = xdr_page; > - buf.head->iov_len = buflen; > - buf.tail->iov_len = 0; > - buf.page_base = 0; > - buf.page_len = 0; > - buf.buflen = buf.head->iov_len; > - buf.len = buf.head->iov_len; > - > - xdr_init_decode(&stream, &buf, ptr); > + scratch = alloc_page(GFP_KERNEL); > + if (scratch == NULL) > + return -ENOMEM; > > + xdr_init_decode(&stream, &buf, NULL); > + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE); > > do { > status = xdr_decode(desc, entry, &stream); > @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en > } else > status = PTR_ERR(array); > } > + > + put_page(scratch); > return status; > } > > @@ -521,7 +523,6 @@ static > void nfs_readdir_free_large_page(void *ptr, struct page **pages, > unsigned int npages) > { > - vm_unmap_ram(ptr, npages); > nfs_readdir_free_pagearray(pages, npages); > } > > @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages, > * to nfs_readdir_free_large_page > */ > static > -void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > +int nfs_readdir_large_page(struct page **pages, unsigned int npages) > { > - void *ptr; > unsigned int i; > > for (i = 0; i < npages; i++) { > @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages) > goto out_freepages; > pages[i] = page; > } > + return 0; > > - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL); > - if (!IS_ERR_OR_NULL(ptr)) > - return ptr; > out_freepages: > nfs_readdir_free_pagearray(pages, i); > - return NULL; > + return -ENOMEM; > } > > static > @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > memset(array, 0, sizeof(struct nfs_cache_array)); > array->eof_index = -1; > > - pages_ptr = nfs_readdir_large_page(pages, array_size); > - if (!pages_ptr) > + status = nfs_readdir_large_page(pages, array_size); > + if (status < 0) > goto out_release_array; > do { > unsigned int pglen; > @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen); > if (status < 0) { > if (status == -ENOSPC) > status = 0; > diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c > index 5914a19..b382a1b 100644 > --- a/fs/nfs/nfs2xdr.c > +++ b/fs/nfs/nfs2xdr.c > @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se > > entry->d_type = DT_UNKNOWN; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c > index f6cc60f..ba91236 100644 > --- a/fs/nfs/nfs3xdr.c > +++ b/fs/nfs/nfs3xdr.c > @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s > memset((u8*)(entry->fh), 0, sizeof(*entry->fh)); > } > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c > index 9f1826b..0662a98 100644 > --- a/fs/nfs/nfs4xdr.c > +++ b/fs/nfs/nfs4xdr.c > @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, > if (verify_attr_len(xdr, p, len) < 0) > goto out_overflow; > > - p = xdr_inline_peek(xdr, 8); > - if (p != NULL) > - entry->eof = !p[0] && p[1]; > - else > - entry->eof = 0; > - > return p; > > out_overflow: > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h > index 498ab93..7783c68 100644 > --- a/include/linux/sunrpc/xdr.h > +++ b/include/linux/sunrpc/xdr.h > @@ -201,6 +201,8 @@ struct xdr_stream { > > __be32 *end; /* end of available buffer space */ > struct kvec *iov; /* pointer to the current kvec */ > + struct kvec scratch; /* Scratch buffer */ > + struct page **page_ptr; /* pointer to the current page */ > }; > > extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, > unsigned int base, unsigned int len); > extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p); > -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes); > +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen); > extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes); > extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len); > extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len); > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > index cd9e841..679cd67 100644 > --- a/net/sunrpc/xdr.c > +++ b/net/sunrpc/xdr.c > @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b > } > EXPORT_SYMBOL_GPL(xdr_write_pages); > > +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov, > + __be32 *p, unsigned int len) > +{ > + if (len > iov->iov_len) > + len = iov->iov_len; > + if (p == NULL) > + p = (__be32*)iov->iov_base; > + xdr->p = p; > + xdr->end = (__be32*)(iov->iov_base + len); > + xdr->iov = iov; > + xdr->page_ptr = NULL; > +} > + > +static int xdr_set_page_base(struct xdr_stream *xdr, > + unsigned int base, unsigned int len) > +{ > + unsigned int pgnr; > + unsigned int maxlen; > + unsigned int pgoff; > + unsigned int pgend; > + void *kaddr; > + > + maxlen = xdr->buf->page_len; > + if (base >= maxlen) > + return -EINVAL; > + maxlen -= base; > + if (len > maxlen) > + len = maxlen; > + > + base += xdr->buf->page_base; > + > + pgnr = base >> PAGE_SHIFT; > + xdr->page_ptr = &xdr->buf->pages[pgnr]; > + kaddr = page_address(*xdr->page_ptr); > + > + pgoff = base & ~PAGE_MASK; > + xdr->p = (__be32*)(kaddr + pgoff); > + > + pgend = pgoff + len; > + if (pgend > PAGE_SIZE) > + pgend = PAGE_SIZE; > + xdr->end = (__be32*)(kaddr + pgend); > + xdr->iov = NULL; > + return 0; > +} > + > +static void xdr_set_next_page(struct xdr_stream *xdr) > +{ > + unsigned int newbase; > + > + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT; > + newbase -= xdr->buf->page_base; > + > + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > +} > + > +static bool xdr_set_next_buffer(struct xdr_stream *xdr) > +{ > + if (xdr->page_ptr != NULL) > + xdr_set_next_page(xdr); > + else if (xdr->iov == xdr->buf->head) { > + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0) > + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len); > + } > + return xdr->p != xdr->end; > +} > + > /** > * xdr_init_decode - Initialize an xdr_stream for decoding data. > * @xdr: pointer to xdr_stream struct > @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages); > */ > void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p) > { > - struct kvec *iov = buf->head; > - unsigned int len = iov->iov_len; > - > - if (len > buf->len) > - len = buf->len; > xdr->buf = buf; > - xdr->iov = iov; > - xdr->p = p; > - xdr->end = (__be32 *)((char *)iov->iov_base + len); > + xdr->scratch.iov_base = NULL; > + xdr->scratch.iov_len = 0; > + if (buf->head[0].iov_len != 0) > + xdr_set_iov(xdr, buf->head, p, buf->len); > + else if (buf->page_len != 0) > + xdr_set_page_base(xdr, 0, buf->len); > } > EXPORT_SYMBOL_GPL(xdr_init_decode); > > -/** > - * xdr_inline_peek - Allow read-ahead in the XDR data stream > - * @xdr: pointer to xdr_stream struct > - * @nbytes: number of bytes of data to decode > - * > - * Check if the input buffer is long enough to enable us to decode > - * 'nbytes' more bytes of data starting at the current position. > - * If so return the current pointer without updating the current > - * pointer position. > - */ > -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes) > +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > __be32 *p = xdr->p; > __be32 *q = p + XDR_QUADLEN(nbytes); > > if (unlikely(q > xdr->end || q < p)) > return NULL; > + xdr->p = q; > return p; > } > -EXPORT_SYMBOL_GPL(xdr_inline_peek); > > /** > - * xdr_inline_decode - Retrieve non-page XDR data to decode > + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data. > + * @xdr: pointer to xdr_stream struct > + * @buf: pointer to an empty buffer > + * @buflen: size of 'buf' > + * > + * The scratch buffer is used when decoding from an array of pages. > + * If an xdr_inline_decode() call spans across page boundaries, then > + * we copy the data into the scratch buffer in order to allow linear > + * access. > + */ > +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen) > +{ > + xdr->scratch.iov_base = buf; > + xdr->scratch.iov_len = buflen; > +} > +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer); > + > +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) > +{ > + __be32 *p; > + void *cpdest = xdr->scratch.iov_base; > + size_t cplen = (char *)xdr->end - (char *)xdr->p; > + > + if (nbytes > xdr->scratch.iov_len) > + return NULL; > + memcpy(cpdest, xdr->p, cplen); > + cpdest += cplen; > + nbytes -= cplen; > + if (!xdr_set_next_buffer(xdr)) > + return NULL; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p == NULL) > + return NULL; > + memcpy(cpdest, p, nbytes); > + return xdr->scratch.iov_base; > +} > + > +/** > + * xdr_inline_decode - Retrieve XDR data to decode > * @xdr: pointer to xdr_stream struct > * @nbytes: number of bytes of data to decode > * > @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek); > */ > __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) > { > - __be32 *p = xdr->p; > - __be32 *q = p + XDR_QUADLEN(nbytes); > + __be32 *p; > > - if (unlikely(q > xdr->end || q < p)) > + if (nbytes == 0) > + return xdr->p; > + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) > return NULL; > - xdr->p = q; > - return p; > + p = __xdr_inline_decode(xdr, nbytes); > + if (p != NULL) > + return p; > + return xdr_copy_to_scratch(xdr, nbytes); > } > EXPORT_SYMBOL_GPL(xdr_inline_decode); > > @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages); > */ > void xdr_enter_page(struct xdr_stream *xdr, unsigned int len) > { > - char * kaddr = page_address(xdr->buf->pages[0]); > xdr_read_pages(xdr, len); > /* > * Position current pointer at beginning of tail, and > * set remaining message length. > */ > - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base) > - len = PAGE_CACHE_SIZE - xdr->buf->page_base; > - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base); > - xdr->end = (__be32 *)((char *)xdr->p + len); > + xdr_set_page_base(xdr, 0, len); > } > EXPORT_SYMBOL_GPL(xdr_enter_page); > > -- > 1.7.3.4 > > > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust at netapp.com > www.netapp.com > > -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 10:50 ` Uwe Kleine-König @ 2011-01-10 16:25 ` Trond Myklebust 2011-01-10 17:08 ` Marc Kleine-Budde 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-10 16:25 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote: > Hi Trond, > > On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > ----------------------------------------------------------------------------------- > It would be great if you could add a "8<" in the line above next time. > Then git-am -c does the right thing (at least I think so). Sorry. I wasn't aware of that particular idiom. So something like 8<------------------------------------------------------------ From: Trond Myklebust <Trond.Myklebust@netapp.com> Subject: ..... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 16:25 ` Trond Myklebust @ 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:20 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:08 UTC (permalink / raw) To: linux-arm-kernel On 01/10/2011 05:25 PM, Trond Myklebust wrote: > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote: >> Hi Trond, >> >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: >>> ----------------------------------------------------------------------------------- >> It would be great if you could add a "8<" in the line above next time. >> Then git-am -c does the right thing (at least I think so). > > Sorry. I wasn't aware of that particular idiom. So something like > > 8<------------------------------------------------------------ > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Subject: ..... Yes. >From "man git-mailinfo": "A line that mainly consists of scissors (either ">8" or "8<") and perforation (dash "-")" BTW: Is this patch a candidate for stable? regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110110/db4d5898/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:08 ` Marc Kleine-Budde @ 2011-01-10 17:20 ` Trond Myklebust 2011-01-10 17:26 ` Marc Kleine-Budde 2011-01-10 19:25 ` Uwe Kleine-König 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-10 17:20 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote: > >> Hi Trond, > >> > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > >>> ----------------------------------------------------------------------------------- > >> It would be great if you could add a "8<" in the line above next time. > >> Then git-am -c does the right thing (at least I think so). > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > 8<------------------------------------------------------------ > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > Subject: ..... > > Yes. > > From "man git-mailinfo": > "A line that mainly consists of scissors (either ">8" or "8<") and > perforation (dash "-")" > > BTW: > Is this patch a candidate for stable? Yes. I'm planning on sending it to the stable list after Linus merges it into mainline. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:20 ` Trond Myklebust @ 2011-01-10 17:26 ` Marc Kleine-Budde 2011-01-10 19:25 ` Uwe Kleine-König 1 sibling, 0 replies; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 17:26 UTC (permalink / raw) To: linux-arm-kernel On 01/10/2011 06:20 PM, Trond Myklebust wrote: >> Is this patch a candidate for stable? > > Yes. I'm planning on sending it to the stable list after Linus merges it > into mainline. Fine! regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110110/110762ce/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 17:20 ` Trond Myklebust 2011-01-10 17:26 ` Marc Kleine-Budde @ 2011-01-10 19:25 ` Uwe Kleine-König 2011-01-10 19:29 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-10 19:25 UTC (permalink / raw) To: linux-arm-kernel Hello Trond, On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote: > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote: > > >> Hi Trond, > > >> > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > >>> ----------------------------------------------------------------------------------- > > >> It would be great if you could add a "8<" in the line above next time. > > >> Then git-am -c does the right thing (at least I think so). > > > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > > > 8<------------------------------------------------------------ > > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > > Subject: ..... > > > > Yes. > > > > From "man git-mailinfo": > > "A line that mainly consists of scissors (either ">8" or "8<") and > > perforation (dash "-")" > > > > BTW: > > Is this patch a candidate for stable? > > Yes. I'm planning on sending it to the stable list after Linus merges it > into mainline. So there is another idiom for you: just put Cc: stable at kernel.org in the S-o-b block and Greg will pick it off "automatically". (Just in case you don't know, and if you do, maybe someone else learned something.) Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:25 ` Uwe Kleine-König @ 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:31 ` James Bottomley 2011-01-10 19:34 ` Linus Torvalds 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-10 19:29 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-10 at 20:25 +0100, Uwe Kleine-K?nig wrote: > Hello Trond, > > On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote: > > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote: > > > On 01/10/2011 05:25 PM, Trond Myklebust wrote: > > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote: > > > >> Hi Trond, > > > >> > > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote: > > > >>> ----------------------------------------------------------------------------------- > > > >> It would be great if you could add a "8<" in the line above next time. > > > >> Then git-am -c does the right thing (at least I think so). > > > > > > > > Sorry. I wasn't aware of that particular idiom. So something like > > > > > > > > 8<------------------------------------------------------------ > > > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > > > Subject: ..... > > > > > > Yes. > > > > > > From "man git-mailinfo": > > > "A line that mainly consists of scissors (either ">8" or "8<") and > > > perforation (dash "-")" > > > > > > BTW: > > > Is this patch a candidate for stable? > > > > Yes. I'm planning on sending it to the stable list after Linus merges it > > into mainline. > So there is another idiom for you: just put > > Cc: stable at kernel.org > > in the S-o-b block and Greg will pick it off "automatically". (Just in > case you don't know, and if you do, maybe someone else learned > something.) I usually do this, but there is a slight problem with that approach: Greg gets to do all the work of figuring out to which stable kernels this particular patch applies. In this case, since we're only talking about the 2.6.37 kernel, I prefer to use the mailing lists. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:29 ` Trond Myklebust @ 2011-01-10 19:31 ` James Bottomley 2011-01-10 19:34 ` Linus Torvalds 1 sibling, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-10 19:31 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-10 at 14:29 -0500, Trond Myklebust wrote: > I usually do this, but there is a slight problem with that approach: > Greg gets to do all the work of figuring out to which stable kernels > this particular patch applies. In this case, since we're only talking > about the 2.6.37 kernel, I prefer to use the mailing lists. So the non-standard, but accepted way of doing this is Cc: Stable Tree <stable@kernel.org> [2.6.37] James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:31 ` James Bottomley @ 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 20:15 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-10 19:34 UTC (permalink / raw) To: linux-arm-kernel 2011/1/10 Trond Myklebust <Trond.Myklebust@netapp.com>: > > I usually do this, but there is a slight problem with that approach: > Greg gets to do all the work of figuring out to which stable kernels > this particular patch applies. In this case, since we're only talking > about the 2.6.37 kernel, I prefer to use the mailing lists. Just do Cc: stable at kernel.org [2.6.37] or similar. It's quite common. So EVEN IF you want to email people around about the patch separately, do add the "Cc: stable" marker. It's worthwhile information about the patch for everybody involved. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-10 19:34 ` Linus Torvalds @ 2011-01-10 20:15 ` Trond Myklebust 0 siblings, 0 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-10 20:15 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-10 at 11:34 -0800, Linus Torvalds wrote: > 2011/1/10 Trond Myklebust <Trond.Myklebust@netapp.com>: > > > > I usually do this, but there is a slight problem with that approach: > > Greg gets to do all the work of figuring out to which stable kernels > > this particular patch applies. In this case, since we're only talking > > about the 2.6.37 kernel, I prefer to use the mailing lists. > > Just do > > Cc: stable at kernel.org [2.6.37] > > or similar. It's quite common. > > So EVEN IF you want to email people around about the patch separately, > do add the "Cc: stable" marker. It's worthwhile information about the > patch for everybody involved. OK. Patch description amended and recommitted in git. Thanks to all for the tips... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-08 23:15 ` Trond Myklebust 2011-01-10 10:50 ` Uwe Kleine-König @ 2011-01-10 12:44 ` Marc Kleine-Budde 1 sibling, 0 replies; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-10 12:44 UTC (permalink / raw) To: linux-arm-kernel On 01/09/2011 12:15 AM, Trond Myklebust wrote: > From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Sat, 8 Jan 2011 17:45:38 -0500 > Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir > > vm_map_ram() is not available on NOMMU platforms, and causes trouble > on incoherrent architectures such as ARM when we access the page data > through both the direct and the virtual mapping. > > The alternative is to use the direct mapping to access page data > for the case when we are not crossing a page boundary, but to copy > the data into a linear scratch buffer when we are accessing data > that spans page boundaries. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Marc Kleine-Budde <mkl@pengutronix.de> ..on AT91 (armv5) regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110110/f1fdce67/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:11 ` James Bottomley @ 2011-01-07 19:13 ` Trond Myklebust 1 sibling, 0 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-07 19:13 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote: > On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote: > > I'd still like to keep the existing code for those architectures that > > don't have problems, since that allows us to send 32k READDIR requests > > instead of being limited to 4k. For large directories, that is a clear > > win. > > For the NOMMU case we will just go back to using a single page for > > storage (and 4k READDIR requests only). Should I just do the same for > > architectures like ARM and PARISC? > > I think you said that readdir reads via the vmalloc mapping of the > group of pages, but XDR writes to the individual pages. > > As I understand NFS, you receive a packet, you then have to use XDR > to unpack the data, which you presumably write into the set of > struct page *'s using kmap? No. The socket or RDMA layers place the data directly into the struct pages. We then unpack them in the XDR layer using the vmalloc mapping and place the resulting processed readdir data into the page cache. > Isn't a solution to have XDR write directly into the vmalloc mapping > rather than using struct page * and kmap? Unfortunately that isn't possible. :-( -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-07 18:53 ` Trond Myklebust 2011-01-07 19:02 ` Russell King - ARM Linux @ 2011-01-07 19:05 ` James Bottomley 1 sibling, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-07 19:05 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2011-01-07 at 13:53 -0500, Trond Myklebust wrote: > There is already code in the SUNRPC layer that calls flush_dcache_page() > after writing (although as Russell pointed out earlier, that is > apparently a no-op for non-page cache pages such as these). Actually (and possibly fortunately) none of our flush_dcache_page() implementations do this (check for an actual non page cache page and nop if they find one). Although, they may according to the docs which say that flush_dcache_page() is only called on page cache pages. But it's definitely using the API outside its documented scope. We have lots of places in the VFS where we don't call flush_dcache_page() even after altering a kernel page (even in the page cache) if we know the page will never be mapped to userspace. The assumption here is that the kernel never sets up non-user aliases of these pages, so not doing the flushing is an optimisation since we only access them through the kernel address space. Of course, setting up vmap areas of these pages within the kernel violates this assumption. > > This is why you really really really generally don't want to have > > aliasing. Purely virtual caches are pure crap. Really. > > Well, it looks as if NOMMU is giving us problems due to the lack of a > vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262). > > I'd still like to keep the existing code for those architectures that > don't have problems, since that allows us to send 32k READDIR requests > instead of being limited to 4k. For large directories, that is a clear > win. > For the NOMMU case we will just go back to using a single page for > storage (and 4k READDIR requests only). Should I just do the same for > architectures like ARM and PARISC? Well, that would include any VI architecture (like SPARC and others) as well. However, I think we can just make the invalidate_kernel_vmap_range() work. James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust @ 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:14 ` James Bottomley 2011-01-06 20:19 ` John Stoffel 2 siblings, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-06 18:05 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > Can you explain how the code works? it looks to me like you read the xdr > > stuff through the vmap region then write it out directly to the pages? > > OK, I think I see how this is supposed to work: It's a sequential loop > of reading in via the pages (i.e. through the kernel mapping) and then > updating those pages via the vmap. In which case, I think this patch is > what you need. > > The theory of operation is that the readdir on pages actually uses the > network DMA operations to perform, so when it's finished, the underlying What network DMA operations - what if your NIC doesn't do DMA because it's an SMSC device? ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:05 ` Russell King - ARM Linux @ 2011-01-06 18:14 ` James Bottomley 2011-01-06 18:25 ` James Bottomley 0 siblings, 1 reply; 79+ messages in thread From: James Bottomley @ 2011-01-06 18:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote: > > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: > > > Can you explain how the code works? it looks to me like you read the xdr > > > stuff through the vmap region then write it out directly to the pages? > > > > OK, I think I see how this is supposed to work: It's a sequential loop > > of reading in via the pages (i.e. through the kernel mapping) and then > > updating those pages via the vmap. In which case, I think this patch is > > what you need. > > > > The theory of operation is that the readdir on pages actually uses the > > network DMA operations to perform, so when it's finished, the underlying > > What network DMA operations - what if your NIC doesn't do DMA because > it's an SMSC device? So this is the danger area ... we might be caught by our own flushing tricks. I can't test this on parisc since all my network drivers use DMA (which automatically coheres the kernel mapping by flush/invalidate). What should happen is that the kernel mapping pages go through the ->readdir() path. Any return from this has to be ready to map the pages back to user space, so the kernel alias has to be flushed to make the underlying page up to date. The exception is pages we haven't yet mapped to userspace. Here we set the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we expect the addition of a userspace mapping will detect this case and do the flush and clear the bit before the mapping goes live. I assume you're thinking that because this page is allocated and freed internally to NFS, it never gets a userspace mapping and therefore, we can return from ->readdir() with a dirty kernel cache (and the corresponding flag set)? I think that is a possible hypothesis in certain cases. James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:14 ` James Bottomley @ 2011-01-06 18:25 ` James Bottomley 2011-01-06 21:07 ` James Bottomley 0 siblings, 1 reply; 79+ messages in thread From: James Bottomley @ 2011-01-06 18:25 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 12:14 -0600, James Bottomley wrote: > On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote: > > What network DMA operations - what if your NIC doesn't do DMA because > > it's an SMSC device? > > So this is the danger area ... we might be caught by our own flushing > tricks. I can't test this on parisc since all my network drivers use > DMA (which automatically coheres the kernel mapping by > flush/invalidate). > > What should happen is that the kernel mapping pages go through the > ->readdir() path. Any return from this has to be ready to map the pages > back to user space, so the kernel alias has to be flushed to make the > underlying page up to date. > > The exception is pages we haven't yet mapped to userspace. Here we set > the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we > expect the addition of a userspace mapping will detect this case and do > the flush and clear the bit before the mapping goes live. I assume > you're thinking that because this page is allocated and freed internally > to NFS, it never gets a userspace mapping and therefore, we can return > from ->readdir() with a dirty kernel cache (and the corresponding flag > set)? I think that is a possible hypothesis in certain cases. OK, so thinking about this, it seems that the only danger is actually what NFS is doing: reading cache pages via a vmap. In that case, since the requirement is to invalidate the vmap range to prepare for read, we could have invalidate_kernel_vmap_range loop over the underlying pages and flush them through the kernel alias if the architecture specific flag indicates their contents might be dirty. The loop adds expense that is probably largely unnecessary to invalidate_kernel_vmap_range() but the alternative is adding to the API proliferation with something that only flushes the kernel pages if the arch specific flag says they're dirty. James ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 18:25 ` James Bottomley @ 2011-01-06 21:07 ` James Bottomley 0 siblings, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-06 21:07 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-06 at 12:25 -0600, James Bottomley wrote: > OK, so thinking about this, it seems that the only danger is actually > what NFS is doing: reading cache pages via a vmap. In that case, since > the requirement is to invalidate the vmap range to prepare for read, we > could have invalidate_kernel_vmap_range loop over the underlying pages > and flush them through the kernel alias if the architecture specific > flag indicates their contents might be dirty. > > The loop adds expense that is probably largely unnecessary to > invalidate_kernel_vmap_range() but the alternative is adding to the API > proliferation with something that only flushes the kernel pages if the > arch specific flag says they're dirty. This is what I think the arm patch would look like (example only: I can't compile it). Is something like this too expensive? the loop can't be optimised away because of the need to check the pages (and vmalloc_to_page is a three level page table lookup). James --- diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 3acd8fa..34469ca 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -414,8 +414,17 @@ static inline void flush_kernel_vmap_range(void *addr, int size) } static inline void invalidate_kernel_vmap_range(void *addr, int size) { - if ((cache_is_vivt() || cache_is_vipt_aliasing())) - __cpuc_flush_dcache_area(addr, (size_t)size); + if ((cache_is_vivt() || cache_is_vipt_aliasing())) { + void *cursor = addr; + + for ( ; cursor < addr + size; cursor += PAGE_SIZE) { + struct page *page = vmalloc_to_page(cursor); + + if (!test_and_set_bit(PG_dcache_clean, &page->flags)) + __flush_dcache_page(page_mapping(page), page); + } + __cpuc_flush_dcache_area(addr, (size_t)size); + } } #define ARCH_HAS_FLUSH_ANON_PAGE ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 18:05 ` Russell King - ARM Linux @ 2011-01-06 20:19 ` John Stoffel 2 siblings, 0 replies; 79+ messages in thread From: John Stoffel @ 2011-01-06 20:19 UTC (permalink / raw) To: linux-arm-kernel >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes: James> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote: >> Can you explain how the code works? it looks to me like you read the xdr >> stuff through the vmap region then write it out directly to the pages? James> OK, I think I see how this is supposed to work: It's a James> sequential loop of reading in via the pages (i.e. through the James> kernel mapping) and then updating those pages via the vmap. In James> which case, I think this patch is what you need. James> The theory of operation is that the readdir on pages actually James> uses the network DMA operations to perform, so when it's James> finished, the underlying page is up to date. After this you James> invalidate the vmap range, so we have no cache lines above it James> (so it picks up the values from the uptodate page). Finally, James> after the operation on the vmap region has finished, you flush James> it so that any updated contents go back to the pages themselves James> before the next iteration begins. You need to re-spin this patch to include the above description into the magic steps your taking here, or at least document it more clearly somewhere why you need to make these funky steps. John James> James James> --- James> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c James> index 996dd89..bde1911 100644 James> --- a/fs/nfs/dir.c James> +++ b/fs/nfs/dir.c James> @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, James> if (status < 0) James> break; James> pglen = status; James> + James> + invalidate_kernel_vmap_range(pages_ptr, pglen); James> + James> status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); James> if (status < 0) { James> if (status == -ENOSPC) James> status = 0; James> break; James> } James> + flush_kernel_vmap_range(pages_ptr, pglen); James> } while (array->eof_index < 0); James> nfs_readdir_free_large_page(pages_ptr, pages, array_size); James> -- James> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in James> the body of a message to majordomo at vger.kernel.org James> More majordomo info at http://vger.kernel.org/majordomo-info.html James> Please read the FAQ at http://www.tux.org/lkml/ -- ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:28 ` James Bottomley @ 2011-01-05 23:28 ` Linus Torvalds 2011-01-05 23:59 ` Russell King - ARM Linux 1 sibling, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2011-01-05 23:28 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > which takes care of invalidating the cache prior to a virtual address > read. > > My question was specifically about the write through the regular kernel > mapping: according to Russell and my reading of the cachetlb.txt > documentation, flush_dcache_page() is only guaranteed to have an effect > on page cache pages. I don't think that should ever matter. It's not like the hardware can know whether it's a dcache page or not. And if the sw implementation cares, it's doing something really odd. But who knows - there's a lot of crap out there, and people sometimes do really odd things to work around the brokenness of a VIVT cache with aliases. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 23:28 ` Linus Torvalds @ 2011-01-05 23:59 ` Russell King - ARM Linux 0 siblings, 0 replies; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 23:59 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 03:28:53PM -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(), > > which takes care of invalidating the cache prior to a virtual address > > read. > > > > My question was specifically about the write through the regular kernel > > mapping: according to Russell and my reading of the cachetlb.txt > > documentation, flush_dcache_page() is only guaranteed to have an effect > > on page cache pages. > > I don't think that should ever matter. It's not like the hardware can > know whether it's a dcache page or not. > > And if the sw implementation cares, it's doing something really odd. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 20:48 ` Linus Torvalds 2011-01-05 21:04 ` Russell King - ARM Linux @ 2011-01-05 21:16 ` James Bottomley 1 sibling, 0 replies; 79+ messages in thread From: James Bottomley @ 2011-01-05 21:16 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 12:48 -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > well, that depends. For us on parisc, kmap of a user page in !HIGHMEM > > sets up an inequivalent aliase still ... because the cache colour of the > > user and kernel virtual addresses are different. Depending on the > > return path to userspace, we still usually have to flush to get the user > > to see the changes the kernel has made. > > Umm. Again, that has nothing to do with kmap(). > > This time it's about the user space mapping. > > Repeat after me: even without the kmap(), the kernel access to that > mapping would have caused cache aliases. > > See? Once more, the kmap() is entirely innocent. You can have a > non-highmem mapping that you never use kmap for, and that you map into > user space, and you'd see exactly the same aliases. Notice? Look ma, > no kmap(). Yes, I understand that (we have no highmem on parisc, so kmap is a nop). The problem (at least as I see it) is that once something within the kernel (well, OK, mostly within drivers) touches a user page via its kernel mapping, the flush often gets forgotten (mainly because it always works on x86). What I was thinking about is that every time the kernel touches a user space page, it has to be within a kmap/kunmap pair (because the page might be highmem) ... so it's possible to make kmap/kunmap do the flushing for this case so the driver writer can't ever forget it. I think the problem case is only really touching scatter/gather elements outside of the DMA API (i.e. the driver pio case), so this may be overkill. Russell also pointed out that a lot of the PIO iterators do excessive kmap_atomic/kunmap_atomic on the same page, so adding a flush could damage performance to the point where the flash root devices on arm might not work. Plus the pio iterators already contain the appropriate flush, so perhaps just using them in every case fixes the problem. > So clearly kmap() is not the issue. The issue continues to be a > totally separate virtual mapping. Whether it's a user mapping or > vm_map_ram() is obviously immaterial - as far as the CPU is concerned, > there is no difference between the two (apart from the trivial > differences of virtual location and permissions). > > (You can also force the problem with vmalloc() an then following the > kernel page tables, but I hope nobody does that any more. I suspect > I'm wrong, though, there's probably code that mixes vmalloc and > physical page accesses in various drivers) Yes, unfortunately, we have seen this quite a bit; mainly to get large buffers. Its not just confined to drivers: xfs used to fail on both arm and parisc because it used a vmalloc region for its log buffer which it then had to do I/O on. James ^ permalink raw reply [flat|nested] 79+ messages in thread
[parent not found: <AANLkTi=-dNeeDjcSoznKtwcaNyw1mMXSqepFY89R2i+2@mail.gmail.com>]
* still nfs problems [Was: Linux 2.6.37-rc8] [not found] <AANLkTi=-dNeeDjcSoznKtwcaNyw1mMXSqepFY89R2i+2@mail.gmail.com> @ 2010-12-30 17:14 ` Uwe Kleine-König 2010-12-30 17:57 ` Linus Torvalds 2010-12-30 17:59 ` Trond Myklebust 0 siblings, 2 replies; 79+ messages in thread From: Uwe Kleine-König @ 2010-12-30 17:14 UTC (permalink / raw) To: linux-arm-kernel Hello, I wonder if the nfs-stuff is considered to be solved, because I still see strange things. During boot my machine sometimes (approx one out of two times) hangs with the output pasted below on Sysctl-l. The irq I'm not 100% sure it's related, but at least it seems to hang in nfs_readdir. (When the serial irq happend that triggered the sysrq the program counter was at 0xc014601c, which is fs/nfs/dir.c:647 for me.) This is on 2.6.37-rc8 plus some patches for machine support on an ARM machine. Best regards Uwe [ 2700.100000] SysRq : Show State [ 2700.100000] task PC stack pid father [ 2700.100000] init S c0285d80 0 1 0 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) [ 2700.100000] r8:c0034088 r7:00000072 r6:00000001 r5:0000001b r4:0140b228 [ 2700.100000] kthreadd S c0285d80 0 2 0 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006a30c>] (kthreadd+0x70/0xfc) [ 2700.100000] [<c006a29c>] (kthreadd+0x0/0xfc) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] ksoftirqd/0 S c0285d80 0 3 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c0052714>] (run_ksoftirqd+0x5c/0x110) [ 2700.100000] [<c00526b8>] (run_ksoftirqd+0x0/0x110) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] r8:00000000 r7:c00526b8 r6:00000000 r5:c7843f1c r4:c7859fac [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c [ 2700.100000] kworker/0:0 S c0285d80 0 4 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc [ 2700.100000] kworker/u:0 S c0285d80 0 5 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc [ 2700.100000] watchdog/0 S c0285d80 0 6 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c008b418>] (watchdog+0xc0/0x110) [ 2700.100000] [<c008b358>] (watchdog+0x0/0x110) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] r6:00000000 r5:c7843efc r4:c785ffac [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843efc [ 2700.100000] khelper S c0285d80 0 7 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c [ 2700.100000] sync_supers S c0285d80 0 8 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00cd114>] (bdi_sync_supers+0x38/0x50) [ 2700.100000] [<c00cd0dc>] (bdi_sync_supers+0x0/0x50) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] r5:c7843f2c r4:c7895fac [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f2c [ 2700.100000] bdi-default S c0285d80 0 9 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c02864b4>] (schedule_timeout+0x22c/0x27c) [ 2700.100000] [<c0286288>] (schedule_timeout+0x0/0x27c) from [<c00ce014>] (bdi_forker_thread+0x3a8/0x41c) [ 2700.100000] r8:c0363f80 r7:00000000 r6:00000000 r5:c03641e8 r4:00000000 [ 2700.100000] [<c00cdc6c>] (bdi_forker_thread+0x0/0x41c) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843efc [ 2700.100000] kintegrityd S c0285d80 0 10 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843e9c [ 2700.100000] kblockd S c0285d80 0 11 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ec4 [ 2700.100000] rpciod S c0285d80 0 12 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ebc [ 2700.100000] kworker/0:1 S c0285d80 0 13 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c785be94 [ 2700.100000] khungtaskd S c0285d80 0 14 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c02864b4>] (schedule_timeout+0x22c/0x27c) [ 2700.100000] [<c0286288>] (schedule_timeout+0x0/0x27c) from [<c0286584>] (schedule_timeout_interruptible+0x28/0x2c) [ 2700.100000] r8:00000078 r7:00007fe9 r6:000003e9 r5:c034eef0 r4:00000064 [ 2700.100000] [<c028655c>] (schedule_timeout_interruptible+0x0/0x2c) from [<c008ada8>] (watchdog+0x54/0x2e8) [ 2700.100000] [<c008ad54>] (watchdog+0x0/0x2e8) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f2c [ 2700.100000] kswapd0 S c0285d80 0 15 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00c5ea4>] (kswapd+0x210/0x74c) [ 2700.100000] [<c00c5c94>] (kswapd+0x0/0x74c) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c [ 2700.100000] fsnotify_mark S c0285d80 0 16 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c011f884>] (fsnotify_mark_destroy+0x11c/0x144) [ 2700.100000] [<c011f768>] (fsnotify_mark_destroy+0x0/0x144) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f34 [ 2700.100000] aio S c0285d80 0 17 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc [ 2700.100000] nfsiod S c0285d80 0 18 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc [ 2700.100000] crypto S c0285d80 0 19 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ee4 [ 2700.100000] kworker/u:1 S c0285d80 0 24 2 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) [ 2700.100000] r7:00000013 r6:c004f4d8 r5:c006a208 r4:c785de94 [ 2700.100000] rcS S c0285d80 0 27 1 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) [ 2700.100000] r8:c0034088 r7:00000072 r6:ffffffff r5:bee7880c r4:00000000 [ 2700.100000] run-parts S c0285d80 0 35 27 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) [ 2700.100000] r8:c0034088 r7:00000072 r6:00000024 r5:bef7dcc4 r4:00000000 [ 2700.100000] S00splashutil R running 0 36 35 0x00000000 [ 2700.100000] Backtrace: [ 2700.100000] [<c0037b54>] (dump_backtrace+0x0/0x110) from [<c0037c80>] (show_stack+0x1c/0x20) [ 2700.100000] r7:c79cfd64 r6:00000000 r5:c7954600 r4:00000000 [ 2700.100000] [<c0037c64>] (show_stack+0x0/0x20) from [<c0046b78>] (sched_show_task+0xb0/0xcc) [ 2700.100000] [<c0046ac8>] (sched_show_task+0x0/0xcc) from [<c0046bf0>] (show_state_filter+0x5c/0xc8) [ 2700.100000] r5:c7954600 r4:c7954600 [ 2700.100000] [<c0046b94>] (show_state_filter+0x0/0xc8) from [<c01c5c40>] (sysrq_handle_showstate+0x18/0x1c) [ 2700.100000] r8:20000093 r7:00000007 r6:00000001 r5:00000074 r4:c036ec5c [ 2700.100000] [<c01c5c28>] (sysrq_handle_showstate+0x0/0x1c) from [<c01c6040>] (__handle_sysrq+0xe0/0x190) [ 2700.100000] [<c01c5f60>] (__handle_sysrq+0x0/0x190) from [<c01c62d8>] (handle_sysrq+0x38/0x44) [ 2700.100000] r8:c7999000 r7:00000100 r6:c7973640 r5:00010074 r4:c7864300 [ 2700.100000] [<c01c62a0>] (handle_sysrq+0x0/0x44) from [<c01da100>] (pl011_int+0x18c/0x5a4) [ 2700.100000] [<c01d9f74>] (pl011_int+0x0/0x5a4) from [<c008b8b0>] (handle_IRQ_event+0x7c/0x1a8) [ 2700.100000] [<c008b834>] (handle_IRQ_event+0x0/0x1a8) from [<c008de5c>] (handle_level_irq+0xc8/0x148) [ 2700.100000] [<c008dd94>] (handle_level_irq+0x0/0x148) from [<c002d080>] (asm_do_IRQ+0x80/0xa4) [ 2700.100000] r7:c74a05a4 r6:c74a0508 r5:00000000 r4:0000002f [ 2700.100000] [<c002d000>] (asm_do_IRQ+0x0/0xa4) from [<c0033ab8>] (__irq_svc+0x38/0x80) [ 2700.100000] Exception stack(0xc79cfe88 to 0xc79cfed0) [ 2700.100000] fe80: c74a0508 00000000 c0145d24 c7487e60 00000000 c79cfee8 [ 2700.100000] fea0: c74a0508 c74a05a4 c7487e60 c79cfee8 c74a0508 c79cff4c c79ce000 c79cfed0 [ 2700.100000] fec0: c016ff10 c014601c 60000013 ffffffff [ 2700.100000] r5:f5000000 r4:ffffffff [ 2700.100000] [<c0145f0c>] (nfs_readdir+0x0/0x458) from [<c00fa298>] (vfs_readdir+0x7c/0xb0) [ 2700.100000] [<c00fa21c>] (vfs_readdir+0x0/0xb0) from [<c00fa3fc>] (sys_getdents+0x70/0xb8) [ 2700.100000] [<c00fa38c>] (sys_getdents+0x0/0xb8) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) [ 2700.100000] r7:0000008d r6:00000000 r5:402ed00c r4:402ed020 [ 2700.100000] Sched Debug Version: v0.09, 2.6.37-rc8-00065-g1cd48e3-dirty #35 [ 2700.100000] now at 2701202.749966 msecs [ 2700.100000] .jiffies : 240010 [ 2700.100000] .sysctl_sched_latency : 6.000000 [ 2700.100000] .sysctl_sched_min_granularity : 0.750000 [ 2700.100000] .sysctl_sched_wakeup_granularity : 1.000000 [ 2700.100000] .sysctl_sched_child_runs_first : 0 [ 2700.100000] .sysctl_sched_features : 31855 [ 2700.100000] .sysctl_sched_tunable_scaling : 1 (logaritmic) [ 2700.100000] [ 2700.100000] cpu#0 [ 2700.100000] .nr_running : 1 [ 2700.100000] .load : 1024 [ 2700.100000] .nr_switches : 11875 [ 2700.100000] .nr_load_updates : 269696 [ 2700.100000] .nr_uninterruptible : 0 [ 2700.100000] .next_balance : 0.000000 [ 2700.100000] .curr->pid : 36 [ 2700.100000] .clock : 2700100.000000 [ 2700.100000] .cpu_load[0] : 1024 [ 2700.100000] .cpu_load[1] : 1024 [ 2700.100000] .cpu_load[2] : 1024 [ 2700.100000] .cpu_load[3] : 1024 [ 2700.100000] .cpu_load[4] : 1024 [ 2700.100000] [ 2700.100000] cfs_rq[0]: [ 2700.100000] .exec_clock : 0.000000 [ 2700.100000] .MIN_vruntime : 0.000001 [ 2700.100000] .min_vruntime : 2695651.938408 [ 2700.100000] .max_vruntime : 0.000001 [ 2700.100000] .spread : 0.000000 [ 2700.100000] .spread0 : 0.000000 [ 2700.100000] .nr_running : 1 [ 2700.100000] .load : 1024 [ 2700.100000] .nr_spread_over : 0 [ 2700.100000] [ 2700.100000] rt_rq[0]: [ 2700.100000] .rt_nr_running : 0 [ 2700.100000] .rt_throttled : 0 [ 2700.100000] .rt_time : 0.000000 [ 2700.100000] .rt_runtime : 950.000000 [ 2700.100000] [ 2700.100000] runnable tasks: [ 2700.100000] task PID tree-key switches prio exec-runtime sum-exec sum-sleep [ 2700.100000] ---------------------------------------------------------------------------------------------------------- [ 2700.100000] R S00splashutils 36 2695651.938408 5397 120 0 0 0.000000 0.000000 0.000000 [ 2700.100000] [ 2700.100000] [ 2700.100000] Showing all locks held in the system: [ 2700.100000] 4 locks held by S00splashutils/36: [ 2700.100000] #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<c00fa268>] vfs_readdir+0x4c/0xb0 [ 2700.100000] #1: (&port_lock_key){-.-...}, at: [<c01d9f94>] pl011_int+0x20/0x5a4 [ 2700.100000] #2: (sysrq_key_table_lock){-.....}, at: [<c01c5f84>] __handle_sysrq+0x24/0x190 [ 2700.100000] #3: (tasklist_lock){.?.+..}, at: [<c007c404>] debug_show_all_locks+0x40/0x1a4 [ 2700.100000] [ 2700.100000] ============================================= [ 2700.100000] -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 17:14 ` Uwe Kleine-König @ 2010-12-30 17:57 ` Linus Torvalds 2010-12-30 18:24 ` Trond Myklebust 2010-12-30 17:59 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2010-12-30 17:57 UTC (permalink / raw) To: linux-arm-kernel Please cc the poor hapless NFS people too, who probably otherwise wouldn't see it. And Arnd just in case it might be locking-related. Trond, any ideas? The sysrq thing does imply that it's stuck in some busy-loop in fs/nfs/dir.c, and line 647 is get_cache_page(), which in turn implies that the endless loop is either the loop in readdir_search_pagecache() _or_ in a caller. In particular, the EBADCOOKIE case in the caller (nfs_readdir) looks suspicious. What protects us from endless streams of EBADCOOKIE and a successful uncached_readdir? Linus 2010/12/30 Uwe Kleine-K?nig <u.kleine-koenig@pengutronix.de>: > Hello, > > I wonder if the nfs-stuff is considered to be solved, because I still > see strange things. > > During boot my machine sometimes (approx one out of two times) hangs with > the output pasted below on Sysctl-l. ?The irq > > I'm not 100% sure it's related, but at least it seems to hang in > nfs_readdir. ?(When the serial irq happend that triggered the sysrq the > program counter was at 0xc014601c, which is fs/nfs/dir.c:647 for me.) > > This is on 2.6.37-rc8 plus some patches for machine support on an ARM > machine. > > Best regards > Uwe > > [ 2700.100000] SysRq : Show State > [ 2700.100000] ? task ? ? ? ? ? ? ? ?PC stack ? pid father > [ 2700.100000] init ? ? ? ? ?S c0285d80 ? ? 0 ? ? 1 ? ? ?0 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) > [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) > [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) > [ 2700.100000] ?r8:c0034088 r7:00000072 r6:00000001 r5:0000001b r4:0140b228 > [ 2700.100000] kthreadd ? ? ?S c0285d80 ? ? 0 ? ? 2 ? ? ?0 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006a30c>] (kthreadd+0x70/0xfc) > [ 2700.100000] [<c006a29c>] (kthreadd+0x0/0xfc) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ksoftirqd/0 ? S c0285d80 ? ? 0 ? ? 3 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c0052714>] (run_ksoftirqd+0x5c/0x110) > [ 2700.100000] [<c00526b8>] (run_ksoftirqd+0x0/0x110) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] ?r8:00000000 r7:c00526b8 r6:00000000 r5:c7843f1c r4:c7859fac > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c > [ 2700.100000] kworker/0:0 ? S c0285d80 ? ? 0 ? ? 4 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) > [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc > [ 2700.100000] kworker/u:0 ? S c0285d80 ? ? 0 ? ? 5 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) > [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc > [ 2700.100000] watchdog/0 ? ?S c0285d80 ? ? 0 ? ? 6 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c008b418>] (watchdog+0xc0/0x110) > [ 2700.100000] [<c008b358>] (watchdog+0x0/0x110) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] ?r6:00000000 r5:c7843efc r4:c785ffac > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843efc > [ 2700.100000] khelper ? ? ? S c0285d80 ? ? 0 ? ? 7 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c > [ 2700.100000] sync_supers ? S c0285d80 ? ? 0 ? ? 8 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00cd114>] (bdi_sync_supers+0x38/0x50) > [ 2700.100000] [<c00cd0dc>] (bdi_sync_supers+0x0/0x50) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] ?r5:c7843f2c r4:c7895fac > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f2c > [ 2700.100000] bdi-default ? S c0285d80 ? ? 0 ? ? 9 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c02864b4>] (schedule_timeout+0x22c/0x27c) > [ 2700.100000] [<c0286288>] (schedule_timeout+0x0/0x27c) from [<c00ce014>] (bdi_forker_thread+0x3a8/0x41c) > [ 2700.100000] ?r8:c0363f80 r7:00000000 r6:00000000 r5:c03641e8 r4:00000000 > [ 2700.100000] [<c00cdc6c>] (bdi_forker_thread+0x0/0x41c) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843efc > [ 2700.100000] kintegrityd ? S c0285d80 ? ? 0 ? ?10 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843e9c > [ 2700.100000] kblockd ? ? ? S c0285d80 ? ? 0 ? ?11 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ec4 > [ 2700.100000] rpciod ? ? ? ?S c0285d80 ? ? 0 ? ?12 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ebc > [ 2700.100000] kworker/0:1 ? S c0285d80 ? ? 0 ? ?13 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) > [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c785be94 > [ 2700.100000] khungtaskd ? ?S c0285d80 ? ? 0 ? ?14 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c02864b4>] (schedule_timeout+0x22c/0x27c) > [ 2700.100000] [<c0286288>] (schedule_timeout+0x0/0x27c) from [<c0286584>] (schedule_timeout_interruptible+0x28/0x2c) > [ 2700.100000] ?r8:00000078 r7:00007fe9 r6:000003e9 r5:c034eef0 r4:00000064 > [ 2700.100000] [<c028655c>] (schedule_timeout_interruptible+0x0/0x2c) from [<c008ada8>] (watchdog+0x54/0x2e8) > [ 2700.100000] [<c008ad54>] (watchdog+0x0/0x2e8) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f2c > [ 2700.100000] kswapd0 ? ? ? S c0285d80 ? ? 0 ? ?15 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00c5ea4>] (kswapd+0x210/0x74c) > [ 2700.100000] [<c00c5c94>] (kswapd+0x0/0x74c) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c > [ 2700.100000] fsnotify_mark S c0285d80 ? ? 0 ? ?16 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c011f884>] (fsnotify_mark_destroy+0x11c/0x144) > [ 2700.100000] [<c011f768>] (fsnotify_mark_destroy+0x0/0x144) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f34 > [ 2700.100000] aio ? ? ? ? ? S c0285d80 ? ? 0 ? ?17 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc > [ 2700.100000] nfsiod ? ? ? ?S c0285d80 ? ? 0 ? ?18 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843edc > [ 2700.100000] crypto ? ? ? ?S c0285d80 ? ? 0 ? ?19 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c00657d4>] (rescuer_thread+0x1b8/0x1c4) > [ 2700.100000] [<c006561c>] (rescuer_thread+0x0/0x1c4) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843ee4 > [ 2700.100000] kworker/u:1 ? S c0285d80 ? ? 0 ? ?24 ? ? ?2 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006480c>] (worker_thread+0x41c/0x444) > [ 2700.100000] [<c00643f0>] (worker_thread+0x0/0x444) from [<c006a294>] (kthread+0x8c/0x94) > [ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658) > [ 2700.100000] ?r7:00000013 r6:c004f4d8 r5:c006a208 r4:c785de94 > [ 2700.100000] rcS ? ? ? ? ? S c0285d80 ? ? 0 ? ?27 ? ? ?1 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) > [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) > [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) > [ 2700.100000] ?r8:c0034088 r7:00000072 r6:ffffffff r5:bee7880c r4:00000000 > [ 2700.100000] run-parts ? ? S c0285d80 ? ? 0 ? ?35 ? ? 27 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c) > [ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0) > [ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) > [ 2700.100000] ?r8:c0034088 r7:00000072 r6:00000024 r5:bef7dcc4 r4:00000000 > [ 2700.100000] S00splashutil R running ? ? ?0 ? ?36 ? ? 35 0x00000000 > [ 2700.100000] Backtrace: > [ 2700.100000] [<c0037b54>] (dump_backtrace+0x0/0x110) from [<c0037c80>] (show_stack+0x1c/0x20) > [ 2700.100000] ?r7:c79cfd64 r6:00000000 r5:c7954600 r4:00000000 > [ 2700.100000] [<c0037c64>] (show_stack+0x0/0x20) from [<c0046b78>] (sched_show_task+0xb0/0xcc) > [ 2700.100000] [<c0046ac8>] (sched_show_task+0x0/0xcc) from [<c0046bf0>] (show_state_filter+0x5c/0xc8) > [ 2700.100000] ?r5:c7954600 r4:c7954600 > [ 2700.100000] [<c0046b94>] (show_state_filter+0x0/0xc8) from [<c01c5c40>] (sysrq_handle_showstate+0x18/0x1c) > [ 2700.100000] ?r8:20000093 r7:00000007 r6:00000001 r5:00000074 r4:c036ec5c > [ 2700.100000] [<c01c5c28>] (sysrq_handle_showstate+0x0/0x1c) from [<c01c6040>] (__handle_sysrq+0xe0/0x190) > [ 2700.100000] [<c01c5f60>] (__handle_sysrq+0x0/0x190) from [<c01c62d8>] (handle_sysrq+0x38/0x44) > [ 2700.100000] ?r8:c7999000 r7:00000100 r6:c7973640 r5:00010074 r4:c7864300 > [ 2700.100000] [<c01c62a0>] (handle_sysrq+0x0/0x44) from [<c01da100>] (pl011_int+0x18c/0x5a4) > [ 2700.100000] [<c01d9f74>] (pl011_int+0x0/0x5a4) from [<c008b8b0>] (handle_IRQ_event+0x7c/0x1a8) > [ 2700.100000] [<c008b834>] (handle_IRQ_event+0x0/0x1a8) from [<c008de5c>] (handle_level_irq+0xc8/0x148) > [ 2700.100000] [<c008dd94>] (handle_level_irq+0x0/0x148) from [<c002d080>] (asm_do_IRQ+0x80/0xa4) > [ 2700.100000] ?r7:c74a05a4 r6:c74a0508 r5:00000000 r4:0000002f > [ 2700.100000] [<c002d000>] (asm_do_IRQ+0x0/0xa4) from [<c0033ab8>] (__irq_svc+0x38/0x80) > [ 2700.100000] Exception stack(0xc79cfe88 to 0xc79cfed0) > [ 2700.100000] fe80: ? ? ? ? ? ? ? ? ? c74a0508 00000000 c0145d24 c7487e60 00000000 c79cfee8 > [ 2700.100000] fea0: c74a0508 c74a05a4 c7487e60 c79cfee8 c74a0508 c79cff4c c79ce000 c79cfed0 > [ 2700.100000] fec0: c016ff10 c014601c 60000013 ffffffff > [ 2700.100000] ?r5:f5000000 r4:ffffffff > [ 2700.100000] [<c0145f0c>] (nfs_readdir+0x0/0x458) from [<c00fa298>] (vfs_readdir+0x7c/0xb0) > [ 2700.100000] [<c00fa21c>] (vfs_readdir+0x0/0xb0) from [<c00fa3fc>] (sys_getdents+0x70/0xb8) > [ 2700.100000] [<c00fa38c>] (sys_getdents+0x0/0xb8) from [<c0033e80>] (ret_fast_syscall+0x0/0x38) > [ 2700.100000] ?r7:0000008d r6:00000000 r5:402ed00c r4:402ed020 > [ 2700.100000] Sched Debug Version: v0.09, 2.6.37-rc8-00065-g1cd48e3-dirty #35 > [ 2700.100000] now at 2701202.749966 msecs > [ 2700.100000] ? .jiffies ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : 240010 > [ 2700.100000] ? .sysctl_sched_latency ? ? ? ? ? ? ? ? ? ?: 6.000000 > [ 2700.100000] ? .sysctl_sched_min_granularity ? ? ? ? ? ?: 0.750000 > [ 2700.100000] ? .sysctl_sched_wakeup_granularity ? ? ? ? : 1.000000 > [ 2700.100000] ? .sysctl_sched_child_runs_first ? ? ? ? ? : 0 > [ 2700.100000] ? .sysctl_sched_features ? ? ? ? ? ? ? ? ? : 31855 > [ 2700.100000] ? .sysctl_sched_tunable_scaling ? ? ? ? ? ?: 1 (logaritmic) > [ 2700.100000] > [ 2700.100000] cpu#0 > [ 2700.100000] ? .nr_running ? ? ? ? ? ? ? ? ? ?: 1 > [ 2700.100000] ? .load ? ? ? ? ? ? ? ? ? ? ? ? ?: 1024 > [ 2700.100000] ? .nr_switches ? ? ? ? ? ? ? ? ? : 11875 > [ 2700.100000] ? .nr_load_updates ? ? ? ? ? ? ? : 269696 > [ 2700.100000] ? .nr_uninterruptible ? ? ? ? ? ?: 0 > [ 2700.100000] ? .next_balance ? ? ? ? ? ? ? ? ?: 0.000000 > [ 2700.100000] ? .curr->pid ? ? ? ? ? ? ? ? ? ? : 36 > [ 2700.100000] ? .clock ? ? ? ? ? ? ? ? ? ? ? ? : 2700100.000000 > [ 2700.100000] ? .cpu_load[0] ? ? ? ? ? ? ? ? ? : 1024 > [ 2700.100000] ? .cpu_load[1] ? ? ? ? ? ? ? ? ? : 1024 > [ 2700.100000] ? .cpu_load[2] ? ? ? ? ? ? ? ? ? : 1024 > [ 2700.100000] ? .cpu_load[3] ? ? ? ? ? ? ? ? ? : 1024 > [ 2700.100000] ? .cpu_load[4] ? ? ? ? ? ? ? ? ? : 1024 > [ 2700.100000] > [ 2700.100000] cfs_rq[0]: > [ 2700.100000] ? .exec_clock ? ? ? ? ? ? ? ? ? ?: 0.000000 > [ 2700.100000] ? .MIN_vruntime ? ? ? ? ? ? ? ? ?: 0.000001 > [ 2700.100000] ? .min_vruntime ? ? ? ? ? ? ? ? ?: 2695651.938408 > [ 2700.100000] ? .max_vruntime ? ? ? ? ? ? ? ? ?: 0.000001 > [ 2700.100000] ? .spread ? ? ? ? ? ? ? ? ? ? ? ?: 0.000000 > [ 2700.100000] ? .spread0 ? ? ? ? ? ? ? ? ? ? ? : 0.000000 > [ 2700.100000] ? .nr_running ? ? ? ? ? ? ? ? ? ?: 1 > [ 2700.100000] ? .load ? ? ? ? ? ? ? ? ? ? ? ? ?: 1024 > [ 2700.100000] ? .nr_spread_over ? ? ? ? ? ? ? ?: 0 > [ 2700.100000] > [ 2700.100000] rt_rq[0]: > [ 2700.100000] ? .rt_nr_running ? ? ? ? ? ? ? ? : 0 > [ 2700.100000] ? .rt_throttled ? ? ? ? ? ? ? ? ?: 0 > [ 2700.100000] ? .rt_time ? ? ? ? ? ? ? ? ? ? ? : 0.000000 > [ 2700.100000] ? .rt_runtime ? ? ? ? ? ? ? ? ? ?: 950.000000 > [ 2700.100000] > [ 2700.100000] runnable tasks: > [ 2700.100000] ? ? ? ? ? ? task ? PID ? ? ? ? tree-key ?switches ?prio ? ? exec-runtime ? ? ? ? sum-exec ? ? ? ?sum-sleep > [ 2700.100000] ---------------------------------------------------------------------------------------------------------- > [ 2700.100000] R S00splashutils ? ?36 ? 2695651.938408 ? ? ?5397 ? 120 ? ? ? ? ? ? ? 0 ? ? ? ? ? ? ? 0 ? ? ? ? ? ? ? 0.000000 ? ? ? ? ? ? ? 0.000000 ? ? ? ? ? ? ? 0.000000 > [ 2700.100000] > [ 2700.100000] > [ 2700.100000] Showing all locks held in the system: > [ 2700.100000] 4 locks held by S00splashutils/36: > [ 2700.100000] ?#0: ?(&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<c00fa268>] vfs_readdir+0x4c/0xb0 > [ 2700.100000] ?#1: ?(&port_lock_key){-.-...}, at: [<c01d9f94>] pl011_int+0x20/0x5a4 > [ 2700.100000] ?#2: ?(sysrq_key_table_lock){-.....}, at: [<c01c5f84>] __handle_sysrq+0x24/0x190 > [ 2700.100000] ?#3: ?(tasklist_lock){.?.+..}, at: [<c007c404>] debug_show_all_locks+0x40/0x1a4 > [ 2700.100000] > [ 2700.100000] ============================================= > [ 2700.100000] > > > -- > Pengutronix e.K. ? ? ? ? ? ? ? ? ? ? ? ? ? | Uwe Kleine-K?nig ? ? ? ? ? ?| > Industrial Linux Solutions ? ? ? ? ? ? ? ? | http://www.pengutronix.de/ ?| > ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 17:57 ` Linus Torvalds @ 2010-12-30 18:24 ` Trond Myklebust 2010-12-30 18:50 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2010-12-30 18:24 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-30 at 09:57 -0800, Linus Torvalds wrote: > Please cc the poor hapless NFS people too, who probably otherwise > wouldn't see it. And Arnd just in case it might be locking-related. > > Trond, any ideas? The sysrq thing does imply that it's stuck in some > busy-loop in fs/nfs/dir.c, and line 647 is get_cache_page(), which in > turn implies that the endless loop is either the loop in > readdir_search_pagecache() _or_ in a caller. In particular, the > EBADCOOKIE case in the caller (nfs_readdir) looks suspicious. What > protects us from endless streams of EBADCOOKIE and a successful > uncached_readdir? There is nothing we can do to protect ourselves against an infinite loop if the server (or underlying filesystem) is breaking the rules w.r.t. cookie generation. It should be possible to recover from all other situations. IOW: if the server generates non-unique cookies, then we're screwed. Fixing that particular problem is impossible since it is basically a variant of the halting problem. That was why I asked which filesystem is being exported in my previous reply. The point of 'uncached_readdir' is to resolve a cookie that was previously valid, but has since been invalidated; usually that is due to the file having been unlinked. If it succeeds, it should result in a new set of valid entries being posted to the 'filldir' callback, and a new cookie being set in the filp->private (i.e. we should have made progress). If it fails, we exit, as you can see. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 18:24 ` Trond Myklebust @ 2010-12-30 18:50 ` Linus Torvalds 2010-12-30 19:25 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Linus Torvalds @ 2010-12-30 18:50 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 30, 2010 at 10:24 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > There is nothing we can do to protect ourselves against an infinite loop > if the server (or underlying filesystem) is breaking the rules w.r.t. > cookie generation. It should be possible to recover from all other > situations. Umm. Sure there is. Just make sure that you return the uncached entry to user space, rather than loop forever. Looping forever in kernel space is a bad idea. How about just changing the "continue" into a "break" for the "uncached readdir returned success". No halting problems, no excuses. There is absolutely _no_ excuse for an endless loop in kernel mode. Certainly not "the other end is incompetent". EVERYBODY is incompetent sometimes. That just means that you must never trust the other end too much. You can't say "we require the server to be sane in order not to lock up". Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 18:50 ` Linus Torvalds @ 2010-12-30 19:25 ` Trond Myklebust 2010-12-30 20:02 ` Linus Torvalds 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2010-12-30 19:25 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-30 at 10:50 -0800, Linus Torvalds wrote: > On Thu, Dec 30, 2010 at 10:24 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > > > There is nothing we can do to protect ourselves against an infinite loop > > if the server (or underlying filesystem) is breaking the rules w.r.t. > > cookie generation. It should be possible to recover from all other > > situations. > > Umm. Sure there is. Just make sure that you return the uncached entry > to user space, rather than loop forever. > > Looping forever in kernel space is a bad idea. How about just changing > the "continue" into a "break" for the "uncached readdir returned > success". uncached_readdir is not really a problem. The real problem is filesystems that generate "infinite directories" by producing looping combinations of cookies. IOW: I've seen servers that generate cookies in a sequence of a form vaguely resembling 1 2 3 4 5 6 7 8 9 10 11 12 3... (with possibly a thousand or so entries between the first and second copy of '3') The kernel won't loop forever with something like that (because eventually filldir() will declare it is out of buffer space), but userland has a halting problem: it needs to detect that every sys_getdents() call it is making is generating another copy of the sequence associated with '4 5 6 7 8 9 10 11 12 3'... > No halting problems, no excuses. There is absolutely _no_ excuse for > an endless loop in kernel mode. Certainly not "the other end is > incompetent". We should never get an endless loop in _kernel mode_ with the current scheme, and I can't see that anyone has demonstrated that yet. > EVERYBODY is incompetent sometimes. That just means that you must > never trust the other end too much. You can't say "we require the > server to be sane in order not to lock up". Unfortunately we must. Call it an NFS protocol failure, but it really boils down to the fact that POSIX readdir() generates a data stream with no well-defined concept of an offset. As a result, each and every filesystem has their own interesting ways of generating cookies to represent that 'offset'. Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 19:25 ` Trond Myklebust @ 2010-12-30 20:02 ` Linus Torvalds 0 siblings, 0 replies; 79+ messages in thread From: Linus Torvalds @ 2010-12-30 20:02 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 30, 2010 at 11:25 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > uncached_readdir is not really a problem. The real problem is > filesystems that generate "infinite directories" by producing looping > combinations of cookies. But if we don't have any lseek's, the readdir cache should trivially take care of this by just incrementing the page_index, and we should return to user space the (eventually ending) sequence, even if there are duplicate numbers. (Also, I suspect that "page_index" should not be a page index, but a position, and then you the "search_for_pos()" should use that instead of the file_pos/current_index thing, but that's a detail that would show up only when you have duplicate cookies within one page worth of directory caches) And if the server really sends us an infinite stream of entries, then that's fine - at least we give to user space the infinite entries that were given to us, instead of _generating_ an infinite stream from what was a finite - but broken - stream). So it seems wrong that the directory caching code resets page_index to the start when it then does an uncached readdir. That seems wrong. I'm sure there's some reason for it, but wouldn't it be nice if the rule for page_index was that it starts off at zero, and only gets reset by lseek? > IOW: I've seen servers that generate cookies in a sequence of a form > vaguely resembling > > 1 2 3 4 5 6 7 8 9 10 11 12 3... > > (with possibly a thousand or so entries between the first and second > copy of '3') Ok, so that' obviously broken, but it's then _doubly_ broken to turn that long broken sequence into an _endless_ broken sequence. And I agree that when user space sees such an endless broken sequence, it's a real stopping problem for user space. But in the absense of lseek, it should _never_ be a problem for the kernel itself, afaik. The kernel should happily return just the broken sequence. No? So then perhaps the solution is to just remove the resetting of page_index in the uncached_readdir() function? Make sure that the page_index is monotonically increasing for any readdir(), and you protect against turning a bad sequence into an endless sequence. Of course, lseek() will have to reset page_index to zero, and if somebody does an lseek() on the directory, then the duplicate '3" entry in the cookie sequence will inevitably be ambiguous, but that really is unavoidable. And rare. People who use lseek() on directories are odd and we know they'll break on other filesystems too under certain circumstances. Linus ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 17:14 ` Uwe Kleine-König 2010-12-30 17:57 ` Linus Torvalds @ 2010-12-30 17:59 ` Trond Myklebust 2010-12-30 19:18 ` Uwe Kleine-König 1 sibling, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2010-12-30 17:59 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-30 at 18:14 +0100, Uwe Kleine-K?nig wrote: > Hello, > > I wonder if the nfs-stuff is considered to be solved, because I still > see strange things. > > During boot my machine sometimes (approx one out of two times) hangs with > the output pasted below on Sysctl-l. The irq > > I'm not 100% sure it's related, but at least it seems to hang in > nfs_readdir. (When the serial irq happend that triggered the sysrq the > program counter was at 0xc014601c, which is fs/nfs/dir.c:647 for me.) > > This is on 2.6.37-rc8 plus some patches for machine support on an ARM > machine. Ccing linux-nfs at vger.kernel.org What filesystem are you exporting on the server? What is the NFS version? Is this nfsroot, autofs or an ordinary nfs mount? In short, how can we reproduce this? Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 17:59 ` Trond Myklebust @ 2010-12-30 19:18 ` Uwe Kleine-König 2011-01-03 21:38 ` Uwe Kleine-König 0 siblings, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2010-12-30 19:18 UTC (permalink / raw) To: linux-arm-kernel Hi Trond, On Thu, Dec 30, 2010 at 12:59:52PM -0500, Trond Myklebust wrote: > On Thu, 2010-12-30 at 18:14 +0100, Uwe Kleine-K?nig wrote: > > I wonder if the nfs-stuff is considered to be solved, because I still > > see strange things. > > > > During boot my machine sometimes (approx one out of two times) hangs with > > the output pasted below on Sysctl-l. The irq > > > > I'm not 100% sure it's related, but at least it seems to hang in > > nfs_readdir. (When the serial irq happend that triggered the sysrq the > > program counter was at 0xc014601c, which is fs/nfs/dir.c:647 for me.) > > > > This is on 2.6.37-rc8 plus some patches for machine support on an ARM > > machine. > > Ccing linux-nfs at vger.kernel.org Yeah, good idea. I had that ~2min after sending my report during dinner, sorry :-\ > What filesystem are you exporting on the server? What is the NFS > version? Is this nfsroot, autofs or an ordinary nfs mount? This is an nfsroot of /home/ukl/nfsroot/tx28 which is a symlink to a directory on a different partition. I don't know the filesystem of my homedir as it resides on a server I have no access to, but I asked the admin, so I can follow up with this info later (I'd suspect ext3, too). The real root directory is on ext3 (rw,noatime). The serving nfs-server is Debian's nfs-kernel-server 1:1.2.2-1. nfs-related kernel parameters are ip=dhcp root=/dev/nfs nfsroot=192.168.23.2:/home/ukl/nfsroot/tx28,v3,tcp I hope this answers your questions. If not, please ask. I tried without the symlink and saw some different errors, e.g. starting splashutils daemon.../etc/rc.d/S00splashutils: line 50: //sbin/fbsplashd.static: Unknown error 521 (this is the init script that hung before) and [ 6.160000] NFS: server 192.168.23.2 error: fileid changed [ 6.160000] fsid 0:c: expected fileid 0x33590a4, got 0x4d11bedc but no hang as before. So maybe it's related to the symlink? I don't know if testing that further would help or just waste of my time, so please let me know if I can help you and how. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2010-12-30 19:18 ` Uwe Kleine-König @ 2011-01-03 21:38 ` Uwe Kleine-König 2011-01-04 0:22 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-03 21:38 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 30, 2010 at 08:18:46PM +0100, Uwe Kleine-K?nig wrote: > On Thu, Dec 30, 2010 at 12:59:52PM -0500, Trond Myklebust wrote: > > What filesystem are you exporting on the server? What is the NFS > > version? Is this nfsroot, autofs or an ordinary nfs mount? > This is an nfsroot of /home/ukl/nfsroot/tx28 which is a symlink to a > directory on a different partition. I don't know the filesystem of my > homedir as it resides on a server I have no access to, but I asked the > admin, so I can follow up with this info later (I'd suspect ext3, too). Yes, it is ext3. > The real root directory is on ext3 (rw,noatime). > > The serving nfs-server is Debian's nfs-kernel-server 1:1.2.2-1. If that matters, kernel is linux-image-2.6.32-5-amd64 (2.6.32-29) provided by Debian. > I don't > know if testing that further would help or just waste of my time, so > please let me know if I can help you and how. This still applies Uwe -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-03 21:38 ` Uwe Kleine-König @ 2011-01-04 0:22 ` Trond Myklebust 2011-01-05 8:40 ` Uwe Kleine-König 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-04 0:22 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2011-01-03 at 22:38 +0100, Uwe Kleine-K?nig wrote: > On Thu, Dec 30, 2010 at 08:18:46PM +0100, Uwe Kleine-K?nig wrote: > > On Thu, Dec 30, 2010 at 12:59:52PM -0500, Trond Myklebust wrote: > > > What filesystem are you exporting on the server? What is the NFS > > > version? Is this nfsroot, autofs or an ordinary nfs mount? > > This is an nfsroot of /home/ukl/nfsroot/tx28 which is a symlink to a > > directory on a different partition. I don't know the filesystem of my > > homedir as it resides on a server I have no access to, but I asked the > > admin, so I can follow up with this info later (I'd suspect ext3, too). > Yes, it is ext3. > > > The real root directory is on ext3 (rw,noatime). > > > > The serving nfs-server is Debian's nfs-kernel-server 1:1.2.2-1. > If that matters, kernel is linux-image-2.6.32-5-amd64 (2.6.32-29) > provided by Debian. > > > I don't > > know if testing that further would help or just waste of my time, so > > please let me know if I can help you and how. > This still applies I'm having trouble reproducing this with my own nfsroot setup (which is just a 'fedora 13 live' disk with NetworkManager turned firmly off). However looking back at your report, you said that when you remove the symlink, you get an error message of the form: "starting splashutils daemon.../etc/rc.d/S00splashutils: line 50: //sbin/fbsplashd.static: Unknown error 521" Error 521 is EBADHANDLE, which basically means your client got a corrupted filehandle. The 'fileid changed' thing also indicates some form of corruption. The question is whether this is something happening on the server or the client. Does an older client kernel boot without any trouble? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-04 0:22 ` Trond Myklebust @ 2011-01-05 8:40 ` Uwe Kleine-König 2011-01-05 11:05 ` Uwe Kleine-König 0 siblings, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-05 8:40 UTC (permalink / raw) To: linux-arm-kernel Hello Trond, On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > The question is whether this is something happening on the server or the > client. Does an older client kernel boot without any trouble? I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to compare with. If you don't consider .36 to be old enough let me now. Once the setup is done it should be easy to test .35 (say), too. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 8:40 ` Uwe Kleine-König @ 2011-01-05 11:05 ` Uwe Kleine-König 2011-01-05 11:27 ` Russell King - ARM Linux 0 siblings, 1 reply; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-05 11:05 UTC (permalink / raw) To: linux-arm-kernel Hello Trond, On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > > The question is whether this is something happening on the server or the > > client. Does an older client kernel boot without any trouble? > I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > compare with. If you don't consider .36 to be old enough let me now. > Once the setup is done it should be easy to test .35 (say), too. > Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the problems didn't occur. This was more reliable to trigger and he was so kind to bisect the problem. When testing v2.6.36-rc3-51-gafa8ccc init hanged. (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the "init hangs" problem the "fileid changed on tab" problem. I could only reproduce that on armv5 machines (imx27, imx28 and at91) but not on armv6 (imx35). Best regards Uwe [1] similar means: not during boot, but when hitting tab to get command completion in the shell. -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 11:05 ` Uwe Kleine-König @ 2011-01-05 11:27 ` Russell King - ARM Linux 2011-01-05 12:14 ` Marc Kleine-Budde 2011-01-05 13:40 ` Uwe Kleine-König 0 siblings, 2 replies; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 11:27 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: > Hello Trond, > > On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > > On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > > > The question is whether this is something happening on the server or the > > > client. Does an older client kernel boot without any trouble? > > I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > > compare with. If you don't consider .36 to be old enough let me now. > > Once the setup is done it should be easy to test .35 (say), too. > > > Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the > problems didn't occur. This was more reliable to trigger and he was so > kind to bisect the problem. > > When testing v2.6.36-rc3-51-gafa8ccc init hanged. > (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with > this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the > "init hangs" problem the "fileid changed on tab" problem. > > I could only reproduce that on armv5 machines (imx27, imx28 and at91) > but not on armv6 (imx35). FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit truncation of the fileid. It occurred several times on successive reboots, so I tried to capture a tcpdump trace off the server (Linux 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy IDE drivers trying to move it forward.) However, for the last couple of weeks I've been unable to reproduce it. The client was based on 2.6.37-rc6. The "fileid changed" messages popped up after mounting an export with 'nolock,intr,rsize=4096,soft', and then trying to use bash completion and 'ls' in a few subdirectories - and entries were missing from the directory lists without 'ls' reporting any errors (which I think is bad behaviour in itself.) I don't know why it's stopped producing the errors, although once it went I never investigated it any further (was far too busy trying to get AMBA DMA support working.) ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 11:27 ` Russell King - ARM Linux @ 2011-01-05 12:14 ` Marc Kleine-Budde 2011-01-05 13:02 ` Nori, Sekhar 2011-01-05 13:40 ` Uwe Kleine-König 1 sibling, 1 reply; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-05 12:14 UTC (permalink / raw) To: linux-arm-kernel On 01/05/2011 12:27 PM, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: >> Hello Trond, >> >> On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: >>> On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: >>>> The question is whether this is something happening on the server or the >>>> client. Does an older client kernel boot without any trouble? >>> I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to >>> compare with. If you don't consider .36 to be old enough let me now. >>> Once the setup is done it should be easy to test .35 (say), too. >>> >> Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the >> problems didn't occur. This was more reliable to trigger and he was so >> kind to bisect the problem. >> >> When testing v2.6.36-rc3-51-gafa8ccc init hanged. >> (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with >> this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the >> "init hangs" problem the "fileid changed on tab" problem. >> >> I could only reproduce that on armv5 machines (imx27, imx28 and at91) >> but not on armv6 (imx35). > > FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit > truncation of the fileid. It occurred several times on successive > reboots, so I tried to capture a tcpdump trace off the server (Linux > 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy > IDE drivers trying to move it forward.) However, for the last couple > of weeks I've been unable to reproduce it. We have the problem with nfs-root. From the kernel command line: root=/dev/nfs nfsroot=192.168.23.2:/home/mkl/pengutronix/xxx/bsp/OSELAS.BSP-xxx-Grabowski-trunk/platform-Ronetix-PM9263/root,v3,tcp /home/mkl/pengutronix is a link which points to a link /ptx/work/octopus/mkl (which is a ext3-based) which points to WORK_1/mkl which is also ext3-based. The server is 2.6.32 and has been rebooted yesterday :), nfs-utils are 1.2.2. I make a tcpdump if needed. Cheers, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110105/eba9dcec/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 12:14 ` Marc Kleine-Budde @ 2011-01-05 13:02 ` Nori, Sekhar 2011-01-05 15:34 ` Russell King - ARM Linux 0 siblings, 1 reply; 79+ messages in thread From: Nori, Sekhar @ 2011-01-05 13:02 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 17:44:16, Marc Kleine-Budde wrote: > On 01/05/2011 12:27 PM, Russell King - ARM Linux wrote: > > On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: > >> Hello Trond, > >> > >> On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > >>> On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > >>>> The question is whether this is something happening on the server or the > >>>> client. Does an older client kernel boot without any trouble? > >>> I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > >>> compare with. If you don't consider .36 to be old enough let me now. > >>> Once the setup is done it should be easy to test .35 (say), too. > >>> > >> Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the > >> problems didn't occur. This was more reliable to trigger and he was so > >> kind to bisect the problem. > >> > >> When testing v2.6.36-rc3-51-gafa8ccc init hanged. > >> (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with > >> this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the > >> "init hangs" problem the "fileid changed on tab" problem. > >> > >> I could only reproduce that on armv5 machines (imx27, imx28 and at91) > >> but not on armv6 (imx35). > > > > FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit > > truncation of the fileid. It occurred several times on successive > > reboots, so I tried to capture a tcpdump trace off the server (Linux > > 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy > > IDE drivers trying to move it forward.) However, for the last couple > > of weeks I've been unable to reproduce it. > > We have the problem with nfs-root. From the kernel command line: > > root=/dev/nfs > nfsroot=192.168.23.2:/home/mkl/pengutronix/xxx/bsp/OSELAS.BSP-xxx-Grabowski-trunk/platform-Ronetix-PM9263/root,v3,tcp > > /home/mkl/pengutronix is a link which points to a link > /ptx/work/octopus/mkl (which is a ext3-based) which points to > WORK_1/mkl which is also ext3-based. > > The server is 2.6.32 and has been rebooted yesterday :), nfs-utils are > 1.2.2. I make a tcpdump if needed. I see the issue too with an ARMv5 based DM355 board with the just released v2.6.37 tag (nfs client). I too see the issue when using bash tab completion. Here are some logs: fileid changed fsid 0:c: expected fileid 0x2db61d, got 0x2dad20 fileid changed fsid 0:c: expected fileid 0x100000000000, got 0x7070000000000000 I am using Fedora 8 (2.6.25 kernel) on the server side. I will try the latest Ubuntu release on the server side and test. Thanks, Sekhar ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 13:02 ` Nori, Sekhar @ 2011-01-05 15:34 ` Russell King - ARM Linux 0 siblings, 0 replies; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 15:34 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 06:32:58PM +0530, Nori, Sekhar wrote: > Here are some logs: > > fileid changed fsid 0:c: expected fileid 0x2db61d, got 0x2dad20 > > fileid changed fsid 0:c: expected fileid 0x100000000000, got 0x7070000000000000 Just to be clear, what I was seeing was things like: expected fileid <32-bit number> got <64-bit number with same 32-bit LS bits> so it looked like something was truncating a 64-bit fileid down to 32-bits. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 11:27 ` Russell King - ARM Linux 2011-01-05 12:14 ` Marc Kleine-Budde @ 2011-01-05 13:40 ` Uwe Kleine-König 2011-01-05 14:29 ` Jim Rees 2011-01-05 14:53 ` Trond Myklebust 1 sibling, 2 replies; 79+ messages in thread From: Uwe Kleine-König @ 2011-01-05 13:40 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, On Wed, Jan 05, 2011 at 11:27:01AM +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: > > Hello Trond, > > > > On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > > > On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > > > > The question is whether this is something happening on the server or the > > > > client. Does an older client kernel boot without any trouble? > > > I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > > > compare with. If you don't consider .36 to be old enough let me now. > > > Once the setup is done it should be easy to test .35 (say), too. > > > > > Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the > > problems didn't occur. This was more reliable to trigger and he was so > > kind to bisect the problem. > > > > When testing v2.6.36-rc3-51-gafa8ccc init hanged. > > (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with > > this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the > > "init hangs" problem the "fileid changed on tab" problem. > > > > I could only reproduce that on armv5 machines (imx27, imx28 and at91) > > but not on armv6 (imx35). > > FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit > truncation of the fileid. It occurred several times on successive > reboots, so I tried to capture a tcpdump trace off the server (Linux > 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy > IDE drivers trying to move it forward.) However, for the last couple > of weeks I've been unable to reproduce it. > > The client was based on 2.6.37-rc6. > > The "fileid changed" messages popped up after mounting an export with > 'nolock,intr,rsize=4096,soft', and then trying to use bash completion > and 'ls' in a few subdirectories - and entries were missing from the > directory lists without 'ls' reporting any errors (which I think is bad > behaviour in itself.) There was a bug in at least -rc5[1] that was considered already fixed in -rc4[2]. The later announcements didn't mention it anymore. > I don't know why it's stopped producing the errors, although once it > went I never investigated it any further (was far too busy trying to > get AMBA DMA support working.) It seems it was fixed for most users though. Trond? Uwe [1] http://lwn.net/Articles/418963/ [2] http://lwn.net/Articles/417704/ -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ | ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 13:40 ` Uwe Kleine-König @ 2011-01-05 14:29 ` Jim Rees 2011-01-05 14:42 ` Marc Kleine-Budde 2011-01-05 14:53 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Jim Rees @ 2011-01-05 14:29 UTC (permalink / raw) To: linux-arm-kernel Uwe Kleine-K?nig wrote: > The "fileid changed" messages popped up after mounting an export with > 'nolock,intr,rsize=4096,soft', and then trying to use bash completion > and 'ls' in a few subdirectories - and entries were missing from the > directory lists without 'ls' reporting any errors (which I think is bad > behaviour in itself.) There was a bug in at least -rc5[1] that was considered already fixed in -rc4[2]. The later announcements didn't mention it anymore. > I don't know why it's stopped producing the errors, although once it > went I never investigated it any further (was far too busy trying to > get AMBA DMA support working.) It seems it was fixed for most users though. Trond? Trond sent a fix to the nfs list on 27 Nov for "fileid changed" but I don't know if this is the same bug you're seeing. The patch was to nfs_same_file() and I can send it if you want. As far as I know the patch made it upstream. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 14:29 ` Jim Rees @ 2011-01-05 14:42 ` Marc Kleine-Budde 2011-01-05 15:38 ` Jim Rees 0 siblings, 1 reply; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-05 14:42 UTC (permalink / raw) To: linux-arm-kernel On 01/05/2011 03:29 PM, Jim Rees wrote: > Uwe Kleine-K?nig wrote: > > > The "fileid changed" messages popped up after mounting an export with > > 'nolock,intr,rsize=4096,soft', and then trying to use bash completion > > and 'ls' in a few subdirectories - and entries were missing from the > > directory lists without 'ls' reporting any errors (which I think is bad > > behaviour in itself.) > There was a bug in at least -rc5[1] that was considered already fixed in > -rc4[2]. The later announcements didn't mention it anymore. > > > I don't know why it's stopped producing the errors, although once it > > went I never investigated it any further (was far too busy trying to > > get AMBA DMA support working.) > It seems it was fixed for most users though. Trond? > > Trond sent a fix to the nfs list on 27 Nov for "fileid changed" but I don't > know if this is the same bug you're seeing. The patch was to > nfs_same_file() and I can send it if you want. As far as I know the patch > made it upstream. Are you sure it's in .37? The pick-axe just found one commit so far (although it's still searching): $ git log -Snfs_same_file commit d39ab9de3b80da5835049b1c3b49da4e84e01c07 Author: Bryan Schumaker <bjschuma@netapp.com> Date: Fri Sep 24 18:50:01 2010 -0400 NFS: re-add readdir plus This patch adds readdir plus support to the cache array. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Would you please be so kind and send the patch to this thread? cheers, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110105/64f1ba04/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 14:42 ` Marc Kleine-Budde @ 2011-01-05 15:38 ` Jim Rees 0 siblings, 0 replies; 79+ messages in thread From: Jim Rees @ 2011-01-05 15:38 UTC (permalink / raw) To: linux-arm-kernel Marc Kleine-Budde wrote: > Trond sent a fix to the nfs list on 27 Nov for "fileid changed" but I don't > know if this is the same bug you're seeing. The patch was to > nfs_same_file() and I can send it if you want. As far as I know the patch > made it upstream. Are you sure it's in .37? ... Would you please be so kind and send the patch to this thread? This is the commit I'm thinking of: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37a09f07459753e7c98d4e21f1c61e8756923f81 NFS: Fix a readdirplus bug When comparing filehandles in the helper nfs_same_file(), we should not be using 'strncmp()': filehandles are not null terminated strings. Instead, we should just use the existing helper nfs_compare_fh(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 8ea4a41..f0a384e 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -395,13 +395,9 @@ int xdr_decode(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, struct x static int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry) { - struct nfs_inode *node; if (dentry->d_inode == NULL) goto different; - node = NFS_I(dentry->d_inode); - if (node->fh.size != entry->fh->size) - goto different; - if (strncmp(node->fh.data, entry->fh->data, node->fh.size) != 0) + if (nfs_compare_fh(entry->fh, NFS_FH(dentry->d_inode)) != 0) goto different; return 1; different: ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 13:40 ` Uwe Kleine-König 2011-01-05 14:29 ` Jim Rees @ 2011-01-05 14:53 ` Trond Myklebust 2011-01-05 15:01 ` Marc Kleine-Budde 2011-01-14 2:25 ` Andy Isaacson 1 sibling, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 14:53 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 14:40 +0100, Uwe Kleine-K?nig wrote: > Hi Russell, > > On Wed, Jan 05, 2011 at 11:27:01AM +0000, Russell King - ARM Linux wrote: > > On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: > > > Hello Trond, > > > > > > On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > > > > On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > > > > > The question is whether this is something happening on the server or the > > > > > client. Does an older client kernel boot without any trouble? > > > > I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > > > > compare with. If you don't consider .36 to be old enough let me now. > > > > Once the setup is done it should be easy to test .35 (say), too. > > > > > > > Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the > > > problems didn't occur. This was more reliable to trigger and he was so > > > kind to bisect the problem. > > > > > > When testing v2.6.36-rc3-51-gafa8ccc init hanged. > > > (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with > > > this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the > > > "init hangs" problem the "fileid changed on tab" problem. > > > > > > I could only reproduce that on armv5 machines (imx27, imx28 and at91) > > > but not on armv6 (imx35). > > > > FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit > > truncation of the fileid. It occurred several times on successive > > reboots, so I tried to capture a tcpdump trace off the server (Linux > > 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy > > IDE drivers trying to move it forward.) However, for the last couple > > of weeks I've been unable to reproduce it. > > > > The client was based on 2.6.37-rc6. > > > > The "fileid changed" messages popped up after mounting an export with > > 'nolock,intr,rsize=4096,soft', and then trying to use bash completion > > and 'ls' in a few subdirectories - and entries were missing from the > > directory lists without 'ls' reporting any errors (which I think is bad > > behaviour in itself.) > There was a bug in at least -rc5[1] that was considered already fixed in > -rc4[2]. The later announcements didn't mention it anymore. > > > I don't know why it's stopped producing the errors, although once it > > went I never investigated it any further (was far too busy trying to > > get AMBA DMA support working.) > It seems it was fixed for most users though. Trond? As I said, I can't reproduce it. I'm seeing a lot of mention of ARM above. Is anyone seeing this bug on x86, or does it appear to be architecture-specific? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 14:53 ` Trond Myklebust @ 2011-01-05 15:01 ` Marc Kleine-Budde 2011-01-05 15:14 ` Trond Myklebust 2011-01-14 2:25 ` Andy Isaacson 1 sibling, 1 reply; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-05 15:01 UTC (permalink / raw) To: linux-arm-kernel On 01/05/2011 03:53 PM, Trond Myklebust wrote: > On Wed, 2011-01-05 at 14:40 +0100, Uwe Kleine-K?nig wrote: >> Hi Russell, >> >> On Wed, Jan 05, 2011 at 11:27:01AM +0000, Russell King - ARM Linux wrote: >>> On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: >>>> Hello Trond, >>>> >>>> On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: >>>>> On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: >>>>>> The question is whether this is something happening on the server or the >>>>>> client. Does an older client kernel boot without any trouble? >>>>> I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to >>>>> compare with. If you don't consider .36 to be old enough let me now. >>>>> Once the setup is done it should be easy to test .35 (say), too. >>>>> >>>> Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the >>>> problems didn't occur. This was more reliable to trigger and he was so >>>> kind to bisect the problem. >>>> >>>> When testing v2.6.36-rc3-51-gafa8ccc init hanged. >>>> (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with >>>> this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the >>>> "init hangs" problem the "fileid changed on tab" problem. >>>> >>>> I could only reproduce that on armv5 machines (imx27, imx28 and at91) >>>> but not on armv6 (imx35). >>> >>> FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit >>> truncation of the fileid. It occurred several times on successive >>> reboots, so I tried to capture a tcpdump trace off the server (Linux >>> 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy >>> IDE drivers trying to move it forward.) However, for the last couple >>> of weeks I've been unable to reproduce it. >>> >>> The client was based on 2.6.37-rc6. >>> >>> The "fileid changed" messages popped up after mounting an export with >>> 'nolock,intr,rsize=4096,soft', and then trying to use bash completion >>> and 'ls' in a few subdirectories - and entries were missing from the >>> directory lists without 'ls' reporting any errors (which I think is bad >>> behaviour in itself.) >> There was a bug in at least -rc5[1] that was considered already fixed in >> -rc4[2]. The later announcements didn't mention it anymore. >> >>> I don't know why it's stopped producing the errors, although once it >>> went I never investigated it any further (was far too busy trying to >>> get AMBA DMA support working.) >> It seems it was fixed for most users though. Trond? > > As I said, I can't reproduce it. > > I'm seeing a lot of mention of ARM above. Is anyone seeing this bug on > x86, or does it appear to be architecture-specific? It _seems_ to be ARMv5 specific[1]. Uwe did some tests and figured out that disabling dcache on ARMv5 "fixes" the problem, but CONFIG_CPU_DCACHE_WRITETHROUGH isn't enough. [1] Uwe fails to reproduce it on ARMv6. The ARMv6 has a L2 cache and uses IIRC different instructions to flush the L1 caches. (please correct me, if I'm wrong, ARM guys :) cheers, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110105/51700fc7/attachment-0001.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 15:01 ` Marc Kleine-Budde @ 2011-01-05 15:14 ` Trond Myklebust 2011-01-05 15:29 ` Trond Myklebust 2011-01-05 15:52 ` Russell King - ARM Linux 0 siblings, 2 replies; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 15:14 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 16:01 +0100, Marc Kleine-Budde wrote: > On 01/05/2011 03:53 PM, Trond Myklebust wrote: > > On Wed, 2011-01-05 at 14:40 +0100, Uwe Kleine-K?nig wrote: > >> Hi Russell, > >> > >> On Wed, Jan 05, 2011 at 11:27:01AM +0000, Russell King - ARM Linux wrote: > >>> On Wed, Jan 05, 2011 at 12:05:17PM +0100, Uwe Kleine-K?nig wrote: > >>>> Hello Trond, > >>>> > >>>> On Wed, Jan 05, 2011 at 09:40:14AM +0100, Uwe Kleine-K?nig wrote: > >>>>> On Mon, Jan 03, 2011 at 07:22:38PM -0500, Trond Myklebust wrote: > >>>>>> The question is whether this is something happening on the server or the > >>>>>> client. Does an older client kernel boot without any trouble? > >>>>> I will set up a boot test with 2.6.37 (for statistics) and 2.6.36 to > >>>>> compare with. If you don't consider .36 to be old enough let me now. > >>>>> Once the setup is done it should be easy to test .35 (say), too. > >>>>> > >>>> Marc (cc'd) saw similar[1] problems with .37, when using .36.2 the > >>>> problems didn't occur. This was more reliable to trigger and he was so > >>>> kind to bisect the problem. > >>>> > >>>> When testing v2.6.36-rc3-51-gafa8ccc init hanged. > >>>> (babddc72a9468884ce1a23db3c3d54b0afa299f0 is the first bad commit with > >>>> this hang.) Commit 56e4ebf877b6043c289bda32a5a7385b80c17dee makes the > >>>> "init hangs" problem the "fileid changed on tab" problem. > >>>> > >>>> I could only reproduce that on armv5 machines (imx27, imx28 and at91) > >>>> but not on armv6 (imx35). > >>> > >>> FYI, I've seen the "fileid changed" problem, and it looked like a 32-bit > >>> truncation of the fileid. It occurred several times on successive > >>> reboots, so I tried to capture a tcpdump trace off the server (Linux > >>> 2.6.23-rc8-ga64314e6 - its ancient because I've had issues with buggy > >>> IDE drivers trying to move it forward.) However, for the last couple > >>> of weeks I've been unable to reproduce it. > >>> > >>> The client was based on 2.6.37-rc6. > >>> > >>> The "fileid changed" messages popped up after mounting an export with > >>> 'nolock,intr,rsize=4096,soft', and then trying to use bash completion > >>> and 'ls' in a few subdirectories - and entries were missing from the > >>> directory lists without 'ls' reporting any errors (which I think is bad > >>> behaviour in itself.) > >> There was a bug in at least -rc5[1] that was considered already fixed in > >> -rc4[2]. The later announcements didn't mention it anymore. > >> > >>> I don't know why it's stopped producing the errors, although once it > >>> went I never investigated it any further (was far too busy trying to > >>> get AMBA DMA support working.) > >> It seems it was fixed for most users though. Trond? > > > > As I said, I can't reproduce it. > > > > I'm seeing a lot of mention of ARM above. Is anyone seeing this bug on > > x86, or does it appear to be architecture-specific? > > It _seems_ to be ARMv5 specific[1]. Uwe did some tests and figured out > that disabling dcache on ARMv5 "fixes" the problem, but > CONFIG_CPU_DCACHE_WRITETHROUGH isn't enough. > > [1] Uwe fails to reproduce it on ARMv6. The ARMv6 has a L2 cache and > uses IIRC different instructions to flush the L1 caches. (please correct > me, if I'm wrong, ARM guys :) > > cheers, Marc OK. So,the new behaviour in 2.6.37 is that we're writing to a series of pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap() interfaces, but we can end up reading them via a virtual address range that gets set up via vm_map_ram() (that range gets set up before the write occurs). Do we perhaps need an invalidate_kernel_vmap_range() before we can read the data on ARM in this kind of scenario? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 15:14 ` Trond Myklebust @ 2011-01-05 15:29 ` Trond Myklebust 2011-01-05 15:39 ` Marc Kleine-Budde 2011-01-05 15:52 ` Russell King - ARM Linux 1 sibling, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 15:29 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 10:14 -0500, Trond Myklebust wrote: > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap() > interfaces, but we can end up reading them via a virtual address range > that gets set up via vm_map_ram() (that range gets set up before the > write occurs). > > Do we perhaps need an invalidate_kernel_vmap_range() before we can read > the data on ARM in this kind of scenario? IOW: Does something like the following patch fix the problem? ------------------------------------------------------------------------------- From: Trond Myklebust <Trond.Myklebust@netapp.com> NFS: Ensure we clean the TLB cache in nfs_readdir_xdr_to_array After calling nfs_readdir_xdr_filler(), we need a call to invalidate_kernel_vmap_range() before we can proceed to read the data back through the virtual address range. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 996dd89..4640470 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -587,6 +587,9 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, if (status < 0) break; pglen = status; + + invalidate_kernel_vmap_range(pages_ptr, pglen); + status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); if (status < 0) { if (status == -ENOSPC) -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 15:29 ` Trond Myklebust @ 2011-01-05 15:39 ` Marc Kleine-Budde 0 siblings, 0 replies; 79+ messages in thread From: Marc Kleine-Budde @ 2011-01-05 15:39 UTC (permalink / raw) To: linux-arm-kernel On 01/05/2011 04:29 PM, Trond Myklebust wrote: > On Wed, 2011-01-05 at 10:14 -0500, Trond Myklebust wrote: >> OK. So,the new behaviour in 2.6.37 is that we're writing to a series of >> pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap() >> interfaces, but we can end up reading them via a virtual address range >> that gets set up via vm_map_ram() (that range gets set up before the >> write occurs). >> >> Do we perhaps need an invalidate_kernel_vmap_range() before we can read >> the data on ARM in this kind of scenario? > > IOW: Does something like the following patch fix the problem? > > ------------------------------------------------------------------------------- > From: Trond Myklebust <Trond.Myklebust@netapp.com> > NFS: Ensure we clean the TLB cache in nfs_readdir_xdr_to_array > > After calling nfs_readdir_xdr_filler(), we need a call to > invalidate_kernel_vmap_range() before we can proceed to read > the data back through the virtual address range. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 996dd89..4640470 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -587,6 +587,9 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, > if (status < 0) > break; > pglen = status; > + > + invalidate_kernel_vmap_range(pages_ptr, pglen); > + > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen); > if (status < 0) { > if (status == -ENOSPC) \o/ - Works for me (at91, armv5) Tested-by: Marc Kleine-Budde <mkl@pengutronix.de> This is a candidate for stable (Cc'd). Regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110105/2804eb32/attachment.sig> ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 15:14 ` Trond Myklebust 2011-01-05 15:29 ` Trond Myklebust @ 2011-01-05 15:52 ` Russell King - ARM Linux 2011-01-05 17:17 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 15:52 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 10:14:17AM -0500, Trond Myklebust wrote: > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap() > interfaces, but we can end up reading them via a virtual address range > that gets set up via vm_map_ram() (that range gets set up before the > write occurs). kmap of lowmem pages will always reuses the existing kernel direct mapping, so there won't be a problem there. > Do we perhaps need an invalidate_kernel_vmap_range() before we can read > the data on ARM in this kind of scenario? Firstly, vm_map_ram() does no cache maintainence of any sort, nor does it take care of page colouring - so any architecture where cache aliasing can occur will see this problem. It should not limited to ARM. Secondly, no, invalidate_kernel_vmap_range() probably isn't sufficient. There's two problems here: addr = kmap(lowmem_page); *addr = stuff; kunmap(lowmem_page); Such lowmem pages are accessed through their kernel direct mapping. ptr = vm_map_ram(lowmem_page); read = *ptr; This creates a new mapping which can alias with the kernel direct mapping. Now, as this is a new mapping, there should be no cache lines associated with it. (Looking at vm_unmap_ram(), it calls free_unmap_vmap_area_addr(), free_unmap_vmap_area(), which then calls flush_cache_vunmap() on the region. vb_free() also calls flush_cache_vunmap() too.) If the write after kmap() hits an already present cache line, the cache line will be updated, but it won't be written back to memory. So, on a subsequent vm_map_ram(), with any kind of aliasing cache, there's no guarantee that you'll hit that cache line and read the data just written there. The kernel direct mapping would need to be flushed. I'm really getting to the point of hating the poliferation of RAM remapping interfaces - it's going to (and is) causing nothing but lots of pain on virtual cache architectures, needing more and more cache flushing interfaces to be created. Is there any other solution to this? ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 15:52 ` Russell King - ARM Linux @ 2011-01-05 17:17 ` Trond Myklebust 2011-01-05 17:26 ` Russell King - ARM Linux 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 17:17 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 15:52 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 10:14:17AM -0500, Trond Myklebust wrote: > > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of > > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap() > > interfaces, but we can end up reading them via a virtual address range > > that gets set up via vm_map_ram() (that range gets set up before the > > write occurs). > > kmap of lowmem pages will always reuses the existing kernel direct > mapping, so there won't be a problem there. > > > Do we perhaps need an invalidate_kernel_vmap_range() before we can read > > the data on ARM in this kind of scenario? > > Firstly, vm_map_ram() does no cache maintainence of any sort, nor does > it take care of page colouring - so any architecture where cache aliasing > can occur will see this problem. It should not limited to ARM. > > Secondly, no, invalidate_kernel_vmap_range() probably isn't sufficient. > There's two problems here: > > addr = kmap(lowmem_page); > *addr = stuff; > kunmap(lowmem_page); > > Such lowmem pages are accessed through their kernel direct mapping. > > ptr = vm_map_ram(lowmem_page); > read = *ptr; > > This creates a new mapping which can alias with the kernel direct mapping. > Now, as this is a new mapping, there should be no cache lines associated > with it. (Looking at vm_unmap_ram(), it calls free_unmap_vmap_area_addr(), > free_unmap_vmap_area(), which then calls flush_cache_vunmap() on the > region. vb_free() also calls flush_cache_vunmap() too.) > > If the write after kmap() hits an already present cache line, the cache > line will be updated, but it won't be written back to memory. So, on > a subsequent vm_map_ram(), with any kind of aliasing cache, there's > no guarantee that you'll hit that cache line and read the data just > written there. > > The kernel direct mapping would need to be flushed. We should already be flushing the kernel direct mapping after writing by means of the calls to flush_dcache_page() in xdr_partial_copy_from_skb() and all the helpers in net/sunrpc/xdr.c. The only new thing is the read access through the virtual address mapping. That mapping is created outside the loop in nfs_readdir_xdr_to_array(), which is why I'm thinking we do need the invalidate_kernel_vmap_range(): we're essentially doing a series of writes through the kernel direct mapping (i.e. readdir RPC calls), then reading the results through the virtual mapping. i.e. we're doing ptr = vm_map_ram(lowmem_pages); while (need_more_data) { for (i = 0; i < npages; i++) { addr = kmap_atomic(lowmem_page[i]); *addr = rpc_stuff; flush_dcache_page(lowmem_page[i]); kunmap_atomic(lowmem_page[i]); } invalidate_kernel_vmap_range(ptr); // Needed here? read = *ptr; } vm_unmap_ram(lowmem_pages) > I'm really getting to the point of hating the poliferation of RAM > remapping interfaces - it's going to (and is) causing nothing but lots > of pain on virtual cache architectures, needing more and more cache > flushing interfaces to be created. > > Is there any other solution to this? Arbitrary sized pages. :-) The problem here is that we want to read variable sized records (i.e. readdir() records) from a multi-page buffer. We could do that by copying those particular records that overlap with page boundaries, but that would make for a fairly intrusive rewrite too. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 17:17 ` Trond Myklebust @ 2011-01-05 17:26 ` Russell King - ARM Linux 2011-01-05 18:12 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 17:26 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 12:17:27PM -0500, Trond Myklebust wrote: > We should already be flushing the kernel direct mapping after writing by > means of the calls to flush_dcache_page() in xdr_partial_copy_from_skb() > and all the helpers in net/sunrpc/xdr.c. Hmm, we're getting into the realms of what flush_dcache_page() is supposed to do and what it's not supposed to do. Is this page an associated with a mapping (iow, page_mapping(page) is non- NULL)? If not, flush_dcache_page() won't do anything, and from my understanding, its flush_anon_page() which you want to be using there instead. > The only new thing is the read access through the virtual address > mapping. That mapping is created outside the loop in > nfs_readdir_xdr_to_array(), which is why I'm thinking we do need the > invalidate_kernel_vmap_range(): we're essentially doing a series of > writes through the kernel direct mapping (i.e. readdir RPC calls), then > reading the results through the virtual mapping. > > i.e. we're doing > > ptr = vm_map_ram(lowmem_pages); > while (need_more_data) { > > for (i = 0; i < npages; i++) { > addr = kmap_atomic(lowmem_page[i]); > *addr = rpc_stuff; > flush_dcache_page(lowmem_page[i]); > kunmap_atomic(lowmem_page[i]); > } > > invalidate_kernel_vmap_range(ptr); // Needed here? Yes, you're going to need some cache maintainence in there to make it work, because accessing 'ptr' will load that data into the cache, and that won't be updated by the writes via kmap_atomic(). Provided you don't write to ptr, then using invalidate_kernel_vmap_range() will be safe. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 17:26 ` Russell King - ARM Linux @ 2011-01-05 18:12 ` Trond Myklebust 2011-01-05 18:27 ` Russell King - ARM Linux 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 18:12 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 17:26 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 12:17:27PM -0500, Trond Myklebust wrote: > > We should already be flushing the kernel direct mapping after writing by > > means of the calls to flush_dcache_page() in xdr_partial_copy_from_skb() > > and all the helpers in net/sunrpc/xdr.c. > > Hmm, we're getting into the realms of what flush_dcache_page() is supposed > to do and what it's not supposed to do. > > Is this page an associated with a mapping (iow, page_mapping(page) is non- > NULL)? If not, flush_dcache_page() won't do anything, and from my > understanding, its flush_anon_page() which you want to be using there > instead. Actually, none of these pages are ever mapped into userspace, nor are they mapped into the page cache. They are allocated directly using alloc_page() by the thread that called the readdir() syscall, so afaics there should be no incoherent mappings other than the kernel direct mapping and the one created by vm_map_ram(). So, yes, you are right that we don't need the flush_dcache_page() here. > > The only new thing is the read access through the virtual address > > mapping. That mapping is created outside the loop in > > nfs_readdir_xdr_to_array(), which is why I'm thinking we do need the > > invalidate_kernel_vmap_range(): we're essentially doing a series of > > writes through the kernel direct mapping (i.e. readdir RPC calls), then > > reading the results through the virtual mapping. > > > > i.e. we're doing > > > > ptr = vm_map_ram(lowmem_pages); > > while (need_more_data) { > > > > for (i = 0; i < npages; i++) { > > addr = kmap_atomic(lowmem_page[i]); > > *addr = rpc_stuff; > > flush_dcache_page(lowmem_page[i]); > > kunmap_atomic(lowmem_page[i]); > > } > > > > invalidate_kernel_vmap_range(ptr); // Needed here? > > Yes, you're going to need some cache maintainence in there to make it work, > because accessing 'ptr' will load that data into the cache, and that won't > be updated by the writes via kmap_atomic(). > > Provided you don't write to ptr, then using invalidate_kernel_vmap_range() > will be safe. Thanks! That is what Marc's testing appears to confirm. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 18:12 ` Trond Myklebust @ 2011-01-05 18:27 ` Russell King - ARM Linux 2011-01-05 18:55 ` Trond Myklebust 0 siblings, 1 reply; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 18:27 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 01:12:25PM -0500, Trond Myklebust wrote: > On Wed, 2011-01-05 at 17:26 +0000, Russell King - ARM Linux wrote: > > On Wed, Jan 05, 2011 at 12:17:27PM -0500, Trond Myklebust wrote: > > > We should already be flushing the kernel direct mapping after writing by > > > means of the calls to flush_dcache_page() in xdr_partial_copy_from_skb() > > > and all the helpers in net/sunrpc/xdr.c. > > > > Hmm, we're getting into the realms of what flush_dcache_page() is supposed > > to do and what it's not supposed to do. > > > > Is this page an associated with a mapping (iow, page_mapping(page) is non- > > NULL)? If not, flush_dcache_page() won't do anything, and from my > > understanding, its flush_anon_page() which you want to be using there > > instead. > > Actually, none of these pages are ever mapped into userspace, nor are > they mapped into the page cache. > > They are allocated directly using alloc_page() by the thread that called > the readdir() syscall, so afaics there should be no incoherent mappings > other than the kernel direct mapping and the one created by > vm_map_ram(). > > So, yes, you are right that we don't need the flush_dcache_page() here. I do still think you need _something_ there, otherwise data can remain in the direct map alias and not be visible via the vmap alias. I don't see that we have anything in place to handle this at present though. jejb mentioned something about making kunmap_atomic() always flush the cache, even for lowmem pages, but I think that's going to be exceedingly painful, to the extent that I believe it will prevent our PIO-only MMC drivers working - or we need a scatterlist API that will let drivers iterate over the scatterlist without needing to continually kmap_atomic and kunmap_atomic each page. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 18:27 ` Russell King - ARM Linux @ 2011-01-05 18:55 ` Trond Myklebust 2011-01-05 19:07 ` Russell King - ARM Linux 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-05 18:55 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2011-01-05 at 18:27 +0000, Russell King - ARM Linux wrote: > On Wed, Jan 05, 2011 at 01:12:25PM -0500, Trond Myklebust wrote: > > On Wed, 2011-01-05 at 17:26 +0000, Russell King - ARM Linux wrote: > > > On Wed, Jan 05, 2011 at 12:17:27PM -0500, Trond Myklebust wrote: > > > > We should already be flushing the kernel direct mapping after writing by > > > > means of the calls to flush_dcache_page() in xdr_partial_copy_from_skb() > > > > and all the helpers in net/sunrpc/xdr.c. > > > > > > Hmm, we're getting into the realms of what flush_dcache_page() is supposed > > > to do and what it's not supposed to do. > > > > > > Is this page an associated with a mapping (iow, page_mapping(page) is non- > > > NULL)? If not, flush_dcache_page() won't do anything, and from my > > > understanding, its flush_anon_page() which you want to be using there > > > instead. > > > > Actually, none of these pages are ever mapped into userspace, nor are > > they mapped into the page cache. > > > > They are allocated directly using alloc_page() by the thread that called > > the readdir() syscall, so afaics there should be no incoherent mappings > > other than the kernel direct mapping and the one created by > > vm_map_ram(). > > > > So, yes, you are right that we don't need the flush_dcache_page() here. > > I do still think you need _something_ there, otherwise data can remain > in the direct map alias and not be visible via the vmap alias. I don't > see that we have anything in place to handle this at present though. Is that perhaps what flush_kernel_dcache_page() is supposed to do? > jejb mentioned something about making kunmap_atomic() always flush the > cache, even for lowmem pages, but I think that's going to be exceedingly > painful, to the extent that I believe it will prevent our PIO-only MMC > drivers working - or we need a scatterlist API that will let drivers > iterate over the scatterlist without needing to continually kmap_atomic > and kunmap_atomic each page. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 18:55 ` Trond Myklebust @ 2011-01-05 19:07 ` Russell King - ARM Linux 0 siblings, 0 replies; 79+ messages in thread From: Russell King - ARM Linux @ 2011-01-05 19:07 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 01:55:05PM -0500, Trond Myklebust wrote: > On Wed, 2011-01-05 at 18:27 +0000, Russell King - ARM Linux wrote: > > I do still think you need _something_ there, otherwise data can remain > > in the direct map alias and not be visible via the vmap alias. I don't > > see that we have anything in place to handle this at present though. > > Is that perhaps what flush_kernel_dcache_page() is supposed to do? Well, given how we have things currently setup on ARM, this ends up being a no-op - as new page cache pages are marked dirty and their flushing done at the point when they're mapped into userspace. I guess we could do the flushing there and mark the page clean, but it'd need some careful examination of various code paths to confirm that it's safe - we may be avoiding this because some ARM arch versions need to manually IPI cache flushes to other cores (which can only be done with IRQs enabled.) So, I don't think it'll do at the present time. ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-05 14:53 ` Trond Myklebust 2011-01-05 15:01 ` Marc Kleine-Budde @ 2011-01-14 2:25 ` Andy Isaacson 2011-01-14 2:40 ` Trond Myklebust 1 sibling, 1 reply; 79+ messages in thread From: Andy Isaacson @ 2011-01-14 2:25 UTC (permalink / raw) To: linux-arm-kernel On Wed, Jan 05, 2011 at 09:53:13AM -0500, Trond Myklebust wrote: > > There was a bug in at least -rc5[1] that was considered already fixed in > > -rc4[2]. The later announcements didn't mention it anymore. > > > > > I don't know why it's stopped producing the errors, although once it > > > went I never investigated it any further (was far too busy trying to > > > get AMBA DMA support working.) > > It seems it was fixed for most users though. Trond? > > As I said, I can't reproduce it. > > I'm seeing a lot of mention of ARM above. Is anyone seeing this bug on > x86, or does it appear to be architecture-specific? I'm seeing processes stuck in D with "fileid changed" in dmesg, on x86_64 (both server and client). The repro testcase is to run an executable off of NFS, recompile it on the server, and then try to tab complete the executable name. The client prints NFS: server <hostname> error: fileid changed fsid 0:18: expected fileid 0x107aa4a, got 0x107ad3e and /bin/zsh hangs in D. My server is running 2.6.36.1, filesystem is ext3 on sda3 on AHCI, client is currently running 2.6.37-rc1. I'm assuming that 37a09f will fix it. -andy ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-14 2:25 ` Andy Isaacson @ 2011-01-14 2:40 ` Trond Myklebust 2011-01-14 4:22 ` Andy Isaacson 0 siblings, 1 reply; 79+ messages in thread From: Trond Myklebust @ 2011-01-14 2:40 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-01-13 at 18:25 -0800, Andy Isaacson wrote: > On Wed, Jan 05, 2011 at 09:53:13AM -0500, Trond Myklebust wrote: > > > There was a bug in at least -rc5[1] that was considered already fixed in > > > -rc4[2]. The later announcements didn't mention it anymore. > > > > > > > I don't know why it's stopped producing the errors, although once it > > > > went I never investigated it any further (was far too busy trying to > > > > get AMBA DMA support working.) > > > It seems it was fixed for most users though. Trond? > > > > As I said, I can't reproduce it. > > > > I'm seeing a lot of mention of ARM above. Is anyone seeing this bug on > > x86, or does it appear to be architecture-specific? > > I'm seeing processes stuck in D with "fileid changed" in dmesg, on > x86_64 (both server and client). The repro testcase is to run an > executable off of NFS, recompile it on the server, and then try to tab > complete the executable name. The client prints > > NFS: server <hostname> error: fileid changed > fsid 0:18: expected fileid 0x107aa4a, got 0x107ad3e > > and /bin/zsh hangs in D. > > My server is running 2.6.36.1, filesystem is ext3 on sda3 on AHCI, > client is currently running 2.6.37-rc1. I'm assuming that 37a09f will > fix it. Why are you sticking to 2.6.37-rc1 when the final 2.6.37 is out? There have been several readdir bugfixes merged in the months since -rc1 came out. Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust at netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 79+ messages in thread
* still nfs problems [Was: Linux 2.6.37-rc8] 2011-01-14 2:40 ` Trond Myklebust @ 2011-01-14 4:22 ` Andy Isaacson 0 siblings, 0 replies; 79+ messages in thread From: Andy Isaacson @ 2011-01-14 4:22 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jan 13, 2011 at 09:40:23PM -0500, Trond Myklebust wrote: > > My server is running 2.6.36.1, filesystem is ext3 on sda3 on AHCI, > > client is currently running 2.6.37-rc1. I'm assuming that 37a09f will > > fix it. > > Why are you sticking to 2.6.37-rc1 when the final 2.6.37 is out? There > have been several readdir bugfixes merged in the months since -rc1 came > out. No good reason, just hadn't run into any reasons to update; switching kernels is nonzero cost since the machine in question runs some out of tree modules ergo updating is slightly more involved than "make && make install". Above and beyond the costs of rebooting and losing state. Actual bug is obviously a good reason to update though. -andy ^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2011-01-14 4:22 UTC | newest] Thread overview: 79+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-05 19:05 still nfs problems [Was: Linux 2.6.37-rc8] James Bottomley 2011-01-05 19:18 ` Linus Torvalds 2011-01-05 19:36 ` James Bottomley 2011-01-05 19:49 ` Linus Torvalds 2011-01-05 20:35 ` James Bottomley 2011-01-05 20:00 ` Russell King - ARM Linux 2011-01-05 20:33 ` James Bottomley 2011-01-05 20:48 ` Linus Torvalds 2011-01-05 21:04 ` Russell King - ARM Linux 2011-01-05 21:08 ` Linus Torvalds 2011-01-05 21:16 ` Trond Myklebust 2011-01-05 21:30 ` Linus Torvalds 2011-01-05 23:06 ` Trond Myklebust 2011-01-05 23:28 ` James Bottomley 2011-01-06 17:40 ` James Bottomley 2011-01-06 17:47 ` Trond Myklebust 2011-01-06 17:51 ` James Bottomley 2011-01-06 17:55 ` Linus Torvalds 2011-01-07 18:53 ` Trond Myklebust 2011-01-07 19:02 ` Russell King - ARM Linux 2011-01-07 19:11 ` James Bottomley 2011-01-08 16:49 ` Trond Myklebust 2011-01-08 23:15 ` Trond Myklebust 2011-01-10 10:50 ` Uwe Kleine-König 2011-01-10 16:25 ` Trond Myklebust 2011-01-10 17:08 ` Marc Kleine-Budde 2011-01-10 17:20 ` Trond Myklebust 2011-01-10 17:26 ` Marc Kleine-Budde 2011-01-10 19:25 ` Uwe Kleine-König 2011-01-10 19:29 ` Trond Myklebust 2011-01-10 19:31 ` James Bottomley 2011-01-10 19:34 ` Linus Torvalds 2011-01-10 20:15 ` Trond Myklebust 2011-01-10 12:44 ` Marc Kleine-Budde 2011-01-07 19:13 ` Trond Myklebust 2011-01-07 19:05 ` James Bottomley 2011-01-06 18:05 ` Russell King - ARM Linux 2011-01-06 18:14 ` James Bottomley 2011-01-06 18:25 ` James Bottomley 2011-01-06 21:07 ` James Bottomley 2011-01-06 20:19 ` John Stoffel 2011-01-05 23:28 ` Linus Torvalds 2011-01-05 23:59 ` Russell King - ARM Linux 2011-01-05 21:16 ` James Bottomley [not found] <AANLkTi=-dNeeDjcSoznKtwcaNyw1mMXSqepFY89R2i+2@mail.gmail.com> 2010-12-30 17:14 ` Uwe Kleine-König 2010-12-30 17:57 ` Linus Torvalds 2010-12-30 18:24 ` Trond Myklebust 2010-12-30 18:50 ` Linus Torvalds 2010-12-30 19:25 ` Trond Myklebust 2010-12-30 20:02 ` Linus Torvalds 2010-12-30 17:59 ` Trond Myklebust 2010-12-30 19:18 ` Uwe Kleine-König 2011-01-03 21:38 ` Uwe Kleine-König 2011-01-04 0:22 ` Trond Myklebust 2011-01-05 8:40 ` Uwe Kleine-König 2011-01-05 11:05 ` Uwe Kleine-König 2011-01-05 11:27 ` Russell King - ARM Linux 2011-01-05 12:14 ` Marc Kleine-Budde 2011-01-05 13:02 ` Nori, Sekhar 2011-01-05 15:34 ` Russell King - ARM Linux 2011-01-05 13:40 ` Uwe Kleine-König 2011-01-05 14:29 ` Jim Rees 2011-01-05 14:42 ` Marc Kleine-Budde 2011-01-05 15:38 ` Jim Rees 2011-01-05 14:53 ` Trond Myklebust 2011-01-05 15:01 ` Marc Kleine-Budde 2011-01-05 15:14 ` Trond Myklebust 2011-01-05 15:29 ` Trond Myklebust 2011-01-05 15:39 ` Marc Kleine-Budde 2011-01-05 15:52 ` Russell King - ARM Linux 2011-01-05 17:17 ` Trond Myklebust 2011-01-05 17:26 ` Russell King - ARM Linux 2011-01-05 18:12 ` Trond Myklebust 2011-01-05 18:27 ` Russell King - ARM Linux 2011-01-05 18:55 ` Trond Myklebust 2011-01-05 19:07 ` Russell King - ARM Linux 2011-01-14 2:25 ` Andy Isaacson 2011-01-14 2:40 ` Trond Myklebust 2011-01-14 4:22 ` Andy Isaacson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).