* 2.4: NFS client kmapping across I/O @ 2002-01-29 2:25 Xeno 2002-01-29 9:05 ` Trond Myklebust 2002-01-29 17:42 ` Hugh Dickins 0 siblings, 2 replies; 5+ messages in thread From: Xeno @ 2002-01-29 2:25 UTC (permalink / raw) To: linux-kernel@vger.kernel.org Trond, thanks for the excellent fattr race fix. I'm sorry I haven't been able to give feedback until now, things got busy for a while. I have not yet had a chance to run your fixes, but after studying them I believe that they will resolve the race nicely, especially with the use of nfs_inode_lock in the recent NFS_ALL experimental patches. FWIW. Now I also have time to mention the other NFS client issue we ran into recently, I have not found mention of it on the mailing lists. The NFS client is kmapping pages for the duration of reads from and writes to the server. This creates a scaling limitation, especially under CONFIG_HIGHMEM64G and i386 where there are only 512 entries in the highmem kmap table. Under I/O load, it is easy to fill up the table, hanging all processes that need to map highmem pages a substantial fraction of the time. Before 2.4.15, it is particularly bad, nfs_flushd locks up the kernel under I/O load. My testcase was to copy 12 100M files from one NFS server to another, it was very reliable at producing the lockup right away. nfs_flushd fills up the kmap table as it sends out requests, then blocks waiting for another kmap entry to free up. But it is running in rpciod, so rpciod is blocked and cannot process any responses to the requests. No kmap entries are ever freed once the table fills up. In this state, the machine pings and responds to SysRq on the serial port, but just about everything else hangs. It looks like nfs_flushd was turned off in 2.4.15, that is the workaround I have applied to our machines. I have also limited the number of requests across all NFS servers to LAST_PKMAP-64, to leave some kmap entries available for non-NFS use. It is not an ideal workaround, though, it artificially limits I/O to multiple servers. I've thought about bumping up LAST_PKMAP to increase the size of the highmem kmap table, but the table looks like it was designed to be small. I've also thought about pushing the kmaps and kunmaps down into the RPC layer, so the pages are only mapped while data is copied to or from them, not while waiting for the network. That would be more work, but it looks doable, so I wanted to run the problem and the approach by you knowledgeable folks while I'm waiting for hardware to free up for kernel hacking. Thanks, Xeno ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.4: NFS client kmapping across I/O 2002-01-29 2:25 2.4: NFS client kmapping across I/O Xeno @ 2002-01-29 9:05 ` Trond Myklebust 2002-01-29 17:42 ` Hugh Dickins 1 sibling, 0 replies; 5+ messages in thread From: Trond Myklebust @ 2002-01-29 9:05 UTC (permalink / raw) To: Xeno; +Cc: linux-kernel@vger.kernel.org >>>>> " " == xeno <xeno@overture.com> writes: > Trond, thanks for the excellent fattr race fix. I'm sorry I > haven't been able to give feedback until now, things got busy > for a while. I have not yet had a chance to run your fixes, > but after studying them I believe that they will resolve the > race nicely, especially with the use of nfs_inode_lock in the > recent NFS_ALL experimental patches. FWIW. Note: the nfs_inode_lock is not in itself part of any 'fix'. It is merely a substitute for the current BKL protection of attribute updates on NFS. The BKL gets dropped from NFS + rpciod in the patch linux-2.4.x-rpc_bkl.dif which is, of course, included in NFS_ALL. > I've also thought about pushing the kmaps and kunmaps down into > the RPC layer, so the pages are only mapped while data is > copied to or from them, not while waiting for the network. > That would be more work, but it looks doable, so I wanted to > run the problem and the approach by you knowledgeable folks > while I'm waiting for hardware to free up for kernel hacking. This is the long term solution. We will in any case want to make the RPC layer 'page aware' in order to be able to make use of the zero-copy socket stuff (a.k.a. sendpage()). I'm still not ready to detail exactly how it should be done though. I'll have to get back to you on this one... Cheers, Trond ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.4: NFS client kmapping across I/O 2002-01-29 2:25 2.4: NFS client kmapping across I/O Xeno 2002-01-29 9:05 ` Trond Myklebust @ 2002-01-29 17:42 ` Hugh Dickins 2002-01-29 18:34 ` Rik van Riel 2002-01-30 0:56 ` Xeno 1 sibling, 2 replies; 5+ messages in thread From: Hugh Dickins @ 2002-01-29 17:42 UTC (permalink / raw) To: Xeno; +Cc: linux-kernel@vger.kernel.org On Mon, 28 Jan 2002, Xeno wrote: > > Now I also have time to mention the other NFS client issue we ran into > recently, I have not found mention of it on the mailing lists. The NFS > client is kmapping pages for the duration of reads from and writes to > the server. This creates a scaling limitation, especially under > CONFIG_HIGHMEM64G and i386 where there are only 512 entries in the > highmem kmap table. Under I/O load, it is easy to fill up the table, > hanging all processes that need to map highmem pages a substantial > fraction of the time. > ... > I've thought about bumping up LAST_PKMAP to increase the size of the > highmem kmap table, but the table looks like it was designed to be > small. You're right that kmap users should avoid holding them for very long, I'd certainly not discourage anyone from pursuing that effort. But in case it helps in the short term, below is 2.4.18-pre7 version of patch I prepared earlier for 2.4.18-pre4aa1: plus now defaults to 1800 kmaps for both HIGHMEM4G and HIGHMEM64G. And as a side-effect, it raises the overly strict limit on the i386 vmalloc area, like -aa. But I've not touched PPC or Sparc, their limits are unchanged. I don't think the kmap area was really designed to be so small, it's just that nobody ever got around to allowing more than one page table for it. It was surely absurd that HIGHMEM64G (doubly-sized ptes so half as many per page table) should be limited to fewer kmaps than HIGHMEM4G. Hugh diff -urN 2.4.18-pre7/arch/i386/mm/init.c linux/arch/i386/mm/init.c --- 2.4.18-pre7/arch/i386/mm/init.c Mon Jan 7 13:02:56 2002 +++ linux/arch/i386/mm/init.c Tue Jan 29 16:02:23 2002 @@ -167,44 +167,6 @@ set_pte_phys(address, phys, flags); } -static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base) -{ - pgd_t *pgd; - pmd_t *pmd; - pte_t *pte; - int i, j; - unsigned long vaddr; - - vaddr = start; - i = __pgd_offset(vaddr); - j = __pmd_offset(vaddr); - pgd = pgd_base + i; - - for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) { -#if CONFIG_X86_PAE - if (pgd_none(*pgd)) { - pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pgd(pgd, __pgd(__pa(pmd) + 0x1)); - if (pmd != pmd_offset(pgd, 0)) - printk("PAE BUG #02!\n"); - } - pmd = pmd_offset(pgd, vaddr); -#else - pmd = (pmd_t *)pgd; -#endif - for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) { - if (pmd_none(*pmd)) { - pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); - if (pte != pte_offset(pmd, 0)) - BUG(); - } - vaddr += PMD_SIZE; - } - j = 0; - } -} - static void __init pagetable_init (void) { unsigned long vaddr, end; @@ -221,8 +183,24 @@ pgd_base = swapper_pg_dir; #if CONFIG_X86_PAE - for (i = 0; i < PTRS_PER_PGD; i++) + /* + * First set all four entries of the pgd. + * Usually only one page is needed here: if PAGE_OFFSET lowered, + * maybe three pages: need not be contiguous, but might as well. + */ + pmd = (pmd_t *)alloc_bootmem_low_pages(KERNEL_PGD_PTRS*PAGE_SIZE); + for (i = 1; i < USER_PGD_PTRS; i++) set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page))); + for (; i < PTRS_PER_PGD; i++, pmd += PTRS_PER_PMD) + set_pgd(pgd_base + i, __pgd(1 + __pa(pmd))); + /* + * Add low memory identity-mappings - SMP needs it when + * starting up on an AP from real-mode. In the non-PAE + * case we already have these mappings through head.S. + * All user-space mappings are explicitly cleared after + * SMP startup. + */ + pgd_base[0] = pgd_base[USER_PGD_PTRS]; #endif i = __pgd_offset(PAGE_OFFSET); pgd = pgd_base + i; @@ -231,14 +209,7 @@ vaddr = i*PGDIR_SIZE; if (end && (vaddr >= end)) break; -#if CONFIG_X86_PAE - pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pgd(pgd, __pgd(__pa(pmd) + 0x1)); -#else - pmd = (pmd_t *)pgd; -#endif - if (pmd != pmd_offset(pgd, 0)) - BUG(); + pmd = pmd_offset(pgd, 0); for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { vaddr = i*PGDIR_SIZE + j*PMD_SIZE; if (end && (vaddr >= end)) @@ -274,34 +245,27 @@ } /* - * Fixed mappings, only the page table structure has to be - * created - mappings will be set by set_fixmap(): + * Leave vmalloc() to create its own page tables as needed, + * but create the page tables at top of virtual memory, to be + * populated by kmap_high(), kmap_atomic(), and set_fixmap(). + * kmap_high() assumes pkmap_page_table contiguous throughout. */ - vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; - fixrange_init(vaddr, 0, pgd_base); - #if CONFIG_HIGHMEM - /* - * Permanent kmaps: - */ vaddr = PKMAP_BASE; - fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base); +#else + vaddr = FIXADDR_START; +#endif + pmd = pmd_offset(pgd_offset_k(vaddr), vaddr); + i = (0UL - (vaddr & PMD_MASK)) >> PMD_SHIFT; + pte = (pte_t *)alloc_bootmem_low_pages(i*PAGE_SIZE); + for (; --i >= 0; pmd++, pte += PTRS_PER_PTE) + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); +#if CONFIG_HIGHMEM pgd = swapper_pg_dir + __pgd_offset(vaddr); pmd = pmd_offset(pgd, vaddr); pte = pte_offset(pmd, vaddr); pkmap_page_table = pte; -#endif - -#if CONFIG_X86_PAE - /* - * Add low memory identity-mappings - SMP needs it when - * starting up on an AP from real-mode. In the non-PAE - * case we already have these mappings through head.S. - * All user-space mappings are explicitly cleared after - * SMP startup. - */ - pgd_base[0] = pgd_base[USER_PTRS_PER_PGD]; #endif } diff -urN 2.4.18-pre7/include/asm-i386/highmem.h linux/include/asm-i386/highmem.h --- 2.4.18-pre7/include/asm-i386/highmem.h Mon Jan 28 18:42:17 2002 +++ linux/include/asm-i386/highmem.h Tue Jan 29 16:02:23 2002 @@ -42,17 +42,13 @@ extern void kmap_init(void) __init; /* - * Right now we initialize only a single pte table. It can be extended - * easily, subsequent pte tables have to be allocated in one physical - * chunk of RAM. + * LAST_PKMAP is the maximum number of persistent (non-atomic) kmappings + * outstanding at one time. 1800 was chosen as more than the previous + * 1024, not necessarily a power of 2, and 2000 would need one more + * page table (since last shared with fixmaps): redefine it if you need. */ -#define PKMAP_BASE (0xfe000000UL) -#ifdef CONFIG_X86_PAE -#define LAST_PKMAP 512 -#else -#define LAST_PKMAP 1024 -#endif -#define LAST_PKMAP_MASK (LAST_PKMAP-1) +#define LAST_PKMAP 1800 +#define PKMAP_BASE (FIXADDR_START - (LAST_PKMAP << PAGE_SHIFT)) #define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT) #define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT)) diff -urN 2.4.18-pre7/mm/highmem.c linux/mm/highmem.c --- 2.4.18-pre7/mm/highmem.c Mon Jan 7 13:03:04 2002 +++ linux/mm/highmem.c Tue Jan 29 16:02:23 2002 @@ -85,8 +85,8 @@ count = LAST_PKMAP; /* Find an empty entry */ for (;;) { - last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK; - if (!last_pkmap_nr) { + if (++last_pkmap_nr >= LAST_PKMAP) { + last_pkmap_nr = 0; flush_all_zero_pkmaps(); count = LAST_PKMAP; } ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.4: NFS client kmapping across I/O 2002-01-29 17:42 ` Hugh Dickins @ 2002-01-29 18:34 ` Rik van Riel 2002-01-30 0:56 ` Xeno 1 sibling, 0 replies; 5+ messages in thread From: Rik van Riel @ 2002-01-29 18:34 UTC (permalink / raw) To: Hugh Dickins; +Cc: Xeno, linux-kernel@vger.kernel.org On Tue, 29 Jan 2002, Hugh Dickins wrote: > On Mon, 28 Jan 2002, Xeno wrote: > > > > Now I also have time to mention the other NFS client issue we ran into > > recently, I have not found mention of it on the mailing lists. The NFS > > client is kmapping pages for the duration of reads from and writes to > > the server. This creates a scaling limitation, especially under > > You're right that kmap users should avoid holding them for very long, > I'd certainly not discourage anyone from pursuing that effort. Things like this would be fixed by having a kmap variant which maps the pages into process-private space, kind of like kmap_atomic(), but into a (4 MB sized?) part of address space which is only visible within this process. This would mean that each process can have a few MB of kmap()d pages ... should be nice for things like having copy_{to,from}_user "cache" the hot pages while keeping longer term mappings from filling the global pool. regards, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.4: NFS client kmapping across I/O 2002-01-29 17:42 ` Hugh Dickins 2002-01-29 18:34 ` Rik van Riel @ 2002-01-30 0:56 ` Xeno 1 sibling, 0 replies; 5+ messages in thread From: Xeno @ 2002-01-30 0:56 UTC (permalink / raw) To: linux-kernel@vger.kernel.org; +Cc: Trond Myklebust, Hugh Dickins Trond Myklebust wrote: > > We will in any case want to make the RPC layer 'page aware' in > order to be able to make use of the zero-copy socket stuff > (a.k.a. sendpage()). I'm still not ready to detail exactly how > it should be done though. I'll have to get back to you on this one... Oh, yeah, zero-copy is definitely the way to go. No hurry, the workaround is good enough for now. Thanks, Trond! Hugh Dickins wrote: > > I don't think the kmap area was really designed to be so small, it's > just that nobody ever got around to allowing more than one page table > for it. It was surely absurd that HIGHMEM64G (doubly-sized ptes so half > as many per page table) should be limited to fewer kmaps than HIGHMEM4G. Thanks for the patch, Hugh, I may try it out when I get some free hardware next week. Looks like there is already a limit of 256 pages kmapped per mount built into the NFS client, so LAST_PKMAP of 1800 would allow heavy I/O to 7 mounts before filling up the kmap table, up from 2 mounts. I wasn't sure about the kmap table size because of the loops in flush_all_zero_pkmaps and map_new_virtual. In the worst case, when the table is nearly full, like when NFS is maxing it out, each of those operations will tend to scan through the table without finding much that can be freed/used. I don't have a machine where I can tinker with it at the moment, though, so I'm not sure how much it matters. Xeno ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-01-30 0:56 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-01-29 2:25 2.4: NFS client kmapping across I/O Xeno 2002-01-29 9:05 ` Trond Myklebust 2002-01-29 17:42 ` Hugh Dickins 2002-01-29 18:34 ` Rik van Riel 2002-01-30 0:56 ` Xeno
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox