* Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
@ 2002-04-26 18:27 Russell King
2002-04-26 22:46 ` Andrea Arcangeli
2002-04-27 22:10 ` Daniel Phillips
0 siblings, 2 replies; 152+ messages in thread
From: Russell King @ 2002-04-26 18:27 UTC (permalink / raw)
To: linux-kernel
Hi,
I've been looking at some of the ARM discontigmem implementations, and
have come across a nasty bug. To illustrate this, I'm going to take
part of the generic kernel, and use the Alpha implementation to
illustrate the problem we're facing on ARM.
I'm going to argue here that virt_to_page() can, in the discontigmem
case, produce rather a nasty bug when used with non-direct mapped
kernel memory arguments.
In mm/memory.c:remap_pte_range() we have the following code:
page = virt_to_page(__va(phys_addr));
if ((!VALID_PAGE(page)) || PageReserved(page))
set_pte(pte, mk_pte_phys(phys_addr, prot));
Let's look closely at the first line:
page = virt_to_page(__va(phys_addr));
Essentially, what we're trying to do here is convert a physical address
to a struct page pointer.
__va() is defined, on Alpha, to be:
#define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))
so we produce a unique "va" for any physical address that is passed. No
problem so far. Now, lets look at virt_to_page() for the Alpha:
#define virt_to_page(kaddr) \
(ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
Looks inoccuous enough. However, look closer at ADDR_TO_MAPBASE:
#define ADDR_TO_MAPBASE(kaddr) \
NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr)))
#define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map)
#define NODE_DATA(n) (&((PLAT_NODE_DATA(n))->gendata))
#define PLAT_NODE_DATA(n) (plat_node_data[(n)])
Ok, so here we get the map base via:
plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]->
gendata.node_mem_map
plat_node_data is declared as:
plat_pg_data_t *plat_node_data[MAX_NUMNODES];
Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES:
#define KVADDR_TO_NID(kaddr) PHYSADDR_TO_NID(__pa(kaddr))
#define __pa(x) ((unsigned long) (x) - PAGE_OFFSET)
#define PHYSADDR_TO_NID(pa) ALPHA_PA_TO_NID(pa)
#define ALPHA_PA_TO_NID(pa) ((pa) >> 36) /* 16 nodes max due 43bit kseg */
#define MAX_NUMNODES WILDFIRE_MAX_QBB
#define WILDFIRE_MAX_QBB 8 /* more than 8 requires other mods */
So, we have a maximum of 8 nodes total, and therefore the plat_node_data
array is 8 entries large.
Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
opened /dev/mem and mapped some random bit of physical memory space)?
__pa returns a large positive number. We shift this large positive
number left by 36 bits, leaving 28 bits of large positive number, which
is larger than our total 8 nodes.
We use this 28-bit number to index plat_node_data. Whoops.
And now, for the icing on the cake, take a look at Alpha's pte_page()
implementation:
unsigned long kvirt; \
struct page * __xx; \
\
kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT)); \
__xx = virt_to_page(kvirt); \
\
__xx; \
Someone *please* tell me where I'm wrong. I really want to be wrong,
because I can see the same thing happening (in theory, and one report
in practice from a developer) on a certain ARM platform.
On ARM, however, we have cherry to add here. __va() may alias certain
physical memory addresses to the same virtual memory address, which
makes:
VALID_PAGE(virt_to_page(__va(phys)))
completely nonsensical.
I'll try kicking myself 3 times to see if I wake up from this horrible
dream now. 8)
--
Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
^ permalink raw reply [flat|nested] 152+ messages in thread* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King @ 2002-04-26 22:46 ` Andrea Arcangeli 2002-04-29 17:50 ` Martin J. Bligh 2002-04-29 22:00 ` Roman Zippel 2002-04-27 22:10 ` Daniel Phillips 1 sibling, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-04-26 22:46 UTC (permalink / raw) To: Russell King; +Cc: linux-kernel On Fri, Apr 26, 2002 at 07:27:11PM +0100, Russell King wrote: > Hi, > > I've been looking at some of the ARM discontigmem implementations, and > have come across a nasty bug. To illustrate this, I'm going to take > part of the generic kernel, and use the Alpha implementation to > illustrate the problem we're facing on ARM. > > I'm going to argue here that virt_to_page() can, in the discontigmem > case, produce rather a nasty bug when used with non-direct mapped > kernel memory arguments. > > In mm/memory.c:remap_pte_range() we have the following code: > > page = virt_to_page(__va(phys_addr)); > if ((!VALID_PAGE(page)) || PageReserved(page)) > set_pte(pte, mk_pte_phys(phys_addr, prot)); > > Let's look closely at the first line: > > page = virt_to_page(__va(phys_addr)); > > Essentially, what we're trying to do here is convert a physical address > to a struct page pointer. > > __va() is defined, on Alpha, to be: > > #define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET)) > > so we produce a unique "va" for any physical address that is passed. No > problem so far. Now, lets look at virt_to_page() for the Alpha: > > #define virt_to_page(kaddr) \ > (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr)) > > Looks inoccuous enough. However, look closer at ADDR_TO_MAPBASE: > > #define ADDR_TO_MAPBASE(kaddr) \ > NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr))) > #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map) > #define NODE_DATA(n) (&((PLAT_NODE_DATA(n))->gendata)) > #define PLAT_NODE_DATA(n) (plat_node_data[(n)]) > > Ok, so here we get the map base via: > > plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]-> > gendata.node_mem_map > > plat_node_data is declared as: > > plat_pg_data_t *plat_node_data[MAX_NUMNODES]; > > Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES: > > #define KVADDR_TO_NID(kaddr) PHYSADDR_TO_NID(__pa(kaddr)) > #define __pa(x) ((unsigned long) (x) - PAGE_OFFSET) > #define PHYSADDR_TO_NID(pa) ALPHA_PA_TO_NID(pa) > #define ALPHA_PA_TO_NID(pa) ((pa) >> 36) /* 16 nodes max due 43bit kseg */ > > #define MAX_NUMNODES WILDFIRE_MAX_QBB > #define WILDFIRE_MAX_QBB 8 /* more than 8 requires other mods */ > > So, we have a maximum of 8 nodes total, and therefore the plat_node_data > array is 8 entries large. > > Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has > opened /dev/mem and mapped some random bit of physical memory space)? > > __pa returns a large positive number. We shift this large positive > number left by 36 bits, leaving 28 bits of large positive number, which > is larger than our total 8 nodes. > > We use this 28-bit number to index plat_node_data. Whoops. > > And now, for the icing on the cake, take a look at Alpha's pte_page() > implementation: > > unsigned long kvirt; \ > struct page * __xx; \ > \ > kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT)); \ > __xx = virt_to_page(kvirt); \ > \ > __xx; \ > > > Someone *please* tell me where I'm wrong. I really want to be wrong, > because I can see the same thing happening (in theory, and one report > in practice from a developer) on a certain ARM platform. > > On ARM, however, we have cherry to add here. __va() may alias certain > physical memory addresses to the same virtual memory address, which > makes: > > VALID_PAGE(virt_to_page(__va(phys))) > > completely nonsensical. correct. This should fix it: --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002 +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002 @@ -106,8 +106,8 @@ #define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \ NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap) -#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr)) -#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr) +#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0) +#define VALID_PAGE(page) ((page) != NULL) #ifdef CONFIG_NUMA #ifdef CONFIG_NUMA_SCHED It still doesn't cover the ram between the end of a node and the start of the next node, but at least on alpha-wildfire there can be nothing mapped there (it's reserved for "more dimm ram" slots) and it would be even more costly to check if the address is in those intra-node holes. The invalid pages now will start at phys addr 64G*8 that is the maximum ram that linux can handle the the wildfire. if you mmap the intra-node ram via /dev/mem you risk for troubles anyways because there's no dimm there and probably the effect is undefined or unpredictable, it's just a "mustn't do that", /dev/mem is a "root" thing so the above approch looks fine to me. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-26 22:46 ` Andrea Arcangeli @ 2002-04-29 17:50 ` Martin J. Bligh 2002-04-29 22:00 ` Roman Zippel 1 sibling, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-04-29 17:50 UTC (permalink / raw) To: Andrea Arcangeli, Russell King; +Cc: linux-kernel, Daniel Phillips >> page = virt_to_page(__va(phys_addr)); >> >> ... >> >> __va() is defined, on Alpha, to be: >> >> # define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET)) >> >> ... >> >> Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has >> opened /dev/mem and mapped some random bit of physical memory space)? But we generated kaddr by using __va, as above? If the user mapped /dev/mem and created a second possible answer for a P->V mapping, that seems irrelevant, as long as __va always returns the "primary" mapping into kernel virtual address space. I'd agree we're lacking some error checking here (maybe virt_to_page should be an inline that checks that kaddr really is a kernel virtual address), but I can't see a real practical problem in the scenario you describe. As other people seem to be able to, maybe I'm missing something ;-) I'm not sure if your arch is a 32-bit or 64-bit arch, but I see more of a problem in this code if we do "page = virt_to_page(__va(phys_addr));" on a physaddr that's in HIGHMEM on a 32 bit arch, in which we get garbage from the wrapping, and Daniel's "page = phys_to_page(phys_addr);" makes infintely more sense. Martin. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-26 22:46 ` Andrea Arcangeli 2002-04-29 17:50 ` Martin J. Bligh @ 2002-04-29 22:00 ` Roman Zippel 2002-04-30 0:43 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-04-29 22:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel Hi, On Sat, 27 Apr 2002, Andrea Arcangeli wrote: > correct. This should fix it: > > --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002 > +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002 > @@ -106,8 +106,8 @@ > #define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \ > NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap) > > -#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr)) > -#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr) > +#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0) > +#define VALID_PAGE(page) ((page) != NULL) > > #ifdef CONFIG_NUMA > #ifdef CONFIG_NUMA_SCHED I'd prefer if VALID_PAGE would go away completely, that test was almost always to late. What about the patch below, it even reduces the code size by 1072 bytes (but it's otherwise untested). It introduces virt_to_valid_page and pte_valid_page, which include a check, whether the input is valid. bye, Roman Index: arch/arm/mach-arc/small_page.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mach-arc/small_page.c,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 small_page.c --- arch/arm/mach-arc/small_page.c 15 Jan 2002 18:12:17 -0000 1.1.1.1 +++ arch/arm/mach-arc/small_page.c 29 Apr 2002 20:38:49 -0000 @@ -150,8 +150,8 @@ static void __free_small_page(unsigned l unsigned long flags; struct page *page; - page = virt_to_page(spage); - if (VALID_PAGE(page)) { + page = virt_to_valid_page(spage); + if (page) { /* * The container-page must be marked Reserved Index: arch/arm/mm/fault-armv.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mm/fault-armv.c,v retrieving revision 1.1.1.5 diff -u -p -r1.1.1.5 fault-armv.c --- arch/arm/mm/fault-armv.c 14 Apr 2002 20:06:12 -0000 1.1.1.5 +++ arch/arm/mm/fault-armv.c 29 Apr 2002 19:18:37 -0000 @@ -240,9 +240,9 @@ make_coherent(struct vm_area_struct *vma */ void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { - struct page *page = pte_page(pte); + struct page *page = pte_valid_page(pte); - if (VALID_PAGE(page) && page->mapping) { + if (page && page->mapping) { if (test_and_clear_bit(PG_dcache_dirty, &page->flags)) __flush_dcache_page(page); Index: arch/ia64/mm/init.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/ia64/mm/init.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 init.c --- arch/ia64/mm/init.c 24 Apr 2002 19:35:43 -0000 1.1.1.3 +++ arch/ia64/mm/init.c 29 Apr 2002 20:39:05 -0000 @@ -147,7 +147,7 @@ free_initrd_mem (unsigned long start, un printk(KERN_INFO "Freeing initrd memory: %ldkB freed\n", (end - start) >> 10); for (; start < end; start += PAGE_SIZE) { - if (!VALID_PAGE(virt_to_page(start))) + if (!virt_to_valid_page(start)) continue; clear_bit(PG_reserved, &virt_to_page(start)->flags); set_page_count(virt_to_page(start), 1); Index: arch/mips/mm/umap.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/mips/mm/umap.c,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 umap.c --- arch/mips/mm/umap.c 31 Jan 2002 22:19:02 -0000 1.1.1.2 +++ arch/mips/mm/umap.c 29 Apr 2002 19:17:45 -0000 @@ -116,8 +116,8 @@ void *vmalloc_uncached (unsigned long si static inline void free_pte(pte_t page) { if (pte_present(page)) { - struct page *ptpage = pte_page(page); - if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage)) + struct page *ptpage = pte_valid_page(page); + if (!ptpage || PageReserved(ptpage)) return; __free_page(ptpage); if (current->mm->rss <= 0) Index: arch/mips64/mm/umap.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/mips64/mm/umap.c,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 umap.c --- arch/mips64/mm/umap.c 31 Jan 2002 22:19:51 -0000 1.1.1.2 +++ arch/mips64/mm/umap.c 29 Apr 2002 19:17:29 -0000 @@ -115,8 +115,8 @@ void *vmalloc_uncached (unsigned long si static inline void free_pte(pte_t page) { if (pte_present(page)) { - struct page *ptpage = pte_page(page); - if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage)) + struct page *ptpage = pte_valid_page(page); + if (!ptpage || PageReserved(ptpage)) return; __free_page(ptpage); if (current->mm->rss <= 0) Index: arch/sh/mm/fault.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sh/mm/fault.c,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 fault.c --- arch/sh/mm/fault.c 31 Jan 2002 22:19:42 -0000 1.1.1.2 +++ arch/sh/mm/fault.c 29 Apr 2002 19:17:11 -0000 @@ -298,8 +298,8 @@ void update_mmu_cache(struct vm_area_str return; #if defined(__SH4__) - page = pte_page(pte); - if (VALID_PAGE(page) && !test_bit(PG_mapped, &page->flags)) { + page = pte_valid_page(pte); + if (page && !test_bit(PG_mapped, &page->flags)) { unsigned long phys = pte_val(pte) & PTE_PHYS_MASK; __flush_wback_region((void *)P1SEGADDR(phys), PAGE_SIZE); __set_bit(PG_mapped, &page->flags); Index: arch/sparc/mm/generic.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/generic.c,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 generic.c --- arch/sparc/mm/generic.c 31 Jan 2002 22:19:00 -0000 1.1.1.2 +++ arch/sparc/mm/generic.c 29 Apr 2002 19:16:58 -0000 @@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page if (pte_none(page)) return; if (pte_present(page)) { - struct page *ptpage = pte_page(page); - if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage)) + struct page *ptpage = pte_valid_page(page); + if (!ptpage || PageReserved(ptpage)) return; page_cache_release(ptpage); return; Index: arch/sparc/mm/sun4c.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/sun4c.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 sun4c.c --- arch/sparc/mm/sun4c.c 14 Apr 2002 20:05:32 -0000 1.1.1.3 +++ arch/sparc/mm/sun4c.c 29 Apr 2002 20:39:35 -0000 @@ -1327,7 +1327,7 @@ static __u32 sun4c_get_scsi_one(char *bu unsigned long page; page = ((unsigned long)bufptr) & PAGE_MASK; - if (!VALID_PAGE(virt_to_page(page))) { + if (!virt_to_valid_page(page)) { sun4c_flush_page(page); return (__u32)bufptr; /* already locked */ } @@ -2106,7 +2106,7 @@ static void sun4c_pte_clear(pte_t *ptep) static int sun4c_pmd_bad(pmd_t pmd) { return (((pmd_val(pmd) & ~PAGE_MASK) != PGD_TABLE) || - (!VALID_PAGE(virt_to_page(pmd_val(pmd))))); + (!virt_to_valid_page(pmd_val(pmd)))); } static int sun4c_pmd_present(pmd_t pmd) Index: arch/sparc64/kernel/traps.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/kernel/traps.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 traps.c --- arch/sparc64/kernel/traps.c 11 Feb 2002 18:49:01 -0000 1.1.1.3 +++ arch/sparc64/kernel/traps.c 29 Apr 2002 20:39:53 -0000 @@ -1284,9 +1284,9 @@ void cheetah_deferred_handler(struct pt_ } if (recoverable) { - struct page *page = virt_to_page(__va(afar)); + struct page *page = virt_to_valid_page(__va(afar)); - if (VALID_PAGE(page)) + if (page) get_page(page); else recoverable = 0; Index: arch/sparc64/mm/generic.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/generic.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 generic.c --- arch/sparc64/mm/generic.c 19 Mar 2002 01:27:51 -0000 1.1.1.3 +++ arch/sparc64/mm/generic.c 29 Apr 2002 19:14:35 -0000 @@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page if (pte_none(page)) return; if (pte_present(page)) { - struct page *ptpage = pte_page(page); - if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage)) + struct page *ptpage = pte_valid_page(page); + if (!ptpage || PageReserved(ptpage)) return; page_cache_release(ptpage); return; Index: arch/sparc64/mm/init.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/init.c,v retrieving revision 1.1.1.6 diff -u -p -r1.1.1.6 init.c --- arch/sparc64/mm/init.c 14 Apr 2002 20:06:08 -0000 1.1.1.6 +++ arch/sparc64/mm/init.c 29 Apr 2002 19:14:15 -0000 @@ -187,11 +187,10 @@ extern void __update_mmu_cache(unsigned void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte) { - struct page *page = pte_page(pte); + struct page *page = pte_valid_page(pte); unsigned long pg_flags; - if (VALID_PAGE(page) && - page->mapping && + if (page && page->mapping && ((pg_flags = page->flags) & (1UL << PG_dcache_dirty))) { int cpu = ((pg_flags >> 24) & (NR_CPUS - 1UL)); @@ -260,10 +259,10 @@ static inline void flush_cache_pte_range continue; if (pte_present(pte) && pte_dirty(pte)) { - struct page *page = pte_page(pte); + struct page *page = pte_valid_page(pte); unsigned long pgaddr, uaddr; - if (!VALID_PAGE(page) || PageReserved(page) || !page->mapping) + if (!page || PageReserved(page) || !page->mapping) continue; pgaddr = (unsigned long) page_address(page); uaddr = address + offset; Index: fs/proc/array.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/fs/proc/array.c,v retrieving revision 1.1.1.7 diff -u -p -r1.1.1.7 array.c --- fs/proc/array.c 14 Apr 2002 20:01:10 -0000 1.1.1.7 +++ fs/proc/array.c 29 Apr 2002 19:12:38 -0000 @@ -424,8 +424,8 @@ static inline void statm_pte_range(pmd_t ++*total; if (!pte_present(page)) continue; - ptpage = pte_page(page); - if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage)) + ptpage = pte_valid_page(page); + if (!ptpage || PageReserved(ptpage)) continue; ++*pages; if (pte_dirty(page)) Index: include/asm-cris/processor.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-cris/processor.h,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 processor.h --- include/asm-cris/processor.h 31 Jan 2002 22:16:02 -0000 1.1.1.2 +++ include/asm-cris/processor.h 29 Apr 2002 20:40:17 -0000 @@ -101,8 +101,7 @@ unsigned long get_wchan(struct task_stru ({ \ unsigned long eip = 0; \ unsigned long regs = (unsigned long)user_regs(tsk); \ - if (regs > PAGE_SIZE && \ - VALID_PAGE(virt_to_page(regs))) \ + if (regs > PAGE_SIZE && virt_to_valid_page(regs)) \ eip = ((struct pt_regs *)regs)->irp; \ eip; }) Index: include/asm-i386/page.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/page.h,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 page.h --- include/asm-i386/page.h 24 Feb 2002 23:11:41 -0000 1.1.1.3 +++ include/asm-i386/page.h 29 Apr 2002 21:09:09 -0000 @@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) #define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT)) -#define VALID_PAGE(page) ((page - mem_map) < max_mapnr) +#define virt_to_valid_page(kaddr) ({ \ + unsigned long __paddr = __pa(kaddr); \ + __paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \ +}) #define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) Index: include/asm-i386/pgtable-2level.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 pgtable-2level.h --- include/asm-i386/pgtable-2level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1 +++ include/asm-i386/pgtable-2level.h 29 Apr 2002 21:13:29 -0000 @@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t * #define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0)) #define pte_same(a, b) ((a).pte_low == (b).pte_low) #define pte_page(x) (mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT)))) +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL) #define pte_none(x) (!(x).pte_low) #define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot)) Index: include/asm-i386/pgtable-3level.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 pgtable-3level.h --- include/asm-i386/pgtable-3level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1 +++ include/asm-i386/pgtable-3level.h 29 Apr 2002 21:13:08 -0000 @@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_ } #define pte_page(x) (mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT)))) +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL) #define pte_none(x) (!(x).pte_low && !(x).pte_high) static inline pte_t __mk_pte(unsigned long page_nr, pgprot_t pgprot) Index: include/asm-m68k/processor.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-m68k/processor.h,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 processor.h --- include/asm-m68k/processor.h 26 Nov 2001 19:29:57 -0000 1.1.1.1 +++ include/asm-m68k/processor.h 29 Apr 2002 20:40:37 -0000 @@ -139,7 +139,7 @@ unsigned long get_wchan(struct task_stru ({ \ unsigned long eip = 0; \ if ((tsk)->thread.esp0 > PAGE_SIZE && \ - (VALID_PAGE(virt_to_page((tsk)->thread.esp0)))) \ + (virt_to_valid_page((tsk)->thread.esp0))) \ eip = ((struct pt_regs *) (tsk)->thread.esp0)->pc; \ eip; }) #define KSTK_ESP(tsk) ((tsk) == current ? rdusp() : (tsk)->thread.usp) Index: include/asm-sh/pgalloc.h =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/include/asm-sh/pgalloc.h,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 pgalloc.h --- include/asm-sh/pgalloc.h 31 Jan 2002 22:15:51 -0000 1.1.1.2 +++ include/asm-sh/pgalloc.h 29 Apr 2002 19:11:43 -0000 @@ -105,9 +105,8 @@ static inline pte_t ptep_get_and_clear(p pte_clear(ptep); if (!pte_not_present(pte)) { - struct page *page = pte_page(pte); - if (VALID_PAGE(page)&& - (!page->mapping || !(page->mapping->i_mmap_shared))) + struct page *page = pte_valid_page(pte); + if (page && (!page->mapping || !(page->mapping->i_mmap_shared))) __clear_bit(PG_mapped, &page->flags); } return pte; Index: mm/memory.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/memory.c,v retrieving revision 1.1.1.9 diff -u -p -r1.1.1.9 memory.c --- mm/memory.c 29 Apr 2002 17:30:38 -0000 1.1.1.9 +++ mm/memory.c 29 Apr 2002 20:38:17 -0000 @@ -76,8 +76,8 @@ mem_map_t * mem_map; */ void __free_pte(pte_t pte) { - struct page *page = pte_page(pte); - if ((!VALID_PAGE(page)) || PageReserved(page)) + struct page *page = pte_valid_page(pte); + if (!page || PageReserved(page)) return; if (pte_dirty(pte)) set_page_dirty(page); @@ -278,9 +278,8 @@ skip_copy_pte_range: address = (address swap_duplicate(pte_to_swp_entry(pte)); goto cont_copy_pte_range; } - ptepage = pte_page(pte); - if ((!VALID_PAGE(ptepage)) || - PageReserved(ptepage)) + ptepage = pte_valid_page(pte); + if (!ptepage || PageReserved(ptepage)) goto cont_copy_pte_range; /* If it's a COW mapping, write protect it both in the parent and the child */ @@ -356,8 +355,8 @@ static inline int zap_pte_range(mmu_gath if (pte_none(pte)) continue; if (pte_present(pte)) { - struct page *page = pte_page(pte); - if (VALID_PAGE(page) && !PageReserved(page)) + struct page *page = pte_valid_page(pte); + if (page && !PageReserved(page)) freed ++; /* This will eventually call __free_pte on the pte. */ tlb_remove_page(tlb, ptep, address + offset); @@ -473,7 +472,7 @@ static struct page * follow_page(struct if (pte_present(pte)) { if (!write || (pte_write(pte) && pte_dirty(pte))) - return pte_page(pte); + return pte_valid_page(pte); } out: @@ -488,8 +487,6 @@ out: static inline struct page * get_page_map(struct page *page) { - if (!VALID_PAGE(page)) - return 0; return page; } @@ -860,12 +857,12 @@ static inline void remap_pte_range(pte_t end = PMD_SIZE; do { struct page *page; - pte_t oldpage; + pte_t oldpage, newpage; oldpage = ptep_get_and_clear(pte); - - page = virt_to_page(__va(phys_addr)); - if ((!VALID_PAGE(page)) || PageReserved(page)) - set_pte(pte, mk_pte_phys(phys_addr, prot)); + newpage = mk_pte_phys(phys_addr, prot); + page = pte_valid_page(newpage); + if (!page || PageReserved(page)) + set_pte(pte, newpage); forget_pte(oldpage); address += PAGE_SIZE; phys_addr += PAGE_SIZE; @@ -978,8 +975,8 @@ static int do_wp_page(struct mm_struct * { struct page *old_page, *new_page; - old_page = pte_page(pte); - if (!VALID_PAGE(old_page)) + old_page = pte_valid_page(pte); + if (!old_page) goto bad_wp_page; if (!TryLockPage(old_page)) { Index: mm/msync.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/msync.c,v retrieving revision 1.1.1.2 diff -u -p -r1.1.1.2 msync.c --- mm/msync.c 14 Apr 2002 20:01:38 -0000 1.1.1.2 +++ mm/msync.c 29 Apr 2002 19:04:34 -0000 @@ -26,8 +26,8 @@ static int filemap_sync_pte(pte_t *ptep, pte_t pte = *ptep; if (pte_present(pte) && pte_dirty(pte)) { - struct page *page = pte_page(pte); - if (VALID_PAGE(page) && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) { + struct page *page = pte_valid_page(pte); + if (page && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) { flush_tlb_page(vma, address); set_page_dirty(page); } Index: mm/page_alloc.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/page_alloc.c,v retrieving revision 1.1.1.8 diff -u -p -r1.1.1.8 page_alloc.c --- mm/page_alloc.c 24 Apr 2002 19:31:04 -0000 1.1.1.8 +++ mm/page_alloc.c 29 Apr 2002 20:42:30 -0000 @@ -101,8 +101,6 @@ static void __free_pages_ok (struct page BUG(); if (page->mapping) BUG(); - if (!VALID_PAGE(page)) - BUG(); if (PageSwapCache(page)) BUG(); if (PageLocked(page)) @@ -294,8 +292,6 @@ static struct page * balance_classzone(z BUG(); if (page->mapping) BUG(); - if (!VALID_PAGE(page)) - BUG(); if (PageSwapCache(page)) BUG(); if (PageLocked(page)) @@ -474,7 +470,7 @@ void __free_pages(struct page *page, uns void free_pages(unsigned long addr, unsigned int order) { if (addr != 0) - __free_pages(virt_to_page(addr), order); + __free_pages(virt_to_valid_page(addr), order); } /* Index: mm/slab.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/slab.c,v retrieving revision 1.1.1.5 diff -u -p -r1.1.1.5 slab.c --- mm/slab.c 13 Mar 2002 21:16:16 -0000 1.1.1.5 +++ mm/slab.c 29 Apr 2002 20:44:21 -0000 @@ -1415,7 +1415,7 @@ alloc_new_slab_nolock: #if DEBUG # define CHECK_NR(pg) \ do { \ - if (!VALID_PAGE(pg)) { \ + if (!pg) { \ printk(KERN_ERR "kfree: out of range ptr %lxh.\n", \ (unsigned long)objp); \ BUG(); \ @@ -1439,7 +1439,7 @@ static inline void kmem_cache_free_one(k { slab_t* slabp; - CHECK_PAGE(virt_to_page(objp)); + CHECK_PAGE(virt_to_valid_page(objp)); /* reduces memory footprint * if (OPTIMIZE(cachep)) @@ -1519,7 +1519,7 @@ static inline void __kmem_cache_free (km #ifdef CONFIG_SMP cpucache_t *cc = cc_data(cachep); - CHECK_PAGE(virt_to_page(objp)); + CHECK_PAGE(virt_to_valid_page(objp)); if (cc) { int batchcount; if (cc->avail < cc->limit) { @@ -1601,7 +1601,7 @@ void kmem_cache_free (kmem_cache_t *cach { unsigned long flags; #if DEBUG - CHECK_PAGE(virt_to_page(objp)); + CHECK_PAGE(virt_to_valid_page(objp)); if (cachep != GET_PAGE_CACHE(virt_to_page(objp))) BUG(); #endif @@ -1626,7 +1626,7 @@ void kfree (const void *objp) if (!objp) return; local_irq_save(flags); - CHECK_PAGE(virt_to_page(objp)); + CHECK_PAGE(virt_to_valid_page(objp)); c = GET_PAGE_CACHE(virt_to_page(objp)); __kmem_cache_free(c, (void*)objp); local_irq_restore(flags); Index: mm/vmalloc.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/vmalloc.c,v retrieving revision 1.1.1.5 diff -u -p -r1.1.1.5 vmalloc.c --- mm/vmalloc.c 24 Apr 2002 19:31:04 -0000 1.1.1.5 +++ mm/vmalloc.c 29 Apr 2002 18:59:39 -0000 @@ -45,8 +45,8 @@ static inline void free_area_pte(pmd_t * if (pte_none(page)) continue; if (pte_present(page)) { - struct page *ptpage = pte_page(page); - if (VALID_PAGE(ptpage) && (!PageReserved(ptpage))) + struct page *ptpage = pte_valid_page(page); + if (ptpage && (!PageReserved(ptpage))) __free_page(ptpage); continue; } Index: mm/vmscan.c =================================================================== RCS file: /usr/src/cvsroot/linux-2.5/mm/vmscan.c,v retrieving revision 1.1.1.7 diff -u -p -r1.1.1.7 vmscan.c --- mm/vmscan.c 24 Apr 2002 19:31:04 -0000 1.1.1.7 +++ mm/vmscan.c 29 Apr 2002 18:57:37 -0000 @@ -206,9 +206,9 @@ static inline int swap_out_pmd(struct mm do { if (pte_present(*pte)) { - struct page *page = pte_page(*pte); + struct page *page = pte_valid_page(*pte); - if (VALID_PAGE(page) && !PageReserved(page)) { + if (page && !PageReserved(page)) { count -= try_to_swap_out(mm, vma, address, pte, page, classzone); if (!count) { address += PAGE_SIZE; ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-29 22:00 ` Roman Zippel @ 2002-04-30 0:43 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-04-30 0:43 UTC (permalink / raw) To: Roman Zippel; +Cc: Russell King, linux-kernel On Tue, Apr 30, 2002 at 12:00:50AM +0200, Roman Zippel wrote: > Hi, > > On Sat, 27 Apr 2002, Andrea Arcangeli wrote: > > > correct. This should fix it: > > > > --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002 > > +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002 > > @@ -106,8 +106,8 @@ > > #define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \ > > NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap) > > > > -#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr)) > > -#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr) > > +#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0) > > +#define VALID_PAGE(page) ((page) != NULL) > > > > #ifdef CONFIG_NUMA > > #ifdef CONFIG_NUMA_SCHED > > I'd prefer if VALID_PAGE would go away completely, that test was almost > always to late. What about the patch below, it even reduces the code size it is _always_ too late indeed, I definitely agree with your proposal to change the common code API, yours is a much saner API. But that's a common code change call, my object was to fix the arch part without changing the common code, and after all my patch will work exactly the same as yours, it's just that you put the page != NULL check explicit and I still use VALID_PAGE instead. You can skip the overflow-check when we know the vaddr or the pte to match with a valid ram page, so it's a bit faster than my fix with discontigmem enabled. I'm not sure if for 2.4 it worth to change that given that my two liner arch-contained patch will also work flawlessy. I've just quite a lots of stuff pending in 2.4 that makes some huge difference to users, so I tend to prefer to left the stuff that doesn't make difference to users for 2.5 only (it's a cleanup plus a minor discontigmem optimization after all). So I recommend you to push it to Linus after fixing the below bugs. > --- include/asm-i386/page.h 24 Feb 2002 23:11:41 -0000 1.1.1.3 > +++ include/asm-i386/page.h 29 Apr 2002 21:09:09 -0000 > @@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned > #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) > #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) > #define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT)) > -#define VALID_PAGE(page) ((page - mem_map) < max_mapnr) > +#define virt_to_valid_page(kaddr) ({ \ > + unsigned long __paddr = __pa(kaddr); \ > + __paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \ > +}) > > #define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ > VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) > Index: include/asm-i386/pgtable-2level.h > =================================================================== > RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v > retrieving revision 1.1.1.1 > diff -u -p -r1.1.1.1 pgtable-2level.h > --- include/asm-i386/pgtable-2level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1 > +++ include/asm-i386/pgtable-2level.h 29 Apr 2002 21:13:29 -0000 > @@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t * > #define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0)) > #define pte_same(a, b) ((a).pte_low == (b).pte_low) > #define pte_page(x) (mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT)))) > +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL) > #define pte_none(x) (!(x).pte_low) > #define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot)) > > Index: include/asm-i386/pgtable-3level.h > =================================================================== > RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v > retrieving revision 1.1.1.1 > diff -u -p -r1.1.1.1 pgtable-3level.h > --- include/asm-i386/pgtable-3level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1 > +++ include/asm-i386/pgtable-3level.h 29 Apr 2002 21:13:08 -0000 > @@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_ > } > > #define pte_page(x) (mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT)))) > +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL) > #define pte_none(x) (!(x).pte_low && !(x).pte_high) > map_mapnr is a pfn, not a physaddr, you're off of 2^PAGE_SHIFT, fix is trivial of course. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King 2002-04-26 22:46 ` Andrea Arcangeli @ 2002-04-27 22:10 ` Daniel Phillips 2002-04-29 13:35 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-04-27 22:10 UTC (permalink / raw) To: Russell King, linux-kernel On Friday 26 April 2002 20:27, Russell King wrote: > Hi, > > I've been looking at some of the ARM discontigmem implementations, and > have come across a nasty bug. To illustrate this, I'm going to take > part of the generic kernel, and use the Alpha implementation to > illustrate the problem we're facing on ARM. > > I'm going to argue here that virt_to_page() can, in the discontigmem > case, produce rather a nasty bug when used with non-direct mapped > kernel memory arguments. It's tough to follow, even when you know the code. While cooking my config_nonlinear patch I noticed the line you're concerned about and regarded it with deep suspicion. My patch does this: - page = virt_to_page(__va(phys_addr)); + page = phys_to_page(phys_addr); And of course took care that phys_to_page does the right thing in all cases. <plug> The new config_nonlinear was designed as a cleaner, more powerful replacement for all non-numa uses of config_discontigmem. </plug> -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-27 22:10 ` Daniel Phillips @ 2002-04-29 13:35 ` Andrea Arcangeli 2002-04-29 23:02 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-04-29 13:35 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, linux-kernel On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote: > On Friday 26 April 2002 20:27, Russell King wrote: > > Hi, > > > > I've been looking at some of the ARM discontigmem implementations, and > > have come across a nasty bug. To illustrate this, I'm going to take > > part of the generic kernel, and use the Alpha implementation to > > illustrate the problem we're facing on ARM. > > > > I'm going to argue here that virt_to_page() can, in the discontigmem > > case, produce rather a nasty bug when used with non-direct mapped > > kernel memory arguments. > > It's tough to follow, even when you know the code. While cooking my > config_nonlinear patch I noticed the line you're concerned about and > regarded it with deep suspicion. My patch does this: > > - page = virt_to_page(__va(phys_addr)); > + page = phys_to_page(phys_addr); > > And of course took care that phys_to_page does the right thing in all > cases. The problem remains the same also going from phys to page, the problem is that it's not a contigous mem_map and it choked when the phys addr was above the max ram physaddr. The patch I posted a few days ago will fix it (modulo for ununused ram space, but attempting to map into the address space unused ram space is a bug in the first place). > > <plug> > The new config_nonlinear was designed as a cleaner, more powerful > replacement for all non-numa uses of config_discontigmem. > </plug> I maybe wrong because I only had a short look at it so far, but the "non-numa" is what I noticed too and that's what renders it not a very interesting option IMHO. Most discontigmem needs numa too. If it cannot handle numa it doesn't worth to add the complexity there, with numa we must view those chunks differently, not linearly. Also there's nothing magic that says mem_map must have a magical meaning, doesn't worth to preserve the mem_map thing, virt_to_page is a much cleaner abstraction than doing mem_map + pfn by hand. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-29 13:35 ` Andrea Arcangeli @ 2002-04-29 23:02 ` Daniel Phillips 2002-05-01 2:23 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-04-29 23:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel On Monday 29 April 2002 15:35, Andrea Arcangeli wrote: > On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote: > > On Friday 26 April 2002 20:27, Russell King wrote: > > > Hi, > > > > > > I've been looking at some of the ARM discontigmem implementations, and > > > have come across a nasty bug. To illustrate this, I'm going to take > > > part of the generic kernel, and use the Alpha implementation to > > > illustrate the problem we're facing on ARM. > > > > > > I'm going to argue here that virt_to_page() can, in the discontigmem > > > case, produce rather a nasty bug when used with non-direct mapped > > > kernel memory arguments. > > > > It's tough to follow, even when you know the code. While cooking my > > config_nonlinear patch I noticed the line you're concerned about and > > regarded it with deep suspicion. My patch does this: > > > > - page = virt_to_page(__va(phys_addr)); > > + page = phys_to_page(phys_addr); > > > > And of course took care that phys_to_page does the right thing in all > > cases. > > The problem remains the same also going from phys to page, the problem > is that it's not a contigous mem_map and it choked when the phys addr > was above the max ram physaddr. The patch I posted a few days ago will > fix it (modulo for ununused ram space, but attempting to map into the > address space unused ram space is a bug in the first place). My config_nonlinear patch does not suffer from the above problem. Here's the code: unsigned long vsection[MAX_SECTIONS]; static inline unsigned long phys_to_ordinal(phys_t p) { return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT); } static inline struct page *phys_to_page(unsigned long p) { return mem_map + phys_to_ordinal(p); } Nothing can go out of range. Sensible, no? > > <plug> > > The new config_nonlinear was designed as a cleaner, more powerful > > replacement for all non-numa uses of config_discontigmem. > > </plug> > > I maybe wrong because I only had a short look at it so far, but the > "non-numa" is what I noticed too and that's what renders it not a very > interesting option IMHO. Most discontigmem needs numa too. I am, first and foremost, presenting config_nonlinear as a replacement for config_discontig for *non-numa* uses of config_discontig. (Sorry if I'm repeating myself here.) There are also applications in numa. Please see the lse-tech archives for details. I expect that, by taking a fresh look at numa code in the light of new work, that the numa code can be cleaned up and simplififed considerably. But that's "further work". Config_nonlinear stands on its own quite nicely. > If it cannot > handle numa it doesn't worth to add the complexity there, It does not add complexity, it removes complexity. Please read the patch more closely. It's very simple. It's also more powerful than config_discontig. > with numa we must view those chunks differently, not linearly. Correct. Now, if you want to extend my patch to handle multiple mem_map vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair of mappings.[1] Don't you think this is a nicer way to organize things? > Also there's nothing > magic that says mem_map must have a magical meaning, doesn't worth to > preserve the mem_map thing, virt_to_page is a much cleaner abstraction > than doing mem_map + pfn by hand. True. The upcoming iteration of config_nonlinear moves all uses of mem_map inside the per-arch page.h headers, so that mem_map need not exist at all in configurations where there is no single mem_map. [1] Since allocation needs to be aware of the separate zones, _alloc_pages stays much as it is, but if we change all non-numa users of config_discontig over to config_nonlinear then we can get rid of the #ifdef CONFIG_NUMA's, by way of cleanup. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-29 23:02 ` Daniel Phillips @ 2002-05-01 2:23 ` Andrea Arcangeli 2002-04-30 23:12 ` Daniel Phillips 2002-05-01 18:05 ` Jesse Barnes 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-01 2:23 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, linux-kernel On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote: > My config_nonlinear patch does not suffer from the above problem. Here's the > code: > > unsigned long vsection[MAX_SECTIONS]; > > static inline unsigned long phys_to_ordinal(phys_t p) > { > return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT); > } > > static inline struct page *phys_to_page(unsigned long p) > { > return mem_map + phys_to_ordinal(p); > } > > Nothing can go out of range. Sensible, no? Really the above vsection[p >> SECTION_SHIFT] will overflow in the very same case I fixed a few days ago for numa-alpha. The whole point is that p isn't a ram page and you assumed that (the alpha code was also assuming that and that's why it overflowed the same way as yours). Either that or you're wasting some huge tons of ram with vsection on a 64bit arch. After the above out of range bug is fixed in practice it is the same as the current discontigmem, except that with the current way you can take the page structures in the right node with numa. And again I cannot see any advantage in having a contigous mem_map even for archs with only discontigmem and non-numa (I think only ARM falls in such category, btw). > > > <plug> > > > The new config_nonlinear was designed as a cleaner, more powerful > > > replacement for all non-numa uses of config_discontigmem. > > > </plug> > > > > I maybe wrong because I only had a short look at it so far, but the > > "non-numa" is what I noticed too and that's what renders it not a very > > interesting option IMHO. Most discontigmem needs numa too. > > I am, first and foremost, presenting config_nonlinear as a replacement for > config_discontig for *non-numa* uses of config_discontig. (Sorry if I'm > repeating myself here.) > > There are also applications in numa. Please see the lse-tech archives for > details. I expect that, by taking a fresh look at numa code in the light > of new work, that the numa code can be cleaned up and simplififed > considerably. But that's "further work". Config_nonlinear stands on its > own quite nicely. Tell me how an ARM machine will run faster with nonlinear, it is doing nearly the same thing except it's a lesser abstraction that forces a contiguous mem_map. Current code is much more powerful and it carries more information (the pgdat describes the whole memory topology to the common code), and it's not going to be slower, so I don't see why should we complicate the code with nonlinear. Personally I hate more than one way of doing the same thing if there's no need of it, the less ways the less you have to keep in mind, the simpler to understand, the better (partly offtopic but for the very same reason when I work in userspace I much prefer coding in python than in perl). > > If it cannot > > handle numa it doesn't worth to add the complexity there, > > It does not add complexity, it removes complexity. Please read the patch > more closely. It's very simple. It's also more powerful than > config_discontig. How? I may be overlooking something but I would say it's all but more powerful. I don't see any "power" point in trying to keep the mem_map contigous. please don't tell me it's more powerful, just tell me why. > > with numa we must view those chunks differently, not linearly. > > Correct. Now, if you want to extend my patch to handle multiple mem_map > vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair > of mappings.[1] Don't you think this is a nicer way to organize things? What's the advantage? And after you can have more than one mem_map, after you added this "vector", then each mem_map will match a discontigmem pgdat. Tell me a numa machine where there's an hole in the middle of a node. The holes are always intra-node, never within the nodes themself. So the nonlinear-numa should fallback to the stright mem_map array pointed by the pgdat all the time like it is just right now. The only advantage of nonlinear I can see could be a machine with an huge hole in a node, then with nonlinear you could avoid wasting mem_map for this hole but without having to add another pgdat that would otherwise break numa assumptions on the pgdat, but I'm not aware of any machine with huge holes of the order of the gigabytes in the middle of a node, at the very least if that happens it means the hardware of the machine is misconfigured. The very same problem would happen right now in x86 if there would be an huge hole in the physical ram, so you have 128M of ram and then an hole of 63G and then the other phusical 900M at offset 63G+128M, it will never happen, that's broken hardware if you see anything like that. at the very least I would wait somebody to ask with a so weird hardware that intentionally does like the above instead of overdesigning common code abstractions, and there would be also other ways to deal with such situation without requiring a contigous mem_map. > > Also there's nothing > > magic that says mem_map must have a magical meaning, doesn't worth to > > preserve the mem_map thing, virt_to_page is a much cleaner abstraction > > than doing mem_map + pfn by hand. > > True. The upcoming iteration of config_nonlinear moves all uses of > mem_map inside the per-arch page.h headers, so that mem_map need not > exist at all in configurations where there is no single mem_map. That's fine, and all correct kernel code just does that correctly, nonbody is allowed to use mem_map in any common code anywhere (besides the mm proper internals when discontigmem is disabled). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 2:23 ` Andrea Arcangeli @ 2002-04-30 23:12 ` Daniel Phillips 2002-05-01 1:05 ` Daniel Phillips 2002-05-02 0:47 ` Andrea Arcangeli 2002-05-01 18:05 ` Jesse Barnes 1 sibling, 2 replies; 152+ messages in thread From: Daniel Phillips @ 2002-04-30 23:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel On Wednesday 01 May 2002 04:23, Andrea Arcangeli wrote: > On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote: > > My config_nonlinear patch does not suffer from the above problem. Here's the > > code: > > > > unsigned long vsection[MAX_SECTIONS]; > > > > static inline unsigned long phys_to_ordinal(phys_t p) > > { > > return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT); > > } > > > > static inline struct page *phys_to_page(unsigned long p) > > { > > return mem_map + phys_to_ordinal(p); > > } > > > > Nothing can go out of range. Sensible, no? > > Really the above vsection[p >> SECTION_SHIFT] will overflow in the very > same case I fixed a few days ago for numa-alpha. No it will not. Note however that you do want to choose SECTION_SHIFT so that the vsection table is not too large. > The whole point is that > p isn't a ram page and you assumed that (the alpha code was also assuming > that and that's why it overflowed the same way as yours). Either that > or you're wasting some huge tons of ram with vsection on a 64bit arch. No and no. In fact, the vsection table for my current project is only 32 elements long. > After the above out of range bug is fixed in practice it is the same as > the current discontigmem, except that with the current way you can take > the page structures in the right node with numa. And again I cannot see > any advantage in having a contiguous mem_map even for archs with only > discontigmem and non-numa > (I think only ARM falls in such category, btw). You would be wrong about that. It's clear that you have not looked at the config_nonlinear patch closely, and are not familiar with it. I'll try to provide some help, by enumerating some similarities and differences below. I'll apologize in advance for not replying to your email point by point. Sorry, there were just too many points ;-) Config_discontigmem ------------------- Has exactly one purpose: to eliminate memory wastage due to struct pages that refer to unpopulated regions of memory. It does this by dividing memory regions up into 'nodes', and each node is handled separately by the memory manager, which attempts to balance allocations between them. Config_discontig replicates however many zones there are across however many discontiguous regions there are, so for purposes of allocation, we end up with a two-dimensional array of zones, (MAX_NR_ZONES * MAX_NR_NODES). Config_discontigmem uses a table mapping in one direction: given an address, find a struct page in a one of several ->mem_map array indexed by the address, or compute a physical memory address by finding a base physical address in an array indexed by the virtual address. Conversion from physical address to struct page also requires a table lookup, to locate the desired ->mem_map array. Config_nonlinear ---------------- Config_nonlinear introduces a new, logical address space, and uses a pair of tables, indexed by a few high bits of the address, one to map sections of logical address space to sections of physical address space, and the other to perform the reverse mapping. This pair of tables is used to define the usual set of address translation functions used to maintain the process page tables, including the kernel virtual page tables. The real work of doing this translation is, of course, performed by the address translation hardware - otherwise the bookkeeping cost of config_nonlinear is comparable to or slightly better than config_discontigmem. Such things as bootmem allocations and VALID_PAGE checks are carried out in the logical address space, which constitutes a considerable simplification vs config_discontigmem. Config_nonlinear was not designed as a replacement for numa address management, however, it is compatible with it and there are numa applications where config_nonlinear can create efficiencies that config_discontigmem cannot. That said, the rest of this discussion is concerned with non-numa applications of config_nonlinear. In the non-numa case, config_nonlinear does what config_discontigmem does, that is, eliminates struct page memory wastage due to unpopulated regions of memory, and in addition: - Can map a large range of physical memory into a small range of kernel virtual memory. This becomes important when physical memory is installed at widely separated intervals - Does not artificially divide memory into nodes, instead, joins it together in one homogeneous pool, which the memory manager divides into zones as *necessary* (for example, for highmem). - Sharply reduces number of zones needing balancing is sharply reduced vs config_discontigmem. Please take a look at the non-numa code in _alloc_pages that attempts to balance between the 'artificial' nodes created by config_discontigmem. It just cycles between round robin between the nodes on each allocation, ignoring the relative availability of memory in the nodes. This obvious deficiency could be fixed by adding more (finicky) code, or the problem can be eliminated completely, using config_nonlinear. - Has better locality of reference in the mapping tables, because the tables are compact (and could easily be made yet more compact than in the posted patch). That is, each table entry in a config_discontig node array is 796 bytes, as opposed to 4 (or possibly 2 or 1) with config_nonlinear. - Eliminates two levels of procedure calls from the alloc_pages call chain. - Provides a simple model that is easily implemented across *all* architectures. Please look at the config_discontigmem option and see how many architectures support it. Hint: it is not enough just to add the option to config.in. - Leads forward to interesting possibilities such as hot plug memory. (Because pinned kernel memory can be remapped to an alternative region of physical memory if desired) - Cleans things up considerably. It eliminates the unholy marriage of config_discontig-for-the-purpose of working around physical memory holes and config_discontig-for-the-purpose of numa allocation. For example, eliminates a number of #ifdefs from the numa code, allowing the numa code to be developed in the way that is best for numa, instead of being hobbled by the need to serve a dual purpose. It's easy to wave your hands at the idea that code should ever be cleaned up. As an example of just how much the config_nonlinear patch cleans things up, let's look at the ARM definition of VALID_PAGE, with config_discontigmem: #define KVADDR_TO_NID(addr) \ (((unsigned long)(addr) - 0xc0000000) >> 27) #define NODE_DATA(nid) (&discontig_node_data[nid]) #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map) #define VALID_PAGE(page) \ ({ unsigned int node = KVADDR_TO_NID(page); \ ( (node < NR_NODES) && \ ((unsigned)((page) - NODE_MEM_MAP(node)) < NODE_DATA(node)->node_size) ); \ }) With config_nonlinear (which does the same job as config_discontigmem in this case) we get: static inline int VALID_PAGE(struct page *page) { return page - mem_map < max_mapnr; } Isn't that nice? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-30 23:12 ` Daniel Phillips @ 2002-05-01 1:05 ` Daniel Phillips 2002-05-02 0:47 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 1:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel On Wednesday 01 May 2002 01:12, I wrote: > Config_discontigmem > ------------------- > > Has exactly one purpose: to eliminate memory wastage due to struct pages > that refer to unpopulated regions of memory... That is, when not used together with config_numa. When used with config_numa, it has a second purpose: to allow ->mem_map arrays to exist on the same numa node as the referenced pages. The config_nonlinear patch could be extended to handle this as well (by elaborating the definitions of virt_to_page and phys_to_page) but I don't have plans to do that at this time. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-04-30 23:12 ` Daniel Phillips 2002-05-01 1:05 ` Daniel Phillips @ 2002-05-02 0:47 ` Andrea Arcangeli 2002-05-01 1:26 ` Daniel Phillips 2002-05-02 2:37 ` William Lee Irwin III 1 sibling, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 0:47 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, linux-kernel On Wed, May 01, 2002 at 01:12:48AM +0200, Daniel Phillips wrote: > On Wednesday 01 May 2002 04:23, Andrea Arcangeli wrote: > > On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote: > > > My config_nonlinear patch does not suffer from the above problem. Here's the > > > code: > > > > > > unsigned long vsection[MAX_SECTIONS]; > > > > > > static inline unsigned long phys_to_ordinal(phys_t p) > > > { > > > return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT); > > > } > > > > > > static inline struct page *phys_to_page(unsigned long p) > > > { > > > return mem_map + phys_to_ordinal(p); > > > } > > > > > > Nothing can go out of range. Sensible, no? > > > > Really the above vsection[p >> SECTION_SHIFT] will overflow in the very > > same case I fixed a few days ago for numa-alpha. > > No it will not. Note however that you do want to choose SECTION_SHIFT so > that the vsection table is not too large. You cannot choose SECTION_SHIFT, the hardware will define it for you. A 64bit arch will get discontigous for example in 64G chunks (real world example actually), so your SECTION_SHIFT will be equal to 36 and you will overflow as I said in the previous email (just like discontigmem in mainline did) unless you want to waste some huge ram for the table (with 2^64/64G entries i.e. sizeof(long) * 2^(64-36) bytes). > > The whole point is that > > p isn't a ram page and you assumed that (the alpha code was also assuming > > that and that's why it overflowed the same way as yours). Either that > > or you're wasting some huge tons of ram with vsection on a 64bit arch. > > No and no. In fact, the vsection table for my current project is only 32 > elements long. See above. > created by config_discontigmem. It just cycles between round robin > between the nodes on each allocation, ignoring the relative Forget mainline. Look at 2.4.19pre7aa3 _only_ when you look into numa, there are an huge number of fixes in that area also from Samuel Ortiz and others. Before I even cosnsider pushing those fixes in mainline (btw, they are cleanly separated in orthogonal patches, not mixed with teh other stuff), I will need to see the other vm updates that everybody deals with included (only a limited number of users is affected by numa issues). > - Has better locality of reference in the mapping tables, because the > tables are compact (and could easily be made yet more compact than in > the posted patch). That is, each table entry in a config_discontig > node array is 796 bytes, as opposed to 4 (or possibly 2 or 1) with > config_nonlinear. Oh yeah, you save 1 microsecond every 10 years of uptime by taking advantage of the potentially coalesced cacheline between the last page in a node and the first page of the next node. Before you can care about this optimizations you should remove from x86 the pgdat loops that are not needed with discontigmem disabled like in x86 (this has nothing to do with discontigmem/nonlinear). That wouldn't be measurable too but at least it would be more worthwhile. > - Eliminates two levels of procedure calls from the alloc_pages call > chain. Again, look -aa, not mainline. > - Provides a simple model that is easily implemented across *all* I don't see much simplicity, it's only weaker I think. > architectures. Please look at the config_discontigmem option and see > how many architectures support it. Hint: it is not enough just to > add the option to config.in. > > - Leads forward to interesting possibilities such as hot plug memory. > (Because pinned kernel memory can be remapped to an alternative > region of physical memory if desired) You cannot handle hot plug with nonlinear, you cannot take the mem_map contigous when somebody plugins new memory, you've to allocate the mem_map in the new node, discontigmem allows that, nonlinear doesn't. At the very least you should waste some tons of memory of unused mem_map for all the potential memory that you're going to plugin, if you want to handle hot-plug with nonlinear. breaking up the limitation of the contigous mem_map is been one of the goals achieved with 2.4, there is no significant advantage (but only the old limitations) to try to coalesce it again. > - Cleans things up considerably. It eliminates the unholy marriage of It clobbers things considerably because it overlaps another more power functionality that is needed anyways for hotplug of ram and numa. > config_discontig-for-the-purpose of working around physical memory > holes and config_discontig-for-the-purpose of numa allocation. For > example, eliminates a number of #ifdefs from the numa code, allowing > the numa code to be developed in the way that is best for numa, > instead of being hobbled by the need to serve a dual purpose. > > It's easy to wave your hands at the idea that code should ever be cleaned up. > As an example of just how much the config_nonlinear patch cleans things up, > let's look at the ARM definition of VALID_PAGE, with config_discontigmem: > > #define KVADDR_TO_NID(addr) \ > (((unsigned long)(addr) - 0xc0000000) >> 27) > > #define NODE_DATA(nid) (&discontig_node_data[nid]) > > #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map) > > #define VALID_PAGE(page) \ > ({ unsigned int node = KVADDR_TO_NID(page); \ > ( (node < NR_NODES) && \ > ((unsigned)((page) - NODE_MEM_MAP(node)) < NODE_DATA(node)->node_size) ); \ > }) > > With config_nonlinear (which does the same job as config_discontigmem in this > case) we get: > > static inline int VALID_PAGE(struct page *page) > { > return page - mem_map < max_mapnr; > } > > Isn't that nice? It isn't nicer to my eyes. It cannot handle a non cotigous mem_map, showstopper for hotplug ram and numa, and secondly VALID_PAGE must go away since the first place. The rest of the NODE_MEM_MAP(node) is completly equivalent to your phys_to_ordinal, just in a different place and capable of handling discontigous mem_map too. For my tree I'm not going to include it for now. For my current understanding of the thing the only ones that could ask for it are the ia64 folks with huge holes in the middle of the ram of the nodes that may prefer to hide their inter-node-discontigmem stuff behind the numa layer to avoid confusing the numa heuristics (still I don't know how big those holes are so it also depends on that if they really need it), so I will wait them to ask for it before considering an inclusion. If we need to complicate a lot the MM code I need somebody to ask for it with some valid reason, your "cleanup" argument doesn't apply here IMHO (this is not a change of the IDE code that reshapes the code but that remains completly functional equivalent), until somebody asks for it with valid arguments this remains an overlapping non needed and complex functionality to my eyes. (btw, the 640k-1M hole is due backwards compatibility stuff so it had a good reason for it at least and it's a very small hole after all that we don't even care to skip in the mem_map array on a 4M box) Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:47 ` Andrea Arcangeli @ 2002-05-01 1:26 ` Daniel Phillips 2002-05-02 1:43 ` Andrea Arcangeli 2002-05-02 2:37 ` William Lee Irwin III 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 1:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel On Thursday 02 May 2002 02:47, Andrea Arcangeli wrote: > > - Leads forward to interesting possibilities such as hot plug memory. > > (Because pinned kernel memory can be remapped to an alternative > > region of physical memory if desired) > > You cannot handle hot plug with nonlinear, you cannot take the mem_map > contigous when somebody plugins new memory, you've to allocate the > mem_map in the new node, discontigmem allows that, nonlinear doesn't. You have not read and understood the patch, which this comment demonstrates. For your information, the mem_map lives in *virtual* memory, it does not need to change location, only the kernel page tables need to be updated, to allow a section of kernel memory to be moved to a different physical location. For user memory, this was always possible, now it is possible for kernel memory as well. Naturally, it's not all you have to do to get hotplug memory, but it's a big step in that direction. > At the very least you should waste some tons of memory of unused mem_map > for all the potential memory that you're going to plugin, if you want to > handle hot-plug with nonlinear. Eh. No. It's not useful for me to keep correcting you on your misunderstanding of what config_nonlinear actually does. Please read Jonathan Corbet's excellent writeup in lwn, it's written in a very understandable fashion. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 1:26 ` Daniel Phillips @ 2002-05-02 1:43 ` Andrea Arcangeli 2002-05-01 2:41 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 1:43 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, linux-kernel On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 02:47, Andrea Arcangeli wrote: > > > - Leads forward to interesting possibilities such as hot plug memory. > > > (Because pinned kernel memory can be remapped to an alternative > > > region of physical memory if desired) > > > > You cannot handle hot plug with nonlinear, you cannot take the mem_map > > contigous when somebody plugins new memory, you've to allocate the > > mem_map in the new node, discontigmem allows that, nonlinear doesn't. > > You have not read and understood the patch, which this comment demonstrates. > > For your information, the mem_map lives in *virtual* memory, it does not > need to change location, only the kernel page tables need to be updated, > to allow a section of kernel memory to be moved to a different physical > location. For user memory, this was always possible, now it is possible > for kernel memory as well. Naturally, it's not all you have to do to get > hotplug memory, but it's a big step in that direction. what kernel pagetables? pagetables for space that you left free for what? You waste virtual space for that at the very least on x86 that is just very tigh, at this point kernel virtual space is more costly than physical space these days. And nevertheless most archs doesn't have pagetables at all to read and write the page structures. yes it's virtual memory but it's a direct mapping. DaveM even rewrote the palcode of sparc to skip the pagetable walking for kernel direct mapping. alpha mips have a kseg that maps to physical addresses directly. there are _no_ pagetables for the mem_map in most archs, the palcode resolves it directly without using the tlb. So if you move mem_map to pagetables like modules to handle hotplug you just made automatically the whole kernel slower due the additional pte walking and tlb trashing. > > At the very least you should waste some tons of memory of unused mem_map > > for all the potential memory that you're going to plugin, if you want to > > handle hot-plug with nonlinear. > > Eh. No. > > It's not useful for me to keep correcting you on your misunderstanding of > what config_nonlinear actually does. Please read Jonathan Corbet's > excellent writeup in lwn, it's written in a very understandable fashion. > > -- > Daniel Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 1:43 ` Andrea Arcangeli @ 2002-05-01 2:41 ` Daniel Phillips 2002-05-02 13:34 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 2:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, linux-kernel On Thursday 02 May 2002 03:43, Andrea Arcangeli wrote: > On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote: > > For your information, the mem_map lives in *virtual* memory, it does not > > need to change location, only the kernel page tables need to be updated, > > to allow a section of kernel memory to be moved to a different physical > > location. For user memory, this was always possible, now it is possible > > for kernel memory as well. Naturally, it's not all you have to do to get > > hotplug memory, but it's a big step in that direction. > > what kernel pagetables? The normal page tables that are used to map kernel memory. > pagetables for space that you left free for what? These page tables have not been left free for anything. The nice thing about page tables is that you can change the page table entries to point wherever you want. (I know you know this.) This is what config_nonlinear supports, and that is why it's called config_nonlinear. When we want to remap part of the kernel memory to a different piece of physical memory, we just need to fill in some pte's. The tricky part is knowing how to fill in the ptes, and config_nonlinear takes care of that. > You waste virtual space for that at the very least on x86 that is > just very tigh, at this point kernel virtual space is more costly than > physical space these days. And nevertheless most archs doesn't have > pagetables at all to read and write the page structures. yes it's > virtual memory but it's a direct mapping. Most architectures? That's quite possibly an exaggeration. Some architectures - MIPS32 for example - make this difficult or impossible, and so what? Those can't do software hotplug memory, sorry. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 2:41 ` Daniel Phillips @ 2002-05-02 13:34 ` Andrea Arcangeli 2002-05-02 15:18 ` Martin J. Bligh 2002-05-02 16:00 ` William Lee Irwin III 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 13:34 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, linux-kernel On Wed, May 01, 2002 at 04:41:17AM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 03:43, Andrea Arcangeli wrote: > > On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote: > > > For your information, the mem_map lives in *virtual* memory, it does not > > > need to change location, only the kernel page tables need to be updated, > > > to allow a section of kernel memory to be moved to a different physical > > > location. For user memory, this was always possible, now it is possible > > > for kernel memory as well. Naturally, it's not all you have to do to get > > > hotplug memory, but it's a big step in that direction. > > > > what kernel pagetables? > > The normal page tables that are used to map kernel memory. > > > pagetables for space that you left free for what? > > These page tables have not been left free for anything. The nice thing about > page tables is that you can change the page table entries to point wherever > you want. (I know you know this.) This is what config_nonlinear supports, > and that is why it's called config_nonlinear. When we want to remap part of > the kernel memory to a different piece of physical memory, we just need to > fill in some pte's. The tricky part is knowing how to fill in the ptes, and > config_nonlinear takes care of that. > > > You waste virtual space for that at the very least on x86 that is > > just very tigh, at this point kernel virtual space is more costly than > > physical space these days. And nevertheless most archs doesn't have > > pagetables at all to read and write the page structures. yes it's > > virtual memory but it's a direct mapping. > > Most architectures? That's quite possibly an exaggeration. Some > architectures - MIPS32 for example - make this difficult or impossible, > and so what? Those can't do software hotplug memory, sorry. alpha is the same as mips I think. sparc would be the same too if there's any discontigmem sparc. Dunno of arm. We're talking about architectures needing discontigmem, 99% percent of users doesn't need discontigmem in the first place, you never need discontigmem in x86 and even in new-numa you don't need discontigmem, you want to pass through discontigmem only to get the numa topology description that the current discontigmem provides via the pgdat. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 13:34 ` Andrea Arcangeli @ 2002-05-02 15:18 ` Martin J. Bligh 2002-05-02 15:35 ` Andrea Arcangeli 2002-05-02 16:00 ` William Lee Irwin III 1 sibling, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 15:18 UTC (permalink / raw) To: Andrea Arcangeli, Daniel Phillips; +Cc: Russell King, linux-kernel > alpha is the same as mips I think. sparc would be the same too if > there's any discontigmem sparc. Dunno of arm. We're talking about > architectures needing discontigmem, 99% percent of users doesn't need > discontigmem in the first place, you never need discontigmem in x86 and That's not true. We use discontigmem on the NUMA-Q boxes for NUMA support. In some memory models, they're also really discontigous memory machines. At the moment I use the contig memory model (so we only use discontig for NUMA support) but I may need to change that in the future. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:18 ` Martin J. Bligh @ 2002-05-02 15:35 ` Andrea Arcangeli 2002-05-01 15:42 ` Daniel Phillips 2002-05-02 16:07 ` Martin J. Bligh 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 15:35 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > alpha is the same as mips I think. sparc would be the same too if > > there's any discontigmem sparc. Dunno of arm. We're talking about > > architectures needing discontigmem, 99% percent of users doesn't need > > discontigmem in the first place, you never need discontigmem in x86 and > > That's not true. We use discontigmem on the NUMA-Q boxes for NUMA support. > In some memory models, they're also really discontigous memory machines. With numa-q there's a 512M hole in each node IIRC. that's fine configuration, similar to the wildfire btw. > At the moment I use the contig memory model (so we only use discontig for > NUMA support) but I may need to change that in the future. I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into the current discontigmem-numa model too as far I can see. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:35 ` Andrea Arcangeli @ 2002-05-01 15:42 ` Daniel Phillips 2002-05-02 16:06 ` Andrea Arcangeli 2002-05-02 16:07 ` Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 15:42 UTC (permalink / raw) To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote: > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > At the moment I use the contig memory model (so we only use discontig for > > NUMA support) but I may need to change that in the future. > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > the current discontigmem-numa model too as far I can see. No it doesn't. The config_discontigmem model forces all zone_normal memory to be on node zero, so all the remaining nodes can only have highmem locally. Even with good cache hardware, this has to hurt. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 15:42 ` Daniel Phillips @ 2002-05-02 16:06 ` Andrea Arcangeli 2002-05-02 16:10 ` Martin J. Bligh 2002-05-02 23:42 ` Daniel Phillips 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 16:06 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > > At the moment I use the contig memory model (so we only use discontig for > > > NUMA support) but I may need to change that in the future. > > > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > > the current discontigmem-numa model too as far I can see. > > No it doesn't. The config_discontigmem model forces all zone_normal memory > to be on node zero, so all the remaining nodes can only have highmem locally. You can trivially map the phys mem between 1G and 1G+256M to be in a direct mapping between 3G+256M and 3G+512M, then you can put such 256M at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. The constraints you have on the normal memory are only two: 1) direct mapping 2) DMA so as far as the ram is capable of 32bit DMA with pci32 and it's mapped in the direct mapping you can put it into the normal zone. There is no difference at all between discontimem or nonlinear in this sense. > Even with good cache hardware, this has to hurt. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:06 ` Andrea Arcangeli @ 2002-05-02 16:10 ` Martin J. Bligh 2002-05-02 16:40 ` Andrea Arcangeli 2002-05-02 23:42 ` Daniel Phillips 1 sibling, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 16:10 UTC (permalink / raw) To: Andrea Arcangeli, Daniel Phillips; +Cc: Russell King, linux-kernel > You can trivially map the phys mem between 1G and 1G+256M to be in a > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. > > The constraints you have on the normal memory are only two: > > 1) direct mapping > 2) DMA > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped > in the direct mapping you can put it into the normal zone. There is no > difference at all between discontimem or nonlinear in this sense. Now imagine an 8 node system, with 4Gb of memory in each node. First 4Gb is in node 0, second 4Gb is in node 1, etc. Even with 64 bit DMA, the real problem is breaking the assumption that mem between 0 and 896Mb phys maps 1-1 onto kernel space. That's 90% of the difficulty of what Dan's doing anyway, as I see it. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:10 ` Martin J. Bligh @ 2002-05-02 16:40 ` Andrea Arcangeli 2002-05-02 17:16 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 16:40 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > > You can trivially map the phys mem between 1G and 1G+256M to be in a > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. > > > > The constraints you have on the normal memory are only two: > > > > 1) direct mapping > > 2) DMA > > > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped > > in the direct mapping you can put it into the normal zone. There is no > > difference at all between discontimem or nonlinear in this sense. > > Now imagine an 8 node system, with 4Gb of memory in each node. > First 4Gb is in node 0, second 4Gb is in node 1, etc. > > Even with 64 bit DMA, the real problem is breaking the assumption > that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > That's 90% of the difficulty of what Dan's doing anyway, as I > see it. You don't need any additional common code abstraction to make virtual address 3G+256G to point to physical address 1G as in my example above, after that you're free to put the physical ram between 1G and 1G+256M into the zone normal of node 1 and the stuff should keep working but with zone-normal spread in more than one node. You just have full control on virt_to_page, pci_map_single, __va. Actually it may be as well cleaner to just let the arch define page_address() when discontigmem is enabled (instead of hacking on top of __va), that's a few liner. (the only true limit you have is on the phys ram above 4G, that cannot definitely go into zone-normal regardless if it belongs to a direct mapping or not because of pci32 API) Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:40 ` Andrea Arcangeli @ 2002-05-02 17:16 ` William Lee Irwin III 2002-05-02 18:41 ` Andrea Arcangeli 2002-05-02 18:25 ` Daniel Phillips 2002-05-02 19:31 ` Martin J. Bligh 2 siblings, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 17:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: >> Even with 64 bit DMA, the real problem is breaking the assumption >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space. >> That's 90% of the difficulty of what Dan's doing anyway, as I >> see it. On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote: > control on virt_to_page, pci_map_single, __va. Actually it may be as > well cleaner to just let the arch define page_address() when > discontigmem is enabled (instead of hacking on top of __va), that's a > few liner. (the only true limit you have is on the phys ram above 4G, > that cannot definitely go into zone-normal regardless if it belongs to a > direct mapping or not because of pci32 API) > Andrea Being unable to have any ZONE_NORMAL above 4GB allows no change at all. 32-bit PCI is not used on NUMA-Q AFAIK. So long as zones are physically contiguous and __va() does what its name implies, page_address() should operate properly aside from the sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I believe was recently resolved; if not I will do so myself shortly). With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the position in mem_map array does not describe the offset into the region of physical memory occupied by the zone. UNMAP_NR_DENSE() may be expensive enough architectures using MAP_NR_DENSE() may be better off using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If space conservation is as important a consideration for stability as it is on architectures with severely limited kernel virtual address spaces, it may be preferable to implement such despite the computational expense. iSeries will likely have physically discontiguous zones and so it won't be able to use an address calculation based page_address() either. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 17:16 ` William Lee Irwin III @ 2002-05-02 18:41 ` Andrea Arcangeli 2002-05-02 19:19 ` William Lee Irwin III 2002-05-02 19:22 ` Daniel Phillips 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 18:41 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > >> Even with 64 bit DMA, the real problem is breaking the assumption > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > >> That's 90% of the difficulty of what Dan's doing anyway, as I > >> see it. > > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote: > > control on virt_to_page, pci_map_single, __va. Actually it may be as > > well cleaner to just let the arch define page_address() when > > discontigmem is enabled (instead of hacking on top of __va), that's a > > few liner. (the only true limit you have is on the phys ram above 4G, > > that cannot definitely go into zone-normal regardless if it belongs to a > > direct mapping or not because of pci32 API) > > Andrea > > Being unable to have any ZONE_NORMAL above 4GB allows no change at all. No change if your first node maps the whole first 4G of physical address space, but in such case nonlinear cannot help you in any way anyways. The fact you can make no change at all has only to do with the fact GFP_KERNEL must return memory accessible from a pci32 device. I think most configurations have more than one node mapped into the first 4G, and so in those configurations you can do changes and spread the direct mapping across all the nodes mapped in the first 4G phys. the fact you can or you can't have something to change with discontigmem or nonlinear, it's all about pci32. > 32-bit PCI is not used on NUMA-Q AFAIK. but can you plugin 32bit pci hardware into your 64bit-pci slots, right? If not, and if you're also sure the linux drivers for your hardware are all 64bit-pci capable then you can do the changes regardless of the 4G limit, in such case you can spread the direct mapping all over the whole 64G physical ram, whereever you want, no 4G constraint anymore. > > So long as zones are physically contiguous and __va() does what its zones remains physically contigous, it's the virtual address returned by page_address that changes. Also the kmap header will need some modification, you should always check for PageHIGHMEM in all places to know if you must kmap or not, that's a few liner. > name implies, page_address() should operate properly aside from the > sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I > believe was recently resolved; if not I will do so myself shortly). > With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the > position in mem_map array does not describe the offset into the region > of physical memory occupied by the zone. UNMAP_NR_DENSE() may be > expensive enough architectures using MAP_NR_DENSE() may be better off > using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If Yes, as alternative to moving page_address to the arch code, you can set WANT_PAGE_VIRTUAL since as you say such a function is going to be more expensive, (if it's only a few instructions you can instead consider moving page_address in the arch code as said in the previous email instead of hacking on __va). > space conservation is as important a consideration for stability as it > is on architectures with severely limited kernel virtual address spaces, > it may be preferable to implement such despite the computational expense. > iSeries will likely have physically discontiguous zones and so it won't > be able to use an address calculation based page_address() either. If you need to support an huge number of discontigous zones then I'm the first to agree you want nonlinear instead of discontigmem, I wasn't aware that such an hardware that normally needs to support hundred or thousand of discontigmem zones exists, for it discontigmem is prohibitive due the O(N) complexity of some code path. That's not the case for NUMA-Q though that also needs the different pgdat structures for the numa optimizations anyways (and still to me a phys memory partitioned with hundred discontigous zones looks like an harddisk partitioned in hundred of different blkdevs). BTW, about the pgdat loops optimizations, you misunderstood what I meant in some previous email, with "removing them" I didn't meant to remove them in the discontigmem case, that would had to be done case by case, with removing them I meant to remove them only for mainline 2.4.19-pre7 when kernel is compiled for x86 target like 99% of userbase uses it. A discontigmem using nonlinear also doesn't need to loop. It's a 1 branch removal optimization (doesn't decrease the complexity of the algorithm for discontigmem enabled). It's all in function of #ifndef CONFIG_DISCONTIGMEM. Dropping the loop when discontigmem is enabled is much more interesting optimization of course. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:41 ` Andrea Arcangeli @ 2002-05-02 19:19 ` William Lee Irwin III 2002-05-02 19:27 ` Daniel Phillips ` (2 more replies) 2002-05-02 19:22 ` Daniel Phillips 1 sibling, 3 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 19:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: >> Being unable to have any ZONE_NORMAL above 4GB allows no change at all. On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > No change if your first node maps the whole first 4G of physical address > space, but in such case nonlinear cannot help you in any way anyways. > The fact you can make no change at all has only to do with the fact > GFP_KERNEL must return memory accessible from a pci32 device. Without relaxing this invariant for this architecture there is no hope that NUMA-Q can ever be efficiently operated by this kernel. On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > I think most configurations have more than one node mapped into the > first 4G, and so in those configurations you can do changes and spread > the direct mapping across all the nodes mapped in the first 4G phys. These would be partially-populated nodes. There may be up to 16 nodes. Some unusual management of the interrupt controllers is required to get the last 4 cpus. Those who know how tend to disavow the knowledge. =) On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > the fact you can or you can't have something to change with discontigmem > or nonlinear, it's all about pci32. Artificially tying together the device-addressibility of memory and virtual addressibility of memory is a fundamental design decision which seems to behave poorly for NUMA-Q, though general it seems to work okay. On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: >> 32-bit PCI is not used on NUMA-Q AFAIK. On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > but can you plugin 32bit pci hardware into your 64bit-pci slots, right? > If not, and if you're also sure the linux drivers for your hardware are all > 64bit-pci capable then you can do the changes regardless of the 4G > limit, in such case you can spread the direct mapping all over the whole > 64G physical ram, whereever you want, no 4G constraint anymore. I believe 64-bit PCI is pretty much taken to be a requirement; if it weren't the 4GB limit would once again apply and we'd be in much trouble, or we'd have to implement a different method of accommodating limited device addressing capabilities and would be in trouble again. On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: >> So long as zones are physically contiguous and __va() does what its On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > zones remains physically contigous, it's the virtual address returned by > page_address that changes. Also the kmap header will need some > modification, you should always check for PageHIGHMEM in all places to > know if you must kmap or not, that's a few liner. I've not been using the generic page_address() in conjunction with highmem, but this sounds like a very natural thing to do when the need to do so arises; arranging for storage of the virtual address sounds trickier, though doable. I'm not sure if mainline would want it, and I don't feel a pressing need to implement it yet, but then again, I've not yet been parked in front of a 64GB x86 machine yet... On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > BTW, about the pgdat loops optimizations, you misunderstood what I meant > in some previous email, with "removing them" I didn't meant to remove > them in the discontigmem case, that would had to be done case by case, > with removing them I meant to remove them only for mainline 2.4.19-pre7 > when kernel is compiled for x86 target like 99% of userbase uses it. A > discontigmem using nonlinear also doesn't need to loop. It's a 1 branch > removal optimization (doesn't decrease the complexity of the algorithm > for discontigmem enabled). It's all in function of > #ifndef CONFIG_DISCONTIGMEM. >From my point of view this would be totally uncontroversial. Some arch maintainers might want a different #ifdef condition but it's fairly trivial to adjust that to their needs when they speak up. On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > Dropping the loop when discontigmem is enabled is much more interesting > optimization of course. > Andrea Absolutely; I'd be very supportive of improvements for this case as well. Many of the systems with the need for discontiguous memory support will also benefit from parallelizations or other methods of avoiding references to remote nodes/zones or iterations over all nodes/zones. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:19 ` William Lee Irwin III @ 2002-05-02 19:27 ` Daniel Phillips 2002-05-02 19:38 ` William Lee Irwin III 2002-05-03 6:10 ` Andrea Arcangeli 2002-05-02 22:20 ` Martin J. Bligh 2002-05-03 6:04 ` Andrea Arcangeli 2 siblings, 2 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 19:27 UTC (permalink / raw) To: William Lee Irwin III, Andrea Arcangeli Cc: Martin J. Bligh, Russell King, linux-kernel On Thursday 02 May 2002 21:19, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > Dropping the loop when discontigmem is enabled is much more interesting > > optimization of course. > > Andrea > > Absolutely; I'd be very supportive of improvements for this case as well. > Many of the systems with the need for discontiguous memory support will > also benefit from parallelizations or other methods of avoiding references > to remote nodes/zones or iterations over all nodes/zones. Which loop in which function are we talking about? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:27 ` Daniel Phillips @ 2002-05-02 19:38 ` William Lee Irwin III 2002-05-02 19:58 ` Daniel Phillips 2002-05-03 6:28 ` Andrea Arcangeli 2002-05-03 6:10 ` Andrea Arcangeli 1 sibling, 2 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 19:38 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: >>> Dropping the loop when discontigmem is enabled is much more interesting >>> optimization of course. On Thursday 02 May 2002 21:19, William Lee Irwin III wrote: >> Absolutely; I'd be very supportive of improvements for this case as well. >> Many of the systems with the need for discontiguous memory support will >> also benefit from parallelizations or other methods of avoiding references >> to remote nodes/zones or iterations over all nodes/zones. On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote: > Which loop in which function are we talking about? I believe it's just for_each_zone() and for_each_pgdat(), and their usage in general. I brewed them up to keep things clean (and by and large they produced largely equivalent code to what preceded it), but there's no harm in conditionally defining them. I think it's even beneficial, since things can use them without concerning themselves about "will this be inefficient for the common case of UP single-node x86?" and might also have the potential to remove some other #ifdefs. In the more general case, avoiding an O(fragments) (or sometimes even O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus) iteration when fragments is very large would be an excellent optimization. Andrea, if the definitions of these helpers start getting large, I think it would help to move them to a separate header. akpm has already done so with page->flags manipulations in 2.5, and it seems like it wouldn't do any harm to do something similar in 2.4 either. Does that sound good? Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:38 ` William Lee Irwin III @ 2002-05-02 19:58 ` Daniel Phillips 2002-05-03 6:28 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 19:58 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Thursday 02 May 2002 21:38, William Lee Irwin III wrote: > In the more general case, avoiding an O(fragments) (or sometimes even > O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus) > iteration when fragments is very large would be an excellent optimization. In general, config_nonlinear gets it down to O(NR_ZONES), i.e., O(1), by eliminating the loops across nodes in the non-numa case. Yes, teaching for_each_* about the 'list length equals one' case would be worthwhile. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:38 ` William Lee Irwin III 2002-05-02 19:58 ` Daniel Phillips @ 2002-05-03 6:28 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:28 UTC (permalink / raw) To: William Lee Irwin III, Daniel Phillips, Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 12:38:47PM -0700, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > >>> Dropping the loop when discontigmem is enabled is much more interesting > >>> optimization of course. > > On Thursday 02 May 2002 21:19, William Lee Irwin III wrote: > >> Absolutely; I'd be very supportive of improvements for this case as well. > >> Many of the systems with the need for discontiguous memory support will > >> also benefit from parallelizations or other methods of avoiding references > >> to remote nodes/zones or iterations over all nodes/zones. > > On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote: > > Which loop in which function are we talking about? > > I believe it's just for_each_zone() and for_each_pgdat(), and their > usage in general. I brewed them up to keep things clean (and by and > large they produced largely equivalent code to what preceded it), but > there's no harm in conditionally defining them. I think it's even > beneficial, since things can use them without concerning themselves > about "will this be inefficient for the common case of UP single-node > x86?" and might also have the potential to remove some other #ifdefs. > > In the more general case, avoiding an O(fragments) (or sometimes even > O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus) > iteration when fragments is very large would be an excellent optimization. > > Andrea, if the definitions of these helpers start getting large, I think > it would help to move them to a separate header. akpm has already done so sure for 2.5. However for 2.4 still I'm not very excited about those optimizations getting in now, at least until some of the other more important pending patches are included. I don't care if those optimizations are obvious or not, it's just more work for Marcelo and I prefer him to spend all his cpu time on things that matters, not on unnecessary cleanups (at least until there will be pending things that matters, if there aren't it's fine to work on microoptimizations/cleanups), and nevertheless it would generate rejects and more work for me too but I have ways that reduces my overhead, so my reject solving work would be really my last concern. > with page->flags manipulations in 2.5, and it seems like it wouldn't > do any harm to do something similar in 2.4 either. Does that sound good? > > Cheers, > Bill Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:27 ` Daniel Phillips 2002-05-02 19:38 ` William Lee Irwin III @ 2002-05-03 6:10 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:10 UTC (permalink / raw) To: Daniel Phillips Cc: William Lee Irwin III, Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 21:19, William Lee Irwin III wrote: > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > > Dropping the loop when discontigmem is enabled is much more interesting > > > optimization of course. > > > Andrea > > > > Absolutely; I'd be very supportive of improvements for this case as well. > > Many of the systems with the need for discontiguous memory support will > > also benefit from parallelizations or other methods of avoiding references > > to remote nodes/zones or iterations over all nodes/zones. > > Which loop in which function are we talking about? the pgdat loops. example, this could be optimized for the 99% of userbase to: do { zonelist_t *zonelist = pgdat->node_zonelists + (GFP_USER & GFP_ZONEMASK); zone_t **zonep = zonelist->zones; zone_t *zone; for (zone = *zonep++; zone; zone = *zonep++) { unsigned long size = zone->size; unsigned long high = zone->pages_high; if (size > high) sum += size - high; } #ifdef CONFIG_DISCONTIGMEM pgdat = pgdat->node_next; } while (pgdat); #else } while (0) #endif so allowing the compiler to remove a branch and a few instructions from the asm, but it would be a microoptimization not visible in benchmarks, I'm not actually suggesting that mostly for code clarity, branch prediction should also take it right if it starts to be executed frequently (hopefully the asm is large enough that it doesn't get confused by the inner loop that is quite near). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:19 ` William Lee Irwin III 2002-05-02 19:27 ` Daniel Phillips @ 2002-05-02 22:20 ` Martin J. Bligh 2002-05-02 21:28 ` William Lee Irwin III 2002-05-03 6:38 ` Andrea Arcangeli 2002-05-03 6:04 ` Andrea Arcangeli 2 siblings, 2 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 22:20 UTC (permalink / raw) To: William Lee Irwin III, Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: >> but can you plugin 32bit pci hardware into your 64bit-pci slots, right? >> If not, and if you're also sure the linux drivers for your hardware are all >> 64bit-pci capable then you can do the changes regardless of the 4G >> limit, in such case you can spread the direct mapping all over the whole >> 64G physical ram, whereever you want, no 4G constraint anymore. > > I believe 64-bit PCI is pretty much taken to be a requirement; if it > weren't the 4GB limit would once again apply and we'd be in much > trouble, or we'd have to implement a different method of accommodating > limited device addressing capabilities and would be in trouble again. IIRC, there are some funny games you can play with 32bit PCI DMA. You're not necessarily restricted to the bottom 4Gb of phys addr space, you're restricted to a 4Gb window, which you can shift by programming a register on the card. Fixing that register to point to a window for the node in question allows you to allocate from a node's pg_data_t and assure DMAable RAM is returned. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 22:20 ` Martin J. Bligh @ 2002-05-02 21:28 ` William Lee Irwin III 2002-05-02 21:52 ` Kurt Ferreira 2002-05-03 6:38 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 21:28 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: >> I believe 64-bit PCI is pretty much taken to be a requirement; if it >> weren't the 4GB limit would once again apply and we'd be in much >> trouble, or we'd have to implement a different method of accommodating >> limited device addressing capabilities and would be in trouble again. On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote: > IIRC, there are some funny games you can play with 32bit PCI DMA. > You're not necessarily restricted to the bottom 4Gb of phys addr space, > you're restricted to a 4Gb window, which you can shift by programming > a register on the card. Fixing that register to point to a window for the > node in question allows you to allocate from a node's pg_data_t and > assure DMAable RAM is returned. > M. Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one who told me about this trick. Thanks, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 21:28 ` William Lee Irwin III @ 2002-05-02 21:52 ` Kurt Ferreira 2002-05-02 21:55 ` William Lee Irwin III 0 siblings, 1 reply; 152+ messages in thread From: Kurt Ferreira @ 2002-05-02 21:52 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Martin J. Bligh, linux-kernel > On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote: > > IIRC, there are some funny games you can play with 32bit PCI DMA. > > You're not necessarily restricted to the bottom 4Gb of phys addr space, > > you're restricted to a 4Gb window, which you can shift by programming > > a register on the card. Fixing that register to point to a window for the > > node in question allows you to allocate from a node's pg_data_t and > > assure DMAable RAM is returned. > > M. > > > Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one > who told me about this trick. > By this do you mean setting bits BAR[2:1]=b'10? Just making sure I get it. Thanks Kurt ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 21:52 ` Kurt Ferreira @ 2002-05-02 21:55 ` William Lee Irwin III 0 siblings, 0 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 21:55 UTC (permalink / raw) To: Kurt Ferreira; +Cc: Martin J. Bligh, linux-kernel On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote: >> Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one >> who told me about this trick. On Thu, May 02, 2002 at 03:52:53PM -0600, Kurt Ferreira wrote: > By this do you mean setting bits BAR[2:1]=b'10? Just making sure I get > it. > Thanks > Kurt I'm not that well-versed in PCI programming; I don't believe I was told in any greater level of detail than has already crossed this list. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 22:20 ` Martin J. Bligh 2002-05-02 21:28 ` William Lee Irwin III @ 2002-05-03 6:38 ` Andrea Arcangeli 2002-05-03 6:58 ` Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:38 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote: > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > >> but can you plugin 32bit pci hardware into your 64bit-pci slots, right? > >> If not, and if you're also sure the linux drivers for your hardware are all > >> 64bit-pci capable then you can do the changes regardless of the 4G > >> limit, in such case you can spread the direct mapping all over the whole > >> 64G physical ram, whereever you want, no 4G constraint anymore. > > > > I believe 64-bit PCI is pretty much taken to be a requirement; if it > > weren't the 4GB limit would once again apply and we'd be in much > > trouble, or we'd have to implement a different method of accommodating > > limited device addressing capabilities and would be in trouble again. > > IIRC, there are some funny games you can play with 32bit PCI DMA. > You're not necessarily restricted to the bottom 4Gb of phys addr space, > you're restricted to a 4Gb window, which you can shift by programming > a register on the card. Fixing that register to point to a window for the > node in question allows you to allocate from a node's pg_data_t and > assure DMAable RAM is returned. if you've as many windows as the number of nodes than you're just fine in all cases. you only need to teach pci_map_single and friends to return the right bus address that won't be an identity anymore with the phys addr, then you can forget the >4G phys constraint on the pages returned by zone_normal :). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 6:38 ` Andrea Arcangeli @ 2002-05-03 6:58 ` Martin J. Bligh 0 siblings, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 6:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel >> IIRC, there are some funny games you can play with 32bit PCI DMA. >> You're not necessarily restricted to the bottom 4Gb of phys addr space, >> you're restricted to a 4Gb window, which you can shift by programming >> a register on the card. Fixing that register to point to a window for the >> node in question allows you to allocate from a node's pg_data_t and >> assure DMAable RAM is returned. > > if you've as many windows as the number of nodes than you're just fine > in all cases. you only need to teach pci_map_single and friends to > return the right bus address that won't be an identity anymore with the > phys addr, then you can forget the >4G phys constraint on the pages > returned by zone_normal :). I only have third hand information, rather than real experience in this particular area, but I believe the general idea was to map the window for any given card onto it's own node's physaddr space. For a general dirty kludge, we could allocated DMAable memory by simply doing an alloc_pages_node from node 0 (assuming a max of 4Gb in the first node ... if we really want a bounce buffer *and* we have more than 4Gb in the first node *and* we have a 32 bit DMA card, we can always alloc from ZONE_NORMAL on node 0 ... yes, that's pretty disgusting ... but 99% of things will have 64 bit DMA). M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:19 ` William Lee Irwin III 2002-05-02 19:27 ` Daniel Phillips 2002-05-02 22:20 ` Martin J. Bligh @ 2002-05-03 6:04 ` Andrea Arcangeli 2002-05-03 6:33 ` Martin J. Bligh 2002-05-03 9:24 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III 2 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:04 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > >> Being unable to have any ZONE_NORMAL above 4GB allows no change at all. > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > No change if your first node maps the whole first 4G of physical address > > space, but in such case nonlinear cannot help you in any way anyways. > > The fact you can make no change at all has only to do with the fact > > GFP_KERNEL must return memory accessible from a pci32 device. > > Without relaxing this invariant for this architecture there is no hope > that NUMA-Q can ever be efficiently operated by this kernel. I don't think it make sense to attempt breaking GFP_KERNEL semantics in 2.4 but for 2.5 we can change stuff so that all non-DMA users can ask for ZONE_NORMAL that will be backed by physical memory over 4G (that's fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other in-core data structures never accessed by hardware via DMA, it's ok even for the buffer cache because the lowlevel layer has the bounce buffer layer that is smart enough to understand when bounce buffers are needed on top of the physical address space pagecache). > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > I think most configurations have more than one node mapped into the > > first 4G, and so in those configurations you can do changes and spread > > the direct mapping across all the nodes mapped in the first 4G phys. > > These would be partially-populated nodes. There may be up to 16 nodes. > Some unusual management of the interrupt controllers is required to get > the last 4 cpus. Those who know how tend to disavow the knowledge. =) > > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > the fact you can or you can't have something to change with discontigmem > > or nonlinear, it's all about pci32. > > Artificially tying together the device-addressibility of memory and > virtual addressibility of memory is a fundamental design decision which > seems to behave poorly for NUMA-Q, though general it seems to work okay. Yes, you know since a few months ago we weren't even capable of skipping the bounce buffers for the memory between 1G and 4G and for the memory above 4G with pci-64, now we can, in the future we can be more finegrined if there's the need to. Again note that nonlinear can do nothing to help you there, the limitation you deal with is pci32 and the GFP API, not at all about discontigmem or nonlinear. we basically changed topic from here. > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > >> 32-bit PCI is not used on NUMA-Q AFAIK. > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > but can you plugin 32bit pci hardware into your 64bit-pci slots, right? > > If not, and if you're also sure the linux drivers for your hardware are all > > 64bit-pci capable then you can do the changes regardless of the 4G > > limit, in such case you can spread the direct mapping all over the whole > > 64G physical ram, whereever you want, no 4G constraint anymore. > > I believe 64-bit PCI is pretty much taken to be a requirement; if it > weren't the 4GB limit would once again apply and we'd be in much > trouble, or we'd have to implement a different method of accommodating > limited device addressing capabilities and would be in trouble again. If you're sure all the hw device are pci64 and the device drivers are using DAC to submit the bus addresses, then you're just fine and you can use pages over 4G for the ZONE_NORMAL too. and yes, if you add an IOMMU unit like the GART then you can fill the zone_normal with phys pages over 4G too because then the bus address won't be an identity anymore with the phys addr, I just assumed it wasn't the case because most x86 doesn't have that capability besides the GART that isn't currently used by the kernel as an iommu but that it's left to use to build contigous ram for the AGP cards (and also not all x86 have an AGP so we couldn't use it by default on x86 even assuming the graphics card doesn't need it). > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > >> So long as zones are physically contiguous and __va() does what its > > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > zones remains physically contigous, it's the virtual address returned by > > page_address that changes. Also the kmap header will need some > > modification, you should always check for PageHIGHMEM in all places to > > know if you must kmap or not, that's a few liner. > > I've not been using the generic page_address() in conjunction with > highmem, but this sounds like a very natural thing to do when the need > to do so arises; arranging for storage of the virtual address sounds > trickier, though doable. I'm not sure if mainline would want it, and > I don't feel a pressing need to implement it yet, but then again, I've > not yet been parked in front of a 64GB x86 machine yet... Personally I always had the hope to never need to see a 64G 32bit machine 8). I mean, even if you manage to solve the pci32bit problem with GFP_KERNEL, then you still have to share 800M across 16 nodes with 4G each. So by striping zone_normal over all the nodes to have numa-local data structures with fast slab allocations will get at most 50mbyte per node of which around 90% of this 50M will be eat by the mem_map array for those 50M plus the other 4G-50M. So at the end you'll be left with only say 5/10M per node of zone_normal that will be filled immediatly as soon as you start reading some directory from disk. a few hundred mbyte of vfs cache is the minimum for those machines, this doesn't even take into account bh headers for the pagecache, physical address space pagecache for the buffercache, kiobufs, vma, etc... Even ignoring the fact it's NUMA a 64G machine will boot fine (thanks to your 2.4.19pre work that shrinks of some bytes each page structure) but still it will work well only depending on what you're doing, for example it's fine for number cruncking but it will be bad for most other important workloads. And this is only because of the 32bit address space, it doesn't have anything to do with nonlinear/numa/discontigmem or pci32. It's just that 1G of virtual address space reserved for kernel is too low to handle efficiently 64G of physical ram, this is a fact and you can't workaround it. every workaround will add a penality here or there. The workaround you will be mostly forced to take is CONFIG_2G, after that the userspace will be limited to less than 1G per task returned by malloc (from over 1G to below 2G) and that will be a showstopper again for most userspace apps that wants to run on a 64G box like a DBMS that wants almost 2G of SGA. I'm glad we're finally going to migrate all to 64bit, just in time not to see a relevant number of 32bit 64G boxes. And of course, I don't mean a 64G 32bit machine doesn't make sense, it can make perfect sense for a certain number of users with specific needs of lots of ram and with very few kernel data structures, if you do that that's because you know what you're doing and you know you can tweak linux for your own workload and that's fine as far it's not supposed to be a general purpose machine (with general purpose I mean pretending to run a DBMS with a 1.7G SGA requiring CONFIG_3G, plus a cvs [or bk if you're a bk fan] server dealing with huge vfs metadata at the same time, for istance the cvs workload would run faster booting with mem=32G :) > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > BTW, about the pgdat loops optimizations, you misunderstood what I meant > > in some previous email, with "removing them" I didn't meant to remove > > them in the discontigmem case, that would had to be done case by case, > > with removing them I meant to remove them only for mainline 2.4.19-pre7 > > when kernel is compiled for x86 target like 99% of userbase uses it. A > > discontigmem using nonlinear also doesn't need to loop. It's a 1 branch > > removal optimization (doesn't decrease the complexity of the algorithm > > for discontigmem enabled). It's all in function of > > #ifndef CONFIG_DISCONTIGMEM. > > >From my point of view this would be totally uncontroversial. Some arch > maintainers might want a different #ifdef condition but it's fairly > trivial to adjust that to their needs when they speak up. Yep. Nobody did it probably just to left the code clean and because it would only remove a branch that wouldn't be measurable in any benchmark. Infact I'm not doing it either, I raised it just as a more worthwhile improvement compared to sharing a cachline between the last page of a mem_map and the first page of the mem_map in the next pgdat (again assuming a sane number of discontig chunks, say 16 with 32/64G of ram global, not hundred of chunks) > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote: > > Dropping the loop when discontigmem is enabled is much more interesting > > optimization of course. > > Andrea > > Absolutely; I'd be very supportive of improvements for this case as well. > Many of the systems with the need for discontiguous memory support will > also benefit from parallelizations or other methods of avoiding references > to remote nodes/zones or iterations over all nodes/zones. I would suggest to start on case-by-case basis looking at the profiling, so we make more complex only what is worth to optimize. For example nr_free_buffer_pages() I guess it will showup because it is used quite frequently. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 6:04 ` Andrea Arcangeli @ 2002-05-03 6:33 ` Martin J. Bligh 2002-05-03 8:38 ` Andrea Arcangeli 2002-05-03 9:24 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III 1 sibling, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 6:33 UTC (permalink / raw) To: Andrea Arcangeli, William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel FYI, whilst we've mentioned NUMA-Q in these arguments, much of this is generic to any 32 bit NUMA machine, the new x440 for example. > I don't think it make sense to attempt breaking GFP_KERNEL semantics in > 2.4 but for 2.5 we can change stuff so that all non-DMA users can ask > for ZONE_NORMAL that will be backed by physical memory over 4G (that's > fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other > in-core data structures never accessed by hardware via DMA, it's ok even > for the buffer cache because the lowlevel layer has the bounce buffer > layer that is smart enough to understand when bounce buffers are needed > on top of the physical address space pagecache). Sounds good. Hopefully we can kill off ZONE_DMA for the old ISA stuff at the same time except as a backwards compatibility config option that you'd have to explicitly enable ... > Again note that nonlinear can do nothing to help you there, the > limitation you deal with is pci32 and the GFP API, not at all about > discontigmem or nonlinear. we basically changed topic from here. There are several different problems we seem to be discussing here: 1. Cleaning up discontig mem alloc for UMA machines. 2. Having a non-contiguous ZONE_NORMAL across NUMA nodes. 3. DMA addressibility of memory. (and probably others I've missed). Nonlinear is more about the first two, and not the third, at least to my mind. > Personally I always had the hope to never need to see a 64G 32bit > machine 8). I mean, even if you manage to solve the pci32bit problem > with GFP_KERNEL, then you still have to share 800M across 16 nodes with > 4G each. So by striping zone_normal over all the nodes to have numa-local > data structures with fast slab allocations will get at most 50mbyte per > node of which around 90% of this 50M will be eat by the mem_map array > for those 50M plus the other 4G-50M. You're assuming we're always going to globally map every struct page into kernel address space for ever. That's a fundamental scalability problem for a 32 bit machine, and I think we need to fix it. If we map only the pages the process is using into the user-kernel address space area, rather than the global KVA, we get rid of some of these problems. Not that that plan doesn't have its own problems, but ... ;-) Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit virtual addr space a long time ago with Dynix/PTX. > So at the end you'll be left with > only say 5/10M per node of zone_normal that will be filled immediatly as > soon as you start reading some directory from disk. a few hundred mbyte > of vfs cache is the minimum for those machines, this doesn't even take > into account bh headers for the pagecache, physical address space > pagecache for the buffercache, kiobufs, vma, etc... Bufferheads are another huge problem right now. For a P4 machine, they round off to 128 bytes per data structure. I was just looking at a 16Gb machine that had completely wedged itself by filling ZONE_NORMAL with unfreeable overhead - 440Mb of bufferheads alone. Globally mapping the bufferheads is probably another thing that'll have to go. > It's just that 1G of > virtual address space reserved for kernel is too low to handle > efficiently 64G of physical ram, this is a fact and you can't > workaround it. Death to global mappings! ;-) I'd agree that a 64 bit vaddr space makes much more sense, but we're stuck with the chips we've got for a little while yet. AMD were a few years too late for the bleeding edge Intel arch people amongst us. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 6:33 ` Martin J. Bligh @ 2002-05-03 8:38 ` Andrea Arcangeli 2002-05-03 9:26 ` William Lee Irwin III 2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 8:38 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote: > into kernel address space for ever. That's a fundamental scalability > problem for a 32 bit machine, and I think we need to fix it. If we > map only the pages the process is using into the user-kernel address > space area, rather than the global KVA, we get rid of some of these > problems. Not that that plan doesn't have its own problems, but ... ;-) :) As said every workaround has a significant drawback at this point. Starting flooding the tlb with invlpg and pagetable walking every time we need to do a set_bit or clear_bit test_bit or an unlock_page is both overkill at runtime and overcomplex on the software side too to manage those kernel pools in user memory. just assume we do that and that you're ok to pay for the hit in general purpose usage, then the next year how will you plan to workaround the limitation of 64G of physical ram, are you going to multiplex another 64G of ram via a pci register so you can handle 128G of ram on x86 just not simultaneously? (but that's ok in theory, the cpu won't notice you're swapping the ram under it, and you cannot keep mapped in virtual mem more than 4G anyways simultaneously, so it doesn't matter if some ram isn't visible on the phsical side either) I mean, in theory there's no limit, but in practice there's a limit, 64G is just over the limit for general purpose x86 IMHO, it's at a point where every workaround for something has a significant performance (or memory drawback), still very fine for custom apps that needs that much ram but 32G is the pratical limit of general purpose x86 IMHO. Ah, and of course you could also use 2M pagetables by default to make it more usable but still you would run in some huge ram wastage in certain usages with small files, huge pageins and reads swapout and swapins, plus it wouldn't be guaranteed to be transparent to the userspace binaries (for istance mmap offset fields would break backwards compatibility on the required alignment, that's probably the last problem though). Despite its also significant drawbacks and the complexity of the change, probably the 4M pagetables would be the saner approch to manage more efficiently 64G with only a 800M kernel window. > Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit > virtual addr space a long time ago with Dynix/PTX. You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said in the earlier email there are many applications that doesn't care if there's only a few meg of zone_normal and for them 2.4.19pre8 is just fine (actually -aa is much better for the bounce buffers and other vm fixes in that area). If all the load is in userspace current 2.4 is just optimal and you'll take advantage of all the ram without problems (let's assume it's not a numa machine, with numa you'd be better with the fixes I included in my tree). But if you need the kernel to do some amount of work, like vfs caching, blkdev cache, lots of bh on pagecache, lots of vma, lots of kiobufs, skb etc.. then you'd probably be faster if you boot with mem=32G or at least you should take actions like recompiling the kernel as CONFIG_2G that would then break SGA large 1.7G etc... > > So at the end you'll be left with > > only say 5/10M per node of zone_normal that will be filled immediatly as > > soon as you start reading some directory from disk. a few hundred mbyte > > of vfs cache is the minimum for those machines, this doesn't even take > > into account bh headers for the pagecache, physical address space > > pagecache for the buffercache, kiobufs, vma, etc... > > Bufferheads are another huge problem right now. For a P4 machine, they > round off to 128 bytes per data structure. I was just looking at a 16Gb > machine that had completely wedged itself by filling ZONE_NORMAL with Go ahead, use -aa or the vm-33 update, I fixed that problem a few days after hearing about it the first time (with the due credit to Rik in a comment for showing me such problem btw, I never noticed it before). > unfreeable overhead - 440Mb of bufferheads alone. Globally mapping the > bufferheads is probably another thing that'll have to go. > > > It's just that 1G of > > virtual address space reserved for kernel is too low to handle > > efficiently 64G of physical ram, this is a fact and you can't > > workaround it. > > Death to global mappings! ;-) > > I'd agree that a 64 bit vaddr space makes much more sense, but we're This is my whole point yes :) > stuck with the chips we've got for a little while yet. AMD were a few > years too late for the bleeding edge Intel arch people amongst us. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 8:38 ` Andrea Arcangeli @ 2002-05-03 9:26 ` William Lee Irwin III 2002-05-03 15:38 ` Martin J. Bligh 2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-03 9:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Fri, May 03, 2002 at 10:38:13AM +0200, Andrea Arcangeli wrote: > You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said > in the earlier email there are many applications that doesn't care if > there's only a few meg of zone_normal and for them 2.4.19pre8 is just > fine (actually -aa is much better for the bounce buffers and other vm > fixes in that area). If all the load is in userspace current 2.4 is just Have you done testing with 64GB? What sort of failure modes are you seeing with it? I've been hearing about more severe failure modes in practice on 32GB, Martin, could you comment on this? Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 9:26 ` William Lee Irwin III @ 2002-05-03 15:38 ` Martin J. Bligh 0 siblings, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 15:38 UTC (permalink / raw) To: William Lee Irwin III, Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel > Have you done testing with 64GB? What sort of failure modes are you > seeing with it? I've been hearing about more severe failure modes in > practice on 32GB, Martin, could you comment on this? I've never gone above 32Gb (yet ;-)). We don't have an SMP platform that I know of that'll support 64Gb, only the NUMA platforms. 32Gb will boot and work with 1GB KVA, but if you actually want to use the memory for something, a 2GB KVA seems imperative. It depends on the workload you're using, but the things we tend to see are: 1. struct page. 2. buffer heads (will look at -aa tree) 3. user page tables (need highpte) 4. LDTs for threads filling the vmalloc space (seems to be fixed in 2.5) I think the whole struct page issue needs some (complex, hard) work, but in general, we're getting there. Fast ;-) M. PS. BTW, Andrea, your latest highpte looks like you obliterated the kmap problem I was complaing of, but I've been having massive problems with other things which are blocking much of the real testing ... sorry about the time lag ;-) ^ permalink raw reply [flat|nested] 152+ messages in thread
* Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 8:38 ` Andrea Arcangeli 2002-05-03 9:26 ` William Lee Irwin III @ 2002-05-03 15:17 ` Martin J. Bligh 2002-05-03 15:58 ` Andrea Arcangeli 2002-05-03 16:02 ` Daniel Phillips 1 sibling, 2 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 15:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel > On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote: >> into kernel address space for ever. That's a fundamental scalability >> problem for a 32 bit machine, and I think we need to fix it. If we >> map only the pages the process is using into the user-kernel address >> space area, rather than the global KVA, we get rid of some of these >> problems. Not that that plan doesn't have its own problems, but ... ;-) > > :) As said every workaround has a significant drawback at this point. > Starting flooding the tlb with invlpg and pagetable walking every time > we need to do a set_bit or clear_bit test_bit or an unlock_page is both > overkill at runtime and overcomplex on the software side too to manage > those kernel pools in user memory. Whilst I take your point in principle, and acknowledge that there is some cost to pay, I don't believe that the working set of one task is all that dynamic (see also second para below). Some stuff really is global data, that's used by a lot of processes, but lots of other things really are per task. If only one process has a given file open, that's the only process that needs to see the pagecache control structures for that file. We don't have to tlb flush every time we map something in, only when we delete it. For the sake of illustration, imagine a huge kmap pool for each task, we just map things in as we need them (say some pagecache structures when we open a file that's already partly in cache), and use lazy TLB flushing to tear down those structures for free when we context switch. If we run out of virtual space, yes, we'll have to flush, but I suspect that won't be too bad (for most workloads) if we careful how we flush. > just assume we do that and that you're ok to pay for the hit in general > purpose usage, then the next year how will you plan to workaround the > limitation of 64G of physical ram, ;-) No, I agree we're pushing the limits here, and I don't want to be fighting this too much longer. The next generation of machines will all have larger virtual address spaces, and I'll be happy when they arrive. For now, we have to deal with what we have, and support the machines that are in the marketplace, and ia32 is (to my mind) still faster than ia64. I'm really looking forward to AMD's Hammer architecture, but it's simply not here right now, and even when it is, there will be these older 32 bit machines in the field for a few years yet to come, and we have to cope with them as best we can. > Ah, and of course you could also use 2M pagetables by default to make it > more usable but still you would run in some huge ram wastage in certain > usages with small files, huge pageins and reads swapout and swapins, > plus it wouldn't be guaranteed to be transparent to the userspace > binaries (for istance mmap offset fields would break backwards > compatibility on the required alignment, that's probably the last > problem though). Despite its also significant drawbacks and the > complexity of the change, probably the 4M pagetables would be the saner > approch to manage more efficiently 64G with only a 800M kernel window. Though that'd reduce the size of some of the structures, I'd still have other concerns (such as tlb size, which is something stupid like 4 pages, IIRC), and the space wastage you mentioned. Page clustering is probably a more useful technique - letting the existing control structures control groups of pages. For example, one struct page could control aligned groups of 4 4K pages, giving us an effective page size of 16K from the management overhead point of view (swap in and out in 4 page chunks, etc). >> Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit >> virtual addr space a long time ago with Dynix/PTX. > > You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said I said *used*, not *booted* ;-) There's a whole host of problems we still have to fix yet, and some tradeoffs to be made - we just have to make those without affecting the people that don't need them. It won't be easy, but I don't think it'll be impossible either. >> Bufferheads are another huge problem right now. For a P4 machine, they >> round off to 128 bytes per data structure. I was just looking at a 16Gb >> machine that had completely wedged itself by filling ZONE_NORMAL with > > Go ahead, use -aa or the vm-33 update, I fixed that problem a few days > after hearing about it the first time (with the due credit to Rik in a > comment for showing me such problem btw, I never noticed it before). Thanks - I'll have a close look at that ... I didn't know you'd already fixed that one. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh @ 2002-05-03 15:58 ` Andrea Arcangeli 2002-05-03 16:10 ` Martin J. Bligh 2002-05-03 16:02 ` Daniel Phillips 1 sibling, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 15:58 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel On Fri, May 03, 2002 at 08:17:23AM -0700, Martin J. Bligh wrote: > We don't have to tlb flush every time we map something in, only when > we delete it. For the sake of illustration, imagine a huge kmap pool > for each task, we just map things in as we need them (say some pagecache yes, the pool will "cache" the mem_map virtual window for a while, but the complexity of the pool management isn't trivial, in the page structure you won't find the associated per-task cached virtual address, you will need something like a lookup on a data structure associated with the task struct to find if you just have it in cache or not in the per-process userspace kmap pool. The current kmap pool is an order of magnitude simpler thanks to page->virtual but you cannot have a page->virtual[nr_tasks] array. Another interesting problem is that 'struct page *' will be as best a cookie, not a valid pointer anymore, not sure what's the best way to handle that. Working with pfn would be cleaner rather than working with a cookie (somebody could dereference the cookie by mistake thinking it's a page structure old style), but if __alloc_pages returns a pfn a whole lot of kernel code will break. > older 32 bit machines in the field for a few years yet to come, and > we have to cope with them as best we can. Sure. > Though that'd reduce the size of some of the structures, I'd still > have other concerns (such as tlb size, which is something stupid > like 4 pages, IIRC), and the space wastage you mentioned. Page it has 8 pages for data and 2 for instructions, that's 16M data and 4M of instructions with PAE. 4k pages can be cached with at most 64 slots for data and 32 entries for instructions, that means 256K of data and 128k of instructions. The main disavantage is that we basically would waste the 4k tlb slots, and we'd share the same slots with the kernel. It mostly depend on the workload but in theory the 8 pages for data could reduce the pte walking (also not to mention a layer less of pte would make the pte walking faster too). So I think 2M pages could speedup some application, but the main advantage remains that you wouldn't need to change the page structure handling. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 15:58 ` Andrea Arcangeli @ 2002-05-03 16:10 ` Martin J. Bligh 2002-05-03 16:25 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 16:10 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel > Another interesting problem is that 'struct page *' will be as best a > cookie, not a valid pointer anymore, not sure what's the best way to > handle that. Working with pfn would be cleaner rather than working with > a cookie (somebody could dereference the cookie by mistake thinking it's > a page structure old style), but if __alloc_pages returns a pfn a whole > lot of kernel code will break. Yup, a physical address pfn would probably be best. (such as tlb size, which is something stupid like 4 pages, IIRC) > it has 8 pages for data and 2 for instructions, that's 16M data and 4M > of instructions with PAE What is "it", a P4? I think the sizes are dependant on which chip you're using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even PPro for the oldest ones, but those don't work at the moment with Linux on multiquad). M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 16:10 ` Martin J. Bligh @ 2002-05-03 16:25 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 16:25 UTC (permalink / raw) To: Martin J. Bligh; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel On Fri, May 03, 2002 at 09:10:46AM -0700, Martin J. Bligh wrote: > > Another interesting problem is that 'struct page *' will be as best a > > cookie, not a valid pointer anymore, not sure what's the best way to > > handle that. Working with pfn would be cleaner rather than working with > > a cookie (somebody could dereference the cookie by mistake thinking it's > > a page structure old style), but if __alloc_pages returns a pfn a whole > > lot of kernel code will break. > > Yup, a physical address pfn would probably be best. > > (such as tlb size, which is something stupid like 4 pages, IIRC) you recall correcty the mean :), it's 8 for data and 2 for instructions. But I don't think the tlb is the problem, potentially it's a big win for the big apps like database, more ram addressed via tlb and faster pagetable lookups, it's the I/O granularity for the pageins that is probably the most annoying part. Even if you've a fast disk, 2M instead of kbytes is going to make difference, as well as the fact a 4M per page and the bh on the pagecache would waste quite lots of ram with small files. > > it has 8 pages for data and 2 for instructions, that's 16M data and 4M > > of instructions with PAE > > What is "it", a P4? I think the sizes are dependant on which chip you're I didn't read if P4 changes that, nor I checked the athlon yet, I read it in the usual and a bit old system programmin manual 3. > using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even > PPro for the oldest ones, but those don't work at the moment with Linux > on multiquad). that's the P6 family, so the PPro P2 P3 all included (only P5 excluded). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh 2002-05-03 15:58 ` Andrea Arcangeli @ 2002-05-03 16:02 ` Daniel Phillips 2002-05-03 16:20 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-03 16:02 UTC (permalink / raw) To: Martin J. Bligh, Andrea Arcangeli; +Cc: William Lee Irwin III, linux-kernel On Friday 03 May 2002 17:17, Martin J. Bligh wrote: > Andrea apparently wrote: > > Ah, and of course you could also use 2M pagetables by default to make it > > more usable but still you would run in some huge ram wastage in certain > > usages with small files, huge pageins and reads swapout and swapins, > > plus it wouldn't be guaranteed to be transparent to the userspace > > binaries (for istance mmap offset fields would break backwards > > compatibility on the required alignment, that's probably the last > > problem though). Despite its also significant drawbacks and the > > complexity of the change, probably the 4M pagetables would be the saner > > approch to manage more efficiently 64G with only a 800M kernel window. > > Though that'd reduce the size of some of the structures, I'd still > have other concerns (such as tlb size, which is something stupid > like 4 pages, IIRC), and the space wastage you mentioned. Page > clustering is probably a more useful technique - letting the existing > control structures control groups of pages. For example, one struct > page could control aligned groups of 4 4K pages, giving us an > effective page size of 16K from the management overhead point of > view (swap in and out in 4 page chunks, etc). IMHO, this will be a much easier change than storing mem_map in highmem, and solves 75% of the problem. It's not just ia32 numa that will benefit from it. For example, MIPS supports 16K pages in software, which will take a lot of load off the tlb. According to Ralf, there are benefits re virtual aliasing as well. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 16:02 ` Daniel Phillips @ 2002-05-03 16:20 ` Andrea Arcangeli 2002-05-03 16:41 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 16:20 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote: > and solves 75% of the problem. It's not just ia32 numa that will benefit > from it. For example, MIPS supports 16K pages in software, which will the whole change would be specific to ia32, I don't see the connection with mips. There would be nothing to share between ia32 2M pages and mips 16K pages. You can do mips 16K just now indipendently from the page_size of ia32. 16K should work without surprises because other archs have pages of this size and even bigger. Nobody has pages large as much as 2M yet, that's an order of magnitude bigger. 16K for example is just fine for the read()/pagein/pageout I/O, DMA is usually done in larger chunks anyways with readahead and async-flushing to be faster (but never as big as 2M, the highest limit is 512k per scsi command). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 16:20 ` Andrea Arcangeli @ 2002-05-03 16:41 ` Daniel Phillips 2002-05-03 16:58 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-03 16:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel On Friday 03 May 2002 18:20, Andrea Arcangeli wrote: > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote: > > and solves 75% of the problem. It's not just ia32 numa that will benefit > > from it. For example, MIPS supports 16K pages in software, which will > > the whole change would be specific to ia32, I don't see the connection > with mips. There would be nothing to share between ia32 2M pages and > mips 16K pages. The topic here is 'page clustering'. The idea is to use one struct page for every four 4K page frames on ia32. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 16:41 ` Daniel Phillips @ 2002-05-03 16:58 ` Andrea Arcangeli 2002-05-03 18:08 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 16:58 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote: > On Friday 03 May 2002 18:20, Andrea Arcangeli wrote: > > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote: > > > and solves 75% of the problem. It's not just ia32 numa that will benefit > > > from it. For example, MIPS supports 16K pages in software, which will > > > > the whole change would be specific to ia32, I don't see the connection > > with mips. There would be nothing to share between ia32 2M pages and > > mips 16K pages. > > The topic here is 'page clustering'. The idea is to use one struct page for > every four 4K page frames on ia32. ah ok, I meant physical hardware pages. physical hardware pages should be doable without common code changes, a software PAGE_SIZE or the PAGE_CACHE_SIZE raises non trivial problems instead. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() ) 2002-05-03 16:58 ` Andrea Arcangeli @ 2002-05-03 18:08 ` Daniel Phillips 0 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-03 18:08 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel On Friday 03 May 2002 18:58, Andrea Arcangeli wrote: > On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote: > > On Friday 03 May 2002 18:20, Andrea Arcangeli wrote: > > > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote: > > > > and solves 75% of the problem. It's not just ia32 numa that will benefit > > > > from it. For example, MIPS supports 16K pages in software, which will > > > > > > the whole change would be specific to ia32, I don't see the connection > > > with mips. There would be nothing to share between ia32 2M pages and > > > mips 16K pages. > > > > The topic here is 'page clustering'. The idea is to use one struct page for > > every four 4K page frames on ia32. > > ah ok, I meant physical hardware pages. physical hardware pages should > be doable without common code changes, a software PAGE_SIZE or the > PAGE_CACHE_SIZE raises non trivial problems instead. Yes, it's not too bad though. In the swap-in path, the locking would be against mem_map + (pfn >> 2). The four pages don't have to be read in and valid all at the same time - it's ok to take multiple faults on the cluster, not recommended, but ok. In the swap-out path, all four page frames have to be swapped out and invalidated at the same time. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 6:04 ` Andrea Arcangeli 2002-05-03 6:33 ` Martin J. Bligh @ 2002-05-03 9:24 ` William Lee Irwin III 2002-05-03 10:30 ` Andrea Arcangeli 2002-05-03 15:32 ` Martin J. Bligh 1 sibling, 2 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-03 9:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel I apologize in advance for the untimeliness of this response; I took perhaps more time than necessary to consider the contents thereof. On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: >> Without relaxing this invariant for this architecture there is no hope >> that NUMA-Q can ever be efficiently operated by this kernel. On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > I don't think it make sense to attempt breaking GFP_KERNEL semantics in > 2.4 but for 2.5 we can change stuff so that all non-DMA users can ask > for ZONE_NORMAL that will be backed by physical memory over 4G (that's > fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other > in-core data structures never accessed by hardware via DMA, it's ok even > for the buffer cache because the lowlevel layer has the bounce buffer > layer that is smart enough to understand when bounce buffers are needed > on top of the physical address space pagecache). Well, in a sense, they're already facing some problems from the progressively stranger hardware people have been porting Linux to. For instance, suppose there were a machine whose buses were only capable of addressing memory on nodes local to them... The assumption that a membership within a single address region suffices to ensure that devices are capable of addressing it already then breaks down. (The workaround was to IPI and issue the command from another node.) On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: >> Artificially tying together the device-addressibility of memory and >> virtual addressibility of memory is a fundamental design decision which >> seems to behave poorly for NUMA-Q, though general it seems to work okay. On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > Yes, you know since a few months ago we weren't even capable of skipping > the bounce buffers for the memory between 1G and 4G and for the memory > above 4G with pci-64, now we can, in the future we can be more > finegrined if there's the need to. > Again note that nonlinear can do nothing to help you there, the > limitation you deal with is pci32 and the GFP API, not at all about > discontigmem or nonlinear. we basically changed topic from here. Given the amount of traffic that's already happened for that thread, I'd be glad to change subjects. =) While I don't have a particular plan to address what changes to the GFP API might be required to make these scenarios work, a quick thought is to pass in indices into a table of zones corresponding to regions of memory addressible by some devices and not others. It'd give rise to a partition like what is already present with foreknowledge of ISA DMA and 32-bit PCI, but there would be strange corner cases, for instance, devices claiming to be 32-bit PCI that don't wire all the address lines. I'm not entirely sure how smoothly these cases are now handled anyway. On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: >> I believe 64-bit PCI is pretty much taken to be a requirement; if it >> weren't the 4GB limit would once again apply and we'd be in much >> trouble, or we'd have to implement a different method of accommodating >> limited device addressing capabilities and would be in trouble again. On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > If you're sure all the hw device are pci64 and the device drivers are > using DAC to submit the bus addresses, then you're just fine and you can > use pages over 4G for the ZONE_NORMAL too. and yes, if you add an IOMMU > unit like the GART then you can fill the zone_normal with phys pages > over 4G too because then the bus address won't be an identity anymore > with the phys addr, I just assumed it wasn't the case because most x86 > doesn't have that capability besides the GART that isn't currently used > by the kernel as an iommu but that it's left to use to build contigous > ram for the AGP cards (and also not all x86 have an AGP so we couldn't > use it by default on x86 even assuming the graphics card doesn't need > it). That sounds a bit painful; digging through drivers to check if any are missing DAC support is not my idea of a good time. On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: >> I've not been using the generic page_address() in conjunction with >> highmem, but this sounds like a very natural thing to do when the need >> to do so arises; arranging for storage of the virtual address sounds >> trickier, though doable. I'm not sure if mainline would want it, and >> I don't feel a pressing need to implement it yet, but then again, I've >> not yet been parked in front of a 64GB x86 machine yet... On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > Personally I always had the hope to never need to see a 64G 32bit > machine 8). I mean, even if you manage to solve the pci32bit problem > with GFP_KERNEL, then you still have to share 800M across 16 nodes with > 4G each. So by striping zone_normal over all the nodes to have numa-local > data structures with fast slab allocations will get at most 50mbyte per > node of which around 90% of this 50M will be eat by the mem_map array > for those 50M plus the other 4G-50M. So at the end you'll be left with > only say 5/10M per node of zone_normal that will be filled immediatly as > soon as you start reading some directory from disk. a few hundred mbyte > of vfs cache is the minimum for those machines, this doesn't even take > into account bh headers for the pagecache, physical address space > pagecache for the buffercache, kiobufs, vma, etc... Even ignoring the fact > it's NUMA a 64G machine will boot fine (thanks to your 2.4.19pre work > that shrinks of some bytes each page structure) but still it will work well > only depending on what you're doing, for example it's fine for number > cruncking but it will be bad for most other important workloads. And > this is only because of the 32bit address space, it doesn't have anything > to do with nonlinear/numa/discontigmem or pci32. It's just that 1G of > virtual address space reserved for kernel is too low to handle > efficiently 64G of physical ram, this is a fact and you can't workaround > it. every workaround will add a penality here or there. The workaround > you will be mostly forced to take is CONFIG_2G, after that the userspace > will be limited to less than 1G per task returned by malloc (from over > 1G to below 2G) and that will be a showstopper again for most userspace > apps that wants to run on a 64G box like a DBMS that wants almost 2G of > SGA. I'm glad we're finally going to migrate all to 64bit, just in time > not to see a relevant number of 32bit 64G boxes. 64GB machines are not new. NUMA-Q's original OS (DYNIX/ptx) must have been doing something radically different, for it appeared to run well there, and it did so years ago. The amount of data actually required to be globally mapped should in principle be no larger than the kernel's loaded image, and everything else can be dynamically mapped by mapping pages as pointers into them are followed. The practical reality of getting Linux to do this for a significant fraction of its globally-mapped structures (or anyone accepting a patch to make it do so) is another matter entirely. Optional facilities for the worst offenders might be more practical, for instance: (1) Given per-zone kswapd's, i.e. separate process contexts for each large fragment of mem_map, it should be possible to reserve a large portion of the process' address space for mapping in its local mem_map. Algorithms allowing sufficient locality of reference (e.g. reverse-mappings) would be required for this to be effective. (2) Various large boot-time allocated structures (think big hash tables) could be changed so that either the algorithm only requires a small root to be globally mapped in the kernel virtual address space (trees), localized on a per-object basis if there is an object to hang them off of (e.g. ratcache), or highmem allocate the table with a globally-mapped physical address available for mapping in the needed portions on-demand (like the above mem_map suggestion but without any way to give process contexts the ability to restrict themselves to orthogonal subsets of the structure). (3) In order to accommodate the sheer number of dynamic mappings going on a large process/mmu-context-local cache of virtual address space for mapping them in would be needed for efficiency, changing the memory map of Linux/i386 as well as adding another kind of (address-space local) kmapping. (4) The bootstrap sequence would need to be altered so that dynamic mappings of boot-time allocated structures residing outside the direct-mapped portion of the kernel virtual address space are possible, as well as the usual sprinkling of small chunks of ZONE_NORMAL across nodes so that something is possible. Almost anything could exhaust of the kernel virtual address space if left permanently mapped. And worse yet, there are some DBMS's that want 3.5GB, not just 3GB. These potentially very time-consuming changes basically kmap everything, including the larger portions of mem_map. The answer I seem to hear most often is "get a 64-bit CPU". But I believe it's fully possible to get the larger highmem systems to what is very near a sane working state and feed back to mainline a good portion of the less invasive patches required to address fundamental stability issues associated with highmem, and welcome any assistance toward that end. What is likely the more widely beneficial aspect of this work is that it can expose the fundamental stability issues of the highmem implementation very readily and so provide users of more common 32-bit highmem systems a greater degree of stability than they have previously enjoyed owing to kva exhaustion issues. On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > And of course, I don't mean a 64G 32bit machine doesn't make sense, it > can make perfect sense for a certain number of users with specific needs > of lots of ram and with very few kernel data structures, if you do that > that's because you know what you're doing and you know you can tweak > linux for your own workload and that's fine as far it's not supposed to > be a general purpose machine (with general purpose I mean pretending to > run a DBMS with a 1.7G SGA requiring CONFIG_3G, plus a cvs [or bk if > you're a bk fan] server dealing with huge vfs metadata at the same time, > for istance the cvs workload would run faster booting with mem=32G :) Well, this is certainly not the case with other OS's. The design limitations of Linux' i386 memory layout, while they now severely hamper performance on NUMA-Q, are a tradeoff that has proved advantageous on other platforms, and should be approached with some degree of caution even while Martin Bligh (truly above all others), myself, and others attempt to address the issues raised by it on NUMA-Q. But I believe it is possible to achieve a good degree of virtual address space conservation without compromising the general design, and if I may be so bold as to speak on behalf of my friends, I believe we are willing to, capable of, and now exercising that caution. On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: >> Absolutely; I'd be very supportive of improvements for this case as well. >> Many of the systems with the need for discontiguous memory support will >> also benefit from parallelizations or other methods of avoiding references >> to remote nodes/zones or iterations over all nodes/zones. On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > I would suggest to start on case-by-case basis looking at the profiling, > so we make more complex only what is worth to optimize. For example > nr_free_buffer_pages() I guess it will showup because it is used quite > frequently. I think I see nr_free_pages(), but nr_free_buffer_pages() sounds very likely as well. Both of these would likely benefit from per-cpu counters. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 9:24 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III @ 2002-05-03 10:30 ` Andrea Arcangeli 2002-05-03 11:09 ` William Lee Irwin III 2002-05-03 15:42 ` Martin J. Bligh 2002-05-03 15:32 ` Martin J. Bligh 1 sibling, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 10:30 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Fri, May 03, 2002 at 02:24:26AM -0700, William Lee Irwin III wrote: > 64GB machines are not new. NUMA-Q's original OS (DYNIX/ptx) must have > been doing something radically different, for it appeared to run well > there, and it did so years ago. The amount of data actually required to Did you ever benchmarked DYNIX/ptx against Linux on a 64bit machine or on a 4G x86 machine? Special changes to deal with the small KVA as said are possible but they will have to affect performance somehow. One way to reduce the regression on the normal 32bit machines could be to take the special actions like putting the mem_map in highmem only dependent on the amount of ram (there would be still the branches for every access of a page structure, at least unless you take the messy self modifying code way). > The answer I seem to hear most often is "get a 64-bit CPU". > > But I believe it's fully possible to get the larger highmem systems to > what is very near a sane working state and feed back to mainline a good > portion of the less invasive patches required to address fundamental > stability issues associated with highmem, and welcome any assistance > toward that end. The stability should be just complete in current -aa, it's just the performance that won't be ok. If you want more cache, larger hashes, more skb etc... you'll need to pay with something else that would then only hurt on a 64bit arch or on a smaller box then. > What is likely the more widely beneficial aspect of this work is that > it can expose the fundamental stability issues of the highmem > implementation very readily and so provide users of more common 32-bit > highmem systems a greater degree of stability than they have previously > enjoyed owing to kva exhaustion issues. Agreed, infact if somebody can test current -aa on a 64G x86 box I'd be glad to hear the results. It should just work stable, at least as far as the VM is concerned (mainline should have some problem instead), except it will probably return -ENOMEM on mmap/open/etc.. after you finish normal_zone, and there can be packet loss too, but that's expected (CONFIG_2G will make it almost completly usable on the kernel side, but reducing userspace). The important thing is that it never deadlocks or malfunction with CONFIG_3G. > Well, this is certainly not the case with other OS's. The design > limitations of Linux' i386 memory layout, while they now severely I see it's limited for your needs on a 64G box, but "limited" looks like "weak", while it's really the optimal design for 64bit archs and normal 32bit machines. > hamper performance on NUMA-Q, are a tradeoff that has proved > advantageous on other platforms, and should be approached with some > degree of caution even while Martin Bligh (truly above all others), > myself, and others attempt to address the issues raised by it on NUMA-Q. > But I believe it is possible to achieve a good degree of virtual > address space conservation without compromising the general design, > and if I may be so bold as to speak on behalf of my friends, I believe > we are willing to, capable of, and now exercising that caution. Putting the mem_map in highmem would be the first step, after that you should be just at at the 90% of work done to make it general purpose, you should wrap most actions on the page struct with wrappers and it will be quite an invasive change (much more invasive than pte-highmem), but it could be done. For this one (unlike pte-highmem) you definitely need a config option to select it, most people doesn't need this feature enabled because they've less than 8G of ram and also considering it will have a significant runtime cost. > On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote: > >> Absolutely; I'd be very supportive of improvements for this case as well. > >> Many of the systems with the need for discontiguous memory support will > >> also benefit from parallelizations or other methods of avoiding references > >> to remote nodes/zones or iterations over all nodes/zones. > > On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote: > > I would suggest to start on case-by-case basis looking at the profiling, > > so we make more complex only what is worth to optimize. For example > > nr_free_buffer_pages() I guess it will showup because it is used quite > > frequently. > > I think I see nr_free_pages(), but nr_free_buffer_pages() sounds very > likely as well. Both of these would likely benefit from per-cpu > counters. nr_free_pages() actually could be mostly optimized out by setting overcommit to 1 :), for the rest is used basically only for /proc/ stats, but yes, with overcommit to 0 (default) every mmap will take the hit in nr_free_pages() so in most workloads it would be even more frequent than nr_free_buffer_pages() (with the difference that nr_free_buffer_pages cannot be avoided). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 10:30 ` Andrea Arcangeli @ 2002-05-03 11:09 ` William Lee Irwin III 2002-05-03 11:27 ` Andrea Arcangeli 2002-05-03 15:42 ` Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-03 11:09 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Daniel Phillips, linux-kernel On Fri, May 03, 2002 at 12:30:09PM +0200, Andrea Arcangeli wrote: > Putting the mem_map in highmem would be the first step, after that you > should be just at at the 90% of work done to make it general purpose, > you should wrap most actions on the page struct with wrappers and it > will be quite an invasive change (much more invasive than pte-highmem), > but it could be done. For this one (unlike pte-highmem) you definitely > need a config option to select it, most people doesn't need this feature > enabled because they've less than 8G of ram and also considering it will > have a significant runtime cost. Invasive or not, if running is impossible without it, it must be done. This is a probable first order of business given that it is the single largest consumer of KVA with only really enough mitigation for bootability provided by my prior efforts at reducing the size of struct page. A clean, perhaps even mergeable design for this would be a great boon to all users of larger highmem systems. IIRC buffer_heads were the specific reported problem, and though they themselves consume excessive KVA only under some circumstances, they present a much greater danger in combination with the excessively large boot-time KVA allocation. Martin, can you take over? I've got plenty of ideas about what to code up, but you've actually got your hands on the machine and are knee-deep in the issues. I'm getting hit up for specifics I can't answer. Andrea, it might also be helpful to hear your input during the LSE conference call tomorrow. The topic is KVA exhaustion scenarios, which seem to be of interest to you as well. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 11:09 ` William Lee Irwin III @ 2002-05-03 11:27 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 11:27 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips, linux-kernel On Fri, May 03, 2002 at 04:09:51AM -0700, William Lee Irwin III wrote: > page. A clean, perhaps even mergeable design for this would be a great > boon to all users of larger highmem systems. IIRC buffer_heads were the > specific reported problem, and though they themselves consume excessive bh problems should be fixed with my latest vm updates, while it's nice to cache the bh across multiple writes, it's not a big problem having to ask the fs again to translate from logical to physical so dropping bh aggressively when needed is ok and the right thing to do. > Andrea, it might also be helpful to hear your input during the LSE > conference call tomorrow. The topic is KVA exhaustion scenarios, which > seem to be of interest to you as well. ok. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 10:30 ` Andrea Arcangeli 2002-05-03 11:09 ` William Lee Irwin III @ 2002-05-03 15:42 ` Martin J. Bligh 1 sibling, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 15:42 UTC (permalink / raw) To: Andrea Arcangeli, William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel > Putting the mem_map in highmem would be the first step, after that you > should be just at at the 90% of work done to make it general purpose, > you should wrap most actions on the page struct with wrappers and it > will be quite an invasive change (much more invasive than pte-highmem), > but it could be done. For this one (unlike pte-highmem) you definitely > need a config option to select it, most people doesn't need this feature > enabled because they've less than 8G of ram and also considering it will > have a significant runtime cost. Absolutely agree making it an option - other people with smaller memory configs may also find this useful for enlarging the user address space to 3.5Gb for databases et al. with a 8Gb or 16Gb machine. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 9:24 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III 2002-05-03 10:30 ` Andrea Arcangeli @ 2002-05-03 15:32 ` Martin J. Bligh 1 sibling, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 15:32 UTC (permalink / raw) To: William Lee Irwin III, Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel >> Again note that nonlinear can do nothing to help you there, the >> limitation you deal with is pci32 and the GFP API, not at all about >> discontigmem or nonlinear. > > While I don't have a particular plan to address what changes to the > GFP API might be required to make these scenarios work, a quick thought > is to pass in indices into a table of zones corresponding to regions of > memory addressible by some devices and not others. It'd give rise to a > partition like what is already present with foreknowledge of ISA DMA > and 32-bit PCI, but there would be strange corner cases, for instance, > devices claiming to be 32-bit PCI that don't wire all the address lines. > I'm not entirely sure how smoothly these cases are now handled anyway. In my mind, one possibility for a powerful API would be to specify a mask of acceptable physical addresses, and a "state" for what kind of mapping you wanted - global kernel permanently mapped address, unmapped address, per-task kernel mapped address, per-address space kernel mapped address, etc. Without thinking about it too much (aka I'm sticking my neck out and am going to get shot down ;-)) it would seem possible to do the phys mask idea inside the current buddy system without too much problem if the mask was aligned on 2^MAX_ORDER * sizeof(struct page) boundarys? I need to think about that one some more, but I thought I'd throw it out to see what people think ... M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:41 ` Andrea Arcangeli 2002-05-02 19:19 ` William Lee Irwin III @ 2002-05-02 19:22 ` Daniel Phillips 2002-05-03 6:06 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 19:22 UTC (permalink / raw) To: Andrea Arcangeli, William Lee Irwin III, Martin J. Bligh, Russell King, linux-kernel On Thursday 02 May 2002 20:41, Andrea Arcangeli wrote: > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > > >> Even with 64 bit DMA, the real problem is breaking the assumption > > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > > >> That's 90% of the difficulty of what Dan's doing anyway, as I > > >> see it. > > > > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote: > > > control on virt_to_page, pci_map_single, __va. Actually it may be as > > > well cleaner to just let the arch define page_address() when > > > discontigmem is enabled (instead of hacking on top of __va), that's a > > > few liner. (the only true limit you have is on the phys ram above 4G, > > > that cannot definitely go into zone-normal regardless if it belongs to a > > > direct mapping or not because of pci32 API) > > > Andrea > > > > Being unable to have any ZONE_NORMAL above 4GB allows no change at all. > > No change if your first node maps the whole first 4G of physical address > space, but in such case nonlinear cannot help you in any way anyways. You *still don't have a clue what config_nonlinear does*. It doesn't matter if the first 4G of physical memory belongs to node zero. Config_nonlinear allows you to map only part of that to the kernel virtual space, and the rest would be mapped to highmem. The next node will map part of its local memory (perhaps the next 4 gig of physical memory) to a different part of the kernel virtual space, and so on, so that in the end, all nodes have at least *some* zone_normal memory. Do you now see why config_nonlinear is needed in this case? Are you willing to recognize the possibility that you might have missed some other cases where config_nonlinear is needed, and config_discontigmem won't do the job? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:22 ` Daniel Phillips @ 2002-05-03 6:06 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:06 UTC (permalink / raw) To: Daniel Phillips Cc: William Lee Irwin III, Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 09:22:07PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 20:41, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > > > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > > > >> Even with 64 bit DMA, the real problem is breaking the assumption > > > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > > > >> That's 90% of the difficulty of what Dan's doing anyway, as I > > > >> see it. > > > > > > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote: > > > > control on virt_to_page, pci_map_single, __va. Actually it may be as > > > > well cleaner to just let the arch define page_address() when > > > > discontigmem is enabled (instead of hacking on top of __va), that's a > > > > few liner. (the only true limit you have is on the phys ram above 4G, > > > > that cannot definitely go into zone-normal regardless if it belongs to a > > > > direct mapping or not because of pci32 API) > > > > Andrea > > > > > > Being unable to have any ZONE_NORMAL above 4GB allows no change at all. > > > > No change if your first node maps the whole first 4G of physical address > > space, but in such case nonlinear cannot help you in any way anyways. > > You *still don't have a clue what config_nonlinear does*. > > It doesn't matter if the first 4G of physical memory belongs to node zero. > Config_nonlinear allows you to map only part of that to the kernel virtual > space, and the rest would be mapped to highmem. The next node will map part > of its local memory (perhaps the next 4 gig of physical memory) to a different > part of the kernel virtual space, and so on, so that in the end, all nodes > have at least *some* zone_normal memory. You are the one that has no clue of what I'm talking about. Go ahead, do that and you'll see the corruption you get after the first vmalloc32 or similar. This has nothing to do with nonlinaer or anything discontigmem/numa. This is all about the GFP kernel API with pci32. > > Do you now see why config_nonlinear is needed in this case? Are you > willing to recognize the possibility that you might have missed some other > cases where config_nonlinear is needed, and config_discontigmem won't do > the job? > > -- > Daniel Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:40 ` Andrea Arcangeli 2002-05-02 17:16 ` William Lee Irwin III @ 2002-05-02 18:25 ` Daniel Phillips 2002-05-02 18:44 ` Andrea Arcangeli 2002-05-02 19:31 ` Martin J. Bligh 2 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 18:25 UTC (permalink / raw) To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel On Thursday 02 May 2002 18:40, Andrea Arcangeli wrote: > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > > > You can trivially map the phys mem between 1G and 1G+256M to be in a > > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. > > > > > > The constraints you have on the normal memory are only two: > > > > > > 1) direct mapping > > > 2) DMA > > > > > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped > > > in the direct mapping you can put it into the normal zone. There is no > > > difference at all between discontimem or nonlinear in this sense. > > > > Now imagine an 8 node system, with 4Gb of memory in each node. > > First 4Gb is in node 0, second 4Gb is in node 1, etc. > > > > Even with 64 bit DMA, the real problem is breaking the assumption > > that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > > That's 90% of the difficulty of what Dan's doing anyway, as I > > see it. > > You don't need any additional common code abstraction to make virtual > address 3G+256G to point to physical address 1G as in my example above, M ----^ > after that you're free to put the physical ram between 1G and 1G+256M > into the zone normal of node 1 and the stuff should keep working but > with zone-normal spread in more than one node. I don't see that you accomplished that at all, with config_discontig. How can you address the memory at 3G+256M? That looks like highmem to me. No good at all for kmem caches, buffers, struct pages, etc. Without config_nonlinear, those structures will all have to be off-node for most nodes. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:25 ` Daniel Phillips @ 2002-05-02 18:44 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 18:44 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 08:25:35PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 18:40, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > > > > You can trivially map the phys mem between 1G and 1G+256M to be in a > > > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > > > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. > > > > > > > > The constraints you have on the normal memory are only two: > > > > > > > > 1) direct mapping > > > > 2) DMA > > > > > > > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped > > > > in the direct mapping you can put it into the normal zone. There is no > > > > difference at all between discontimem or nonlinear in this sense. > > > > > > Now imagine an 8 node system, with 4Gb of memory in each node. > > > First 4Gb is in node 0, second 4Gb is in node 1, etc. > > > > > > Even with 64 bit DMA, the real problem is breaking the assumption > > > that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > > > That's 90% of the difficulty of what Dan's doing anyway, as I > > > see it. > > > > You don't need any additional common code abstraction to make virtual ^^^^^^^ > > address 3G+256G to point to physical address 1G as in my example above, > M ----^ indeed > > after that you're free to put the physical ram between 1G and 1G+256M > > into the zone normal of node 1 and the stuff should keep working but > > with zone-normal spread in more than one node. > > I don't see that you accomplished that at all, with config_discontig. > How can you address the memory at 3G+256M? That looks like highmem to that's virtual memory, to access it you only need to dereference the address. To get the page * you can simply use virt_to_page(3G+256M) and it will return a page at phys address 1G+256M. > me. No good at all for kmem caches, buffers, struct pages, etc. It is good for kmem buffers struct pages, pci32, it's ZONE_NORMAL memory. > Without config_nonlinear, those structures will all have to be off-node > for most nodes. > > -- > Daniel Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:40 ` Andrea Arcangeli 2002-05-02 17:16 ` William Lee Irwin III 2002-05-02 18:25 ` Daniel Phillips @ 2002-05-02 19:31 ` Martin J. Bligh 2002-05-02 18:57 ` Andrea Arcangeli 2 siblings, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 19:31 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel > You don't need any additional common code abstraction to make virtual > address 3G+256G to point to physical address 1G as in my example above, > after that you're free to put the physical ram between 1G and 1G+256M > into the zone normal of node 1 and the stuff should keep working but > with zone-normal spread in more than one node. You just have full > control on virt_to_page, pci_map_single, __va. Actually it may be as > well cleaner to just let the arch define page_address() when > discontigmem is enabled (instead of hacking on top of __va), that's a > few liner. (the only true limit you have is on the phys ram above 4G, > that cannot definitely go into zone-normal regardless if it belongs to a > direct mapping or not because of pci32 API) The thing that's special about ZONE_NORMAL is that it's permanently mapped into kernel virtual address space, so you *cannot* put memory in other nodes into ZONE_NORMAL without changing the mapping between physical to virtual memory to a non 1-1 mapping. No, you don't need to call changing that mapping "CONFIG_NONLINEAR", but that's basically what the bulk of Dan's patch does, so I think we should steal it with impunity ;-) M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:31 ` Martin J. Bligh @ 2002-05-02 18:57 ` Andrea Arcangeli 2002-05-02 19:08 ` Daniel Phillips 2002-05-02 22:39 ` Martin J. Bligh 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 18:57 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote: > between physical to virtual memory to a non 1-1 mapping. correct. The direct mapping is nothing magic, it's like a big static kmap area. Everybody is required to use virt_to_page/page_address/pci_map_single/... to switch between virtual address and mem_map anyways (thanks to the discontigous mem_map), so you can use this property by making discontigous the virtual space as well, not only the mem_map. discontigmem basically just allows that. > No, you don't need to call changing that mapping "CONFIG_NONLINEAR", > but that's basically what the bulk of Dan's patch does, so I think we should > steal it with impunity ;-) The difference is that if you use discontigmem you don't clobber the common code in any way, there is no "logical/ordinal" abstraction, there is no special table, it's all hidden in the arch section, and the pgdat you need them anyways to allocate from affine memory with numa. Actually the same mmu technique can be used to coalesce in virtual memory the discontigous chunks of iSeries, then you left the lookup in the tree to resolve from mem_map to the right virtual address and from the right virtual address back to mem_map. (and you left DISCONTIGMEM disabled) I think it should be possible. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:57 ` Andrea Arcangeli @ 2002-05-02 19:08 ` Daniel Phillips 2002-05-03 5:15 ` Andrea Arcangeli 2002-05-02 22:39 ` Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 19:08 UTC (permalink / raw) To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote: > > between physical to virtual memory to a non 1-1 mapping. > > correct. The direct mapping is nothing magic, it's like a big static > kmap area. Everybody is required to use > virt_to_page/page_address/pci_map_single/... to switch between virtual > address and mem_map anyways (thanks to the discontigous mem_map), so you > can use this property by making discontigous the virtual space as well, > not only the mem_map. discontigmem basically just allows that. And what if you don't have enough virtual space to fit all the memory you need, plus the holes? Config_nonlinear handles that, config_discontig doesn't. > > No, you don't need to call changing that mapping "CONFIG_NONLINEAR", > > but that's basically what the bulk of Dan's patch does, so I think we should > > steal it with impunity ;-) > > The difference is that if you use discontigmem you don't clobber the > common code in any way, First that's wrong. Look at _alloc_pages and tell me that config_discontig doesn't impact the common code (in fact, it adds two extra subroutine calls, including two loops, to every alloc_pages call). Secondly, config_nonlinear does not clobber the common code. If it does, please show me where. When config_nonlinear is not enabled, suitable stubs are provided to make it transparent. > Actually the same mmu technique can be used to coalesce in virtual > memory the discontigous chunks of iSeries, then you left the lookup in > the tree to resolve from mem_map to the right virtual address and from > the right virtual address back to mem_map. (and you left DISCONTIGMEM > disabled) I think it should be possible. So you're proposing a new patch? Have you chosen a name for it? How about 'config_nonlinear'? ;-) -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:08 ` Daniel Phillips @ 2002-05-03 5:15 ` Andrea Arcangeli 2002-05-05 23:54 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 5:15 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote: > > > between physical to virtual memory to a non 1-1 mapping. > > > > correct. The direct mapping is nothing magic, it's like a big static > > kmap area. Everybody is required to use > > virt_to_page/page_address/pci_map_single/... to switch between virtual > > address and mem_map anyways (thanks to the discontigous mem_map), so you > > can use this property by making discontigous the virtual space as well, > > not only the mem_map. discontigmem basically just allows that. > > And what if you don't have enough virtual space to fit all the memory you ZONE_NORMAL is by definition limited by the direct mapping size, so if you don't have enough virtual space you cannot enlarge the zone_normal anyways. If need more virtual space you can only do things like CONFIG_2G. > need, plus the holes? Config_nonlinear handles that, config_discontig > doesn't. > > > > No, you don't need to call changing that mapping "CONFIG_NONLINEAR", > > > but that's basically what the bulk of Dan's patch does, so I think we should > > > steal it with impunity ;-) > > > > The difference is that if you use discontigmem you don't clobber the > > common code in any way, > > First that's wrong. Look at _alloc_pages and tell me that config_discontig > doesn't impact the common code (in fact, it adds two extra subroutine > calls, including two loops, to every alloc_pages call). there are no two subroutines, check -aa. And the whole point is that we need a topology description of the machine for numa, nonlinear or not, what you're talking about is the whole numa concept in 2.4, it is all but superflous, while nonlinear implications in common code are superflous just to provide ZONE_NORMAL in more than one node in numa-q. > > Secondly, config_nonlinear does not clobber the common code. If it does, > please show me where. > > When config_nonlinear is not enabled, suitable stubs are provided to make it > transparent. it's the stubs that are visible to the common code and that are superflous. > > Actually the same mmu technique can be used to coalesce in virtual > > memory the discontigous chunks of iSeries, then you left the lookup in > > the tree to resolve from mem_map to the right virtual address and from > > the right virtual address back to mem_map. (and you left DISCONTIGMEM > > disabled) I think it should be possible. > > So you're proposing a new patch? Have you chosen a name for it? How > about 'config_nonlinear'? ;-) They're called CONFIG_MULTIQUAD and CONFIG_MSCHUNKS. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 5:15 ` Andrea Arcangeli @ 2002-05-05 23:54 ` Daniel Phillips 2002-05-06 0:28 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-05 23:54 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel On Friday 03 May 2002 07:15, Andrea Arcangeli wrote: > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote: > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > > > > > > correct. The direct mapping is nothing magic, it's like a big static > > > kmap area. Everybody is required to use > > > virt_to_page/page_address/pci_map_single/... to switch between virtual > > > address and mem_map anyways (thanks to the discontigous mem_map), so you > > > can use this property by making discontigous the virtual space as well, > > > not only the mem_map. discontigmem basically just allows that. > > > > And what if you don't have enough virtual space to fit all the memory you > > ZONE_NORMAL is by definition limited by the direct mapping size, so if > you don't have enough virtual space you cannot enlarge the zone_normal > anyways. If need more virtual space you can only do things like > CONFIG_2G. I must be guilty of not explaining clearly. Suppose you have the following physical memory map: 0: 128 MB 8000,0000: 128 MB 1,0000,0000: 128 MB 1,8000,0000: 128 MB 2,0000,0000: 128 MB 2,8000,0000: 128 MB 3,0000,0000: 128 MB 3,8000,0000: 128 MB The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, can only handle 128 MB of it. The rest falls out of the addressable range and has to be handled as highmem, that is if you preserve the linear relationship between kernel virtual memory and physical memory, as config_discontigmem does. Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) you can only handle 256 MB. By using config_nonlinear, the kernel can directly address all of that memory, giving you the full 800MB or so to work with (leaving out the kmap regions etc) as zone_normal. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-05 23:54 ` Daniel Phillips @ 2002-05-06 0:28 ` Andrea Arcangeli 2002-05-06 0:34 ` Daniel Phillips 2002-05-06 0:55 ` Russell King 2002-05-06 8:54 ` Roman Zippel 2 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 0:28 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > On Friday 03 May 2002 07:15, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote: > > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > > > > > > > > correct. The direct mapping is nothing magic, it's like a big static > > > > kmap area. Everybody is required to use > > > > virt_to_page/page_address/pci_map_single/... to switch between virtual > > > > address and mem_map anyways (thanks to the discontigous mem_map), so you > > > > can use this property by making discontigous the virtual space as well, > > > > not only the mem_map. discontigmem basically just allows that. > > > > > > And what if you don't have enough virtual space to fit all the memory you > > > > ZONE_NORMAL is by definition limited by the direct mapping size, so if > > you don't have enough virtual space you cannot enlarge the zone_normal > > anyways. If need more virtual space you can only do things like > > CONFIG_2G. > > I must be guilty of not explaining clearly. Suppose you have the following > physical memory map: > > 0: 128 MB > 8000,0000: 128 MB > 1,0000,0000: 128 MB > 1,8000,0000: 128 MB > 2,0000,0000: 128 MB > 2,8000,0000: 128 MB > 3,0000,0000: 128 MB > 3,8000,0000: 128 MB > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > can only handle 128 MB of it. The rest falls out of the addressable range and > has to be handled as highmem, that is if you preserve the linear relationship > between kernel virtual memory and physical memory, as config_discontigmem does. > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) > you can only handle 256 MB. > > By using config_nonlinear, the kernel can directly address all of that memory, > giving you the full 800MB or so to work with (leaving out the kmap regions etc) > as zone_normal. If those different 128M chunks aren't in different numa nodes that's broken hardware that can be workarounded just fine with discontigmem. If as expected they are (indeed similar to numa-q) placed on different numa nodes, then they must go into pgdat regardless, so nonlinear or not cannot make difference with numa. Either ways (both if it's broken hardware workaroundable with discontigmem, or proper numa architecture) there will be no problem at all in coalescing the blocks below 4G into ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing). nonlinear is only needed with origin2k (and possibly iseries if the partitioning is extremely inefficient) where discontigmem with hundred/thousand of pgdat would not capable of workarounding the hardware weird mem phys layout because it would perform too poorly. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:28 ` Andrea Arcangeli @ 2002-05-06 0:34 ` Daniel Phillips 2002-05-06 1:01 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 0:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel On Monday 06 May 2002 02:28, Andrea Arcangeli wrote: > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > On Friday 03 May 2002 07:15, Andrea Arcangeli wrote: > > > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote: > > > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > > > > > > > > > > correct. The direct mapping is nothing magic, it's like a big static > > > > > kmap area. Everybody is required to use > > > > > virt_to_page/page_address/pci_map_single/... to switch between virtual > > > > > address and mem_map anyways (thanks to the discontigous mem_map), so you > > > > > can use this property by making discontigous the virtual space as well, > > > > > not only the mem_map. discontigmem basically just allows that. > > > > > > > > And what if you don't have enough virtual space to fit all the memory you > > > > > > ZONE_NORMAL is by definition limited by the direct mapping size, so if > > > you don't have enough virtual space you cannot enlarge the zone_normal > > > anyways. If need more virtual space you can only do things like > > > CONFIG_2G. > > > > I must be guilty of not explaining clearly. Suppose you have the following > > physical memory map: > > > > 0: 128 MB > > 8000,0000: 128 MB > > 1,0000,0000: 128 MB > > 1,8000,0000: 128 MB > > 2,0000,0000: 128 MB > > 2,8000,0000: 128 MB > > 3,0000,0000: 128 MB > > 3,8000,0000: 128 MB > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > can only handle 128 MB of it. The rest falls out of the addressable range and > > has to be handled as highmem, that is if you preserve the linear relationship > > between kernel virtual memory and physical memory, as config_discontigmem does. > > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) > > you can only handle 256 MB. > > > > By using config_nonlinear, the kernel can directly address all of that memory, > > giving you the full 800MB or so to work with (leaving out the kmap regions etc) > > as zone_normal. > > If those different 128M chunks aren't in different numa nodes that's > broken hardware that can be workarounded just fine with discontigmem. It's real hardware - broken operating system. And no, it's not numa. Could you please explain how to work around it with discontigmem? > If > as expected they are (indeed similar to numa-q) placed on different numa > nodes, then they must go into pgdat regardless, so nonlinear or not > cannot make difference with numa. Either ways (both if it's broken > hardware workaroundable with discontigmem, or proper numa architecture) > there will be no problem at all in coalescing the blocks below 4G into > ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing). Why can config_nonlinear do nothing with blocks above 4G physical? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:34 ` Daniel Phillips @ 2002-05-06 1:01 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 1:01 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 02:34:49AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 02:28, Andrea Arcangeli wrote: > > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > > On Friday 03 May 2002 07:15, Andrea Arcangeli wrote: > > > > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote: > > > > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote: > > > > > > > > > > > > correct. The direct mapping is nothing magic, it's like a big static > > > > > > kmap area. Everybody is required to use > > > > > > virt_to_page/page_address/pci_map_single/... to switch between virtual > > > > > > address and mem_map anyways (thanks to the discontigous mem_map), so you > > > > > > can use this property by making discontigous the virtual space as well, > > > > > > not only the mem_map. discontigmem basically just allows that. > > > > > > > > > > And what if you don't have enough virtual space to fit all the memory you > > > > > > > > ZONE_NORMAL is by definition limited by the direct mapping size, so if > > > > you don't have enough virtual space you cannot enlarge the zone_normal > > > > anyways. If need more virtual space you can only do things like > > > > CONFIG_2G. > > > > > > I must be guilty of not explaining clearly. Suppose you have the following > > > physical memory map: > > > > > > 0: 128 MB > > > 8000,0000: 128 MB > > > 1,0000,0000: 128 MB > > > 1,8000,0000: 128 MB > > > 2,0000,0000: 128 MB > > > 2,8000,0000: 128 MB > > > 3,0000,0000: 128 MB > > > 3,8000,0000: 128 MB > > > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > > can only handle 128 MB of it. The rest falls out of the addressable range and > > > has to be handled as highmem, that is if you preserve the linear relationship > > > between kernel virtual memory and physical memory, as config_discontigmem does. > > > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) > > > you can only handle 256 MB. > > > > > > By using config_nonlinear, the kernel can directly address all of that memory, > > > giving you the full 800MB or so to work with (leaving out the kmap regions etc) > > > as zone_normal. > > > > If those different 128M chunks aren't in different numa nodes that's > > broken hardware that can be workarounded just fine with discontigmem. > > It's real hardware - broken operating system. And no, it's not numa. operative system can workaround such a weird mem layout just fine with discontigmem, there is no problem making such hardware to work. > Could you please explain how to work around it with discontigmem? Are you serious? that's what ARM is doing for ages in 2.4, I think this part was obvious under the whole previous discussions. just put each discontigous chunk into a separated pgdat and it will work flawlessy (also make sure to apply all pending vm/numa fixes in -aa first that are needed for numa anyways). They will all be normal zones provided you implement a static view of them in the kernel virtual address space, and you also cover page_address/virt_to_page/pci_map* of course. Yes, nonlinear would be just a bit faster than discontigmem in the above scenario (it's non numa so you are not forced to describe the discontigmem topology to common code and that would save some runtime bit), but avoiding nonlinear also lefts the common code quite simpler without adding further mm abstractions. With hundred of pgdats the "discontigmem workaround" becomes prohibitive, and so nonlinear become mandatory in a scenario like origin2k. But in the above scenario, "nonlinear" would be just a minor optimizations that also leads to additional common code complexity. > > If > > as expected they are (indeed similar to numa-q) placed on different numa > > nodes, then they must go into pgdat regardless, so nonlinear or not > > cannot make difference with numa. Either ways (both if it's broken > > hardware workaroundable with discontigmem, or proper numa architecture) > > there will be no problem at all in coalescing the blocks below 4G into > > ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing). > > Why can config_nonlinear do nothing with blocks above 4G physical? Just to be sure it's clear, with "do nothing", I mean it cannot put them into zone_normal anyways. Putting the whole thing into zone_normal was your whole point of the previous email: "By using config_nonlinear, the kernel can directly address all of that memory...giving you the full 800MB...as zone_normal". I think I just told you why, grep for vmalloc32 and see why it doesn't passes to GFP the __GFP_HIGHMEM flag. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-05 23:54 ` Daniel Phillips 2002-05-06 0:28 ` Andrea Arcangeli @ 2002-05-06 0:55 ` Russell King 2002-05-06 1:07 ` Daniel Phillips ` (3 more replies) 2002-05-06 8:54 ` Roman Zippel 2 siblings, 4 replies; 152+ messages in thread From: Russell King @ 2002-05-06 0:55 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > I must be guilty of not explaining clearly. Suppose you have the following > physical memory map: > > 0: 128 MB > 8000,0000: 128 MB > 1,0000,0000: 128 MB > 1,8000,0000: 128 MB > 2,0000,0000: 128 MB > 2,8000,0000: 128 MB > 3,0000,0000: 128 MB > 3,8000,0000: 128 MB > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > can only handle 128 MB of it. I see no problem with the above with the existing discontigmem stuff. discontigmem does *not* require a linear relationship between kernel virtual and physical memory. I've been running kernels for a while on such systems. Which was the reason for my comment at the start of this thread: | On ARM, however, we have cherry to add here. __va() may alias certain | physical memory addresses to the same virtual memory address, which | makes: -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:55 ` Russell King @ 2002-05-06 1:07 ` Daniel Phillips 2002-05-06 1:20 ` Andrea Arcangeli 2002-05-06 1:09 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 1:07 UTC (permalink / raw) To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Monday 06 May 2002 02:55, Russell King wrote: > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > I must be guilty of not explaining clearly. Suppose you have the following > > physical memory map: > > > > 0: 128 MB > > 8000,0000: 128 MB > > 1,0000,0000: 128 MB > > 1,8000,0000: 128 MB > > 2,0000,0000: 128 MB > > 2,8000,0000: 128 MB > > 3,0000,0000: 128 MB > > 3,8000,0000: 128 MB > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > can only handle 128 MB of it. > > I see no problem with the above with the existing discontigmem stuff. > discontigmem does *not* require a linear relationship between kernel > virtual and physical memory. I've been running kernels for a while > on such systems. Look, you've got this: #define __phys_to_virt(ppage) ((unsigned long)(ppage) + PAGE_OFFSET - PHYS_OFFSET) So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the relation __pa(__va(kva)) == kva cannot hold. Perhaps that doesn't bother you? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 1:07 ` Daniel Phillips @ 2002-05-06 1:20 ` Andrea Arcangeli 2002-05-06 1:24 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 1:20 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 02:55, Russell King wrote: > > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > > I must be guilty of not explaining clearly. Suppose you have the following > > > physical memory map: > > > > > > 0: 128 MB > > > 8000,0000: 128 MB > > > 1,0000,0000: 128 MB > > > 1,8000,0000: 128 MB > > > 2,0000,0000: 128 MB > > > 2,8000,0000: 128 MB > > > 3,0000,0000: 128 MB > > > 3,8000,0000: 128 MB > > > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > > can only handle 128 MB of it. > > > > I see no problem with the above with the existing discontigmem stuff. > > discontigmem does *not* require a linear relationship between kernel > > virtual and physical memory. I've been running kernels for a while > > on such systems. > > Look, you've got this: > > #define __phys_to_virt(ppage) ((unsigned long)(ppage) + PAGE_OFFSET - PHYS_OFFSET) > > So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the > relation __pa(__va(kva)) == kva cannot hold. Perhaps that doesn't bother you? Check my previous email: [..] They will all be normal zones provided you implement a static view of them in the kernel virtual address space, and you also cover page_address/virt_to_page [..] Depending on the kind of coalescing of those chunks in the direct mapping virt_to_page/page_address will vary. virt_to_page and page_address will have all the necessary internal knowledge in order to make it all zone_normal. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 1:20 ` Andrea Arcangeli @ 2002-05-06 1:24 ` Daniel Phillips 2002-05-06 1:42 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 1:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, Martin J. Bligh, linux-kernel On Monday 06 May 2002 03:20, Andrea Arcangeli wrote: > On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote: > > On Monday 06 May 2002 02:55, Russell King wrote: > > So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the > > relation __pa(__va(kva)) == kva cannot hold. Perhaps that doesn't bother you? > > Check my previous email: > > [..] They will all be normal zones provided you implement a static > view of them in the kernel virtual address space, and you also > cover page_address/virt_to_page [..] > > Depending on the kind of coalescing of those chunks in the direct > mapping virt_to_page/page_address will vary. virt_to_page and > page_address will have all the necessary internal knowledge in order to > make it all zone_normal. What do you mean by 'implement a static view of them'? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 1:24 ` Daniel Phillips @ 2002-05-06 1:42 ` Andrea Arcangeli 2002-05-06 1:48 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 1:42 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1071 bytes --] On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 03:20, Andrea Arcangeli wrote: > > On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote: > > > On Monday 06 May 2002 02:55, Russell King wrote: > > > So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the > > > relation __pa(__va(kva)) == kva cannot hold. Perhaps that doesn't bother you? > > > > Check my previous email: > > > > [..] They will all be normal zones provided you implement a static > > view of them in the kernel virtual address space, and you also > > cover page_address/virt_to_page [..] > > > > Depending on the kind of coalescing of those chunks in the direct > > mapping virt_to_page/page_address will vary. virt_to_page and > > page_address will have all the necessary internal knowledge in order to > > make it all zone_normal. > > What do you mean by 'implement a static view of them'? See the attached email. assuming chunks of 256M ram every 1G, 1G phys goes at 3G+256M virt, 2G goes at 3G+512M etc... Andrea [-- Attachment #2: Type: message/rfc822, Size: 1991 bytes --] From: Andrea Arcangeli <andrea@suse.de> To: Daniel Phillips <phillips@bonn-fries.net> Cc: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Date: Thu, 2 May 2002 18:06:32 +0200 Message-ID: <20020502180632.I11414@dualathlon.random> On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote: > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > > At the moment I use the contig memory model (so we only use discontig for > > > NUMA support) but I may need to change that in the future. > > > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > > the current discontigmem-numa model too as far I can see. > > No it doesn't. The config_discontigmem model forces all zone_normal memory > to be on node zero, so all the remaining nodes can only have highmem locally. You can trivially map the phys mem between 1G and 1G+256M to be in a direct mapping between 3G+256M and 3G+512M, then you can put such 256M at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. The constraints you have on the normal memory are only two: 1) direct mapping 2) DMA so as far as the ram is capable of 32bit DMA with pci32 and it's mapped in the direct mapping you can put it into the normal zone. There is no difference at all between discontimem or nonlinear in this sense. > Even with good cache hardware, this has to hurt. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 1:42 ` Andrea Arcangeli @ 2002-05-06 1:48 ` Daniel Phillips 2002-05-06 2:06 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 1:48 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Russell King, Martin J. Bligh, linux-kernel On Monday 06 May 2002 03:42, Andrea Arcangeli wrote: > On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote: > > What do you mean by 'implement a static view of them'? > > See the attached email. assuming chunks of 256M ram every 1G, 1G phys > goes at 3G+256M virt, 2G goes at 3G+512M etc... So, __va(0x40000000) = 0xc0000000, and __va(0x80000000) = 0, i.e., not a kernel address at all, because with config_discontigmem __va is a simple linear relation. What do you do about that? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 1:48 ` Daniel Phillips @ 2002-05-06 2:06 ` Andrea Arcangeli 2002-05-06 17:40 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 2:06 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1333 bytes --] On Mon, May 06, 2002 at 03:48:30AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 03:42, Andrea Arcangeli wrote: > > On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote: > > > What do you mean by 'implement a static view of them'? > > > > See the attached email. assuming chunks of 256M ram every 1G, 1G phys > > goes at 3G+256M virt, 2G goes at 3G+512M etc... > > So, __va(0x40000000) = 0xc0000000, and __va(0x80000000) = 0, i.e., not a kernel I said page_address() not necessairly __va. Assume the arch specify WANT_PAGE_VIRTUAL because such a page_address wouldn't be that cheap anyways, see my discussion with William as reference. > address at all, because with config_discontigmem __va is a simple linear > relation. What do you do about that? You can implement __va as you want, it doesn't need ot be a simple linear relation (see also the attached email from Roman), but regardless what matters really is page_address and virt_to_page, not only __va, just initialize page->virtual to the static kernel window at boot time using the proper virtual address and you won't run into __va (or let the arch code specify page_address if CONFIG_DISCONTIGMEM is defined, this would require a two liner to mm.h). this also is been just discussed some day ago with William, see the other attached email. Andrea [-- Attachment #2: Type: message/rfc822, Size: 5614 bytes --] From: Andrea Arcangeli <andrea@suse.de> To: William Lee Irwin III <wli@holomorphy.com>, "Martin J. Bligh" <Martin.Bligh@us.ibm.com>, Daniel Phillips <phillips@bonn-fries.net>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Date: Thu, 2 May 2002 20:41:36 +0200 Message-ID: <20020502204136.M11414@dualathlon.random> On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote: > >> Even with 64 bit DMA, the real problem is breaking the assumption > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space. > >> That's 90% of the difficulty of what Dan's doing anyway, as I > >> see it. > > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote: > > control on virt_to_page, pci_map_single, __va. Actually it may be as > > well cleaner to just let the arch define page_address() when > > discontigmem is enabled (instead of hacking on top of __va), that's a > > few liner. (the only true limit you have is on the phys ram above 4G, > > that cannot definitely go into zone-normal regardless if it belongs to a > > direct mapping or not because of pci32 API) > > Andrea > > Being unable to have any ZONE_NORMAL above 4GB allows no change at all. No change if your first node maps the whole first 4G of physical address space, but in such case nonlinear cannot help you in any way anyways. The fact you can make no change at all has only to do with the fact GFP_KERNEL must return memory accessible from a pci32 device. I think most configurations have more than one node mapped into the first 4G, and so in those configurations you can do changes and spread the direct mapping across all the nodes mapped in the first 4G phys. the fact you can or you can't have something to change with discontigmem or nonlinear, it's all about pci32. > 32-bit PCI is not used on NUMA-Q AFAIK. but can you plugin 32bit pci hardware into your 64bit-pci slots, right? If not, and if you're also sure the linux drivers for your hardware are all 64bit-pci capable then you can do the changes regardless of the 4G limit, in such case you can spread the direct mapping all over the whole 64G physical ram, whereever you want, no 4G constraint anymore. > > So long as zones are physically contiguous and __va() does what its zones remains physically contigous, it's the virtual address returned by page_address that changes. Also the kmap header will need some modification, you should always check for PageHIGHMEM in all places to know if you must kmap or not, that's a few liner. > name implies, page_address() should operate properly aside from the > sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I > believe was recently resolved; if not I will do so myself shortly). > With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the > position in mem_map array does not describe the offset into the region > of physical memory occupied by the zone. UNMAP_NR_DENSE() may be > expensive enough architectures using MAP_NR_DENSE() may be better off > using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If Yes, as alternative to moving page_address to the arch code, you can set WANT_PAGE_VIRTUAL since as you say such a function is going to be more expensive, (if it's only a few instructions you can instead consider moving page_address in the arch code as said in the previous email instead of hacking on __va). > space conservation is as important a consideration for stability as it > is on architectures with severely limited kernel virtual address spaces, > it may be preferable to implement such despite the computational expense. > iSeries will likely have physically discontiguous zones and so it won't > be able to use an address calculation based page_address() either. If you need to support an huge number of discontigous zones then I'm the first to agree you want nonlinear instead of discontigmem, I wasn't aware that such an hardware that normally needs to support hundred or thousand of discontigmem zones exists, for it discontigmem is prohibitive due the O(N) complexity of some code path. That's not the case for NUMA-Q though that also needs the different pgdat structures for the numa optimizations anyways (and still to me a phys memory partitioned with hundred discontigous zones looks like an harddisk partitioned in hundred of different blkdevs). BTW, about the pgdat loops optimizations, you misunderstood what I meant in some previous email, with "removing them" I didn't meant to remove them in the discontigmem case, that would had to be done case by case, with removing them I meant to remove them only for mainline 2.4.19-pre7 when kernel is compiled for x86 target like 99% of userbase uses it. A discontigmem using nonlinear also doesn't need to loop. It's a 1 branch removal optimization (doesn't decrease the complexity of the algorithm for discontigmem enabled). It's all in function of #ifndef CONFIG_DISCONTIGMEM. Dropping the loop when discontigmem is enabled is much more interesting optimization of course. Andrea [-- Attachment #3: Type: message/rfc822, Size: 2453 bytes --] From: Roman Zippel <zippel@linux-m68k.org> To: Daniel Phillips <phillips@bonn-fries.net> Cc: Andrea Arcangeli <andrea@suse.de>, Ralf Baechle <ralf@uni-koblenz.de>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org Subject: Re: discontiguous memory platforms Date: Thu, 02 May 2002 21:40:48 +0200 Message-ID: <3CD19640.3B85BF76@linux-m68k.org> Daniel Phillips wrote: > Patching the kernel how, and where? Check for example in asm-ppc/page.h the __va/__pa functions. > > Anyway, I agree with Andrea, that another mapping isn't really needed. > > Clever use of the mmu should give you almost the same result. > > We *are* making clever use of the mmu in config_nonlinear, it is doing the > nonlinear kernel virtual mapping for us. Did you have something more clever > in mind? I mean to map the memory where you need it. The physical<->virtual mapping won't be one to one, but you won't need another abstraction and the current vm is already basically able to handle it. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 2:06 ` Andrea Arcangeli @ 2002-05-06 17:40 ` Daniel Phillips 2002-05-06 19:09 ` Martin J. Bligh 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 17:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel This thread is already long enough, I propose that after your response to this we take it private. The executive summary of this post is: "show me the code". On Monday 06 May 2002 04:06, Andrea Arcangeli wrote: > You can implement __va as you want, it doesn't need ot be a simple > linear relation (see also the attached email from Roman), Here's the relevant comment from Roman: > I mean to map the memory where you need it. The physical<->virtual > mapping won't be one to one, but you won't need another abstraction and > the current vm is already basically able to handle it. > > bye, Roman Roman is talking about an implementation idea that so far hasn't been presented in the form of working code. I have already imlemented __va as I want, it works, it's efficient, it's simple, clean, powerful and extensible. If Roman has an alternative, I'd be interested in looking at the patch. > but regardless > what matters really is page_address and virt_to_page, not only __va, > just initialize page->virtual to the static kernel window at boot time OK, so you want to tie things to page->address. It's an interesting proposition, I'd like to see your code. Keep in mind that your new use of page->address conflicts with the current move to get rid of it from mainline, except for highmem use. I also have doubts about the efficiency and cleanliness your proposal. Your __pa and __va are going to get more expensive because they now have to work through the struct page, requiring multiplies as well as lookups. I think you'll end up with something more complex and less efficient than config_nonlinear - please prove me wrong by showing me the code. You also need some sort of structure that tells you how to set up your static mapping in the kernel. I already have that, you still need to describe it. In fact, config_nonlinear's way of doing the mem_map initialization required no changes at all to the mem_map initialization code. Such results tend to suggest a particular design approach is indeed correct. Now, it would be interesting to see exactly what changes are required to config_nonlinear to allow it to cover numa usage as well as non-numa usage. As far as I can see, I simply have to elaborate the my mapping between pagenum and struct page, i.e., I have to do what's necessary to put the mem_map structure into the local node. I believe that's possible without requiring any double table lookups. Note that for NUMA-Q, the ->lmem_map arrays are currently off-node for all but node zero, so the per-node ->lmem_map is doing nothing for NUMA-Q at the moment. In order for this to make sense for NUMA-Q, I really do have to provide a local mapping of a portion of zone_numa, otherwise we might as well just use config_nonlinear in its current form. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 17:40 ` Daniel Phillips @ 2002-05-06 19:09 ` Martin J. Bligh 0 siblings, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-06 19:09 UTC (permalink / raw) To: Daniel Phillips, Andrea Arcangeli; +Cc: linux-kernel > Note that for NUMA-Q, the ->lmem_map arrays are currently off-node for > all but node zero, so the per-node ->lmem_map is doing nothing for > NUMA-Q at the moment. In order for this to make sense for NUMA-Q, I > really do have to provide a local mapping of a portion of zone_numa, > otherwise we might as well just use config_nonlinear in its current > form. To split hairs, they're not currently off node - as they have to reside in ZONE_NORMAL, I can't make them so until I have the nonlinear stuff (or equivalent). But they ought to be on their home node, so your point is pretty much the same ;-) AFAIK, all other NUMA arches use the local lmem_map already. Is zone_numa a typo for zone_normal, or did I lose track of the conversation at some point? I'm not sure I grok the last sentence of yours .... M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:55 ` Russell King 2002-05-06 1:07 ` Daniel Phillips @ 2002-05-06 1:09 ` Andrea Arcangeli 2002-05-06 1:13 ` Daniel Phillips 2002-05-06 2:03 ` Daniel Phillips 3 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 1:09 UTC (permalink / raw) To: Russell King; +Cc: Daniel Phillips, Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 01:55:05AM +0100, Russell King wrote: > I see no problem with the above with the existing discontigmem stuff. > discontigmem does *not* require a linear relationship between kernel > virtual and physical memory. I've been running kernels for a while > on such systems. Indeed. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:55 ` Russell King 2002-05-06 1:07 ` Daniel Phillips 2002-05-06 1:09 ` Andrea Arcangeli @ 2002-05-06 1:13 ` Daniel Phillips 2002-05-06 2:03 ` Daniel Phillips 3 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 1:13 UTC (permalink / raw) To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Monday 06 May 2002 02:55, Russell King wrote: > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > I must be guilty of not explaining clearly. Suppose you have the following > > physical memory map: > > > > 0: 128 MB > > 8000,0000: 128 MB > > 1,0000,0000: 128 MB > > 1,8000,0000: 128 MB > > 2,0000,0000: 128 MB > > 2,8000,0000: 128 MB > > 3,0000,0000: 128 MB > > 3,8000,0000: 128 MB > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > can only handle 128 MB of it. > >... I've been running kernels for a while on such systems. Could you provide me with an example memory map, please? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 0:55 ` Russell King ` (2 preceding siblings ...) 2002-05-06 1:13 ` Daniel Phillips @ 2002-05-06 2:03 ` Daniel Phillips 2002-05-06 2:31 ` Andrea Arcangeli 2002-05-06 8:57 ` Russell King 3 siblings, 2 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 2:03 UTC (permalink / raw) To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Monday 06 May 2002 02:55, Russell King wrote: > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > I must be guilty of not explaining clearly. Suppose you have the following > > physical memory map: > > > > 0: 128 MB > > 8000,0000: 128 MB > > 1,0000,0000: 128 MB > > 1,8000,0000: 128 MB > > 2,0000,0000: 128 MB > > 2,8000,0000: 128 MB > > 3,0000,0000: 128 MB > > 3,8000,0000: 128 MB > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > can only handle 128 MB of it. > > I see no problem with the above with the existing discontigmem stuff. > discontigmem does *not* require a linear relationship between kernel > virtual and physical memory. I've been running kernels for a while > on such systems. I just went through every variant of arm in the kernel tree, and I found that *all* of them implement a simple linear relationship between kernel virtual and physical memory, of the form: #define __virt_to_phys(vpage) ((vpage) - PAGE_OFFSET + PHYS_OFFSET) #define __phys_to_virt(ppage) ((ppage) + PAGE_OFFSET - PHYS_OFFSET) With such a linear mapping you *cannot* map physical memory distributed across more than one gig into one gig of kernel virtual memory. Are you talking about code that isn't in the tree? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 2:03 ` Daniel Phillips @ 2002-05-06 2:31 ` Andrea Arcangeli 2002-05-06 8:57 ` Russell King 1 sibling, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-06 2:31 UTC (permalink / raw) To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 04:03:15AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 02:55, Russell King wrote: > > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote: > > > I must be guilty of not explaining clearly. Suppose you have the following > > > physical memory map: > > > > > > 0: 128 MB > > > 8000,0000: 128 MB > > > 1,0000,0000: 128 MB > > > 1,8000,0000: 128 MB > > > 2,0000,0000: 128 MB > > > 2,8000,0000: 128 MB > > > 3,0000,0000: 128 MB > > > 3,8000,0000: 128 MB > > > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > > can only handle 128 MB of it. > > > > I see no problem with the above with the existing discontigmem stuff. > > discontigmem does *not* require a linear relationship between kernel > > virtual and physical memory. I've been running kernels for a while > > on such systems. > > I just went through every variant of arm in the kernel tree, and I found that > *all* of them implement a simple linear relationship between kernel virtual and > physical memory, of the form: > > #define __virt_to_phys(vpage) ((vpage) - PAGE_OFFSET + PHYS_OFFSET) > #define __phys_to_virt(ppage) ((ppage) + PAGE_OFFSET - PHYS_OFFSET) ARM is an example that the pgdat way is fine. As an example of the other part about the zone_normal coalescing (page_address/__va/virt_to_page) check ppc and m68k. ARM doesn't have highmem, so it's clearly not strict in the address space since the first place (remeber, it's not an high end cpu, it pays off big time in other areas), and it couldn't take advantage in making the kernel virtual address space not a linear mapping with the physical address space. Did you actually read Roman's email of a few days ago that shows you __va is even just used as nonlinear? > Are you talking about code that isn't in the tree? first of all it doesn't matter if there wouldn't be a nonlinear __va in ppc and m68k trees, if something can be done or not doesn't depend if somebody did it before or not, but somebody just did it in practice too in this case. I've the feeling you reply too fast ignoring previous emails, so please try to ask strict questions with non obvious stuff that you disagree with on the past emails or you'll waste resources. If you ask stuff that is just been discussed and that ignores the previous discussions completly I will probably not have time to answer next times (like I've no time for IRC for similar reasons), sorry. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 2:03 ` Daniel Phillips 2002-05-06 2:31 ` Andrea Arcangeli @ 2002-05-06 8:57 ` Russell King 1 sibling, 0 replies; 152+ messages in thread From: Russell King @ 2002-05-06 8:57 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Mon, May 06, 2002 at 04:03:15AM +0200, Daniel Phillips wrote: > On Monday 06 May 2002 02:55, Russell King wrote: > > I see no problem with the above with the existing discontigmem stuff. > > discontigmem does *not* require a linear relationship between kernel > > virtual and physical memory. I've been running kernels for a while > > on such systems. > > I just went through every variant of arm in the kernel tree, and I found that > *all* of them implement a simple linear relationship between kernel virtual > and physical memory, of the form: Whoops. I didn't say _current_ kernels, did I? 8) (Don't write mails at 2am...) We got rid of it later as we cleaned up the kernel mappings to use ioremap instead of static device mappings. Hence 2.4/2.5 don't contain them any more. However, from 2.3.35: diff -urN linux-orig/include/asm-arm/arch-sa1100/memory.h linux/include/asm-arm/arch-sa1100/memory.h ... /* * The following gives a maximum memory size of 128MB (32MB in each bank). - * - * Does this still need to be optimised for one bank machines? */ -#define __virt_to_phys(x) (((x) & 0xe0ffffff) | ((x) & 0x06000000) << 2) -#define __phys_to_virt(x) (((x) & 0xe7ffffff) | ((x) & 0x30000000) >> 2) +#define __virt_to_phys(x) (((x) & 0xf9ffffff) | ((x) & 0x06000000) << 2) +#define __phys_to_virt(x) (((x) & 0xe7ffffff) | ((x) & 0x18000000) >> 2) This type of mapping went away in 2.4.0-test9, which is after this particular platform got discontig mem support in 2.3.99-pre2-rmk1. An example is right up to date, and was the subject of the first mail is: +#define __virt_to_phys(vpage) (((vpage) + ((vpage) & 0x18000000)) & \ + ~0x40000000) + +#define __phys_to_virt(ppage) (((ppage) & ~0x30000000) | \ + (((ppage) & 0x30000000) >> 1) | \ + 0x40000000) You won't find this one in my patches nor Linus' kernel tree though. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-05 23:54 ` Daniel Phillips 2002-05-06 0:28 ` Andrea Arcangeli 2002-05-06 0:55 ` Russell King @ 2002-05-06 8:54 ` Roman Zippel 2002-05-06 15:26 ` Daniel Phillips 2 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-06 8:54 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, On Mon, 6 May 2002, Daniel Phillips wrote: > I must be guilty of not explaining clearly. Suppose you have the following > physical memory map: > > 0: 128 MB > 8000,0000: 128 MB > 1,0000,0000: 128 MB > 1,8000,0000: 128 MB > 2,0000,0000: 128 MB > 2,8000,0000: 128 MB > 3,0000,0000: 128 MB > 3,8000,0000: 128 MB > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > can only handle 128 MB of it. The rest falls out of the addressable range and > has to be handled as highmem, that is if you preserve the linear relationship > between kernel virtual memory and physical memory, as config_discontigmem does. > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) > you can only handle 256 MB. Why do you want to preserve the linear relationship between virtual and physical memory? There is little common code (and only during initialization), which assumes a direct mapping. I can send you the patches to fix this. Then you can map as much physical memory as you want into a single virtual area and you only need a single pgdat. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 8:54 ` Roman Zippel @ 2002-05-06 15:26 ` Daniel Phillips 2002-05-06 19:07 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-06 15:26 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Monday 06 May 2002 10:54, Roman Zippel wrote: > Hi, > > On Mon, 6 May 2002, Daniel Phillips wrote: > > > I must be guilty of not explaining clearly. Suppose you have the following > > physical memory map: > > > > 0: 128 MB > > 8000,0000: 128 MB > > 1,0000,0000: 128 MB > > 1,8000,0000: 128 MB > > 2,0000,0000: 128 MB > > 2,8000,0000: 128 MB > > 3,0000,0000: 128 MB > > 3,8000,0000: 128 MB > > > > The total is 1 GB of installed ram. Yet the kernel's 1G virtual space, > > can only handle 128 MB of it. The rest falls out of the addressable range and > > has to be handled as highmem, that is if you preserve the linear relationship > > between kernel virtual memory and physical memory, as config_discontigmem does. > > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual) > > you can only handle 256 MB. > > Why do you want to preserve the linear relationship between virtual and > physical memory? I don't, I observed that in all known instances of config_discontigmem, that linear relationship is preserved. Now, you and Andrea are suggesting that no such linear relation is strictly necessary and I believe its worth investigating further, to see how it would work and how it compares to config_nonlinear. > There is little common code (and only during > initialization), which assumes a direct mapping. I can send you the > patches to fix this. I already have patches to do that, that is, config_nonlinear. I'm interested in looking at your patches though, because we might as well give all the different approaches a fair examination. > Then you can map as much physical memory as you want > into a single virtual area and you only need a single pgdat. You're talking about your 68K solution with the loops that search through memory regions? If so, I've already looked at it and understand it. Or, if it's a new approach, then naturally I'd be interested. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 15:26 ` Daniel Phillips @ 2002-05-06 19:07 ` Roman Zippel 2002-05-08 15:57 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-06 19:07 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, On Mon, 6 May 2002, Daniel Phillips wrote: > I don't, I observed that in all known instances of config_discontigmem, that > linear relationship is preserved. That's true, but m68k isn't using config_discontigmem. :) > > There is little common code (and only during > > initialization), which assumes a direct mapping. I can send you the > > patches to fix this. > > I already have patches to do that, that is, config_nonlinear. I'm interested in > looking at your patches though, because we might as well give all the different > approaches a fair examination. See below, the patch is almost complete: - the only other free_area_init_core() needs to be updated - the virt_to_page(phys_to_virt()) sequence could be replaced now with pfn_page() > You're talking about your 68K solution with the loops that search through > memory regions? If so, I've already looked at it and understand it. That's just how the virtual<->physical conversion is implemented. > Or, if > it's a new approach, then naturally I'd be interested. It's not really new, you only have to take care, that you don't iterate with the physical address over a pgdat, this is what the patch below fixes, the rest can be hidden in the arch macros and no special config options is needed. bye, Roman Index: mm/bootmem.c =================================================================== RCS file: /home/linux-m68k/cvsroot/linux/mm/bootmem.c,v retrieving revision 1.1.1.4 retrieving revision 1.5 diff -u -p -r1.1.1.4 -r1.5 --- mm/bootmem.c 11 Feb 2002 17:51:47 -0000 1.1.1.4 +++ mm/bootmem.c 11 Feb 2002 18:34:49 -0000 1.5 @@ -243,7 +243,7 @@ found: static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat) { - struct page *page = pgdat->node_mem_map; + struct page *page; bootmem_data_t *bdata = pgdat->bdata; unsigned long i, count, total = 0; unsigned long idx; @@ -256,21 +256,22 @@ static unsigned long __init free_all_boo map = bdata->node_bootmem_map; for (i = 0; i < idx; ) { unsigned long v = ~map[i / BITS_PER_LONG]; - if (v) { - unsigned long m; - for (m = 1; m && i < idx; m<<=1, page++, i++) { - if (v & m) { + unsigned long m; + if (!v) { + i+=BITS_PER_LONG; + continue; + } + for (m = 1; m && i < idx; m<<=1, i++) { + if (!(v & m)) + continue; + page = virt_to_page(phys_to_virt((i << PAGE_SHIFT) + + bdata->node_boot_start)); count++; ClearPageReserved(page); set_page_count(page, 1); __free_page(page); } } - } else { - i+=BITS_PER_LONG; - page+=BITS_PER_LONG; - } - } total += count; /* Index: mm/page_alloc.c =================================================================== RCS file: /home/linux-m68k/cvsroot/linux/mm/page_alloc.c,v retrieving revision 1.1.1.14 retrieving revision 1.17 diff -u -p -r1.1.1.14 -r1.17 --- mm/page_alloc.c 6 May 2002 08:52:16 -0000 1.1.1.14 +++ mm/page_alloc.c 6 May 2002 09:11:36 -0000 1.17 @@ -796,7 +796,7 @@ static inline unsigned long wait_table_b * - clear the memory bitmaps */ void __init free_area_init_core(int nid, pg_data_t *pgdat, struct page **gmap, - unsigned long *zones_size, unsigned long zone_start_paddr, + unsigned long *zones_size, unsigned long zone_start_vaddr, unsigned long *zholes_size, struct page *lmem_map) { unsigned long i, j; @@ -804,7 +804,7 @@ void __init free_area_init_core(int nid, unsigned long totalpages, offset, realtotalpages; const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1); - if (zone_start_paddr & ~PAGE_MASK) + if (zone_start_vaddr & ~PAGE_MASK) BUG(); totalpages = 0; @@ -837,7 +837,7 @@ void __init free_area_init_core(int nid, } *gmap = pgdat->node_mem_map = lmem_map; pgdat->node_size = totalpages; - pgdat->node_start_paddr = zone_start_paddr; + pgdat->node_start_paddr = __pa(zone_start_vaddr); pgdat->node_start_mapnr = (lmem_map - mem_map); pgdat->nr_zones = 0; @@ -889,9 +889,9 @@ void __init free_area_init_core(int nid, zone->zone_mem_map = mem_map + offset; zone->zone_start_mapnr = offset; - zone->zone_start_paddr = zone_start_paddr; + zone->zone_start_paddr = __pa(zone_start_vaddr); - if ((zone_start_paddr >> PAGE_SHIFT) & (zone_required_alignment-1)) + if ((zone_start_vaddr >> PAGE_SHIFT) & (zone_required_alignment-1)) printk("BUG: wrong zone alignment, it will crash\n"); /* @@ -906,8 +906,8 @@ void __init free_area_init_core(int nid, SetPageReserved(page); memlist_init(&page->list); if (j != ZONE_HIGHMEM) - set_page_address(page, __va(zone_start_paddr)); - zone_start_paddr += PAGE_SIZE; + set_page_address(page, zone_start_vaddr); + zone_start_vaddr += PAGE_SIZE; } offset += size; @@ -954,7 +954,7 @@ void __init free_area_init_core(int nid, void __init free_area_init(unsigned long *zones_size) { - free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 0, 0, 0); + free_area_init_core(0, &contig_page_data, &mem_map, zones_size, PAGE_OFFSET, 0, 0); } static int __init setup_mem_frac(char *str) ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-06 19:07 ` Roman Zippel @ 2002-05-08 15:57 ` Daniel Phillips 2002-05-08 23:11 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-08 15:57 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Monday 06 May 2002 21:07, Roman Zippel wrote: > Hi, > > On Mon, 6 May 2002, Daniel Phillips wrote: > > > I don't, I observed that in all known instances of config_discontigmem, > > that linear relationship is preserved. > > That's true, but m68k isn't using config_discontigmem. :) Right. In fact, your two way, phys_to_virt and virt_to_phys mapping makes it more like config_nonlinear. You don't define the contiguous logical memory space though, and perhaps that's the reason you need the free_area_init changes in the patch below. Your patch preserves a linear relationship between physical and virtual memory, because you do both the ptov and vtop lookup in the same array. As such, you don't provide the functionality I provide of being able to fit a large amount of physical memory into a small amount of virtual memory, and you can't join all your separate pgdat's into one, as I do. (The latter is desireable because it allows the memory manager to allocate from one homogeneous space, reducing the likelihood of zone balancing problems.) We could, if we want, implement your variable sized memory chunk system with config_nonlinear. You'd just have to replace the ulong psection[MAX_SECTIONS] with: struct { ulong base; ulong size; } pchunk[MAX_CHUNKS]; and replace the four direct table lookup with loops. Highmem does not need to be a special case, by the way. Another by the way: you've accidently repeated the last four lines of mm_vtop. Finally, it looks like your ZTWO_VADDR hack in mm_ptov would also cease to be a special case, at least, the special case part would move to the initialization instead of every __va operation. So you would end up with *zero* special cases in the page translation functions of page.h. > ...you only have to take care, that you don't iterate > with the physical address over a pgdat, this is what the patch below > fixes, the rest can be hidden in the arch macros and no special config > options is needed. You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK. You just didn't attempt to create the contiguous logical address space as I did, so you didn't need to go outside your arch. The generic part of config_nonlinear is tiny anyway - only 200 lines, and might grow to 400 by the time all device driver usage of __pa is reclassified as either virt_to_phys or virt_to_logical - the latter being a rather nice distinction to make, even if the mapping is the same don't you think? I.e, it's like the distinction between pointer and integer: if it's a physical address you can pass it to dma hardware, for example and if it's logical you're just using it for accounting. Whenever it's possible to elevate a per-arch feature to the generic level without compromising functionality, it should be done, modulo programmer time, and of course, assuming functionality isn't compromised. At the generic level, it's easier to document, we get cross-pollination from improvements developed on different arches, and it's easier to build on. Going the other way and allowing design features to fray across architectures takes us in the direction of unmaintainable bloat. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-08 15:57 ` Daniel Phillips @ 2002-05-08 23:11 ` Roman Zippel 2002-05-09 16:08 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-08 23:11 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, Daniel Phillips wrote: > Your patch preserves a linear relationship between physical and virtual > memory, because you do both the ptov and vtop lookup in the same array. As > such, you don't provide the functionality I provide of being able to fit a > large amount of physical memory into a small amount of virtual memory, and > you can't join all your separate pgdat's into one, as I do. Read the source again. arch/m68k/mm/motorola.c:map_chunk() maps all memory areas into single virtual area (virtaddr is static there and only increased starting from PAGE_OFFSET). In paging_init() there is only a single call to free_area_init(). > and replace the four direct table lookup with loops. The loops are only an implementation detail and can be replaced with another algorithm. > you've accidently > repeated the last four lines of mm_vtop. Finally, it looks like your > ZTWO_VADDR hack in mm_ptov would also cease to be a special case, at least, That stuff is obsolete since ages, it should be replaced with BUG(). > You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK. That was our cheap answer to avoid the loops. > You just didn't > attempt to create the contiguous logical address space as I did, so you > didn't need to go outside your arch. I don't need that, because I create a contiguous _virtual_ address space. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-08 23:11 ` Roman Zippel @ 2002-05-09 16:08 ` Daniel Phillips 2002-05-09 22:06 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-09 16:08 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Thursday 09 May 2002 01:11, Roman Zippel wrote: > Hi, > > Daniel Phillips wrote: > > > Your patch preserves a linear relationship between physical and virtual > > memory, because you do both the ptov and vtop lookup in the same array. As > > such, you don't provide the functionality I provide of being able to fit a > > large amount of physical memory into a small amount of virtual memory, and > > you can't join all your separate pgdat's into one, as I do. > > Read the source again. arch/m68k/mm/motorola.c:map_chunk() maps all > memory areas into single virtual area (virtaddr is static there and only > increased starting from PAGE_OFFSET). In paging_init() there is only a > single call to free_area_init(). Oops, yes, I see how it works, it relies on your O(N) search for the inverse. (Obligatory snipe: there are almost no comments for this opaque code, I hope you share my feeling that needs fixing.) Searching the table instead of doing a direct lookup allows you to eliminate one of my two tables. This is not a property you'd want to tie yourself to though, since the cost for any large number of chunks will be excessive, and will show up in the page table manipulation overhead. Now it seems our strategies are a lot more similar than different. So what were we arguing about again? I've just gone further with the generalization of this, and cast it into a more general form suitable for use across more than one arch. Where you ignore the distinction between logical and physical, it costs you execution time, as where you wrote page = virt_to_page(phys_to_virt((i << PAGE_SHIFT) + bdata->node_boot_start)) where formerly we just had page++. This in generic code too. Unless you have an #ifdef CONFIG_SOMETHING there I recommend this code *not* be merged because it penalizes the common case for the sake of your arch. And it's unnecessary even for your arch, as I've demonstrated. Incidently, the reason I came up with the virtual/logical distinction in the first places is that I found myself writing such awkward constructs as you wrote above, and thought there must be a better way. Indeed there is. > > You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK. > > That was our cheap answer to avoid the loops. My cheap answer is to turn the option off. So why don't I need a config option again? > > You just didn't > > attempt to create the contiguous logical address space as I did, so you > > didn't need to go outside your arch. > > I don't need that, because I create a contiguous _virtual_ address > space. Again, we're arguing about what? So do I. The relationship between virtual and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll find in many places in the kernel source already, often obscured by the impression that physical addresses are really being manipulated when in fact nothing of the kind is going on - the simple truth is, the arithmetic gets easier then you work zero-based instead of PAGE_OFFSET based. So now that we know we're both doing the same thing, could we please stop doing the catholics vs the protestants thing and maybe cooperate? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-09 16:08 ` Daniel Phillips @ 2002-05-09 22:06 ` Roman Zippel 2002-05-09 22:22 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-09 22:06 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, Daniel Phillips wrote: > Where you ignore the distinction between logical and physical, it costs you > execution time, as where you wrote page = virt_to_page(phys_to_virt((i << > PAGE_SHIFT) + bdata->node_boot_start)) where formerly we just had page++. > This in generic code too. Unless you have an #ifdef CONFIG_SOMETHING there I > recommend this code *not* be merged because it penalizes the common case for > the sake of your arch. And it's unnecessary even for your arch, as I've > demonstrated. 1. My patch only modifies init code, I don't think it's really a problem if it's slightly slower. 2. Above can now be written as "page = pfn_to_page(i + (bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :) > > > You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK. > > > > That was our cheap answer to avoid the loops. > > My cheap answer is to turn the option off. So why don't I need a config > option again? You know, what that option does? > > I don't need that, because I create a contiguous _virtual_ address > > space. > > Again, we're arguing about what? So do I. The relationship between virtual > and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll > find in many places in the kernel source already, often obscured by the > impression that physical addresses are really being manipulated when in fact > nothing of the kind is going on - the simple truth is, the arithmetic gets > easier then you work zero-based instead of PAGE_OFFSET based. Why do you want to introduce another abstraction? If the logical address is basically the same as the virtual address, just use the virtual address. What difference should that offset make? Could you show me please one single example? > So now that we know we're both doing the same thing, could we please stop > doing the catholics vs the protestants thing and maybe cooperate? I'm an atheist. >:-) bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-09 22:06 ` Roman Zippel @ 2002-05-09 22:22 ` Daniel Phillips 2002-05-09 23:00 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-09 22:22 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Friday 10 May 2002 00:06, Roman Zippel wrote: > 1. My patch only modifies init code, I don't think it's really a problem > if it's slightly slower. But why be slower when we don't have to. And why slow down *all* architectures? > 2. Above can now be written as "page = pfn_to_page(i + > (bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :) page++ is nicer yet. > > > I don't need that, because I create a contiguous _virtual_ address > > > space. > > > > Again, we're arguing about what? So do I. The relationship between virtual > > and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll > > find in many places in the kernel source already, often obscured by the > > impression that physical addresses are really being manipulated when in fact > > nothing of the kind is going on - the simple truth is, the arithmetic gets > > easier then you work zero-based instead of PAGE_OFFSET based. > > Why do you want to introduce another abstraction? The abstraction is already there. I didn't create the logical space, I identified it. There are places where the code is really manipulating logical addresses, not physical addresses, and these are not explicitly identified. This makes the code cleaner and easier to read. Your question is really 'why introduce any abstraction', or maybe you're asking 'is this an abstraction worth introducing'? Clearly it is, since it makes bootmem run faster, with nothing but name changes. > If the logical address > is basically the same as the virtual address, just use the virtual > address. But kernel coders have already done that in lots of places. Why? Because it's a pain to to arithmetic where everything is at an offset, and difficult to read. Not to mention, bulkier. > What difference should that offset make? Could you show me > please one single example? Look at drivers/char/mem.c, read_mem. Clearly, the code is not dealing with physical addresses. Yet it starts off with virt_to_phys, and thereafter works in zero-offset addresses. Why? Because it's clearer and more efficient to do that. The generic part of my nonlinear patch clarifies this usage by rewriting it as virt_to_logical, which is really what's happening. That's really what's happening in bootmem too. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-09 22:22 ` Daniel Phillips @ 2002-05-09 23:00 ` Roman Zippel 2002-05-09 23:22 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-09 23:00 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, Daniel Phillips wrote: > On Friday 10 May 2002 00:06, Roman Zippel wrote: > > 1. My patch only modifies init code, I don't think it's really a problem > > if it's slightly slower. > > But why be slower when we don't have to. And why slow down *all* architectures? > > > 2. Above can now be written as "page = pfn_to_page(i + > > (bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :) > > page++ is nicer yet. Is memmap[i++] so much worse? Let me repeat, this is only executed once at boot! > > Why do you want to introduce another abstraction? > > The abstraction is already there. I didn't create the logical space, I identified > it. And it's called virtual address space. > There are places where the code is really manipulating logical addresses, not > physical addresses, and these are not explicitly identified. This makes the code > cleaner and easier to read. _Please_ show me an example. > Look at drivers/char/mem.c, read_mem. Clearly, the code is not dealing with > physical addresses. Yet it starts off with virt_to_phys, and thereafter works > in zero-offset addresses. Why? Because it's clearer and more efficient to do > that. The generic part of my nonlinear patch clarifies this usage by rewriting > it as virt_to_logical, which is really what's happening. Are we looking at the same code??? Where is that zero-offset thingie? It just works with virtual and physical addresses and needs to convert between them. > That's really what's happening in bootmem too. That also works with just physical and virtual addresses. What are you talking about??? bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-09 23:00 ` Roman Zippel @ 2002-05-09 23:22 ` Daniel Phillips 2002-05-10 0:13 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-09 23:22 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel On Friday 10 May 2002 01:00, Roman Zippel wrote: > > Look at drivers/char/mem.c, read_mem. Clearly, the code is not dealing with > > physical addresses. Yet it starts off with virt_to_phys, and thereafter works > > in zero-offset addresses. Why? Because it's clearer and more efficient to do > > that. The generic part of my nonlinear patch clarifies this usage by rewriting > > it as virt_to_logical, which is really what's happening. > > Are we looking at the same code??? Where is that zero-offset thingie? It > just works with virtual and physical addresses and needs to convert > between them. Show me where the 'physical' address is actually treated as a physical address. You can't, because it isn't. The 'physical' address is merely a zero-based logical address, and the code *relies* on it being contiguous. Your code is going to do __pa there, and you are going to go walking into places you don't expect. Even you need my logical address space abstraction, or else you want to go making global changes to the common kernel code that just add cruft. I enjoy the feeling of removing cruft, even when it's an uphill battle. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-09 23:22 ` Daniel Phillips @ 2002-05-10 0:13 ` Roman Zippel 0 siblings, 0 replies; 152+ messages in thread From: Roman Zippel @ 2002-05-10 0:13 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel Hi, Daniel Phillips wrote: > Show me where the 'physical' address is actually treated as a physical address. > You can't, because it isn't. The 'physical' address is merely a zero-based > logical address, and the code *relies* on it being contiguous. Most of the code doesn't care about physical addresses, because they either work with virtual memory or with the page structure. Physical addresses are only interesting to pass them to the hardware or to put them into the page table. > Your code is going to do __pa there, and you are going to go walking into places > you don't expect. Even you need my logical address space abstraction, or else you > want to go making global changes to the common kernel code that just add cruft. So far I've only seen a virtual address with some offset. You can maybe move that offset around, but you can't remove it. In the end it's the same. > I enjoy the feeling of removing cruft, even when it's an uphill battle. I'm happy to see patches. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:57 ` Andrea Arcangeli 2002-05-02 19:08 ` Daniel Phillips @ 2002-05-02 22:39 ` Martin J. Bligh 2002-05-03 7:04 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 22:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel > The difference is that if you use discontigmem you don't clobber the > common code in any way, there is no "logical/ordinal" abstraction, > there is no special table, it's all hidden in the arch section, and the > pgdat you need them anyways to allocate from affine memory with numa. I *want* the logical / ordinal abstraction. That's not a negative thing - it reduces the number of complicated things I have to think about, allowing me to think more clearly, and write correct code ;-) Not having a multitude of zones to balance in the normal discontigmem case also seems like a powerful argument to me ... M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 22:39 ` Martin J. Bligh @ 2002-05-03 7:04 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 7:04 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 03:39:54PM -0700, Martin J. Bligh wrote: > > The difference is that if you use discontigmem you don't clobber the > > common code in any way, there is no "logical/ordinal" abstraction, > > there is no special table, it's all hidden in the arch section, and the > > pgdat you need them anyways to allocate from affine memory with numa. > > I *want* the logical / ordinal abstraction. That's not a negative thing - > it reduces the number of complicated things I have to think about, > allowing me to think more clearly, and write correct code ;-) That's just overhead. you don't need an additional table ordinal/logical things. the only case nonlinear will pay off is when you have to deal with a single pgdat with physical huge holes in the middle of its per-node mem_map. You don't have those holes in the middle of the mem_map of each node, so it's cleaner and faster to avoid nonlinear for you, it's just overhead. nonlinear instead definitely pays off with the origin 2k layout shown by Ralf, or with the iseries machine if the partitioning mandates an huge number of discontigous chunks. > > Not having a multitude of zones to balance in the normal discontigmem > case also seems like a powerful argument to me ... > > M. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:06 ` Andrea Arcangeli 2002-05-02 16:10 ` Martin J. Bligh @ 2002-05-02 23:42 ` Daniel Phillips 2002-05-03 7:45 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 23:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel On Thursday 02 May 2002 18:06, Andrea Arcangeli wrote: > On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote: > > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote: > > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > > > At the moment I use the contig memory model (so we only use discontig for > > > > NUMA support) but I may need to change that in the future. > > > > > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > > > the current discontigmem-numa model too as far I can see. > > > > No it doesn't. The config_discontigmem model forces all zone_normal memory > > to be on node zero, so all the remaining nodes can only have highmem locally. > > You can trivially map the phys mem between 1G and 1G+256M to be in a > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. Andrea, I'm re-reading this and I'm guilty of misreading your 3G+512M, what you meant is PAGE_OFFSET+512M. Yes, in fact this is exactly what config_nonlinear does. Config_discontigmem does not do this, not without your 'trivial map', and that's all config_nonlinear is: a trivial map done in an organized way. This same trivial mapping is capable of replacing all known non-numa uses of config_discontigmem. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 23:42 ` Daniel Phillips @ 2002-05-03 7:45 ` Andrea Arcangeli 0 siblings, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 7:45 UTC (permalink / raw) To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel On Fri, May 03, 2002 at 01:42:56AM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 18:06, Andrea Arcangeli wrote: > > On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote: > > > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote: > > > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote: > > > > > At the moment I use the contig memory model (so we only use discontig for > > > > > NUMA support) but I may need to change that in the future. > > > > > > > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > > > > the current discontigmem-numa model too as far I can see. > > > > > > No it doesn't. The config_discontigmem model forces all zone_normal memory > > > to be on node zero, so all the remaining nodes can only have highmem locally. > > > > You can trivially map the phys mem between 1G and 1G+256M to be in a > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too. > > Andrea, I'm re-reading this and I'm guilty of misreading your 3G+512M, what > you meant is PAGE_OFFSET+512M. Yes, in fact this is exactly what yes, I was short in explaining it, 3G == PAGE_OFFSET for 99% of userbase but it wasn't obvious. > config_nonlinear does. Config_discontigmem does not do this, not without > your 'trivial map', and that's all config_nonlinear is: a trivial map done > in an organized way. This same trivial mapping is capable of replacing all > known non-numa uses of config_discontigmem. You add a table lookup, the lookup on such table or data structure is pure overhead if your ram is contigous. NUMA-Q has a contigous ram within the node so it doesn't make sense to add the nonlinear overhead, to provide normal memory from the other nodes they only need to change virt_to_page and page_address, plus of course the initialization of the direct mapping (and the window intialization of the pci32 BAR windows/pci_map_single, but this latter pci part is indipendent of the discontigmem/nonlinear issue). nonlinear make sense and it's not a pure overhead _only_ when the mem_map has holes, so instead of wasting ram with the mem_map you pay a CPU hit with your nonlinear lookup, and so it can pay off, if there's no hole in the per-node mem_map pointed by the pgdat then nonlinear cannot pay off. At the start of the thread I never heard of configurations with huge ram holes in the middle of the nodes, I thought it had to be misconfigured hardware, origin 2k and iseries falls in this category so for them nonlinear can pay off (but if I had an iSeries I would know how to partition it efficiently and I would be fine with discontigmem be sure, the other is a fascinating but slow dinosaur anyways). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:35 ` Andrea Arcangeli 2002-05-01 15:42 ` Daniel Phillips @ 2002-05-02 16:07 ` Martin J. Bligh 2002-05-02 16:58 ` Gerrit Huizenga 1 sibling, 1 reply; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 16:07 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel > With numa-q there's a 512M hole in each node IIRC. that's fine > configuration, similar to the wildfire btw. There's 2 different memory models - the NT mode we use currently is contiguous, the PTX mode is discontiguous. I don't think it's as simple as a 512Mb fixed size hole, though I'd have to look it up to be sure. >> At the moment I use the contig memory model (so we only use discontig for >> NUMA support) but I may need to change that in the future. > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into > the current discontigmem-numa model too as far I can see. As Dan already mentioned, we need CONFIG_NONLINEAR to spread around ZONE_NORMAL. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:07 ` Martin J. Bligh @ 2002-05-02 16:58 ` Gerrit Huizenga 2002-05-02 18:10 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Gerrit Huizenga @ 2002-05-02 16:58 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes: > > With numa-q there's a 512M hole in each node IIRC. that's fine > > configuration, similar to the wildfire btw. > > There's 2 different memory models - the NT mode we use currently > is contiguous, the PTX mode is discontiguous. I don't think it's > as simple as a 512Mb fixed size hole, though I'd have to look it > up to be sure. No - it definitely isn't as simple as a 512 MB hole. Depends on how much memory is in each node, holes could be all kinds of sizes. You could, in theory, have had 128 MB in one node and 8 GB in another node. I don't think we had holes within the node from the software side - I think the requirement was that all DIMMS were added in low to high memory slots. Not sure what forced that requirement - could have been PTX, BIOS, cache controllers, etc. gerrit ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:58 ` Gerrit Huizenga @ 2002-05-02 18:10 ` Andrea Arcangeli 2002-05-02 19:28 ` Gerrit Huizenga 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 18:10 UTC (permalink / raw) To: Gerrit Huizenga Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote: > In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes: > > > With numa-q there's a 512M hole in each node IIRC. that's fine > > > configuration, similar to the wildfire btw. > > > > There's 2 different memory models - the NT mode we use currently > > is contiguous, the PTX mode is discontiguous. I don't think it's > > as simple as a 512Mb fixed size hole, though I'd have to look it > > up to be sure. > > No - it definitely isn't as simple as a 512 MB hole. Depends on how much I meant that as an example, I recall that was valid config, 512M of ram and 512M hole, then next node 512M ram and 512M hole etc... Of course it must be possible to vary the mem size if you want more or less ram in each node but still it doesn't generate a problematic layout for discontigmem (i.e. not 256 discontigous chunks or something of that order). > memory is in each node, holes could be all kinds of sizes. You could, > in theory, have had 128 MB in one node and 8 GB in another node. I don't > think we had holes within the node from the software side - I think the > requirement was that all DIMMS were added in low to high memory slots. > Not sure what forced that requirement - could have been PTX, BIOS, > cache controllers, etc. > > gerrit Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 18:10 ` Andrea Arcangeli @ 2002-05-02 19:28 ` Gerrit Huizenga 2002-05-02 22:23 ` Martin J. Bligh 2002-05-03 6:20 ` Andrea Arcangeli 0 siblings, 2 replies; 152+ messages in thread From: Gerrit Huizenga @ 2002-05-02 19:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel In message <20020502201043.L11414@dualathlon.random>, > : Andrea Arcangeli writ es: > On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote: > > In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes: > > > > With numa-q there's a 512M hole in each node IIRC. that's fine > > > > configuration, similar to the wildfire btw. > > > > > > There's 2 different memory models - the NT mode we use currently > > > is contiguous, the PTX mode is discontiguous. I don't think it's > > > as simple as a 512Mb fixed size hole, though I'd have to look it > > > up to be sure. > > > > No - it definitely isn't as simple as a 512 MB hole. Depends on how much > > I meant that as an example, I recall that was valid config, 512M of ram > and 512M hole, then next node 512M ram and 512M hole etc... Of course it > must be possible to vary the mem size if you want more or less ram in > each node but still it doesn't generate a problematic layout for > discontigmem (i.e. not 256 discontigous chunks or something of that > order). I *think* the ranges were typically aligned to 4 GB, although with 8 GB in a single node, I don't remember what the mapping layout looked like. Which made everything but node 0 into HIGHMEM. With the "flat" addressing mode that Martin has been using (the dummied down for NT version) everything is squished together. That makes it a bit harder to do node local data structures, although he may have enough data from the MPS table to split memory appropriately. gerrit ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:28 ` Gerrit Huizenga @ 2002-05-02 22:23 ` Martin J. Bligh 2002-05-03 6:20 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-02 22:23 UTC (permalink / raw) To: Gerrit Huizenga, Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel > I *think* the ranges were typically aligned to 4 GB, although with 8 GB > in a single node, I don't remember what the mapping layout looked like. > > Which made everything but node 0 into HIGHMEM. > > With the "flat" addressing mode that Martin has been using (the > dummied down for NT version) everything is squished together. That > makes it a bit harder to do node local data structures, although he > may have enough data from the MPS table to split memory appropriately. I have enough data, I know which phys mem ranges are in each node, but I still need to change the virtual <-> physical mapping in order to spread ZONE_NORMAL around. Pat has already spread the high memory around into specific pg_data_t's per node. M. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 19:28 ` Gerrit Huizenga 2002-05-02 22:23 ` Martin J. Bligh @ 2002-05-03 6:20 ` Andrea Arcangeli 2002-05-03 6:39 ` Martin J. Bligh 1 sibling, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:20 UTC (permalink / raw) To: Gerrit Huizenga Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 12:28:52PM -0700, Gerrit Huizenga wrote: > In message <20020502201043.L11414@dualathlon.random>, > : Andrea Arcangeli writ > es: > > On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote: > > > In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes: > > > > > With numa-q there's a 512M hole in each node IIRC. that's fine > > > > > configuration, similar to the wildfire btw. > > > > > > > > There's 2 different memory models - the NT mode we use currently > > > > is contiguous, the PTX mode is discontiguous. I don't think it's > > > > as simple as a 512Mb fixed size hole, though I'd have to look it > > > > up to be sure. > > > > > > No - it definitely isn't as simple as a 512 MB hole. Depends on how much > > > > I meant that as an example, I recall that was valid config, 512M of ram > > and 512M hole, then next node 512M ram and 512M hole etc... Of course it > > must be possible to vary the mem size if you want more or less ram in > > each node but still it doesn't generate a problematic layout for > > discontigmem (i.e. not 256 discontigous chunks or something of that > > order). > > I *think* the ranges were typically aligned to 4 GB, although with 8 GB > in a single node, I don't remember what the mapping layout looked like. > > Which made everything but node 0 into HIGHMEM. ok. > > With the "flat" addressing mode that Martin has been using (the > dummied down for NT version) everything is squished together. That > makes it a bit harder to do node local data structures, although he > may have enough data from the MPS table to split memory appropriately. sure, the only issue is the API that the hardware provides to advertise the start/end of the memory for each node. It doesn't matter if it's squashed or not as long as you still know the start/end of the phys ram per node. It also won't make any difference with nonlinear or discontigmem because you need to fill the pgdat anyways to enable the numa heuristics (node-affine-allocations being the most sensible etc..). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 6:20 ` Andrea Arcangeli @ 2002-05-03 6:39 ` Martin J. Bligh 0 siblings, 0 replies; 152+ messages in thread From: Martin J. Bligh @ 2002-05-03 6:39 UTC (permalink / raw) To: Andrea Arcangeli, Gerrit Huizenga Cc: Daniel Phillips, Russell King, linux-kernel >> With the "flat" addressing mode that Martin has been using (the >> dummied down for NT version) everything is squished together. That >> makes it a bit harder to do node local data structures, although he >> may have enough data from the MPS table to split memory appropriately. > > sure, the only issue is the API that the hardware provides to advertise > the start/end of the memory for each node. It doesn't matter if it's > squashed or not as long as you still know the start/end of the phys ram > per node. It also won't make any difference with nonlinear or > discontigmem because you need to fill the pgdat anyways to enable the > numa heuristics (node-affine-allocations being the most sensible etc..). Yup, we can grab that info from the BIOS generated tables - see Pat Gaughen's patch posted here a few days ago that parses those tables and feeds the pgdats if you want the gory details. Martin. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 13:34 ` Andrea Arcangeli 2002-05-02 15:18 ` Martin J. Bligh @ 2002-05-02 16:00 ` William Lee Irwin III 1 sibling, 0 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 16:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 03:34:02PM +0200, Andrea Arcangeli wrote: > alpha is the same as mips I think. sparc would be the same too if > there's any discontigmem sparc. Dunno of arm. We're talking about > architectures needing discontigmem, 99% percent of users doesn't need > discontigmem in the first place, you never need discontigmem in x86 and > even in new-numa you don't need discontigmem, you want to pass through > discontigmem only to get the numa topology description that the current > discontigmem provides via the pgdat. Any chance you could name a few of these mysterious new NUMA machines? Thanks, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:47 ` Andrea Arcangeli 2002-05-01 1:26 ` Daniel Phillips @ 2002-05-02 2:37 ` William Lee Irwin III 2002-05-02 15:59 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 2:37 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 02:47:40AM +0200, Andrea Arcangeli wrote: > Oh yeah, you save 1 microsecond every 10 years of uptime by taking > advantage of the potentially coalesced cacheline between the last page > in a node and the first page of the next node. Before you can care about > this optimizations you should remove from x86 the pgdat loops that are > not needed with discontigmem disabled like in x86 (this has nothing to > do with discontigmem/nonlinear). That wouldn't be measurable too but at > least it would be more worthwhile. Which ones did you have in mind? I did poke around this area a bit, and already have my eye on one... Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 2:37 ` William Lee Irwin III @ 2002-05-02 15:59 ` Andrea Arcangeli 2002-05-02 16:06 ` William Lee Irwin III 0 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 15:59 UTC (permalink / raw) To: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel On Wed, May 01, 2002 at 07:37:11PM -0700, William Lee Irwin III wrote: > On Thu, May 02, 2002 at 02:47:40AM +0200, Andrea Arcangeli wrote: > > Oh yeah, you save 1 microsecond every 10 years of uptime by taking > > advantage of the potentially coalesced cacheline between the last page > > in a node and the first page of the next node. Before you can care about > > this optimizations you should remove from x86 the pgdat loops that are > > not needed with discontigmem disabled like in x86 (this has nothing to > > do with discontigmem/nonlinear). That wouldn't be measurable too but at > > least it would be more worthwhile. > > Which ones did you have in mind? I did poke around this area a bit, and all of them, if you implement a mechanism to skip one of the pgdat loops, you could skip them of all then. > already have my eye on one... > > > Cheers, > Bill Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:59 ` Andrea Arcangeli @ 2002-05-02 16:06 ` William Lee Irwin III 0 siblings, 0 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 16:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Wed, May 01, 2002 at 07:37:11PM -0700, William Lee Irwin III wrote: >> Which ones did you have in mind? I did poke around this area a bit, and On Thu, May 02, 2002 at 05:59:46PM +0200, Andrea Arcangeli wrote: > all of them, if you implement a mechanism to skip one of the pgdat > loops, you could skip them of all then. Not quite; I only had in mind a per-cpu free pages counter for nr_free_pages. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 2:23 ` Andrea Arcangeli 2002-04-30 23:12 ` Daniel Phillips @ 2002-05-01 18:05 ` Jesse Barnes 2002-05-01 23:17 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: Jesse Barnes @ 2002-05-01 18:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Wed, May 01, 2002 at 04:23:41AM +0200, Andrea Arcangeli wrote: > What's the advantage? And after you can have more than one mem_map, > after you added this "vector", then each mem_map will match a > discontigmem pgdat. Tell me a numa machine where there's an hole in the > middle of a node. The holes are always intra-node, never within the > nodes themself. So the nonlinear-numa should fallback to the stright Just FYI, there _are_ many NUMA machines with memory holes in the middle of a node. Check out the discontig patch at http://sf.net/projects/discontig for more info. Jesse ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 18:05 ` Jesse Barnes @ 2002-05-01 23:17 ` Andrea Arcangeli 2002-05-01 23:23 ` discontiguous memory platforms Jesse Barnes 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-01 23:17 UTC (permalink / raw) To: Daniel Phillips, Russell King, linux-kernel; +Cc: Jesse Barnes On Wed, May 01, 2002 at 11:05:47AM -0700, Jesse Barnes wrote: > On Wed, May 01, 2002 at 04:23:41AM +0200, Andrea Arcangeli wrote: > > What's the advantage? And after you can have more than one mem_map, > > after you added this "vector", then each mem_map will match a > > discontigmem pgdat. Tell me a numa machine where there's an hole in the > > middle of a node. The holes are always intra-node, never within the > > nodes themself. So the nonlinear-numa should fallback to the stright > > Just FYI, there _are_ many NUMA machines with memory holes in the > middle of a node. Check out the discontig patch at > http://sf.net/projects/discontig for more info. so ia64 is one of those archs with a ram layout with huge holes in the middle of the ram of the nodes? I'd be curious to know what's the hardware advantage of designing the ram layout in such a way, compared to all other numa archs that I deal with. Also if you know other archs with huge holes in the middle of the ram of the nodes I'd be curious to know about them too. thanks for the interesting info! Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-01 23:17 ` Andrea Arcangeli @ 2002-05-01 23:23 ` Jesse Barnes 2002-05-02 0:51 ` Ralf Baechle 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard 1 sibling, 1 reply; 152+ messages in thread From: Jesse Barnes @ 2002-05-01 23:23 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote: > so ia64 is one of those archs with a ram layout with huge holes in the > middle of the ram of the nodes? I'd be curious to know what's the Well, our ia64 platform is at least, but I think there are others. > hardware advantage of designing the ram layout in such a way, compared > to all other numa archs that I deal with. Also if you know other archs > with huge holes in the middle of the ram of the nodes I'd be curious to > know about them too. thanks for the interesting info! AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory layouts like this too. I've never done hardware design before, so I'm not sure if there's a good reason for such layouts. Ralf or Daniel might be able to shed some more light on that... Jesse ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-01 23:23 ` discontiguous memory platforms Jesse Barnes @ 2002-05-02 0:51 ` Ralf Baechle 2002-05-02 1:27 ` Andrea Arcangeli 0 siblings, 1 reply; 152+ messages in thread From: Ralf Baechle @ 2002-05-02 0:51 UTC (permalink / raw) To: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel On Wed, May 01, 2002 at 04:23:43PM -0700, Jesse Barnes wrote: > On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote: > > so ia64 is one of those archs with a ram layout with huge holes in the > > middle of the ram of the nodes? I'd be curious to know what's the > > Well, our ia64 platform is at least, but I think there are others. > > > hardware advantage of designing the ram layout in such a way, compared > > to all other numa archs that I deal with. Also if you know other archs > > with huge holes in the middle of the ram of the nodes I'd be curious to > > know about them too. thanks for the interesting info! > > AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory > layouts like this too. I've never done hardware design before, so I'm > not sure if there's a good reason for such layouts. Ralf or Daniel > might be able to shed some more light on that... Just to give a few examples of memory layouts on MIPS systems. Sibyte 1250 is as follows: - 256MB at physical address 0 - 512MB at physical address 0x80000000 - 256MB at physical address 0xc0000000 - The entire rest of the memory is mapped contiguously from physical address 0x1:00000000 up. All available memory is mapped from the lowest address up. Origin 200/2000. Each node has an address space of 2GB, each node has 4 memory banks, that is each bank takes 512MB of address space. Even unpopulated or partially populated banks take the full 512MB address space. Memory in partially populated banks is mapped at the beginning of the bank's address space; each node must have have at least one bank with memory in it, that is something like - 32MB @ physical address 0x00:00000000 - 32MB @ physical address 0x00:80000000 - 32MB @ physical address 0x01:00000000 ... - 32MB @ physical address 0x7f:00000000 would be a valid configuration. That's 8GB of RAM scattered in tiny chunks of just 32mb throughout 256MB address space. In theory nodes might not even have to exist, so - 32MB @ physical address 0x00:00000000 - 32MB @ physical address 0x7f:00000000 would be a valid configuration as well. There are other examples more but #1 is becoming a widespread chip and #2 is a rather extreme example just to show how far discontiguity may go. Ralf ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 0:51 ` Ralf Baechle @ 2002-05-02 1:27 ` Andrea Arcangeli 2002-05-02 1:32 ` Ralf Baechle 2002-05-02 8:50 ` Roman Zippel 0 siblings, 2 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 1:27 UTC (permalink / raw) To: Ralf Baechle; +Cc: Daniel Phillips, Russell King, linux-kernel On Wed, May 01, 2002 at 05:51:33PM -0700, Ralf Baechle wrote: > On Wed, May 01, 2002 at 04:23:43PM -0700, Jesse Barnes wrote: > > > On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote: > > > so ia64 is one of those archs with a ram layout with huge holes in the > > > middle of the ram of the nodes? I'd be curious to know what's the > > > > Well, our ia64 platform is at least, but I think there are others. > > > > > hardware advantage of designing the ram layout in such a way, compared > > > to all other numa archs that I deal with. Also if you know other archs > > > with huge holes in the middle of the ram of the nodes I'd be curious to > > > know about them too. thanks for the interesting info! > > > > AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory > > layouts like this too. I've never done hardware design before, so I'm > > not sure if there's a good reason for such layouts. Ralf or Daniel > > might be able to shed some more light on that... > > Just to give a few examples of memory layouts on MIPS systems. Sibyte 1250 > is as follows: > > - 256MB at physical address 0 > - 512MB at physical address 0x80000000 > - 256MB at physical address 0xc0000000 > - The entire rest of the memory is mapped contiguously from physical > address 0x1:00000000 up. > All available memory is mapped from the lowest address up. Is this a numa? If not then you should be just perfectly fine with discontigmem with this chip. > > Origin 200/2000. Each node has an address space of 2GB, each node has 4 > memory banks, that is each bank takes 512MB of address space. Even > unpopulated or partially populated banks take the full 512MB address > space. Memory in partially populated banks is mapped at the beginning > of the bank's address space; each node must have have at least one > bank with memory in it, that is something like > > - 32MB @ physical address 0x00:00000000 > - 32MB @ physical address 0x00:80000000 > - 32MB @ physical address 0x01:00000000 > ... > - 32MB @ physical address 0x7f:00000000 > > would be a valid configuration. That's 8GB of RAM scattered in tiny > chunks of just 32mb throughout 256MB address space. In theory nodes > might not even have to exist, so > > - 32MB @ physical address 0x00:00000000 > - 32MB @ physical address 0x7f:00000000 > > would be a valid configuration as well. this means 256 nodes. for example that many different discontigmem nodes would give you a measurable slowdown in the nr_free_pages O(N) loops over the pgdat list, so nonlinear on the above hardware is a win. I wasn't aware that such a memory layout actually existed. So if you want to support the above efficiently we must make it possibe for you to do nonlinear transparently to the common code kernel abstraction. What are actually the common code changes involved with nonlinear? What I care about is not to clobber the common code with additional overlapping common code abstractions. We should try to make it possible to do nonlinear under mips completly transparently to the current common code, if we do that then you can use nonlinear to handle the above extreme origin 200/2k scenario without the common code noticing that. Then there's no point to argue about nonlinear or discontigmem, nonlinear becomes mips way of handling virt_to_page and that's all, no config_nonlinaer at all, just select ARCH=mips instead of ARCH=x86. Then I'll be very fine with it of course, it would become an obviously right implementation of virt_to_page/pte_page for a certain arch. > > There are other examples more but #1 is becoming a widespread chip and #2 > is a rather extreme example just to show how far discontiguity may go. > > Ralf Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 1:27 ` Andrea Arcangeli @ 2002-05-02 1:32 ` Ralf Baechle 2002-05-02 8:50 ` Roman Zippel 1 sibling, 0 replies; 152+ messages in thread From: Ralf Baechle @ 2002-05-02 1:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel On Thu, May 02, 2002 at 03:27:25AM +0200, Andrea Arcangeli wrote: > > - 256MB at physical address 0 > > - 512MB at physical address 0x80000000 > > - 256MB at physical address 0xc0000000 > > - The entire rest of the memory is mapped contiguously from physical > > address 0x1:00000000 up. > > All available memory is mapped from the lowest address up. > > Is this a numa? If not then you should be just perfectly fine with > discontigmem with this chip. This is a system on a chip with memory controllers on die. In theory multiple of it can be combined to brew some crude ccNUMA system but I don't know if people are actually doing that. Ralf ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 1:27 ` Andrea Arcangeli 2002-05-02 1:32 ` Ralf Baechle @ 2002-05-02 8:50 ` Roman Zippel 2002-05-01 13:21 ` Daniel Phillips 2002-05-02 18:35 ` Geert Uytterhoeven 1 sibling, 2 replies; 152+ messages in thread From: Roman Zippel @ 2002-05-02 8:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ralf Baechle, Daniel Phillips, Russell King, linux-kernel Hi, On Thu, 2 May 2002, Andrea Arcangeli wrote: > What I > care about is not to clobber the common code with additional overlapping > common code abstractions. Just to throw in an alternative: On m68k we map currently everything together into a single virtual area. This means the virtual<->physical conversion is a bit more expensive and mem_map is simply indexed by the the virtual address. It works nicely, it just needs two small patches in the initializition code, which aren't integrated yet. I think it's very close to what Daniel wants, only that the logical and virtual address are identical. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 8:50 ` Roman Zippel @ 2002-05-01 13:21 ` Daniel Phillips 2002-05-02 14:00 ` Roman Zippel 2002-05-02 18:35 ` Geert Uytterhoeven 1 sibling, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 13:21 UTC (permalink / raw) To: Roman Zippel, Andrea Arcangeli; +Cc: Ralf Baechle, Russell King, linux-kernel On Thursday 02 May 2002 10:50, Roman Zippel wrote: > Hi, > > On Thu, 2 May 2002, Andrea Arcangeli wrote: > > > What I > > care about is not to clobber the common code with additional overlapping > > common code abstractions. > > Just to throw in an alternative: On m68k we map currently everything > together into a single virtual area. This means the virtual<->physical > conversion is a bit more expensive and mem_map is simply indexed by the > the virtual address. Are you talking about mm_ptov and friends here? What are the loops for? Could you please describe the most extreme case of physical discontiguity you're handling? > It works nicely, it just needs two small patches in the initializition > code, which aren't integrated yet. I think it's very close to what Daniel > wants, only that the logical and virtual address are identical. Yes, since logical and virtual kernel addresses in config_nonlinear differ only by a constant (PAGE_OFFSET) then setting the constant to zero gives me your case. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-01 13:21 ` Daniel Phillips @ 2002-05-02 14:00 ` Roman Zippel 2002-05-01 14:08 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-02 14:00 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel Hi, On Wed, 1 May 2002, Daniel Phillips wrote: > > Just to throw in an alternative: On m68k we map currently everything > > together into a single virtual area. This means the virtual<->physical > > conversion is a bit more expensive and mem_map is simply indexed by the > > the virtual address. > > Are you talking about mm_ptov and friends here? What are the loops for? It simply searches through all memory nodes, it's not really efficient. > Could you please describe the most extreme case of physical discontiguity > you're handling? I can't assume anything. I'm thinking about calculating the table dynamically and patching the kernel at bootup, we are already doing something similiar in the Amiga/ppc kernel. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 14:00 ` Roman Zippel @ 2002-05-01 14:08 ` Daniel Phillips 2002-05-02 17:56 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 14:08 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel On Thursday 02 May 2002 16:00, Roman Zippel wrote: > Hi, > > On Wed, 1 May 2002, Daniel Phillips wrote: > > > > Just to throw in an alternative: On m68k we map currently everything > > > together into a single virtual area. This means the virtual<->physical > > > conversion is a bit more expensive and mem_map is simply indexed by the > > > the virtual address. > > > > Are you talking about mm_ptov and friends here? What are the loops for? > > It simply searches through all memory nodes, it's not really efficient. > > > Could you please describe the most extreme case of physical discontiguity > > you're handling? > > I can't assume anything. I'm thinking about calculating the table > dynamically and patching the kernel at bootup, we are already doing > something similiar in the Amiga/ppc kernel. Maybe this is a good place to try out a hash table variant of config_nonlinear. It's got to be more efficient than searching all the nodes, don't you think? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-01 14:08 ` Daniel Phillips @ 2002-05-02 17:56 ` Roman Zippel 2002-05-01 17:59 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-02 17:56 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel Daniel Phillips wrote: > Maybe this is a good place to try out a hash table variant of > config_nonlinear. It's got to be more efficient than searching all the > nodes, don't you think? Most of the time there are only a few nodes, I just don't know where and how big they are, so I don't think a hash based approach will be a lot faster. When I'm going to change this, I'd rather try the dynamic table approach. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 17:56 ` Roman Zippel @ 2002-05-01 17:59 ` Daniel Phillips 2002-05-02 18:26 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 17:59 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel On Thursday 02 May 2002 19:56, Roman Zippel wrote: > Daniel Phillips wrote: > > > Maybe this is a good place to try out a hash table variant of > > config_nonlinear. It's got to be more efficient than searching all the > > nodes, don't you think? > > Most of the time there are only a few nodes, I just don't know where and > how big they are, so I don't think a hash based approach will be a lot > faster. When I'm going to change this, I'd rather try the dynamic table > approach. Which dynamic table approach is that? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-01 17:59 ` Daniel Phillips @ 2002-05-02 18:26 ` Roman Zippel 2002-05-02 18:32 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Roman Zippel @ 2002-05-02 18:26 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel Hi, Daniel Phillips wrote: > > Most of the time there are only a few nodes, I just don't know where and > > how big they are, so I don't think a hash based approach will be a lot > > faster. When I'm going to change this, I'd rather try the dynamic table > > approach. > > Which dynamic table approach is that? I mean calculating the lookup table and patching the kernel at startup. Anyway, I agree with Andrea, that another mapping isn't really needed. Clever use of the mmu should give you almost the same result. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 18:26 ` Roman Zippel @ 2002-05-02 18:32 ` Daniel Phillips 2002-05-02 19:40 ` Roman Zippel 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 18:32 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel On Thursday 02 May 2002 20:26, Roman Zippel wrote: > Hi, > > Daniel Phillips wrote: > > > > Most of the time there are only a few nodes, I just don't know where and > > > how big they are, so I don't think a hash based approach will be a lot > > > faster. When I'm going to change this, I'd rather try the dynamic table > > > approach. > > > > Which dynamic table approach is that? > > I mean calculating the lookup table and patching the kernel at startup. Patching the kernel how, and where? Calculating the lookup table automatically at startup is definitely planned, and yes, essential to avoid an unmanageble proliferation of configuration files. It's also possible to pass the configuration as a list of mem=size@physaddr kernel command line entries, which is a pragmatic solution for configurations with unusual memory mappings, but not too many of them. > Anyway, I agree with Andrea, that another mapping isn't really needed. > Clever use of the mmu should give you almost the same result. We *are* making clever use of the mmu in config_nonlinear, it is doing the nonlinear kernel virtual mapping for us. Did you have something more clever in mind? -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 18:32 ` Daniel Phillips @ 2002-05-02 19:40 ` Roman Zippel 2002-05-02 20:14 ` Daniel Phillips 2002-05-03 6:30 ` Andrea Arcangeli 0 siblings, 2 replies; 152+ messages in thread From: Roman Zippel @ 2002-05-02 19:40 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel Daniel Phillips wrote: > Patching the kernel how, and where? Check for example in asm-ppc/page.h the __va/__pa functions. > > Anyway, I agree with Andrea, that another mapping isn't really needed. > > Clever use of the mmu should give you almost the same result. > > We *are* making clever use of the mmu in config_nonlinear, it is doing the > nonlinear kernel virtual mapping for us. Did you have something more clever > in mind? I mean to map the memory where you need it. The physical<->virtual mapping won't be one to one, but you won't need another abstraction and the current vm is already basically able to handle it. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 19:40 ` Roman Zippel @ 2002-05-02 20:14 ` Daniel Phillips 2002-05-03 6:34 ` Andrea Arcangeli 2002-05-03 9:33 ` Roman Zippel 2002-05-03 6:30 ` Andrea Arcangeli 1 sibling, 2 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 20:14 UTC (permalink / raw) To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel On Thursday 02 May 2002 21:40, Roman Zippel wrote: > Daniel Phillips wrote: > > > Patching the kernel how, and where? > > Check for example in asm-ppc/page.h the __va/__pa functions. OK, by 'patching the kernel' you must mean 'initialize the m68k_memory array'. The loop you use does have one advantage: it can handle size variation vs a shift-lookup strategy. It's a lot more expensive though, and these are heavily used operations. > > > Anyway, I agree with Andrea, that another mapping isn't really needed. > > > Clever use of the mmu should give you almost the same result. > > > > We *are* making clever use of the mmu in config_nonlinear, it is doing the > > nonlinear kernel virtual mapping for us. Did you have something more clever > > in mind? > > I mean to map the memory where you need it. The physical<->virtual > mapping won't be one to one, but you won't need another abstraction and > the current vm is already basically able to handle it. I'll accept 'not needed for 68K', though I guess config_nonlinear will work perfectly well for you and be faster than the loops. However, some of the problems that config_nonlinear solves cannot be solved by any existing kernel mechanism. We've been over the NUMA-Q and mips32 cases in detail, so I won't reiterate. Thanks for your input. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 20:14 ` Daniel Phillips @ 2002-05-03 6:34 ` Andrea Arcangeli 2002-05-03 9:33 ` Roman Zippel 1 sibling, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:34 UTC (permalink / raw) To: Daniel Phillips; +Cc: Roman Zippel, Ralf Baechle, Russell King, linux-kernel On Thu, May 02, 2002 at 10:14:02PM +0200, Daniel Phillips wrote: > mechanism. We've been over the NUMA-Q and mips32 cases in detail, so I won't I didn't hear the mips32 argument, but for NUMA-Q nonlinear is definitely the last thing you want, there is no discontinuity in the ram in each node. nonlinaer can make sense _only_ when there are ram holes in the middle of the per-numa-node-mem_map. NUMA-Q has the need of spreading the zone_normal over the different nodes and nonlinaer is definitely not needed and it won't help in achieving that object, NUMA-Q infact needs discontigmem topology description to allow the numa optimizations so it cannot use nonlinear anyways to handle the holes between the numa-nodes. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 20:14 ` Daniel Phillips 2002-05-03 6:34 ` Andrea Arcangeli @ 2002-05-03 9:33 ` Roman Zippel 1 sibling, 0 replies; 152+ messages in thread From: Roman Zippel @ 2002-05-03 9:33 UTC (permalink / raw) To: Daniel Phillips Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel Hi, On Thu, 2 May 2002, Daniel Phillips wrote: > I'll accept 'not needed for 68K', though I guess config_nonlinear will work > perfectly well for you and be faster than the loops. However, some of the > problems that config_nonlinear solves cannot be solved by any existing kernel > mechanism. We've been over the NUMA-Q and mips32 cases in detail, so I won't > reiterate. Maybe I missed that, but could you give me an example of a memory configuration, which would be difficult to handle with the current vm? Could you describe, how in your model the physical address space would be mapped to the logical and virtual address space and how they are mapped into the pgdat nodes? Some real numbers would help me a lot to understand, what you have in mind. I have a rough idea of it, but I want to be sure we are talking about the same thing. bye, Roman ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 19:40 ` Roman Zippel 2002-05-02 20:14 ` Daniel Phillips @ 2002-05-03 6:30 ` Andrea Arcangeli 1 sibling, 0 replies; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-03 6:30 UTC (permalink / raw) To: Roman Zippel; +Cc: Daniel Phillips, Ralf Baechle, Russell King, linux-kernel On Thu, May 02, 2002 at 09:40:48PM +0200, Roman Zippel wrote: > mapping won't be one to one, but you won't need another abstraction and > the current vm is already basically able to handle it. this was basically my whole point, agreed. Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 8:50 ` Roman Zippel 2002-05-01 13:21 ` Daniel Phillips @ 2002-05-02 18:35 ` Geert Uytterhoeven 2002-05-02 18:39 ` Daniel Phillips 1 sibling, 1 reply; 152+ messages in thread From: Geert Uytterhoeven @ 2002-05-02 18:35 UTC (permalink / raw) To: Roman Zippel Cc: Andrea Arcangeli, Ralf Baechle, Daniel Phillips, Russell King, Linux Kernel Development On Thu, 2 May 2002, Roman Zippel wrote: > On Thu, 2 May 2002, Andrea Arcangeli wrote: > > What I > > care about is not to clobber the common code with additional overlapping > > common code abstractions. > > Just to throw in an alternative: On m68k we map currently everything > together into a single virtual area. This means the virtual<->physical > conversion is a bit more expensive and mem_map is simply indexed by the > the virtual address. > It works nicely, it just needs two small patches in the initializition > code, which aren't integrated yet. I think it's very close to what Daniel > wants, only that the logical and virtual address are identical. I also want to add that the order (by address) of the virtual chunk is not necessarily the same as the order (by address) of the physical chunks. So it's perfect possible to put the kernel in the second physical chunk, in which case the first physical chunk (with a lower physical address) ends up in the virtual list behind the first physical chunk. IIRC (/me no Linux mm whizard), the above reason was the main reason why the current zone system doesn't work well for m68k boxes (mainly talking about Amiga). Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: discontiguous memory platforms 2002-05-02 18:35 ` Geert Uytterhoeven @ 2002-05-02 18:39 ` Daniel Phillips 0 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 18:39 UTC (permalink / raw) To: Geert Uytterhoeven, Roman Zippel Cc: Andrea Arcangeli, Ralf Baechle, Russell King, Linux Kernel Development On Thursday 02 May 2002 20:35, Geert Uytterhoeven wrote: > On Thu, 2 May 2002, Roman Zippel wrote: > > On Thu, 2 May 2002, Andrea Arcangeli wrote: > > > What I > > > care about is not to clobber the common code with additional overlapping > > > common code abstractions. > > > > Just to throw in an alternative: On m68k we map currently everything > > together into a single virtual area. This means the virtual<->physical > > conversion is a bit more expensive and mem_map is simply indexed by the > > the virtual address. > > It works nicely, it just needs two small patches in the initializition > > code, which aren't integrated yet. I think it's very close to what Daniel > > wants, only that the logical and virtual address are identical. > > I also want to add that the order (by address) of the virtual chunk is not > necessarily the same as the order (by address) of the physical chunks. > > So it's perfect possible to put the kernel in the second physical chunk, in > which case the first physical chunk (with a lower physical address) ends up in > the virtual list behind the first physical chunk. > > IIRC (/me no Linux mm whizard), the above reason was the main reason why the > current zone system doesn't work well for m68k boxes (mainly talking about > Amiga). Config_nonlinear will handle this situation without difficulty. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 23:17 ` Andrea Arcangeli 2002-05-01 23:23 ` discontiguous memory platforms Jesse Barnes @ 2002-05-02 0:20 ` Anton Blanchard 2002-05-01 1:35 ` Daniel Phillips ` (3 more replies) 1 sibling, 4 replies; 152+ messages in thread From: Anton Blanchard @ 2002-05-02 0:20 UTC (permalink / raw) To: Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes > so ia64 is one of those archs with a ram layout with huge holes in the > middle of the ram of the nodes? I'd be curious to know what's the > hardware advantage of designing the ram layout in such a way, compared > to all other numa archs that I deal with. Also if you know other archs > with huge holes in the middle of the ram of the nodes I'd be curious to > know about them too. thanks for the interesting info! >From arch/ppc64/kernel/iSeries_setup.c: * The iSeries may have very large memories ( > 128 GB ) and a partition * may get memory in "chunks" that may be anywhere in the 2**52 real * address space. The chunks are 256K in size. Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic solution to this problem. Anton ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard @ 2002-05-01 1:35 ` Daniel Phillips 2002-05-02 1:45 ` William Lee Irwin III 2002-05-02 1:46 ` Andrea Arcangeli 2002-05-02 1:01 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 2 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 1:35 UTC (permalink / raw) To: Anton Blanchard, Andrea Arcangeli Cc: Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 02:20, Anton Blanchard wrote: > > so ia64 is one of those archs with a ram layout with huge holes in the > > middle of the ram of the nodes? I'd be curious to know what's the > > hardware advantage of designing the ram layout in such a way, compared > > to all other numa archs that I deal with. Also if you know other archs > > with huge holes in the middle of the ram of the nodes I'd be curious to > > know about them too. thanks for the interesting info! > > From arch/ppc64/kernel/iSeries_setup.c: > > * The iSeries may have very large memories ( > 128 GB ) and a partition > * may get memory in "chunks" that may be anywhere in the 2**52 real > * address space. The chunks are 256K in size. > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic > solution to this problem. Using the config_nonlinear model, you'd change the four mapping functions: logical_to_phys phys_to_logical pagenum_to_phys phys_to_pagenum to use a hash table instead of a table lookup. Bill Irwin suggested a btree would work here as well. (Note I'm trying out the term 'pagenum' instead of 'ordinal' here, following comments on lse-tech.) -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 1:35 ` Daniel Phillips @ 2002-05-02 1:45 ` William Lee Irwin III 2002-05-01 2:02 ` Daniel Phillips 2002-05-02 1:46 ` Andrea Arcangeli 1 sibling, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 1:45 UTC (permalink / raw) To: Daniel Phillips Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel, Jesse Barnes On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote: > to use a hash table instead of a table lookup. Bill Irwin suggested a btree > would work here as well. I remember suggesting a sorted array of extents on which binary search could be performed. A B-tree seems unlikely but perhaps if it were contiguously allocated and some other tricks done it might do, maybe I don't remember the special sauce used for the occasion. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 1:45 ` William Lee Irwin III @ 2002-05-01 2:02 ` Daniel Phillips 2002-05-02 2:33 ` William Lee Irwin III 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 2:02 UTC (permalink / raw) To: William Lee Irwin III Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 03:45, William Lee Irwin III wrote: > On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote: > > to use a hash table instead of a table lookup. Bill Irwin suggested a btree > > would work here as well. > > I remember suggesting a sorted array of extents on which binary > search could be performed. A B-tree seems unlikely but perhaps if > it were contiguously allocated and some other tricks done it might > do, maybe I don't remember the special sauce used for the occasion. Thanks for the correction. When you said 'extents' I automatically thought 'btree of extents'. I'd tend to go for the hash table anyway - your binary search is going to take quite a few more steps to terminate than the bucket search, given some reasonable choice of hash table size. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 2:02 ` Daniel Phillips @ 2002-05-02 2:33 ` William Lee Irwin III 2002-05-01 2:44 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 2:33 UTC (permalink / raw) To: Daniel Phillips Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 03:45, William Lee Irwin III wrote: >> I remember suggesting a sorted array of extents on which binary >> search could be performed. A B-tree seems unlikely but perhaps if >> it were contiguously allocated and some other tricks done it might >> do, maybe I don't remember the special sauce used for the occasion. On Wed, May 01, 2002 at 04:02:33AM +0200, Daniel Phillips wrote: > Thanks for the correction. When you said 'extents' I automatically thought > 'btree of extents'. I'd tend to go for the hash table anyway - your binary > search is going to take quite a few more steps to terminate than the bucket > search, given some reasonable choice of hash table size. It's probably motivated more by sheer terror of another huge hash table sized proportional to memory eating the kernel virtual address space alive than anything else. I probably should have used reverse psychology instead. I should note that the size of the array I suggested is not proportional to memory, only to the number of fragments. It would probably only have a distinct advantage in a situation where both the fragment sizes and distributions are irregular; when the number of fragments is in fact proportional to memory it gains little aside from a small factor of compactness and/or in-core contiguity. The hashing techniques that seem obvious to me effectively require some sort of objects to back a direct mapping, which translates to per-page overhead, which I'm very very picky about. I also like things to behave gracefully about space and time when faced with irregular or "hostile" layouts. Actually, now that I think about it, a contiguously-allocated B-tree of extents doesn't sound bad at all, even without additional dressing. Do you think it's worth a try? Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 2:33 ` William Lee Irwin III @ 2002-05-01 2:44 ` Daniel Phillips 0 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 2:44 UTC (permalink / raw) To: William Lee Irwin III Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 04:33, William Lee Irwin III wrote: > Actually, now that I think about it, a contiguously-allocated B-tree of > extents doesn't sound bad at all, even without additional dressing. Do > you think it's worth a try? If it solves a problem on a real machine, certainly. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 1:35 ` Daniel Phillips 2002-05-02 1:45 ` William Lee Irwin III @ 2002-05-02 1:46 ` Andrea Arcangeli 2002-05-01 1:56 ` Daniel Phillips 1 sibling, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 1:46 UTC (permalink / raw) To: Daniel Phillips; +Cc: Anton Blanchard, Russell King, linux-kernel, Jesse Barnes On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote: > On Thursday 02 May 2002 02:20, Anton Blanchard wrote: > > > so ia64 is one of those archs with a ram layout with huge holes in the > > > middle of the ram of the nodes? I'd be curious to know what's the > > > hardware advantage of designing the ram layout in such a way, compared > > > to all other numa archs that I deal with. Also if you know other archs > > > with huge holes in the middle of the ram of the nodes I'd be curious to > > > know about them too. thanks for the interesting info! > > > > From arch/ppc64/kernel/iSeries_setup.c: > > > > * The iSeries may have very large memories ( > 128 GB ) and a partition > > * may get memory in "chunks" that may be anywhere in the 2**52 real > > * address space. The chunks are 256K in size. > > > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic > > solution to this problem. > > Using the config_nonlinear model, you'd change the four mapping functions: > > logical_to_phys > phys_to_logical > pagenum_to_phys > phys_to_pagenum > > to use a hash table instead of a table lookup. Bill Irwin suggested a btree > would work here as well. btree? btree is not an interesting in core data structure. Anyways you can use a btree with discontigmem too for the lookup. nonlinear will pay off if you've something of the order of 256 discontigmem chunks with significant holes in between like origin 2k, and I think it should be resolved internally to the arch without exposing it to the common code. > > (Note I'm trying out the term 'pagenum' instead of 'ordinal' here, following > comments on lse-tech.) > > -- > Daniel Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 1:46 ` Andrea Arcangeli @ 2002-05-01 1:56 ` Daniel Phillips 0 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 1:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: Anton Blanchard, Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 03:46, Andrea Arcangeli wrote: > On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote: > > On Thursday 02 May 2002 02:20, Anton Blanchard wrote: > > > > so ia64 is one of those archs with a ram layout with huge holes in the > > > > middle of the ram of the nodes? I'd be curious to know what's the > > > > hardware advantage of designing the ram layout in such a way, compared > > > > to all other numa archs that I deal with. Also if you know other archs > > > > with huge holes in the middle of the ram of the nodes I'd be curious to > > > > know about them too. thanks for the interesting info! > > > > > > From arch/ppc64/kernel/iSeries_setup.c: > > > > > > * The iSeries may have very large memories ( > 128 GB ) and a partition > > > * may get memory in "chunks" that may be anywhere in the 2**52 real > > > * address space. The chunks are 256K in size. > > > > > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic > > > solution to this problem. > > > > Using the config_nonlinear model, you'd change the four mapping functions: > > > > logical_to_phys > > phys_to_logical > > pagenum_to_phys > > phys_to_pagenum > > > > to use a hash table instead of a table lookup. Bill Irwin suggested a btree > > would work here as well. > > btree? btree is not an interesting in core data structure. Well, I didn't really like the btree for this application either, but I see his point. > Anyways you > can use a btree with discontigmem too for the lookup. nonlinear will pay > off if you've something of the order of 256 discontigmem chunks with > significant holes in between like origin 2k, and I think it should be > resolved internally to the arch without exposing it to the common code. Those mapping functions are all defined per-arch, in page.h. The only part of this patch that affects the common code is the new distinction between logical and physical address spaces (which are the same when the option isn't enabled). -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard 2002-05-01 1:35 ` Daniel Phillips @ 2002-05-02 1:01 ` Andrea Arcangeli 2002-05-02 15:28 ` Anton Blanchard 2002-05-02 23:05 ` Daniel Phillips 2002-05-03 23:52 ` David Mosberger 3 siblings, 1 reply; 152+ messages in thread From: Andrea Arcangeli @ 2002-05-02 1:01 UTC (permalink / raw) To: Anton Blanchard; +Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes On Thu, May 02, 2002 at 10:20:11AM +1000, Anton Blanchard wrote: > > > so ia64 is one of those archs with a ram layout with huge holes in the > > middle of the ram of the nodes? I'd be curious to know what's the > > hardware advantage of designing the ram layout in such a way, compared > > to all other numa archs that I deal with. Also if you know other archs > > with huge holes in the middle of the ram of the nodes I'd be curious to > > know about them too. thanks for the interesting info! > > >From arch/ppc64/kernel/iSeries_setup.c: > > * The iSeries may have very large memories ( > 128 GB ) and a partition > * may get memory in "chunks" that may be anywhere in the 2**52 real > * address space. The chunks are 256K in size. > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic > solution to this problem. is this machine a numa machine? If not then discontigmem will work just fine. also it's a matter of administration, even if it's a numa machine you can use it just optimally with discontigmem+numa. Regardless of what we do if the partitioning is bad the kernel will do bad. If you create zillon discontigous nodes of 256K each, you'd need waste memory to handle them regardless of nonlinear or discontigmem (with discontigmem you will waste more memory than nonlinear yes, exactly because it's more powerful, but I think a machine with an huge lot of non contigous 256K chunks is misconfigured, it's like if you pretend to install linux on a machine after you partitioned the HD with thousand of logical volumes large 256K each [for the sake of this example let's assume there are more than 256LV available in LVM], a sane partitioning requires you to have at least a partition for /usr large 1 giga, depends what you're doing of course, but requiring sane partitioning it's an admin problem not a kernel problem IMHO). Andrea ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 1:01 ` Andrea Arcangeli @ 2002-05-02 15:28 ` Anton Blanchard 2002-05-01 16:10 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 152+ messages in thread From: Anton Blanchard @ 2002-05-02 15:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes > is this machine a numa machine? If not then discontigmem will work just > fine. also it's a matter of administration, even if it's a numa machine > you can use it just optimally with discontigmem+numa. Regardless of what > we do if the partitioning is bad the kernel will do bad. If you create > zillon discontigous nodes of 256K each, you'd need waste memory to > handle them regardless of nonlinear or discontigmem (with discontigmem > you will waste more memory than nonlinear yes, exactly because it's more > powerful, but I think a machine with an huge lot of non contigous 256K > chunks is misconfigured, it's like if you pretend to install linux on a > machine after you partitioned the HD with thousand of logical volumes > large 256K each [for the sake of this example let's assume there are > more than 256LV available in LVM], a sane partitioning requires you to > have at least a partition for /usr large 1 giga, depends what you're > doing of course, but requiring sane partitioning it's an admin problem > not a kernel problem IMHO). Its not a NUMA machine, its one that allows shared processor logical partitions. While I would prefer the hypervisor to give each partition a nice memory map (and internally maintain a mapping to real memory) it does not. I can imagine if the machine has been up for many months memory could become very fragmented. Also when we do hotplug memory support will discontigmem be able to efficiently handle memory turning up all over the place in the memory map? Anton ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:28 ` Anton Blanchard @ 2002-05-01 16:10 ` Daniel Phillips 2002-05-02 15:59 ` Dave Engebretsen 2002-05-02 16:31 ` William Lee Irwin III 2 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 16:10 UTC (permalink / raw) To: Anton Blanchard, Andrea Arcangeli Cc: Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 17:28, Anton Blanchard wrote: > > is this machine a numa machine? If not then discontigmem will work just > > fine. also it's a matter of administration, even if it's a numa machine > > you can use it just optimally with discontigmem+numa. Regardless of what > > we do if the partitioning is bad the kernel will do bad. If you create > > zillon discontigous nodes of 256K each, you'd need waste memory to > > handle them regardless of nonlinear or discontigmem (with discontigmem > > you will waste more memory than nonlinear yes, exactly because it's more > > powerful, but I think a machine with an huge lot of non contigous 256K > > chunks is misconfigured, it's like if you pretend to install linux on a > > machine after you partitioned the HD with thousand of logical volumes > > large 256K each [for the sake of this example let's assume there are > > more than 256LV available in LVM], a sane partitioning requires you to > > have at least a partition for /usr large 1 giga, depends what you're > > doing of course, but requiring sane partitioning it's an admin problem > > not a kernel problem IMHO). > > Its not a NUMA machine, its one that allows shared processor logical > partitions. While I would prefer the hypervisor to give each partition > a nice memory map (and internally maintain a mapping to real memory) > it does not. I can imagine if the machine has been up for many months > memory could become very fragmented. > > Also when we do hotplug memory support will discontigmem be able to > efficiently handle memory turning up all over the place in the memory > map? My proposal for support of extremely fragmented physical memory maps is to use a hash table instead of a direct table lookup in the following four functions: logical_to_phys phys_to_logica pagenum_to_phys phys_to_pagenum With the page.h organization: #ifdef CONFIG_NONLINEAR #ifdef CONFIG_NONLINEAR_HASH <the hash table versions of above> #else <the direct table mappings> #endif #else /* Stub definitions */ #define logical_to_phys(p) (p) #define phys_to_logical(p) (p) #define ordinal_to_phys(n) ((n) << PAGE_SHIFT) #define phys_to_ordinal(p) ((p) >> PAGE_SHIFT) #endif The hash tables only be updated when the memory configuration changes. In fact, we will may likely only need the hash table in one direction, in the case that the virtual memory map is less fragmented than physical memory. Then we can use a direct table to go the other direction, and we might want: #ifdef CONFIG_NONLINEAR #ifdef CONFIG_NONLINEAR_PHASH <the hash table versions of above> #else <the direct table mappings> #endif #else #define phys_to_logical(p) (p) #define phys_to_ordinal(p) ((p) >> PAGE_SHIFT) #endif #ifdef CONFIG_NONLINEAR #ifdef CONFIG_NONLINEAR_VHASH <the hash table versions of above> #else <the direct table mappings> #endif #else #define logical_to_phys(p) (p) #define ordinal_to_phys(n) ((n) << PAGE_SHIFT) #endif These are all per-arch, though one of my goals is to reduce the difference between arches, where it doesn't involve any compromise. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:28 ` Anton Blanchard 2002-05-01 16:10 ` Daniel Phillips @ 2002-05-02 15:59 ` Dave Engebretsen 2002-05-01 17:24 ` Daniel Phillips 2002-05-02 16:31 ` William Lee Irwin III 2 siblings, 1 reply; 152+ messages in thread From: Dave Engebretsen @ 2002-05-02 15:59 UTC (permalink / raw) To: linux-kernel; +Cc: Daniel Phillips Anton Blanchard wrote: > > > more than 256LV available in LVM], a sane partitioning requires you to > > have at least a partition for /usr large 1 giga, depends what you're > > doing of course, but requiring sane partitioning it's an admin problem > > not a kernel problem IMHO). > > Its not a NUMA machine, its one that allows shared processor logical > partitions. While I would prefer the hypervisor to give each partition > a nice memory map (and internally maintain a mapping to real memory) > it does not. I can imagine if the machine has been up for many months > memory could become very fragmented. > > Also when we do hotplug memory support will discontigmem be able to > efficiently handle memory turning up all over the place in the memory > map? > > Anton On this type of partitioned system where ppc64 runs, there is not much administration that could be done to help the problem. As Anton mentioned, when the system has been up for a long time, and memory has been moving between partitions which support dynamic memory movement, it is assured that memory will become very fragmented. As more partitions on these systems become available, and resources migrate more freely, the problem will get worse. Whether this management from kernel to hardware addresses is done in the hypervisor layer or the OS, the same overhead exists, given todays hardware structure for PowerPC servers anyway. In todays ppc64 implementation, we just use an array to map from what the kernel sees as its address space to what is put in the hardware page table and I/O translation tables, thus not requiring any changes in independant code. This does consume some storage, but the highly fragmented nature of our platform memory drives this decision. I would like to see that data structure decision left to the archs as different platform design points may lead to different mapping decisions. Dave. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:59 ` Dave Engebretsen @ 2002-05-01 17:24 ` Daniel Phillips 2002-05-02 16:44 ` Dave Engebretsen 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-01 17:24 UTC (permalink / raw) To: Dave Engebretsen, linux-kernel On Thursday 02 May 2002 17:59, Dave Engebretsen wrote: > Anton Blanchard wrote: > > Also when we do hotplug memory support will discontigmem be able to > > efficiently handle memory turning up all over the place in the memory > > map? > > On this type of partitioned system where ppc64 runs, there is not much > administration that could be done to help the problem. As Anton > mentioned, when the system has been up for a long time, and memory has > been moving between partitions which support dynamic memory movement, it > is assured that memory will become very fragmented. As more partitions > on these systems become available, and resources migrate more freely, > the problem will get worse. > > Whether this management from kernel to hardware addresses is done in the > hypervisor layer or the OS, the same overhead exists, given todays > hardware structure for PowerPC servers anyway. In todays ppc64 > implementation, we just use an array to map from what the kernel sees as > its address space to what is put in the hardware page table and I/O > translation tables, thus not requiring any changes in independant code. > This does consume some storage, but the highly fragmented nature of our > platform memory drives this decision. I would like to see that data > structure decision left to the archs as different platform design points > may lead to different mapping decisions. And it is left up to the arch in my patch, I've simply imposed a little more order on what, up till now, has been a pretty chaotic corner of the kernel, and provided a template that satisfies a wider variety of needs than the old one. It sounds like the table translation you're doing in the hypervisor is exactly what I've implemented in the kernel. One advantage of going with the kernel's implementation is that you get the benefit of improvements made to it, for example, the proposed hashing scheme to handle extremely fragmented physical memory maps. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-01 17:24 ` Daniel Phillips @ 2002-05-02 16:44 ` Dave Engebretsen 0 siblings, 0 replies; 152+ messages in thread From: Dave Engebretsen @ 2002-05-02 16:44 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel > And it is left up to the arch in my patch, I've simply imposed a little more > order on what, up till now, has been a pretty chaotic corner of the kernel, > and provided a template that satisfies a wider variety of needs than the old > one. Yep, got that - just reenforcing the point. > It sounds like the table translation you're doing in the hypervisor is > exactly what I've implemented in the kernel. One advantage of going with > the kernel's implementation is that you get the benefit of improvements > made to it, for example, the proposed hashing scheme to handle extremely > fragmented physical memory maps. > I should clarify a bit -- we run on two different hypervisor interfaces. The iSeries interface leaves this translation work to the OS. In that case Linux has an array translation lookup which is analogous to your patch. We just managed to hide everything in arch/ppc64 by doing this lookup when inserting hashed page table and I/O table mappings. Other than at that low level, the remappings are transparent to Linux -- it just sees a nice big flat physical address space. On pSeries, the hypervisor does the translation work under the covers, but as you point out, Linux doesn't get the chance to play with different mapping schemes. Then again, that does simplify my life ... Dave. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 15:28 ` Anton Blanchard 2002-05-01 16:10 ` Daniel Phillips 2002-05-02 15:59 ` Dave Engebretsen @ 2002-05-02 16:31 ` William Lee Irwin III 2002-05-02 16:21 ` Dave Engebretsen 2 siblings, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 16:31 UTC (permalink / raw) To: Anton Blanchard Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel, Jesse Barnes On Fri, May 03, 2002 at 01:28:25AM +1000, Anton Blanchard wrote: > Also when we do hotplug memory support will discontigmem be able to > efficiently handle memory turning up all over the place in the memory > map? Would the flip side of that coin perhaps be implementing a way to be a good logically partitioned citizen and cooperatively offline memory? Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:31 ` William Lee Irwin III @ 2002-05-02 16:21 ` Dave Engebretsen 2002-05-02 17:28 ` William Lee Irwin III 0 siblings, 1 reply; 152+ messages in thread From: Dave Engebretsen @ 2002-05-02 16:21 UTC (permalink / raw) To: William Lee Irwin III; +Cc: linux-kernel William Lee Irwin III wrote: > > On Fri, May 03, 2002 at 01:28:25AM +1000, Anton Blanchard wrote: > > Also when we do hotplug memory support will discontigmem be able to > > efficiently handle memory turning up all over the place in the memory > > map? > > Would the flip side of that coin perhaps be implementing a way to be a > good logically partitioned citizen and cooperatively offline memory? > > Cheers, > Bill Yes, both add and remove are needed to be a good citizen. One could spend all kinds of time coming up with good huristicts to do that automatically :) At a mimimum, manual off line of memory would be nice. Dave. ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 16:21 ` Dave Engebretsen @ 2002-05-02 17:28 ` William Lee Irwin III 0 siblings, 0 replies; 152+ messages in thread From: William Lee Irwin III @ 2002-05-02 17:28 UTC (permalink / raw) To: Dave Engebretsen; +Cc: linux-kernel William Lee Irwin III wrote: >> Would the flip side of that coin perhaps be implementing a way to be a >> good logically partitioned citizen and cooperatively offline memory? >> Cheers, >> Bill On Thu, May 02, 2002 at 11:21:59AM -0500, Dave Engebretsen wrote: > Yes, both add and remove are needed to be a good citizen. One could > spend all kinds of time coming up with good huristicts to do that > automatically :) > At a mimimum, manual off line of memory would be nice. > Dave. I have a particular interest in the implementation of at least one mechanism in the kernel (rmap) which could be exploited to assist in this. If there are other efforts in progress toward this end I'd be happy to investigate methods of using the additional machinery provided by rmap to assist in this. Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard 2002-05-01 1:35 ` Daniel Phillips 2002-05-02 1:01 ` Andrea Arcangeli @ 2002-05-02 23:05 ` Daniel Phillips 2002-05-03 0:05 ` William Lee Irwin III 2002-05-03 23:52 ` David Mosberger 3 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-02 23:05 UTC (permalink / raw) To: Anton Blanchard, Andrea Arcangeli Cc: Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 02:20, Anton Blanchard wrote: > > so ia64 is one of those archs with a ram layout with huge holes in the > > middle of the ram of the nodes? I'd be curious to know what's the > > hardware advantage of designing the ram layout in such a way, compared > > to all other numa archs that I deal with. Also if you know other archs > > with huge holes in the middle of the ram of the nodes I'd be curious to > > know about them too. thanks for the interesting info! > > From arch/ppc64/kernel/iSeries_setup.c: > > * The iSeries may have very large memories ( > 128 GB ) and a partition > * may get memory in "chunks" that may be anywhere in the 2**52 real > * address space. The chunks are 256K in size. > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic > solution to this problem. Hmm, I just re-read your numbers above. Supposing you have 256 GB of 'installed' memory, divided into 256K chunks at random places in the 52 bit address space, a hash table with 1M entries could map all that physical memory. You'd need 16 bytes or so per hash table entry, making the table 16MB in size. This would be about .0006% of memory. More-or-less equivalently, a tree could be used, with the tradeoff being a little better locality of reference vs more search steps. The hash structure can also be tweaked to improve locality by making each hash entry map several adjacent memory chunks, and hoping that the chunks tend to occur in groups, which they most probably do. I'm offering the hash table, combined with config_nonlinear as a generic solution. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 23:05 ` Daniel Phillips @ 2002-05-03 0:05 ` William Lee Irwin III 2002-05-03 1:19 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: William Lee Irwin III @ 2002-05-03 0:05 UTC (permalink / raw) To: Daniel Phillips Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel, Jesse Barnes On Thursday 02 May 2002 02:20, Anton Blanchard wrote: >> From arch/ppc64/kernel/iSeries_setup.c: >> * The iSeries may have very large memories ( > 128 GB ) and a partition >> * may get memory in "chunks" that may be anywhere in the 2**52 real >> * address space. The chunks are 256K in size. >> Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic >> solution to this problem. On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote: > Hmm, I just re-read your numbers above. Supposing you have 256 GB of > 'installed' memory, divided into 256K chunks at random places in the 52 > bit address space, a hash table with 1M entries could map all that > physical memory. You'd need 16 bytes or so per hash table entry, making > the table 16MB in size. This would be about .0006% of memory. Doh! I made all that noise about "contiguously allocated" and the relaxation of the contiguous allocation requirement on the aggregate was the whole reason I liked trees so much! Regardless, if there's virtual contiguity the table can work, and what can I say, it's not my patch, and there probably isn't a real difference given that your ratio to memory size is probably small enough to cope. On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote: > More-or-less equivalently, a tree could be used, with the tradeoff being > a little better locality of reference vs more search steps. The hash > structure can also be tweaked to improve locality by making each hash > entry map several adjacent memory chunks, and hoping that the chunks tend > to occur in groups, which they most probably do. > I'm offering the hash table, combined with config_nonlinear as a generic > solution. Is the virtual remapping for virtual contiguity available at the time this remapping table is set up? A 1M-entry table is larger than the largest available fragment of physically contiguous memory even at 1B/entry. If it's used to direct the virtual remapping you might need to perform some arch-specific bootstrapping phases. Also, what is the recourse of a boot-time allocated table when it overflows due to the onlining of sufficient physical memory? Or are there pointer links within the table entries so as to provide collision chains? If so, then the memory requirements are even larger... If you limit the size of the table to consume an entire hypervisor-allocated memory fragment would that not require boot-time allocation of a fresh chunk from the hypervisor and virtually mapping the new chunk? How do you know what the size of the table should be if the number of chunks varies dramatically? Cheers, Bill ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 0:05 ` William Lee Irwin III @ 2002-05-03 1:19 ` Daniel Phillips 2002-05-03 19:47 ` Dave Engebretsen 0 siblings, 1 reply; 152+ messages in thread From: Daniel Phillips @ 2002-05-03 1:19 UTC (permalink / raw) To: William Lee Irwin III Cc: Anton Blanchard, Andrea Arcangeli, linux-kernel, Jesse Barnes On Friday 03 May 2002 02:05, William Lee Irwin III wrote: > On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote: > > More-or-less equivalently, a tree could be used, with the tradeoff being > > a little better locality of reference vs more search steps. The hash > > structure can also be tweaked to improve locality by making each hash > > entry map several adjacent memory chunks, and hoping that the chunks tend > > to occur in groups, which they most probably do. > > I'm offering the hash table, combined with config_nonlinear as a generic > > solution. > > Is the virtual remapping for virtual contiguity available at the time > this remapping table is set up? A 1M-entry table is larger than the > largest available fragment of physically contiguous memory even at > 1B/entry. If it's used to direct the virtual remapping you might need > to perform some arch-specific bootstrapping phases. Interesting point. Fortunately, the logical_to_phys table doesn't have to be a hash, making it considerably smaller. Then we get to the interesting part: allocating the phys_to_logical hash table. The boot loader must have provided at least some contiguous physical memory in order to load the kernel, the compressed disk image and give us a little working memory. (For practical purposes, we're most likely to have been provided with a full gig, or whatever is appropriate according to the mem= command line setting, but lets pretend it's a lot less than that.) Now, the first thing we need to do is fill in enough of the vsection table to allocate the table itself. Fortunately, the bottom part of the table is the part we need to fill in, and we surely have enough memory to do that. We just have to be sure that the process of filling it in doesn't require any bootmem allocations, which is not so hard - we the existing memory initialization code already has to obey that requirement. Naturally, during initialization of the hash table, we want to be sure not to perform and phys_to_logical translations, as would be required to read values from the page tables during swap-out for example. Probably there's already no possibility of that, but it needs a comment at least. I can't provide any more details than that, because I'm not familiar with the way the iseries boots. Anton is the man there. > Also, what is the > recourse of a boot-time allocated table when it overflows due to the > onlining of sufficient physical memory? We ought to have some clue about the maximum number of physical memory chunks available to us. I doubt *every* partition is going to be provided 256 GB of memory. In fact, the real amount we need will be considerably less, and the phys_to_logical table will be smaller than 16 MB, say, 1 MB. Just allocate the whole thing and be done with it. > Or are there pointer links > within the table entries so as to provide collision chains? For this one I'd think a classic, nonlist, hash table is the way to go. At 16 bytes, I overestimated the per-entry size, really 8 bytes is more realistic. We need 34 bits for the key field (52 bit physical range, less 18 bits chunk size) and considerably less than 32 bits for the value field (a logical section) so it works out nicely. > If so, > then the memory requirements are even larger... If you limit the size > of the table to consume an entire hypervisor-allocated memory fragment > would that not require boot-time allocation of a fresh chunk from the > hypervisor and virtually mapping the new chunk? I think that the bootstrapping method described above is sufficiently simple and robust to obviate this requirement. > How do you know what > the size of the table should be if the number of chunks varies > dramatically? The most obvious and practical approach is to have the boot loader tell us, we allocate the maximum size needed, and won't worry about that again. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 1:19 ` Daniel Phillips @ 2002-05-03 19:47 ` Dave Engebretsen 2002-05-03 22:06 ` Daniel Phillips 0 siblings, 1 reply; 152+ messages in thread From: Dave Engebretsen @ 2002-05-03 19:47 UTC (permalink / raw) To: Daniel Phillips; +Cc: William Lee Irwin III, Andrea Arcangeli, linux-kernel Daniel Phillips wrote: > > The boot loader must have provided at least some contiguous physical > memory in order to load the kernel, the compressed disk image and give > us a little working memory. (For practical purposes, we're most likely to > have been provided with a full gig, or whatever is appropriate according > to the mem= command line setting, but lets pretend it's a lot less than > that.) Now, the first thing we need to do is fill in enough of the ... > > Naturally, during initialization of the hash table, we want to be sure > not to perform and phys_to_logical translations, as would be required to > read values from the page tables during swap-out for example. Probably > there's already no possibility of that, but it needs a comment at least. > > I can't provide any more details than that, because I'm not familiar > with the way the iseries boots. Anton is the man there. The way it works on iSeries is the HV provides a 64MB physicaly contiguous load area. The kernel & working storage, including the logical->physical (or absolute in our terms) must fit in this space. Even with a 256KB chunk size and a simple array for translation, the memory consumption is not excessive. Each array entry is 32bits, allowing a 32+12(page offset) = 44b of physical addessability and a 1MB array allows 32GB of translations. We don't need the reverse translation on iSeries as the kernel never knows about the actual hardware address, other than when putting an entry in the hardware page tables (processor and I/O). One other thing to not is that Linux _always_ runs with relocation enabled on iSeries so there is never a point, other than way I mention above, when the hardware address matters. > We ought to have some clue about the maximum number of physical memory > chunks available to us. I doubt *every* partition is going to be > provided 256 GB of memory. In fact, the real amount we need will be > considerably less, and the phys_to_logical table will be smaller than > 16 MB, say, 1 MB. Just allocate the whole thing and be done with it. ... > > > How do you know what > > the size of the table should be if the number of chunks varies > > dramatically? > > The most obvious and practical approach is to have the boot loader tell > us, we allocate the maximum size needed, and won't worry about that > again. > Yes, we do know the maximum memory possible, both system wide, and more importantly within a partition. In fact the way it works today, a partition is defined to the hypervisor with a current memory size to use and a max memory size. The max is required because the hardware page table for PowerPC is allocated to the max size before the partition boots. Becuase the page table must be physically contiguous, it is allocated for the partition when the system boots. The size of the Linux translation tables is a similar issue where the worst case should just be considered and allocated at Linux boot time. Dave Engebretsen ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-03 19:47 ` Dave Engebretsen @ 2002-05-03 22:06 ` Daniel Phillips 0 siblings, 0 replies; 152+ messages in thread From: Daniel Phillips @ 2002-05-03 22:06 UTC (permalink / raw) To: Dave Engebretsen; +Cc: William Lee Irwin III, Andrea Arcangeli, linux-kernel On Friday 03 May 2002 21:47, Dave Engebretsen wrote: > We don't need the reverse translation on iSeries as the kernel never > knows about the actual hardware address, other than when putting an > entry in the hardware page tables (processor and I/O). So the kernel page tables are carrying what I'd call a logical address, that is, zero-based, indexing your logical-to-physical table (physical taken in a non-literal sense). This would suggest that your current arrangement is a strict subset of my current config_nonlinear design, flat table and all, but with phys_to_pagenum defined as a compile-time error. -- Daniel ^ permalink raw reply [flat|nested] 152+ messages in thread
* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard ` (2 preceding siblings ...) 2002-05-02 23:05 ` Daniel Phillips @ 2002-05-03 23:52 ` David Mosberger 3 siblings, 0 replies; 152+ messages in thread From: David Mosberger @ 2002-05-03 23:52 UTC (permalink / raw) To: Anton Blanchard Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel, Jesse Barnes [Looks like this buffer was laying dormant in my Emacs and never sent. Hence the delay... ;-) ] >>>>> On Thu, 2 May 2002 10:20:11 +1000, Anton Blanchard <anton@samba.org> said: >> so ia64 is one of those archs with a ram layout with huge holes >> in the middle of the ram of the nodes? I'd be curious to know >> what's the hardware advantage of designing the ram layout in such >> a way, compared to all other numa archs that I deal with. Also if >> you know other archs with huge holes in the middle of the ram of >> the nodes I'd be curious to know about them too. thanks for the >> interesting info! >> From arch/ppc64/kernel/iSeries_setup.c: Anton> * The iSeries may have very large memories ( > 128 GB ) and Anton> a partition * may get memory in "chunks" that may be anywhere Anton> in the 2**52 real * address space. The chunks are 256K in Anton> size. Anton> Also check out CONFIG_MSCHUNKS code and see why I'd love to Anton> see a generic solution to this problem. Me too. HP's zx1 platform also has a rather giant hole above the 1GB boundary. I don't know the exact reasons for this hole, but it's related to the fact that (many) PCI devices need <4GB memory. The current solution for zx1 is to place the mem_map in virtual memory. This obviously increases TLB pressure when touching lots of mem_map[] entries randomly, but I haven't really seen any benchmarks so far (real or artificial) where this has a signifcant performance effect. The nice part of this approach is that it is a rather general solution, provided the kernel's page-table mapped address space is sufficiently big. --david ^ permalink raw reply [flat|nested] 152+ messages in thread
end of thread, other threads:[~2002-05-10 0:13 UTC | newest] Thread overview: 152+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King 2002-04-26 22:46 ` Andrea Arcangeli 2002-04-29 17:50 ` Martin J. Bligh 2002-04-29 22:00 ` Roman Zippel 2002-04-30 0:43 ` Andrea Arcangeli 2002-04-27 22:10 ` Daniel Phillips 2002-04-29 13:35 ` Andrea Arcangeli 2002-04-29 23:02 ` Daniel Phillips 2002-05-01 2:23 ` Andrea Arcangeli 2002-04-30 23:12 ` Daniel Phillips 2002-05-01 1:05 ` Daniel Phillips 2002-05-02 0:47 ` Andrea Arcangeli 2002-05-01 1:26 ` Daniel Phillips 2002-05-02 1:43 ` Andrea Arcangeli 2002-05-01 2:41 ` Daniel Phillips 2002-05-02 13:34 ` Andrea Arcangeli 2002-05-02 15:18 ` Martin J. Bligh 2002-05-02 15:35 ` Andrea Arcangeli 2002-05-01 15:42 ` Daniel Phillips 2002-05-02 16:06 ` Andrea Arcangeli 2002-05-02 16:10 ` Martin J. Bligh 2002-05-02 16:40 ` Andrea Arcangeli 2002-05-02 17:16 ` William Lee Irwin III 2002-05-02 18:41 ` Andrea Arcangeli 2002-05-02 19:19 ` William Lee Irwin III 2002-05-02 19:27 ` Daniel Phillips 2002-05-02 19:38 ` William Lee Irwin III 2002-05-02 19:58 ` Daniel Phillips 2002-05-03 6:28 ` Andrea Arcangeli 2002-05-03 6:10 ` Andrea Arcangeli 2002-05-02 22:20 ` Martin J. Bligh 2002-05-02 21:28 ` William Lee Irwin III 2002-05-02 21:52 ` Kurt Ferreira 2002-05-02 21:55 ` William Lee Irwin III 2002-05-03 6:38 ` Andrea Arcangeli 2002-05-03 6:58 ` Martin J. Bligh 2002-05-03 6:04 ` Andrea Arcangeli 2002-05-03 6:33 ` Martin J. Bligh 2002-05-03 8:38 ` Andrea Arcangeli 2002-05-03 9:26 ` William Lee Irwin III 2002-05-03 15:38 ` Martin J. Bligh 2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh 2002-05-03 15:58 ` Andrea Arcangeli 2002-05-03 16:10 ` Martin J. Bligh 2002-05-03 16:25 ` Andrea Arcangeli 2002-05-03 16:02 ` Daniel Phillips 2002-05-03 16:20 ` Andrea Arcangeli 2002-05-03 16:41 ` Daniel Phillips 2002-05-03 16:58 ` Andrea Arcangeli 2002-05-03 18:08 ` Daniel Phillips 2002-05-03 9:24 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III 2002-05-03 10:30 ` Andrea Arcangeli 2002-05-03 11:09 ` William Lee Irwin III 2002-05-03 11:27 ` Andrea Arcangeli 2002-05-03 15:42 ` Martin J. Bligh 2002-05-03 15:32 ` Martin J. Bligh 2002-05-02 19:22 ` Daniel Phillips 2002-05-03 6:06 ` Andrea Arcangeli 2002-05-02 18:25 ` Daniel Phillips 2002-05-02 18:44 ` Andrea Arcangeli 2002-05-02 19:31 ` Martin J. Bligh 2002-05-02 18:57 ` Andrea Arcangeli 2002-05-02 19:08 ` Daniel Phillips 2002-05-03 5:15 ` Andrea Arcangeli 2002-05-05 23:54 ` Daniel Phillips 2002-05-06 0:28 ` Andrea Arcangeli 2002-05-06 0:34 ` Daniel Phillips 2002-05-06 1:01 ` Andrea Arcangeli 2002-05-06 0:55 ` Russell King 2002-05-06 1:07 ` Daniel Phillips 2002-05-06 1:20 ` Andrea Arcangeli 2002-05-06 1:24 ` Daniel Phillips 2002-05-06 1:42 ` Andrea Arcangeli 2002-05-06 1:48 ` Daniel Phillips 2002-05-06 2:06 ` Andrea Arcangeli 2002-05-06 17:40 ` Daniel Phillips 2002-05-06 19:09 ` Martin J. Bligh 2002-05-06 1:09 ` Andrea Arcangeli 2002-05-06 1:13 ` Daniel Phillips 2002-05-06 2:03 ` Daniel Phillips 2002-05-06 2:31 ` Andrea Arcangeli 2002-05-06 8:57 ` Russell King 2002-05-06 8:54 ` Roman Zippel 2002-05-06 15:26 ` Daniel Phillips 2002-05-06 19:07 ` Roman Zippel 2002-05-08 15:57 ` Daniel Phillips 2002-05-08 23:11 ` Roman Zippel 2002-05-09 16:08 ` Daniel Phillips 2002-05-09 22:06 ` Roman Zippel 2002-05-09 22:22 ` Daniel Phillips 2002-05-09 23:00 ` Roman Zippel 2002-05-09 23:22 ` Daniel Phillips 2002-05-10 0:13 ` Roman Zippel 2002-05-02 22:39 ` Martin J. Bligh 2002-05-03 7:04 ` Andrea Arcangeli 2002-05-02 23:42 ` Daniel Phillips 2002-05-03 7:45 ` Andrea Arcangeli 2002-05-02 16:07 ` Martin J. Bligh 2002-05-02 16:58 ` Gerrit Huizenga 2002-05-02 18:10 ` Andrea Arcangeli 2002-05-02 19:28 ` Gerrit Huizenga 2002-05-02 22:23 ` Martin J. Bligh 2002-05-03 6:20 ` Andrea Arcangeli 2002-05-03 6:39 ` Martin J. Bligh 2002-05-02 16:00 ` William Lee Irwin III 2002-05-02 2:37 ` William Lee Irwin III 2002-05-02 15:59 ` Andrea Arcangeli 2002-05-02 16:06 ` William Lee Irwin III 2002-05-01 18:05 ` Jesse Barnes 2002-05-01 23:17 ` Andrea Arcangeli 2002-05-01 23:23 ` discontiguous memory platforms Jesse Barnes 2002-05-02 0:51 ` Ralf Baechle 2002-05-02 1:27 ` Andrea Arcangeli 2002-05-02 1:32 ` Ralf Baechle 2002-05-02 8:50 ` Roman Zippel 2002-05-01 13:21 ` Daniel Phillips 2002-05-02 14:00 ` Roman Zippel 2002-05-01 14:08 ` Daniel Phillips 2002-05-02 17:56 ` Roman Zippel 2002-05-01 17:59 ` Daniel Phillips 2002-05-02 18:26 ` Roman Zippel 2002-05-02 18:32 ` Daniel Phillips 2002-05-02 19:40 ` Roman Zippel 2002-05-02 20:14 ` Daniel Phillips 2002-05-03 6:34 ` Andrea Arcangeli 2002-05-03 9:33 ` Roman Zippel 2002-05-03 6:30 ` Andrea Arcangeli 2002-05-02 18:35 ` Geert Uytterhoeven 2002-05-02 18:39 ` Daniel Phillips 2002-05-02 0:20 ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard 2002-05-01 1:35 ` Daniel Phillips 2002-05-02 1:45 ` William Lee Irwin III 2002-05-01 2:02 ` Daniel Phillips 2002-05-02 2:33 ` William Lee Irwin III 2002-05-01 2:44 ` Daniel Phillips 2002-05-02 1:46 ` Andrea Arcangeli 2002-05-01 1:56 ` Daniel Phillips 2002-05-02 1:01 ` Andrea Arcangeli 2002-05-02 15:28 ` Anton Blanchard 2002-05-01 16:10 ` Daniel Phillips 2002-05-02 15:59 ` Dave Engebretsen 2002-05-01 17:24 ` Daniel Phillips 2002-05-02 16:44 ` Dave Engebretsen 2002-05-02 16:31 ` William Lee Irwin III 2002-05-02 16:21 ` Dave Engebretsen 2002-05-02 17:28 ` William Lee Irwin III 2002-05-02 23:05 ` Daniel Phillips 2002-05-03 0:05 ` William Lee Irwin III 2002-05-03 1:19 ` Daniel Phillips 2002-05-03 19:47 ` Dave Engebretsen 2002-05-03 22:06 ` Daniel Phillips 2002-05-03 23:52 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox