* [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-10-30 18:33 Mel Gorman
2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
` (7 more replies)
0 siblings, 8 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:33 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel
Hi Andrew,
This is the latest release of the fragmentation avoidance patches with no
code changes since v18. If it is possible, I would like to get this into -mm,
so this patch is generated against the latest -mm tree 2.6.14-rc5-mm1 and
is known to apply cleanly. If there is another tree that should be diffed
against instead, just say so and I'll send another version.
Here are a few brief reasons why this set of patches is useful;
o Reduced fragmentation improves the chance a large order allocation succeeds
o General-purpose memory hotplug needs the page/memory groupings provided
o Reduces the number of badly-placed pages that page migration mechanism must
deal with. This also applies to any active page defragmentation mechanism.
o This patch is a pre-requisite for a linear scanning mechanism that could
be used to guarantee large-page allocations
Built and tested successfully on a single processor AMD machine, quad
processor Xeon machine and PPC64. Benchmarks are generated on the Xeon machine.
Changelog since v18
o Resync against 2.6.14-rc5-mm1
o 004_markfree dropped
o Documentation note added on the behavior of free_area.nr_free
Changelog since v17
o Update to 2.6.14-rc4-mm1
o Remove explicit casts where implicit casts were in place
o Change __GFP_USER to __GFP_EASYRCLM, RCLM_USER to RCLM_EASY and PCPU_USER to
PCPU_EASY
o Print a warning and return NULL if both RCLM flags are set in the GFP flags
o Reduce size of fallback_allocs
o Change magic number 64 to FREE_AREA_USEMAP_SIZE
o CodingStyle regressions cleanup
o Move sparsemen setup_usemap() out of header
o Changed fallback_balance to a mechanism that depended on zone->present_pages
to avoid hotplug problems later
o Many superflous parenthesis removed
Changlog since v16
o Variables using bit operations now are unsigned long. Note that when used
as indices, they are integers and cast to unsigned long when necessary.
This is because aim9 shows regressions when used as unsigned longs
throughout (~10% slowdown)
o 004_showfree added to provide more debugging information
o 008_stats dropped. Even with CONFIG_ALLOCSTATS disabled, it is causing
severe performance regressions. No explanation as to why
o for_each_rclmtype_order moved to header
o More coding style cleanups
Changelog since V14 (V15 not released)
o Update against 2.6.14-rc3
o Resync with Joel's work. All suggestions made on fix-ups to his last
set of patches should also be in here. e.g. __GFP_USER is still __GFP_USER
but is better commented.
o Large amount of CodingStyle, readability cleanups and corrections pointed
out by Dave Hansen.
o Fix CONFIG_NUMA error that corrupted per-cpu lists
o Patches broken out to have one-feature-per-patch rather than
more-code-per-patch
o Fix fallback bug where pages for RCLM_NORCLM end up on random other
free lists.
Changelog since V13
o Patches are now broken out
o Added per-cpu draining of userrclm pages
o Brought the patch more in line with memory hotplug work
o Fine-grained use of the __GFP_USER and __GFP_KERNRCLM flags
o Many coding-style corrections
o Many whitespace-damage corrections
Changelog since V12
o Minor whitespace damage fixed as pointed by Joel Schopp
Changelog since V11
o Mainly a redefiff against 2.6.12-rc5
o Use #defines for indexing into pcpu lists
o Fix rounding error in the size of usemap
Changelog since V10
o All allocation types now use per-cpu caches like the standard allocator
o Removed all the additional buddy allocator statistic code
o Elimated three zone fields that can be lived without
o Simplified some loops
o Removed many unnecessary calculations
Changelog since V9
o Tightened what pools are used for fallbacks, less likely to fragment
o Many micro-optimisations to have the same performance as the standard
allocator. Modified allocator now faster than standard allocator using
gcc 3.3.5
o Add counter for splits/coalescing
Changelog since V8
o rmqueue_bulk() allocates pages in large blocks and breaks it up into the
requested size. Reduces the number of calls to __rmqueue()
o Beancounters are now a configurable option under "Kernel Hacking"
o Broke out some code into inline functions to be more Hotplug-friendly
o Increased the size of reserve for fallbacks from 10% to 12.5%.
Changelog since V7
o Updated to 2.6.11-rc4
o Lots of cleanups, mainly related to beancounters
o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz
(thanks Mike)
o Introduced a 10% reserve for fallbacks. Drastically reduces the number of
kernnorclm allocations that go to the wrong places
o Don't trigger OOM when large allocations are involved
Changelog since V6
o Updated to 2.6.11-rc2
o Minor change to allow prezeroing to be a cleaner looking patch
Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage
Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
with offsets to 2.6.11-rc1-mm1
Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists
Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
anything to do with asynchronous IO
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
allocation type
o In-code documentation
Version 1
o Initial release against 2.6.9
This patch is designed to reduce fragmentation in the standard buddy allocator
without impairing the performance of the allocator. High fragmentation in
the standard binary buddy allocator means that high-order allocations can
rarely be serviced. This patch works by dividing allocations into three
different types of allocations;
EasyReclaimable - These are userspace pages that are easily reclaimable. This
flag is set when it is known that the pages will be trivially reclaimed
by writing the page out to swap or syncing with backing storage
KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs
KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type
Instead of having one global MAX_ORDER-sized array of free lists,
there are four, one for each type of allocation and another reserve for
fallbacks.
Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if 2^MAX_ORDER number of pages were required, the system could linearly scan
a block of pages allocated for EasyReclaimable and page each of them out.
Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.
Three benchmark results are included all based on a 2.6.14-rc3 kernel
compiled with gcc 3.4 (it is known that gcc 2.95 produces different results).
The first is the output of portions of AIM9 for the vanilla allocator and
the modified one;
(Tests run with bench-aim9.sh from VMRegress 0.17)
2.6.14-rc5-mm1-clean
------------------------------------------------------------------------------------------------------------
Test Test Elapsed Iteration Iteration Operation
Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
1 creat-clo 60.04 961 16.00600 16006.00 File Creations and Closes/second
2 page_test 60.02 4149 69.12696 117515.83 System Allocations & Pages/second
3 brk_test 60.04 1555 25.89940 440289.81 System Memory Allocations/second
4 jmp_test 60.00 250768 4179.46667 4179466.67 Non-local gotos/second
5 signal_test 60.01 4849 80.80320 80803.20 Signal Traps/second
6 exec_test 60.00 741 12.35000 61.75 Program Loads/second
7 fork_test 60.06 797 13.27006 1327.01 Task Creations/second
8 link_test 60.01 5269 87.80203 5531.53 Link/Unlink Pairs/second
2.6.14-rc3-mbuddy-v19
------------------------------------------------------------------------------------------------------------
Test Test Elapsed Iteration Iteration Operation
Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
1 creat-clo 60.04 954 15.88941 15889.41 File Creations and Closes/second
2 page_test 60.01 4133 68.87185 117082.15 System Allocations & Pages/second
3 brk_test 60.02 1546 25.75808 437887.37 System Memory Allocations/second
4 jmp_test 60.00 250797 4179.95000 4179950.00 Non-local gotos/second
5 signal_test 60.01 5121 85.33578 85335.78 Signal Traps/second
6 exec_test 60.00 743 12.38333 61.92 Program Loads/second
7 fork_test 60.05 806 13.42215 1342.21 Task Creations/second
8 link_test 60.00 5291 88.18333 5555.55 Link/Unlink Pairs/second
Difference in performance operations report generated by diff-aim9.sh
Clean mbuddy-v19
---------- ----------
1 creat-clo 16006.00 15889.41 -116.59 -0.73% File Creations and Closes/second
2 page_test 117515.83 117082.15 -433.68 -0.37% System Allocations & Pages/second
3 brk_test 440289.81 437887.37 -2402.44 -0.55% System Memory Allocations/second
4 jmp_test 4179466.67 4179950.00 483.33 0.01% Non-local gotos/second
5 signal_test 80803.20 85335.78 4532.58 5.61% Signal Traps/second
6 exec_test 61.75 61.92 0.17 0.28% Program Loads/second
7 fork_test 1327.01 1342.21 15.20 1.15% Task Creations/second
8 link_test 5531.53 5555.55 24.02 0.43% Link/Unlink Pairs/second
In this test, there were small regressions in the page_test. However, it
is known that different kernel configurations, compilers and even different
runs show similar varianes of +/- 3% .
The second benchmark tested the CPU cache usage to make sure it was not
getting clobbered. The test was to repeatedly render a large postscript file
10 times and get the average. The result is;
2.6.14-rc5-mm1-clean: Average: 43.254 real, 38.89 user, 0.042 sys
2.6.14-rc5-mm1-mbuddy-v19: Average: 43.212 real, 40.494 user, 0.044 sys
So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;
1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j2
5. Start compiling the primary tree
6. Sleep 1 minute while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
- note, it only attempts 160 times, no matter how often it succeeds
- An allocation is attempted every 1/10th of a second
- Performance will get badly shot as it forces considerable amounts of
pageout
The result of the allocations under load (load averaging 18) were;
2.6.14-rc5-mm1 Clean
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 30
Failed allocs: 130
DMA zone allocs: 0
Normal zone allocs: 7
HighMem zone allocs: 23
% Success: 18
2.6.14-rc5-mm1 MBuddy V19
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 76
Failed allocs: 84
DMA zone allocs: 1
Normal zone allocs: 30
HighMem zone allocs: 45
% Success: 47
One thing that had to be changed in the 2.6.14-rc5-mm1 clean test was to
disable the OOM killer. During one test, the OOM killer had better results
but invoked the OOM killer a very large number of times to achieve it. The
patch with the placement policy never invoked the OOM killer.
The above results are not very dramatic but the affect is very noticeable when
the system is at rest after the test completes. After the test, the standard
allocator was able to allocate 45 order-10 pages and the modified allocator
allocated 159. The ability to allocate large pages under load depend heavily
on the decisions of kswapd so there can be large variances in results but
that is a separate problem. It is also known that the success of large
allocations is also dependant on the location of per-cpu pages but fixing
that problem is a separate issue.
The results show that the modified allocator has comparable speed, has no
adverse cache effects but is far less fragmented and in a better position
to satisfy high-order allocations.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 241+ messages in thread* [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman ` (6 subsequent siblings) 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman This patch adds two flags __GFP_EASYRCLM and __GFP_KERNRCLM that are used to trap the type of allocation the caller is made. Allocations using the __GFP_EASYRCLM flag are expected to be easily reclaimed by syncing with backing storage (be it a file or swap) or cleaning the buffers and discarding. Allocations using the __GFP_KERNRCLM flag belong to slab caches that can be shrunk by the kernel. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/buffer.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c --- linux-2.6.14-rc5-mm1-clean/fs/buffer.c 2005-10-30 13:19:59.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c 2005-10-30 13:34:50.000000000 +0000 @@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev, struct page *page; struct buffer_head *bh; - page = find_or_create_page(inode->i_mapping, index, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, index, + GFP_NOFS|__GFP_EASYRCLM); if (!page) return NULL; @@ -3058,7 +3059,8 @@ static void recalc_bh_state(void) struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) { - struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); + struct buffer_head *ret = kmem_cache_alloc(bh_cachep, + gfp_flags|__GFP_KERNRCLM); if (ret) { get_cpu_var(bh_accounting).nr++; recalc_bh_state(); diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/compat.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c --- linux-2.6.14-rc5-mm1-clean/fs/compat.c 2005-10-30 13:19:59.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c 2005-10-30 13:34:50.000000000 +0000 @@ -1363,7 +1363,7 @@ static int compat_copy_strings(int argc, page = bprm->page[i]; new = 0; if (!page) { - page = alloc_page(GFP_HIGHUSER); + page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM); bprm->page[i] = page; if (!page) { ret = -ENOMEM; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/dcache.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c --- linux-2.6.14-rc5-mm1-clean/fs/dcache.c 2005-10-30 13:19:59.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c 2005-10-30 13:34:50.000000000 +0000 @@ -878,7 +878,7 @@ struct dentry *d_alloc(struct dentry * p struct dentry *dentry; char *dname; - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); + dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM); if (!dentry) return NULL; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/exec.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c --- linux-2.6.14-rc5-mm1-clean/fs/exec.c 2005-10-30 13:19:59.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c 2005-10-30 13:34:50.000000000 +0000 @@ -237,7 +237,7 @@ static int copy_strings(int argc, char _ page = bprm->page[i]; new = 0; if (!page) { - page = alloc_page(GFP_HIGHUSER); + page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM); bprm->page[i] = page; if (!page) { ret = -ENOMEM; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c --- linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c 2005-10-20 07:23:05.000000000 +0100 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c 2005-10-30 13:34:50.000000000 +0000 @@ -141,7 +141,8 @@ static kmem_cache_t * ext2_inode_cachep; static struct inode *ext2_alloc_inode(struct super_block *sb) { struct ext2_inode_info *ei; - ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL); + ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, + SLAB_KERNEL|__GFP_KERNRCLM); if (!ei) return NULL; #ifdef CONFIG_EXT2_FS_POSIX_ACL diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c --- linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c 2005-10-30 13:20:00.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c 2005-10-30 13:34:50.000000000 +0000 @@ -444,7 +444,7 @@ static struct inode *ext3_alloc_inode(st { struct ext3_inode_info *ei; - ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS); + ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM); if (!ei) return NULL; #ifdef CONFIG_EXT3_FS_POSIX_ACL diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c --- linux-2.6.14-rc5-mm1-clean/fs/inode.c 2005-10-20 07:23:05.000000000 +0100 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c 2005-10-30 13:34:50.000000000 +0000 @@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct mapping->a_ops = &empty_aops; mapping->host = inode; mapping->flags = 0; - mapping_set_gfp_mask(mapping, GFP_HIGHUSER); + mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_EASYRCLM); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c --- linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c 2005-10-30 13:20:01.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c 2005-10-30 13:34:50.000000000 +0000 @@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc ntfs_inode *ni; ntfs_debug("Entering."); - ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS); + ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM); if (likely(ni != NULL)) { ni->state = 0; return VFS_I(ni); @@ -343,7 +343,7 @@ static inline ntfs_inode *ntfs_alloc_ext ntfs_inode *ni; ntfs_debug("Entering."); - ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS); + ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM); if (likely(ni != NULL)) { ni->state = 0; return ni; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h --- linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h 2005-10-20 07:23:05.000000000 +0100 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h 2005-10-30 13:34:50.000000000 +0000 @@ -36,7 +36,8 @@ #define clear_user_page(page, vaddr, pg) clear_page(page) #define copy_user_page(to, from, vaddr, pg) copy_page(to, from) -#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr) +#define alloc_zeroed_user_highpage(vma, vaddr) \ + alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | __GFP_EASYRCLM, vma, vaddr) #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE /* diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h --- linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h 2005-10-30 13:20:05.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h 2005-10-30 13:34:50.000000000 +0000 @@ -50,14 +50,27 @@ struct vm_area_struct; #define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */ #define __GFP_VALID 0x80000000u /* valid GFP flags */ -#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ +/* + * Allocation type modifiers, these are required to be adjacent + * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages + * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation + * Both bits off: Kernel non-reclaimable or very hard to reclaim + * __GFP_EASYRCLM and __GFP_KERNRCLM should not be specified at the same time + * RCLM_SHIFT (defined elsewhere) depends on the location of these bits + */ +#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */ +#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */ +#define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM) + +#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) /* if you forget to add the bitmask here kernel will crash, period */ #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \ - __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL) + __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \ + __GFP_EASYRCLM|__GFP_KERNRCLM) #define GFP_ATOMIC (__GFP_VALID | __GFP_HIGH) #define GFP_NOIO (__GFP_VALID | __GFP_WAIT) diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h --- linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h 2005-10-20 07:23:05.000000000 +0100 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h 2005-10-30 13:34:50.000000000 +0000 @@ -47,7 +47,8 @@ static inline void clear_user_highpage(s static inline struct page * alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr) { - struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr); + struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM, + vma, vaddr); if (page) clear_user_highpage(page, vaddr); diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/memory.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c --- linux-2.6.14-rc5-mm1-clean/mm/memory.c 2005-10-30 13:20:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c 2005-10-30 13:34:50.000000000 +0000 @@ -1295,7 +1295,8 @@ static int do_wp_page(struct mm_struct * if (!new_page) goto oom; } else { - new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); + new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM, + vma, address); if (!new_page) goto oom; copy_user_highpage(new_page, old_page, address); @@ -1858,7 +1859,8 @@ retry: if (unlikely(anon_vma_prepare(vma))) goto oom; - page = alloc_page_vma(GFP_HIGHUSER, vma, address); + page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM, + vma, address); if (!page) goto oom; copy_user_highpage(page, new_page, address); diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/shmem.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c --- linux-2.6.14-rc5-mm1-clean/mm/shmem.c 2005-10-30 13:20:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c 2005-10-30 13:34:50.000000000 +0000 @@ -906,7 +906,7 @@ shmem_alloc_page(unsigned long gfp, stru pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); pvma.vm_pgoff = idx; pvma.vm_end = PAGE_SIZE; - page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0); + page = alloc_page_vma(gfp | __GFP_ZERO | __GFP_EASYRCLM, &pvma, 0); mpol_free(pvma.vm_policy); return page; } @@ -921,7 +921,7 @@ shmem_swapin(struct shmem_inode_info *in static inline struct page * shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx) { - return alloc_page(gfp | __GFP_ZERO); + return alloc_page(gfp | __GFP_ZERO | __GFP_EASYRCLM); } #endif diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/swap_state.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c --- linux-2.6.14-rc5-mm1-clean/mm/swap_state.c 2005-10-30 13:20:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c 2005-10-30 13:34:50.000000000 +0000 @@ -341,7 +341,8 @@ struct page *read_swap_cache_async(swp_e * Get a new page to read into from swap. */ if (!new_page) { - new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr); + new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM, + vma, addr); if (!new_page) break; /* Out of memory */ } ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman 2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman ` (5 subsequent siblings) 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel This patch adds a "usemap" to the allocator. When a PAGE_PER_MAXORDER block of pages (i.e. 2^(MAX_ORDER-1)) is split, the usemap is updated with the type of allocation when splitting. This information is used in an anti-fragmentation patch to group related allocation types together. The __GFP_EASYRCLM and __GFP_KERNRCLM bits are used to enumerate three allocation types; RCLM_NORLM: These are kernel allocations that cannot be reclaimed on demand. RCLM_EASY: These are pages allocated with __GFP_EASYRCLM flag set. They are considered to be user and other easily reclaimed pages such as buffers RCLM_KERN: Allocated for the kernel but for caches that can be reclaimed on demand. gfpflags_to_rclmtype() converts gfp_flags to their corresponding RCLM_TYPE by masking out irrelevant bits and shifting the result right by RCLM_SHIFT. Compile-time checks are made on RCLM_SHIFT to ensure gfpflags_to_rclmtype() keeps working. ffz() could be used to avoid static checks, but it would be runtime overhead for a compile-time constant. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h --- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h 2005-10-30 13:20:05.000000000 +0000 +++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h 2005-10-30 13:35:31.000000000 +0000 @@ -529,6 +529,12 @@ static inline void set_page_links(struct extern struct page *mem_map; #endif +/* + * Return what type of page this 2^(MAX_ORDER-1) block of pages is being + * used for. Return value is one of the RCLM_X types + */ +extern int get_pageblock_type(struct zone *zone, struct page *page); + static inline void *lowmem_page_address(struct page *page) { return __va(page_to_pfn(page) << PAGE_SHIFT); diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h 2005-10-30 13:20:05.000000000 +0000 +++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000 @@ -21,6 +21,17 @@ #else #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER #endif +#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1)) + +/* + * The two bit field __GFP_RECLAIMBITS enumerates the following types of + * page reclaimability. + */ +#define RCLM_NORCLM 0 +#define RCLM_EASY 1 +#define RCLM_KERN 2 +#define RCLM_TYPES 3 +#define BITS_PER_RCLM_TYPE 2 struct free_area { struct list_head free_list; @@ -146,6 +157,13 @@ struct zone { #endif struct free_area free_area[MAX_ORDER]; +#ifndef CONFIG_SPARSEMEM + /* + * The map tracks what each 2^MAX_ORDER-1 sized block is being used for. + * Each PAGES_PER_MAXORDER block of pages use BITS_PER_RCLM_TYPE bits + */ + unsigned long *free_area_usemap; +#endif ZONE_PADDING(_pad1_) @@ -501,9 +519,14 @@ extern struct pglist_data contig_page_da #define PAGES_PER_SECTION (1UL << PFN_SECTION_SHIFT) #define PAGE_SECTION_MASK (~(PAGES_PER_SECTION-1)) +#define FREE_AREA_BITS 64 + #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS #error Allocator MAX_ORDER exceeds SECTION_SIZE #endif +#if ((SECTION_SIZE_BITS - MAX_ORDER) * BITS_PER_RCLM_TYPE) > FREE_AREA_BITS +#error free_area_usemap is not big enough +#endif struct page; struct mem_section { @@ -516,6 +539,7 @@ struct mem_section { * before using it wrong. */ unsigned long section_mem_map; + DECLARE_BITMAP(free_area_usemap, FREE_AREA_BITS); }; #ifdef CONFIG_SPARSEMEM_EXTREME @@ -584,6 +608,18 @@ static inline struct mem_section *__pfn_ return __nr_to_section(pfn_to_section_nr(pfn)); } +static inline unsigned long *pfn_to_usemap(struct zone *zone, + unsigned long pfn) +{ + return &__pfn_to_section(pfn)->free_area_usemap[0]; +} + +static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) +{ + pfn &= (PAGES_PER_SECTION-1); + return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE; +} + #define pfn_to_page(pfn) \ ({ \ unsigned long __pfn = (pfn); \ @@ -621,6 +657,17 @@ void sparse_init(void); #else #define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) +static inline unsigned long *pfn_to_usemap(struct zone *zone, + unsigned long pfn) +{ + return zone->free_area_usemap; +} + +static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) +{ + pfn = pfn - zone->zone_start_pfn; + return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE; +} #endif /* CONFIG_SPARSEMEM */ #ifdef CONFIG_NODES_SPAN_OTHER_NODES diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c 2005-10-30 13:20:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000 @@ -69,6 +69,99 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z EXPORT_SYMBOL(totalram_pages); /* + * RCLM_SHIFT is the number of bits that a gfp_mask has to be shifted right + * to have just the __GFP_EASYRCLM and __GFP_KERNRCLM bits. The static check + * is made afterwards in case the GFP flags are not updated without updating + * this number + */ +#define RCLM_SHIFT 19 +#if (__GFP_EASYRCLM >> RCLM_SHIFT) != RCLM_EASY +#error __GFP_EASYRCLM not mapping to RCLM_EASY +#endif +#if (__GFP_KERNRCLM >> RCLM_SHIFT) != RCLM_KERN +#error __GFP_KERNRCLM not mapping to RCLM_KERN +#endif + +/* + * This function maps gfpflags to their RCLM_TYPE. It makes assumptions + * on the location of the GFP flags. + */ +static inline int gfpflags_to_rclmtype(gfp_t gfp_flags) +{ + unsigned long rclmbits = gfp_flags & __GFP_RCLM_BITS; + + /* Specifying both RCLM flags makes no sense */ + if (unlikely(rclmbits == __GFP_RCLM_BITS)) { + printk(KERN_WARNING "Multiple RCLM GFP flags specified\n"); + dump_stack(); + return RCLM_TYPES; + } + + return rclmbits >> RCLM_SHIFT; +} + +/* + * copy_bits - Copy bits between bitmaps + * @dstaddr: The destination bitmap to copy to + * @srcaddr: The source bitmap to copy from + * @sindex_dst: The start bit index within the destination map to copy to + * @sindex_src: The start bit index within the source map to copy from + * @nr: The number of bits to copy + * + * Note that this method is slow and makes no guarantees for atomicity. + * It depends on being called with the zone spinlock held to ensure data + * safety + */ +static inline void copy_bits(unsigned long *dstaddr, + unsigned long *srcaddr, + int sindex_dst, + int sindex_src, + int nr) +{ + /* + * Written like this to take advantage of arch-specific + * set_bit() and clear_bit() functions + */ + for (nr = nr - 1; nr >= 0; nr--) { + int bit = test_bit(sindex_src + nr, srcaddr); + if (bit) + set_bit(sindex_dst + nr, dstaddr); + else + clear_bit(sindex_dst + nr, dstaddr); + } +} + +int get_pageblock_type(struct zone *zone, struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long type = 0; + unsigned long *usemap; + int bitidx; + + bitidx = pfn_to_bitidx(zone, pfn); + usemap = pfn_to_usemap(zone, pfn); + + copy_bits(&type, usemap, 0, bitidx, BITS_PER_RCLM_TYPE); + + return type; +} + +/* Reserve a block of pages for an allocation type */ +static inline void set_pageblock_type(struct zone *zone, struct page *page, + int type) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long *usemap; + unsigned long ltype = type; + int bitidx; + + bitidx = pfn_to_bitidx(zone, pfn); + usemap = pfn_to_usemap(zone, pfn); + + copy_bits(usemap, <ype, bitidx, 0, BITS_PER_RCLM_TYPE); +} + +/* * Used by page_zone() to look up the address of the struct zone whose * id is encoded in the upper bits of page->flags */ @@ -498,7 +591,8 @@ static void prep_new_page(struct page *p * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order) +static struct page *__rmqueue(struct zone *zone, unsigned int order, + int alloctype) { struct free_area * area; unsigned int current_order; @@ -514,6 +608,14 @@ static struct page *__rmqueue(struct zon rmv_page_order(page); area->nr_free--; zone->free_pages -= 1UL << order; + + /* + * If splitting a large block, record what the block is being + * used for in the usemap + */ + if (current_order == MAX_ORDER-1) + set_pageblock_type(zone, page, alloctype); + return expand(zone, page, order, current_order, area); } @@ -526,7 +628,8 @@ static struct page *__rmqueue(struct zon * Returns the number of new pages which were placed at *list. */ static int rmqueue_bulk(struct zone *zone, unsigned int order, - unsigned long count, struct list_head *list) + unsigned long count, struct list_head *list, + int alloctype) { unsigned long flags; int i; @@ -535,7 +638,7 @@ static int rmqueue_bulk(struct zone *zon spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { - page = __rmqueue(zone, order); + page = __rmqueue(zone, order, alloctype); if (page == NULL) break; allocated++; @@ -719,6 +822,11 @@ buffered_rmqueue(struct zone *zone, int unsigned long flags; struct page *page = NULL; int cold = !!(gfp_flags & __GFP_COLD); + int alloctype = gfpflags_to_rclmtype(gfp_flags); + + /* If the alloctype is RCLM_TYPES, the gfp_flags make no sense */ + if (alloctype == RCLM_TYPES) + return NULL; if (order == 0) { struct per_cpu_pages *pcp; @@ -727,7 +835,8 @@ buffered_rmqueue(struct zone *zone, int local_irq_save(flags); if (pcp->count <= pcp->low) pcp->count += rmqueue_bulk(zone, 0, - pcp->batch, &pcp->list); + pcp->batch, &pcp->list, + alloctype); if (pcp->count) { page = list_entry(pcp->list.next, struct page, lru); list_del(&page->lru); @@ -739,7 +848,7 @@ buffered_rmqueue(struct zone *zone, int if (page == NULL) { spin_lock_irqsave(&zone->lock, flags); - page = __rmqueue(zone, order); + page = __rmqueue(zone, order, alloctype); spin_unlock_irqrestore(&zone->lock, flags); } @@ -1866,6 +1975,38 @@ inline void setup_pageset(struct per_cpu INIT_LIST_HEAD(&pcp->list); } +#ifndef CONFIG_SPARSEMEM +#define roundup(x, y) ((((x)+((y)-1))/(y))*(y)) +/* + * Calculate the size of the zone->usemap in bytes rounded to an unsigned long + * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up + * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up + * what is now in bits to nearest long in bits, then return it in bytes. + */ +static unsigned long __init usemap_size(unsigned long zonesize) +{ + unsigned long usemapsize; + + usemapsize = roundup(zonesize, PAGES_PER_MAXORDER); + usemapsize = usemapsize >> (MAX_ORDER-1); + usemapsize *= BITS_PER_RCLM_TYPE; + usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); + + return usemapsize / 8; +} + +static void __init setup_usemap(struct pglist_data *pgdat, + struct zone *zone, unsigned long zonesize) +{ + unsigned long usemapsize = usemap_size(zonesize); + zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize); + memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize); +} +#else +static void inline setup_usemap(struct pglist_data *pgdat, + struct zone *zone, unsigned long zonesize) {} +#endif /* CONFIG_SPARSEMEM */ + #ifdef CONFIG_NUMA /* * Boot pageset table. One per cpu which is going to be used for all @@ -2079,6 +2220,7 @@ static void __init free_area_init_core(s zonetable_add(zone, nid, j, zone_start_pfn, size); init_currently_empty_zone(zone, zone_start_pfn, size); zone_start_pfn += size; + setup_usemap(pgdat, zone, size); } } ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman 2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman 2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman ` (4 subsequent siblings) 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman This patch adds the core of the anti-fragmentation strategy. It works by grouping related allocation types together. The idea is that large groups of pages that may be reclaimed are placed near each other. The zone->free_area list is broken into three free lists for each RCLM_TYPE. This section of the patch looks superflous but it is to surpress a compiler warning. Suggestions to make this better looking are welcome. - struct free_area * area; + struct free_area * area = NULL; Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000 +++ linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000 @@ -33,6 +33,10 @@ #define RCLM_TYPES 3 #define BITS_PER_RCLM_TYPE 2 +#define for_each_rclmtype_order(type, order) \ + for (order = 0; order < MAX_ORDER; order++) \ + for (type = 0; type < RCLM_TYPES; type++) + struct free_area { struct list_head free_list; unsigned long nr_free; @@ -155,7 +159,6 @@ struct zone { /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif - struct free_area free_area[MAX_ORDER]; #ifndef CONFIG_SPARSEMEM /* @@ -165,6 +168,8 @@ struct zone { unsigned long *free_area_usemap; #endif + struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER]; + ZONE_PADDING(_pad1_) /* Fields commonly accessed by the page reclaim scanner */ diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000 +++ linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000 @@ -352,6 +352,15 @@ __find_combined_index(unsigned long page } /* + * Return the free list for a given page within a zone + */ +static inline struct free_area *__page_find_freelist(struct zone *zone, + struct page *page) +{ + return zone->free_area_lists[get_pageblock_type(zone, page)]; +} + +/* * This function checks whether a page is free && is the buddy * we can do coalesce a page and its buddy if * (a) the buddy is free && @@ -398,6 +407,8 @@ static inline void __free_pages_bulk (st { unsigned long page_idx; int order_size = 1 << order; + struct free_area *area; + struct free_area *freelist; if (unlikely(order)) destroy_compound_page(page, order); @@ -407,10 +418,11 @@ static inline void __free_pages_bulk (st BUG_ON(page_idx & (order_size - 1)); BUG_ON(bad_range(zone, page)); + freelist = __page_find_freelist(zone, page); + zone->free_pages += order_size; while (order < MAX_ORDER-1) { unsigned long combined_idx; - struct free_area *area; struct page *buddy; combined_idx = __find_combined_index(page_idx, order); @@ -421,7 +433,7 @@ static inline void __free_pages_bulk (st if (!page_is_buddy(buddy, order)) break; /* Move the buddy up one level. */ list_del(&buddy->lru); - area = zone->free_area + order; + area = &freelist[order]; area->nr_free--; rmv_page_order(buddy); page = page + (combined_idx - page_idx); @@ -429,8 +441,8 @@ static inline void __free_pages_bulk (st order++; } set_page_order(page, order); - list_add(&page->lru, &zone->free_area[order].free_list); - zone->free_area[order].nr_free++; + list_add_tail(&page->lru, &freelist[order].free_list); + freelist[order].nr_free++; } static inline void free_pages_check(const char *function, struct page *page) @@ -587,6 +599,45 @@ static void prep_new_page(struct page *p kernel_map_pages(page, 1 << order, 1); } +/* + * Find a list that has a 2^MAX_ORDER-1 block of pages available and + * return it + */ +struct page *steal_maxorder_block(struct zone *zone, int alloctype) +{ + struct page *page; + struct free_area *area = NULL; + int i; + + for(i = 0; i < RCLM_TYPES; i++) { + if (i == alloctype) + continue; + + area = &zone->free_area_lists[i][MAX_ORDER-1]; + if (!list_empty(&area->free_list)) + break; + } + if (i == RCLM_TYPES) + return NULL; + + page = list_entry(area->free_list.next, struct page, lru); + area->nr_free--; + + set_pageblock_type(zone, page, alloctype); + + return page; +} + +static inline struct page * +remove_page(struct zone *zone, struct page *page, unsigned int order, + unsigned int current_order, struct free_area *area) +{ + list_del(&page->lru); + rmv_page_order(page); + zone->free_pages -= 1UL << order; + return expand(zone, page, order, current_order, area); +} + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. @@ -594,31 +645,25 @@ static void prep_new_page(struct page *p static struct page *__rmqueue(struct zone *zone, unsigned int order, int alloctype) { - struct free_area * area; + struct free_area * area = NULL; unsigned int current_order; struct page *page; for (current_order = order; current_order < MAX_ORDER; ++current_order) { - area = zone->free_area + current_order; + area = &zone->free_area_lists[alloctype][current_order]; if (list_empty(&area->free_list)) continue; page = list_entry(area->free_list.next, struct page, lru); - list_del(&page->lru); - rmv_page_order(page); area->nr_free--; - zone->free_pages -= 1UL << order; - - /* - * If splitting a large block, record what the block is being - * used for in the usemap - */ - if (current_order == MAX_ORDER-1) - set_pageblock_type(zone, page, alloctype); - - return expand(zone, page, order, current_order, area); + return remove_page(zone, page, order, current_order, area); } + /* Allocate a MAX_ORDER block */ + page = steal_maxorder_block(zone, alloctype); + if (page != NULL) + return remove_page(zone, page, order, MAX_ORDER-1, area); + return NULL; } @@ -704,9 +749,9 @@ static void __drain_pages(unsigned int c void mark_free_pages(struct zone *zone) { unsigned long zone_pfn, flags; - int order; + int order, t; + unsigned long start_pfn, i; struct list_head *curr; - if (!zone->spanned_pages) return; @@ -714,14 +759,12 @@ void mark_free_pages(struct zone *zone) for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn)); - for (order = MAX_ORDER - 1; order >= 0; --order) - list_for_each(curr, &zone->free_area[order].free_list) { - unsigned long start_pfn, i; - + for_each_rclmtype_order(t, order) { + list_for_each(curr,&zone->free_area_lists[t][order].free_list) { start_pfn = page_to_pfn(list_entry(curr, struct page, lru)); - for (i=0; i < (1<<order); i++) SetPageNosaveFree(pfn_to_page(start_pfn+i)); + } } spin_unlock_irqrestore(&zone->lock, flags); } @@ -876,6 +919,7 @@ int zone_watermark_ok(struct zone *z, in /* free_pages my go negative - that's OK */ long min = mark, free_pages = z->free_pages - (1 << order) + 1; int o; + struct free_area *kernnorclm, *kernrclm, *easyrclm; if (gfp_high) min -= min / 2; @@ -884,15 +928,22 @@ int zone_watermark_ok(struct zone *z, in if (free_pages <= min + z->lowmem_reserve[classzone_idx]) goto out_failed; + kernnorclm = z->free_area_lists[RCLM_NORCLM]; + easyrclm = z->free_area_lists[RCLM_EASY]; + kernrclm = z->free_area_lists[RCLM_KERN]; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ - free_pages -= z->free_area[o].nr_free << o; + free_pages -= (kernnorclm->nr_free + kernrclm->nr_free + + easyrclm->nr_free) << o; /* Require fewer higher order pages to be free */ min >>= 1; if (free_pages <= min) goto out_failed; + kernnorclm++; + easyrclm++; + kernrclm++; } return 1; @@ -1496,6 +1547,7 @@ void show_free_areas(void) unsigned long inactive; unsigned long free; struct zone *zone; + int type; for_each_zone(zone) { show_node(zone); @@ -1575,7 +1627,9 @@ void show_free_areas(void) } for_each_zone(zone) { - unsigned long nr, flags, order, total = 0; + unsigned long nr = 0; + unsigned long total = 0; + unsigned long flags,order; show_node(zone); printk("%s: ", zone->name); @@ -1585,10 +1639,18 @@ void show_free_areas(void) } spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { - nr = zone->free_area[order].nr_free; + for_each_rclmtype_order(type, order) { + nr += zone->free_area_lists[type][order].nr_free; total += nr << order; - printk("%lu*%lukB ", nr, K(1UL) << order); + + /* + * If type had reached RCLM_TYPE, the free pages + * for this order have been summed up + */ + if (type == RCLM_TYPES-1) { + printk("%lu*%lukB ", nr, K(1UL) << order); + nr = 0; + } } spin_unlock_irqrestore(&zone->lock, flags); printk("= %lukB\n", K(total)); @@ -1899,9 +1961,14 @@ void zone_init_free_lists(struct pglist_ unsigned long size) { int order; - for (order = 0; order < MAX_ORDER ; order++) { - INIT_LIST_HEAD(&zone->free_area[order].free_list); - zone->free_area[order].nr_free = 0; + int type; + struct free_area *area; + + /* Initialse the three size ordered lists of free_areas */ + for_each_rclmtype_order(type, order) { + area = &(zone->free_area_lists[type][order]); + INIT_LIST_HEAD(&area->free_list); + area->nr_free = 0; } } @@ -2314,16 +2381,26 @@ static int frag_show(struct seq_file *m, struct zone *zone; struct zone *node_zones = pgdat->node_zones; unsigned long flags; - int order; + int order, t; + struct free_area *area; + unsigned long nr_bufs = 0; for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { if (!zone->present_pages) continue; spin_lock_irqsave(&zone->lock, flags); - seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); - for (order = 0; order < MAX_ORDER; ++order) - seq_printf(m, "%6lu ", zone->free_area[order].nr_free); + seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name); + for_each_rclmtype_order(t, order) { + area = &(zone->free_area_lists[t][order]); + nr_bufs += area->nr_free; + + if (t == RCLM_TYPES-1) { + seq_printf(m, "%6lu ", nr_bufs); + nr_bufs = 0; + } + } + spin_unlock_irqrestore(&zone->lock, flags); seq_putc(m, '\n'); } ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman ` (2 preceding siblings ...) 2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman ` (3 subsequent siblings) 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel This patch implements fallback logic. In the event there is no 2^(MAX_ORDER-1) blocks of pages left, this will help the system decide what list to use. The highlights of the patch are; o Define a RCLM_FALLBACK type for fallbacks o Use a percentage of each zone for fallbacks. When a reserved pool of pages is depleted, it will try and use RCLM_FALLBACK before using anything else. This greatly reduces the amount of fallbacks causing fragmentation without needing complex balancing algorithms o Add a fallback_reserve that records how much of the zone is currently used for allocations falling back to RCLM_FALLBACK o Adds a fallback_allocs[] array that determines the order of freelists are used for each allocation type Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000 +++ linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000 @@ -30,7 +30,8 @@ #define RCLM_NORCLM 0 #define RCLM_EASY 1 #define RCLM_KERN 2 -#define RCLM_TYPES 3 +#define RCLM_FALLBACK 3 +#define RCLM_TYPES 4 #define BITS_PER_RCLM_TYPE 2 #define for_each_rclmtype_order(type, order) \ @@ -168,8 +169,17 @@ struct zone { unsigned long *free_area_usemap; #endif + /* + * With allocation fallbacks, the nr_free count for each RCLM_TYPE must + * be added together to get the correct count of free pages for a given + * order. Individually, the nr_free count in a free_area may not match + * the number of pages in the free_list. + */ struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER]; + /* Number of pages currently used for RCLM_FALLBACK */ + unsigned long fallback_reserve; + ZONE_PADDING(_pad1_) /* Fields commonly accessed by the page reclaim scanner */ @@ -292,6 +302,17 @@ struct zonelist { struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited }; +static inline void inc_reserve_count(struct zone *zone, int type) +{ + if (type == RCLM_FALLBACK) + zone->fallback_reserve += PAGES_PER_MAXORDER; +} + +static inline void dec_reserve_count(struct zone *zone, int type) +{ + if (type == RCLM_FALLBACK && zone->fallback_reserve) + zone->fallback_reserve -= PAGES_PER_MAXORDER; +} /* * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000 +++ linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000 @@ -54,6 +54,22 @@ unsigned long totalhigh_pages __read_mos long nr_swap_pages; /* + * fallback_allocs contains the fallback types for low memory conditions + * where the preferred alloction type if not available. + */ +int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { + {RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES}, + {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, + {RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES} +}; + +/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */ +static inline int min_fallback_reserved(struct zone *zone) +{ + return zone->fallback_reserve >= zone->present_pages >> 3; +} + +/* * results with 256, 32 in the lowmem_reserve sysctl: * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) * 1G machine -> (16M dma, 784M normal, 224M high) @@ -623,7 +639,12 @@ struct page *steal_maxorder_block(struct page = list_entry(area->free_list.next, struct page, lru); area->nr_free--; + if (!min_fallback_reserved(zone)) + alloctype = RCLM_FALLBACK; + set_pageblock_type(zone, page, alloctype); + dec_reserve_count(zone, i); + inc_reserve_count(zone, alloctype); return page; } @@ -638,6 +659,78 @@ remove_page(struct zone *zone, struct pa return expand(zone, page, order, current_order, area); } +/* + * If we are falling back, and the allocation is KERNNORCLM, + * then reserve any buddies for the KERNNORCLM pool. These + * allocations fragment the worst so this helps keep them + * in the one place + */ +static inline struct free_area * +fallback_buddy_reserve(int start_alloctype, struct zone *zone, + unsigned int current_order, struct page *page, + struct free_area *area) +{ + if (start_alloctype != RCLM_NORCLM) + return area; + + area = &zone->free_area_lists[RCLM_NORCLM][current_order]; + + /* Reserve the whole block if this is a large split */ + if (current_order >= MAX_ORDER / 2) { + int reserve_type = RCLM_NORCLM; + if (!min_fallback_reserved(zone)) + reserve_type = RCLM_FALLBACK; + + dec_reserve_count(zone, get_pageblock_type(zone,page)); + set_pageblock_type(zone, page, reserve_type); + inc_reserve_count(zone, reserve_type); + } + return area; +} + +static struct page * +fallback_alloc(int alloctype, struct zone *zone, unsigned int order) +{ + int *fallback_list; + int start_alloctype = alloctype; + struct free_area *area; + unsigned int current_order; + struct page *page; + int i; + + /* Ok, pick the fallback order based on the type */ + BUG_ON(alloctype >= RCLM_TYPES); + fallback_list = fallback_allocs[alloctype]; + + /* + * Here, the alloc type lists has been depleted as well as the global + * pool, so fallback. When falling back, the largest possible block + * will be taken to keep the fallbacks clustered if possible + */ + for (i = 0; fallback_list[i] != RCLM_TYPES; i++) { + alloctype = fallback_list[i]; + + /* Find a block to allocate */ + area = &zone->free_area_lists[alloctype][MAX_ORDER-1]; + for (current_order = MAX_ORDER - 1; current_order > order; + current_order--, area--) { + if (list_empty(&area->free_list)) + continue; + + page = list_entry(area->free_list.next, + struct page, lru); + area->nr_free--; + area = fallback_buddy_reserve(start_alloctype, zone, + current_order, page, area); + return remove_page(zone, page, order, + current_order, area); + + } + } + + return NULL; +} + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. @@ -664,7 +757,8 @@ static struct page *__rmqueue(struct zon if (page != NULL) return remove_page(zone, page, order, MAX_ORDER-1, area); - return NULL; + /* Try falling back */ + return fallback_alloc(alloctype, zone, order); } /* @@ -2270,6 +2364,7 @@ static void __init free_area_init_core(s zone_seqlock_init(zone); zone->zone_pgdat = pgdat; zone->free_pages = 0; + zone->fallback_reserve = 0; zone->temp_priority = zone->prev_priority = DEF_PRIORITY; ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman ` (3 preceding siblings ...) 2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman ` (2 subsequent siblings) 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman Fragmentation avoidance patches increase our chances of satisfying high order allocations. So this patch takes more than one iteration at trying to fulfill those allocations because, unlike before, the extra iterations are often useful. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000 +++ linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000 @@ -1127,6 +1127,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i int do_retry; int can_try_harder; int did_some_progress; + int highorder_retry = 3; might_sleep_if(wait); @@ -1275,7 +1276,17 @@ rebalance: goto got_pg; } - out_of_memory(gfp_mask, order); + if (order < MAX_ORDER / 2) + out_of_memory(gfp_mask, order); + + /* + * Due to low fragmentation efforts, we try a little + * harder to satisfy high order allocations and only + * go OOM for low-order allocations + */ + if (order >= MAX_ORDER/2 && --highorder_retry > 0) + goto rebalance; + goto restart; } @@ -1292,6 +1303,8 @@ rebalance: do_retry = 1; if (gfp_mask & __GFP_NOFAIL) do_retry = 1; + if (order >= MAX_ORDER/2 && --highorder_retry > 0) + do_retry = 1; } if (do_retry) { blk_congestion_wait(WRITE, HZ/50); ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman ` (4 preceding siblings ...) 2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman 2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel The freelists for each allocation type can slowly become corrupted due to the per-cpu list. Consider what happens when the following happens 1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages 2. An order-0 page is allocated from the newly reserved block 3. The page is freed and placed on the per-cpu list 4. alloc_page() is called with GFP_KERNEL as the gfp_mask 5. The per-cpu list is used to satisfy the allocation Now, a kernel page is in the middle of a __GFP_EASYRCLM page. This means that over long periods of the time, the anti-fragmentation scheme slowly degrades to the standard allocator. This patch divides the per-cpu lists into Kernel and User lists. RCLM_NORCLM and RCLM_KERN use the Kernel list and RCLM_EASY uses the user list. Strictly speaking, there should be three lists but as little effort is made to reclaim RCLM_KERN pages, it is not worth the overhead *yet*. Signed-off-by: Mel Gorman <mel@csn.ul.ie> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000 +++ linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000 @@ -60,12 +60,21 @@ struct zone_padding { #define ZONE_PADDING(name) #endif +/* + * Indices into pcpu_list + * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations + * PCPU_EASY: For RCLM_EASY allocations + */ +#define PCPU_KERNEL 0 +#define PCPU_EASY 1 +#define PCPU_TYPES 2 + struct per_cpu_pages { - int count; /* number of pages in the list */ + int count[PCPU_TYPES]; /* Number of pages on each list */ int low; /* low watermark, refill needed */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - struct list_head list; /* the list of pages */ + struct list_head list[PCPU_TYPES]; /* the lists of pages */ }; struct per_cpu_pageset { @@ -80,6 +89,10 @@ struct per_cpu_pageset { #endif } ____cacheline_aligned_in_smp; +/* Helpers for per_cpu_pages */ +#define pset_count(pset) (pset.count[PCPU_KERNEL] + pset.count[PCPU_EASY]) +#define for_each_pcputype(pindex) \ + for (pindex = 0; pindex < PCPU_TYPES; pindex++) #ifdef CONFIG_NUMA #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)]) #else diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000 +++ linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000 @@ -792,7 +792,7 @@ static int rmqueue_bulk(struct zone *zon void drain_remote_pages(void) { struct zone *zone; - int i; + int i, pindex; unsigned long flags; local_irq_save(flags); @@ -808,9 +808,16 @@ void drain_remote_pages(void) struct per_cpu_pages *pcp; pcp = &pset->pcp[i]; - if (pcp->count) - pcp->count -= free_pages_bulk(zone, pcp->count, - &pcp->list, 0); + for_each_pcputype(pindex) { + if (!pcp->count[pindex]) + continue; + + /* Try remove all pages from the pcpu list */ + pcp->count[pindex] -= + free_pages_bulk(zone, + pcp->count[pindex], + &pcp->list[pindex], 0); + } } } local_irq_restore(flags); @@ -821,7 +828,7 @@ void drain_remote_pages(void) static void __drain_pages(unsigned int cpu) { struct zone *zone; - int i; + int i, pindex; for_each_zone(zone) { struct per_cpu_pageset *pset; @@ -831,8 +838,16 @@ static void __drain_pages(unsigned int c struct per_cpu_pages *pcp; pcp = &pset->pcp[i]; - pcp->count -= free_pages_bulk(zone, pcp->count, - &pcp->list, 0); + for_each_pcputype(pindex) { + if (!pcp->count[pindex]) + continue; + + /* Try remove all pages from the pcpu list */ + pcp->count[pindex] -= + free_pages_bulk(zone, + pcp->count[pindex], + &pcp->list[pindex], 0); + } } } } @@ -911,6 +926,7 @@ static void fastcall free_hot_cold_page( struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; unsigned long flags; + int pindex; arch_free_page(page, 0); @@ -920,11 +936,21 @@ static void fastcall free_hot_cold_page( page->mapping = NULL; free_pages_check(__FUNCTION__, page); pcp = &zone_pcp(zone, get_cpu())->pcp[cold]; + + /* + * Strictly speaking, we should not be accessing the zone information + * here. In this case, it does not matter if the read is incorrect + */ + if (get_pageblock_type(zone, page) == RCLM_EASY) + pindex = PCPU_EASY; + else + pindex = PCPU_KERNEL; local_irq_save(flags); - list_add(&page->lru, &pcp->list); - pcp->count++; - if (pcp->count >= pcp->high) - pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0); + list_add(&page->lru, &pcp->list[pindex]); + pcp->count[pindex]++; + if (pcp->count[pindex] >= pcp->high) + pcp->count[pindex] -= free_pages_bulk(zone, pcp->batch, + &pcp->list[pindex], 0); local_irq_restore(flags); put_cpu(); } @@ -967,17 +993,23 @@ buffered_rmqueue(struct zone *zone, int if (order == 0) { struct per_cpu_pages *pcp; + int pindex = PCPU_KERNEL; + if (alloctype == RCLM_EASY) + pindex = PCPU_EASY; pcp = &zone_pcp(zone, get_cpu())->pcp[cold]; local_irq_save(flags); - if (pcp->count <= pcp->low) - pcp->count += rmqueue_bulk(zone, 0, - pcp->batch, &pcp->list, - alloctype); - if (pcp->count) { - page = list_entry(pcp->list.next, struct page, lru); + if (pcp->count[pindex] <= pcp->low) + pcp->count[pindex] += rmqueue_bulk(zone, + 0, pcp->batch, + &(pcp->list[pindex]), + alloctype); + + if (pcp->count[pindex]) { + page = list_entry(pcp->list[pindex].next, + struct page, lru); list_del(&page->lru); - pcp->count--; + pcp->count[pindex]--; } local_irq_restore(flags); put_cpu(); @@ -1678,7 +1710,7 @@ void show_free_areas(void) pageset->pcp[temperature].low, pageset->pcp[temperature].high, pageset->pcp[temperature].batch, - pageset->pcp[temperature].count); + pset_count(pageset->pcp[temperature])); } } @@ -2135,18 +2167,22 @@ inline void setup_pageset(struct per_cpu struct per_cpu_pages *pcp; pcp = &p->pcp[0]; /* hot */ - pcp->count = 0; + pcp->count[PCPU_KERNEL] = 0; + pcp->count[PCPU_EASY] = 0; pcp->low = 0; - pcp->high = 6 * batch; + pcp->high = 3 * batch; pcp->batch = max(1UL, 1 * batch); - INIT_LIST_HEAD(&pcp->list); + INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]); + INIT_LIST_HEAD(&pcp->list[PCPU_EASY]); pcp = &p->pcp[1]; /* cold*/ - pcp->count = 0; + pcp->count[PCPU_KERNEL] = 0; + pcp->count[PCPU_EASY] = 0; pcp->low = 0; - pcp->high = 2 * batch; + pcp->high = batch; pcp->batch = max(1UL, batch/2); - INIT_LIST_HEAD(&pcp->list); + INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]); + INIT_LIST_HEAD(&pcp->list[PCPU_EASY]); } #ifndef CONFIG_SPARSEMEM @@ -2574,7 +2610,7 @@ static int zoneinfo_show(struct seq_file pageset = zone_pcp(zone, i); for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) { - if (pageset->pcp[j].count) + if (pset_count(pageset->pcp[j])) break; } if (j == ARRAY_SIZE(pageset->pcp)) @@ -2587,7 +2623,7 @@ static int zoneinfo_show(struct seq_file "\n high: %i" "\n batch: %i", i, j, - pageset->pcp[j].count, + pset_count(pageset->pcp[j]), pageset->pcp[j].low, pageset->pcp[j].high, pageset->pcp[j].batch); ^ permalink raw reply [flat|nested] 241+ messages in thread
* [PATCH 7/7] Fragmentation Avoidance V19: 007_stats 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman ` (5 preceding siblings ...) 2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman @ 2005-10-30 18:34 ` Mel Gorman 2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz 7 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw) To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman It is not necessary to apply this patch to get all the anti-fragmentation code. This patch adds a new config option called CONFIG_ALLOCSTATS. If set, a number of new bean counters are added that are related to the anti-fragmentation code. The information is exported via /proc/buddyinfo. This is very useful when debugging why high-order pages are not available for allocation. Signed-off-by: Mel Gorman <mel@csn.ul.ie> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000 +++ linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h 2005-10-30 13:38:56.000000000 +0000 @@ -193,6 +193,17 @@ struct zone { /* Number of pages currently used for RCLM_FALLBACK */ unsigned long fallback_reserve; +#ifdef CONFIG_ALLOCSTATS + /* + * These are beancounters that track how the placement policy + * of the buddy allocator is performing + */ + unsigned long fallback_count[RCLM_TYPES]; + unsigned long alloc_count[RCLM_TYPES]; + unsigned long reserve_count[RCLM_TYPES]; + unsigned long kernnorclm_full_steal; + unsigned long kernnorclm_partial_steal; +#endif ZONE_PADDING(_pad1_) /* Fields commonly accessed by the page reclaim scanner */ @@ -292,6 +303,17 @@ struct zone { char *name; } ____cacheline_maxaligned_in_smp; +#ifdef CONFIG_ALLOCSTATS +#define inc_fallback_count(zone, type) zone->fallback_count[type]++ +#define inc_alloc_count(zone, type) zone->alloc_count[type]++ +#define inc_kernnorclm_partial_steal(zone) zone->kernnorclm_partial_steal++ +#define inc_kernnorclm_full_steal(zone) zone->kernnorclm_full_steal++ +#else +#define inc_fallback_count(zone, type) do {} while (0) +#define inc_alloc_count(zone, type) do {} while (0) +#define inc_kernnorclm_partial_steal(zone) do {} while (0) +#define inc_kernnorclm_full_steal(zone) do {} while (0) +#endif /* * The "priority" of VM scanning is how much of the queues we will scan in one @@ -319,12 +341,19 @@ static inline void inc_reserve_count(str { if (type == RCLM_FALLBACK) zone->fallback_reserve += PAGES_PER_MAXORDER; +#ifdef CONFIG_ALLOCSTATS + zone->reserve_count[type]++; +#endif } static inline void dec_reserve_count(struct zone *zone, int type) { if (type == RCLM_FALLBACK && zone->fallback_reserve) zone->fallback_reserve -= PAGES_PER_MAXORDER; +#ifdef CONFIG_ALLOCSTATS + if (zone->reserve_count[type] > 0) + zone->reserve_count[type]--; +#endif } /* diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug --- linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug 2005-10-30 13:20:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug 2005-10-30 13:38:56.000000000 +0000 @@ -77,6 +77,17 @@ config SCHEDSTATS application, you can say N to avoid the very slight overhead this adds. +config ALLOCSTATS + bool "Collection buddy allocator statistics" + depends on DEBUG_KERNEL && PROC_FS + help + If you say Y here, additional code will be inserted into the + page allocator routines to collect statistics on the allocator + behavior and provide them in /proc/buddyinfo. These stats are + useful for measuring fragmentation in the buddy allocator. If + you are not debugging or measuring the allocator, you can say N + to avoid the slight overhead this adds. + config DEBUG_SLAB bool "Debug memory allocations" depends on DEBUG_KERNEL diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000 +++ linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c 2005-10-30 13:38:56.000000000 +0000 @@ -187,6 +187,11 @@ EXPORT_SYMBOL(zone_table); static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" }; int min_free_kbytes = 1024; +#ifdef CONFIG_ALLOCSTATS +static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm", + "KernRclm", "Fallback"}; +#endif /* CONFIG_ALLOCSTATS */ + unsigned long __initdata nr_kernel_pages; unsigned long __initdata nr_all_pages; @@ -684,6 +689,9 @@ fallback_buddy_reserve(int start_allocty dec_reserve_count(zone, get_pageblock_type(zone,page)); set_pageblock_type(zone, page, reserve_type); inc_reserve_count(zone, reserve_type); + inc_kernnorclm_full_steal(zone); + } else { + inc_kernnorclm_partial_steal(zone); } return area; } @@ -726,6 +734,15 @@ fallback_alloc(int alloctype, struct zon current_order, area); } + + /* + * If the current alloctype is RCLM_FALLBACK, it means + * that the requested pool and fallback pool are both + * depleted and we are falling back to other pools. + * At this point, pools are starting to get fragmented + */ + if (alloctype == RCLM_FALLBACK) + inc_fallback_count(zone, start_alloctype); } return NULL; @@ -742,6 +759,8 @@ static struct page *__rmqueue(struct zon unsigned int current_order; struct page *page; + inc_alloc_count(zone, alloctype); + for (current_order = order; current_order < MAX_ORDER; ++current_order) { area = &zone->free_area_lists[alloctype][current_order]; if (list_empty(&area->free_list)) @@ -2373,6 +2392,9 @@ static __devinit void init_currently_emp memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn); zone_init_free_lists(pgdat, zone, zone->spanned_pages); +#ifdef CONFIG_ALLOCSTATS + zone->reserve_count[RCLM_NORCLM] = zone->present_pages >> (MAX_ORDER-1); +#endif /* CONFIG_ALLOCSTATS */ } /* @@ -2528,6 +2550,18 @@ static int frag_show(struct seq_file *m, int order, t; struct free_area *area; unsigned long nr_bufs = 0; +#ifdef CONFIG_ALLOCSTATS + int i; + unsigned long kernnorclm_full_steal = 0; + unsigned long kernnorclm_partial_steal = 0; + unsigned long reserve_count[RCLM_TYPES]; + unsigned long fallback_count[RCLM_TYPES]; + unsigned long alloc_count[RCLM_TYPES]; + + memset(reserve_count, 0, sizeof(reserve_count)); + memset(fallback_count, 0, sizeof(fallback_count)); + memset(alloc_count, 0, sizeof(alloc_count)); +#endif for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { if (!zone->present_pages) @@ -2548,6 +2582,86 @@ static int frag_show(struct seq_file *m, spin_unlock_irqrestore(&zone->lock, flags); seq_putc(m, '\n'); } + +#ifdef CONFIG_ALLOCSTATS + /* Show statistics for each allocation type */ + seq_printf(m, "\nPer-allocation-type statistics"); + for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { + if (!zone->present_pages) + continue; + + spin_lock_irqsave(&zone->lock, flags); + for (t = 0; t < RCLM_TYPES; t++) { + struct list_head *elem; + seq_printf(m, "\nNode %d, zone %8s, type %10s ", + pgdat->node_id, zone->name, + type_names[t]); + for (order = 0; order < MAX_ORDER; ++order) { + nr_bufs = 0; + + list_for_each(elem, &zone->free_area_lists[t][order].free_list) + ++nr_bufs; + seq_printf(m, "%6lu ", nr_bufs); + } + } + + /* Scan global list */ + seq_printf(m, "\n"); + seq_printf(m, "Node %d, zone %8s, type %10s", + pgdat->node_id, zone->name, + "MAX_ORDER"); + nr_bufs = 0; + for (t = 0; t < RCLM_TYPES; t++) { + nr_bufs += + zone->free_area_lists[t][MAX_ORDER-1].nr_free; + } + seq_printf(m, "%6lu ", nr_bufs); + seq_printf(m, "\n"); + + seq_printf(m, "%s Zone beancounters\n", zone->name); + seq_printf(m, "Fallback reserve: %lu (%lu blocks)\n", + zone->fallback_reserve, + zone->fallback_reserve >> (MAX_ORDER-1)); + seq_printf(m, "Fallback needed: %lu (%lu blocks)\n", + zone->present_pages >> 3, + (zone->present_pages >> 3) >> (MAX_ORDER-1)); + seq_printf(m, "Partial steal: %lu\n", + zone->kernnorclm_partial_steal); + seq_printf(m, "Full steal: %lu\n", + zone->kernnorclm_full_steal); + + kernnorclm_partial_steal += zone->kernnorclm_partial_steal; + kernnorclm_full_steal += zone->kernnorclm_full_steal; + seq_putc(m, '\n'); + + for (i = 0; i< RCLM_TYPES; i++) { + seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n", + type_names[i], + zone->alloc_count[i], + zone->reserve_count[i], + zone->fallback_count[i]); + alloc_count[i] += zone->alloc_count[i]; + reserve_count[i] += zone->reserve_count[i]; + fallback_count[i] += zone->fallback_count[i]; + } + + spin_unlock_irqrestore(&zone->lock, flags); + } + + + /* Show bean counters */ + seq_printf(m, "\nGlobal beancounters\n"); + seq_printf(m, "Partial steal: %lu\n", kernnorclm_partial_steal); + seq_printf(m, "Full steal: %lu\n", kernnorclm_full_steal); + + for (i = 0; i< RCLM_TYPES; i++) { + seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n", + type_names[i], + alloc_count[i], + reserve_count[i], + fallback_count[i]); + } +#endif /* CONFIG_ALLOCSTATS */ return 0; } ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman ` (6 preceding siblings ...) 2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman @ 2005-10-31 5:57 ` Mike Kravetz 2005-10-31 6:37 ` Nick Piggin 7 siblings, 1 reply; 241+ messages in thread From: Mike Kravetz @ 2005-10-31 5:57 UTC (permalink / raw) To: Mel Gorman; +Cc: akpm, linux-mm, linux-kernel, lhms-devel On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote: > Here are a few brief reasons why this set of patches is useful; > > o Reduced fragmentation improves the chance a large order allocation succeeds > o General-purpose memory hotplug needs the page/memory groupings provided > o Reduces the number of badly-placed pages that page migration mechanism must > deal with. This also applies to any active page defragmentation mechanism. I can say that this patch set makes hotplug memory remove be of value on ppc64. My system has 6GB of memory and I would 'load it up' to the point where it would just start to swap and let it run for an hour. Without these patches, it was almost impossible to find a section that could be offlined. With the patches, I can consistently reduce memory to somewhere between 512MB and 1GB. Of course, results will vary based on workload. Also, this is most advantageous for memory hotlug on ppc64 due to relatively small section size (16MB) as compared to the page grouping size (8MB). A more general purpose solution is needed for memory hotplug support on architectures with larger section sizes. Just another data point, -- Mike ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz @ 2005-10-31 6:37 ` Nick Piggin 2005-10-31 7:54 ` Andrew Morton 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-10-31 6:37 UTC (permalink / raw) To: Mike Kravetz; +Cc: Mel Gorman, akpm, linux-mm, linux-kernel, lhms-devel Mike Kravetz wrote: > On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote: > >>Here are a few brief reasons why this set of patches is useful; >> >>o Reduced fragmentation improves the chance a large order allocation succeeds >>o General-purpose memory hotplug needs the page/memory groupings provided >>o Reduces the number of badly-placed pages that page migration mechanism must >> deal with. This also applies to any active page defragmentation mechanism. > > > I can say that this patch set makes hotplug memory remove be of > value on ppc64. My system has 6GB of memory and I would 'load > it up' to the point where it would just start to swap and let it > run for an hour. Without these patches, it was almost impossible > to find a section that could be offlined. With the patches, I > can consistently reduce memory to somewhere between 512MB and 1GB. > Of course, results will vary based on workload. Also, this is > most advantageous for memory hotlug on ppc64 due to relatively > small section size (16MB) as compared to the page grouping size > (8MB). A more general purpose solution is needed for memory hotplug > support on architectures with larger section sizes. > > Just another data point, Despite what people were trying to tell me at Ottawa, this patch set really does add quite a lot of complexity to the page allocator, and it seems to be increasingly only of benefit to dynamically allocating hugepages and memory hot unplug. If that is the case, do we really want to make such sacrifices for the huge machines that want these things? What about just making an extra zone for easy-to-reclaim things to live in? This could possibly even be resized at runtime according to demand with the memory hotplug stuff (though I haven't been following that). Don't take this as criticism of the actual implementation or its effectiveness. Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 6:37 ` Nick Piggin @ 2005-10-31 7:54 ` Andrew Morton 2005-10-31 7:11 ` Nick Piggin [not found] ` <27700000.1130769270@[10.10.2.4]> 0 siblings, 2 replies; 241+ messages in thread From: Andrew Morton @ 2005-10-31 7:54 UTC (permalink / raw) To: Nick Piggin; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Mike Kravetz wrote: > > On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote: > > > >>Here are a few brief reasons why this set of patches is useful; > >> > >>o Reduced fragmentation improves the chance a large order allocation succeeds > >>o General-purpose memory hotplug needs the page/memory groupings provided > >>o Reduces the number of badly-placed pages that page migration mechanism must > >> deal with. This also applies to any active page defragmentation mechanism. > > > > > > I can say that this patch set makes hotplug memory remove be of > > value on ppc64. My system has 6GB of memory and I would 'load > > it up' to the point where it would just start to swap and let it > > run for an hour. Without these patches, it was almost impossible > > to find a section that could be offlined. With the patches, I > > can consistently reduce memory to somewhere between 512MB and 1GB. > > Of course, results will vary based on workload. Also, this is > > most advantageous for memory hotlug on ppc64 due to relatively > > small section size (16MB) as compared to the page grouping size > > (8MB). A more general purpose solution is needed for memory hotplug > > support on architectures with larger section sizes. > > > > Just another data point, > > Despite what people were trying to tell me at Ottawa, this patch > set really does add quite a lot of complexity to the page > allocator, and it seems to be increasingly only of benefit to > dynamically allocating hugepages and memory hot unplug. Remember that Rohit is seeing ~10% variation between runs of scientific software, and that his patch to use higher-order pages to preload the percpu-pages magazines fixed that up. I assume this means that it provided up to 10% speedup, which is a lot. But the patch caused page allocator fragmentation and several reports of gigE Tx buffer allocation failures, so I dropped it. We think that Mel's patches will allow us to reintroduce Rohit's optimisation. > If that is the case, do we really want to make such sacrifices > for the huge machines that want these things? What about just > making an extra zone for easy-to-reclaim things to live in? > > This could possibly even be resized at runtime according to > demand with the memory hotplug stuff (though I haven't been > following that). > > Don't take this as criticism of the actual implementation or its > effectiveness. > But yes, adding additional complexity is a black mark, and these patches add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw). ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 7:54 ` Andrew Morton @ 2005-10-31 7:11 ` Nick Piggin 2005-10-31 16:19 ` Mel Gorman [not found] ` <27700000.1130769270@[10.10.2.4]> 1 sibling, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-10-31 7:11 UTC (permalink / raw) To: Andrew Morton; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>Despite what people were trying to tell me at Ottawa, this patch >>set really does add quite a lot of complexity to the page >>allocator, and it seems to be increasingly only of benefit to >>dynamically allocating hugepages and memory hot unplug. > > > Remember that Rohit is seeing ~10% variation between runs of scientific > software, and that his patch to use higher-order pages to preload the > percpu-pages magazines fixed that up. I assume this means that it provided > up to 10% speedup, which is a lot. > OK, I wasn't aware of this. I wonder what other approaches we could try to add a bit of colour to our pages? I bet something simple like trying to hand out alternate odd/even pages per task might help. > But the patch caused page allocator fragmentation and several reports of > gigE Tx buffer allocation failures, so I dropped it. > > We think that Mel's patches will allow us to reintroduce Rohit's > optimisation. > > >>If that is the case, do we really want to make such sacrifices >>for the huge machines that want these things? What about just >>making an extra zone for easy-to-reclaim things to live in? >> >>This could possibly even be resized at runtime according to >>demand with the memory hotplug stuff (though I haven't been >>following that). >> >>Don't take this as criticism of the actual implementation or its >>effectiveness. >> > > > But yes, adding additional complexity is a black mark, and these patches > add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw). > They do look quite fine. They seem to get their claws pretty deep into page reclaim, but I guess that is to be expected if we want to increase readahead smarts much more. However, I'm hoping bits of that can be merged at a time, and interfaces and page reclaim stuff can be discussed and the best option taken. No such luck with these patches AFAIKS - simply adding another level of page groups, and another level of heuristics to the page allocator is going to hurt. By definition. I do wonder why zones can't be used... though I'm sure there are good reasons. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 7:11 ` Nick Piggin @ 2005-10-31 16:19 ` Mel Gorman 2005-10-31 23:54 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-10-31 16:19 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Mon, 31 Oct 2005, Nick Piggin wrote: > Andrew Morton wrote: > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > Despite what people were trying to tell me at Ottawa, this patch > > > set really does add quite a lot of complexity to the page > > > allocator, and it seems to be increasingly only of benefit to > > > dynamically allocating hugepages and memory hot unplug. > > > > > > Remember that Rohit is seeing ~10% variation between runs of scientific > > software, and that his patch to use higher-order pages to preload the > > percpu-pages magazines fixed that up. I assume this means that it provided > > up to 10% speedup, which is a lot. > > > > OK, I wasn't aware of this. I wonder what other approaches we could > try to add a bit of colour to our pages? I bet something simple like > trying to hand out alternate odd/even pages per task might help. > Reading through the kernel archives, it appears that any page colouring scheme was getting rejected because it slowed up workloads like kernel compilers that were not very cache sensitive. Where an approach didn't suffer from that problem, there was disagreement over whether there was a general performance improvement or not. I recall Rohit's patch from an earlier -mm. Without knowing anything about his test, I am guessing he is getting cheap page colouring by preloading the per-cpu cache with contiguous pages and his workload is faulting in the batch of pages immediately by doing something like linearly reading a large array. Hence, the mappings of his workload are getting the right colour pages. This makes his workload a "lucky" workload. The general benefit of preloading the percpu magazines is that there is a chance the allocator only has to be called once, not pcp->batch times. An odd/even allocation scheme could be provided by having two free_lists in a free_area. One list for the "left buddy" and the other list for the "right buddy". However, at best, that would provide two colours. I'm not sure how much benefit it would give for the cost of more linked lists. > > gigE Tx buffer allocation failures, so I dropped it. > > > > We think that Mel's patches will allow us to reintroduce Rohit's > > optimisation. > > > > > > > If that is the case, do we really want to make such sacrifices > > > for the huge machines that want these things? What about just > > > making an extra zone for easy-to-reclaim things to live in? > > > > > > This could possibly even be resized at runtime according to > > > demand with the memory hotplug stuff (though I haven't been > > > following that). > > > > > > Don't take this as criticism of the actual implementation or its > > > effectiveness. > > > > > > > > > But yes, adding additional complexity is a black mark, and these patches > > add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw). > > > > They do look quite fine. They seem to get their claws pretty deep > into page reclaim, but I guess that is to be expected if we want > to increase readahead smarts much more. > > However, I'm hoping bits of that can be merged at a time, and > interfaces and page reclaim stuff can be discussed and the best > option taken. No such luck with these patches AFAIKS - simply > adding another level of page groups, and another level of > heuristics to the page allocator is going to hurt. By definition. > I do wonder why zones can't be used... though I'm sure there are > good reasons. > Granted, the patch set does add complexity even though I tried to keep it as simple as possible. Benchmarks were posted with each patchset to show that it was not suffering in real performance even if the code is a bit less approachable. Doing something similar with zones is an old idea and brought up specifically for memory hotplug. In implementations, the zone was called ZONE_HOTREMOVABLE or something similar. In my opinion, replicating the effect of this set of patches with zones introduces it's own set of headaches and ends up being far more complicated. Hopefully, someone will point out if I am missing historical context here, am rehashing old arguments or am just plain wrong :) To replicate the functionality of these patches with zones would require two additional zones for NormalEasy and HighmemEasy (I suck at naming things). The plus side is that once the zone fallback lists are updated, the page allocator remains more or less the same as it is today. Then the headaches start. Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming we are allocating PTEs from high memory, we could fallback to the Normal zone even if highmem pages are available because the HighMem zone was out of pages. It will require very different fallback logic to say that HighMem allocations can also use HighMemEasy rather than falling back to Normal. Problem 2: Setting the zone size will be a very difficult tunable to get right. Right off, we are are introducing a tunable which will make foreheads furrow. If the tunable is set wrong, system performance will suffer and we could see situations where kernel allocations fail because it's zone got depleted. Problem 3: To get rid of the tunable, we could try resizing the zones dynamically but that will be hard. Obviously, the zones are going to be physically adjacent to each other. To resize the zone, the pages at one end of the zone will need to be free. Shrinking the NormalEasy zone would be easy enough, but shrinking the Normal zone with kernel pages in it would be considerably harder, if not outright impossible. One page in the wrong place will mean the zone cannot be resized Problem 4: Page reclaim would have two new zones to deal with bringing with it a new set of zone balancing problems. That brings it's own special brand of fun. There may be more problems but these 4 are fairly important. This patchset does not suffer from the same problems. Problem 1: This patchset has a fallback list for each allocation type. So EasyRclm allocations can just as easily use an area reserved for kernel allocations and vice versa. Obviously we don't like when this happens, but when it does, things start fragmenting rather than breaking. Problem 2: The number of pages that get reserved for each type grows and shrinks on demand. There is no tunable and no need for one. Problem 3: Problem doesn't exist for this patchset Problem 4: Problem doesn't exist for this patchset. Bottom line, using zones will be more complex than this set of patches and bring a lot of tricky issues with it. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 16:19 ` Mel Gorman @ 2005-10-31 23:54 ` Nick Piggin 2005-11-01 1:28 ` Mel Gorman 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-10-31 23:54 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel Mel Gorman wrote: > I recall Rohit's patch from an earlier -mm. Without knowing anything about > his test, I am guessing he is getting cheap page colouring by preloading > the per-cpu cache with contiguous pages and his workload is faulting in > the batch of pages immediately by doing something like linearly reading a > large array. Hence, the mappings of his workload are getting the right > colour pages. This makes his workload a "lucky" workload. The general > benefit of preloading the percpu magazines is that there is a chance the > allocator only has to be called once, not pcp->batch times. > Or we could introduce a new allocation mechanism for anon pages that passes the vaddr to the allocator, and tries to get an odd/even page according to the vaddr. > An odd/even allocation scheme could be provided by having two free_lists > in a free_area. One list for the "left buddy" and the other list for the > "right buddy". However, at best, that would provide two colours. I'm not > sure how much benefit it would give for the cost of more linked lists. > 2 colours should be a good first order improvement because you will no longer have adjacent pages of the same colour. It would definitely be cheaper than fragmentation avoidance + higher order batch loading. > To replicate the functionality of these patches with zones would require > two additional zones for NormalEasy and HighmemEasy (I suck at naming > things). The plus side is that once the zone fallback lists are updated, > the page allocator remains more or less the same as it is today. Then the > headaches start. > > Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a > fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming > we are allocating PTEs from high memory, we could fallback to the Normal > zone even if highmem pages are available because the HighMem zone was out > of pages. It will require very different fallback logic to say that > HighMem allocations can also use HighMemEasy rather than falling back to > Normal. > Just be a different set of GFP flags. Your patches obviously also have some ordering imposed.... pagecache would want HighMemEasy, HighMem, NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA. Note that if you do need to make some changes to the zone allocator, then IMO that is far preferable to add a new layer of things-that-are-blocks-of- -memory-but-not-zones, complete with their own balancing and other heuristics. > Problem 2: Setting the zone size will be a very difficult tunable to get > right. Right off, we are are introducing a tunable which will make > foreheads furrow. If the tunable is set wrong, system performance will > suffer and we could see situations where kernel allocations fail because > it's zone got depleted. > But even so, when you do automatic resizing, you seem to be adding a fundamental weak point in fragmentation avoidance. > Problem 3: To get rid of the tunable, we could try resizing the zones > dynamically but that will be hard. Obviously, the zones are going to be > physically adjacent to each other. To resize the zone, the pages at one > end of the zone will need to be free. Shrinking the NormalEasy zone would > be easy enough, but shrinking the Normal zone with kernel pages in it > would be considerably harder, if not outright impossible. One page in the > wrong place will mean the zone cannot be resized > OK, maybe it is hard ;) Do they really need to be resized, then? Isn't the big memory hotunplug push aimed at virtual machines and hypervisors anyway? In which case one would presumably have some memory that "must" be reclaimable, in which case we can't expand non-Easy zones into that memory anyway. > Problem 4: Page reclaim would have two new zones to deal with bringing > with it a new set of zone balancing problems. That brings it's own special > brand of fun. > > There may be more problems but these 4 are fairly important. This patchset > does not suffer from the same problems. > If page reclaim can't deal with 5 zones then it is going to have problems somewhere at 3 and needs to be fixed. I don't see how your patches get around this fun by simply introducing their own balancing and fallback heuristics. > Problem 1: This patchset has a fallback list for each allocation type. So > EasyRclm allocations can just as easily use an area reserved for kernel > allocations and vice versa. Obviously we don't like when this happens, but > when it does, things start fragmenting rather than breaking. > > Problem 2: The number of pages that get reserved for each type grows and > shrinks on demand. There is no tunable and no need for one. > > Problem 3: Problem doesn't exist for this patchset > > Problem 4: Problem doesn't exist for this patchset. > > Bottom line, using zones will be more complex than this set of patches and > bring a lot of tricky issues with it. > Maybe zones don't do exactly what you need, but I think they're better than you think ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 23:54 ` Nick Piggin @ 2005-11-01 1:28 ` Mel Gorman 2005-11-01 1:42 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-01 1:28 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tue, 1 Nov 2005, Nick Piggin wrote: > Mel Gorman wrote: > > > I recall Rohit's patch from an earlier -mm. Without knowing anything about > > his test, I am guessing he is getting cheap page colouring by preloading > > the per-cpu cache with contiguous pages and his workload is faulting in > > the batch of pages immediately by doing something like linearly reading a > > large array. Hence, the mappings of his workload are getting the right > > colour pages. This makes his workload a "lucky" workload. The general > > benefit of preloading the percpu magazines is that there is a chance the > > allocator only has to be called once, not pcp->batch times. > > > > Or we could introduce a new allocation mechanism for anon pages that > passes the vaddr to the allocator, and tries to get an odd/even page > according to the vaddr. > We could, but it is a different problem than what this set of patches are trying to address. I'll add page colouring to the end of the todo list in case I get stuck for something to do. > > An odd/even allocation scheme could be provided by having two free_lists > > in a free_area. One list for the "left buddy" and the other list for the > > "right buddy". However, at best, that would provide two colours. I'm not > > sure how much benefit it would give for the cost of more linked lists. > > > > 2 colours should be a good first order improvement because you will > no longer have adjacent pages of the same colour. > > It would definitely be cheaper than fragmentation avoidance + higher > order batch loading. > Ok, but the page colours would also need to be in the per-cpu lists this new api that supplies vaddrs always takes the spinlock for the free lists. I don't believe it would be cheaper and any benefit would only show up on benchmarks that are cache sensitive. Judging by previous discussions on page colouring in the mail archives, Linus will happily kick the approach full of holes. As for current performance, the Aim9 benchmarks show that the fragmentation avoidance does not have a major performance penalty. A run of the patches in the -mm tree should find out if there are performance regressions on other machine types. > > > To replicate the functionality of these patches with zones would require > > two additional zones for NormalEasy and HighmemEasy (I suck at naming > > things). The plus side is that once the zone fallback lists are updated, > > the page allocator remains more or less the same as it is today. Then the > > headaches start. > > > > Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a > > fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming > > we are allocating PTEs from high memory, we could fallback to the Normal > > zone even if highmem pages are available because the HighMem zone was out > > of pages. It will require very different fallback logic to say that > > HighMem allocations can also use HighMemEasy rather than falling back to > > Normal. > > > > Just be a different set of GFP flags. Your patches obviously also have > some ordering imposed.... pagecache would want HighMemEasy, HighMem, > NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA. > As well as a different set of GFP flags, we would also need new zone fallback logic which will hit the __alloc_pages() path. It will be adding more complexity to the allocator and we're replacing one type of complexity with another. > Note that if you do need to make some changes to the zone allocator, then > IMO that is far preferable to add a new layer of things-that-are-blocks-of- > -memory-but-not-zones, complete with their own balancing and other heuristics. > Thing is, with my approach, the very worst that happens is that it fragments just as bad as the normal allocator. With a zone-based approach, the worst that happens is that the kernel zone is too small, kernel caches do not grow to a suitable size and overall system performance degrades. > > Problem 2: Setting the zone size will be a very difficult tunable to get > > right. Right off, we are are introducing a tunable which will make > > foreheads furrow. If the tunable is set wrong, system performance will > > suffer and we could see situations where kernel allocations fail because > > it's zone got depleted. > > > > But even so, when you do automatic resizing, you seem to be adding a > fundamental weak point in fragmentation avoidance. > The sizing I do is when a large block is split. Then the region is just marked for a particular allocation type. This is very simple. The second resizing that occurs is when a kernel allocation "steal" easyrclm pages. I do not like the fact that we steal in this fashion but the alternative is to teach kswapd how to reclaim easyrclm pages from other areas. I view this as "future work" but if it was done, the "steal" mechanism would go away. > > Problem 3: To get rid of the tunable, we could try resizing the zones > > dynamically but that will be hard. Obviously, the zones are going to be > > physically adjacent to each other. To resize the zone, the pages at one > > end of the zone will need to be free. Shrinking the NormalEasy zone would > > be easy enough, but shrinking the Normal zone with kernel pages in it > > would be considerably harder, if not outright impossible. One page in the > > wrong place will mean the zone cannot be resized > > > > OK, maybe it is hard ;) Do they really need to be resized, then? > I think we would need to, yes. If the size of the region is wrong, bad things are likely to happen. If the kernel page zone is too small, it'll be under pressure even though there is memory available elsewhere. If it's too large, then it will get fragmented and high order allocations will fail. > Isn't the big memory hotunplug push aimed at virtual machines and > hypervisors anyway? In which case one would presumably have some > memory that "must" be reclaimable, in which case we can't expand > non-Easy zones into that memory anyway. > I believe that is the case for hotplug all right, but not the case where we just want to satisfy high order allocations in a reasonably reliable fashion. In that case, it would be nice to reclaim an easyrclm region. It has already been reported by Mike Kravetz that memory remove works a whole lot better on PPC64 with this patch than without it. Memory hotplug remove was not the problem I was trying to solve, but I consider the fact that it is helped to be a big plus. So, even though it is possible that this approach still gets fragmented under some workloads, we know that, in general, it does a pretty good job. > > Problem 4: Page reclaim would have two new zones to deal with bringing > > with it a new set of zone balancing problems. That brings it's own special > > brand of fun. > > > > There may be more problems but these 4 are fairly important. This patchset > > does not suffer from the same problems. > > > > If page reclaim can't deal with 5 zones then it is going to have problems > somewhere at 3 and needs to be fixed. I don't see how your patches get > around this fun by simply introducing their own balancing and fallback > heuristics. > If my approach gets the sizes of areas all wrong, it will fragment. If the zone-based approach gets the sizes of areas wrong, system performance degrades. I prefer the failure scenario of my approach :). > > Problem 1: This patchset has a fallback list for each allocation type. So > > EasyRclm allocations can just as easily use an area reserved for kernel > > allocations and vice versa. Obviously we don't like when this happens, but > > when it does, things start fragmenting rather than breaking. > > > > Problem 2: The number of pages that get reserved for each type grows and > > shrinks on demand. There is no tunable and no need for one. > > > > Problem 3: Problem doesn't exist for this patchset > > > > Problem 4: Problem doesn't exist for this patchset. > > > > Bottom line, using zones will be more complex than this set of patches and > > bring a lot of tricky issues with it. > > > > Maybe zones don't do exactly what you need, but I think they're better > than you think ;) > You may be right, but I still think that my approach is simpler and less likely to introduce horrible balancing problems. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 1:28 ` Mel Gorman @ 2005-11-01 1:42 ` Nick Piggin 0 siblings, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-01 1:42 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel Mel Gorman wrote: > On Tue, 1 Nov 2005, Nick Piggin wrote: > Ok, but the page colours would also need to be in the per-cpu lists this > new api that supplies vaddrs always takes the spinlock for the free lists. > I don't believe it would be cheaper and any benefit would only show up on > benchmarks that are cache sensitive. Judging by previous discussions on > page colouring in the mail archives, Linus will happily kick the approach > full of holes. > OK, but I'm just pointing out that improving page colouring doesn't require contiguous pages. > As for current performance, the Aim9 benchmarks show that the > fragmentation avoidance does not have a major performance penalty. A run > of the patches in the -mm tree should find out if there are performance > regressions on other machine types. > But I can see that there will be penalties. Cache misses, branches, etc. Obviously any new feature or more sophisticated behaviour is going to require that but they obviously need good justification. >>Just be a different set of GFP flags. Your patches obviously also have >>some ordering imposed.... pagecache would want HighMemEasy, HighMem, >>NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA. >> > > > As well as a different set of GFP flags, we would also need new zone > fallback logic which will hit the __alloc_pages() path. It will be adding > more complexity to the allocator and we're replacing one type of > complexity with another. > It is complexity that is mostly already handled for us with the zones logic. Picking out a couple of small points that zones don't get exactly right isn't a good basis to come up with a completely new zoneing layer. > >>Note that if you do need to make some changes to the zone allocator, then >>IMO that is far preferable to add a new layer of things-that-are-blocks-of- >>-memory-but-not-zones, complete with their own balancing and other heuristics. >> > > > Thing is, with my approach, the very worst that happens is that it > fragments just as bad as the normal allocator. With a zone-based approach, > the worst that happens is that the kernel zone is too small, kernel caches > do not grow to a suitable size and overall system performance degrades. > If you don't need to guarantee higher order allocations, then there is no problem with our current approach. If you do then you simply need to make a sacrifice. > >>>Problem 2: Setting the zone size will be a very difficult tunable to get >>>right. Right off, we are are introducing a tunable which will make >>>foreheads furrow. If the tunable is set wrong, system performance will >>>suffer and we could see situations where kernel allocations fail because >>>it's zone got depleted. >>> >> >>But even so, when you do automatic resizing, you seem to be adding a >>fundamental weak point in fragmentation avoidance. >> > > > The sizing I do is when a large block is split. Then the region is just > marked for a particular allocation type. This is very simple. The second > resizing that occurs is when a kernel allocation "steal" easyrclm pages. I > do not like the fact that we steal in this fashion but the alternative is > to teach kswapd how to reclaim easyrclm pages from other areas. I view > this as "future work" but if it was done, the "steal" mechanism would go > away. > Weak point, as in: gets fragmented. > >>>Problem 3: To get rid of the tunable, we could try resizing the zones >>>dynamically but that will be hard. Obviously, the zones are going to be >>>physically adjacent to each other. To resize the zone, the pages at one >>>end of the zone will need to be free. Shrinking the NormalEasy zone would >>>be easy enough, but shrinking the Normal zone with kernel pages in it >>>would be considerably harder, if not outright impossible. One page in the >>>wrong place will mean the zone cannot be resized >>> >> >>OK, maybe it is hard ;) Do they really need to be resized, then? >> > > > I think we would need to, yes. If the size of the region is wrong, bad > things are likely to happen. If the kernel page zone is too small, it'll > be under pressure even though there is memory available elsewhere. If it's > too large, then it will get fragmented and high order allocations will > fail. > But people will just have to get it right then. If they want to be able to hot unplug 10G of memory, or allocate 4G of hugepages on demand, then they simply need to specify their requirements. Not too difficult? It is really nice to be able to place some burden on huge servers and mainframes, because they have people administering and tuning them full-time. It allows us to not penalise small servers and desktops. > >>Isn't the big memory hotunplug push aimed at virtual machines and >>hypervisors anyway? In which case one would presumably have some >>memory that "must" be reclaimable, in which case we can't expand >>non-Easy zones into that memory anyway. >> > > > I believe that is the case for hotplug all right, but not the case where > we just want to satisfy high order allocations in a reasonably reliable > fashion. In that case, it would be nice to reclaim an easyrclm region. > As I've said before, I think this is a false hope and we need to move away from higher order allocations. > It has already been reported by Mike Kravetz that memory remove works a > whole lot better on PPC64 with this patch than without it. Memory hotplug > remove was not the problem I was trying to solve, but I consider the fact > that it is helped to be a big plus. So, even though it is possible that > this approach still gets fragmented under some workloads, we know that, in > general, it does a pretty good job. > Sure, but using zones would work too, and on the plus side you would be able to specify exactly how much removable memory to be. >> >>Maybe zones don't do exactly what you need, but I think they're better >>than you think ;) >> > > > You may be right, but I still think that my approach is simpler and less > likely to introduce horrible balancing problems. > Simpler? We already have zones though. They are a complexity we need to deal with already. I really can't see how you can use the simpler argument in favour of your patches ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
[parent not found: <27700000.1130769270@[10.10.2.4]>]
[parent not found: <20051031112409.153e7048.akpm@osdl.org>]
[parent not found: <3660000.1130787652@flay>]
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 [not found] ` <3660000.1130787652@flay> @ 2005-10-31 23:59 ` Nick Piggin 2005-11-01 1:36 ` Mel Gorman 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-10-31 23:59 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, kravetz, mel, linux-mm, linux-kernel, lhms-devel Martin J. Bligh wrote: > --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org> wrote: >>I suspect this would all be a non-issue if the net drivers were using >>__GFP_NOWARN ;) > > > We still need to allocate them, even if it's GFP_KERNEL. As memory gets > larger and larger, and we have no targetted reclaim, we'll have to blow > away more and more stuff at random before we happen to get contiguous > free areas. Just statistics aren't in your favour ... Getting 4 contig > pages on a 1GB desktop is much harder than on a 128MB machine. > However, these allocations are not of the "easy to reclaim" type, in which case they just use the regular fragmented-to-shit areas. If no contiguous pages are available from there, then an easy-reclaim area needs to be stolen, right? In which case I don't see why these patches don't have similar long term failure cases if there is strong demand for higher order allocations. Prolong things a bit, perhaps, but... > Is not going to get better as time goes on ;-) Yeah, yeah, I know, you > want recreates, numbers, etc. Not the easiest thing to reproduce in a > short-term consistent manner though. > Regardless, I think we need to continue our steady move away from higher order allocation requirements. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-10-31 23:59 ` Nick Piggin @ 2005-11-01 1:36 ` Mel Gorman 0 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-01 1:36 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tue, 1 Nov 2005, Nick Piggin wrote: > Martin J. Bligh wrote: > > --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org> > > wrote: > > > > I suspect this would all be a non-issue if the net drivers were using > > > __GFP_NOWARN ;) > > > > > > We still need to allocate them, even if it's GFP_KERNEL. As memory gets > > larger and larger, and we have no targetted reclaim, we'll have to blow > > away more and more stuff at random before we happen to get contiguous > > free areas. Just statistics aren't in your favour ... Getting 4 contig > > pages on a 1GB desktop is much harder than on a 128MB machine. > > However, these allocations are not of the "easy to reclaim" type, in > which case they just use the regular fragmented-to-shit areas. If no > contiguous pages are available from there, then an easy-reclaim area > needs to be stolen, right? > Right. > In which case I don't see why these patches don't have similar long > term failure cases if there is strong demand for higher order > allocations. Prolong things a bit, perhaps, but... > It hinges all on how long the high order kernel allocation is. If it's short-lived, it will get freed back to the easyrclm free lists and we don't fragment. If it turns out to be long lived, then we are in trouble. If this turns out to be the case, a possibility would be to use the __GFP_KERNRCLM flag for high order, short lived allocations. This would tend to group large free areas in the same place. It would only be worth investigating if we found that memory still got fragmented over very long periods of time. > > Is not going to get better as time goes on ;-) Yeah, yeah, I know, you > > want recreates, numbers, etc. Not the easiest thing to reproduce in a > > short-term consistent manner though. > > > > Regardless, I think we need to continue our steady move away from > higher order allocation requirements. > No arguement with you there. My actual aim is to guarantee HugeTLB allocations for userspace which we currently have to reserve at boot time. Stuff like memory hotplug remove and high order kernel allocations are benefits that would be nice to pick up on the way. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
[parent not found: <4366A8D1.7020507@yahoo.com.au>]
[parent not found: <Pine.LNX.4.58.0510312333240.29390@skynet>]
[parent not found: <4366C559.5090504@yahoo.com.au>]
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 [not found] ` <4366C559.5090504@yahoo.com.au> @ 2005-11-01 15:25 ` Martin J. Bligh 2005-11-01 15:33 ` Dave Hansen [not found] ` <Pine.LNX.4.58.0511010137020.29390@skynet> 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-01 15:25 UTC (permalink / raw) To: Nick Piggin, Mel Gorman Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar > I really don't think we *want* to say we support higher order allocations > absolutely robustly, nor do we want people using them if possible. Because > we don't. Even with your patches. > > Ingo also brought up this point at Ottawa. Some of the driver issues can be fixed by scatter-gather DMA *if* the h/w supports it. But what exactly do you propose to do about kernel stacks, etc? By the time you've fixed all the individual usages of it, frankly, it would be easier to provide a generic mechanism to fix the problem ... M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 15:25 ` Martin J. Bligh @ 2005-11-01 15:33 ` Dave Hansen 2005-11-01 16:57 ` Mel Gorman 2005-11-01 18:58 ` Rob Landley 0 siblings, 2 replies; 241+ messages in thread From: Dave Hansen @ 2005-11-01 15:33 UTC (permalink / raw) To: Martin J. Bligh Cc: Nick Piggin, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote: > > I really don't think we *want* to say we support higher order allocations > > absolutely robustly, nor do we want people using them if possible. Because > > we don't. Even with your patches. > > > > Ingo also brought up this point at Ottawa. > > Some of the driver issues can be fixed by scatter-gather DMA *if* the > h/w supports it. But what exactly do you propose to do about kernel > stacks, etc? By the time you've fixed all the individual usages of it, > frankly, it would be easier to provide a generic mechanism to fix the > problem ... That generic mechanism is the kernel virtual remapping. However, it has a runtime performance cost, which is increased TLB footprint inside the kernel, and a more costly implementation of __pa() and __va(). I'll admit, I'm biased toward partial solutions without runtime cost before we start incurring constant cost across the entire kernel, especially when those partial solutions have other potential in-kernel users. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 15:33 ` Dave Hansen @ 2005-11-01 16:57 ` Mel Gorman 2005-11-01 17:00 ` Mel Gorman 2005-11-01 18:58 ` Rob Landley 1 sibling, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-01 16:57 UTC (permalink / raw) To: Dave Hansen Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar On Tue, 1 Nov 2005, Dave Hansen wrote: > On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote: > > > I really don't think we *want* to say we support higher order allocations > > > absolutely robustly, nor do we want people using them if possible. Because > > > we don't. Even with your patches. > > > > > > Ingo also brought up this point at Ottawa. > > > > Some of the driver issues can be fixed by scatter-gather DMA *if* the > > h/w supports it. But what exactly do you propose to do about kernel > > stacks, etc? By the time you've fixed all the individual usages of it, > > frankly, it would be easier to provide a generic mechanism to fix the > > problem ... > > That generic mechanism is the kernel virtual remapping. However, it has > a runtime performance cost, which is increased TLB footprint inside the > kernel, and a more costly implementation of __pa() and __va(). > > I'll admit, I'm biased toward partial solutions without runtime cost > before we start incurring constant cost across the entire kernel, > especially when those partial solutions have other potential in-kernel > users. To give an idea of the increased TLB footprint, I ran an aim9 test with cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use of small pages for the physical memory mappings. This is the -clean results clean clean-nopse 1 creat-clo 16006.00 15294.90 -711.10 -4.44% File Creations and Closes/second 2 page_test 117515.83 118677.11 1161.28 0.99% System Allocations & Pages/second 3 brk_test 440289.81 436042.64 -4247.17 -0.96% System Memory Allocations/second 4 jmp_test 4179466.67 4173266.67 -6200.00 -0.15% Non-local gotos/second 5 signal_test 80803.20 78286.95 -2516.25 -3.11% Signal Traps/second 6 exec_test 61.75 60.45 -1.30 -2.11% Program Loads/second 7 fork_test 1327.01 1318.11 -8.90 -0.67% Task Creations/second 8 link_test 5531.53 5406.60 -124.93 -2.26% Link/Unlink Pairs/second This is what mbuddy-v19 with and without pse looks like mbuddy-v19 mbuddy-v19-nopse 1 creat-clo 15889.41 15328.22 -561.19 -3.53% File Creations and Closes/second 2 page_test 117082.15 116892.70 -189.45 -0.16% System Allocations & Pages/second 3 brk_test 437887.37 432716.97 -5170.40 -1.18% System Memory Allocations/second 4 jmp_test 4179950.00 4176087.32 -3862.68 -0.09% Non-local gotos/second 5 signal_test 85335.78 78553.57 -6782.21 -7.95% Signal Traps/second 6 exec_test 61.92 60.61 -1.31 -2.12% Program Loads/second 7 fork_test 1342.21 1292.26 -49.95 -3.72% Task Creations/second 8 link_test 5555.55 5412.90 -142.65 -2.57% Link/Unlink Pairs/second -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 16:57 ` Mel Gorman @ 2005-11-01 17:00 ` Mel Gorman 0 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-01 17:00 UTC (permalink / raw) To: Dave Hansen Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar On Tue, 1 Nov 2005, Mel Gorman wrote: > On Tue, 1 Nov 2005, Dave Hansen wrote: > > > On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote: > > > > I really don't think we *want* to say we support higher order allocations > > > > absolutely robustly, nor do we want people using them if possible. Because > > > > we don't. Even with your patches. > > > > > > > > Ingo also brought up this point at Ottawa. > > > > > > Some of the driver issues can be fixed by scatter-gather DMA *if* the > > > h/w supports it. But what exactly do you propose to do about kernel > > > stacks, etc? By the time you've fixed all the individual usages of it, > > > frankly, it would be easier to provide a generic mechanism to fix the > > > problem ... > > > > That generic mechanism is the kernel virtual remapping. However, it has > > a runtime performance cost, which is increased TLB footprint inside the > > kernel, and a more costly implementation of __pa() and __va(). > > > > I'll admit, I'm biased toward partial solutions without runtime cost > > before we start incurring constant cost across the entire kernel, > > especially when those partial solutions have other potential in-kernel > > users. > > To give an idea of the increased TLB footprint, I ran an aim9 test with > cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use > of small pages for the physical memory mappings. > > This is the -clean results > > clean clean-nopse > 1 creat-clo 16006.00 15294.90 -711.10 -4.44% File Creations and Closes/second > 2 page_test 117515.83 118677.11 1161.28 0.99% System Allocations & Pages/second > 3 brk_test 440289.81 436042.64 -4247.17 -0.96% System Memory Allocations/second > 4 jmp_test 4179466.67 4173266.67 -6200.00 -0.15% Non-local gotos/second > 5 signal_test 80803.20 78286.95 -2516.25 -3.11% Signal Traps/second > 6 exec_test 61.75 60.45 -1.30 -2.11% Program Loads/second > 7 fork_test 1327.01 1318.11 -8.90 -0.67% Task Creations/second > 8 link_test 5531.53 5406.60 -124.93 -2.26% Link/Unlink Pairs/second > > This is what mbuddy-v19 with and without pse looks like > > mbuddy-v19 mbuddy-v19-nopse > 1 creat-clo 15889.41 15328.22 -561.19 -3.53% File Creations and Closes/second > 2 page_test 117082.15 116892.70 -189.45 -0.16% System Allocations & Pages/second > 3 brk_test 437887.37 432716.97 -5170.40 -1.18% System Memory Allocations/second > 4 jmp_test 4179950.00 4176087.32 -3862.68 -0.09% Non-local gotos/second > 5 signal_test 85335.78 78553.57 -6782.21 -7.95% Signal Traps/second > 6 exec_test 61.92 60.61 -1.31 -2.12% Program Loads/second > 7 fork_test 1342.21 1292.26 -49.95 -3.72% Task Creations/second > 8 link_test 5555.55 5412.90 -142.65 -2.57% Link/Unlink Pairs/second > I forgot to include the comparison between -clean and -mbuddy-v19-nopse clean mbuddy-v19-nopse 1 creat-clo 16006.00 15328.22 -677.78 -4.23% File Creations and Closes/second 2 page_test 117515.83 116892.70 -623.13 -0.53% System Allocations & Pages/second 3 brk_test 440289.81 432716.97 -7572.84 -1.72% System Memory Allocations/second 4 jmp_test 4179466.67 4176087.32 -3379.35 -0.08% Non-local gotos/second 5 signal_test 80803.20 78553.57 -2249.63 -2.78% Signal Traps/second 6 exec_test 61.75 60.61 -1.14 -1.85% Program Loads/second 7 fork_test 1327.01 1292.26 -34.75 -2.62% Task Creations/second 8 link_test 5531.53 5412.90 -118.63 -2.14% Link/Unlink Pairs/second -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 15:33 ` Dave Hansen 2005-11-01 16:57 ` Mel Gorman @ 2005-11-01 18:58 ` Rob Landley 1 sibling, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-01 18:58 UTC (permalink / raw) To: Dave Hansen Cc: Martin J. Bligh, Nick Piggin, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar On Tuesday 01 November 2005 09:33, Dave Hansen wrote: > On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote: > > > I really don't think we *want* to say we support higher order > > > allocations absolutely robustly, nor do we want people using them if > > > possible. Because we don't. Even with your patches. > > > > > > Ingo also brought up this point at Ottawa. > > > > Some of the driver issues can be fixed by scatter-gather DMA *if* the > > h/w supports it. But what exactly do you propose to do about kernel > > stacks, etc? By the time you've fixed all the individual usages of it, > > frankly, it would be easier to provide a generic mechanism to fix the > > problem ... > > That generic mechanism is the kernel virtual remapping. However, it has > a runtime performance cost, which is increased TLB footprint inside the > kernel, and a more costly implementation of __pa() and __va(). Ok, right now the kernel _has_ a virtual mapping, it's just a 1:1 with the physical mapping, right? In theory, if you restrict all kernel unmovable mappings to a physically contiguous address range (something like ZONE_DMA) that's at the start of the physical address space, then what you could do is have a two-kernel-monte like situation where if you _NEED_ to move the kernel you quiesce the system (as if you're going to swsusp), figure out where the new start of physical memory will be when this bank goes bye-bye, memcpy the whole mess to the new location, adjust your one VMA, and then call the swsusp unfreeze stuff. This is ugly, and a huge latency spike, but why wouldn't it work? The problem now becomes finding some NEW physically contiguous range to shoehorn the kernel into, and that's a problem that Mel's already addressing... Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
[parent not found: <Pine.LNX.4.58.0511010137020.29390@skynet>]
[parent not found: <4366D469.2010202@yahoo.com.au>]
[parent not found: <Pine.LNX.4.58.0511011014060.14884@skynet>]
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 [not found] ` <Pine.LNX.4.58.0511011014060.14884@skynet> @ 2005-11-01 13:56 ` Ingo Molnar 2005-11-01 14:10 ` Dave Hansen ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-01 13:56 UTC (permalink / raw) To: Mel Gorman Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel * Mel Gorman <mel@csn.ul.ie> wrote: > The set of patches do fix a lot and make a strong start at addressing > the fragmentation problem, just not 100% of the way. [...] do you have an expectation to be able to solve the 'fragmentation problem', all the time, in a 100% way, now or in the future? > So, with this set of patches, how fragmented you get is dependant on > the workload and it may still break down and high order allocations > will fail. But the current situation is that it will defiantly break > down. The fact is that it has been reported that memory hotplug remove > works with these patches and doesn't without them. Granted, this is > just one feature on a high-end machine, but it is one solid operation > we can perform with the patches and cannot without them. [...] can you always, under any circumstance hot unplug RAM with these patches applied? If not, do you have any expectation to reach 100%? Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 13:56 ` Ingo Molnar @ 2005-11-01 14:10 ` Dave Hansen 2005-11-01 14:29 ` Ingo Molnar 2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman 2005-11-01 18:23 ` Rob Landley 2 siblings, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-01 14:10 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 2005-11-01 at 14:56 +0100, Ingo Molnar wrote: > * Mel Gorman <mel@csn.ul.ie> wrote: > > > The set of patches do fix a lot and make a strong start at addressing > > the fragmentation problem, just not 100% of the way. [...] > > do you have an expectation to be able to solve the 'fragmentation > problem', all the time, in a 100% way, now or in the future? In a word, yes. The current allocator has no design for measuring or reducing fragmentation. These patches provide the framework for at least measuring fragmentation. The patches can not do anything magical and there will be a point where the system has to make a choice: fragment, or fail an allocation when there _is_ free memory. These patches take us in a direction where we are capable of making such a decision. > > So, with this set of patches, how fragmented you get is dependant on > > the workload and it may still break down and high order allocations > > will fail. But the current situation is that it will defiantly break > > down. The fact is that it has been reported that memory hotplug remove > > works with these patches and doesn't without them. Granted, this is > > just one feature on a high-end machine, but it is one solid operation > > we can perform with the patches and cannot without them. [...] > > can you always, under any circumstance hot unplug RAM with these patches > applied? If not, do you have any expectation to reach 100%? With these patches, no. There are currently some very nice, pathological workloads which will still cause fragmentation. But, in the interest of incremental feature introduction, I think they're a fine first step. We can effectively reach toward a more comprehensive solution on top of these patches. Reaching truly 100% will require some other changes such as being able to virtually remap things like kernel text. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:10 ` Dave Hansen @ 2005-11-01 14:29 ` Ingo Molnar 2005-11-01 14:49 ` Dave Hansen 0 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-01 14:29 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms * Dave Hansen <haveblue@us.ibm.com> wrote: > > can you always, under any circumstance hot unplug RAM with these patches > > applied? If not, do you have any expectation to reach 100%? > > With these patches, no. There are currently some very nice, > pathological workloads which will still cause fragmentation. But, in > the interest of incremental feature introduction, I think they're a > fine first step. We can effectively reach toward a more comprehensive > solution on top of these patches. > > Reaching truly 100% will require some other changes such as being able > to virtually remap things like kernel text. then we need to see that 100% solution first - at least in terms of conceptual steps. Not being able to hot-unplug RAM in a 100% way wont satisfy customers. Whatever solution we choose, it must work 100%. Just to give a comparison: would you be content with your computer failing to start up apps 1 time out of 100, saying that 99% is good enough? Or would you call it what it is: buggy and unreliable? to stress it: hot unplug is a _feature_ that must work 100%, _not_ some optimization where 99% is good enough. This is a feature that people will be depending on if we promise it, and 1% failure rate is not acceptable. Your 'pathological workload' might be customer X's daily workload. Unless there is a clear definition of what is possible and what is not (which definition can be relied upon by users), having a 99% solution is much worse than the current 0% solution! worse than that, this is a known _hard_ problem to solve in a 100% way, and saying 'this patch is a good first step' just lures us (and customers) into believing that we are only 1% away from the desired 100% solution, while nothing could be further from the truth. They will demand the remaining 1%, but can we offer it? Unless you can provide a clear, accepted-upon path towards the 100% solution, we have nothing right now. I have no problems with using higher-order pages for performance purposes [*], as long as 'failed' allocation (and freeing) actions are user-invisible. But the moment you make it user-visible, it _must_ work in a deterministic way! Ingo [*] in which case any slowdown in the page allocator must be offset by the gains. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:29 ` Ingo Molnar @ 2005-11-01 14:49 ` Dave Hansen 2005-11-01 15:01 ` Ingo Molnar 2005-11-02 0:51 ` Nick Piggin 0 siblings, 2 replies; 241+ messages in thread From: Dave Hansen @ 2005-11-01 14:49 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 2005-11-01 at 15:29 +0100, Ingo Molnar wrote: > * Dave Hansen <haveblue@us.ibm.com> wrote: > > > can you always, under any circumstance hot unplug RAM with these patches > > > applied? If not, do you have any expectation to reach 100%? > > > > With these patches, no. There are currently some very nice, > > pathological workloads which will still cause fragmentation. But, in > > the interest of incremental feature introduction, I think they're a > > fine first step. We can effectively reach toward a more comprehensive > > solution on top of these patches. > > > > Reaching truly 100% will require some other changes such as being able > > to virtually remap things like kernel text. > > then we need to see that 100% solution first - at least in terms of > conceptual steps. I don't think saying "truly 100%" really even makes sense. There will always be restrictions of some kind. For instance, with a 10MB kernel image, should you be able to shrink the memory in the system below 10MB? ;) There is also no precedent in existing UNIXes for a 100% solution. From http://publib.boulder.ibm.com/infocenter/pseries/index.jsp?topic=/com.ibm.aix.doc/aixbman/prftungd/dlpar.htm , a seemingly arbitrary restriction: A memory region that contains a large page cannot be removed. What the fragmentation patches _can_ give us is the ability to have 100% success in removing certain areas: the "user-reclaimable" areas referenced in the patch. This gives a customer at least the ability to plan for how dynamically reconfigurable a system should be. After these patches, the next logical steps are to increase the knowledge that the slabs have about fragmentation, and to teach some of the shrinkers about fragmentation. After that, we'll need some kind of virtual remapping, breaking the 1:1 kernel virtual mapping, so that the most problematic pages can be remapped. These pages would retain their virtual address, but getting a new physical. However, this is quite far down the road and will require some serious evaluation because it impacts how normal devices are able to to DMA. The ppc64 proprietary hypervisor has features to work around these issues, and any new hypervisors wishing to support partition memory hotplug would likely have to follow suit. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:49 ` Dave Hansen @ 2005-11-01 15:01 ` Ingo Molnar 2005-11-01 15:22 ` Dave Hansen 2005-11-01 16:48 ` Kamezawa Hiroyuki 2005-11-02 0:51 ` Nick Piggin 1 sibling, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-01 15:01 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms * Dave Hansen <haveblue@us.ibm.com> wrote: > > then we need to see that 100% solution first - at least in terms of > > conceptual steps. > > I don't think saying "truly 100%" really even makes sense. There will > always be restrictions of some kind. For instance, with a 10MB kernel > image, should you be able to shrink the memory in the system below > 10MB? ;) think of it in terms of filesystem shrinking: yes, obviously you cannot shrink to below the allocated size, but no user expects to be able to do it. But users would not accept filesystem shrinking failing for certain file layouts. In that case we are better off with no ability to shrink: it makes it clear that we have not solved the problem, yet. so it's all about expectations: _could_ you reasonably remove a piece of RAM? Customer will say: "I have stopped all nonessential services, and free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix the kernel!". No reasonable customer will say: "True, I have all RAM used up in mlock()ed sections, but i want to remove some RAM nevertheless". > There is also no precedent in existing UNIXes for a 100% solution. does this have any relevance to the point, other than to prove that it's a hard problem that we should not pretend to be able to solve, without seeing a clear path towards a solution? Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 15:01 ` Ingo Molnar @ 2005-11-01 15:22 ` Dave Hansen [not found] ` <20051102084946.GA3930@elte.hu> 2005-11-01 16:48 ` Kamezawa Hiroyuki 1 sibling, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-01 15:22 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 2005-11-01 at 16:01 +0100, Ingo Molnar wrote: > so it's all about expectations: _could_ you reasonably remove a piece of > RAM? Customer will say: "I have stopped all nonessential services, and > free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix > the kernel!". That's an excellent example. Until we have some kind of kernel remapping, breaking the 1:1 kernel virtual mapping, these pages will always exist. The easiest example of this kind of memory is kernel text. Another example might be a somewhat errant device driver which has allocates some large buffers and is doing DMA to or from them. In this case, we need to have APIs to require devices to give up and reacquire any dynamically allocated structures. If the device driver does not implement these APIs it is not compatible with memory hotplug. > > There is also no precedent in existing UNIXes for a 100% solution. > > does this have any relevance to the point, other than to prove that it's > a hard problem that we should not pretend to be able to solve, without > seeing a clear path towards a solution? Agreed. It is a hard problem. One that some other UNIXes have not fully solved. Here are the steps that I think we need to take. Do you see any holes in their coverage? Anything that seems infeasible? 1. Fragmentation avoidance * by itself, increases likelyhood of having an area of memory which might be easily removed * very small (if any) performance overhead * other potential in-kernel users * creates infrastructure to enforce the "hotplugablity" of any particular are of memory. 2. Driver APIs * Require that drivers specifically request for areas which must retain constant physical addresses * Driver must relinquish control of such areas upon request * Can be worked around by hypervisors 3. Break 1:1 Kernel Virtual/Physial Mapping * In any large area of physical memory we wish to remove, there will likely be very, very few straggler pages, which can not easily be freed. * Kernel will transparently move the contents of these physical pages to new pages, keeping constant virtual addresses. * Negative TLB overhead, as in-kernel large page mappings are broken down into smaller pages. * __{p,v}a() become more expensive, likely a table lookup I've already done (3) on a limited basis, in the early days of memory hotplug. Not the remapping, just breaking the 1:1 assumptions. It wasn't too horribly painful. We'll also need to make some decisions along the way about what to do about thinks like large pages. Is it better to just punt like AIX and refuse to remove their areas? Break them down into small pages and degrade performance? -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
[parent not found: <20051102084946.GA3930@elte.hu>]
[parent not found: <436880B8.1050207@yahoo.com.au>]
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 [not found] ` <436880B8.1050207@yahoo.com.au> @ 2005-11-02 9:32 ` Dave Hansen 2005-11-02 9:48 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-02 9:32 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote: > I agree. Especially considering that all this memory hotplug usage for > hypervisors etc. is a relatively new thing with few of our userbase > actually using it. I think a simple zones solution is the right way to > go for now. I agree enough on concept that I think we can go implement at least a demonstration of how easy it is to perform. There are a couple of implementation details that will require some changes to the current zone model, however. Perhaps you have some suggestions on those. In which zone do we place hot-added RAM? I don't think answer can simply be the HOTPLUGGABLE zone. If you start with sufficiently small of a machine, you'll degrade into the same horrible HIGHMEM behavior that a 64GB ia32 machine has today, despite your architecture. Think of a machine that starts out with a size of 256MB and grows to 1TB. So, if you have to add to NORMAL/DMA on the fly, how do you handle a case where the new NORMAL/DMA ram is physically above HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone required to be able to span other zones, and be noncontiguous? Would that represent too much of a change to the current model? >From where do we perform reclaim when we run out of a particular zone? Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been hard, and I worry that we never got it quite. Introducing yet another zone makes this harder. Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in any case? -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:32 ` Dave Hansen @ 2005-11-02 9:48 ` Nick Piggin 2005-11-02 10:54 ` Dave Hansen 2005-11-02 15:02 ` Martin J. Bligh 0 siblings, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 9:48 UTC (permalink / raw) To: Dave Hansen Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Dave Hansen wrote: > On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote: > >>I agree. Especially considering that all this memory hotplug usage for >>hypervisors etc. is a relatively new thing with few of our userbase >>actually using it. I think a simple zones solution is the right way to >>go for now. > > > I agree enough on concept that I think we can go implement at least a > demonstration of how easy it is to perform. > > There are a couple of implementation details that will require some > changes to the current zone model, however. Perhaps you have some > suggestions on those. > > In which zone do we place hot-added RAM? I don't think answer can > simply be the HOTPLUGGABLE zone. If you start with sufficiently small > of a machine, you'll degrade into the same horrible HIGHMEM behavior > that a 64GB ia32 machine has today, despite your architecture. Think of > a machine that starts out with a size of 256MB and grows to 1TB. > What can we do reasonably sanely? I think we can drive about 16GB of highmem per 1GB of normal fairly well. So on your 1TB system, you should be able to unplug 960GB RAM. Lower the ratio to taste if you happen to be doing something particularly zone normal intensive - remember in that case the frag patches won't buy you anything more because a zone normal intensive workload is going to cause unreclaimable regions by definition. > So, if you have to add to NORMAL/DMA on the fly, how do you handle a > case where the new NORMAL/DMA ram is physically above > HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone > required to be able to span other zones, and be noncontiguous? Would > that represent too much of a change to the current model? > Perhaps. Perhaps it wouldn't be required to get a solution that is "good enough" though. But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim it all and expand your normal zones into it, bottom up. >>From where do we perform reclaim when we run out of a particular zone? > Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been > hard, and I worry that we never got it quite. Introducing yet another > zone makes this harder. > We didn't get it right, but there are fairly simple things we can do (http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2) to improve things remarkably, and having yet more users should result in even more improvements. We still have ZONE_DMA and ZONE_DMA32, so we can't just afford to abandon zones because they're crap ;) > Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in > any case? > I think this would defeat the purpose if we really want to set limits, but we could have a sysctl perhaps to turn it on or off, or say, only allow it if the alternative is going OOM. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:48 ` Nick Piggin @ 2005-11-02 10:54 ` Dave Hansen 2005-11-02 15:02 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Dave Hansen @ 2005-11-02 10:54 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Wed, 2005-11-02 at 20:48 +1100, Nick Piggin wrote: > > So, if you have to add to NORMAL/DMA on the fly, how do you handle a > > case where the new NORMAL/DMA ram is physically above > > HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone > > required to be able to span other zones, and be noncontiguous? Would > > that represent too much of a change to the current model? > > > > Perhaps. Perhaps it wouldn't be required to get a solution that is > "good enough" though. > > But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim > it all and expand your normal zones into it, bottom up. That's a good point. It would be slow, because you have to wait on page reclaim, but it would work. I do worry a bit that this might make adding memory to slow of an operation to be useful for short periods, but we'll see how it actually behaves. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:48 ` Nick Piggin 2005-11-02 10:54 ` Dave Hansen @ 2005-11-02 15:02 ` Martin J. Bligh 2005-11-03 3:21 ` Nick Piggin 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 15:02 UTC (permalink / raw) To: Nick Piggin, Dave Hansen Cc: Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> I agree enough on concept that I think we can go implement at least a >> demonstration of how easy it is to perform. >> >> There are a couple of implementation details that will require some >> changes to the current zone model, however. Perhaps you have some >> suggestions on those. >> >> In which zone do we place hot-added RAM? I don't think answer can >> simply be the HOTPLUGGABLE zone. If you start with sufficiently small >> of a machine, you'll degrade into the same horrible HIGHMEM behavior >> that a 64GB ia32 machine has today, despite your architecture. Think of >> a machine that starts out with a size of 256MB and grows to 1TB. >> > > What can we do reasonably sanely? I think we can drive about 16GB of > highmem per 1GB of normal fairly well. So on your 1TB system, you > should be able to unplug 960GB RAM. I think you need to talk to some more users trying to run 16GB ia32 systems. Feel the pain. > Lower the ratio to taste if you happen to be doing something > particularly zone normal intensive - remember in that case the frag > patches won't buy you anything more because a zone normal intensive > workload is going to cause unreclaimable regions by definition. > >> So, if you have to add to NORMAL/DMA on the fly, how do you handle a >> case where the new NORMAL/DMA ram is physically above >> HIGHMEM/HOTPLUGGABLE? Is there any other course than to make a zone >> required to be able to span other zones, and be noncontiguous? Would >> that represent too much of a change to the current model? >> > > Perhaps. Perhaps it wouldn't be required to get a solution that is > "good enough" though. > > But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim > it all and expand your normal zones into it, bottom up. Can we quit coming up with specialist hacks for hotplug, and try to solve the generic problem please? hotplug is NOT the only issue here. Fragmentation in general is. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 15:02 ` Martin J. Bligh @ 2005-11-03 3:21 ` Nick Piggin 2005-11-03 15:36 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-03 3:21 UTC (permalink / raw) To: Martin J. Bligh Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Martin J. Bligh wrote: >>What can we do reasonably sanely? I think we can drive about 16GB of >>highmem per 1GB of normal fairly well. So on your 1TB system, you >>should be able to unplug 960GB RAM. > > > I think you need to talk to some more users trying to run 16GB ia32 > systems. Feel the pain. > OK, make it 8GB then. And as a bonus we get all you IBM guys back on the case again to finish the job that was started on highmem :) And as another bonus, you actually *have* the capability to unplug memory or use hugepages exactly the size you require, which is not the case with the frag patches. >>But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim >>it all and expand your normal zones into it, bottom up. > > > Can we quit coming up with specialist hacks for hotplug, and try to solve > the generic problem please? hotplug is NOT the only issue here. Fragmentation > in general is. > Not really it isn't. There have been a few cases (e1000 being the main one, and is fixed upstream) where fragmentation in general is a problem. But mostly it is not. Anyone who thinks they can start using higher order allocations willy nilly after Mel's patch, I'm fairly sure they're wrong because they are just going to be using up the contiguous regions. Trust me, if the frag patches were a general solution that solved the generic fragmentation problem I would be a lot less concerned about the complexity they introduce. But even then it only seems to be a problem that a very small number of users care about. Anyway I keep saying the same things (sorry) so I'll stop now. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 3:21 ` Nick Piggin @ 2005-11-03 15:36 ` Martin J. Bligh 2005-11-03 15:40 ` Arjan van de Ven 0 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 15:36 UTC (permalink / raw) To: Nick Piggin Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> Can we quit coming up with specialist hacks for hotplug, and try to solve >> the generic problem please? hotplug is NOT the only issue here. Fragmentation >> in general is. >> > > Not really it isn't. There have been a few cases (e1000 being the main > one, and is fixed upstream) where fragmentation in general is a problem. > But mostly it is not. Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. And devices that don't support scatter-gather DMA. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:36 ` Martin J. Bligh @ 2005-11-03 15:40 ` Arjan van de Ven 2005-11-03 15:51 ` Linus Torvalds 2005-11-03 15:53 ` Martin J. Bligh 0 siblings, 2 replies; 241+ messages in thread From: Arjan van de Ven @ 2005-11-03 15:40 UTC (permalink / raw) To: Martin J. Bligh Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote: > >> Can we quit coming up with specialist hacks for hotplug, and try to solve > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation > >> in general is. > >> > > > > Not really it isn't. There have been a few cases (e1000 being the main > > one, and is fixed upstream) where fragmentation in general is a problem. > > But mostly it is not. > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. with CONFIG_4KSTACKS :) ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:40 ` Arjan van de Ven @ 2005-11-03 15:51 ` Linus Torvalds 2005-11-03 15:57 ` Martin J. Bligh ` (2 more replies) 2005-11-03 15:53 ` Martin J. Bligh 1 sibling, 3 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 15:51 UTC (permalink / raw) To: Arjan van de Ven Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Arjan van de Ven wrote: > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote: > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation > > >> in general is. > > >> > > > > > > Not really it isn't. There have been a few cases (e1000 being the main > > > one, and is fixed upstream) where fragmentation in general is a problem. > > > But mostly it is not. > > > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. > > with CONFIG_4KSTACKS :) 2-page allocations are _not_ a problem. Especially not for fork()/clone(). If you don't even have 2-page contiguous areas, you are doing something _wrong_, or you're so low on memory that there's no point in forking any more. Don't confuse "fragmentation" with "perfectly spread out page allocations". Fragmentation means that it gets _exponentially_ more unlikely that you can allocate big contiguous areas. But contiguous areas of order 1 are very very likely indeed. It's only the _big_ areas that aren't going to happen. This is why fragmentation avoidance has always been totally useless. It is - only useful for big areas - very hard for big areas (Corollary: when it's easy and possible, it's not useful). Don't do it. We've never done it, and we've been fine. Claiming that fork() is a reason to do fragmentation avoidance is invalid. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:51 ` Linus Torvalds @ 2005-11-03 15:57 ` Martin J. Bligh 2005-11-03 16:20 ` Arjan van de Ven 2005-11-03 16:27 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 15:57 UTC (permalink / raw) To: Linus Torvalds, Arjan van de Ven Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> with CONFIG_4KSTACKS :) > > 2-page allocations are _not_ a problem. > > Especially not for fork()/clone(). If you don't even have 2-page > contiguous areas, you are doing something _wrong_, or you're so low on > memory that there's no point in forking any more. 64 bit platforms need kernel stacks > 8K, it seems. > Don't confuse "fragmentation" with "perfectly spread out page > allocations". > > Fragmentation means that it gets _exponentially_ more unlikely that you > can allocate big contiguous areas. But contiguous areas of order 1 are > very very likely indeed. It's only the _big_ areas that aren't going to > happen. > > This is why fragmentation avoidance has always been totally useless. It is > - only useful for big areas > - very hard for big areas > > (Corollary: when it's easy and possible, it's not useful). > > Don't do it. We've never done it, and we've been fine. Claiming that > fork() is a reason to do fragmentation avoidance is invalid. With respect, we have not been fine. We see problems fairly regularly with no large page/hotplug issues with higher order allocations. Drivers, CIFS, kernel stacks, etc, etc etc. The larger memory gets, the worse the problem is, just because the statistics make it less likely to free up multiple contiguous pages. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:51 ` Linus Torvalds 2005-11-03 15:57 ` Martin J. Bligh @ 2005-11-03 16:20 ` Arjan van de Ven 2005-11-03 16:27 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Arjan van de Ven @ 2005-11-03 16:20 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 2005-11-03 at 07:51 -0800, Linus Torvalds wrote: > > On Thu, 3 Nov 2005, Arjan van de Ven wrote: > > > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote: > > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve > > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation > > > >> in general is. > > > >> > > > > > > > > Not really it isn't. There have been a few cases (e1000 being the main > > > > one, and is fixed upstream) where fragmentation in general is a problem. > > > > But mostly it is not. > > > > > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. > > > > with CONFIG_4KSTACKS :) > > 2-page allocations are _not_ a problem. agreed for the general case. There are some corner cases that you can trigger deliberate in an artifical setting with lots of java threads (esp on x86 on a 32Gb box; the lowmem zone works as a lever here leading to "hyperfragmentation"; otoh on x86 you can do 4k stacks and it's gone mostly) > Fragmentation means that it gets _exponentially_ more unlikely that you > can allocate big contiguous areas. But contiguous areas of order 1 are > very very likely indeed. It's only the _big_ areas that aren't going to > happen. yup. only possible exception is the leveraged scenario .. thank god for 64 bit x86-64. (and in the leveraged scenario I don't think active defragmentation will buy you much over the long term at all) ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:51 ` Linus Torvalds 2005-11-03 15:57 ` Martin J. Bligh 2005-11-03 16:20 ` Arjan van de Ven @ 2005-11-03 16:27 ` Mel Gorman 2005-11-03 16:46 ` Linus Torvalds 2 siblings, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-03 16:27 UTC (permalink / raw) To: Linus Torvalds Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Linus Torvalds wrote: > > > On Thu, 3 Nov 2005, Arjan van de Ven wrote: > > > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote: > > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve > > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation > > > >> in general is. > > > >> > > > > > > > > Not really it isn't. There have been a few cases (e1000 being the main > > > > one, and is fixed upstream) where fragmentation in general is a problem. > > > > But mostly it is not. > > > > > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. > > > > with CONFIG_4KSTACKS :) > > 2-page allocations are _not_ a problem. > > Especially not for fork()/clone(). If you don't even have 2-page > contiguous areas, you are doing something _wrong_, or you're so low on > memory that there's no point in forking any more. > > Don't confuse "fragmentation" with "perfectly spread out page > allocations". > > Fragmentation means that it gets _exponentially_ more unlikely that you > can allocate big contiguous areas. But contiguous areas of order 1 are > very very likely indeed. It's only the _big_ areas that aren't going to > happen. > For me, it's the big areas that I am interested in, especially if we want to give HugeTLB pages to a user when they are asking for them. The obvious one here is database and HPC loads, particularly the HPC loads which may not have had a chance to reserve what they needed at boot time. These loads need 1024 contiguous pages on the x86 at least, not 2. We can free all we want on todays kernels and you're not going to get more than 1 or two blocks this large unless you are very lucky. Hotplug is, for me, an additional benefit. For others, it is the main benefit. For others of course, they don't care, but others don't are about scalability to 64 processors either but we provide it anyway at a low cost to smaller machines. > This is why fragmentation avoidance has always been totally useless. It is > - only useful for big areas > - very hard for big areas > > (Corollary: when it's easy and possible, it's not useful). > Unless you are a user that wants a large area when it suddenly is useful. > Don't do it. We've never done it, and we've been fine. Claiming that > fork() is a reason to do fragmentation avoidance is invalid. > We've never done it but, but we've only supported HugeTLB pages being reserved at boot time and nothing else as well. I'm going to setup a kbuild environment, hopefully this evening, and see are these patches adversely impacting a load that kernel developers care about. If I am impacting it, oops I'm in some trouble. If I'm not, then why not try and help out the people who care about the big areas. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:27 ` Mel Gorman @ 2005-11-03 16:46 ` Linus Torvalds 2005-11-03 16:52 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 16:46 UTC (permalink / raw) To: Mel Gorman Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Mel Gorman wrote: > On Thu, 3 Nov 2005, Linus Torvalds wrote: > > > This is why fragmentation avoidance has always been totally useless. It is > > - only useful for big areas > > - very hard for big areas > > > > (Corollary: when it's easy and possible, it's not useful). > > > > Unless you are a user that wants a large area when it suddenly is useful. No. It's _not_ suddenly useful. It might be something you _want_, but that's a totally different issue. My point is that regardless of what you _want_, defragmentation is _useless_. It's useless simply because for big areas it is so expensive as to be impractical. Put another way: you may _want_ the moon to be made of cheese, but a moon made out of cheese is _useless_ because it is impractical. The only way to support big areas is to have special zones for them. (Then, we may be able to use the special zones for small things too, but under special rules, like "only used for anonymous mappings" where we can just always remove them by paging them out. But it would still be a special area meant for big pages, just temporarily "on loan"). Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:46 ` Linus Torvalds @ 2005-11-03 16:52 ` Martin J. Bligh 2005-11-03 17:19 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 16:52 UTC (permalink / raw) To: Linus Torvalds, Mel Gorman Cc: Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven > The only way to support big areas is to have special zones for them. > > (Then, we may be able to use the special zones for small things too, but > under special rules, like "only used for anonymous mappings" where we > can just always remove them by paging them out. But it would still be a > special area meant for big pages, just temporarily "on loan"). The problem is how these zones get resized. Can we hotplug memory between them, with some sparsemem like indirection layer? Real customers have shown us that their workloads shift, and they have different needs at different parts of the day. We can't just pick one size and call it good. It's the same argument as the traditional VM balancing act between pagecache, user pages, and kernel pages (which incidentally, we don't use zones for). We want the system to be able to use memory wherever it's most needed. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:52 ` Martin J. Bligh @ 2005-11-03 17:19 ` Linus Torvalds 2005-11-03 17:48 ` Dave Hansen 2005-11-03 17:51 ` Martin J. Bligh 0 siblings, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 17:19 UTC (permalink / raw) To: Martin J. Bligh Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > The problem is how these zones get resized. Can we hotplug memory between > them, with some sparsemem like indirection layer? I think you should be able to add them. You can remove them. But you can't resize them. And I suspect that by default, there should be zero of them. Ie you'd have to set them up the same way you now set up a hugetlb area. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:19 ` Linus Torvalds @ 2005-11-03 17:48 ` Dave Hansen 2005-11-03 17:51 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Dave Hansen @ 2005-11-03 17:48 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Mel Gorman, Arjan van de Ven, Nick Piggin, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 2005-11-03 at 09:19 -0800, Linus Torvalds wrote: > On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > > > The problem is how these zones get resized. Can we hotplug memory between > > them, with some sparsemem like indirection layer? > > I think you should be able to add them. You can remove them. But you can't > resize them. Any particular reasons you think we can't resize them? I know shrinking the non-reclaim (DMA,NORMAL) zones will be practically impossible, but it should be quite possible to shrink the reclaim zone, and grow DMA or NORMAL into it. This will likely be necessary as memory is added to a system, and the ratio of reclaim to non-reclaim zones gets out of whack and away from the magic 16:1 or 8:1 highmem:normal ratio that seems popular. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:19 ` Linus Torvalds 2005-11-03 17:48 ` Dave Hansen @ 2005-11-03 17:51 ` Martin J. Bligh 2005-11-03 17:59 ` Arjan van de Ven ` (2 more replies) 1 sibling, 3 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven --Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 09:19:35 -0800): > > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: >> >> The problem is how these zones get resized. Can we hotplug memory between >> them, with some sparsemem like indirection layer? > > I think you should be able to add them. You can remove them. But you can't > resize them. > > And I suspect that by default, there should be zero of them. Ie you'd have > to set them up the same way you now set up a hugetlb area. So ... if there are 0 by default, and I run for a while and dirty up memory, how do I free any pages up to put into them? Not sure how that works. Going back to finding contig pages for a sec ... I don't disagree with your assertion that order 1 is doable (however, we do need to make one fix ...see below). It's > 1 that's a problem. For amusement, let me put in some tritely oversimplified math. For the sake of arguement, assume the free watermarks are 8MB or so. Let's assume a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. I'm going to assume random distribution of free pages, which is oversimplified, but I'm trying to demonstrate a general premise, not get accurate numbers. 8MB = 2048 pages. On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for each free page, chance of it's buddy being free is 2048/16384. So in grossly-oversimplified stats-land, if I can remember anything at all, chance of finding one page with a free buddy is 1-(1-2048/16384)^2048, which is, for all intents and purposes ... 1. 1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989 128GB system. 33554432 pages. 0.1175 probability yes, yes, my math sucks and I'm a simpleton. The point is that as memory gets bigger, the odds suck for getting contiguous pages. And would also explain why you think there's no problem, and I do ;-) And bear in mind that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll spare you more wild attempts at foully-approximated math. Hmmm. If we keep 128MB free, that totally kills off the above calculation I think I'll just tweak it so the limit is not so hard on really big systems. Will send you a patch. However ... larger allocs will still suck ... I guess I'd better gross you out with more incorrect math after all ... ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:51 ` Martin J. Bligh @ 2005-11-03 17:59 ` Arjan van de Ven 2005-11-03 18:08 ` Linus Torvalds 2005-11-03 18:03 ` Linus Torvalds 2005-11-03 18:48 ` Martin J. Bligh 2 siblings, 1 reply; 241+ messages in thread From: Arjan van de Ven @ 2005-11-03 17:59 UTC (permalink / raw) To: Martin J. Bligh Cc: Linus Torvalds, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote: > For amusement, let me put in some tritely oversimplified math. For the > sake of arguement, assume the free watermarks are 8MB or so. Let's assume > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. > I'm going to assume random distribution of free pages, which is > oversimplified, but I'm trying to demonstrate a general premise, not get > accurate numbers. that is VERY over simplified though, given the anti-fragmentation property of buddy algorithm ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:59 ` Arjan van de Ven @ 2005-11-03 18:08 ` Linus Torvalds 2005-11-03 18:17 ` Martin J. Bligh 2005-11-03 21:11 ` Mel Gorman 0 siblings, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 18:08 UTC (permalink / raw) To: Arjan van de Ven Cc: Martin J. Bligh, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Arjan van de Ven wrote: > On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote: > > > For amusement, let me put in some tritely oversimplified math. For the > > sake of arguement, assume the free watermarks are 8MB or so. Let's assume > > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. > > I'm going to assume random distribution of free pages, which is > > oversimplified, but I'm trying to demonstrate a general premise, not get > > accurate numbers. > > that is VERY over simplified though, given the anti-fragmentation > property of buddy algorithm Indeed. I write a program at one time doing random allocation and de-allocation and looking at what the output was, and buddy is very good at avoiding fragmentation. These days we have things like per-cpu lists in front of the buddy allocator that will make fragmentation somewhat higher, but it's still absolutely true that the page allocation layout is _not_ random. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:08 ` Linus Torvalds @ 2005-11-03 18:17 ` Martin J. Bligh 2005-11-03 18:44 ` Linus Torvalds 2005-11-04 0:58 ` Nick Piggin 2005-11-03 21:11 ` Mel Gorman 1 sibling, 2 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 18:17 UTC (permalink / raw) To: Linus Torvalds, Arjan van de Ven Cc: Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> > For amusement, let me put in some tritely oversimplified math. For the >> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume >> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. >> > I'm going to assume random distribution of free pages, which is >> > oversimplified, but I'm trying to demonstrate a general premise, not get >> > accurate numbers. >> >> that is VERY over simplified though, given the anti-fragmentation >> property of buddy algorithm > > Indeed. I write a program at one time doing random allocation and > de-allocation and looking at what the output was, and buddy is very good > at avoiding fragmentation. > > These days we have things like per-cpu lists in front of the buddy > allocator that will make fragmentation somewhat higher, but it's still > absolutely true that the page allocation layout is _not_ random. OK, well I'll quit torturing you with incorrect math if you'll concede that the situation gets much much worse as memory sizes get larger ;-) For order > 1 allocs, I think it's fixable. For order > 1, I think we basically don't have a prayer on a largish system under pressure. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:17 ` Martin J. Bligh @ 2005-11-03 18:44 ` Linus Torvalds 2005-11-03 18:51 ` Martin J. Bligh 2005-11-04 0:58 ` Nick Piggin 1 sibling, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 18:44 UTC (permalink / raw) To: Martin J. Bligh Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > > > These days we have things like per-cpu lists in front of the buddy > > allocator that will make fragmentation somewhat higher, but it's still > > absolutely true that the page allocation layout is _not_ random. > > OK, well I'll quit torturing you with incorrect math if you'll concede > that the situation gets much much worse as memory sizes get larger ;-) I don't remember the specifics (I did the stats several years ago), but if I recall correctly, the low-order allocations actually got _better_ with more memory, assuming you kept a fixed percentage of memory free. So you actually needed _less_ memory free (in percentages) to get low-order allocations reliably. But the higher orders didn't much matter. Basically, it gets exponentially more difficult to keep higher-order allocations, and it doesn't help one whit if there's a linear improvement from having more memory available or something like that. So it doesn't get _harder_ with lots of memory, but - you need to keep the "minimum free" watermarks growing at the same rate the memory sizes grow (and on x86, I don't think we do: at least at some point, the HIGHMEM zone had a much lower low-water-mark because it made the balancing behaviour much nicer. But I didn't check that). - with lots of memory, you tend to want to get higher-order pages, and that gets harder much much faster than your memory size grows. So _effectively_, the kinds of allocations you care about are much harder to get. If you look at get_free_pages(), you will note that we actyally _guarantee_ memory allocations up to order-3: ... if (!(gfp_mask & __GFP_NORETRY)) { if ((order <= 3) || (gfp_mask & __GFP_REPEAT)) do_retry = 1; ... and nobody has ever even noticed. In other words, low-order allocations really _are_ dependable. It's just that the kinds of orders you want for memory hotplug or hugetlb (ie not orders <=3, but >=10) are not, and never will be. (Btw, my statistics did depend on that fact that the _usage_ was an even higher exponential, ie you had many many more order-0 allocations than you had order-1). You can always run out of order-n (n != 0) pages if you just allocate enough of them. The buddy thing works well statistically, but it obviously can't do wonders). Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:44 ` Linus Torvalds @ 2005-11-03 18:51 ` Martin J. Bligh 2005-11-03 19:35 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 18:51 UTC (permalink / raw) To: Linus Torvalds Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven --Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 10:44:14 -0800): > > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: >> > >> > These days we have things like per-cpu lists in front of the buddy >> > allocator that will make fragmentation somewhat higher, but it's still >> > absolutely true that the page allocation layout is _not_ random. >> >> OK, well I'll quit torturing you with incorrect math if you'll concede >> that the situation gets much much worse as memory sizes get larger ;-) > > I don't remember the specifics (I did the stats several years ago), but if > I recall correctly, the low-order allocations actually got _better_ with > more memory, assuming you kept a fixed percentage of memory free. So you > actually needed _less_ memory free (in percentages) to get low-order > allocations reliably. Possibly, I can redo the calculations easily enough (have to go for now, but I just sent the other ones). But we don't keep a fixed percentage of memory free - we cap it ... perhaps we should though? M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:51 ` Martin J. Bligh @ 2005-11-03 19:35 ` Linus Torvalds 2005-11-03 22:40 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 19:35 UTC (permalink / raw) To: Martin J. Bligh Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > Possibly, I can redo the calculations easily enough (have to go for now, > but I just sent the other ones). But we don't keep a fixed percentage of > memory free - we cap it ... perhaps we should though? I suspect the capping may well be from some old HIGHMEM interaction on x86 (ie "don't keep half a gig free in the normal zone just because we have 16GB in the high-zone". We used to have serious balancing issues, and I wouldn't be surprised at all if there are remnants from that. Stuff that simply hasn't been visible, because not a lot of people had many many GB of memory even on machines that didn't need HIGHMEM. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 19:35 ` Linus Torvalds @ 2005-11-03 22:40 ` Martin J. Bligh 2005-11-03 22:56 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 22:40 UTC (permalink / raw) To: Linus Torvalds Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven --On Thursday, November 03, 2005 11:35:28 -0800 Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: >> >> Possibly, I can redo the calculations easily enough (have to go for now, >> but I just sent the other ones). But we don't keep a fixed percentage of >> memory free - we cap it ... perhaps we should though? > > I suspect the capping may well be from some old HIGHMEM interaction on x86 > (ie "don't keep half a gig free in the normal zone just because we have > 16GB in the high-zone". We used to have serious balancing issues, and I > wouldn't be surprised at all if there are remnants from that. Stuff that > simply hasn't been visible, because not a lot of people had many many GB > of memory even on machines that didn't need HIGHMEM. But pages_min is based on the zone size, not the system size. And we still cap it. Maybe that's just a mistake? M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 22:40 ` Martin J. Bligh @ 2005-11-03 22:56 ` Linus Torvalds 2005-11-03 23:01 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 22:56 UTC (permalink / raw) To: Martin J. Bligh Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > But pages_min is based on the zone size, not the system size. And we > still cap it. Maybe that's just a mistake? The per-zone watermarking is actually the "modern" and "working" approach. We didn't always do it that way. I would not be at all surprised if the capping was from the global watermarking days. Of course, I would _also_ not be at all surprised if it wasn't just out of habit. Most of the things where we try to scale things up by memory size, we cap for various reasons. Ie we tend to try to scale things like hash sizes for core data structures by memory size, but then we tend to cap them to "sane" versions. So quite frankly, it's entirely possible that the capping is there not because it _ever_ was a good idea, but just because it's what we almost always do ;) Mental inertia is definitely alive and well. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 22:56 ` Linus Torvalds @ 2005-11-03 23:01 ` Martin J. Bligh 0 siblings, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 23:01 UTC (permalink / raw) To: Linus Torvalds Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> But pages_min is based on the zone size, not the system size. And we >> still cap it. Maybe that's just a mistake? > > The per-zone watermarking is actually the "modern" and "working" approach. > > We didn't always do it that way. I would not be at all surprised if the > capping was from the global watermarking days. > > Of course, I would _also_ not be at all surprised if it wasn't just out of > habit. Most of the things where we try to scale things up by memory size, > we cap for various reasons. Ie we tend to try to scale things like hash > sizes for core data structures by memory size, but then we tend to cap > them to "sane" versions. > > So quite frankly, it's entirely possible that the capping is there not > because it _ever_ was a good idea, but just because it's what we almost > always do ;) > > Mental inertia is definitely alive and well. Ha ;-) Well thanks for the explanation. I would suggest the patch I sent you makes some semblence of sense then ... M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:17 ` Martin J. Bligh 2005-11-03 18:44 ` Linus Torvalds @ 2005-11-04 0:58 ` Nick Piggin 2005-11-04 1:06 ` Linus Torvalds 1 sibling, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-04 0:58 UTC (permalink / raw) To: Martin J. Bligh Cc: Linus Torvalds, Arjan van de Ven, Mel Gorman, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Martin J. Bligh wrote: >>These days we have things like per-cpu lists in front of the buddy >>allocator that will make fragmentation somewhat higher, but it's still >>absolutely true that the page allocation layout is _not_ random. > > > OK, well I'll quit torturing you with incorrect math if you'll concede > that the situation gets much much worse as memory sizes get larger ;-) > Let me add that as memory sized get larger, people are also looking for more tlb coverage and less per page overhead. Looks like ppc64 is getting 64K page support, at which point higher order allocations (eg. for stacks) basically disappear don't they? x86-64 I thought were also getting 64K page support but I can't find a reference to it right now - at the very least I know Andi wants to support larger soft pages for it. ia64 is obviously already well covered. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 0:58 ` Nick Piggin @ 2005-11-04 1:06 ` Linus Torvalds 2005-11-04 1:20 ` Paul Mackerras ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 1:06 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Fri, 4 Nov 2005, Nick Piggin wrote: > > Looks like ppc64 is getting 64K page support, at which point higher > order allocations (eg. for stacks) basically disappear don't they? Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a general-purpose machine. 64kB pages are _only_ usable for databases, nothing else. Why? Do the math. Try to cache the whole kernel source tree in 4kB pages vs 64kB pages. See how the memory usage goes up by a factor of _four_. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:06 ` Linus Torvalds @ 2005-11-04 1:20 ` Paul Mackerras 2005-11-04 1:22 ` Nick Piggin 2005-11-04 1:26 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Paul Mackerras @ 2005-11-04 1:20 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Linus Torvalds writes: > 64kB pages are _only_ usable for databases, nothing else. Actually people running HPC apps also like 64kB pages since their TLB misses go down significantly, and their data files tend to be large. Fileserving for windows boxes should also benefit, since both the executables and the data files that typical office applications on windows use are largish. I got a distribution of file sizes for a government department office and concluded that 64k pages would only bloat the page cache by a few percent for that case. Paul. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:06 ` Linus Torvalds 2005-11-04 1:20 ` Paul Mackerras @ 2005-11-04 1:22 ` Nick Piggin 2005-11-04 1:48 ` Mel Gorman 2005-11-04 1:26 ` Mel Gorman 2 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-04 1:22 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Linus Torvalds wrote: > > On Fri, 4 Nov 2005, Nick Piggin wrote: > >>Looks like ppc64 is getting 64K page support, at which point higher >>order allocations (eg. for stacks) basically disappear don't they? > > > Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a > general-purpose machine. > > 64kB pages are _only_ usable for databases, nothing else. > > Why? Do the math. Try to cache the whole kernel source tree in 4kB pages > vs 64kB pages. See how the memory usage goes up by a factor of _four_. > Yeah that's true. But Martin's worried about future machines with massive memories - so maybe it is safe to assume those will be using big pages, I don't know. Maybe the solution is to bloat the kernel sources enough to make 64KB pages worthwhile? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:22 ` Nick Piggin @ 2005-11-04 1:48 ` Mel Gorman 2005-11-04 1:59 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-04 1:48 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Fri, 4 Nov 2005, Nick Piggin wrote: > Linus Torvalds wrote: > > > > On Fri, 4 Nov 2005, Nick Piggin wrote: > > > > > Looks like ppc64 is getting 64K page support, at which point higher > > > order allocations (eg. for stacks) basically disappear don't they? > > > > > > Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a > > general-purpose machine. > > > > 64kB pages are _only_ usable for databases, nothing else. > > > > Why? Do the math. Try to cache the whole kernel source tree in 4kB pages vs > > 64kB pages. See how the memory usage goes up by a factor of _four_. > > > > Yeah that's true. But Martin's worried about future machines > with massive memories - so maybe it is safe to assume those will > be using big pages, I don't know. > Todays massive machines are tomorrows desktop. Weak comment, I know, but it's happened before. > Maybe the solution is to bloat the kernel sources enough to make > 64KB pages worthwhile? > root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-clean -rw-r--r-- 1 root root 1718063 2005-11-01 16:17 vmlinuz-2.6.14-rc5-mm1-clean root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-mbuddy-v19 -rw-r--r-- 1 root root 1722102 2005-11-02 14:56 vmlinuz-2.6.14-rc5-mm1-mbuddy-v19 root@monocle:/boot# dc 1722102 1718063 - p 4039 root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-clean -rwxr-xr-x 1 root root 31518866 2005-11-01 16:17 vmlinux-2.6.14-rc5-mm1-clean root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-mbuddy-v19 -rwxr-xr-x 1 root root 31585714 2005-11-02 14:56 vmlinux-2.6.14-rc5-mm1-mbuddy-v19 mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-nooom$ wc -l mm/page_alloc.c 2689 mm/page_alloc.c mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-mbuddy-v19-withdefrag$ wc -l mm/page_alloc.c 3188 mm/page_alloc.c 0.23% increase in size of bzImage, 0.21% increase in the size of vmlinux and the major increase in code size is in one file, *one* file, all of which does it's best to impact the flow of the well-understood code. We're seeing bigger differences in performance than we are in the size of the kernel. I'd understand if I was the first person to ever introduce complexity to the VM. If the size of the image for really small systems is the issue, what if I say I'll add in another patch that optionally compiles away as much of anti-defrag as possible without making the code a mess of #defines . Are we still going to hear "no, I don't like looking at this". The current patch to compile it away deliberately choose the smallest part to take away to restore the allocator to todays behavior. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:48 ` Mel Gorman @ 2005-11-04 1:59 ` Nick Piggin 2005-11-04 2:35 ` Mel Gorman 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-04 1:59 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Mel Gorman wrote: > On Fri, 4 Nov 2005, Nick Piggin wrote: > > Todays massive machines are tomorrows desktop. Weak comment, I know, but > it's happened before. > Oh I wouldn't bet against it. And if desktops of the future are using 100s of GB then they probably would be happy to use 64K pages as well. > >>Maybe the solution is to bloat the kernel sources enough to make >>64KB pages worthwhile? >> > Sorry this wasn't meant to be a dig at your patches - I guess it turned out that way though :\ But yes, if anybody is adding complexity or size to core code it obviously does need to be justified -- and by no means does this only apply to you. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:59 ` Nick Piggin @ 2005-11-04 2:35 ` Mel Gorman 0 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-04 2:35 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Fri, 4 Nov 2005, Nick Piggin wrote: > Mel Gorman wrote: > > On Fri, 4 Nov 2005, Nick Piggin wrote: > > > > Todays massive machines are tomorrows desktop. Weak comment, I know, but > > it's happened before. > > > > Oh I wouldn't bet against it. And if desktops of the future are using > 100s of GB then they probably would be happy to use 64K pages as well. > And would it not be nice to be ready when it happens, before it happens even? > > > > > Maybe the solution is to bloat the kernel sources enough to make > > > 64KB pages worthwhile? > > > > > > > Sorry this wasn't meant to be a dig at your patches - I guess it turned > out that way though :\ > Oh, I'll live. If I was going to take it personally and go into a big sulk, I wouldn't be here. This is linux-kernel, not the super-friends club. > But yes, if anybody is adding complexity or size to core code it > obviously does need to be justified -- and by no means does this only > apply to you. > I've tried to justify it with benchmarks that came with each release and code reviews, particularly by Dave Hansen, showed that earlier versions had significant problems that needed to be ironed out. I don't want to hurt the normal case, because the fact of the matter is, my desktop machine (which runs with these patches to see if there are any bugs) runs the normal case and it will until we get much further because I'm not configuring my machine for HugeTLB when it boots. If I'm hurting the normal case, that's more time switching windows to see if the next test kernel has built yet. If we can do this and not regress in the standard case, then what is wrong? I'm still waiting for figures that say this approach is slow and I can only assume someone is trying considering the length of this thread. If and when those figures show up, I'll put on the thinking hat and see where I went wrong because regression performance is wrong. There is a win-win solution somewhere, how hard could it possibly be :) ? I'm looking at the zone approach. I want to see if it can work in a nice fashion, not in a "if the sysadm can see the future and configure correctly, it'll work just fine" fashion. I'm not confident, but it might be bias. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:06 ` Linus Torvalds 2005-11-04 1:20 ` Paul Mackerras 2005-11-04 1:22 ` Nick Piggin @ 2005-11-04 1:26 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-04 1:26 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Linus Torvalds wrote: > > > On Fri, 4 Nov 2005, Nick Piggin wrote: > > > > Looks like ppc64 is getting 64K page support, at which point higher > > order allocations (eg. for stacks) basically disappear don't they? > > Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a > general-purpose machine. > > 64kB pages are _only_ usable for databases, nothing else. > Very well, but if the infrastructure required to help get 64kB pages still performs the same, or better, than the current infrastructure that gives 4kB pages, then why not? I am biased obviously and probably optimistic but I am hoping we have a case here where we get our cake and eat it twice. > Why? Do the math. Try to cache the whole kernel source tree in 4kB pages > vs 64kB pages. See how the memory usage goes up by a factor of _four_. > I don't know, but I doubt they would use 64kB pages as the default size unless it is a specialised machine. I could be wrong, I don't have a ppc64 machine, I don't work on a ppc64 machine, I haven't read the architectures documentation and I didn't write this code for a ppc64 machine. If the machine here in question it's a specialised machine, they go into the 0.01% category of people, but it's a group that we can still help without introducing static zones they have to configure. I'm still waiting on figures that say the approach proposed here is actually really slow, rather than makes people unhappy slow. If this is proved to be slow, then I'll admit there is a problem and put more effort into the plans to use zones instead. I just haven't found a problem on the machines I have available to me, be it aim9, bench-stresshighalloc or building kernels (which I think is important considering how often I build test kernels). If it's a documentation problem with these patches, I'll write up VM docs on the allocator and submit it as a patch, complete with downsides and caveats to be fair. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:08 ` Linus Torvalds 2005-11-03 18:17 ` Martin J. Bligh @ 2005-11-03 21:11 ` Mel Gorman 1 sibling, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-03 21:11 UTC (permalink / raw) To: Linus Torvalds Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Linus Torvalds wrote: > On Thu, 3 Nov 2005, Arjan van de Ven wrote: > > > On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote: > > > > > For amusement, let me put in some tritely oversimplified math. For the > > > sake of arguement, assume the free watermarks are 8MB or so. Let's assume > > > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. > > > I'm going to assume random distribution of free pages, which is > > > oversimplified, but I'm trying to demonstrate a general premise, not get > > > accurate numbers. > > > > that is VERY over simplified though, given the anti-fragmentation > > property of buddy algorithm > The statistical properties of the buddy system are a nightmare. There is a paper called "Statistical Properties of the Buddy System" which is a whole pile of no fun to read. It's because of the difficulty to analyse fragmentation offline that bench-stresshighalloc was written to see how well anti-defrag would do. > Indeed. I write a program at one time doing random allocation and > de-allocation and looking at what the output was, and buddy is very good > at avoiding fragmentation. > The worse cause of fragmentation I found were kernel caches that were long lived. How fragmenting the workload is depended heavily on whether things like updatedb happened which is why bench-stresshighalloc deliberately ran it. It's also why anti-defrag tries to group inodes and buffer_heads into the same areas in memory separate from other persumed-to-be-even-longer-lived kernel allocations. The assumption is if the buffer, inode and dcaches are all shrunk, contiguous blocks will appear. You're also right on the size of the watermarks for zones and how it affects fragmentation. A serious problem I had with anti-defrag was when 87.5% of memory is in use. At this point, a "fallback" area is used by any allocation type that has no pages of it's own. When it is depleted, real fragmentation starts happening and it's also about here that the high watermark for reclaiming starts. I wanted to increase the watermarks up to start reclaiming pages when the "fallback" area started getting used but didn't think I would get away with adjusting those figures. I could have cheated and set it via /proc before benchmarks but didn't to avoid "magic test system" syndrome. > These days we have things like per-cpu lists in front of the buddy > allocator that will make fragmentation somewhat higher, but it's still > absolutely true that the page allocation layout is _not_ random. > It's worse than somewhat higher for the per-cpu pages. Using another set of patches on top of an earlier version of anti-defrag, I was about to allocate about 75% of physical memory in pinned 4MiB chunks of memory under loads of 15-20 (kernel builds). To get there, per-cpu pages had to be drained using an IPI call because for some perverse reason, there were always 2 or 3 free per-cpu pages in the middle of a 1024 block of pages. Basically, I don't we have to live with fragmentation in the page allocator. I think it can be pushed down a whole lot without taking a performance hit for the 99.99% of users that don't care about this sort of thing. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:51 ` Martin J. Bligh 2005-11-03 17:59 ` Arjan van de Ven @ 2005-11-03 18:03 ` Linus Torvalds 2005-11-03 20:00 ` Paul Jackson 2005-11-03 20:46 ` Mel Gorman 2005-11-03 18:48 ` Martin J. Bligh 2 siblings, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 18:03 UTC (permalink / raw) To: Martin J. Bligh Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > And I suspect that by default, there should be zero of them. Ie you'd have > > to set them up the same way you now set up a hugetlb area. > > So ... if there are 0 by default, and I run for a while and dirty up > memory, how do I free any pages up to put into them? Not sure how that > works. You don't. Just face it - people who want memory hotplug had better know that beforehand (and let's be honest - in practice it's only going to work in virtualized environments or in environments where you can insert the new bank of memory and copy it over and remove the old one with hw support). Same as hugetlb. Nobody sane _cares_. Nobody sane is asking for these things. Only people with special needs are asking for it, and they know their needs. You have to realize that the first rule of engineering is to work out the balances. The undeniable fact is, that 99.99% of all users will never care one whit, and memory management is complex and fragile. End result: the 0.01% of users will have to do some manual configuration to keep things simpler for the cases that really matter. Because the case that really matters is the sane case. The one where we - don't change memory (normal) - only add memory (easy) - only switch out memory with hardware support (ie the _hardware_ supports parallel memory, and you can switch out a DIMM without software ever really even noticing) - have system maintainers that do strange things, but _know_ that. We simply DO NOT CARE about some theoretical "general case", because the general case is (a) insane and (b) impossible to cater to without excessive complexity. Guys, a kernel developer needs to know when to say NO. And we say NO, HELL NO!! to generic software-only memory hotplug. If you are running a DB that needs to benchmark well, you damn well KNOW IT IN ADVANCE, AND YOU TUNE FOR IT. Nobody takes a random machine and says "ok, we'll now put our most performance-critical database on this machine, and oh, btw, you can't reboot it and tune for it beforehand". And if you have such a person, you need to learn to IGNORE THE CRAZY PEOPLE. When you hear voices in your head that tell you to shoot the pope, do you do what they say? Same thing goes for customers and managers. They are the crazy voices in your head, and you need to set them right, not just blindly do what they ask for. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:03 ` Linus Torvalds @ 2005-11-03 20:00 ` Paul Jackson 2005-11-03 20:46 ` Mel Gorman 1 sibling, 0 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-03 20:00 UTC (permalink / raw) To: Linus Torvalds Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, akpm, kravetz, linux-mm, linux-kernel, lhms-devel, arjanv > We simply DO NOT CARE about some theoretical "general case", because the > general case is (a) insane and (b) impossible to cater to without > excessive complexity. The lawyers have a phrase for this: Hard cases make bad law. For us, that's bad code. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:03 ` Linus Torvalds 2005-11-03 20:00 ` Paul Jackson @ 2005-11-03 20:46 ` Mel Gorman 1 sibling, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-03 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Linus Torvalds wrote: > > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > > And I suspect that by default, there should be zero of them. Ie you'd have > > > to set them up the same way you now set up a hugetlb area. > > > > So ... if there are 0 by default, and I run for a while and dirty up > > memory, how do I free any pages up to put into them? Not sure how that > > works. > > You don't. > > Just face it - people who want memory hotplug had better know that > beforehand (and let's be honest - in practice it's only going to work in > virtualized environments or in environments where you can insert the new > bank of memory and copy it over and remove the old one with hw support). > > Same as hugetlb. > For HugeTLB, there are cases were the sysadmin won't configure the server because it's a tunable that can badly affect the machine if they get it wrong. In those cases, the users just get small pages, the performance penalty and are told to like it. > Nobody sane _cares_. Nobody sane is asking for these things. Only people > with special needs are asking for it, and they know their needs. > > You have to realize that the first rule of engineering is to work out the > balances. The undeniable fact is, that 99.99% of all users will never care > one whit, and memory management is complex and fragile. End result: the > 0.01% of users will have to do some manual configuration to keep things > simpler for the cases that really matter. > Ok, so lets consider the 99.99% of users then. One two machines, aim9 benchmarks posted during this thread show some improvements on page_test, fork_test and brk_test, the paths you would expect to be hit by these patches. They are very minor improvements but 99.99% of users benefit from this. Aim9 might be considered artifical so somewhere in that 99.99% of users are kernel developers who care about kbuild so here are the timings of "kernel untar ; make defconfig ; make" 2.6.14-rc5-mm1: 1093 seconds 2.6.14-rc5-mm1-mbuddy-v19-withoutdefrag 1089 seconds 2.6.14-rc5-mm1-mbuddy-v19-withdefrag:: 1086 seconds The withoutdefrag mark is with the core of anti-defrag disabled via a configure option. The option to disable was a separate patch produced during this thread. To be really honest, I don't think a configurable page allocator is a great idea. Building kernels is faster with this set of patches which a few people on this list care about. aim9 shows very minor improvements which benefit a very large number of people and 0.01% of people who care about fragmentation get lower fragmentation. Of course, maybe there is something magic with my test machines (or maybe I am willing it faster) so figures from other people wouldn't hurt whether they show gains or regressions. On my machine at least, 99.99% of people are still benefitting. I am going to wait to see if people post figures that show regressions before asking "are you still saying no?" to this set of patches > Because the case that really matters is the sane case. The one where we > - don't change memory (normal) > - only add memory (easy) > - only switch out memory with hardware support (ie the _hardware_ > supports parallel memory, and you can switch out a DIMM without > software ever really even noticing) > - have system maintainers that do strange things, but _know_ that. > > We simply DO NOT CARE about some theoretical "general case", because the > general case is (a) insane and (b) impossible to cater to without > excessive complexity. > > Guys, a kernel developer needs to know when to say NO. > > And we say NO, HELL NO!! to generic software-only memory hotplug. > > If you are running a DB that needs to benchmark well, you damn well KNOW > IT IN ADVANCE, AND YOU TUNE FOR IT. > > Nobody takes a random machine and says "ok, we'll now put our most > performance-critical database on this machine, and oh, btw, you can't > reboot it and tune for it beforehand". And if you have such a person, you > need to learn to IGNORE THE CRAZY PEOPLE. > > When you hear voices in your head that tell you to shoot the pope, do you > do what they say? Same thing goes for customers and managers. They are the > crazy voices in your head, and you need to set them right, not just > blindly do what they ask for. > > Linus > -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:51 ` Martin J. Bligh 2005-11-03 17:59 ` Arjan van de Ven 2005-11-03 18:03 ` Linus Torvalds @ 2005-11-03 18:48 ` Martin J. Bligh 2005-11-03 19:08 ` Linus Torvalds 2 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 18:48 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven > For amusement, let me put in some tritely oversimplified math. For the > sake of arguement, assume the free watermarks are 8MB or so. Let's assume > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages. > I'm going to assume random distribution of free pages, which is > oversimplified, but I'm trying to demonstrate a general premise, not get > accurate numbers. > > 8MB = 2048 pages. > > On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for > each free page, chance of it's buddy being free is 2048/16384. So in > grossly-oversimplified stats-land, if I can remember anything at all, > chance of finding one page with a free buddy is 1-(1-2048/16384)^2048, > which is, for all intents and purposes ... 1. > > 1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989 > > 128GB system. 33554432 pages. 0.1175 probability > > yes, yes, my math sucks and I'm a simpleton. The point is that as memory > gets bigger, the odds suck for getting contiguous pages. And would also > explain why you think there's no problem, and I do ;-) And bear in mind > that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll > spare you more wild attempts at foully-approximated math. > > Hmmm. If we keep 128MB free, that totally kills off the above calculation > I think I'll just tweak it so the limit is not so hard on really big > systems. Will send you a patch. However ... larger allocs will still > suck ... I guess I'd better gross you out with more incorrect math after > all ... Ha. Just because I don't think I made you puke hard enough already with foul approximations ... for order 2, I think it's 1-(1-(free_pool/total)^3)^free_pool because all 3 of his buddies have to be free as well. (and generically ... 2^order - 1) ORDER: 1 1024MB system, 8MB pool = 1.000000 131072MB system, 8MB pool = 0.117506 1024MB system, 128MB pool = 1.000000 131072MB system, 128MB pool = 1.000000 ORDER: 2 1024MB system, 8MB pool = 0.000976 131072MB system, 8MB pool = 0.000000 1024MB system, 128MB pool = 1.000000 131072MB system, 128MB pool = 0.000031 ORDER: 3 1024MB system, 8MB pool = 0.000000 131072MB system, 8MB pool = 0.000000 1024MB system, 128MB pool = 0.015504 131072MB system, 128MB pool = 0.000000 ORDER: 4 1024MB system, 8MB pool = 0.000000 131072MB system, 8MB pool = 0.000000 1024MB system, 128MB pool = 0.000000 131072MB system, 128MB pool = 0.000000 ------------------------ I really should learn not to post my rusty math in such public places ... but I still think the point is correct. Anyway, I'm sure somewhere in the resultant flamewar, someone will come up with some better approx ;-) And yes, I appreciate the random distribution thing is wrong. But it's still not going to work for bigger allocs. Fixing the free watermarks will help us a bit though. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 18:48 ` Martin J. Bligh @ 2005-11-03 19:08 ` Linus Torvalds 2005-11-03 22:37 ` Martin J. Bligh 2005-11-04 16:22 ` Mel Gorman 0 siblings, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 19:08 UTC (permalink / raw) To: Martin J. Bligh Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > Ha. Just because I don't think I made you puke hard enough already with > foul approximations ... for order 2, I think it's Your basic fault is in believing that the free watermark would stay constant. That's insane. Would you keep 8MB free on a 64MB system? Would you keep 8MB free on a 8GB system? The point being, that if you start with insane assumptions, you'll get insane answers. The _correct_ assumption is that you aim to keep some fixed percentage of memory free. With that assumption and your math, finding higher-order pages is equally hard regardless of amount of memory. Now, your math then doesn't allow for the fact that buddy automatically coalesces for you, so in fact things get _easier_ with more memory, but hey, that needs more math than I can come up with (I never did it as math, only as simulations with allocation patterns - "smart people use math, plodding people just try to simulate an estimate" ;) Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 19:08 ` Linus Torvalds @ 2005-11-03 22:37 ` Martin J. Bligh 2005-11-03 23:16 ` Linus Torvalds 2005-11-04 16:22 ` Mel Gorman 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven >> Ha. Just because I don't think I made you puke hard enough already with >> foul approximations ... for order 2, I think it's > > Your basic fault is in believing that the free watermark would stay > constant. > > That's insane. > > Would you keep 8MB free on a 64MB system? > > Would you keep 8MB free on a 8GB system? > > The point being, that if you start with insane assumptions, you'll get > insane answers. Ummm. I was basing it on what we actually do now in the code, unless I misread it, which is perfectly possible. Do you want this patch? diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c --- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700 +++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800 @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi min_pages = zone->present_pages / 1024; if (min_pages < SWAP_CLUSTER_MAX) min_pages = SWAP_CLUSTER_MAX; - if (min_pages > 128) - min_pages = 128; zone->pages_min = min_pages; } else { /* if it's a lowmem zone, reserve a number of pages > The _correct_ assumption is that you aim to keep some fixed percentage of > memory free. With that assumption and your math, finding higher-order > pages is equally hard regardless of amount of memory. That would, indeed, make more sense. > Now, your math then doesn't allow for the fact that buddy automatically > coalesces for you, so in fact things get _easier_ with more memory, but > hey, that needs more math than I can come up with (I never did it as math, > only as simulations with allocation patterns - "smart people use math, > plodding people just try to simulate an estimate" ;) Not sure what people who do math, but wrongly, are called, but I'm sure it's not polite, and I'm sure I'm one of those ;-) M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 22:37 ` Martin J. Bligh @ 2005-11-03 23:16 ` Linus Torvalds 2005-11-03 23:39 ` Martin J. Bligh 2005-11-04 4:39 ` Andrew Morton 0 siblings, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-03 23:16 UTC (permalink / raw) To: Martin J. Bligh Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > Ummm. I was basing it on what we actually do now in the code, unless I > misread it, which is perfectly possible. Do you want this patch? > > diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c > --- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700 > +++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800 > @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi > min_pages = zone->present_pages / 1024; > if (min_pages < SWAP_CLUSTER_MAX) > min_pages = SWAP_CLUSTER_MAX; > - if (min_pages > 128) > - min_pages = 128; > zone->pages_min = min_pages; > } else { > /* if it's a lowmem zone, reserve a number of pages Ahh, you're right, there's a totally separate watermark for highmem. I think I even remember this. I may even be responsible. I know some of our less successful highmem balancing efforts in the 2.4.x timeframe had serious trouble when they ran out of highmem, and started pruning lowmem very very aggressively. Limiting the highmem water marks meant that it wouldn't do that very often. I think your patch may in fact be fine, but quite frankly, it needs testing under real load with highmem. In general, I don't _think_ we should do anything different for highmem at all, and we should just in general try to keep a percentage of pages available. Now, the percentage probably does depend on the zone: we should be more aggressive about more "limited" zones, ie the old 16MB DMA zone should probably try to keep a higher percentage of free pages around than the normal zone, and that in turn should probably keep a higher percentage of pages around than the highmem zones. And that's not because of fragmentation so much, but simply because the lower zones tend to have more "desperate" users. Running out of the normal zone is thus a "worse" situation than running out of highmem. And we effectively never want to allocate from the 16MB DMA zone at all, unless it is our only choice. We actually do try to do that with that "lowmem_reserve[]" logic, which reserves more pages in the lower zones the bigger the upper zones are (ie if we _only_ have memory in the low 16MB, then we don't reserve any of it, but if we have _tons_ of memory in the high zones, then we reserve more memory for the low zones and thus make the watermarks higher for them). So the watermarking interacts with that lowmem_reserve logic, and I think that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min" is limited, and second because HIGHMEM has no lowmem_reserve. Does that make sense? Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 23:16 ` Linus Torvalds @ 2005-11-03 23:39 ` Martin J. Bligh 2005-11-04 0:42 ` Nick Piggin 2005-11-04 4:39 ` Andrew Morton 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 23:39 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven > Ahh, you're right, there's a totally separate watermark for highmem. > > I think I even remember this. I may even be responsible. I know some of > our less successful highmem balancing efforts in the 2.4.x timeframe had > serious trouble when they ran out of highmem, and started pruning lowmem > very very aggressively. Limiting the highmem water marks meant that it > wouldn't do that very often. > > I think your patch may in fact be fine, but quite frankly, it needs > testing under real load with highmem. > > In general, I don't _think_ we should do anything different for highmem at > all, and we should just in general try to keep a percentage of pages > available. Now, the percentage probably does depend on the zone: we should > be more aggressive about more "limited" zones, ie the old 16MB DMA zone > should probably try to keep a higher percentage of free pages around than > the normal zone, and that in turn should probably keep a higher percentage > of pages around than the highmem zones. Hmm. it strikes me that there will be few (if any?) allocations out of highmem. PPC64 et al dump everything into ZONE_DMA though - so those should be uncapped already. > And that's not because of fragmentation so much, but simply because the > lower zones tend to have more "desperate" users. Running out of the normal > zone is thus a "worse" situation than running out of highmem. And we > effectively never want to allocate from the 16MB DMA zone at all, unless > it is our only choice. Well it's not 16MB on the other platforms, but ... > We actually do try to do that with that "lowmem_reserve[]" logic, which > reserves more pages in the lower zones the bigger the upper zones are (ie > if we _only_ have memory in the low 16MB, then we don't reserve any of it, > but if we have _tons_ of memory in the high zones, then we reserve more > memory for the low zones and thus make the watermarks higher for them). > > So the watermarking interacts with that lowmem_reserve logic, and I think > that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min" > is limited, and second because HIGHMEM has no lowmem_reserve. > > Does that make sense? Yes. So we were only capping highmem before, now I squint at it closer. I was going off a simplification I'd written for a paper, which is not generally correct. I doubt frag is a problem in highmem, so maybe the code is correct as-is. We only want contig allocs for virtual when it's mapped 1-1 to physical (ie the kernel mapping) or real physical things. I suppose I could write something to trawl the source tree to check that assumption, but it feels right ... M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 23:39 ` Martin J. Bligh @ 2005-11-04 0:42 ` Nick Piggin 0 siblings, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-04 0:42 UTC (permalink / raw) To: Martin J. Bligh Cc: Linus Torvalds, Mel Gorman, Arjan van de Ven, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven Martin J. Bligh wrote: >>Ahh, you're right, there's a totally separate watermark for highmem. >> >>I think I even remember this. I may even be responsible. I know some of >>our less successful highmem balancing efforts in the 2.4.x timeframe had >>serious trouble when they ran out of highmem, and started pruning lowmem >>very very aggressively. Limiting the highmem water marks meant that it >>wouldn't do that very often. >> >>I think your patch may in fact be fine, but quite frankly, it needs >>testing under real load with highmem. >> I'd prefer not. The reason is that it increases the "min" watermark, which only gets used basically by GFP_ATOMIC and PF_MEMALLOC allocators - neither of which are likely to want highmem. Also, I don't think anybody cares about higher order highmem allocations. At least the patches in this thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2 Should be applied before this. However they also need more testing so I'll be sending them to Andrew first. Patch 2 does basically the same thing as your patch, without increasing the min watermark. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 23:16 ` Linus Torvalds 2005-11-03 23:39 ` Martin J. Bligh @ 2005-11-04 4:39 ` Andrew Morton 1 sibling, 0 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-04 4:39 UTC (permalink / raw) To: Linus Torvalds Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, kravetz, linux-mm, linux-kernel, lhms-devel, arjanv Linus Torvalds <torvalds@osdl.org> wrote: > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > > > Ummm. I was basing it on what we actually do now in the code, unless I > > misread it, which is perfectly possible. Do you want this patch? > > > > diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c > > --- linux-2.6.14/mm/page_alloc.c 2005-10-27 18:52:20.000000000 -0700 > > +++ 2.6.14-no_water_cap/mm/page_alloc.c 2005-11-03 14:36:06.000000000 -0800 > > @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi > > min_pages = zone->present_pages / 1024; > > if (min_pages < SWAP_CLUSTER_MAX) > > min_pages = SWAP_CLUSTER_MAX; > > - if (min_pages > 128) > > - min_pages = 128; > > zone->pages_min = min_pages; > > } else { > > /* if it's a lowmem zone, reserve a number of pages > > Ahh, you're right, there's a totally separate watermark for highmem. > > I think I even remember this. I may even be responsible. I know some of > our less successful highmem balancing efforts in the 2.4.x timeframe had > serious trouble when they ran out of highmem, and started pruning lowmem > very very aggressively. Limiting the highmem water marks meant that it > wouldn't do that very often. No, that was me and Matthew Dobson, circa 2.5.71. The thinking was that highmem is just for userspace pages and we don't need to keep the free memory pool around for things like atomic allocations. Especially as a proportionally-sized highmem emergency pool would be potentially hundreds of (wasted) megabytes. iirc, things worked ok with a highmem min_pages threshold of zero pages. Back in 2.5.70, before everyone else broke it ;) ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 19:08 ` Linus Torvalds 2005-11-03 22:37 ` Martin J. Bligh @ 2005-11-04 16:22 ` Mel Gorman 1 sibling, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-04 16:22 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven On Thu, 3 Nov 2005, Linus Torvalds wrote: > > > On Thu, 3 Nov 2005, Martin J. Bligh wrote: > > > > Ha. Just because I don't think I made you puke hard enough already with > > foul approximations ... for order 2, I think it's > > Your basic fault is in believing that the free watermark would stay > constant. > > That's insane. > > Would you keep 8MB free on a 64MB system? > > Would you keep 8MB free on a 8GB system? > > The point being, that if you start with insane assumptions, you'll get > insane answers. > > The _correct_ assumption is that you aim to keep some fixed percentage of > memory free. With that assumption and your math, finding higher-order > pages is equally hard regardless of amount of memory. > > Now, your math then doesn't allow for the fact that buddy automatically > coalesces for you, so in fact things get _easier_ with more memory, but > hey, that needs more math than I can come up with (I never did it as math, > only as simulations with allocation patterns - "smart people use math, > plodding people just try to simulate an estimate" ;) > My math is not that great either, so here is a simulation. Setup: Reboot the machine which is a quad xeon xSeries 350 with 1.5GiB of RAM. Configure /proc/sys/vm/min_free_kbytes to try and keep 1/8th of physical memory free. This is to keep in line with your suggestion that fragmentation is low when there is a higher percentage of memory free. Load: Run a load - 7 kernels compiling simultaneously at -j2 which gives loads between 10-14. Try and get 50% worth of physical memory in 4MiB pages (1024 contiguous pages) while compiling. When the test ends and the system is quiet, try again. 4MiB in this case is a single HugeTLB page. Here are the results; 2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating Under Load Order: 10 Allocation type: HighMem Attempted allocations: 160 Success allocs: 24 Failed allocs: 136 DMA zone allocs: 0 Normal zone allocs: 16 HighMem zone allocs: 8 % Success: 15 2.6.14-rc5-mm1-mbuddy-v19 Allocating Under Load Order: 10 Allocation type: HighMem Attempted allocations: 160 Success allocs: 24 Failed allocs: 136 DMA zone allocs: 0 Normal zone allocs: 11 HighMem zone allocs: 13 % Success: 15 Not a lot of difference there and the success rate is not great. mbuddy-v19 is a bit better at the normal zone and that's about it. These results are not surprising as kswapd is making no effort to get contiguous pages. Under a load of 7 kernel compiles, kswapd will not free pages fast enough. When the test ends and the system is quiet, try and get 80% of physical memory in large pages. 4 attempts are made to satisfy the requests to give kswapd lots of time. 2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating while rested Order: 10 Allocation type: HighMem Attempted allocations: 300 Success allocs: 159 Failed allocs: 141 DMA zone allocs: 0 Normal zone allocs: 46 HighMem zone allocs: 113 % Success: 53 Mainly highmem there. 2.6.14-rc5-mm1-mbuddy-v19 Allocating while rested Order: 10 Allocation type: HighMem Attempted allocations: 300 Success allocs: 212 Failed allocs: 88 DMA zone allocs: 0 Normal zone allocs: 102 HighMem zone allocs: 110 % Success: 70 Look at the big difference in the number of successful allocations in ZONE_NORMAL because the kernel allocations were kept together. Experience has shown me that failure to get higher success rates depended on per-cpu pages and the number of kernel pages that leaked to other areas (56 over the course of this test). Kernel pages leaking was helped a lot by setting min_free_kbytes higher than the default. I then ported forward the linear scanner and ran the tests again. The linear scanner does two things - finds linear reclaimable pages using information provided by anti-defrag and drains the per-cpu caches. I'll post the linear scanner code if people want to look at it but it's really crap. It's slow, works too hard and doesn't try to hold on to the pages for the process reclaiming the pages are just some of it's problems. I need to rewrite it almost from scratch and avoid all the mistakes but it's a path that is hit *only* if you are allocating high orders. 2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating under load Order: 10 Allocation type: HighMem Attempted allocations: 160 Success allocs: 155 Failed allocs: 0 DMA zone allocs: 0 Normal zone allocs: 12 HighMem zone allocs: 143 % Success: 96 Mainly got it's pages back from highmem which is always easier as long as PTE pages are not in the way. 2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating while rested Order: 10 Allocation type: HighMem Attempted allocations: 300 Success allocs: 275 Failed allocs: 0 DMA zone allocs: 0 Normal zone allocs: 133 HighMem zone allocs: 142 % Success: 91 That is 71% of physical memory available in contiguous blocks with the linear scanner but that code is not ready. anti-defrag on it's own as it is today was able to get 55% of physical memory in 4MiB chunks. This is provided without performance regressions in the normal case everyone cares about. In my tests, there are minor improvements on aim9 which is artificial, and gained a few seconds on kernel build tests which people do care about. Does these patches still make no sense to you? Lower fragmentation that does not impact the cases everyone cares about? If so, why? To get the best possibly results, a zone approach could still be built on top of this and it seems as if it's worth developing. At the cost of some configuration, the zone would give *hard* guarantees on the available number of large pages and anti-defrag would give best effort everywhere else. By default without configuration, you would get best-effort. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 15:40 ` Arjan van de Ven 2005-11-03 15:51 ` Linus Torvalds @ 2005-11-03 15:53 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 15:53 UTC (permalink / raw) To: Arjan van de Ven Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven --Arjan van de Ven <arjan@infradead.org> wrote (on Thursday, November 03, 2005 16:40:21 +0100): > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote: >> >> Can we quit coming up with specialist hacks for hotplug, and try to solve >> >> the generic problem please? hotplug is NOT the only issue here. Fragmentation >> >> in general is. >> >> >> > >> > Not really it isn't. There have been a few cases (e1000 being the main >> > one, and is fixed upstream) where fragmentation in general is a problem. >> > But mostly it is not. >> >> Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. > > with CONFIG_4KSTACKS :) I've been told previously that doesn't work for x86_64, and other 64 bit platforms. Is that incorrect? ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 15:01 ` Ingo Molnar 2005-11-01 15:22 ` Dave Hansen @ 2005-11-01 16:48 ` Kamezawa Hiroyuki 2005-11-01 16:59 ` Kamezawa Hiroyuki ` (3 more replies) 1 sibling, 4 replies; 241+ messages in thread From: Kamezawa Hiroyuki @ 2005-11-01 16:48 UTC (permalink / raw) To: Ingo Molnar Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Ingo Molnar wrote: > so it's all about expectations: _could_ you reasonably remove a piece of > RAM? Customer will say: "I have stopped all nonessential services, and > free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix > the kernel!". No reasonable customer will say: "True, I have all RAM > used up in mlock()ed sections, but i want to remove some RAM > nevertheless". > Hi, I'm one of men in -lhms In my understanding... - Memory Hotremove on IBM's LPAR? approach is [remove some amount of memory from somewhere.] For this approach, Mel's patch will work well. But this will not guaranntee a user can remove specified range of memory at any time because how memory range is used is not defined by an admin but by the kernel automatically. But to extract some amount of memory, Mel's patch is very important and they need this. My own target is NUMA node hotplug, what NUMA node hotplug want is - [remove the range of memory] For this approach, admin should define *core* node and removable node. Memory on removable node is removable. Dividing area into removable and not-removable is needed, because we cannot allocate any kernel's object on removable area. Removable area should be 100% removable. Customer can know the limitation before using. What I'm considering now is this: - removable area is hot-added area - not-removable area is memory which is visible to kernel at boot time. (I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM) A customer can hot add their extra memory after boot. This is very easy to understand. Peformance problem is trade-off.(I'm afraid of this ;) If a cutomer wants to guarantee some memory areas should be hot-removable, he will hot-add them. I don't think adding memory for the kernel by hot-add is wanted by a customer. -- Kame ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 16:48 ` Kamezawa Hiroyuki @ 2005-11-01 16:59 ` Kamezawa Hiroyuki 2005-11-01 17:19 ` Mel Gorman ` (2 subsequent siblings) 3 siblings, 0 replies; 241+ messages in thread From: Kamezawa Hiroyuki @ 2005-11-01 16:59 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Kamezawa Hiroyuki wrote: > Ingo Molnar wrote: > >> so it's all about expectations: _could_ you reasonably remove a piece >> of RAM? Customer will say: "I have stopped all nonessential services, >> and free RAM is at 90%, still I cannot remove that piece of faulty >> RAM, fix the kernel!". No reasonable customer will say: "True, I have >> all RAM used up in mlock()ed sections, but i want to remove some RAM >> nevertheless". >> > Hi, I'm one of men in -lhms > > In my understanding... > - Memory Hotremove on IBM's LPAR? approach is > [remove some amount of memory from somewhere.] > For this approach, Mel's patch will work well. > But this will not guaranntee a user can remove specified range of > memory at any time because how memory range is used is not defined by > an admin > but by the kernel automatically. But to extract some amount of memory, > Mel's patch is very important and they need this. > One more consideration... Some cpus which support virtialization will be shipped by some vendor in near future. If someone uses vitualized OS, only problem is *resizing*. Hypervisor will be able to remap semi-physical pages anyware with hardware assistance but system resizing needs operating system assistance. To this direction, [remove some amount of memory from somewhere.] is important approach. -- Kame ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 16:48 ` Kamezawa Hiroyuki 2005-11-01 16:59 ` Kamezawa Hiroyuki @ 2005-11-01 17:19 ` Mel Gorman 2005-11-02 0:32 ` KAMEZAWA Hiroyuki 2005-11-01 18:06 ` linux-os (Dick Johnson) 2005-11-02 7:19 ` Ingo Molnar 3 siblings, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-01 17:19 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 2 Nov 2005, Kamezawa Hiroyuki wrote: > Ingo Molnar wrote: > > so it's all about expectations: _could_ you reasonably remove a piece of > > RAM? Customer will say: "I have stopped all nonessential services, and free > > RAM is at 90%, still I cannot remove that piece of faulty RAM, fix the > > kernel!". No reasonable customer will say: "True, I have all RAM used up in > > mlock()ed sections, but i want to remove some RAM nevertheless". > > > Hi, I'm one of men in -lhms > > In my understanding... > - Memory Hotremove on IBM's LPAR? approach is > [remove some amount of memory from somewhere.] > For this approach, Mel's patch will work well. > But this will not guaranntee a user can remove specified range of > memory at any time because how memory range is used is not defined by an > admin > but by the kernel automatically. But to extract some amount of memory, > Mel's patch is very important and they need this. > > My own target is NUMA node hotplug, what NUMA node hotplug want is > - [remove the range of memory] For this approach, admin should define > *core* node and removable node. Memory on removable node is removable. > Dividing area into removable and not-removable is needed, because > we cannot allocate any kernel's object on removable area. > Removable area should be 100% removable. Customer can know the limitation > before using. > In this case, we would want some mechanism that says "don't put awkward pages in this NUMA node" in a clear way. One way we could do this is; 1. Move fallback_allocs to be per-node. fallback_allocs is currently defined as int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { {RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES}, {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, {RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES} }; The effect is that a RCLM_NORCLM allocation, falls back to RCLM_FALLBACK, RCLM_KERN, RCLM_EASY and then gives up. 2. Architectures would need to provide a function that allocates and populates a fallback_allocs[][] array. If they do not provide one, a generic function uses array like the one above 3. When adding a node that must be removable, make the array look like this int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, }; The effect of this is only allocations that are easily reclaimable will end up in this node. This would be a straight-forward addition to build upon this set of patches. The difference would only be visible to architectures that cared. > What I'm considering now is this: > - removable area is hot-added area > - not-removable area is memory which is visible to kernel at boot time. > (I'd like to achieve this by the limitation : hot-added node goes into only > ZONE_HIGHMEM) ZONE_HIGHMEM can still end up with PTE pages if allocating PTE pages from highmem is configured. This is bad. With the above approach, nodes that are not hot-added that have a ZONE_HIGHMEM will be able to use it for PTEs as well. But when a node is hot-added, it will have a ZONE_HIGHMEM that is not used for PTE allocations because they are not RCLM_EASY allocations. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 17:19 ` Mel Gorman @ 2005-11-02 0:32 ` KAMEZAWA Hiroyuki 2005-11-02 11:22 ` Mel Gorman 0 siblings, 1 reply; 241+ messages in thread From: KAMEZAWA Hiroyuki @ 2005-11-02 0:32 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Mel Gorman wrote: > 3. When adding a node that must be removable, make the array look like > this > > int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { > {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, > {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, > {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, > }; > > The effect of this is only allocations that are easily reclaimable will > end up in this node. This would be a straight-forward addition to build > upon this set of patches. The difference would only be visible to > architectures that cared. > Thank you for illustration. maybe fallback_list per pgdat/zone is what I need with your patch. right ? -- Kame ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 0:32 ` KAMEZAWA Hiroyuki @ 2005-11-02 11:22 ` Mel Gorman 0 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-02 11:22 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 2 Nov 2005, KAMEZAWA Hiroyuki wrote: > Mel Gorman wrote: > > 3. When adding a node that must be removable, make the array look like > > this > > > > int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { > > {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, > > {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, > > {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES}, > > }; > > > > The effect of this is only allocations that are easily reclaimable will > > end up in this node. This would be a straight-forward addition to build > > upon this set of patches. The difference would only be visible to > > architectures that cared. > > > Thank you for illustration. > maybe fallback_list per pgdat/zone is what I need with your patch. right ? > With my patch, yes. With zones, you need to change how zonelists are built for each node. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 16:48 ` Kamezawa Hiroyuki 2005-11-01 16:59 ` Kamezawa Hiroyuki 2005-11-01 17:19 ` Mel Gorman @ 2005-11-01 18:06 ` linux-os (Dick Johnson) 2005-11-02 7:19 ` Ingo Molnar 3 siblings, 0 replies; 241+ messages in thread From: linux-os (Dick Johnson) @ 2005-11-01 18:06 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 1 Nov 2005, Kamezawa Hiroyuki wrote: > Ingo Molnar wrote: >> so it's all about expectations: _could_ you reasonably remove a piece of >> RAM? Customer will say: "I have stopped all nonessential services, and >> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix >> the kernel!". No reasonable customer will say: "True, I have all RAM >> used up in mlock()ed sections, but i want to remove some RAM >> nevertheless". >> > Hi, I'm one of men in -lhms > > In my understanding... > - Memory Hotremove on IBM's LPAR? approach is > [remove some amount of memory from somewhere.] > For this approach, Mel's patch will work well. > But this will not guaranntee a user can remove specified range of > memory at any time because how memory range is used is not defined by an admin > but by the kernel automatically. But to extract some amount of memory, > Mel's patch is very important and they need this. > > My own target is NUMA node hotplug, what NUMA node hotplug want is > - [remove the range of memory] For this approach, admin should define > *core* node and removable node. Memory on removable node is removable. > Dividing area into removable and not-removable is needed, because > we cannot allocate any kernel's object on removable area. > Removable area should be 100% removable. Customer can know the limitation before using. > > What I'm considering now is this: > - removable area is hot-added area > - not-removable area is memory which is visible to kernel at boot time. > (I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM) > A customer can hot add their extra memory after boot. This is very easy to understand. > Peformance problem is trade-off.(I'm afraid of this ;) > > If a cutomer wants to guarantee some memory areas should be hot-removable, > he will hot-add them. > I don't think adding memory for the kernel by hot-add is wanted by a customer. > > -- Kame With ix86 machines, the page directory pointed to by CR3 needs to always be present in physical memory. This means that there must always be some RAM that can't be hot-swapped (you can't put back the contents of the page-directory without using the CPU which needs the page directory). This is explained on page 5-21 of the i486 reference manual. This happens because there is no "present" bit in CR3 as there are in the page tables themselves. This problem means that "surprise" swaps are impossible. However, given a forewarning, it is possible to build a new table somewhere in existing RAM within the physical constraints required, call some code there (needs to be a 1:1 translation), disable paging, then proceed. The problem is that of writing of the contents of RAM to be replaced, to storage media so the new page-table needs to be loaded from the new location. This may not work if the LDT and the GDT are not accessible from their current locations. If they are in the RAM to be replaced, you are in a world of hurt taking the "world" apart and putting it back together again. Cheers, Dick Johnson Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips). Warning : 98.36% of all statistics are fiction. . **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 16:48 ` Kamezawa Hiroyuki ` (2 preceding siblings ...) 2005-11-01 18:06 ` linux-os (Dick Johnson) @ 2005-11-02 7:19 ` Ingo Molnar 2005-11-02 7:46 ` Gerrit Huizenga 2005-11-02 7:57 ` Nick Piggin 3 siblings, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-02 7:19 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > My own target is NUMA node hotplug, what NUMA node hotplug want is > - [remove the range of memory] For this approach, admin should define > *core* node and removable node. Memory on removable node is removable. > Dividing area into removable and not-removable is needed, because > we cannot allocate any kernel's object on removable area. > Removable area should be 100% removable. Customer can know the limitation > before using. that's a perfectly fine method, and is quite similar to the 'separate zone' approach Nick mentioned too. It is also easily understandable for users/customers. under such an approach, things become easier as well: if you have zones you can to restrict (no kernel pinned-down allocations, no mlock-ed pages, etc.), there's no need for any 'fragmentation avoidance' patches! Basically all of that RAM becomes instantly removable (with some small complications). That's the beauty of the separate-zones approach. It is also a limitation: no kernel allocations, so all the highmem-alike restrictions apply to it too. but what is a dangerous fallacy is that we will be able to support hot memory unplug of generic kernel RAM in any reliable way! you really have to look at this from the conceptual angle: 'can an approach ever lead to a satisfactory result'? If the answer is 'no', then we _must not_ add a 90% solution that we _know_ will never be a 100% solution. for the separate-removable-zones approach we see the end of the tunnel. Separate zones are well-understood. generic unpluggable kernel RAM _will not work_. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 7:19 ` Ingo Molnar @ 2005-11-02 7:46 ` Gerrit Huizenga 2005-11-02 8:50 ` Nick Piggin 2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar 2005-11-02 7:57 ` Nick Piggin 1 sibling, 2 replies; 241+ messages in thread From: Gerrit Huizenga @ 2005-11-02 7:46 UTC (permalink / raw) To: Ingo Molnar Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 02 Nov 2005 08:19:43 +0100, Ingo Molnar wrote: > > * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > My own target is NUMA node hotplug, what NUMA node hotplug want is > > - [remove the range of memory] For this approach, admin should define > > *core* node and removable node. Memory on removable node is removable. > > Dividing area into removable and not-removable is needed, because > > we cannot allocate any kernel's object on removable area. > > Removable area should be 100% removable. Customer can know the limitation > > before using. > > that's a perfectly fine method, and is quite similar to the 'separate > zone' approach Nick mentioned too. It is also easily understandable for > users/customers. > > under such an approach, things become easier as well: if you have zones > you can to restrict (no kernel pinned-down allocations, no mlock-ed > pages, etc.), there's no need for any 'fragmentation avoidance' patches! > Basically all of that RAM becomes instantly removable (with some small > complications). That's the beauty of the separate-zones approach. It is > also a limitation: no kernel allocations, so all the highmem-alike > restrictions apply to it too. > > but what is a dangerous fallacy is that we will be able to support hot > memory unplug of generic kernel RAM in any reliable way! > > you really have to look at this from the conceptual angle: 'can an > approach ever lead to a satisfactory result'? If the answer is 'no', > then we _must not_ add a 90% solution that we _know_ will never be a > 100% solution. > > for the separate-removable-zones approach we see the end of the tunnel. > Separate zones are well-understood. > > generic unpluggable kernel RAM _will not work_. Actually, it will. Well, depending on terminology. There are two usage models here - those which intend to remove physical elements and those where the kernel returnss management of its virtualized "physical" memory to a hypervisor. In the latter case, a hypervisor already maintains a virtual map of the memory and the OS needs to release virtualized "physical" memory. I think you are referring to RAM here as the physical component; however these same defrag patches help where a hypervisor is maintaining the real physical memory below the operating system and the OS is managing a virtualized "physical" memory. On pSeries hardware or with Xen, a client OS can return chunks of memory to the hypervisor. That memory needs to be returned in chunks of the size that the hypervisor normally manages/maintains. But long ranges of physical contiguity are not required. Just shorter ranges, depending on what the hypervisor maintains, need to be returned from the OS to the hypervisor. In other words, if we can return 1 MB chunks, the hypervisor can hand out those 1 MB chunks to other domains/partitions. So, if we can return 500 1 MB chunks from a 2 GB OS instance, we can add 500 MB dyanamically to another OS image. This happens to be a *very* satisfactory answer for virtualized environments. The other answer, which is harder, is to return (free) entire large physical chunks, e.g. the size of the full memory of a node, allowing a node to be dynamically removed (or a DIMM/SIMM/etc.). So, people are working towards two distinct solutions, both of which require us to do a better job of defragmenting memory (or avoiding fragementation in the first place). gerrit ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 7:46 ` Gerrit Huizenga @ 2005-11-02 8:50 ` Nick Piggin 2005-11-02 9:12 ` Gerrit Huizenga 2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar 1 sibling, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-02 8:50 UTC (permalink / raw) To: Gerrit Huizenga Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Gerrit Huizenga wrote: > So, people are working towards two distinct solutions, both of which > require us to do a better job of defragmenting memory (or avoiding > fragementation in the first place). > This is just going around in circles. Even with your fragmentation avoidance and memory defragmentation, there are still going to be cases where memory does get fragmented and can't be defragmented. This is Ingo's point, I believe. Isn't the solution for your hypervisor problem to dish out pages of the same size that are used by the virtual machines. Doesn't this provide you with a nice, 100% solution that doesn't add complexity where it isn't needed? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 8:50 ` Nick Piggin @ 2005-11-02 9:12 ` Gerrit Huizenga 2005-11-02 9:37 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Gerrit Huizenga @ 2005-11-02 9:12 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote: > Gerrit Huizenga wrote: > > > So, people are working towards two distinct solutions, both of which > > require us to do a better job of defragmenting memory (or avoiding > > fragementation in the first place). > > > > This is just going around in circles. Even with your fragmentation > avoidance and memory defragmentation, there are still going to be > cases where memory does get fragmented and can't be defragmented. > This is Ingo's point, I believe. > > Isn't the solution for your hypervisor problem to dish out pages of > the same size that are used by the virtual machines. Doesn't this > provide you with a nice, 100% solution that doesn't add complexity > where it isn't needed? So do you see the problem with fragementation if the hypervisor is handing out, say, 1 MB pages? Or, more likely, something like 64 MB pages? What are the chances that an entire 64 MB page can be freed on a large system that has been up a while? And, if you create zones, you run into all of the zone rebalancing problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with random allocations, making almost none of it available. However, with reasonable defragmentation or fragmentation avoidance, we have some potential to make large chunks available for return to the hypervisor. And, that same capability continues to help those who want to remove fixed ranges of physical memory. gerrit ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:12 ` Gerrit Huizenga @ 2005-11-02 9:37 ` Nick Piggin 2005-11-02 10:17 ` Gerrit Huizenga 2005-11-02 23:47 ` Rob Landley 0 siblings, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 9:37 UTC (permalink / raw) To: Gerrit Huizenga Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Gerrit Huizenga wrote: > On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote: >>Isn't the solution for your hypervisor problem to dish out pages of >>the same size that are used by the virtual machines. Doesn't this >>provide you with a nice, 100% solution that doesn't add complexity >>where it isn't needed? > > > So do you see the problem with fragementation if the hypervisor is > handing out, say, 1 MB pages? Or, more likely, something like 64 MB > pages? What are the chances that an entire 64 MB page can be freed > on a large system that has been up a while? > I see the problem, but if you want to be able to shrink memory to a given size, then you must either introduce a hard limit somewhere, or have the hypervisor hand out guest sized pages. Use zones, or Xen? > And, if you create zones, you run into all of the zone rebalancing > problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on > any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with > random allocations, making almost none of it available. > If there are zone rebalancing problems[*], then it would be great to have more users of zones because then they will be more likely to get fixed. [*] and there are, sadly enough - see the recent patches I posted to lkml for example. But I'm fairly confident that once the particularly silly ones have been fixed, zone balancing will no longer be a derogatory term as has been thrown around (maybe rightly) in this thread! -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:37 ` Nick Piggin @ 2005-11-02 10:17 ` Gerrit Huizenga 2005-11-02 23:47 ` Rob Landley 1 sibling, 0 replies; 241+ messages in thread From: Gerrit Huizenga @ 2005-11-02 10:17 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 02 Nov 2005 20:37:43 +1100, Nick Piggin wrote: > Gerrit Huizenga wrote: > > On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote: > > >>Isn't the solution for your hypervisor problem to dish out pages of > >>the same size that are used by the virtual machines. Doesn't this > >>provide you with a nice, 100% solution that doesn't add complexity > >>where it isn't needed? > > > > > > So do you see the problem with fragementation if the hypervisor is > > handing out, say, 1 MB pages? Or, more likely, something like 64 MB > > pages? What are the chances that an entire 64 MB page can be freed > > on a large system that has been up a while? > > I see the problem, but if you want to be able to shrink memory to a > given size, then you must either introduce a hard limit somewhere, or > have the hypervisor hand out guest sized pages. Use zones, or Xen? So why do you believe there must be a hard limit? Any reduction in memory usage is going to be workload related. If the workload is consuming less memory than is available, memory reclaim is easy (e.g. handle fragmentation, find nice sized chunks). The workload determines how much the administrator can free. If the workload is using all of the resources available (e.g. lots of associated kernel memory locked down, locked user pages, etc.) then the administrator will logically be able to reduce less memory from the machine. The amount of memory to be freed up is not determined by some pre-defined machine constraints but based on the actual workload's use of the machine. In other words, who really cares if there is some hard limit? The only limit should be the number of pages not currently needed by a given workload, not some arbitrary zone size. > > And, if you create zones, you run into all of the zone rebalancing > > problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on > > any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with > > random allocations, making almost none of it available. > > If there are zone rebalancing problems[*], then it would be great to > have more users of zones because then they will be more likely to get > fixed. > > [*] and there are, sadly enough - see the recent patches I posted to > lkml for example. But I'm fairly confident that once the particularly > silly ones have been fixed, zone balancing will no longer be a > derogatory term as has been thrown around (maybe rightly) in this > thread! You are more optimistic here than I. You might have improved the problem but I think that any zone rebalancing problem is intrinsicly hard given the way those zones are used and the fact that we sort of want them to be dynamic and yet physically contiguous. Those two core constraints seem to be relatively at odds with each other. I'm not a huge fan of dividing memory up into different types which are all special purposed. Everything that becomes special purposed over time limits its use and brings up questions on what special purpose bucket each allocation should use (e.g. ZONE_NORMAL or ZONE_HIGHMEM or ZONE_DMA or ZONE_HOTPLUGGABLE). And then, when you run out of ZONE_HIGHMEM and have to reach into ZONE_HOTPLUGGABLE for some pinned memory allocation, it seems the whole concept leads to a messy train wreck. gerrit ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 9:37 ` Nick Piggin 2005-11-02 10:17 ` Gerrit Huizenga @ 2005-11-02 23:47 ` Rob Landley 2005-11-03 4:43 ` Nick Piggin 1 sibling, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-02 23:47 UTC (permalink / raw) To: Nick Piggin Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wednesday 02 November 2005 03:37, Nick Piggin wrote: > > So do you see the problem with fragementation if the hypervisor is > > handing out, say, 1 MB pages? Or, more likely, something like 64 MB > > pages? What are the chances that an entire 64 MB page can be freed > > on a large system that has been up a while? > > I see the problem, but if you want to be able to shrink memory to a > given size, then you must either introduce a hard limit somewhere, or > have the hypervisor hand out guest sized pages. Use zones, or Xen? In the UML case, I want the system to automatically be able to hand back any sufficiently large chunks of memory it currently isn't using. What does this have to do with specifying hard limits of anything? What's to specify? Workloads vary. Deal with it. > If there are zone rebalancing problems[*], then it would be great to > have more users of zones because then they will be more likely to get > fixed. Ok, so you want to artificially turn this into a zone balancing issue in hopes of giving that area of the code more testing when, if zones weren't involved, there would be no need for balancing at all? How does that make sense? > [*] and there are, sadly enough - see the recent patches I posted to > lkml for example. I was under the impression that zone balancing is, conceptually speaking, a difficult problem. > But I'm fairly confident that once the particularly > silly ones have been fixed, Great, you're advocating migrating the fragmentation patches to an area of code that has known problems you yourself describe as "particularly silly". A ringing endorsement, that. The fact that the migrated version wouldn't even address fragmentation avoidance at all (the topic of this thread!) is apparently a side issue. > zone balancing will no longer be a > derogatory term as has been thrown around (maybe rightly) in this > thread! If I'm not mistaken, you introduced zones into this thread, you are the primary (possibly only) proponent of them. Yes, zones are a way of categorizing memory. They're not a way of defragmenting it. Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 23:47 ` Rob Landley @ 2005-11-03 4:43 ` Nick Piggin 2005-11-03 6:07 ` Rob Landley 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-03 4:43 UTC (permalink / raw) To: Rob Landley Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Rob Landley wrote: > In the UML case, I want the system to automatically be able to hand back any > sufficiently large chunks of memory it currently isn't using. > I'd just be happy with UML handing back page sized chunks of memory that it isn't currently using. How does contiguous memory (in either the host or the guest) help this? > What does this have to do with specifying hard limits of anything? What's to > specify? Workloads vary. Deal with it. > Umm, if you hadn't bothered to read the thread then I won't go through it all again. The short of it is that if you want guaranteed unfragmented memory you have to specify a limit. > >>If there are zone rebalancing problems[*], then it would be great to >>have more users of zones because then they will be more likely to get >>fixed. > > > Ok, so you want to artificially turn this into a zone balancing issue in hopes > of giving that area of the code more testing when, if zones weren't involved, > there would be no need for balancing at all? > > How does that make sense? > Have you looked at the frag patches? Do you realise that they have to balance between the different types of memory blocks? Duplicating the same or similar infrastructure (in this case, a memory zoning facility) is a bad thing in general. > >>[*] and there are, sadly enough - see the recent patches I posted to >> lkml for example. > > > I was under the impression that zone balancing is, conceptually speaking, a > difficult problem. > I am under the impression that you think proper fragmentation avoidance is easier. > >> But I'm fairly confident that once the particularly >> silly ones have been fixed, > > > Great, you're advocating migrating the fragmentation patches to an area of > code that has known problems you yourself describe as "particularly silly". > A ringing endorsement, that. > Err, the point is so we don't now have 2 layers doing very similar things, at least one of which has "particularly silly" bugs in it. > The fact that the migrated version wouldn't even address fragmentation > avoidance at all (the topic of this thread!) is apparently a side issue. > Zones can be used to guaranteee physically contiguous regions with exactly the same effectiveness as the frag patches. > >> zone balancing will no longer be a >> derogatory term as has been thrown around (maybe rightly) in this >> thread! > > > If I'm not mistaken, you introduced zones into this thread, you are the > primary (possibly only) proponent of them. So you didn't look at Yasunori Goto's patch from last year that implements exactly what I described, then? > Yes, zones are a way of categorizing memory. Yes, have you read Mel's patches? Guess what they do? > They're not a way of defragmenting it. Guess what they don't? Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 4:43 ` Nick Piggin @ 2005-11-03 6:07 ` Rob Landley 2005-11-03 7:34 ` Nick Piggin 2005-11-03 16:35 ` Jeff Dike 0 siblings, 2 replies; 241+ messages in thread From: Rob Landley @ 2005-11-03 6:07 UTC (permalink / raw) To: Nick Piggin Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wednesday 02 November 2005 22:43, Nick Piggin wrote: > Rob Landley wrote: > > In the UML case, I want the system to automatically be able to hand back > > any sufficiently large chunks of memory it currently isn't using. > > I'd just be happy with UML handing back page sized chunks of memory that > it isn't currently using. How does contiguous memory (in either the host > or the guest) help this? Smaller chunks of memory are likely to be reclaimed really soon, and adding in the syscall overhead working with individual pages of memory is almost guaranteed to slow us down. Plus with punch, we'd be fragmenting the heck out of the underlying file. > > What does this have to do with specifying hard limits of anything? > > What's to specify? Workloads vary. Deal with it. > > Umm, if you hadn't bothered to read the thread then I won't go through > it all again. The short of it is that if you want guaranteed unfragmented > memory you have to specify a limit. I read it. It just didn't contain an answer the the question. I want UML to be able to hand back however much memory it's not using, but handing back individual pages as we free them and inserting a syscall overhead for every page freed and allocated is just nuts. (Plus, at page size, the OS isn't likely to zero them much faster than we can ourselves even without the syscall overhead.) Defragmentation means we can batch this into a granularity that makes it worth it. This has nothing to do with hard limits on anything. > Have you looked at the frag patches? I've read Mel's various descriptions, and tried to stay more or less up to date ever since LWN brought it to my attention. But I can't say I'm a linux VM system expert. (The last time I felt I had a really firm grasp on it was before Andrea and Rik started arguing circa 2.4 and Andrea spent six months just assuming everybody already knew what a classzone was. I've had other things to do since then...) > Do you realise that they have to > balance between the different types of memory blocks? I realise they merge them back together into larger chunks as they free up space, and split larger chunks when they haven't got a smaller one. > Duplicating the > same or similar infrastructure (in this case, a memory zoning facility) > is a bad thing in general. Even when they keep track of very different things? The memory zoning thing is about where stuff is in physical memory, and it exists because various hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is evil and crippled and we have to humor it by not asking it to do stuff it can't. The fragmentation stuff is about what long contiguous runs of free memory we can arrange, and it's also nice to be able to categorize them as "zeroed" or "not zeroed" to make new allocations faster. Where they actually are in memory is not at issue here. You can have prezeroed memory in 32 bit DMA space, and prezeroed memory in highmem, but there's memory in both that isn't prezeroed. I thought there was a hierarchy of zones. You want overlapping, interlaced, randomly laid out zones. > >>[*] and there are, sadly enough - see the recent patches I posted to > >> lkml for example. > > > > I was under the impression that zone balancing is, conceptually speaking, > > a difficult problem. > > I am under the impression that you think proper fragmentation avoidance > is easier. I was under the impression it was orthogonal to figuring out whether or not a given bank of physical memory is accessable to your sound blaster without an IOMMU. > >> But I'm fairly confident that once the particularly > >> silly ones have been fixed, > > > > Great, you're advocating migrating the fragmentation patches to an area > > of code that has known problems you yourself describe as "particularly > > silly". A ringing endorsement, that. > > Err, the point is so we don't now have 2 layers doing very similar things, > at least one of which has "particularly silly" bugs in it. Similar is not identical. You seem to be implying that the IO elevator and the network stack queueing should be merged because they do similar things. > > The fact that the migrated version wouldn't even address fragmentation > > avoidance at all (the topic of this thread!) is apparently a side issue. > > Zones can be used to guaranteee physically contiguous regions with exactly > the same effectiveness as the frag patches. If you'd like to write a counter-patch to Mel's to prove it... > >> zone balancing will no longer be a > >> derogatory term as has been thrown around (maybe rightly) in this > >> thread! > > > > If I'm not mistaken, you introduced zones into this thread, you are the > > primary (possibly only) proponent of them. > > So you didn't look at Yasunori Goto's patch from last year that implements > exactly what I described, then? I saw the patch he just posted, if that's what you mean. By his own admission, it doesn't address fragmentation at all. > > Yes, zones are a way of categorizing memory. > > Yes, have you read Mel's patches? Guess what they do? The swap file is a way of storing data on disk. So is ext3. Obviously, one is a trivial extension of the other and there's no reason to have both. > > They're not a way of defragmenting it. > > Guess what they don't? I have no idea what you intended to mean by that. Mel posted a set of patches in a thread titled "fragmentation avoidance", and you've been arguing about hotplug, and pointing to a set of patches from Goto that do not address fragmentation at all. This confuses me. Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 6:07 ` Rob Landley @ 2005-11-03 7:34 ` Nick Piggin 2005-11-03 17:54 ` Rob Landley 2005-11-03 16:35 ` Jeff Dike 1 sibling, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-03 7:34 UTC (permalink / raw) To: Rob Landley Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Rob Landley wrote: > On Wednesday 02 November 2005 22:43, Nick Piggin wrote: > >>I'd just be happy with UML handing back page sized chunks of memory that >>it isn't currently using. How does contiguous memory (in either the host >>or the guest) help this? > > > Smaller chunks of memory are likely to be reclaimed really soon, and adding in > the syscall overhead working with individual pages of memory is almost > guaranteed to slow us down. Because UML doesn't already make a syscall per individual page of memory freed? (If I read correctly) > Plus with punch, we'd be fragmenting the heck > out of the underlying file. > Why? No you wouldn't. > >>>What does this have to do with specifying hard limits of anything? >>>What's to specify? Workloads vary. Deal with it. >> >>Umm, if you hadn't bothered to read the thread then I won't go through >>it all again. The short of it is that if you want guaranteed unfragmented >>memory you have to specify a limit. > > > I read it. It just didn't contain an answer the the question. I want UML to > be able to hand back however much memory it's not using, but handing back > individual pages as we free them and inserting a syscall overhead for every > page freed and allocated is just nuts. (Plus, at page size, the OS isn't > likely to zero them much faster than we can ourselves even without the > syscall overhead.) Defragmentation means we can batch this into a > granularity that makes it worth it. > Oh you have measured it and found out that "defragmentation" makes it worthwhile? > This has nothing to do with hard limits on anything. > You said: "What does this have to do with specifying hard limits of anything? What's to specify? Workloads vary. Deal with it." And I was answering your very polite questions. > >>Have you looked at the frag patches? > > > I've read Mel's various descriptions, and tried to stay more or less up to > date ever since LWN brought it to my attention. But I can't say I'm a linux > VM system expert. (The last time I felt I had a really firm grasp on it was > before Andrea and Rik started arguing circa 2.4 and Andrea spent six months > just assuming everybody already knew what a classzone was. I've had other > things to do since then...) > Maybe you have better things to do now as well? >>Duplicating the >>same or similar infrastructure (in this case, a memory zoning facility) >>is a bad thing in general. > > > Even when they keep track of very different things? The memory zoning thing > is about where stuff is in physical memory, and it exists because various > hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is > evil and crippled and we have to humor it by not asking it to do stuff it > can't. > No, the buddy allocator is and always has been what tracks the "long contiguous runs of free memory". Both zones and Mels patches classify blocks of memory according to some criteria. They're not exactly the same obviously, but they're equivalent in terms of capability to guarantee contiguous freeable regions. > > I was under the impression it was orthogonal to figuring out whether or not a > given bank of physical memory is accessable to your sound blaster without an > IOMMU. > Huh? >>Err, the point is so we don't now have 2 layers doing very similar things, >>at least one of which has "particularly silly" bugs in it. > > > Similar is not identical. You seem to be implying that the IO elevator and > the network stack queueing should be merged because they do similar things. > No I don't. > > If you'd like to write a counter-patch to Mel's to prove it... > It has already been written as you have been told numerous times. Now if you'd like to actually learn about what you're commenting on, that would be really good too. >>So you didn't look at Yasunori Goto's patch from last year that implements >>exactly what I described, then? > > > I saw the patch he just posted, if that's what you mean. By his own > admission, it doesn't address fragmentation at all. > It seems to be that it provides exactly the same (actually stronger) guarantees than the current frag patches do. Or were you going to point out a bug in the implementation? > >>>Yes, zones are a way of categorizing memory. >> >>Yes, have you read Mel's patches? Guess what they do? > > > The swap file is a way of storing data on disk. So is ext3. Obviously, one > is a trivial extension of the other and there's no reason to have both. > Don't try to bullshit your way around with stupid analogies please, it is an utter waste of time. > >>>They're not a way of defragmenting it. >> >>Guess what they don't? > > > I have no idea what you intended to mean by that. Mel posted a set of patches What I mean is that Mel's patches aren't a way of defragmenting memory either. They fit exactly the description you gave for zones (ie. a way of categorizing, not defragmenting). > in a thread titled "fragmentation avoidance", and you've been arguing about > hotplug, and pointing to a set of patches from Goto that do not address > fragmentation at all. This confuses me. > Yeah it does seem like you are confused. Now let's finish up this subthread and try to keep the SN ratio up, please? I'm sure Jeff or someone knowledgeable in the area can chime in if there are concerns about UML. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 7:34 ` Nick Piggin @ 2005-11-03 17:54 ` Rob Landley 2005-11-03 20:13 ` Jeff Dike 0 siblings, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-03 17:54 UTC (permalink / raw) To: Nick Piggin Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thursday 03 November 2005 01:34, Nick Piggin wrote: > Rob Landley wrote: > > On Wednesday 02 November 2005 22:43, Nick Piggin wrote: > >>I'd just be happy with UML handing back page sized chunks of memory that > >>it isn't currently using. How does contiguous memory (in either the host > >>or the guest) help this? > > > > Smaller chunks of memory are likely to be reclaimed really soon, and > > adding in the syscall overhead working with individual pages of memory is > > almost guaranteed to slow us down. > > Because UML doesn't already make a syscall per individual page of > memory freed? (If I read correctly) UML does a big mmap to get "physical" memory, and then manages itself using the normal Linux kernel mechanisms for doing so. We even have page tables, although I'm still somewhat unclear on quite how that works. > > Plus with punch, we'd be fragmenting the heck > > out of the underlying file. > > Why? No you wouldn't. Creating holes in the file and freeing up the underlying blocks on disk? 4k at a time? Randomly scattered? > > I read it. It just didn't contain an answer the the question. I want > > UML to be able to hand back however much memory it's not using, but > > handing back individual pages as we free them and inserting a syscall > > overhead for every page freed and allocated is just nuts. (Plus, at page > > size, the OS isn't likely to zero them much faster than we can ourselves > > even without the syscall overhead.) Defragmentation means we can batch > > this into a granularity that makes it worth it. > > Oh you have measured it and found out that "defragmentation" makes > it worthwhile? Lots of work has gone into batching up syscalls and making as few of them as possible because they are a performance bottleneck. You want to introduce a syscall for every single individual page of memory allocated or freed. That's stupid. > > This has nothing to do with hard limits on anything. > > You said: > > "What does this have to do with specifying hard limits of > anything? What's to specify? Workloads vary. Deal with it." > > And I was answering your very polite questions. You didn't answer. You keep saying you've already answered, but there continues to be no answer. Maybe you think you've answered, but I haven't seen it yet. You brought up hard limits, I asked what that had to do with anything, and in response you quote my question back at me. > >>Have you looked at the frag patches? > > > > I've read Mel's various descriptions, and tried to stay more or less up > > to date ever since LWN brought it to my attention. But I can't say I'm a > > linux VM system expert. (The last time I felt I had a really firm grasp > > on it was before Andrea and Rik started arguing circa 2.4 and Andrea > > spent six months just assuming everybody already knew what a classzone > > was. I've had other things to do since then...) > > Maybe you have better things to do now as well? Yeah, thanks for reminding me. I need to test Mel's newest round of fragmentation avoidance patches in my UML build system... > >>Duplicating the > >>same or similar infrastructure (in this case, a memory zoning facility) > >>is a bad thing in general. > > > > Even when they keep track of very different things? The memory zoning > > thing is about where stuff is in physical memory, and it exists because > > various hardware that wants to access memory (24 bit DMA, 32 bit DMA, and > > PAE) is evil and crippled and we have to humor it by not asking it to do > > stuff it can't. > > No, the buddy allocator is and always has been what tracks the "long > contiguous runs of free memory". We are still discussing fragmentation avoidance, right? (I know _I'm_ trying to...) > Both zones and Mels patches classify blocks of memory according to some > criteria. They're not exactly the same obviously, but they're equivalent in > terms of capability to guarantee contiguous freeable regions. Back up. I don't care _where_ the freeable regions are. I just wan't them coalesced. Zones are all about _where_ the memory is. I'm pretty sure we're arguing past each other. > > I was under the impression it was orthogonal to figuring out whether or > > not a given bank of physical memory is accessable to your sound blaster > > without an IOMMU. > > Huh? Fragmentation avoidance is what is orthogonal to... > >>Err, the point is so we don't now have 2 layers doing very similar > >> things, at least one of which has "particularly silly" bugs in it. > > > > Similar is not identical. You seem to be implying that the IO elevator > > and the network stack queueing should be merged because they do similar > > things. > > No I don't. They're similar though, aren't they? Why should we have different code in there to do both? (I know why, but that's what your argument sounds like to me.) > > If you'd like to write a counter-patch to Mel's to prove it... > > It has already been written as you have been told numerous times. Quoting Yasunori Goto, Yesterday at 2:33 pm, Message-Id: <20051102172729.9E7C.Y-GOTO@jp.fujitsu.com> > Hmmm. I don't see at this point. > Why do you think ZONE_REMOVABLE can satisfy for hugepage. > At leaset, my ZONE_REMOVABLE patch doesn't any concern about > fragmentation. He's NOT ADDRESSING FRAGMENTATION. So unless you're talking about some OTHER patch, we're talking past each other again. > Now if you'd like to actually learn about what you're commenting on, > that would be really good too. The feeling is mutual. > >>So you didn't look at Yasunori Goto's patch from last year that > >> implements exactly what I described, then? > > > > I saw the patch he just posted, if that's what you mean. By his own > > admission, it doesn't address fragmentation at all. > > It seems to be that it provides exactly the same (actually stronger) > guarantees than the current frag patches do. Or were you going to point > out a bug in the implementation? No, I'm going to point out that the author of the patch contradicts you. > >>>Yes, zones are a way of categorizing memory. > >> > >>Yes, have you read Mel's patches? Guess what they do? > > > > The swap file is a way of storing data on disk. So is ext3. Obviously, > > one is a trivial extension of the other and there's no reason to have > > both. > > Don't try to bullshit your way around with stupid analogies please, it > is an utter waste of time. I agree that this conversation is a waste of time, and will stop trying to reason with you now. Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 17:54 ` Rob Landley @ 2005-11-03 20:13 ` Jeff Dike 0 siblings, 0 replies; 241+ messages in thread From: Jeff Dike @ 2005-11-03 20:13 UTC (permalink / raw) To: Rob Landley Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thu, Nov 03, 2005 at 11:54:10AM -0600, Rob Landley wrote: > Lots of work has gone into batching up syscalls and making as few of them as > possible because they are a performance bottleneck. You want to introduce a > syscall for every single individual page of memory allocated or freed. > > That's stupid. I think what I'm optimizing is TLB flushes, not system calls. With mmap et al, they are effectively the same thing though. Jeff ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 6:07 ` Rob Landley 2005-11-03 7:34 ` Nick Piggin @ 2005-11-03 16:35 ` Jeff Dike 2005-11-03 16:23 ` Badari Pulavarty 1 sibling, 1 reply; 241+ messages in thread From: Jeff Dike @ 2005-11-03 16:35 UTC (permalink / raw) To: Rob Landley Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote: > I want UML to > be able to hand back however much memory it's not using, but handing back > individual pages as we free them and inserting a syscall overhead for every > page freed and allocated is just nuts. (Plus, at page size, the OS isn't > likely to zero them much faster than we can ourselves even without the > syscall overhead.) Defragmentation means we can batch this into a > granularity that makes it worth it. I don't think that freeing pages back to the host in free_pages is the way to go. The normal behavior for a Linux system, virtual or physical, is to use all the memory it has. So, any memory that's freed is pretty likely to be reused for something else, wasting any effort that's made to free pages back to the host. The one counter-example I can think of is when a large process with a lot of data exits. Then its data pages will be freed and they may stay free for a while until the system finds other data to fill them with. Also, it's not the virtual machine's job to know how to make the host perform optimally. It doesn't have the information to do it. It's perfectly OK for a UML to hang on to memory if the host has plenty free. So, it's the host's job to make sure that its memory pressure is reflected to the UMLs. My current thinking is that you'll have a daemon on the host keeping track of memory pressure on the host and the UMLs, plugging and unplugging memory in order to keep the busy machines, including the host, supplied with memory, and periodically pushing down the memory of idle UMLs in order to force them to GC their page caches. With Badari's patch and UML memory hotplug, the infrastructure is there to make this work. The one thing I'm puzzling over right now is how to measure memory pressure. Jeff ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:35 ` Jeff Dike @ 2005-11-03 16:23 ` Badari Pulavarty 2005-11-03 18:27 ` Jeff Dike ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Badari Pulavarty @ 2005-11-03 16:23 UTC (permalink / raw) To: Jeff Dike Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thu, 2005-11-03 at 11:35 -0500, Jeff Dike wrote: > On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote: > > I want UML to > > be able to hand back however much memory it's not using, but handing back > > individual pages as we free them and inserting a syscall overhead for every > > page freed and allocated is just nuts. (Plus, at page size, the OS isn't > > likely to zero them much faster than we can ourselves even without the > > syscall overhead.) Defragmentation means we can batch this into a > > granularity that makes it worth it. > > I don't think that freeing pages back to the host in free_pages is the > way to go. The normal behavior for a Linux system, virtual or > physical, is to use all the memory it has. So, any memory that's > freed is pretty likely to be reused for something else, wasting any > effort that's made to free pages back to the host. > > The one counter-example I can think of is when a large process with a > lot of data exits. Then its data pages will be freed and they may > stay free for a while until the system finds other data to fill them > with. > > Also, it's not the virtual machine's job to know how to make the host > perform optimally. It doesn't have the information to do it. It's > perfectly OK for a UML to hang on to memory if the host has plenty > free. So, it's the host's job to make sure that its memory pressure > is reflected to the UMLs. > > My current thinking is that you'll have a daemon on the host keeping > track of memory pressure on the host and the UMLs, plugging and > unplugging memory in order to keep the busy machines, including the > host, supplied with memory, and periodically pushing down the memory > of idle UMLs in order to force them to GC their page caches. > > With Badari's patch and UML memory hotplug, the infrastructure is > there to make this work. The one thing I'm puzzling over right now is > how to measure memory pressure. Yep. This is the exactly the issue other product groups normally raise on Linux. How do we measure memory pressure in linux ? Some of our software products want to grow or shrink their memory usage depending on the memory pressure in the system. Since most memory is used for cache, "free" really doesn't indicate anything -they are monitoring info in /proc/meminfo and swapping rates to "guess" on the memory pressure. They want a clear way of finding out "how badly" system is under memory pressure. (As a starting point, they want to find out out of "cached" memory - how much is really easily "reclaimable" under memory pressure - without swapping). I know this is kind of crazy, but interesting to think about :) Thanks, Badari ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:23 ` Badari Pulavarty @ 2005-11-03 18:27 ` Jeff Dike 2005-11-03 18:49 ` Rob Landley 2005-11-04 4:52 ` Andrew Morton 2 siblings, 0 replies; 241+ messages in thread From: Jeff Dike @ 2005-11-03 18:27 UTC (permalink / raw) To: Badari Pulavarty Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thu, Nov 03, 2005 at 08:23:20AM -0800, Badari Pulavarty wrote: > Yep. This is the exactly the issue other product groups normally raise > on Linux. How do we measure memory pressure in linux ? Some of our > software products want to grow or shrink their memory usage depending > on the memory pressure in the system. I think this is wrong. Applications shouldn't be measuring host memory pressure and trying to react to it. This gives you no way to implement a global memory use policy - you can't say "App X is the most important thing on the system and must have all the memory it needs in order run as quickly as possible". You can't establish any sort of priority between apps when it comes to memory use, or change those priorities. And how does this work when the system can change the amount of memory that it has, such as when the app is inside a UML? I think the right way to go is for willing apps to have an interface through which they can be told "change your memory consumption by +-X" and have a single daemon on the host tracking memory use and memory pressure, and shuffling memory between the apps. This allows the admin to set memory use priorities between the apps and to exempt important ones from having memory pulled. Measuring at the bottom and pushing memory pressure upwards also works naturally for virtual machines and the apps running inside them. The host will push memory pressure at the virtual machines, which in turn will push that pressure at their apps. With UML, I have an interface where a daemon on the host can add or remove memory from an instance. I think the apps that are willing to adjust should implement something similar. Jeff ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:23 ` Badari Pulavarty 2005-11-03 18:27 ` Jeff Dike @ 2005-11-03 18:49 ` Rob Landley 2005-11-04 4:52 ` Andrew Morton 2 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-03 18:49 UTC (permalink / raw) To: Badari Pulavarty Cc: Jeff Dike, Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thursday 03 November 2005 10:23, Badari Pulavarty wrote: > Yep. This is the exactly the issue other product groups normally raise > on Linux. How do we measure memory pressure in linux ? Some of our > software products want to grow or shrink their memory usage depending > on the memory pressure in the system. Since most memory is used for > cache, "free" really doesn't indicate anything -they are monitoring > info in /proc/meminfo and swapping rates to "guess" on the memory > pressure. They want a clear way of finding out "how badly" system > is under memory pressure. (As a starting point, they want to find out > out of "cached" memory - how much is really easily "reclaimable" > under memory pressure - without swapping). I know this is kind of > crazy, but interesting to think about :) If we do ever get prezeroing, we'd want a tuneable to say how much memory should be spent on random page cache and how much should be prezeroed. And large chunks of prezeroed memory lying around are what you'd think about handing back to the host OS... Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 16:23 ` Badari Pulavarty 2005-11-03 18:27 ` Jeff Dike 2005-11-03 18:49 ` Rob Landley @ 2005-11-04 4:52 ` Andrew Morton 2005-11-04 5:35 ` Paul Jackson 2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar 2 siblings, 2 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-04 4:52 UTC (permalink / raw) To: Badari Pulavarty Cc: jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Badari Pulavarty <pbadari@gmail.com> wrote: > > > With Badari's patch and UML memory hotplug, the infrastructure is > > there to make this work. The one thing I'm puzzling over right now is > > how to measure memory pressure. > > Yep. This is the exactly the issue other product groups normally raise > on Linux. How do we measure memory pressure in linux ? Some of our > software products want to grow or shrink their memory usage depending > on the memory pressure in the system. Since most memory is used for > cache, "free" really doesn't indicate anything -they are monitoring > info in /proc/meminfo and swapping rates to "guess" on the memory > pressure. They want a clear way of finding out "how badly" system > is under memory pressure. (As a starting point, they want to find out > out of "cached" memory - how much is really easily "reclaimable" > under memory pressure - without swapping). I know this is kind of > crazy, but interesting to think about :) Similarly, that SGI patch which was rejected 6-12 months ago to kill off processes once they started swapping. We thought that it could be done from userspace, but we need a way for userspace to detect when a task is being swapped on a per-task basis. I'm thinking a few numbers in the mm_struct, incremented in the pageout code, reported via /proc/stat. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 4:52 ` Andrew Morton @ 2005-11-04 5:35 ` Paul Jackson 2005-11-04 5:48 ` Andrew Morton 2005-11-04 6:16 ` Bron Nelson 2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar 1 sibling, 2 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-04 5:35 UTC (permalink / raw) To: Andrew Morton Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel > Similarly, that SGI patch which was rejected 6-12 months ago to kill off > processes once they started swapping. We thought that it could be done > from userspace, but we need a way for userspace to detect when a task is > being swapped on a per-task basis. > > I'm thinking a few numbers in the mm_struct, incremented in the pageout > code, reported via /proc/stat. I just sent in a proposed patch for this - one more per-cpuset number, tracking the recent rate of calls into the synchronous (direct) page reclaim by tasks in the cpuset. See the message sent a few minutes ago, with subject: [PATCH 5/5] cpuset: memory reclaim rate meter -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 5:35 ` Paul Jackson @ 2005-11-04 5:48 ` Andrew Morton 2005-11-04 6:42 ` Paul Jackson 2005-11-04 6:16 ` Bron Nelson 1 sibling, 1 reply; 241+ messages in thread From: Andrew Morton @ 2005-11-04 5:48 UTC (permalink / raw) To: Paul Jackson, Bron Nelson Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Paul Jackson <pj@sgi.com> wrote: > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill off > > processes once they started swapping. We thought that it could be done > > from userspace, but we need a way for userspace to detect when a task is > > being swapped on a per-task basis. > > > > I'm thinking a few numbers in the mm_struct, incremented in the pageout > > code, reported via /proc/stat. > > I just sent in a proposed patch for this - one more per-cpuset > number, tracking the recent rate of calls into the synchronous > (direct) page reclaim by tasks in the cpuset. > > See the message sent a few minutes ago, with subject: > > [PATCH 5/5] cpuset: memory reclaim rate meter > uh, OK. If that patch is merged, does that make Bron happy, so I don't have to reply to his plaintive email? I was kind of thinking that the stats should be per-process (actually per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 5:48 ` Andrew Morton @ 2005-11-04 6:42 ` Paul Jackson 2005-11-04 7:10 ` Andrew Morton 0 siblings, 1 reply; 241+ messages in thread From: Paul Jackson @ 2005-11-04 6:42 UTC (permalink / raw) To: Andrew Morton Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Andrew wrote: > uh, OK. If that patch is merged, does that make Bron happy, so I don't > have to reply to his plaintive email? In theory yes, that should do it. I will ack again, by early next week, after I have verified this further. And it should also handle some other folks who have plaintive emails in my inbox, that haven't gotten bold enough to pester you, yet. It really is, for the users who know my email address (*), job based memory pressure, not task based, that matters. Sticking it in a cpuset, which is the natural job container, is easier, more natural, and more efficient for all concerned. It's jobs that are being run in cpusets with dedicated (not shared) CPUs and Memory Nodes that care about this, so far as I know. When running a system in a more typical sharing mode, with multiple jobs and applications competing for the same resources, then the kernel needs to be master of processor scheduling and memory allocation. When running jobs in cpusets with dedicated CPUs and Memory Nodes, then less is being asked of the kernel, and some per-job controls from userspace make more sense. This is where a simple hook like this reclaim rate meter comes into play - passing up to user space another clue to help it do its job. > I was kind of thinking that the stats should be per-process (actually > per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something. There may well be a market for these too. But such stats sound like more work, and the market isn't one that's paying my salary. So I will leave that challenge on the table for someone else. (*) Of course, there is some self selection going on here. Folks not doing cpuset-based jobs are far less likely to know my email address ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 6:42 ` Paul Jackson @ 2005-11-04 7:10 ` Andrew Morton 2005-11-04 7:45 ` Paul Jackson 2005-11-04 15:19 ` Martin J. Bligh 0 siblings, 2 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-04 7:10 UTC (permalink / raw) To: Paul Jackson Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Paul Jackson <pj@sgi.com> wrote: > > > I was kind of thinking that the stats should be per-process (actually > > per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something. > > There may well be a market for these too. But such stats sound like > more work, and the market isn't one that's paying my salary. But I have to care for all users. > So I will leave that challenge on the table for someone else. And I won't merge your patch ;) Seriously, it does appear that doing it per-task is adequate for your needs, and it is certainly more general. I cannot understand why you decided to count only the number of direct-reclaim events, via a "digitally filtered, constant time based, event frequency meter". a) It loses information. If we were to export the number of pages reclaimed from the mm, filtering can be done in userspace. b) It omits reclaim performed by kswapd and by other tasks (ok, it's very cpuset-specific). c) It only counts synchronous try_to_free_pages() attempts. What if an attempt only freed pagecache, or didbn't manage to free anything? d) It doesn't notice if kswapd is swapping the heck out of your not-allocating-any-memory-now process. I think all the above can be addressed by exporting per-task (actually per-mm) reclaim info. (I haven't put much though into what info that should be - page reclaim attempts, mmapped reclaims, swapcache reclaims, etc) ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 7:10 ` Andrew Morton @ 2005-11-04 7:45 ` Paul Jackson 2005-11-04 8:02 ` Andrew Morton 2005-11-04 15:19 ` Martin J. Bligh 1 sibling, 1 reply; 241+ messages in thread From: Paul Jackson @ 2005-11-04 7:45 UTC (permalink / raw) To: Andrew Morton Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Andrew wrote: > > So I will leave that challenge on the table for someone else. > > And I won't merge your patch ;) Be that way ;). > Seriously, it does appear that doing it per-task is adequate for your > needs, and it is certainly more general. My motivations for the per-cpuset, digitally filtered rate, as opposed to the per-task raw counter mostly have to do with minimizing total cost (user + kernel) of collecting this information. I have this phobia, perhaps not well founded, that moving critical scheduling/allocation decisions like this into user space will fail in some cases because the cost of gathering the critical information will be too intrusive on system performance and scalability. A per-task stat requires walking the tasklist, to build a list of the tasks to query. A raw counter requires repeated polling to determine the recent rate of activity. The filtered per-cpuset rate avoids any need to repeatedly access global resources such as the tasklist, and minimizes the total cpu cycles required to get the interesting stat. > But I have to care for all users. Well you should, and well you do. If you have good reason, or just good instincts, to think that there are uses for per-task raw counters, then your choice is clear. As indeed it was clear. I don't recall hearing of any desire for per-task memory pressure data, until tonight. I will miss this patch. It had provided exactly what I thought was needed, with an extremely small impact on system (kern+user) performance. Oh well. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 7:45 ` Paul Jackson @ 2005-11-04 8:02 ` Andrew Morton 2005-11-04 9:52 ` Paul Jackson 0 siblings, 1 reply; 241+ messages in thread From: Andrew Morton @ 2005-11-04 8:02 UTC (permalink / raw) To: Paul Jackson Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Paul Jackson <pj@sgi.com> wrote: > > A per-task stat requires walking the tasklist, to build a list of the > tasks to query. Nope, just task->mm->whatever. > A raw counter requires repeated polling to determine the recent rate of > activity. True. > The filtered per-cpuset rate avoids any need to repeatedly access > global resources such as the tasklist, and minimizes the total cpu > cycles required to get the interesting stat. > Well no. Because the filtered-whatsit takes two spinlocks and does a bunch of arith for each and every task, each time it calls try_to_free_pages(). The frequency of that could be very high indeed, even when nobody is interested in the metric which is being maintained(!). And I'd suggest that only a minority of workloads would be interested in this metric? ergo, polling the thing once per five seconds in those situations where we actually want to poll the thing may well be cheaper, in global terms? ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 8:02 ` Andrew Morton @ 2005-11-04 9:52 ` Paul Jackson 2005-11-04 15:27 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Paul Jackson @ 2005-11-04 9:52 UTC (permalink / raw) To: Andrew Morton Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel > > A per-task stat requires walking the tasklist, to build a list of the > > tasks to query. > > Nope, just task->mm->whatever. Nope. Agreed - once you have the task, then sure, that's enough. However - a batch scheduler will end up having to figure out what tasks there are to inquire, by either listing the tasks in a cpuset, or by listing /proc. Either way, that's a tasklist scan. And it will have to do that pretty much every iteration of polling, since it has no a priori knowledge of what tasks a job is firing up. > Well no. Because the filtered-whatsit takes two spinlocks and does a bunch > of arith for each and every task, each time it calls try_to_free_pages(). Neither spinlock is global - the task and a lock in its cpuset. I see a fair number of existing locks and semaphores, some global and some in loops, that look to be in the code invoked by try_to_free_pages(). And far more arithmetic than in that little filter. Granted, its cost seen by all, for the benefit of few. But other sorts of per-task or per-mm stats are not going to be free either. I would have figured that doing something per-page, even the most trivial "counter++" (better have that mm locked) will likely cost more than doing something per try_to_free_pages() call. > The frequency of that could be very high indeed, even when nobody is > interested in the metric which is being maintained(!) When I have a task start allocating memory as fast it can, it is only able to call try_to_free_pages() about 10 times a second on an idle ia64 SN2 system, with a single thread, or about 20 times a second running several threads at once allocating memory. That's not "very high" in my book. What sort of load would hit this much more often? If more folks need these detailed stats, then that's how it should be. But I am no fan of exposing more than the minimum kernel vm details for use by production software. We agree that my per-cpuset memory_reclaim_rate meter certainly hides more detail than the sorts of stats you are suggesting. I thought that was good, so long as what was needed was still present. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 9:52 ` Paul Jackson @ 2005-11-04 15:27 ` Martin J. Bligh 0 siblings, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 15:27 UTC (permalink / raw) To: Paul Jackson, Andrew Morton Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel > We agree that my per-cpuset memory_reclaim_rate meter certainly hides > more detail than the sorts of stats you are suggesting. I thought that > was good, so long as what was needed was still present. But it's horribly specific to cpusets. If you want something multi-task, would be better if it worked by more generic task groupings. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 7:10 ` Andrew Morton 2005-11-04 7:45 ` Paul Jackson @ 2005-11-04 15:19 ` Martin J. Bligh 2005-11-04 17:38 ` Andrew Morton 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 15:19 UTC (permalink / raw) To: Andrew Morton, Paul Jackson Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel > Seriously, it does appear that doing it per-task is adequate for your > needs, and it is certainly more general. > > > > I cannot understand why you decided to count only the number of > direct-reclaim events, via a "digitally filtered, constant time based, > event frequency meter". > > a) It loses information. If we were to export the number of pages > reclaimed from the mm, filtering can be done in userspace. > > b) It omits reclaim performed by kswapd and by other tasks (ok, it's > very cpuset-specific). > > c) It only counts synchronous try_to_free_pages() attempts. What if an > attempt only freed pagecache, or didbn't manage to free anything? > > d) It doesn't notice if kswapd is swapping the heck out of your > not-allocating-any-memory-now process. > > > I think all the above can be addressed by exporting per-task (actually > per-mm) reclaim info. (I haven't put much though into what info that > should be - page reclaim attempts, mmapped reclaims, swapcache reclaims, > etc) I've been looking at similar things. When we page out / free something from a shared library that 10 tasks have mapped, who does that count against for pressure? M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:19 ` Martin J. Bligh @ 2005-11-04 17:38 ` Andrew Morton 0 siblings, 0 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-04 17:38 UTC (permalink / raw) To: Martin J. Bligh Cc: pj, bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel "Martin J. Bligh" <mbligh@mbligh.org> wrote: > > > Seriously, it does appear that doing it per-task is adequate for your > > needs, and it is certainly more general. > > > > > > > > I cannot understand why you decided to count only the number of > > direct-reclaim events, via a "digitally filtered, constant time based, > > event frequency meter". > > > > a) It loses information. If we were to export the number of pages > > reclaimed from the mm, filtering can be done in userspace. > > > > b) It omits reclaim performed by kswapd and by other tasks (ok, it's > > very cpuset-specific). > > > > c) It only counts synchronous try_to_free_pages() attempts. What if an > > attempt only freed pagecache, or didbn't manage to free anything? > > > > d) It doesn't notice if kswapd is swapping the heck out of your > > not-allocating-any-memory-now process. > > > > > > I think all the above can be addressed by exporting per-task (actually > > per-mm) reclaim info. (I haven't put much though into what info that > > should be - page reclaim attempts, mmapped reclaims, swapcache reclaims, > > etc) > > I've been looking at similar things. When we page out / free something from > a shared library that 10 tasks have mapped, who does that count against > for pressure? Count pte unmappings and minor faults and account them against the mm_struct, I guess. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 5:35 ` Paul Jackson 2005-11-04 5:48 ` Andrew Morton @ 2005-11-04 6:16 ` Bron Nelson 1 sibling, 0 replies; 241+ messages in thread From: Bron Nelson @ 2005-11-04 6:16 UTC (permalink / raw) To: Paul Jackson, Andrew Morton Cc: lhms-devel, linux-kernel, linux-mm, kravetz, mbligh, mel, haveblue, kamezawa.hiroyu, mingo, gh, nickpiggin, rob, jdike, pbadari > I was kind of thinking that the stats should be per-process (actually > per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something. The particular people that I deal with care about constraining things on a per-cpuset basis, so that is the information that I personally am looking for. But it is simple enough to map tasks to cpusets and vice-versa, so this is not really a serious consideration. I would generically be in favor of the per-process stats (even though the application at hand is actually interested in the cpuset aggregate stats), because we can always produce an aggregate from the detailed, but not vice-versa. And no doubt some future as-yet-unimagined application will want per-process info. -- Bron Campbell Nelson bron@sgi.com These statements are my own, not those of Silicon Graphics. ^ permalink raw reply [flat|nested] 241+ messages in thread
* [patch] swapin rlimit 2005-11-04 4:52 ` Andrew Morton 2005-11-04 5:35 ` Paul Jackson @ 2005-11-04 7:26 ` Ingo Molnar 2005-11-04 7:36 ` Andrew Morton 2005-11-04 10:14 ` Bernd Petrovitsch 1 sibling, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 7:26 UTC (permalink / raw) To: Andrew Morton Cc: Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel * Andrew Morton <akpm@osdl.org> wrote: > Similarly, that SGI patch which was rejected 6-12 months ago to kill > off processes once they started swapping. We thought that it could be > done from userspace, but we need a way for userspace to detect when a > task is being swapped on a per-task basis. wouldnt the clean solution here be a "swap ulimit"? I.e. something like the 2-minute quick-hack below (against Linus-curr). Ingo --- implement a swap ulimit: RLIMIT_SWAP. setting the ulimit to 0 causes any swapin activity to kill the task. Setting the rlimit to 0 is allowed for unprivileged users too, since it is a decrease of the default RLIM_INFINITY value. I.e. users could run known-memory-intense jobs with such an ulimit set, and get a guarantee that they wont put the system into a swap-storm. Note: it's just swapin that causes the SIGKILL, because at swapout time it's hard to identify the originating task. Pure swapouts and a buildup in the swap-cache is not punished, only actual hard swapins. I didnt try too hard to make the rlimit particularly finegrained - i.e. right now we only know 'zero' and 'infinity' ... Signed-off-by: Ingo Molnar <mingo@elte.hu> include/asm-generic/resource.h | 4 +++- mm/memory.c | 13 +++++++++++++ 2 files changed, 16 insertions(+), 1 deletion(-) Index: linux/include/asm-generic/resource.h =================================================================== --- linux.orig/include/asm-generic/resource.h +++ linux/include/asm-generic/resource.h @@ -44,8 +44,9 @@ #define RLIMIT_NICE 13 /* max nice prio allowed to raise to 0-39 for nice level 19 .. -20 */ #define RLIMIT_RTPRIO 14 /* maximum realtime priority */ +#define RLIMIT_SWAP 15 /* maximum swapspace for task */ -#define RLIM_NLIMITS 15 +#define RLIM_NLIMITS 16 /* * SuS says limits have to be unsigned. @@ -86,6 +87,7 @@ [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \ [RLIMIT_NICE] = { 0, 0 }, \ [RLIMIT_RTPRIO] = { 0, 0 }, \ + [RLIMIT_SWAP] = { RLIM_INFINITY, RLIM_INFINITY }, \ } #endif /* __KERNEL__ */ Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -1647,6 +1647,18 @@ void swapin_readahead(swp_entry_t entry, } /* + * Crude first-approximation swapin-avoidance: if there is a zero swap + * rlimit then kill the task. + */ +static inline void check_swap_rlimit(void) +{ + unsigned long limit = current->signal->rlim[RLIMIT_SWAP].rlim_cur; + + if (limit != RLIM_INFINITY) + force_sig(SIGKILL, current); +} + +/* * We enter with non-exclusive mmap_sem (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. @@ -1667,6 +1679,7 @@ static int do_swap_page(struct mm_struct entry = pte_to_swp_entry(orig_pte); page = lookup_swap_cache(entry); if (!page) { + check_swap_rlimit(); swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar @ 2005-11-04 7:36 ` Andrew Morton 2005-11-04 8:07 ` Ingo Molnar ` (2 more replies) 2005-11-04 10:14 ` Bernd Petrovitsch 1 sibling, 3 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-04 7:36 UTC (permalink / raw) To: Ingo Molnar Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > off processes once they started swapping. We thought that it could be > > done from userspace, but we need a way for userspace to detect when a > > task is being swapped on a per-task basis. > > wouldnt the clean solution here be a "swap ulimit"? Well it's _a_ solution, but it's terribly specific. How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's non-zero, kill <pid>? ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 7:36 ` Andrew Morton @ 2005-11-04 8:07 ` Ingo Molnar 2005-11-04 10:06 ` Paul Jackson 2005-11-04 15:24 ` Martin J. Bligh 2005-11-04 8:18 ` Arjan van de Ven 2005-11-04 15:14 ` Rob Landley 2 siblings, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 8:07 UTC (permalink / raw) To: Andrew Morton Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel * Andrew Morton <akpm@osdl.org> wrote: > Ingo Molnar <mingo@elte.hu> wrote: > > > > * Andrew Morton <akpm@osdl.org> wrote: > > > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > > off processes once they started swapping. We thought that it could be > > > done from userspace, but we need a way for userspace to detect when a > > > task is being swapped on a per-task basis. > > > > wouldnt the clean solution here be a "swap ulimit"? > > Well it's _a_ solution, but it's terribly specific. > > How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's > non-zero, kill <pid>? on a system with possibly thousands of taks, over /proc, on a high-performance node where for a 0.5% improvement they are willing to sacrifice maidens? :) Seriously, while nr_swapped_in_pages ought to be OK, i think there is a generic problem with /proc based stats. System instrumentation people are already complaining about how costly /proc parsing is. If you have to get some nontrivial stat from all threads in the system, and if Linux doesnt offer that counter or summary by default, it gets pretty expensive. One solution i can think of would be to make a binary representation of /proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every task tracked that way, and stats updates would have to update this page too - but it would make instrumentation of running apps really unintrusive and scalable. Another addition would be some mechanism for a monitoring app to capture events in the PID space: so that they can mmap() new tasks [if they are interested] on a non-polling basis, i.e. not like readdir on /proc. This capability probably has to be a system-call though, as /proc seems too quirky for it. The system does not wait on the monitoring app(s) to catch up - if it's too slow in reacting and the event buffer overflows then tough luck - monitoring apps will have no impact on the runtime characteristics of other tasks. In theory this is somewhat similar to auditing, but the purpose would be quite different, and it only cares about PID-space events like 'fork/clone', 'exec' and 'exit'. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 8:07 ` Ingo Molnar @ 2005-11-04 10:06 ` Paul Jackson 2005-11-04 15:24 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-04 10:06 UTC (permalink / raw) To: Ingo Molnar Cc: akpm, pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Ingo wrote: > Seriously, while nr_swapped_in_pages ought to be OK, i think there is a > generic problem with /proc based stats. > > System instrumentation people are already complaining about how costly > /proc parsing is. If you have to get some nontrivial stat from all > threads in the system, and if Linux doesnt offer that counter or summary > by default, it gets pretty expensive. Agreed. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 8:07 ` Ingo Molnar 2005-11-04 10:06 ` Paul Jackson @ 2005-11-04 15:24 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 15:24 UTC (permalink / raw) To: Ingo Molnar, Andrew Morton Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel > System instrumentation people are already complaining about how costly > /proc parsing is. If you have to get some nontrivial stat from all > threads in the system, and if Linux doesnt offer that counter or summary > by default, it gets pretty expensive. > > One solution i can think of would be to make a binary representation of > /proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every > task tracked that way, and stats updates would have to update this page > too - but it would make instrumentation of running apps really > unintrusive and scalable. That would be awesome - the current methods we have are mostly crap. There are some atomicity issues though. Plus when I suggested this 2 years ago, everyone told me to piss off, but I'm not bitter ;-) Seriously, we do need a fast communication mechanism. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 7:36 ` Andrew Morton 2005-11-04 8:07 ` Ingo Molnar @ 2005-11-04 8:18 ` Arjan van de Ven 2005-11-04 10:04 ` Paul Jackson 2005-11-04 15:14 ` Rob Landley 2 siblings, 1 reply; 241+ messages in thread From: Arjan van de Ven @ 2005-11-04 8:18 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel On Thu, 2005-11-03 at 23:36 -0800, Andrew Morton wrote: > Ingo Molnar <mingo@elte.hu> wrote: > > > > * Andrew Morton <akpm@osdl.org> wrote: > > > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > > off processes once they started swapping. We thought that it could be > > > done from userspace, but we need a way for userspace to detect when a > > > task is being swapped on a per-task basis. > > > > wouldnt the clean solution here be a "swap ulimit"? > > Well it's _a_ solution, but it's terribly specific. > > How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's > non-zero, kill <pid>? well or do it the other way around write a counter to such a thing and kill when it hits zero (similar to the CPU perf counter stuff on x86) doing this from userspace is tricky; what if the task dies of natural causes and the pid gets reused, between the time the userspace app reads the value and the time it decides the time is up and time for a kill.... (and on a busy server that can be quite a bit of time) ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 8:18 ` Arjan van de Ven @ 2005-11-04 10:04 ` Paul Jackson 0 siblings, 0 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-04 10:04 UTC (permalink / raw) To: Arjan van de Ven Cc: akpm, mingo, pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Arjan wrote: > doing this from userspace is tricky; what if the task dies of natural > causes and the pid gets reused, between the time the userspace app reads > the value and the time it decides the time is up and time for a kill.... > (and on a busy server that can be quite a bit of time) If pids are being reused within seconds of their being freed up, then the batch managers running on the big HPC systems I care about are so screwed it isn't even funny. They depend heavily on being able to identify the task pids in a job and then doing something to those tasks (suspend, kill, gather stats, ...). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 7:36 ` Andrew Morton 2005-11-04 8:07 ` Ingo Molnar 2005-11-04 8:18 ` Arjan van de Ven @ 2005-11-04 15:14 ` Rob Landley 2 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-04 15:14 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, pbadari, torvalds, jdike, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel On Friday 04 November 2005 01:36, Andrew Morton wrote: > > wouldnt the clean solution here be a "swap ulimit"? > > Well it's _a_ solution, but it's terribly specific. > > How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's > non-zero, kill <pid>? Things like make fork lots of short-lived child processes, and some of those can be quite memory intensive. (The gcc 4.0.2 build causes an outright swap storm for me about halfway through, doing genattrtab and then again compiling the result). Is there any way for parents to collect their child process's statistics when the children exit? Or by the time the actual swapper exits, do we not care anymore? Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar 2005-11-04 7:36 ` Andrew Morton @ 2005-11-04 10:14 ` Bernd Petrovitsch 2005-11-04 10:21 ` Ingo Molnar 1 sibling, 1 reply; 241+ messages in thread From: Bernd Petrovitsch @ 2005-11-04 10:14 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote: > * Andrew Morton <akpm@osdl.org> wrote: > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > off processes once they started swapping. We thought that it could be > > done from userspace, but we need a way for userspace to detect when a > > task is being swapped on a per-task basis. > > wouldnt the clean solution here be a "swap ulimit"? Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"? OK, mlockall() can only be done by root (processes). Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 10:14 ` Bernd Petrovitsch @ 2005-11-04 10:21 ` Ingo Molnar 2005-11-04 11:17 ` Bernd Petrovitsch 0 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 10:21 UTC (permalink / raw) To: Bernd Petrovitsch Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel * Bernd Petrovitsch <bernd@firmix.at> wrote: > On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > > off processes once they started swapping. We thought that it could be > > > done from userspace, but we need a way for userspace to detect when a > > > task is being swapped on a per-task basis. > > > > wouldnt the clean solution here be a "swap ulimit"? > > Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"? > OK, mlockall() can only be done by root (processes). what do you mean? mlockall pins down all pages. swapin ulimit kills the task (and thus frees all the RAM it had) when it touches swap for the first time. These two solutions almost oppose each other! Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [patch] swapin rlimit 2005-11-04 10:21 ` Ingo Molnar @ 2005-11-04 11:17 ` Bernd Petrovitsch 0 siblings, 0 replies; 241+ messages in thread From: Bernd Petrovitsch @ 2005-11-04 11:17 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel On Fri, 2005-11-04 at 11:21 +0100, Ingo Molnar wrote: > * Bernd Petrovitsch <bernd@firmix.at> wrote: > > On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote: > > > * Andrew Morton <akpm@osdl.org> wrote: > > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill > > > > off processes once they started swapping. We thought that it could be > > > > done from userspace, but we need a way for userspace to detect when a > > > > task is being swapped on a per-task basis. > > > > > > wouldnt the clean solution here be a "swap ulimit"? > > > > Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"? > > OK, mlockall() can only be done by root (processes). > > what do you mean? mlockall pins down all pages. swapin ulimit kills the in memory. > task (and thus frees all the RAM it had) when it touches swap for the > first time. These two solutions almost oppose each other! Almost IMHO as locked pages in RAM avoid swapping totally. Probably "complement each other" is more correct. Given the limit for "max locked memory" it should pretty much behave the same if the process gets on his limits. OK, the difference may be loaded executable and lib pages. Hmm, delivering a signal on the first swapped out page might be another simple solution and the process might do something to avoid it. The nice thing about "swap ulimit" is: It is easy to understand what it is (which is always a good thing). Generating a similar effect with the combination of 2 other features is probably somewhat more arcane. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 7:46 ` Gerrit Huizenga 2005-11-02 8:50 ` Nick Piggin @ 2005-11-02 10:41 ` Ingo Molnar 2005-11-02 11:04 ` Gerrit Huizenga 1 sibling, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-02 10:41 UTC (permalink / raw) To: Gerrit Huizenga Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms * Gerrit Huizenga <gh@us.ibm.com> wrote: > > generic unpluggable kernel RAM _will not work_. > > Actually, it will. Well, depending on terminology. 'generic unpluggable kernel RAM' means what it says: any RAM seen by the kernel can be unplugged, always. (as long as the unplug request is reasonable and there is enough free space to migrate in-use pages to). > There are two usage models here - those which intend to remove > physical elements and those where the kernel returnss management of > its virtualized "physical" memory to a hypervisor. In the latter > case, a hypervisor already maintains a virtual map of the memory and > the OS needs to release virtualized "physical" memory. I think you > are referring to RAM here as the physical component; however these > same defrag patches help where a hypervisor is maintaining the real > physical memory below the operating system and the OS is managing a > virtualized "physical" memory. reliable unmapping of "generic kernel RAM" is not possible even in a virtualized environment. Think of the 'live pointers' problem i outlined in an earlier mail in this thread today. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar @ 2005-11-02 11:04 ` Gerrit Huizenga 2005-11-02 12:00 ` Ingo Molnar 0 siblings, 1 reply; 241+ messages in thread From: Gerrit Huizenga @ 2005-11-02 11:04 UTC (permalink / raw) To: Ingo Molnar Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote: > > * Gerrit Huizenga <gh@us.ibm.com> wrote: > > > > generic unpluggable kernel RAM _will not work_. > > > > Actually, it will. Well, depending on terminology. > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the > kernel can be unplugged, always. (as long as the unplug request is > reasonable and there is enough free space to migrate in-use pages to). Okay, I understand your terminology. Yes, I can not point to any particular piece of memory and say "I want *that* one" and have that request succeed. However, I can say "find me 50 chunks of memory of your choosing" and have a very good chance of finding enough memory to satisfy my request. > > There are two usage models here - those which intend to remove > > physical elements and those where the kernel returnss management of > > its virtualized "physical" memory to a hypervisor. In the latter > > case, a hypervisor already maintains a virtual map of the memory and > > the OS needs to release virtualized "physical" memory. I think you > > are referring to RAM here as the physical component; however these > > same defrag patches help where a hypervisor is maintaining the real > > physical memory below the operating system and the OS is managing a > > virtualized "physical" memory. > > reliable unmapping of "generic kernel RAM" is not possible even in a > virtualized environment. Think of the 'live pointers' problem i outlined > in an earlier mail in this thread today. Yeah - and that isn't what is being proposed here. The goal is to ask the kernel to identify some memory which can be legitimately freed and hasten the freeing of that memory. gerrit ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 11:04 ` Gerrit Huizenga @ 2005-11-02 12:00 ` Ingo Molnar 2005-11-02 12:42 ` Dave Hansen 2005-11-02 15:02 ` Gerrit Huizenga 0 siblings, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-02 12:00 UTC (permalink / raw) To: Gerrit Huizenga Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms * Gerrit Huizenga <gh@us.ibm.com> wrote: > > On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote: > > > > * Gerrit Huizenga <gh@us.ibm.com> wrote: > > > > > > generic unpluggable kernel RAM _will not work_. > > > > > > Actually, it will. Well, depending on terminology. > > > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the > > kernel can be unplugged, always. (as long as the unplug request is > > reasonable and there is enough free space to migrate in-use pages to). > > Okay, I understand your terminology. Yes, I can not point to any > particular piece of memory and say "I want *that* one" and have that > request succeed. However, I can say "find me 50 chunks of memory > of your choosing" and have a very good chance of finding enough > memory to satisfy my request. but that's obviously not 'generic unpluggable kernel RAM'. It's very special RAM: RAM that is free or easily freeable. I never argued that such RAM is not returnable to the hypervisor. > > reliable unmapping of "generic kernel RAM" is not possible even in a > > virtualized environment. Think of the 'live pointers' problem i outlined > > in an earlier mail in this thread today. > > Yeah - and that isn't what is being proposed here. The goal is to > ask the kernel to identify some memory which can be legitimately > freed and hasten the freeing of that memory. but that's very easy to identify: check the free list or the clean list(s). No defragmentation necessary. [unless the unit of RAM mapping between hypervisor and guest is too coarse (i.e. not 4K pages).] Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 12:00 ` Ingo Molnar @ 2005-11-02 12:42 ` Dave Hansen 2005-11-02 15:02 ` Gerrit Huizenga 1 sibling, 0 replies; 241+ messages in thread From: Dave Hansen @ 2005-11-02 12:42 UTC (permalink / raw) To: Ingo Molnar Cc: Gerrit Huizenga, KAMEZAWA Hiroyuki, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 2005-11-02 at 13:00 +0100, Ingo Molnar wrote: > > > Yeah - and that isn't what is being proposed here. The goal is to > > ask the kernel to identify some memory which can be legitimately > > freed and hasten the freeing of that memory. > > but that's very easy to identify: check the free list or the clean > list(s). No defragmentation necessary. [unless the unit of RAM mapping > between hypervisor and guest is too coarse (i.e. not 4K pages).] It needs to be that coarse in cases where HugeTLB is desired for use. I'm not sure I could convince the DB guys to give up large pages, they're pretty hooked on them. ;) -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 12:00 ` Ingo Molnar 2005-11-02 12:42 ` Dave Hansen @ 2005-11-02 15:02 ` Gerrit Huizenga 2005-11-03 0:10 ` Rob Landley 1 sibling, 1 reply; 241+ messages in thread From: Gerrit Huizenga @ 2005-11-02 15:02 UTC (permalink / raw) To: Ingo Molnar Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 02 Nov 2005 13:00:48 +0100, Ingo Molnar wrote: > > * Gerrit Huizenga <gh@us.ibm.com> wrote: > > > > > On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote: > > > > > > * Gerrit Huizenga <gh@us.ibm.com> wrote: > > > > > > > > generic unpluggable kernel RAM _will not work_. > > > > > > > > Actually, it will. Well, depending on terminology. > > > > > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the > > > kernel can be unplugged, always. (as long as the unplug request is > > > reasonable and there is enough free space to migrate in-use pages to). > > > > Okay, I understand your terminology. Yes, I can not point to any > > particular piece of memory and say "I want *that* one" and have that > > request succeed. However, I can say "find me 50 chunks of memory > > of your choosing" and have a very good chance of finding enough > > memory to satisfy my request. > > but that's obviously not 'generic unpluggable kernel RAM'. It's very > special RAM: RAM that is free or easily freeable. I never argued that > such RAM is not returnable to the hypervisor. Okay - and 'generic unpluggable kernel RAM' has not been a goal for the hypervisor based environments. I believe it is closer to being a goal for those machines which want to hot-remove DIMMs or physical memory, e.g. those with IA64 machines wishing to remove entire nodes. > > > reliable unmapping of "generic kernel RAM" is not possible even in a > > > virtualized environment. Think of the 'live pointers' problem i outlined > > > in an earlier mail in this thread today. > > > > Yeah - and that isn't what is being proposed here. The goal is to > > ask the kernel to identify some memory which can be legitimately > > freed and hasten the freeing of that memory. > > but that's very easy to identify: check the free list or the clean > list(s). No defragmentation necessary. [unless the unit of RAM mapping > between hypervisor and guest is too coarse (i.e. not 4K pages).] Ah, but the hypervisor often manages large page sizes, e.g. 64 MB. It doesn't manage page rights for each guest OS at the 4 K granularity. Hypervisors are theoretically light in terms of memory needs and general footprint. Picture the overhead of tracking rights/permissions of each page of memory and its assignment to any of, say, 256 different guest operating systems. For a machine of any size, that would be a huge amount of state for a hypervisor to maintain. Would you really want a hypervisor to keep that much state? Or is it more reasonably for a hypervisor to track, say, 64 MB chunks and the rights of that memory for a number of guest operating systems? Even if the number of guests is small, the data structures for fast memory management would grow quickly. gerrit ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 15:02 ` Gerrit Huizenga @ 2005-11-03 0:10 ` Rob Landley 0 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-03 0:10 UTC (permalink / raw) To: Gerrit Huizenga Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wednesday 02 November 2005 09:02, Gerrit Huizenga wrote: > > but that's obviously not 'generic unpluggable kernel RAM'. It's very > > special RAM: RAM that is free or easily freeable. I never argued that > > such RAM is not returnable to the hypervisor. > > Okay - and 'generic unpluggable kernel RAM' has not been a goal for > the hypervisor based environments. I believe it is closer to being > a goal for those machines which want to hot-remove DIMMs or physical > memory, e.g. those with IA64 machines wishing to remove entire nodes Keep in mind that just about any virtualized environment might benefit from being able to tell the parent system "we're not using this ram". I mentioned UML, and I can also imagine a Linux driver that signals qemu (or even vmware) to say "this chunk of physical memory isn't currently in use", and even if they don't actually _free_ it they can call madvise() on it. Heck, if we have prezeroing of large blocks, telling your emulator to madvise(ADV_DONTNEED) the pages for you should just plug right in to that infrastructure... Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 7:19 ` Ingo Molnar 2005-11-02 7:46 ` Gerrit Huizenga @ 2005-11-02 7:57 ` Nick Piggin 1 sibling, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 7:57 UTC (permalink / raw) To: Ingo Molnar Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Ingo Molnar wrote: > * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > >>My own target is NUMA node hotplug, what NUMA node hotplug want is >>- [remove the range of memory] For this approach, admin should define >> *core* node and removable node. Memory on removable node is removable. >> Dividing area into removable and not-removable is needed, because >> we cannot allocate any kernel's object on removable area. >> Removable area should be 100% removable. Customer can know the limitation >> before using. > > > that's a perfectly fine method, and is quite similar to the 'separate > zone' approach Nick mentioned too. It is also easily understandable for > users/customers. > I agree - and I think it should be easy to configure out of the kernel for those that don't want the functionality, and should at very little complexity to core code (all without looking at the patches so I could be very wrong!). > > but what is a dangerous fallacy is that we will be able to support hot > memory unplug of generic kernel RAM in any reliable way! > Very true. > you really have to look at this from the conceptual angle: 'can an > approach ever lead to a satisfactory result'? If the answer is 'no', > then we _must not_ add a 90% solution that we _know_ will never be a > 100% solution. > > for the separate-removable-zones approach we see the end of the tunnel. > Separate zones are well-understood. > Yep, I don't see why this doesn't cover all the needs that the frag patches attempt (hot unplug, hugepage dynamic reserves). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:49 ` Dave Hansen 2005-11-01 15:01 ` Ingo Molnar @ 2005-11-02 0:51 ` Nick Piggin 2005-11-02 7:42 ` Dave Hansen 2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman 1 sibling, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 0:51 UTC (permalink / raw) To: Dave Hansen Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Dave Hansen wrote: > What the fragmentation patches _can_ give us is the ability to have 100% > success in removing certain areas: the "user-reclaimable" areas > referenced in the patch. This gives a customer at least the ability to > plan for how dynamically reconfigurable a system should be. > But the "user-reclaimable" areas can still be taken over by other areas which become fragmented. That's like saying we can already guarantee 100% success in removing areas that are unfragmented and free, or freeable. > After these patches, the next logical steps are to increase the > knowledge that the slabs have about fragmentation, and to teach some of > the shrinkers about fragmentation. > I don't like all this work and complexity and overheads going into a partial solution. Look: if you have to guarantee memory can be shrunk, set aside a zone for it (that only fills with user reclaimable areas). This is better than the current frag patches because it will give you the 100% guarantee that you need (provided we have page migration to move mlocked pages). If you don't need a guarantee, then our current, simple system does the job perfectly. > After that, we'll need some kind of virtual remapping, breaking the 1:1 > kernel virtual mapping, so that the most problematic pages can be > remapped. These pages would retain their virtual address, but getting a > new physical. However, this is quite far down the road and will require > some serious evaluation because it impacts how normal devices are able > to to DMA. The ppc64 proprietary hypervisor has features to work around > these issues, and any new hypervisors wishing to support partition > memory hotplug would likely have to follow suit. > I would more like to see something like this happen (provided it was nicely abstracted away and could be CONFIGed out for the 99.999% of users who don't need the overhead or complexity). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 0:51 ` Nick Piggin @ 2005-11-02 7:42 ` Dave Hansen 2005-11-02 8:24 ` Nick Piggin 2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman 1 sibling, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-02 7:42 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote: > Look: if you have to guarantee memory can be shrunk, set aside a zone > for it (that only fills with user reclaimable areas). This is better > than the current frag patches because it will give you the 100% > guarantee that you need (provided we have page migration to move mlocked > pages). With Mel's patches, you can easily add the same guarantee. Look at the code in fallback_alloc() (patch 5/8). It would be quite easy to modify the fallback lists to disallow fallbacks into areas from which we would like to remove memory. That was left out for simplicity. As you say, they're quite complex as it is. Would you be interested in seeing a patch to provide those kinds of guarantees? We've had a bit of experience with a hotpluggable zone approach before. Just like the current topic patches, you're right, that approach can also provide strong guarantees. However, the issue comes if the system ever needs to move memory between such zones, such as if a user ever decides that they'd prefer to break hotplug guarantees rather than OOM. Do you think changing what a particular area of memory is being used for would ever be needed? One other thing, if we decide to take the zones approach, it would have no other side benefits for the kernel. It would be for hotplug only and I don't think even the large page users would get much benefit. -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 7:42 ` Dave Hansen @ 2005-11-02 8:24 ` Nick Piggin 2005-11-02 8:33 ` Yasunori Goto 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-02 8:24 UTC (permalink / raw) To: Dave Hansen Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Dave Hansen wrote: > On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote: > >>Look: if you have to guarantee memory can be shrunk, set aside a zone >>for it (that only fills with user reclaimable areas). This is better >>than the current frag patches because it will give you the 100% >>guarantee that you need (provided we have page migration to move mlocked >>pages). > > > With Mel's patches, you can easily add the same guarantee. Look at the > code in fallback_alloc() (patch 5/8). It would be quite easy to modify > the fallback lists to disallow fallbacks into areas from which we would > like to remove memory. That was left out for simplicity. As you say, > they're quite complex as it is. Would you be interested in seeing a > patch to provide those kinds of guarantees? > On top of Mel's patch? I think this is essiential for any guarantees that you might be interested... but it would just mean that now you have a redundant extra zoning layer. I think ZONE_REMOVABLE is something that really needs to be looked at again if you need a hotunplug solution in the kernel. > We've had a bit of experience with a hotpluggable zone approach before. > Just like the current topic patches, you're right, that approach can > also provide strong guarantees. However, the issue comes if the system > ever needs to move memory between such zones, such as if a user ever > decides that they'd prefer to break hotplug guarantees rather than OOM. > I can imagine one could have a sysctl to allow/disallow non-easy-reclaim allocations from ZONE_REMOVABLE. As Ingo says, neither way is going to give a 100% solution - I wouldn't like to see so much complexity added to bring us from a ZONE_REMOVABLE 80% solution to a 90% solution. I believe this is where Linus' "perfect is the enemy of good" quote applies. > Do you think changing what a particular area of memory is being used for > would ever be needed? > Perhaps, but Mel's patch only guarantees you to change once, same as ZONE_REMOVABLE. Once you eat up those easy-to-reclaim areas, you can't get them back. > One other thing, if we decide to take the zones approach, it would have > no other side benefits for the kernel. It would be for hotplug only and > I don't think even the large page users would get much benefit. > Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're talking about other higher-order users, I still think we can't guarantee past about order 1 or 2 with Mel's patch and they simply need to have some other ways to do things. But I think using zones would have advantages in that they would help give zones and zone balancing more scrutiny and test coverage in the kernel, which is sorely needed since everyone threw out their highmem systems :P -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 8:24 ` Nick Piggin @ 2005-11-02 8:33 ` Yasunori Goto 2005-11-02 8:43 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: Yasunori Goto @ 2005-11-02 8:33 UTC (permalink / raw) To: Nick Piggin Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms > > One other thing, if we decide to take the zones approach, it would have > > no other side benefits for the kernel. It would be for hotplug only and > > I don't think even the large page users would get much benefit. > > > > Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're > talking about other higher-order users, I still think we can't guarantee > past about order 1 or 2 with Mel's patch and they simply need to have > some other ways to do things. Hmmm. I don't see at this point. Why do you think ZONE_REMOVABLE can satisfy for hugepage. At leaset, my ZONE_REMOVABLE patch doesn't any concern about fragmentation. Bye. -- Yasunori Goto ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 8:33 ` Yasunori Goto @ 2005-11-02 8:43 ` Nick Piggin 2005-11-02 14:51 ` Martin J. Bligh 2005-11-02 23:28 ` Rob Landley 0 siblings, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 8:43 UTC (permalink / raw) To: Yasunori Goto Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Yasunori Goto wrote: >>>One other thing, if we decide to take the zones approach, it would have >>>no other side benefits for the kernel. It would be for hotplug only and >>>I don't think even the large page users would get much benefit. >>> >> >>Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're >>talking about other higher-order users, I still think we can't guarantee >>past about order 1 or 2 with Mel's patch and they simply need to have >>some other ways to do things. > > > Hmmm. I don't see at this point. > Why do you think ZONE_REMOVABLE can satisfy for hugepage. > At leaset, my ZONE_REMOVABLE patch doesn't any concern about > fragmentation. > Well I think it can satisfy hugepage allocations simply because we can be reasonably sure of being able to free contiguous regions. Of course it will be memory no longer easily reclaimable, same as the case for the frag patches. Nor would be name ZONE_REMOVABLE any longer be the most appropriate! But my point is, the basic mechanism is there and is workable. Hugepages and memory unplug are the two main reasons for IBM to be pushing this AFAIKS. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 8:43 ` Nick Piggin @ 2005-11-02 14:51 ` Martin J. Bligh 2005-11-02 23:28 ` Rob Landley 1 sibling, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 14:51 UTC (permalink / raw) To: Nick Piggin, Yasunori Goto Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms > Well I think it can satisfy hugepage allocations simply because > we can be reasonably sure of being able to free contiguous regions. > Of course it will be memory no longer easily reclaimable, same as > the case for the frag patches. Nor would be name ZONE_REMOVABLE any > longer be the most appropriate! > > But my point is, the basic mechanism is there and is workable. > Hugepages and memory unplug are the two main reasons for IBM to be > pushing this AFAIKS. No, that's not true - those are just the "exciting" features that go on the back of it. Look back in this email thread - there's lots of other reasons to fix fragmentation. I don't believe you can eliminate all the order > 0 allocations in the kernel. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 8:43 ` Nick Piggin 2005-11-02 14:51 ` Martin J. Bligh @ 2005-11-02 23:28 ` Rob Landley 2005-11-03 5:26 ` Jeff Dike 1 sibling, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-02 23:28 UTC (permalink / raw) To: Nick Piggin, user-mode-linux-devel Cc: Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wednesday 02 November 2005 02:43, Nick Piggin wrote: > > Hmmm. I don't see at this point. > > Why do you think ZONE_REMOVABLE can satisfy for hugepage. > > At leaset, my ZONE_REMOVABLE patch doesn't any concern about > > fragmentation. > > Well I think it can satisfy hugepage allocations simply because > we can be reasonably sure of being able to free contiguous regions. > Of course it will be memory no longer easily reclaimable, same as > the case for the frag patches. Nor would be name ZONE_REMOVABLE any > longer be the most appropriate! > > But my point is, the basic mechanism is there and is workable. > Hugepages and memory unplug are the two main reasons for IBM to be > pushing this AFAIKS. Who cares what IBM is pushing? I'm interested in fragmentation avoidance for User Mode Linux. I use User Mode Linux to virtualize a system build, and one problem I currently have is that some workloads temporarily use a lot of memory. For example, I can run a complete system build in about 48 megs of ram: except for building GCC. That spikes to a couple hundred megabytes. If I allocate 256 megabytes of memory to UML, that's half the memory on my laptop and UML will just use it for redundant cacheing and such while desktop performance gets a bit unhappy with the build going. UML gets an instance's "physical memory" by allocating a temporary file, mmapping it, and deleting it (which signals to the vfs that flushing this data to backing store should only be done under memory pressure from the rest of the OS, because the file's going away when it's closed so there's no With fragmentation reduction and prezeroing, UML suddenly gains the option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of prezeroing, B) a way of giving memory back to the host OS when it's not in use. This has _nothing_ to do with IBM. Or large systems. This is some random developer trying to run a virtualized system build on his laptop. (The reason I need to use UML is that I build uClibc with the newest 2.6 kernel headers I can, link apps against it, and then running many of those apps during later stages of the build. If the kernel headers used to build libc are sufficiently newer than the kernel the build is running under, I get segfaults because the new libc tries use kernel features that aren't there on the host system, but will be in the final system. I also get the ability to mknod/chown/chroot without needing root access on the host system for free...) Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 23:28 ` Rob Landley @ 2005-11-03 5:26 ` Jeff Dike 2005-11-03 5:41 ` Rob Landley 0 siblings, 1 reply; 241+ messages in thread From: Jeff Dike @ 2005-11-03 5:26 UTC (permalink / raw) To: Rob Landley Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > With fragmentation reduction and prezeroing, UML suddenly gains the option of > calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of > prezeroing, B) a way of giving memory back to the host OS when it's not in > use. DONT_NEED is insufficient. It doesn't discard the data in dirty file-backed pages. Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) which does do the trick, and I have a UML patch which adds memory hotplug. This combination does free memory back to the host. Jeff ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 5:26 ` Jeff Dike @ 2005-11-03 5:41 ` Rob Landley 2005-11-04 3:26 ` [uml-devel] " Blaisorblade 0 siblings, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-03 5:41 UTC (permalink / raw) To: Jeff Dike Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wednesday 02 November 2005 23:26, Jeff Dike wrote: > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > > With fragmentation reduction and prezeroing, UML suddenly gains the > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a > > fast way of prezeroing, B) a way of giving memory back to the host OS > > when it's not in use. > > DONT_NEED is insufficient. It doesn't discard the data in dirty > file-backed pages. I thought DONT_NEED would discard the page cache, and punch was only needed to free up the disk space. I was hoping that since the file was deleted from disk and is already getting _some_ special treatment (since it's a longstanding "poor man's shared memory" hack), that madvise wouldn't flush the data to disk, but would just zero it out. A bit optimistic on my part, I know. :) > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) > which does do the trick, and I have a UML patch which adds memory > hotplug. This combination does free memory back to the host. I saw it wander by, and am all for it. If it goes in, it's obviously the right thing to use. You may remember I asked about this two years ago: http://seclists.org/lists/linux-kernel/2003/Dec/0919.html And a reply indicated that SVr4 had it, but we don't. I assume the "naming discussion" mentioned in the recent thread already scrubbed through this old thread to determine that the SVr4 API was icky. http://seclists.org/lists/linux-kernel/2003/Dec/0955.html > Jeff Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-03 5:41 ` Rob Landley @ 2005-11-04 3:26 ` Blaisorblade 2005-11-04 15:50 ` Rob Landley 0 siblings, 1 reply; 241+ messages in thread From: Blaisorblade @ 2005-11-04 3:26 UTC (permalink / raw) To: user-mode-linux-devel Cc: Rob Landley, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thursday 03 November 2005 06:41, Rob Landley wrote: > On Wednesday 02 November 2005 23:26, Jeff Dike wrote: > > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > > > With fragmentation reduction and prezeroing, UML suddenly gains the > > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) > > > a fast way of prezeroing, B) a way of giving memory back to the host OS > > > when it's not in use. > > DONT_NEED is insufficient. It doesn't discard the data in dirty > > file-backed pages. > I thought DONT_NEED would discard the page cache, and punch was only needed > to free up the disk space. This is correct, but... > I was hoping that since the file was deleted from disk and is already > getting _some_ special treatment (since it's a longstanding "poor man's > shared memory" hack), that madvise wouldn't flush the data to disk, but > would just zero it out. A bit optimistic on my part, I know. :) I read at some time that this optimization existed but was deemed obsolete and removed. Why obsolete? Because... we have tmpfs! And that's the point. With DONTNEED, we detach references from page tables, but the content is still pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?) > > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) > > which does do the trick, and I have a UML patch which adds memory > > hotplug. This combination does free memory back to the host. > I saw it wander by, and am all for it. If it goes in, it's obviously the > right thing to use. Btw, on this side of the picture, I think fragmentation avoidance is not needed for that. I guess you refer to using frag. avoidance on the guest (if it matters for the host, let me know). When it will be present using it will be nice, but currently we'd do madvise() on a page-per-page basis, and we'd do it on non-consecutive pages (basically, free pages we either find or free or purpose). > You may remember I asked about this two years ago: > http://seclists.org/lists/linux-kernel/2003/Dec/0919.html > And a reply indicated that SVr4 had it, but we don't. I assume the "naming > discussion" mentioned in the recent thread already scrubbed through this > old thread to determine that the SVr4 API was icky. > http://seclists.org/lists/linux-kernel/2003/Dec/0955.html I assume not everybody did (even if somebody pointed out the existance of the SVr4 API), but there was the need, in at least one usage, for a virtual address-based API rather than a file offset based one, like the SVr4 one - that user would need implementing backward mapping in userspace only for this purpose, while we already have it in the kernel. Anyway, the sys_punch() API will follow later - customers need mainly madvise() for now. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 3:26 ` [uml-devel] " Blaisorblade @ 2005-11-04 15:50 ` Rob Landley 2005-11-04 17:18 ` Blaisorblade 0 siblings, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-04 15:50 UTC (permalink / raw) To: user-mode-linux-devel Cc: Blaisorblade, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thursday 03 November 2005 21:26, Blaisorblade wrote: > > I was hoping that since the file was deleted from disk and is already > > getting _some_ special treatment (since it's a longstanding "poor man's > > shared memory" hack), that madvise wouldn't flush the data to disk, but > > would just zero it out. A bit optimistic on my part, I know. :) > > I read at some time that this optimization existed but was deemed obsolete > and removed. > > Why obsolete? Because... we have tmpfs! And that's the point. With > DONTNEED, we detach references from page tables, but the content is still > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?) If I had that kind of control over environment my build would always be deployed in (including root access), I wouldn't need UML. :) (P.S. The default for Ubuntu "Horny Hedgehog" is no. The only tmpfs mount is /dev/shm, and /tmp is on / which is ext3. Yeah, I need to upgrade my laptop...) > I guess you refer to using frag. avoidance on the guest Yes. Moot point since Linus doesn't want it. > (if it matters for > the host, let me know). When it will be present using it will be nice, but > currently we'd do madvise() on a page-per-page basis, and we'd do it on > non-consecutive pages (basically, free pages we either find or free or > purpose). Might be a performance issue if that gets introduced with per-page granularity, and how do you avoid giving back pages we're about to re-use? Oh well, bench it when it happens. (And in any case, it needs a tunable to beat the page cache into submission or there's no free memory to give back. If there's already such a tuneable, I haven't found it yet.) Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:50 ` Rob Landley @ 2005-11-04 17:18 ` Blaisorblade 2005-11-04 17:44 ` Rob Landley 0 siblings, 1 reply; 241+ messages in thread From: Blaisorblade @ 2005-11-04 17:18 UTC (permalink / raw) To: Rob Landley Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms (Note - I've removed a few CC's since we're too many ones, sorry for any inconvenience). On Friday 04 November 2005 16:50, Rob Landley wrote: > On Thursday 03 November 2005 21:26, Blaisorblade wrote: > > > I was hoping that since the file was deleted from disk and is already > > > getting _some_ special treatment (since it's a longstanding "poor man's > > > shared memory" hack), that madvise wouldn't flush the data to disk, but > > > would just zero it out. A bit optimistic on my part, I know. :) > > > > I read at some time that this optimization existed but was deemed > > obsolete and removed. > > > > Why obsolete? Because... we have tmpfs! And that's the point. With > > DONTNEED, we detach references from page tables, but the content is still > > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?) > > If I had that kind of control over environment my build would always be > deployed in (including root access), I wouldn't need UML. :) Yep, right for your case... however currently the majority of users use tmpfs (I hope for them)... > > I guess you refer to using frag. avoidance on the guest > > Yes. Moot point since Linus doesn't want it. See lwn.net last issue (when it becomes available) on this issue. In short, however, the real point is that we need this kind of support. > Might be a performance issue if that gets introduced with per-page > granularity, I'm aware of this possibility, and I've said in fact "Frag. avoidance will be nice to use". However I'm not sure that the system call overhead is so big, compared to flushing the TLB entries... But for now we haven't the issue - you don't do hotunplug frequently. When somebody will write the auto-hotunplug management daemon we could have a problem on this... > and how do you avoid giving back pages we're about to re-use? Jeff's trick is call the buddy allocator (__get_free_pages()) to get a full page (and it will do any needed work to free memory), so nobody else will use it, and then madvise() it. If a better API exists, that will be used. > Oh well, bench it when it happens. (And in any case, it needs a tunable to > beat the page cache into submission or there's no free memory to give back. I couldn't parse your sentence. The allocation will free memory like when memory is needed. However look at /proc/sys/vm/swappiness or use Con Kolivas's patches to find new tunable and policies. > If there's already such a tuneable, I haven't found it yet.) -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 17:18 ` Blaisorblade @ 2005-11-04 17:44 ` Rob Landley 0 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-04 17:44 UTC (permalink / raw) To: Blaisorblade Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms On Friday 04 November 2005 11:18, Blaisorblade wrote: > > Oh well, bench it when it happens. (And in any case, it needs a tunable > > to beat the page cache into submission or there's no free memory to give > > back. > > I couldn't parse your sentence. The allocation will free memory like when > memory is needed. If you've got a daemon running in the virtual system to hand back memory to the host, then you don't need a tuneable. What I was thinking is that if we get prezeroing infrastructure that can use various prezeroing accelerators (as has been discussed but I don't believe merged), then a logical prezeroing accelerator for UML would be calling madvise on the host system. This has the advantage of automatically giving back to the host system any memory that's not in use, but would require some way to tell kswapd or some such that keeping around lots of prezeroed memory is preferable to keeping around lots of page cache. In my case, I have a workload that can mostly work with 32-48 megs of ram, but it spikes up to 256 at one point. Right now, I'm telling UML mem=64 megs and the feeding it a 256 swap file on ubd, but this is hideously inefficient when it actually tries to use this swap file. (And since the host system is running a 2.6.10 kernel, there's a five minute period during each build where things on my desktop actually freeze for 15-30 seconds at a time. And this is on a laptop with 512 megs of ram. I think it's because the disk is so overwhelmed, and some things (like vim's .swp file, and something similar in kmail's composer) do a gratuitous fsync... > However look at /proc/sys/vm/swappiness Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that completes with swappiness at 60. I mentioned this on the list a little while ago and some people asked for copies of my test script... > or use Con Kolivas's patches to find new tunable and policies. The daemon you mentioned is an alternative, but I'm not quite sure how rapid the daemon's reaction is going to be to potential OOM situations when something suddenly wants an extra 200 megs... Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary 2005-11-02 0:51 ` Nick Piggin 2005-11-02 7:42 ` Dave Hansen @ 2005-11-02 12:38 ` Mel Gorman 2005-11-03 3:14 ` Nick Piggin 1 sibling, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-02 12:38 UTC (permalink / raw) To: Nick Piggin Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Wed, 2 Nov 2005, Nick Piggin wrote: > Dave Hansen wrote: > > > What the fragmentation patches _can_ give us is the ability to have 100% > > success in removing certain areas: the "user-reclaimable" areas > > referenced in the patch. This gives a customer at least the ability to > > plan for how dynamically reconfigurable a system should be. > > > > But the "user-reclaimable" areas can still be taken over by other > areas which become fragmented. > This is true, we have worst case scenarios. With our patches though, our assertion it takes a lot longer to degrade and in good scenarios like where the workload is not using all of physical memory, we don't degrade at all. Assuming we get a page migration or active defragmentation in the future, it will be a lot longer before they have to do any work. As we only fragment when there is nothing else to do, page migration will also have less work to do. > That's like saying we can already guarantee 100% success in removing > areas that are unfragmented and free, or freeable. > > > After these patches, the next logical steps are to increase the > > knowledge that the slabs have about fragmentation, and to teach some of > > the shrinkers about fragmentation. > > > > I don't like all this work and complexity and overheads going into a > partial solution. > > Look: if you have to guarantee memory can be shrunk, set aside a zone > for it (that only fills with user reclaimable areas). This is better > than the current frag patches because it will give you the 100% > guarantee that you need (provided we have page migration to move mlocked > pages). > > If you don't need a guarantee, then our current, simple system does the > job perfectly. > Ok. To me, the rest of the thread are beating around the same points and no one is giving ground. The points are made so lets summarise. Apologies if anything is missing. Problem ======= Memory gets fragmented meaning that contiguous blocks of memory are not free and not freeable no matter how much kswapd works Impact ====== A number of different users are hit, in different ways Physical Hotplug remove: Hotplug remove needs to be able to free a large region of memory that is then unplugged. Different architectures have different ways of doing this Virtualization hotplug remove: The requirements are lighter here. Contiguous Regions from 1MiB to 64MiB (figure taken from thread) must be freed to move the memory between virtual machines High order allocations: With fragmentation, high order allocations fail. Depending on the workload, kswapd could work forever and not free up a 4MiB chunk Who cares ========= Physical hotplug remove: Vendors of the hardware that support this - Fujitsu, HP (I think), IBM etc Virtualization hotplug remove: Sellers of virtualization software, some hardware like any IBM machine that lists LPAR in it's list of features. Probably software solutions like Xen are also affected if they want to be able to grow and shrink the virtual machines on demand High order allocations: Ultimately, hugepage users. Today, that is a feature only big server users like Oracle care about. In the future I reckon applications will be able to use them for things like backing the heap by huge pages. Other users like GigE, loopback devices with large MTUs, some filesystem like CIFS are all interested although they are also been told use use smaller pages. Solutions ========= Anti-defrag: This solution defines three groups of pages KernNoRclm, KernRclm and EasyRclm. Small sub-zone regions of size 2^(MAX_ORDER-1) are reserved for each allocation type. If there are no large blocks available and no reserved pages available, it falls back and begins to fragment. This tries to delay fragmentation for as long as possible New Zone: Add a new zone for easyrclm only allocations. This means that all kernel pages go in one place and all easyrclm go in another. This solution would allow us to reclaim contiguous blocks of (Note: This is basically what Solaris Kernel Cages are) Note that I am leaving out Growing/Shrinking zone code for the moment. While zones are currently able to get new pages with something like memory hotadd, there is no mechanism available to move existing pages from one zone into another. This will need planning and code. Code exists for page migration so we can reasonable speculate about what it brings to the table for both anti-defrag and New Zone approaches. Pros/Cons of Solutions ====================== Anti-defrag Pros o Aim9 shows no significant regressions (.37% on page_test). On some tests, it shows performance gains (> 5% on fork_test) o Stress tests show that it manages to keep fragmentation down to a far lower level even without teaching kswapd how to linear reclaim o Stress tests with a linear reclaim experimental patch shows that it can successfully find large contiguous chunks of memory o It is known to help hotplug on PPC64 o No tunables. The approach tries to manage itself as much as possible o It exists, heavily tested, and synced against the latest -mm1 o Can be compiled away be redefining the RCLM_* macros and the __GFP_*RCLM flags Anti-defrag Cons o More complexity within the page allocator o Adds a new layer onto the allocator that effectively creates subzones o Adding a new concept that maintainers have to work with o Depending on the workload, it fragments anyway New Zone Pros o Zones are a well known and understood concept o For people that do not care about hotplug, they can easily get rid of it o Provides reliable areas of contiguous groups that can be freed for HugeTLB pages going to userspace o Uses existing zone infrastructure for balancing New Zone Cons o Zones historically have introduced balancing problems o Been tried for hotplug and dropped because of being awkward to work with o It only helps hotplug and potentially HugeTLB pages for userspace o Tunable required. If you get it wrong, the system suffers a lot o Needs to be planned for and developed Scenarios ========= Lets outline some situations then or workloads that can occur 1. Heavy job running that consumes 75% of physical memory. Like a kernel build Anti-defrag: It will not fragment as it will never have to fallback.High order allocations will be possible in the remaining 25%. Zone-based: After been tuned to a kernel build load, it will not fragment. Get the tuning wrong, performance suffers or workload fails. High order allocations will be possible in the remaining 25%. Future work for scenario 1 Anti-defrag: No problem. Zone-based: Tune some more if problems occur. 2. Heavy job running that needs 110% of physical memory, swap is used. Example would be too many simultaneous kernel builds Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks. KernNoRclm starts stealing UserRclm regions to avoid excessive fragmentation but some fragmentation occurs. Extent depends on the duration and heaviness of the load. High order allocations will work if kswapd runs for long enough as it will reclaim the UserRclm reserved areas. Your chances depend on the intensity of KernNoRclm allocations Zone-based: After been tuned to the new kernel build load, it will not fragment. Get it wrong and performance suffers. High order allocations will work if you're lucky enough to have enough reclaimable pages together. Your chances are not good Future work for scenario 2 Anti-defrag: kswapd would need to know how to reclaim EasyRclm pages from the KernNoRclm, KernRclm and Fallback areas. Zone-based: Keep tuning 3. HighMem intensive workload with CONFIG_HIGHPTE set. Example would be a scientific job that was performing a very large calculation on an anonymous region of memory. Possible that some desktop workloads are like this - i.e. use large amounts of anonymous memory Anti-defrag: For ZONE_HIGHMEM, PTEs are grouped into one area, everything else into another, no fragmentation. HugeTLB allocations in ZONE_HIGHMEM will work if kswapd works long enough Zone-based: PTEs go to anywhere in ZONE_HIGHMEM. Easy-reclaimed goes to ZONE_HIGHMEM and ZONE_HOTREMOVABLE. ZONE_HIGHMEM fragments, ZONE_HOTREMOVABLE does not. HugeTLB pages will be available in ZONE_HOTREMOVABLE, but probably not in ZONE_HIGHMEM. Future work for scenario 3 Anti-defrag: No problem. On-demand HugeTLB allocation for userspace is possible. Would work better with linear page reclaim. Zone-based: Depends if we care that ZONE_HIGHMEM gets fragmented. We would only care if trying to allocate HugeTLB pages on demand from ZONE_HIGHMEM. ZONE_HOTREMOVABLE depending on it's size would be possible. Linear reclaim will help ZONE_HOTREMOVABLE, but not ZONE_HIGHMEM 4. KBuild. Main concerns here are performance Anti-defrag: May cause problems because of the .37% drop on page_test. May cause improvements because of the 5% increase on fork_test. No figures on kbuild available Zone-based: No figures available. Depends heavily on being configured correctly Future work for scenario 4 Anti-defrag: Try and optimise the paths affected. Alternatively make anti-defrag a configurable option by altering the values of RCLM_* and __GFP_*RCLM. (Note, would people be interested in a compile-time option for anti-defrag or would it make the complexity worse for people?) Zone-based: Tune for performance or compile away the zone 5. Physically unplug memory 25% of physical memory Anti-defrag: Memory in the region gets reclaimed if it's EasyRclm. Possibly will encounter awkward pages. Known that PPC64 has some success. Fujitsu's use nodes for hotplug, they would need to adjust the fallbacks to be fully reliable Zone-based: If we are unplugging the right zone, reclaim the pages. Possibly will encounter awkward pages (only mlock in this case) Future work for scenario 5 Anti-defrag: fallback_allocs for each node for Fujitsu to be any way reliable. Ability to move awkward pages around. For 100% success, ability to move kernel pages Zone-based: Ability to move awkward pages around. There is no 100% success scenario here. You remove the ZONE_HOTREMOVEABLE area or you turn the machine off. 6. Fsck a large filesystem (known to be a KernNoRclm heavy workload) Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy workload. It is also known that free blocks reappear through the address space when it finishes. Contiguous blocks may appear in the middle of the zone rather than either end. Zone-based: If misconfigured, performance degrades. As a machine will not be tuned for fsck, changes of degrading are pretty high. On the other hand, fsck is something people can wait for Future work for scenario 6 Anti-defrag: Ideally, in case of fallbacks, page migration would move awkward pages out of UserRclm areas Zone-based: Keep tuning if you run into problems Lets say we agree on a way that ZONE_HOTREMOVABLE can be shrunk in such a way to give pages to ZONE_NORMAL and ZONE_HIGHMEM as necessary (and we have to be able to handle both), Situation 2 and 6 changes. Note that this changing of zones sizes brings all the problems from the anti-defrag approach to the zone-based approach. 2a. Heavy job running that needs 110% of physical memory, swap is used. Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks. KernNoRclm starts stealing UserRclm regions to avoid excessive fragmentation but some fragmentation occurs. Extent depends on the duration and heaviness of the load. Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVALE. The zone cannot be shrunk so ZONE_NORMAL fragments as normal. Future work for scenario 2a Anti-defrag: kswapd would need to know how to clean EasyRclm pages from the KernNoRclm, KernRclm and Fallback reserved areas. When load drops off, regions will get reserved again for EasyRclm. Contiguous blocks will become whenever possible be it the beginning, middle or end of the zone. Page migration would help fix up single kernel pages left in EasyRclm areas. Zone-based: Page migration would need to move pages from the end of the zone so it could be shrunk. 6a. Fsck Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy workload. It is also known that free blocks reappear through the address space when it finishes. Once the free blocks appear, they get reserved for the different allocation types on demand and business continues as usual Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVABLE. No mechanism to shrink it so it doesn't recover Future work for scenario 2a Anti-defrag: Same as for Situation 2. kswapd would need to know how to clean UserRclm pages from the KernNoRclm, KernRclm and Fallback reserved areas. Zone-based: Same as for 2a. Page migration would need to move pages from the end of the zone so it could be shrunk I've tried to be as objective as possible with the summary. >From the points above though, I think that anti-defrag gets us a lot of the way, with the complexity isolated in one place. It's downside is that it can still break down and future work is needed to stop it degrading (kswapd cleaning UserRclm areas and page migration when we get really stuck). Zone-based is more reliable but only addresses a limited situation, principally hotplug and it does not even go 100% of the way for hotplug. It also depends on a tunable which is not cool and it is static. If we make the zones growable+shrinkable, we run into all the same problems that anti-defrag has today. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary 2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman @ 2005-11-03 3:14 ` Nick Piggin 2005-11-03 12:19 ` Mel Gorman 2005-11-03 15:34 ` Martin J. Bligh 0 siblings, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-03 3:14 UTC (permalink / raw) To: Mel Gorman Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Mel Gorman wrote: > > Ok. To me, the rest of the thread are beating around the same points and > no one is giving ground. The points are made so lets summarise. Apologies > if anything is missing. > Thanks for attempting a summary of a difficult topic. I have a couple of suggestions. > Who cares > ========= > Physical hotplug remove: Vendors of the hardware that support this - > Fujitsu, HP (I think), IBM etc > > Virtualization hotplug remove: Sellers of virtualization software, some > hardware like any IBM machine that lists LPAR in it's list of > features. Probably software solutions like Xen are also affected > if they want to be able to grow and shrink the virtual machines on > demand > Ingo said that Xen is fine with per page granular freeing - this covers embedded, desktop and small server users of VMs into the future I'd say. > High order allocations: Ultimately, hugepage users. Today, that is a > feature only big server users like Oracle care about. In the > future I reckon applications will be able to use them for things > like backing the heap by huge pages. Other users like GigE, > loopback devices with large MTUs, some filesystem like CIFS are > all interested although they are also been told use use smaller > pages. > I think that saying its now OK to use higher order allocations is wrong because as I said even with your patches they are going to run into problems. Actually I think one reason your patches may perform so well is because there aren't actually a lot of higher order allocations in the kernel. I think that probably leaves us realistically with demand hugepages, hot unplug memory, and IBM lpars? > Pros/Cons of Solutions > ====================== > > Anti-defrag Pros > o Aim9 shows no significant regressions (.37% on page_test). On some > tests, it shows performance gains (> 5% on fork_test) > o Stress tests show that it manages to keep fragmentation down to a far > lower level even without teaching kswapd how to linear reclaim This sounds like a kind of funny test to me if nobody is actually using higher order allocations. When a higher order allocation is attempted, either you will satisfy it from the kernel region, in which case the vanilla kernel would have done the same. Or you satisfy it from an easy-reclaim contiguous region, in which case it is no longer an easy-reclaim contiguous region. > o Stress tests with a linear reclaim experimental patch shows that it > can successfully find large contiguous chunks of memory > o It is known to help hotplug on PPC64 > o No tunables. The approach tries to manage itself as much as possible But it has more dreaded heuristics :P > o It exists, heavily tested, and synced against the latest -mm1 > o Can be compiled away be redefining the RCLM_* macros and the > __GFP_*RCLM flags > > Anti-defrag Cons > o More complexity within the page allocator > o Adds a new layer onto the allocator that effectively creates subzones > o Adding a new concept that maintainers have to work with > o Depending on the workload, it fragments anyway > > New Zone Pros > o Zones are a well known and understood concept > o For people that do not care about hotplug, they can easily get rid of it > o Provides reliable areas of contiguous groups that can be freed for > HugeTLB pages going to userspace > o Uses existing zone infrastructure for balancing > > New Zone Cons > o Zones historically have introduced balancing problems > o Been tried for hotplug and dropped because of being awkward to work with > o It only helps hotplug and potentially HugeTLB pages for userspace > o Tunable required. If you get it wrong, the system suffers a lot Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let them get it right. > o Needs to be planned for and developed > Yasunori Goto had patches around from last year. Not sure what sort of shape they're in now but I'd think most of the hard work is done. > Scenarios > ========= > > Lets outline some situations then or workloads that can occur > > 1. Heavy job running that consumes 75% of physical memory. Like a kernel > build > > Anti-defrag: It will not fragment as it will never have to fallback.High > order allocations will be possible in the remaining 25%. > Zone-based: After been tuned to a kernel build load, it will not > fragment. Get the tuning wrong, performance suffers or workload > fails. High order allocations will be possible in the remaining 25%. > You don't need to continually tune things for each and every possible workload under the sun. It is like how we currently drive 16GB highmem systems quite nicely under most workloads with 1GB of normal memory. Make that an 8:1 ratio if you're worried. [snip] > > I've tried to be as objective as possible with the summary. > >>From the points above though, I think that anti-defrag gets us a lot of > the way, with the complexity isolated in one place. It's downside is that > it can still break down and future work is needed to stop it degrading > (kswapd cleaning UserRclm areas and page migration when we get really > stuck). Zone-based is more reliable but only addresses a limited > situation, principally hotplug and it does not even go 100% of the way for > hotplug. To me it seems like it solves the hotplug, lpar hotplug, and hugepages problems which seem to be the main ones. > It also depends on a tunable which is not cool and it is static. I think it is very cool because it means the tiny minority of Linux users who want this can do so without impacting the rest of the code or users. This is how Linux has been traditionally run and I still have a tiny bit of faith left :) > If we make the zones growable+shrinkable, we run into all the same > problems that anti-defrag has today. > But we don't have the extra zones layer that anti defrag has today. And anti defrag needs limits if it is to be reliable anyway. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary 2005-11-03 3:14 ` Nick Piggin @ 2005-11-03 12:19 ` Mel Gorman 2005-11-10 18:47 ` Steve Lord 2005-11-03 15:34 ` Martin J. Bligh 1 sibling, 1 reply; 241+ messages in thread From: Mel Gorman @ 2005-11-03 12:19 UTC (permalink / raw) To: Nick Piggin Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Thu, 3 Nov 2005, Nick Piggin wrote: > Mel Gorman wrote: > > > > > Ok. To me, the rest of the thread are beating around the same points and > > no one is giving ground. The points are made so lets summarise. Apologies > > if anything is missing. > > > > Thanks for attempting a summary of a difficult topic. I have a couple > of suggestions. > > > Who cares > > ========= > > Physical hotplug remove: Vendors of the hardware that support this - > > Fujitsu, HP (I think), IBM etc > > > > Virtualization hotplug remove: Sellers of virtualization software, some > > hardware like any IBM machine that lists LPAR in it's list of > > features. Probably software solutions like Xen are also affected > > if they want to be able to grow and shrink the virtual machines on > > demand > > > > Ingo said that Xen is fine with per page granular freeing - this covers > embedded, desktop and small server users of VMs into the future I'd say. > Ok, hard to argue with that. > > High order allocations: Ultimately, hugepage users. Today, that is a > > feature only big server users like Oracle care about. In the > > future I reckon applications will be able to use them for things > > like backing the heap by huge pages. Other users like GigE, > > loopback devices with large MTUs, some filesystem like CIFS are > > all interested although they are also been told use use smaller > > pages. > > > > I think that saying its now OK to use higher order allocations is wrong > because as I said even with your patches they are going to run into > problems. > Ok, I have not denied that they will run into problems. I have asserted that, with more work built upon these patches, we can grant large pages with a good degree of reliability. Subsystems should still use small orders whenever possible and at the very least, large orders should be short-lived. For userspace users, I would like to move towards better availibility of huge page without requiring boot-time tunables which are required today. Do we agree that this would be useful at least for a few different users? HugeTLB user 1: Todays users of hugetlbfs like big databases etc HugeTLB user 2: HPC jobs that run with sparse data sets HugeTLB user 3: Desktop applications that use large amounts of address space. I got a mail from a user of category 2. He said I can quote his email, but he didn't say I could quote his name which is inconvenient but I'm sure he has good reasons. To him, low fragmentation is "critical, at least in HPC environments". Here is the core of his issue; --- excerpt --- Take the scenario that you have a large machine that is used by multiple users, and the usage is regulated by a batch scheduler. Loadleveler on ibm's for example. PBS on many others. Both appear to be available in linux environments. In the case of my codes, I find that having large pages is extremely beneficial to my run times. As in factors of several, modulo things that I've coded in by hand to try and avoid the issues. I don't think my code is in any way unusual in this magnitude of improvement. --- excerpt --- ok, so we have two potential solutions, anti-defrag and zones. We don't need to rehash the pro's and cons. With zones, we just say "just reclaim the easy reclaim zone, alloc your pages and away we go". Now, his problem is that the server is not restarted between job times and jobs takes days and weeks to complete. The system administrators will not restart the machine so getting it to a prestine state is a difficulty. The state he gets the system in is the state he works with and with fragmentation, he doesn't get large pages unless he is lucky enough to be the first user of the machine With the zone approach, we would just be saying "tune it". Here is what he says about that --- excerpt --- I specifically *don't* want things that I have to beg sysadmins to tune correctly. They won't get it right because there is no `right' that is right for everyone. They won't want to change it and it won't work besides. Been there, done that. My experience is that with linux so far, and some other non-linux machines too, they always turn all the page stuff off because it breaks the machine. --- excerpt --- This is an example of a real user that "tune the size of your zone correctly" is just not good enough. He makes a novel suggestion on how anti-defrag + hotplug could be used. --- excerpt --- In the context of hotplug stuff and fragmentation avoidance, this sort of reset would be implemented by performing the the first step in the hot unplug, to migrate everything off of that memory, including whatever kernel pages that exist there, but not the second step. Just leave that memory plugged in and reset the memory to a sane initial state. Essentially this would be some sort of pseudo hotunplug followed by a pseudo hotplug of that memory. --- excerpt --- I'm pretty sure this is not what hotplug was aimed at but it would get him what he wants, large pages to echo BigNumber > nr_hugepages at the least. It also needs hotplug remove to be working for some banks and regions of memory although not the 100% case. Ok, this is one example of a user for scientific workloads that "tune the size of the zone" just is not good enough. The admins won't do it for him because it'll just break for the next scheduled job. > Actually I think one reason your patches may perform so well is because > there aren't actually a lot of higher order allocations in the kernel. > > I think that probably leaves us realistically with demand hugepages, > hot unplug memory, and IBM lpars? > > > > Pros/Cons of Solutions > > ====================== > > > > Anti-defrag Pros > > o Aim9 shows no significant regressions (.37% on page_test). On some > > tests, it shows performance gains (> 5% on fork_test) > > o Stress tests show that it manages to keep fragmentation down to a far > > lower level even without teaching kswapd how to linear reclaim > > This sounds like a kind of funny test to me if nobody is actually > using higher order allocations. > No one uses them because they always fail. This is a chicken and egg problem. > When a higher order allocation is attempted, either you will satisfy > it from the kernel region, in which case the vanilla kernel would > have done the same. Or you satisfy it from an easy-reclaim contiguous > region, in which case it is no longer an easy-reclaim contiguous > region. > Right, but right now, we say "don't use high order allocations ever". With work, we'll be saying "ok, use high order allocations but they should be short lived or you won't be allocating them for long" > > o Stress tests with a linear reclaim experimental patch shows that it > > can successfully find large contiguous chunks of memory > > o It is known to help hotplug on PPC64 > > o No tunables. The approach tries to manage itself as much as possible > > But it has more dreaded heuristics :P > Yeah, but if it gets them wrong, the system chugs along anyway, just fragmented like it is today. If the zone-based approach gets it wrong, the system goes down the tubes. At very worst, the patches give a kernel allocator that is as good as todays. At very worst, the zone-based approach makes an unusable system. The performance of the patches is another story. I've been posting aim9 figures based on my test machine. I'm trying to kick an ancient PowerPC 43P Model 150 machine into working. This machine is a different architecture and ancient (I found it on the way to a skip) so should give different figures. > > o It exists, heavily tested, and synced against the latest -mm1 > > o Can be compiled away be redefining the RCLM_* macros and the > > __GFP_*RCLM flags > > > > Anti-defrag Cons > > o More complexity within the page allocator > > o Adds a new layer onto the allocator that effectively creates subzones > > o Adding a new concept that maintainers have to work with > > o Depending on the workload, it fragments anyway > > > > New Zone Pros > > o Zones are a well known and understood concept > > o For people that do not care about hotplug, they can easily get rid of it > > o Provides reliable areas of contiguous groups that can be freed for > > HugeTLB pages going to userspace > > o Uses existing zone infrastructure for balancing > > > > New Zone Cons > > o Zones historically have introduced balancing problems > > o Been tried for hotplug and dropped because of being awkward to work with > > o It only helps hotplug and potentially HugeTLB pages for userspace > > o Tunable required. If you get it wrong, the system suffers a lot > > Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let > them get it right. > Unless you work in a place where they sysadmins will tell you to go away such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they have better things to do than twiddle a tunable all day. > > o Needs to be planned for and developed > > > > Yasunori Goto had patches around from last year. Not sure what sort > of shape they're in now but I'd think most of the hard work is done. > But Yasunori (thanks for sending the links ) himself says when he posted. --- excerpt --- Another one was a bit similar than Mel-san's one. One of motivation of this patch was to create orthogonal relationship between Removable and DMA/Normal/Highmem. I thought it is desirable. Because, ppc64 can treat that all of memory is same (DMA) zone. I thought that new zone spoiled its good feature. --- excerpt --- He thought that the new zone removed the ability of some architectures to treat all memory the same. My patches give some of the benefits of using another zone while still preserving an architectures ability to treat all memory the same. > > Scenarios > > ========= > > > > Lets outline some situations then or workloads that can occur > > > > 1. Heavy job running that consumes 75% of physical memory. Like a kernel > > build > > > > Anti-defrag: It will not fragment as it will never have to fallback.High > > order allocations will be possible in the remaining 25%. > > Zone-based: After been tuned to a kernel build load, it will not > > fragment. Get the tuning wrong, performance suffers or workload > > fails. High order allocations will be possible in the remaining 25%. > > > > You don't need to continually tune things for each and every possible > workload under the sun. It is like how we currently drive 16GB highmem > systems quite nicely under most workloads with 1GB of normal memory. > Make that an 8:1 ratio if you're worried. > > [snip] > > > > > I've tried to be as objective as possible with the summary. > > > > > From the points above though, I think that anti-defrag gets us a lot of > > the way, with the complexity isolated in one place. It's downside is that > > it can still break down and future work is needed to stop it degrading > > (kswapd cleaning UserRclm areas and page migration when we get really > > stuck). Zone-based is more reliable but only addresses a limited > > situation, principally hotplug and it does not even go 100% of the way for > > hotplug. > > To me it seems like it solves the hotplug, lpar hotplug, and hugepages > problems which seem to be the main ones. > > > It also depends on a tunable which is not cool and it is static. > > I think it is very cool because it means the tiny minority of Linux > users who want this can do so without impacting the rest of the code > or users. This is how Linux has been traditionally run and I still > have a tiny bit of faith left :) > The impact of the code and users will depend on benchmarks. I've posted benchmarks that show there are either very small regressions or else there are performance gains. As I write this, some of the aim9 benchmarks completed on the PowerPC. This is a comparison between 2.6.14-rc5-mm1 and 2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig 1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second 2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second 3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second 4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second 5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second 6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second 7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second 8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second Performance gains on page_test, brk_test and exec_test. Even with variances between tests, we are looking at "more or less the same", not regressions. No user impact there. This is a comparison between 2.6.14-rc5-mm1 and 2.6.14-rc5-mm1-mbuddy-v19-withantidefrag 1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second 2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second 3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second 4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second 5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second 6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second 7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second 8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second Performance gains on page_test, brk_test, exec_test and fork_test. Not bad going for complex overhead. create-clo took a beating, but what workload opens and closes files at that rate? This is an old, small machine. If I hotplug this, I'll be lucky if it ever turns on again. The aim9 benchmarks on two machines show that there is similar and, in some cases better, performance with these patches. If a workload does suffer badly, an additional patch has been supplied that disables anti-defrag. A run in -mm will tell us if this is the general case for machines or are my two test boxes running on magic beans. So, the small number of users that want this, get this. The rest of the users who just run the code, should not notice or care. This brings us back to the main stickler, code complexity. I think that the code has been very well isolated from the code allocator code and people looking at the allocator could avoid it if they really wanted while stilling knowing what the buddy allocator was doing. > > If we make the zones growable+shrinkable, we run into all the same > > problems that anti-defrag has today. > > > > But we don't have the extra zones layer that anti defrag has today. > So, we just have an extra layer on the side that has to be configured. All of the problems with all of the configuration. > And anti defrag needs limits if it is to be reliable anyway. > I'm confident given time that I can make this manage itself with a very good degree of reliability. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary 2005-11-03 12:19 ` Mel Gorman @ 2005-11-10 18:47 ` Steve Lord 0 siblings, 0 replies; 241+ messages in thread From: Steve Lord @ 2005-11-10 18:47 UTC (permalink / raw) To: Mel Gorman Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms Flogging a dead horse here maybe, I missed this whole thread when it was live, and someone may already have covered this. Another reason for avoiding memory fragmentation, which may have been lost in the discussion, is avoiding scatter/gather in I/O. The block layer now has the smarts to join together physically contiguous pages into a single scatter/gather element. It always had the smarts to deal with I/O from lots of small chunks of memory, and let the hardware do the work of reassembling it. This does not come for free though. I have come across situations where a raid controller gets cpu bound dealing with I/O from Linux, but not from Windows. The reason being that Windows seems to manage to present the same amount of memory in less scatter gather entries. Because the number of DMA elements is another limiting factor, Windows also managed to submit larger individual requests. Once Linux reaches steady state, it ends up submitting one page per scatter gather entry. OK, if you are going via the page cache, then this is not going to mean anything unless the idea of having PAGE_CACHE_SIZE > PAGE_SIZE gets dusted off. However, for direct userspace <-> disk direct I/O, having the address space of a process be more physically contiguous could help here. Specifically allocated huge pages is another way to achieve this, but it does require special coding in an app to do it. I'll go back to my day job now ;-) Steve Mel Gorman wrote: > On Thu, 3 Nov 2005, Nick Piggin wrote: > >> Mel Gorman wrote: >> >>> Ok. To me, the rest of the thread are beating around the same points and >>> no one is giving ground. The points are made so lets summarise. Apologies >>> if anything is missing. >>> >> Thanks for attempting a summary of a difficult topic. I have a couple >> of suggestions. >> >>> Who cares >>> ========= >>> Physical hotplug remove: Vendors of the hardware that support this - >>> Fujitsu, HP (I think), IBM etc >>> >>> Virtualization hotplug remove: Sellers of virtualization software, some >>> hardware like any IBM machine that lists LPAR in it's list of >>> features. Probably software solutions like Xen are also affected >>> if they want to be able to grow and shrink the virtual machines on >>> demand >>> >> Ingo said that Xen is fine with per page granular freeing - this covers >> embedded, desktop and small server users of VMs into the future I'd say. >> > > Ok, hard to argue with that. > >>> High order allocations: Ultimately, hugepage users. Today, that is a >>> feature only big server users like Oracle care about. In the >>> future I reckon applications will be able to use them for things >>> like backing the heap by huge pages. Other users like GigE, >>> loopback devices with large MTUs, some filesystem like CIFS are >>> all interested although they are also been told use use smaller >>> pages. >>> >> I think that saying its now OK to use higher order allocations is wrong >> because as I said even with your patches they are going to run into >> problems. >> > > Ok, I have not denied that they will run into problems. I have asserted > that, with more work built upon these patches, we can grant large pages > with a good degree of reliability. Subsystems should still use small > orders whenever possible and at the very least, large orders should be > short-lived. > > For userspace users, I would like to move towards better availibility of > huge page without requiring boot-time tunables which are required today. > Do we agree that this would be useful at least for a few different users? > > HugeTLB user 1: Todays users of hugetlbfs like big databases etc > HugeTLB user 2: HPC jobs that run with sparse data sets > HugeTLB user 3: Desktop applications that use large amounts of address space. > > I got a mail from a user of category 2. He said I can quote his email, but > he didn't say I could quote his name which is inconvenient but I'm sure he > has good reasons. > > To him, low fragmentation is "critical, at least in HPC environments". > Here is the core of his issue; > > --- excerpt --- > Take the scenario that you have a large machine that is > used by multiple users, and the usage is regulated by a batch > scheduler. Loadleveler on ibm's for example. PBS on many > others. Both appear to be available in linux environments. > > In the case of my codes, I find that having large pages is > extremely beneficial to my run times. As in factors of several, > modulo things that I've coded in by hand to try and avoid the > issues. I don't think my code is in any way unusual in this > magnitude of improvement. > --- excerpt --- > > ok, so we have two potential solutions, anti-defrag and zones. We don't > need to rehash the pro's and cons. With zones, we just say "just reclaim > the easy reclaim zone, alloc your pages and away we go". > > Now, his problem is that the server is not restarted between job times and > jobs takes days and weeks to complete. The system administrators will not > restart the machine so getting it to a prestine state is a difficulty. The > state he gets the system in is the state he works with and with > fragmentation, he doesn't get large pages unless he is lucky enough to be > the first user of the machine > > With the zone approach, we would just be saying "tune it". Here is what he > says about that > > --- excerpt --- > I specifically *don't* want things that I have to beg sysadmins to > tune correctly. They won't get it right because there is no `right' > that is right for everyone. They won't want to change it and it > won't work besides. Been there, done that. My experience is that > with linux so far, and some other non-linux machines too, they > always turn all the page stuff off because it breaks the machine. > --- excerpt --- > > This is an example of a real user that "tune the size of your zone > correctly" is just not good enough. He makes a novel suggestion on how > anti-defrag + hotplug could be used. > > --- excerpt --- > In the context of hotplug stuff and fragmentation avoidance, > this sort of reset would be implemented by performing the > the first step in the hot unplug, to migrate everything off > of that memory, including whatever kernel pages that exist > there, but not the second step. Just leave that memory plugged > in and reset the memory to a sane initial state. Essentially > this would be some sort of pseudo hotunplug followed by a pseudo > hotplug of that memory. > --- excerpt --- > > I'm pretty sure this is not what hotplug was aimed at but it would get him > what he wants, large pages to echo BigNumber > nr_hugepages at the least. > It also needs hotplug remove to be working for some banks and regions of > memory although not the 100% case. > > Ok, this is one example of a user for scientific workloads that "tune the > size of the zone" just is not good enough. The admins won't do it for him > because it'll just break for the next scheduled job. > >> Actually I think one reason your patches may perform so well is because >> there aren't actually a lot of higher order allocations in the kernel. >> >> I think that probably leaves us realistically with demand hugepages, >> hot unplug memory, and IBM lpars? >> > > >>> Pros/Cons of Solutions >>> ====================== >>> >>> Anti-defrag Pros >>> o Aim9 shows no significant regressions (.37% on page_test). On some >>> tests, it shows performance gains (> 5% on fork_test) >>> o Stress tests show that it manages to keep fragmentation down to a far >>> lower level even without teaching kswapd how to linear reclaim >> This sounds like a kind of funny test to me if nobody is actually >> using higher order allocations. >> > > No one uses them because they always fail. This is a chicken and egg > problem. > >> When a higher order allocation is attempted, either you will satisfy >> it from the kernel region, in which case the vanilla kernel would >> have done the same. Or you satisfy it from an easy-reclaim contiguous >> region, in which case it is no longer an easy-reclaim contiguous >> region. >> > > Right, but right now, we say "don't use high order allocations ever". With > work, we'll be saying "ok, use high order allocations but they should be > short lived or you won't be allocating them for long" > >>> o Stress tests with a linear reclaim experimental patch shows that it >>> can successfully find large contiguous chunks of memory >>> o It is known to help hotplug on PPC64 >>> o No tunables. The approach tries to manage itself as much as possible >> But it has more dreaded heuristics :P >> > > Yeah, but if it gets them wrong, the system chugs along anyway, just > fragmented like it is today. If the zone-based approach gets it wrong, the > system goes down the tubes. > > At very worst, the patches give a kernel allocator that is as good as > todays. At very worst, the zone-based approach makes an unusable system. > The performance of the patches is another story. I've been posting aim9 > figures based on my test machine. I'm trying to kick an ancient PowerPC > 43P Model 150 machine into working. This machine is a different > architecture and ancient (I found it on the way to a skip) so should give > different figures. > >>> o It exists, heavily tested, and synced against the latest -mm1 >>> o Can be compiled away be redefining the RCLM_* macros and the >>> __GFP_*RCLM flags >>> >>> Anti-defrag Cons >>> o More complexity within the page allocator >>> o Adds a new layer onto the allocator that effectively creates subzones >>> o Adding a new concept that maintainers have to work with >>> o Depending on the workload, it fragments anyway >>> >>> New Zone Pros >>> o Zones are a well known and understood concept >>> o For people that do not care about hotplug, they can easily get rid of it >>> o Provides reliable areas of contiguous groups that can be freed for >>> HugeTLB pages going to userspace >>> o Uses existing zone infrastructure for balancing >>> >>> New Zone Cons >>> o Zones historically have introduced balancing problems >>> o Been tried for hotplug and dropped because of being awkward to work with >>> o It only helps hotplug and potentially HugeTLB pages for userspace >>> o Tunable required. If you get it wrong, the system suffers a lot >> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let >> them get it right. >> > > Unless you work in a place where they sysadmins will tell you to go away > such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they > have better things to do than twiddle a tunable all day. > >>> o Needs to be planned for and developed >>> >> Yasunori Goto had patches around from last year. Not sure what sort >> of shape they're in now but I'd think most of the hard work is done. >> > > But Yasunori (thanks for sending the links ) himself says when he posted. > > --- excerpt --- > Another one was a bit similar than Mel-san's one. > One of motivation of this patch was to create orthogonal relationship > between Removable and DMA/Normal/Highmem. I thought it is desirable. > Because, ppc64 can treat that all of memory is same (DMA) zone. > I thought that new zone spoiled its good feature. > --- excerpt --- > > He thought that the new zone removed the ability of some architectures to > treat all memory the same. My patches give some of the benefits of using > another zone while still preserving an architectures ability to > treat all memory the same. > >>> Scenarios >>> ========= >>> >>> Lets outline some situations then or workloads that can occur >>> >>> 1. Heavy job running that consumes 75% of physical memory. Like a kernel >>> build >>> >>> Anti-defrag: It will not fragment as it will never have to fallback.High >>> order allocations will be possible in the remaining 25%. >>> Zone-based: After been tuned to a kernel build load, it will not >>> fragment. Get the tuning wrong, performance suffers or workload >>> fails. High order allocations will be possible in the remaining 25%. >>> >> You don't need to continually tune things for each and every possible >> workload under the sun. It is like how we currently drive 16GB highmem >> systems quite nicely under most workloads with 1GB of normal memory. >> Make that an 8:1 ratio if you're worried. >> >> [snip] >> >>> I've tried to be as objective as possible with the summary. >>> >>>> From the points above though, I think that anti-defrag gets us a lot of >>> the way, with the complexity isolated in one place. It's downside is that >>> it can still break down and future work is needed to stop it degrading >>> (kswapd cleaning UserRclm areas and page migration when we get really >>> stuck). Zone-based is more reliable but only addresses a limited >>> situation, principally hotplug and it does not even go 100% of the way for >>> hotplug. >> To me it seems like it solves the hotplug, lpar hotplug, and hugepages >> problems which seem to be the main ones. >> >>> It also depends on a tunable which is not cool and it is static. >> I think it is very cool because it means the tiny minority of Linux >> users who want this can do so without impacting the rest of the code >> or users. This is how Linux has been traditionally run and I still >> have a tiny bit of faith left :) >> > > The impact of the code and users will depend on benchmarks. I've posted > benchmarks that show there are either very small regressions or else there > are performance gains. As I write this, some of the aim9 benchmarks > completed on the PowerPC. > > This is a comparison between 2.6.14-rc5-mm1 and > 2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig > > 1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second > 2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second > 3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second > 4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second > 5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second > 6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second > 7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second > 8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second > > Performance gains on page_test, brk_test and exec_test. Even with > variances between tests, we are looking at "more or less the same", not > regressions. No user impact there. > > This is a comparison between 2.6.14-rc5-mm1 and > 2.6.14-rc5-mm1-mbuddy-v19-withantidefrag > > 1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second > 2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second > 3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second > 4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second > 5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second > 6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second > 7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second > 8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second > > Performance gains on page_test, brk_test, exec_test and fork_test. Not bad > going for complex overhead. create-clo took a beating, but what workload > opens and closes files at that rate? > > This is an old, small machine. If I hotplug this, I'll be lucky if it ever > turns on again. The aim9 benchmarks on two machines show that there is > similar and, in some cases better, performance with these patches. If a > workload does suffer badly, an additional patch has been supplied that > disables anti-defrag. A run in -mm will tell us if this is the general > case for machines or are my two test boxes running on magic beans. > > So, the small number of users that want this, get this. The rest of the > users who just run the code, should not notice or care. This brings us > back to the main stickler, code complexity. I think that the code has been > very well isolated from the code allocator code and people looking at the > allocator could avoid it if they really wanted while stilling knowing what > the buddy allocator was doing. > >>> If we make the zones growable+shrinkable, we run into all the same >>> problems that anti-defrag has today. >>> >> But we don't have the extra zones layer that anti defrag has today. >> > > So, we just have an extra layer on the side that has to be configured. All > of the problems with all of the configuration. > >> And anti defrag needs limits if it is to be reliable anyway. >> > > I'm confident given time that I can make this manage itself with a very > good degree of reliability. > ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary 2005-11-03 3:14 ` Nick Piggin 2005-11-03 12:19 ` Mel Gorman @ 2005-11-03 15:34 ` Martin J. Bligh 1 sibling, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-03 15:34 UTC (permalink / raw) To: Nick Piggin, Mel Gorman Cc: Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms >> Physical hotplug remove: Vendors of the hardware that support this - >> Fujitsu, HP (I think), IBM etc >> >> Virtualization hotplug remove: Sellers of virtualization software, some >> hardware like any IBM machine that lists LPAR in it's list of >> features. Probably software solutions like Xen are also affected >> if they want to be able to grow and shrink the virtual machines on >> demand > > Ingo said that Xen is fine with per page granular freeing - this covers > embedded, desktop and small server users of VMs into the future I'd say. Not using large page mappings for the kernel area will be a substantial performance hit. It's a less efficient approach inside the hypervisor, and not all VMs / hardware can support it. >> High order allocations: Ultimately, hugepage users. Today, that is a >> feature only big server users like Oracle care about. In the >> future I reckon applications will be able to use them for things >> like backing the heap by huge pages. Other users like GigE, >> loopback devices with large MTUs, some filesystem like CIFS are >> all interested although they are also been told use use smaller >> pages. > > I think that saying its now OK to use higher order allocations is wrong > because as I said even with your patches they are going to run into > problems. > > Actually I think one reason your patches may perform so well is because > there aren't actually a lot of higher order allocations in the kernel. > > I think that probably leaves us realistically with demand hugepages, > hot unplug memory, and IBM lpars? Sigh. You seem obsessed with this. There are various critical places in the kernel that use higher order allocations. Yes, they're normally smaller ones rather than larger ones, but .... please try re-reading the earlier portions of this thread. You are NOT going to be able to get rid of all higher-order allocations - please quit pretending you can - living in denial is not going to help us. If you really, really believe you can do that, please go ahead and prove it. Until that point, please let go of the "it's only for a few specialized users" arguement, and acknowledge we DO actually use higher order allocs in the kernel right now. >> o Aim9 shows no significant regressions (.37% on page_test). On some >> tests, it shows performance gains (> 5% on fork_test) >> o Stress tests show that it manages to keep fragmentation down to a far >> lower level even without teaching kswapd how to linear reclaim > > This sounds like a kind of funny test to me if nobody is actually > using higher order allocations. It's a regression test. To, like, test for regressions in the normal case ;-) >> New Zone Cons >> o Zones historically have introduced balancing problems >> o Been tried for hotplug and dropped because of being awkward to work with >> o It only helps hotplug and potentially HugeTLB pages for userspace >> o Tunable required. If you get it wrong, the system suffers a lot > > Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let > them get it right. Having met some of them ... that's not a pro ;-) We have quite enough meaningless tunables already. And to be honest, the bigger problem is that it's a problem with no correct answer - workloads shift day vs. night, etc. > You don't need to continually tune things for each and every possible > workload under the sun. It is like how we currently drive 16GB highmem > systems quite nicely under most workloads with 1GB of normal memory. > Make that an 8:1 ratio if you're worried. Thanks for turning my 64 bit system back into a 32 bit one. really appreciate that. Note the last 5 years of endless whining about all the problems with large 32 bit systems, and how they're unfixable and we should all move to 64 bit please. > To me it seems like it solves the hotplug, lpar hotplug, and hugepages > problems which seem to be the main ones. That's because you're not listening, you're going on your own preconcieved notions ... > I think it is very cool because it means the tiny minority of Linux > users who want this can do so without impacting the rest of the code > or users. Ditto. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 13:56 ` Ingo Molnar 2005-11-01 14:10 ` Dave Hansen @ 2005-11-01 14:41 ` Mel Gorman 2005-11-01 14:46 ` Ingo Molnar ` (2 more replies) 2005-11-01 18:23 ` Rob Landley 2 siblings, 3 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-01 14:41 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tue, 1 Nov 2005, Ingo Molnar wrote: > * Mel Gorman <mel@csn.ul.ie> wrote: > > > The set of patches do fix a lot and make a strong start at addressing > > the fragmentation problem, just not 100% of the way. [...] > > do you have an expectation to be able to solve the 'fragmentation > problem', all the time, in a 100% way, now or in the future? > Not now, but I expect to make 100% on demand in the future for all but GFP_ATOMIC and GFP_NOFS allocations. As GFP_ATOMIC and GFP_NOFS cannot do any reclaim work themselves, they will still be required to use smaller orders or private pools that are refilled using GFP_KERNEL if necessary. The high order pages would have to be reclaimed by another process like kswapd just like what happens for order-0 pages today. > > So, with this set of patches, how fragmented you get is dependant on > > the workload and it may still break down and high order allocations > > will fail. But the current situation is that it will defiantly break > > down. The fact is that it has been reported that memory hotplug remove > > works with these patches and doesn't without them. Granted, this is > > just one feature on a high-end machine, but it is one solid operation > > we can perform with the patches and cannot without them. [...] > > can you always, under any circumstance hot unplug RAM with these patches > applied? If not, do you have any expectation to reach 100%? > No, you cannot guarantee hot unplug RAM with these patches applied. Anecdotal evidence suggests your chances are better on PPC64 which is a start but we have to start somewhere. The full 100% solution would be a large set of far reaching patches that would touch a lot of the memory manager. This would get rejected because the patches should have have arrived piecemeal. These patches are one piece. To reach 100%, other mechanisms are also needed such as; o Page migration to move unreclaimable pages like mlock()ed pages or kernel pages that had fallen back into easy-reclaim areas. A mechanism would also be needed to move things like kernel text. I think the memory hotplug tree has done a lot of work here o Mechanism for taking regions of memory offline. Again, I think the memory hotplug crowd have something for this. If they don't, one of them will chime in. o linear page reclaim that linearly scans a region of memory reclaims or moves all the pages it. I have a proof-of-concept patch that does the linear scan and reclaim but it's currently ugly and depends on this set of patches been applied. These patches are the *starting* point that other things like linear page reclaim can be based on. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman @ 2005-11-01 14:46 ` Ingo Molnar 2005-11-01 15:23 ` Mel Gorman 2005-11-01 18:33 ` Rob Landley 2005-11-01 14:50 ` Dave Hansen 2005-11-02 5:11 ` Andrew Morton 2 siblings, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-01 14:46 UTC (permalink / raw) To: Mel Gorman Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel * Mel Gorman <mel@csn.ul.ie> wrote: > [...] The full 100% solution would be a large set of far reaching > patches that would touch a lot of the memory manager. This would get > rejected because the patches should have have arrived piecemeal. These > patches are one piece. To reach 100%, other mechanisms are also needed > such as; > > o Page migration to move unreclaimable pages like mlock()ed pages or > kernel pages that had fallen back into easy-reclaim areas. A mechanism > would also be needed to move things like kernel text. I think the memory > hotplug tree has done a lot of work here > o Mechanism for taking regions of memory offline. Again, I think the > memory hotplug crowd have something for this. If they don't, one of them > will chime in. > o linear page reclaim that linearly scans a region of memory reclaims or > moves all the pages it. I have a proof-of-concept patch that does the > linear scan and reclaim but it's currently ugly and depends on this set > of patches been applied. how will the 100% solution handle a simple kmalloc()-ed kernel buffer, that is pinned down, and to/from which live pointers may exist? That alone can prevent RAM from being removable. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:46 ` Ingo Molnar @ 2005-11-01 15:23 ` Mel Gorman 2005-11-01 18:33 ` Rob Landley 1 sibling, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-01 15:23 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tue, 1 Nov 2005, Ingo Molnar wrote: > > * Mel Gorman <mel@csn.ul.ie> wrote: > > > [...] The full 100% solution would be a large set of far reaching > > patches that would touch a lot of the memory manager. This would get > > rejected because the patches should have have arrived piecemeal. These > > patches are one piece. To reach 100%, other mechanisms are also needed > > such as; > > > > o Page migration to move unreclaimable pages like mlock()ed pages or > > kernel pages that had fallen back into easy-reclaim areas. A mechanism > > would also be needed to move things like kernel text. I think the memory > > hotplug tree has done a lot of work here > > o Mechanism for taking regions of memory offline. Again, I think the > > memory hotplug crowd have something for this. If they don't, one of them > > will chime in. > > o linear page reclaim that linearly scans a region of memory reclaims or > > moves all the pages it. I have a proof-of-concept patch that does the > > linear scan and reclaim but it's currently ugly and depends on this set > > of patches been applied. > > how will the 100% solution handle a simple kmalloc()-ed kernel buffer, > that is pinned down, and to/from which live pointers may exist? That > alone can prevent RAM from being removable. > It would require the page to have it's virtual->physical mapping changed in the pagetables for each running process and the master page table. That would be another step on the road to 100% support. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:46 ` Ingo Molnar 2005-11-01 15:23 ` Mel Gorman @ 2005-11-01 18:33 ` Rob Landley 2005-11-01 19:02 ` Ingo Molnar 1 sibling, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-01 18:33 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tuesday 01 November 2005 08:46, Ingo Molnar wrote: > how will the 100% solution handle a simple kmalloc()-ed kernel buffer, > that is pinned down, and to/from which live pointers may exist? That > alone can prevent RAM from being removable. Would you like to apply your "100% or nothing" argument to the virtual memory management subsystem and see how it sounds in that context? (As an argument that we shouldn't _have_ one?) > Ingo Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 18:33 ` Rob Landley @ 2005-11-01 19:02 ` Ingo Molnar 0 siblings, 0 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-01 19:02 UTC (permalink / raw) To: Rob Landley Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel * Rob Landley <rob@landley.net> wrote: > On Tuesday 01 November 2005 08:46, Ingo Molnar wrote: > > how will the 100% solution handle a simple kmalloc()-ed kernel buffer, > > that is pinned down, and to/from which live pointers may exist? That > > alone can prevent RAM from being removable. > > Would you like to apply your "100% or nothing" argument to the virtual > memory management subsystem and see how it sounds in that context? > (As an argument that we shouldn't _have_ one?) that would be comparing apples to oranges. There is a big difference between "VM failures under high load", and "failure of VM functionality for no user-visible reason". The fragmentation problem here has nothing to do with pathological workloads. It has to do with 'unlucky' allocation patterns that pin down RAM areas which thus become non-removable. The RAM module will be non-removable for no user-visible reason. Possible under zero load, and with lots of free RAM otherwise. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman 2005-11-01 14:46 ` Ingo Molnar @ 2005-11-01 14:50 ` Dave Hansen 2005-11-01 15:24 ` Mel Gorman 2005-11-02 5:11 ` Andrew Morton 2 siblings, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-01 14:50 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote: > o Mechanism for taking regions of memory offline. Again, I think the > memory hotplug crowd have something for this. If they don't, one of them > will chime in. I'm not sure what you're asking for here. Right now, you can offline based on NUMA node, or physical address. It's all revealed in sysfs. Sounds like "regions" to me. :) -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:50 ` Dave Hansen @ 2005-11-01 15:24 ` Mel Gorman 0 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-01 15:24 UTC (permalink / raw) To: Dave Hansen Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms On Tue, 1 Nov 2005, Dave Hansen wrote: > On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote: > > o Mechanism for taking regions of memory offline. Again, I think the > > memory hotplug crowd have something for this. If they don't, one of them > > will chime in. > > I'm not sure what you're asking for here. > > Right now, you can offline based on NUMA node, or physical address. > It's all revealed in sysfs. Sounds like "regions" to me. :) > Ah yes, that would do the job all right. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman 2005-11-01 14:46 ` Ingo Molnar 2005-11-01 14:50 ` Dave Hansen @ 2005-11-02 5:11 ` Andrew Morton 2 siblings, 0 replies; 241+ messages in thread From: Andrew Morton @ 2005-11-02 5:11 UTC (permalink / raw) To: Mel Gorman Cc: mingo, nickpiggin, mbligh, kravetz, linux-mm, linux-kernel, lhms-devel Mel Gorman <mel@csn.ul.ie> wrote: > > As GFP_ATOMIC and GFP_NOFS cannot do > any reclaim work themselves Both GFP_NOFS and GFP_NOIO can indeed perform direct reclaim. All we require is __GFP_WAIT. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 13:56 ` Ingo Molnar 2005-11-01 14:10 ` Dave Hansen 2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman @ 2005-11-01 18:23 ` Rob Landley 2005-11-01 20:31 ` Joel Schopp 2 siblings, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-01 18:23 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel On Tuesday 01 November 2005 07:56, Ingo Molnar wrote: > * Mel Gorman <mel@csn.ul.ie> wrote: > > The set of patches do fix a lot and make a strong start at addressing > > the fragmentation problem, just not 100% of the way. [...] > > do you have an expectation to be able to solve the 'fragmentation > problem', all the time, in a 100% way, now or in the future? Considering anybody can allocate memory and never release it, _any_ 100% solution is going to require migrating existing pages, regardless of allocation strategy. > > So, with this set of patches, how fragmented you get is dependant on > > the workload and it may still break down and high order allocations > > will fail. But the current situation is that it will defiantly break > > down. The fact is that it has been reported that memory hotplug remove > > works with these patches and doesn't without them. Granted, this is > > just one feature on a high-end machine, but it is one solid operation > > we can perform with the patches and cannot without them. [...] > > can you always, under any circumstance hot unplug RAM with these patches > applied? If not, do you have any expectation to reach 100%? You're asking intentionally leading questions, aren't you? Without on-demand page migration a given area of physical memory would only ever be free by sheer coincidence. Less fragmented page allocation doesn't address _where_ the free areas are, it just tries to make them contiguous. A page migration strategy would have to do less work if there's less fragmention, and it also allows you to cluster the "difficult" cases (such as kernel structures that just ain't moving) so you can much more easily hot-unplug everything else. It also makes larger order allocations easier to do so drivers needing that can load as modules after boot, and it also means hugetlb comes a lot closer to general purpose infrastructure rather than a funky boot-time reservation thing. Plus page prezeroing approaches get to work on larger chunks, and so on. But any strategy to demand that "this physical memory range must be freed up now" will by definition require moving pages... > Ingo Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 18:23 ` Rob Landley @ 2005-11-01 20:31 ` Joel Schopp 0 siblings, 0 replies; 241+ messages in thread From: Joel Schopp @ 2005-11-01 20:31 UTC (permalink / raw) To: Rob Landley Cc: Ingo Molnar, Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel >>>The set of patches do fix a lot and make a strong start at addressing >>>the fragmentation problem, just not 100% of the way. [...] >> >>do you have an expectation to be able to solve the 'fragmentation >>problem', all the time, in a 100% way, now or in the future? > > > Considering anybody can allocate memory and never release it, _any_ 100% > solution is going to require migrating existing pages, regardless of > allocation strategy. > Three issues here. Fragmentation of memory in general, fragmentation of usage, and being able to have 100% success rate at removing memory. We will never be able to have 100% contiguous memory with no fragmentation. Ever. Certainly not while we have non-movable pieces of memory. Even if we could move every piece of memory it would be impractical. What these patches do for general fragmentation is to keep the allocations that never will get freed away from the rest of memory, so that memory has a chance to form larger contiguous ranges when it is freed. By separating memory based on usage there is another side effect. It also makes possible some more active defragmentation methods on easier memory, because it doesn't have annoying hard memory scattered throughout. Suddenly we can talk about being able to do memory hotplug remove on significant portions of memory. Or allocating these hugepages after boot. Or doing active defragmentation. Or modules being able to be modules because they don't have to preallocate big pieces of contiguous memory. Some people will argue that we need 100% separation of usage or no separation at all. Well, change the array of fallback to not allow kernel non-reclaimable to fallback and we are done. 4 line change, 100% separation. But the tradeoff is that under memory pressure we might fail allocations when we still have free memory. There are other options for fallback of course, the fallback_alloc() function is easily replaceable if somebody wants to. Many of these options get easier once memory migration is in. The way fallback is done in the current patches is to maintain current behavior as much as possible, satisfy allocations, and not affect performance. As to the 100% success at removing memory, this set of patches doesn't solve that. But it solves the 80% problem quite nicely (when combined with the memory migration patches). 80% is great for virtualized systems where the OS has some choice over which memory to remove, but not the quantity to remove. It is also a good start to 100%, because we can separate and identify the easy memory from the hard memory. Dave Hansen has outlined in separate posts how we can get to 100%, including hard memory. >>can you always, under any circumstance hot unplug RAM with these patches >>applied? If not, do you have any expectation to reach 100%? > > > You're asking intentionally leading questions, aren't you? Without on-demand > page migration a given area of physical memory would only ever be free by > sheer coincidence. Less fragmented page allocation doesn't address _where_ > the free areas are, it just tries to make them contiguous. > > A page migration strategy would have to do less work if there's less > fragmention, and it also allows you to cluster the "difficult" cases (such as > kernel structures that just ain't moving) so you can much more easily > hot-unplug everything else. It also makes larger order allocations easier to > do so drivers needing that can load as modules after boot, and it also means > hugetlb comes a lot closer to general purpose infrastructure rather than a > funky boot-time reservation thing. Plus page prezeroing approaches get to > work on larger chunks, and so on. > > But any strategy to demand that "this physical memory range must be freed up > now" will by definition require moving pages... Perfectly stated. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 [not found] ` <4366D469.2010202@yahoo.com.au> [not found] ` <Pine.LNX.4.58.0511011014060.14884@skynet> @ 2005-11-01 20:59 ` Joel Schopp 2005-11-02 1:06 ` Nick Piggin 1 sibling, 1 reply; 241+ messages in thread From: Joel Schopp @ 2005-11-01 20:59 UTC (permalink / raw) To: Nick Piggin Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar >> The patches have gone through a large number of revisions, have been >> heavily tested and reviewed by a few people. The memory footprint of this >> approach is smaller than introducing new zones. If the cache footprint, >> increased branches and instructions were a problem, I would expect >> them to >> show up in the aim9 benchmark or the benchmark that ran ghostscript >> multiple times on a large file. >> > > I appreciate that a lot of work has gone into them. You must appreciate > that they add a reasonable amount of complexity and a non-zero perormance > cost to the page allocator. The patches do ad a reasonable amount of complexity to the page allocator. In my opinion that is the only downside of these patches, even though it is a big one. What we need to decide as a community is if there is a less complex way to do this, and if there isn't a less complex way then is the benefit worth the increased complexity. As to the non-zero performance cost, I think hard numbers should carry more weight than they have been given in this area. Mel has posted hard numbers that say the patches are a wash with respect to performance. I don't see any evidence to contradict those results. >> The will need high order allocations if we want to provide HugeTLB pages >> to userspace on-demand rather than reserving at boot-time. This is a >> future problem, but it's one that is not worth tackling until the >> fragmentation problem is fixed first. >> > > Sure. In what form, we haven't agreed. I vote zones! :) I'd like to hear more details of how zones would be less complex while still solving the problem. I just don't get it. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-01 20:59 ` Joel Schopp @ 2005-11-02 1:06 ` Nick Piggin 2005-11-02 1:41 ` Martin J. Bligh ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 1:06 UTC (permalink / raw) To: Joel Schopp Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar Joel Schopp wrote: > The patches do ad a reasonable amount of complexity to the page > allocator. In my opinion that is the only downside of these patches, > even though it is a big one. What we need to decide as a community is > if there is a less complex way to do this, and if there isn't a less > complex way then is the benefit worth the increased complexity. > > As to the non-zero performance cost, I think hard numbers should carry > more weight than they have been given in this area. Mel has posted hard > numbers that say the patches are a wash with respect to performance. I > don't see any evidence to contradict those results. > The numbers I have seen show that performance is decreased. People like Ken Chen spend months trying to find a 0.05% improvement in performance. Not long ago I just spent days getting our cached kbuild performance back to where 2.4 is on my build system. I can simply see they will cost more icache, more dcache, more branches, etc. in what is the hottest part of the kernel in some workloads (kernel compiles, for one). I'm sorry if I sound like a wet blanket. I just don't look at a patch and think "wow all those 3 guys with Linux on IBM mainframes and using lpars are going to be so much happier now, this is something we need". >>> The will need high order allocations if we want to provide HugeTLB pages >>> to userspace on-demand rather than reserving at boot-time. This is a >>> future problem, but it's one that is not worth tackling until the >>> fragmentation problem is fixed first. >>> >> >> Sure. In what form, we haven't agreed. I vote zones! :) > > > I'd like to hear more details of how zones would be less complex while > still solving the problem. I just don't get it. > You have an extra zone. You size that zone at boot according to the amount of memory you need to be able to free. Only easy-reclaim stuff goes in that zone. It is less complex because zones are a complexity we already have to live with. 99% of the infrastructure is already there to do this. If you want to hot unplug memory or guarantee hugepage allocation, this is the way to do it. Nobody has told me why this *doesn't* work. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 1:06 ` Nick Piggin @ 2005-11-02 1:41 ` Martin J. Bligh 2005-11-02 2:03 ` Nick Piggin 2005-11-02 11:37 ` Mel Gorman 2005-11-02 15:11 ` Mel Gorman 2 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 1:41 UTC (permalink / raw) To: Nick Piggin, Joel Schopp Cc: Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar > The numbers I have seen show that performance is decreased. People > like Ken Chen spend months trying to find a 0.05% improvement in > performance. Not long ago I just spent days getting our cached > kbuild performance back to where 2.4 is on my build system. Ironically, we're currently trying to chase down a 'database benchmark' regression that seems to have been cause by the last round of "let's rewrite the scheduler again" (more details later). Nick, you've added an awful lot of complexity to some of these code paths yourself ... seems ironic that you're the one complaining about it ;-) >>> Sure. In what form, we haven't agreed. I vote zones! :) >> >> >> I'd like to hear more details of how zones would be less complex while >> still solving the problem. I just don't get it. >> > > You have an extra zone. You size that zone at boot according to the > amount of memory you need to be able to free. Only easy-reclaim stuff > goes in that zone. > > It is less complex because zones are a complexity we already have to > live with. 99% of the infrastructure is already there to do this. > > If you want to hot unplug memory or guarantee hugepage allocation, > this is the way to do it. Nobody has told me why this *doesn't* work. Because the zone is statically sized, and you're back to the same crap we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM, effectively. Define how much you need for system ram, and how much for easily reclaimable memory at boot time. You can't - it doesn't work. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 1:41 ` Martin J. Bligh @ 2005-11-02 2:03 ` Nick Piggin 2005-11-02 2:24 ` Martin J. Bligh 2005-11-02 11:41 ` Mel Gorman 0 siblings, 2 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 2:03 UTC (permalink / raw) To: Martin J. Bligh Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar Martin J. Bligh wrote: >>The numbers I have seen show that performance is decreased. People >>like Ken Chen spend months trying to find a 0.05% improvement in >>performance. Not long ago I just spent days getting our cached >>kbuild performance back to where 2.4 is on my build system. > > > Ironically, we're currently trying to chase down a 'database benchmark' > regression that seems to have been cause by the last round of "let's > rewrite the scheduler again" (more details later). Nick, you've added an > awful lot of complexity to some of these code paths yourself ... seems > ironic that you're the one complaining about it ;-) > Yeah that's unfortunate, but I think a large portion of the problem (if they are anything the same) has been narrowed down to some over eager wakeup balancing for which there are a number of proposed patches. But in this case I was more worried about getting the groundwork done for handling the multicore multicore systems that everyone will soon be using rather than several % performance regression on TPC-C (not to say that I don't care about that at all)... I don't see the irony. But let's move this to another thread if it is going to continue. I would be happy to discuss scheduler problems. >>You have an extra zone. You size that zone at boot according to the >>amount of memory you need to be able to free. Only easy-reclaim stuff >>goes in that zone. >> >>It is less complex because zones are a complexity we already have to >>live with. 99% of the infrastructure is already there to do this. >> >>If you want to hot unplug memory or guarantee hugepage allocation, >>this is the way to do it. Nobody has told me why this *doesn't* work. > > > Because the zone is statically sized, and you're back to the same crap > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM, > effectively. Define how much you need for system ram, and how much > for easily reclaimable memory at boot time. You can't - it doesn't work. > You can't what? What doesn't work? If you have no hard limits set, then the frag patches can't guarantee anything either. You can't have it both ways. Either you have limits for things or you don't need any guarantees. Zones handle the former case nicely, and we currently do the latter case just fine (along with the frag patches). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:03 ` Nick Piggin @ 2005-11-02 2:24 ` Martin J. Bligh 2005-11-02 2:49 ` Nick Piggin 2005-11-02 11:41 ` Mel Gorman 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 2:24 UTC (permalink / raw) To: Nick Piggin Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar >>> The numbers I have seen show that performance is decreased. People >>> like Ken Chen spend months trying to find a 0.05% improvement in >>> performance. Not long ago I just spent days getting our cached >>> kbuild performance back to where 2.4 is on my build system. >> >> Ironically, we're currently trying to chase down a 'database benchmark' >> regression that seems to have been cause by the last round of "let's >> rewrite the scheduler again" (more details later). Nick, you've added an >> awful lot of complexity to some of these code paths yourself ... seems >> ironic that you're the one complaining about it ;-) > > Yeah that's unfortunate, but I think a large portion of the problem > (if they are anything the same) has been narrowed down to some over > eager wakeup balancing for which there are a number of proposed > patches. > > But in this case I was more worried about getting the groundwork done > for handling the multicore multicore systems that everyone will soon > be using rather than several % performance regression on TPC-C (not > to say that I don't care about that at all)... I don't see the irony. > > But let's move this to another thread if it is going to continue. I > would be happy to discuss scheduler problems. My point was that most things we do add complexity to the codebase, including the things you do yourself ... I'm not saying the we're worse off for the changes you've made, by any means - I think they've been mostly beneficial. I'm just pointing out that we ALL do it, so let us not be too quick to judge when others propose adding something that does ;-) >> Because the zone is statically sized, and you're back to the same crap >> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM, >> effectively. Define how much you need for system ram, and how much >> for easily reclaimable memory at boot time. You can't - it doesn't work. > > You can't what? What doesn't work? If you have no hard limits set, > then the frag patches can't guarantee anything either. > > You can't have it both ways. Either you have limits for things or > you don't need any guarantees. Zones handle the former case nicely, > and we currently do the latter case just fine (along with the frag > patches). I'll go look through Mel's current patchset again. I was under the impression it didn't suffer from this problem, at least not as much as zones did. Nothing is guaranteed. You can shag the whole machine and/or VM in any number of ways ... if we can significantly improve the probability of existing higher order allocs working, and new functionality has an excellent probability of success, that's as good as you're going to get. Have a free "perfect is the enemy of good" Linus quote, on me ;-) M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:24 ` Martin J. Bligh @ 2005-11-02 2:49 ` Nick Piggin 2005-11-02 4:39 ` Martin J. Bligh ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 2:49 UTC (permalink / raw) To: Martin J. Bligh Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar Martin J. Bligh wrote: >>But let's move this to another thread if it is going to continue. I >>would be happy to discuss scheduler problems. > > > My point was that most things we do add complexity to the codebase, > including the things you do yourself ... I'm not saying the we're worse > off for the changes you've made, by any means - I think they've been > mostly beneficial. Heh - I like the "mostly" ;) > I'm just pointing out that we ALL do it, so let us > not be too quick to judge when others propose adding something that does ;-) > What I'm getting worried about is the marked increase in the rate of features and complexity going in. I am almost certainly never going to use memory hotplug or demand paging of hugepages. I am pretty likely going to have to wade through this code at some point in the future if it is merged. It is also going to slow down my kernel by maybe 1% when doing kbuilds, but hey let's not worry about that until we've merged 10 more such slowdowns (ok that wasn't aimed at you or Mel, but my perception of the status quo). > >>You can't what? What doesn't work? If you have no hard limits set, >>then the frag patches can't guarantee anything either. >> >>You can't have it both ways. Either you have limits for things or >>you don't need any guarantees. Zones handle the former case nicely, >>and we currently do the latter case just fine (along with the frag >>patches). > > > I'll go look through Mel's current patchset again. I was under the > impression it didn't suffer from this problem, at least not as much > as zones did. > Over time, I don't think it can offer any stronger a guarantee than what we currently have. I'm not even sure that it would be any better at all for problematic workloads as time -> infinity. > Nothing is guaranteed. You can shag the whole machine and/or VM in > any number of ways ... if we can significantly improve the probability > of existing higher order allocs working, and new functionality has > an excellent probability of success, that's as good as you're going to > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-) > I think it falls down if these higher order allocations actually get *used* for anything. You'll simply be going through the process of replacing your contiguous, easy-to-reclaim memory with pinned kernel memory. However, for the purpose of memory hot unplug, a new zone *will* guarantee memory can be reclaimed and unplugged. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:49 ` Nick Piggin @ 2005-11-02 4:39 ` Martin J. Bligh 2005-11-02 5:09 ` Nick Piggin 2005-11-02 7:19 ` Yasunori Goto 2005-11-02 11:48 ` Mel Gorman 2 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 4:39 UTC (permalink / raw) To: Nick Piggin Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar >> I'm just pointing out that we ALL do it, so let us >> not be too quick to judge when others propose adding something that does ;-) > > What I'm getting worried about is the marked increase in the > rate of features and complexity going in. > > I am almost certainly never going to use memory hotplug or > demand paging of hugepages. I am pretty likely going to have > to wade through this code at some point in the future if it > is merged. Mmm. Though whether any one of us will personally use each feature is perhaps not the most ideal criteria to judge things by ;-) > It is also going to slow down my kernel by maybe 1% when > doing kbuilds, but hey let's not worry about that until we've > merged 10 more such slowdowns (ok that wasn't aimed at you or > Mel, but my perception of the status quo). If it's really 1%, yes, that's a huge problem. And yes, I agree with you that there's a problem with the rate of change. Part of that is a lack of performance measurement and testing, and the quality sometimes scares me (though the last month has actually been significantly better, the tree mostly builds and boots now!). I've tried to do something on the testing front, but I'm acutely aware it's not sufficient by any means. >>> You can't what? What doesn't work? If you have no hard limits set, >>> then the frag patches can't guarantee anything either. >>> >>> You can't have it both ways. Either you have limits for things or >>> you don't need any guarantees. Zones handle the former case nicely, >>> and we currently do the latter case just fine (along with the frag >>> patches). >> >> I'll go look through Mel's current patchset again. I was under the >> impression it didn't suffer from this problem, at least not as much >> as zones did. > > Over time, I don't think it can offer any stronger a guarantee > than what we currently have. I'm not even sure that it would be > any better at all for problematic workloads as time -> infinity. Sounds worth discussing. We need *some* way of dealing with fragmentation issues. To me that means both an avoidance strategy, and an ability to actively defragment if we need it. Linux is evolved software, it may not be perfect at first - that's the way we work, and it's served us well up till now. To me, that's the biggest advantage we have over the proprietary model. >> Nothing is guaranteed. You can shag the whole machine and/or VM in >> any number of ways ... if we can significantly improve the probability >> of existing higher order allocs working, and new functionality has >> an excellent probability of success, that's as good as you're going to >> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-) > > I think it falls down if these higher order allocations actually > get *used* for anything. You'll simply be going through the process > of replacing your contiguous, easy-to-reclaim memory with pinned > kernel memory. It seems inevitable that we need both physically contiguous memory sections, and virtually contiguous in kernel space (which equates to the same thing, unless we totally break the 1-1 P-V mapping and lose the large page mapping for kernel, which I'd hate to do.) > However, for the purpose of memory hot unplug, a new zone *will* > guarantee memory can be reclaimed and unplugged. It's not just about memory hotplug. There are, as we have discussed already, many usage for physically contiguous (and virtually contiguous) memory segments. Focusing purely on any one of them will not solve the issue at hand ... M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 4:39 ` Martin J. Bligh @ 2005-11-02 5:09 ` Nick Piggin 2005-11-02 5:14 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Nick Piggin @ 2005-11-02 5:09 UTC (permalink / raw) To: Martin J. Bligh Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar Martin J. Bligh wrote: >>I am almost certainly never going to use memory hotplug or >>demand paging of hugepages. I am pretty likely going to have >>to wade through this code at some point in the future if it >>is merged. > > > Mmm. Though whether any one of us will personally use each feature > is perhaps not the most ideal criteria to judge things by ;-) > Of course, but I'd say very few people will. Then again maybe I'm just a luddite who doesn't know what's good for him ;) > >>It is also going to slow down my kernel by maybe 1% when >>doing kbuilds, but hey let's not worry about that until we've >>merged 10 more such slowdowns (ok that wasn't aimed at you or >>Mel, but my perception of the status quo). > > > If it's really 1%, yes, that's a huge problem. And yes, I agree with > you that there's a problem with the rate of change. Part of that is > a lack of performance measurement and testing, and the quality sometimes > scares me (though the last month has actually been significantly better, > the tree mostly builds and boots now!). I've tried to do something on > the testing front, but I'm acutely aware it's not sufficient by any means. > To be honest I haven't tested so this is an unfounded guess. However it is based on what I have seen of Mel's numbers, and the fact that the kernel spends nearly 1/3rd of its time in the page allocator when running a kbuild. I may get around to getting some real numbers when my current patch queues shrink. >>Over time, I don't think it can offer any stronger a guarantee >>than what we currently have. I'm not even sure that it would be >>any better at all for problematic workloads as time -> infinity. > > > Sounds worth discussing. We need *some* way of dealing with fragmentation > issues. To me that means both an avoidance strategy, and an ability > to actively defragment if we need it. Linux is evolved software, it > may not be perfect at first - that's the way we work, and it's served > us well up till now. To me, that's the biggest advantage we have over > the proprietary model. > True and I'm also annoyed that we have these issues at all. I just don't see that the avoidance strategy helps that much because as I said, you don't need to keep these lovely contiguous regions just for show (or other easy-to-reclaim user pages). The absolute priority is to move away from higher order allocs or use fallbacks IMO. And that doesn't necessarily mean order 1 or even 2 allocations because we've don't seem to have a problem with those. Because I want Linux to be as robust as you do. >>I think it falls down if these higher order allocations actually >>get *used* for anything. You'll simply be going through the process >>of replacing your contiguous, easy-to-reclaim memory with pinned >>kernel memory. > > > It seems inevitable that we need both physically contiguous memory > sections, and virtually contiguous in kernel space (which equates to > the same thing, unless we totally break the 1-1 P-V mapping and > lose the large page mapping for kernel, which I'd hate to do.) > I think this isn't as bad an idea as you think. If it means those guys doing memory hotplug take a few % performance hit and nobody else has to bear the costs then that sounds great. > >>However, for the purpose of memory hot unplug, a new zone *will* >>guarantee memory can be reclaimed and unplugged. > > > It's not just about memory hotplug. There are, as we have discussed > already, many usage for physically contiguous (and virtually contiguous) > memory segments. Focusing purely on any one of them will not solve the > issue at hand ... > True, but we don't seem to have huge problems with other things. The main ones that have come up on lkml are e1000 which is getting fixed, and maybe XFS which I think there are also moves to improve. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 5:09 ` Nick Piggin @ 2005-11-02 5:14 ` Martin J. Bligh 2005-11-02 6:23 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-02 5:14 UTC (permalink / raw) To: Nick Piggin Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar >> It's not just about memory hotplug. There are, as we have discussed >> already, many usage for physically contiguous (and virtually contiguous) >> memory segments. Focusing purely on any one of them will not solve the >> issue at hand ... > > True, but we don't seem to have huge problems with other things. The > main ones that have come up on lkml are e1000 which is getting fixed, > and maybe XFS which I think there are also moves to improve. It should be fairly easy to trawl through the list of all allocations and pull out all the higher order ones from the whole source tree. I suspect there's a lot ... maybe I'll play with it later on. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 5:14 ` Martin J. Bligh @ 2005-11-02 6:23 ` KAMEZAWA Hiroyuki 2005-11-02 10:15 ` Nick Piggin 0 siblings, 1 reply; 241+ messages in thread From: KAMEZAWA Hiroyuki @ 2005-11-02 6:23 UTC (permalink / raw) To: Martin J. Bligh Cc: Nick Piggin, Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar Martin J. Bligh wrote: >>True, but we don't seem to have huge problems with other things. The >>main ones that have come up on lkml are e1000 which is getting fixed, >>and maybe XFS which I think there are also moves to improve. > > > It should be fairly easy to trawl through the list of all allocations > and pull out all the higher order ones from the whole source tree. I > suspect there's a lot ... maybe I'll play with it later on. > please check kmalloc(32k,64k) For example, loopback device's default MTU=16436 means order=3 and maybe there are other high MTU device. I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be sufferd from fragmentation when MTU is big. They allocs large skb by gathering fragmented skbs.When these skb_* funcs failed, the packet is silently discarded by netfilter. If fragmentation is heavy, packets (especialy TCP) uses large MTU never reachs its end, even if loopback. Honestly, I'm not familiar with network code, could anyone comment this ? -- Kame ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 6:23 ` KAMEZAWA Hiroyuki @ 2005-11-02 10:15 ` Nick Piggin 0 siblings, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-02 10:15 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Martin J. Bligh, Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar KAMEZAWA Hiroyuki wrote: > Martin J. Bligh wrote: > > please check kmalloc(32k,64k) > > For example, loopback device's default MTU=16436 means order=3 and > maybe there are other high MTU device. > > I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be > sufferd from fragmentation when MTU is big. They allocs large skb by > gathering fragmented skbs.When these skb_* funcs failed, the packet > is silently discarded by netfilter. If fragmentation is heavy, packets > (especialy TCP) uses large MTU never reachs its end, even if loopback. > > Honestly, I'm not familiar with network code, could anyone comment this ? > I'd be interested to know, actually. I was hoping loopback should always use order-0 allocations, because the loopback driver is SG, FRAGLIST, and HIGHDMA capable. However I'm likewise not familiar with network code. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:49 ` Nick Piggin 2005-11-02 4:39 ` Martin J. Bligh @ 2005-11-02 7:19 ` Yasunori Goto 2005-11-02 11:48 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Yasunori Goto @ 2005-11-02 7:19 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar, Mel Gorman Hello. Nick-san. I posted patches to make ZONE_REMOVABLE to LHMS. I don't say they are better than Mel-san's patch. I hope this will be base of good discussion. There were 2 types. One was just add ZONE_REMOVABLE. This patch came from early implementation of memory hotplug VA-Linux team. http://sourceforge.net/mailarchive/forum.php?thread_id=5969508&forum_id=223 ZONE_HIGHMEM is used for this purpose at early implementation. We thought ZONE_HIGHMEM is easier removing than other zone. But some of archtecture don't use it. That is why ZONE_REMOVABLE was born. (And I remember that ZONE_DMA32 was defined after this patch. So, number of zone became 5, and one more bit was necessary in page->flags. (I don't know recent progress of ZONE_DMA32)). Another one was a bit similar than Mel-san's one. One of motivation of this patch was to create orthogonal relationship between Removable and DMA/Normal/Highmem. I thought it is desirable. Because, ppc64 can treat that all of memory is same (DMA) zone. I thought that new zone spoiled its good feature. http://sourceforge.net/mailarchive/forum.php?thread_id=5345977&forum_id=223 http://sourceforge.net/mailarchive/forum.php?thread_id=5345978&forum_id=223 http://sourceforge.net/mailarchive/forum.php?thread_id=5345979&forum_id=223 http://sourceforge.net/mailarchive/forum.php?thread_id=5345980&forum_id=223 Thanks. P.S. to Mel-san. I'm sorry for late writing of this. This threads was mail bomb for me to read with my poor English skill. :-( > Martin J. Bligh wrote: > > >>But let's move this to another thread if it is going to continue. I > >>would be happy to discuss scheduler problems. > > > > > > My point was that most things we do add complexity to the codebase, > > including the things you do yourself ... I'm not saying the we're worse > > off for the changes you've made, by any means - I think they've been > > mostly beneficial. > > Heh - I like the "mostly" ;) > > > I'm just pointing out that we ALL do it, so let us > > not be too quick to judge when others propose adding something that does ;-) > > > > What I'm getting worried about is the marked increase in the > rate of features and complexity going in. > > I am almost certainly never going to use memory hotplug or > demand paging of hugepages. I am pretty likely going to have > to wade through this code at some point in the future if it > is merged. > > It is also going to slow down my kernel by maybe 1% when > doing kbuilds, but hey let's not worry about that until we've > merged 10 more such slowdowns (ok that wasn't aimed at you or > Mel, but my perception of the status quo). > > > > >>You can't what? What doesn't work? If you have no hard limits set, > >>then the frag patches can't guarantee anything either. > >> > >>You can't have it both ways. Either you have limits for things or > >>you don't need any guarantees. Zones handle the former case nicely, > >>and we currently do the latter case just fine (along with the frag > >>patches). > > > > > > I'll go look through Mel's current patchset again. I was under the > > impression it didn't suffer from this problem, at least not as much > > as zones did. > > > > Over time, I don't think it can offer any stronger a guarantee > than what we currently have. I'm not even sure that it would be > any better at all for problematic workloads as time -> infinity. > > > Nothing is guaranteed. You can shag the whole machine and/or VM in > > any number of ways ... if we can significantly improve the probability > > of existing higher order allocs working, and new functionality has > > an excellent probability of success, that's as good as you're going to > > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-) > > > > I think it falls down if these higher order allocations actually > get *used* for anything. You'll simply be going through the process > of replacing your contiguous, easy-to-reclaim memory with pinned > kernel memory. > > However, for the purpose of memory hot unplug, a new zone *will* > guarantee memory can be reclaimed and unplugged. > > -- > SUSE Labs, Novell Inc. > > Send instant messages to your online friends http://au.messenger.yahoo.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Yasunori Goto ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:49 ` Nick Piggin 2005-11-02 4:39 ` Martin J. Bligh 2005-11-02 7:19 ` Yasunori Goto @ 2005-11-02 11:48 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-02 11:48 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar On Wed, 2 Nov 2005, Nick Piggin wrote: > Martin J. Bligh wrote: > > > > But let's move this to another thread if it is going to continue. I > > > would be happy to discuss scheduler problems. > > > > > > My point was that most things we do add complexity to the codebase, > > including the things you do yourself ... I'm not saying the we're worse > > off for the changes you've made, by any means - I think they've been > > mostly beneficial. > > Heh - I like the "mostly" ;) > > > I'm just pointing out that we ALL do it, so let us > > not be too quick to judge when others propose adding something that does ;-) > > > > What I'm getting worried about is the marked increase in the > rate of features and complexity going in. > > I am almost certainly never going to use memory hotplug or > demand paging of hugepages. I am pretty likely going to have > to wade through this code at some point in the future if it > is merged. > Plenty of features in the kernel I don't use either :) . > It is also going to slow down my kernel by maybe 1% when > doing kbuilds, but hey let's not worry about that until we've > merged 10 more such slowdowns (ok that wasn't aimed at you or > Mel, but my perception of the status quo). > Ok, my patches show performance gains and losses on different parts of Aim9. page_test is slightly down but fork_test was considerably up. Both would have an effect on kbuild so more figures are needed on mode machines. That will only be found from testing from a variety of machines. > > > > > You can't what? What doesn't work? If you have no hard limits set, > > > then the frag patches can't guarantee anything either. > > > > > > You can't have it both ways. Either you have limits for things or > > > you don't need any guarantees. Zones handle the former case nicely, > > > and we currently do the latter case just fine (along with the frag > > > patches). > > > > > > I'll go look through Mel's current patchset again. I was under the > > impression it didn't suffer from this problem, at least not as much > > as zones did. > > > > Over time, I don't think it can offer any stronger a guarantee > than what we currently have. I'm not even sure that it would be > any better at all for problematic workloads as time -> infinity. > Not as they currently stand no. As I've said elsewhere, to really guarantee things, kswapd would need to know how to clear out UesrRclm pages from the other reserve types. > > Nothing is guaranteed. You can shag the whole machine and/or VM in > > any number of ways ... if we can significantly improve the probability of > > existing higher order allocs working, and new functionality has > > an excellent probability of success, that's as good as you're going to get. > > Have a free "perfect is the enemy of good" Linus quote, on me ;-) > > > > I think it falls down if these higher order allocations actually > get *used* for anything. You'll simply be going through the process > of replacing your contiguous, easy-to-reclaim memory with pinned > kernel memory. > And a misconfigured zone-based approach just falls apart. Going to finish that summary mail to avoid repetition. > However, for the purpose of memory hot unplug, a new zone *will* > guarantee memory can be reclaimed and unplugged. > > -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 2:03 ` Nick Piggin 2005-11-02 2:24 ` Martin J. Bligh @ 2005-11-02 11:41 ` Mel Gorman 1 sibling, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-02 11:41 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar On Wed, 2 Nov 2005, Nick Piggin wrote: > Martin J. Bligh wrote: > > > The numbers I have seen show that performance is decreased. People > > > like Ken Chen spend months trying to find a 0.05% improvement in > > > performance. Not long ago I just spent days getting our cached > > > kbuild performance back to where 2.4 is on my build system. > > > > > > Ironically, we're currently trying to chase down a 'database benchmark' > > regression that seems to have been cause by the last round of "let's > > rewrite the scheduler again" (more details later). Nick, you've added an > > awful lot of complexity to some of these code paths yourself ... seems > > ironic that you're the one complaining about it ;-) > > > > Yeah that's unfortunate, but I think a large portion of the problem > (if they are anything the same) has been narrowed down to some over > eager wakeup balancing for which there are a number of proposed > patches. > > But in this case I was more worried about getting the groundwork done > for handling the multicore multicore systems that everyone will soon > be using rather than several % performance regression on TPC-C (not > to say that I don't care about that at all)... I don't see the irony. > > But let's move this to another thread if it is going to continue. I > would be happy to discuss scheduler problems. > > > > You have an extra zone. You size that zone at boot according to the > > > amount of memory you need to be able to free. Only easy-reclaim stuff > > > goes in that zone. > > > > > > It is less complex because zones are a complexity we already have to > > > live with. 99% of the infrastructure is already there to do this. > > > > > > If you want to hot unplug memory or guarantee hugepage allocation, > > > this is the way to do it. Nobody has told me why this *doesn't* work. > > > > > > Because the zone is statically sized, and you're back to the same crap > > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM, > > effectively. Define how much you need for system ram, and how much > > for easily reclaimable memory at boot time. You can't - it doesn't work. > > > > You can't what? What doesn't work? If you have no hard limits set, > then the frag patches can't guarantee anything either. > True, but the difference is Anti-defrag: Best effort at low cost (according to Aim9) without tunable Zones: Will work, but requires tunable. falls apart if tuned wrong > You can't have it both ways. Either you have limits for things or > you don't need any guarantees. Zones handle the former case nicely, > and we currently do the latter case just fine (along with the frag > patches). > Sure, so you compromise and do best effort for as long as possible. Always try to keep fragmentation low. If the system is configured to really need low fragmentation, then after a long period of time, a page-migration mechanism kicks in to move the kernel pages out of EasyRclm areas and we continue on. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 1:06 ` Nick Piggin 2005-11-02 1:41 ` Martin J. Bligh @ 2005-11-02 11:37 ` Mel Gorman 2005-11-02 15:11 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-02 11:37 UTC (permalink / raw) To: Nick Piggin Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar On Wed, 2 Nov 2005, Nick Piggin wrote: > Joel Schopp wrote: > > > The patches do ad a reasonable amount of complexity to the page allocator. > > In my opinion that is the only downside of these patches, even though it is > > a big one. What we need to decide as a community is if there is a less > > complex way to do this, and if there isn't a less complex way then is the > > benefit worth the increased complexity. > > > > As to the non-zero performance cost, I think hard numbers should carry more > > weight than they have been given in this area. Mel has posted hard numbers > > that say the patches are a wash with respect to performance. I don't see > > any evidence to contradict those results. > > > > The numbers I have seen show that performance is decreased. People > like Ken Chen spend months trying to find a 0.05% improvement in > performance. Fine, that is understandable. The AIM9 benchmarks also show performance improvements in other areas like fork_test. About a 5% difference which is also important for kernel builds. Wider testing would be needed to see if the improvements are specific to my tests or not. Every set of patches have had a performance regression test run with Aim9 so I certainly have not been ignoring perforkmance. > Not long ago I just spent days getting our cached > kbuild performance back to where 2.4 is on my build system. > Then it would be interesting to find out how 2.6.14-rc5-mm1 compares against 2.6.14-rc5-mm1-mbuddy-v19? > I can simply see they will cost more icache, more dcache, more branches, > etc. in what is the hottest part of the kernel in some workloads (kernel > compiles, for one). > > I'm sorry if I sound like a wet blanket. I just don't look at a patch > and think "wow all those 3 guys with Linux on IBM mainframes and using > lpars are going to be so much happier now, this is something we need". > I developed this as the beginning of a long term solution for on-demand HugeTLB pages as part of a PhD. This could potentially help desktop workloads in the future. Hotplug machines are a benefit that was picked up by the work on the way. We can help hotplug to some extent today and desktop users in the future (and given time, all of the hotplug problems as well). But if we tell desktop users "Yeah, your applications will run a bit better with HugeTLB pages as long as you configure the size of the zone correctly" at any stage, we'll be told where to go. > > > > The will need high order allocations if we want to provide HugeTLB pages > > > > to userspace on-demand rather than reserving at boot-time. This is a > > > > future problem, but it's one that is not worth tackling until the > > > > fragmentation problem is fixed first. > > > > > > > > > > Sure. In what form, we haven't agreed. I vote zones! :) > > > > > > I'd like to hear more details of how zones would be less complex while still > > solving the problem. I just don't get it. > > > > You have an extra zone. You size that zone at boot according to the > amount of memory you need to be able to free. Only easy-reclaim stuff > goes in that zone. > Helps hotplug, no one else. Rules out HugeTLB on demand for userspace unless we are willing to tell desktop users to configure this tunable. > It is less complex because zones are a complexity we already have to > live with. 99% of the infrastructure is already there to do this. > The simplicity of zones is still in dispute. I am putting together a mail of pros, cons, situations and future work for both approaches. I hope to sent it out fairly soon. > If you want to hot unplug memory or guarantee hugepage allocation, > this is the way to do it. Nobody has told me why this *doesn't* work. > Hot unplug the configured zone of memory and guarantee hugepage allocation only for userspace. There is no help for kernel allocations to steal a huge page under any circumstance. Our approach allows the kernel to get the large page at the cost of fragmentation degrading slowly over time. To stop it fragmenting slowly over time, more work is needed. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-02 1:06 ` Nick Piggin 2005-11-02 1:41 ` Martin J. Bligh 2005-11-02 11:37 ` Mel Gorman @ 2005-11-02 15:11 ` Mel Gorman 2 siblings, 0 replies; 241+ messages in thread From: Mel Gorman @ 2005-11-02 15:11 UTC (permalink / raw) To: Nick Piggin Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel, Ingo Molnar On (02/11/05 12:06), Nick Piggin didst pronounce: > Joel Schopp wrote: > > >The patches do ad a reasonable amount of complexity to the page > >allocator. In my opinion that is the only downside of these patches, > >even though it is a big one. What we need to decide as a community is > >if there is a less complex way to do this, and if there isn't a less > >complex way then is the benefit worth the increased complexity. > > > >As to the non-zero performance cost, I think hard numbers should carry > >more weight than they have been given in this area. Mel has posted hard > >numbers that say the patches are a wash with respect to performance. I > >don't see any evidence to contradict those results. > > > > The numbers I have seen show that performance is decreased. People > like Ken Chen spend months trying to find a 0.05% improvement in > performance. Not long ago I just spent days getting our cached > kbuild performance back to where 2.4 is on my build system. > One contention point is the overhead this introduces. Lets say that we do discover that kbuild is slower with this patch (still unknown), then we have to get rid of mbuddy, disable it or replace it with an as-yet-to-be-written-zone-based-approach. I wrote a quick patch that disables anti-defrag via a config option and ran aim9 on the test machine I have been using all along. I deliberatly changed the minimum amount of anti-defrag as possible but maybe we could make this patch even smaller or go the other way and conditionally take out as much anti-defrag as possible. Here are the Aim9 comparisons between -clean and -mbuddy-v19-antidefrag-disabled-with-config-option (just the one run) These are both based on 2.6.14-rc5-mm1 vanilla-mm mbuddy-disabled-via-config 1 creat-clo 16006.00 15844.72 -161.28 -1.01% File Creations and Closes/second 2 page_test 117515.83 119696.77 2180.94 1.86% System Allocations & Pages/second 3 brk_test 440289.81 439870.04 -419.77 -0.10% System Memory Allocations/second 4 jmp_test 4179466.67 4179150.00 -316.67 -0.01% Non-local gotos/second 5 signal_test 80803.20 82055.98 1252.78 1.55% Signal Traps/second 6 exec_test 61.75 61.53 -0.22 -0.36% Program Loads/second 7 fork_test 1327.01 1344.55 17.54 1.32% Task Creations/second 8 link_test 5531.53 5548.33 16.80 0.30% Link/Unlink Pairs/second On this kernel, I forgot to disable the collection of buddy allocator statistics. Collection introduces more overhead in both CPU and memory. Here are the figures when statistic collection is also disabled via the config option. vanilla-mm mbuddy-disabled-via-config-nostats 1 creat-clo 16006.00 15906.06 -99.94 -0.62% File Creations and Closes/second 2 page_test 117515.83 120736.54 3220.71 2.74% System Allocations & Pages/second 3 brk_test 440289.81 430311.61 -9978.20 -2.27% System Memory Allocations/second 4 jmp_test 4179466.67 4181683.33 2216.66 0.05% Non-local gotos/second 5 signal_test 80803.20 87387.54 6584.34 8.15% Signal Traps/second 6 exec_test 61.75 62.14 0.39 0.63% Program Loads/second 7 fork_test 1327.01 1345.77 18.76 1.41% Task Creations/second 8 link_test 5531.53 5556.72 25.19 0.46% Link/Unlink Pairs/second So, now we have performance gains in a number of areas. Nice big jump in page_test and that fork_test improvement probably won't hurt kbuild either with exec_test giving a bit of a nudge. signal_test has a big hike for some reason, not sure who will benefit there, but hey, it can't be bad. I am annoyed with brk_test especially as it is very similar to page_test in the aim9 source code but there is no point hiding the result either. These figures does not tell us how kbuild really performs of course. For that, kbuild needs to be run on both kernels and compared. This applies to any workload. This anti-defrag makes the code more complex and harder to read, no arguement there. However, on at least one test machine, there is a very small difference when anti-defrag is enabled in comparison to a vanilla kernel. When the patches applied and the anti-defrag disabled via a kernel option, we see a number of performance gains, on one machine at least which is a good thing. Wider testing would show if these good figures are specific to my testbed or not. If other testbeds show up nothing bad, anti-defrag with this additional patch could give us the best of both worlds. If you have a hotplug machine or you care about high orders, enable this option. Otherwise, choose N and avoid the anti-defrag overhead. diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h --- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h 2005-11-02 12:44:06.000000000 +0000 +++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h 2005-11-02 12:49:24.000000000 +0000 @@ -50,6 +50,7 @@ struct vm_area_struct; #define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */ #define __GFP_VALID 0x80000000u /* valid GFP flags */ +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG /* * Allocation type modifiers, these are required to be adjacent * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages @@ -61,6 +62,11 @@ struct vm_area_struct; #define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */ #define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */ #define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM) +#else +#define __GFP_EASYRCLM 0 +#define __GFP_KERNRCLM 0 +#define __GFP_RCLM_BITS 0 +#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */ #define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h --- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h 2005-11-02 12:44:07.000000000 +0000 +++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h 2005-11-02 13:00:56.000000000 +0000 @@ -23,6 +23,7 @@ #endif #define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1)) +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG /* * The two bit field __GFP_RECLAIMBITS enumerates the following types of * page reclaimability. @@ -33,6 +34,14 @@ #define RCLM_FALLBACK 3 #define RCLM_TYPES 4 #define BITS_PER_RCLM_TYPE 2 +#else +#define RCLM_NORCLM 0 +#define RCLM_EASY 0 +#define RCLM_KERN 0 +#define RCLM_FALLBACK 0 +#define RCLM_TYPES 1 +#define BITS_PER_RCLM_TYPE 0 +#endif #define for_each_rclmtype_order(type, order) \ for (order = 0; order < MAX_ORDER; order++) \ @@ -60,6 +69,7 @@ struct zone_padding { #define ZONE_PADDING(name) #endif +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG /* * Indices into pcpu_list * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations @@ -68,6 +78,11 @@ struct zone_padding { #define PCPU_KERNEL 0 #define PCPU_EASY 1 #define PCPU_TYPES 2 +#else +#define PCPU_KERNEL 0 +#define PCPU_EASY 0 +#define PCPU_TYPES 1 +#endif struct per_cpu_pages { int count[PCPU_TYPES]; /* Number of pages on each list */ diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig --- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig 2005-11-02 12:42:20.000000000 +0000 +++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig 2005-11-02 12:59:49.000000000 +0000 @@ -419,6 +419,17 @@ config CC_ALIGN_JUMPS no dummy operations need be executed. Zero means use compiler's default. +config PAGEALLOC_ANTIDEFRAG + bool "Try and avoid fragmentation in the page allocator" + def_bool y + help + The standard allocator will fragment memory over time which means that + high order allocations will fail even if kswapd is running. If this + option is set, the allocator will try and group page types into + three groups KernNoRclm, KernRclm and EasyRclm. The gain is a best + effort attempt at lowering fragmentation. The loss is more complexity + + endmenu # General setup config TINY_SHMEM diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c --- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c 2005-11-02 13:05:07.000000000 +0000 +++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c 2005-11-02 14:09:37.000000000 +0000 @@ -57,11 +57,17 @@ long nr_swap_pages; * fallback_allocs contains the fallback types for low memory conditions * where the preferred alloction type if not available. */ +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = { {RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES}, {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES}, {RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES} }; +#else +int fallback_allocs[RCLM_TYPES][RCLM_TYPES+1] = { + {RCLM_NORCLM, RCLM_TYPES} +}; +#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */ /* Returns 1 if the needed percentage of the zone is reserved for fallbacks */ static inline int min_fallback_reserved(struct zone *zone) @@ -98,6 +104,7 @@ EXPORT_SYMBOL(totalram_pages); #error __GFP_KERNRCLM not mapping to RCLM_KERN #endif +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG /* * This function maps gfpflags to their RCLM_TYPE. It makes assumptions * on the location of the GFP flags. @@ -115,6 +122,12 @@ static inline int gfpflags_to_rclmtype(g return rclmbits >> RCLM_SHIFT; } +#else +static inline int gfpflags_to_rclmtype(gfp_t gfp_flags) +{ + return RCLM_NORCLM; +} +#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */ /* * copy_bits - Copy bits between bitmaps @@ -134,6 +147,9 @@ static inline void copy_bits(unsigned lo int sindex_src, int nr) { + if (nr == 0) + return; + /* * Written like this to take advantage of arch-specific * set_bit() and clear_bit() functions @@ -188,8 +204,12 @@ static char *zone_names[MAX_NR_ZONES] = int min_free_kbytes = 1024; #ifdef CONFIG_ALLOCSTATS +#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm", "KernRclm", "Fallback"}; +#else +static char *type_names[RCLM_TYPES] = { "KernNoRclm" }; +#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */ #endif /* CONFIG_ALLOCSTATS */ unsigned long __initdata nr_kernel_pages; @@ -2228,8 +2248,10 @@ static void __init setup_usemap(struct p struct zone *zone, unsigned long zonesize) { unsigned long usemapsize = usemap_size(zonesize); - zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize); - memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize); + if (usemapsize != 0) { + zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize); + memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize); + } } #else static void inline setup_usemap(struct pglist_data *pgdat, ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-04 1:00 Andy Nelson 2005-11-04 1:16 ` Martin J. Bligh 2005-11-04 5:14 ` Linus Torvalds 0 siblings, 2 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 1:00 UTC (permalink / raw) To: mbligh, torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin Linus writes: >Just face it - people who want memory hotplug had better know that >beforehand (and let's be honest - in practice it's only going to work in >virtualized environments or in environments where you can insert the new >bank of memory and copy it over and remove the old one with hw support). > >Same as hugetlb. > >Nobody sane _cares_. Nobody sane is asking for these things. Only people >with special needs are asking for it, and they know their needs. Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote about. I am the whisperer in people's minds, causing them to conspire against sanity everywhere and make lives as insane and crazy as mine is. I love my work. I am an astrophysicist. I have lurked on various linux lists for years now, and this is my first time standing in front of all you people, hoping to make you bend your insane and crazy kernel developing minds to listen to the rantings of my insane and crazy HPC mind. I have done high performance computing in astrophysics for nearly two decades now. It gives me a perspective that kernel developers usually don't have, but sometimes need. For my part, I promise that I specifically do *not* have the perspective of a kernel developer. I don't even speak C. I don't really know what you folks do all day or night, and I actually don't much care except when it impacts my own work. I am fairly certain a lot of this hotplug/page defragmentation/page faulting/page zeroing stuff that the sgi and ibm folk are currently getting rejected from inclusion in the kernel impacts my work in very serious ways. You're right, I do know my needs. They are not being met and the people with the power to do anything about it call me insane and crazy and refuse to be interested even in making improvement possible, even when it quite likely helps them too. Today I didn't hear a voice in my head that told me to shoot the pope, but I did I hear one telling me to write a note telling you about my issues, which apparently are in the 0.01% of insane crazies that should be ignored, as are about 1/2 of the people responding on this thread. I'll tell you a bit about my issues and their context now that things have gotten hot enough that even a devout lurker like me is posting. Some of it might make sense. Other parts may be internally inconsistent if only I knew enough. Still other parts may be useful to get people who don't talk to each other in contact, and think about things in ways they haven't. I run large hydrodynamic simulations using a variety of techniques whose relevance is only tangential to the current flamefest. I'll let you know the important details as they come in later. A lot of my statements will be common to a large fraction of all hpc applications, and I imagine to many large scale database applications as well though I'm guessing a bit there. I run the same codes on many kinds of systems from workstations up to large supercomputing platforms. Mostly my experience has been in shared memory systems, but recently I've been part of things that will put me into distributed memory space as well. What does it mean to use computers like I do? Maybe this is surprising but my executables are very very small. Typically 1-2MB or less, with only a bit more needed for various external libraries like FFTW or the like. On the other hand, my memory requirements are huge. Typically many GB, and some folks run simulations with many TB. Picture a very small and very fast flea repeatedly jumping around all over the skin of a very large elephant, taking a bite at each jump and that is a crude idea of what is happening. This has bearing on the current discussion in the following ways, which are not theoretical in any way. 1) Some of these simulations frequently need to access data that is located very far away in memory. That means that the bigger your pages are, the fewer TLB misses you get, the smaller the thrashing, and the faster your code runs. One example: I have a particle hydrodynamics code that uses gravity. Molecular dynamics simulations have similar issues with long range forces too. Gravity is calculated by culling acceptable nodes and atoms out of a tree structure that can be many GB in size, or for bigger jobs, many TB. You have to traverse the entire tree for every particle (or closely packed small group). During this stage, almost every node examination (a simple compare and about 5 flops) requires at least one TLB miss and depending on how you've laid out your array, several TLB misses. Huge pages help this problem, big time. Fine with me if all I had was one single page. If I am stupid and get into swap territory, I deserve every bad thing that happens to me. Now you have a list of a few thousand nodes and atoms with their data spread sparsely over that entire multi-GB memory volume. Grab data (about 10 double precision numbers) for one node, do 40-50 flops with it, and repeat, L1 and TLB thrashing your way through the entire list. There are some tricks that work some times (preload an L1 sized array of node data and use it for an entire group of particles, then discard it for another preload if there is more data; dimension arrays in the right direction, so you get multiple loads from the same cache line etc) but such things don't always work or aren't always useful. I can easily imagine database apps doing things not too dissimilar to this. With my particular code, I have measured factors of several (\sim 3-4) speedup with large pages compared to small. This is measured on an Origin 3000, with 64k, 1M and 16MB pages were used. Not a factor of several percent. A factor of several. I have also measured similar sorts of speedups on other types of machines. It is also not a factor related to NUMA. I can see other effects from that source and can distinguish between them. Another example: Take a code that discretizes space on a grid in 3d and does something to various variables to make them evolve. You've got 3d arrays many GB in size, and for various calculations you have to sweep through them in each direction: x, y and z. Going in the z direction means that you are leaping across huge slices of memory every time you increment the grid zone by 1. In some codes only a few calculations are needed per zone. For example you want to take a derivative: deriv = (rho(i,j,k+1) - rho(i,j,k-1))/dz(k) (I speak fortran, so the last index is the slow one here). Again, every calculation strides through huge distances and gets you a TLB miss or several. Note for the unwary: it usually does not make sense to transpose the arrays so that the fast index is the one you work with. You don't have enough memory for one thing and you pay for the TLB overhead in the transpose anyway. In both examples, with large pages the chances of getting a TLB hit are far far higher than with small pages. That means I want truly huge pages. Assuming pages at all (various arches don't have them I think), a single one that covered my whole memory would be fine. Other codes don't seem to benefit so much from large pages, or even benefit from small pages, though my experience is minimal with such codes. Other folks run them on the same machines I do though: 2) The last paragraph above is important because of the way HPC works as an industry. We often don't just have a dedicated machine to run on, that gets booted once and one dedicated application runs on it till it dies or gets rebooted again. Many jobs run on the same machine. Some jobs run for weeks. Others run for a few hours over and over again. Some run massively parallel. Some run throughput. How is this situation handled? With a batch scheduler. You submit a job to run and ask for X cpus, Y memory and Z time. It goes and fits you in wherever it can. cpusets were helpful infrastructure in linux for this. You may get some cpus on one side of the machine, some more on the other, and memory associated with still others. They do a pretty good job of allocating resources sanely, but there is only so much that it can do. The important point here for page related discusssions is that someone, you don't know who, was running on those cpu's and memory before you. And doing Ghu Knows What with it. This code could be running something that benefits from small pages, or it could be running with large pages. It could be dynamically allocating and freeing large or small blocks of memory or it could be allocating everything at the beginning and running statically thereafter. Different codes do different things. That means that the memory state could be totally fubar'ed before your job ever gets any time allocated to it. >Nobody takes a random machine and says "ok, we'll now put our most >performance-critical database on this machine, and oh, btw, you can't >reboot it and tune for it beforehand". Wanna bet? What I wrote above makes tuning the machine itself totally ineffective. What do you tune for? Tuning for one person's code makes someone else's slower. Tuning for the same code on one input makes another input run horribly. You also can't be rebooting after every job. What about all the other ones that weren't done yet? You'd piss off everyone running there and it takes too long besides. What about a machine that is running multiple instances of some database, some bigger or smaller than others, or doing other kinds of work? Do you penalize the big ones or the small ones, this kind of work or that? You also can't establish zones that can't be changed on the fly as things on the system change. How do zones like that fit into numa? How do things work when suddenly you've got a job that wants the entire memory filled with large pages and you've only got half your system set up for large pages? What if you tune the system that way and then let that job run. For some stupid reason user reason it dies 10 minutes after starting? Do you let the 30 other jobs in the queue sit idle because they want a different page distribution? This way lies madness. Sysadmins just say no and set up the machine in as stably as they can, usually with something not too different that whatever manufacturer recommends as a default. For very good reasons. I would bet the only kind of zone stuff that could even possibly work would be related to a cpu/memset zone arrangement. See below. 3) I have experimented quite a bit with the page merge infrastructure that exists on irix. I understand that similar large page and merge infrastructure exists on solaris, though I haven't run on such systems. I can get very good page distributions if I run immediately after reboot. I get progressively worse distributions if my job runs only a few days or weeks later. My experience is that after some days or weeks of running have gone by, there is no possible way short of a reboot to get pages merged effectively back to any pristine state with the infrastructure that exists there. Some improvement can be had however, with a bit of pain. What I would like to see is not a theoretical, general, all purpose defragmentation and hotplug scheme, but one that can work effectively with the kinds of constraints that a batch scheduler imposes. I would even imagine that a more general scheduler type of situation could be effective it that scheduler was smart enough. God knows, the scheduler in linux has been rewritten often enough. What is one more time for this purpose too? You may claim that this sort of merge stuff requires excessive time for the OS. Nothing could matter to me less. I've got those cpu's full time for the next X days and if I want them to spend the first 5 minutes or whatever of my run making the place comfortable, so that my job gets done three days earlier then I want to spend that time. 3) The thing is that all of this memory management at this level is not the batch scheduler's job, its the OS's job. The thing that will make it work is that in the case of a reasonably intelligent batch scheduler (there are many), you are absolutely certain that nothing else is running on those cpus and that memory. Except whatever the kernel sprinkled in and didn't clean up afterwards. So why can't the kernel clean up after itself? Why does the kernel need to keep anything in this memory anyway? I supposedly have a guarantee that it is mine, but it goes and immediately violates that guarantee long before I even get started. I want all that kernel stuff gone from my allocation and reset to a nice, sane pristine state. The thing that would make all of it work is good fragmentation and hotplug type stuff in the kernel. Push everything that the kernel did to my memory into the bitbucket and start over. There shouldn't be anything there that it needs to remember from before anyway. Perhaps this is what the defragmentation stuff is supposed to help with. Probably it has other uses that aren't on my agenda. Like pulling out bad ram sticks or whatever. Perhaps there are things that need to be remembered. Certainly being able to hotunplug those pieces would do it. Just do everything but unplug it from the board, and then do a hotplug to turn it back on. 4) You seem to claim that issues I wrote about above are 'theoretical general cases'. They are not, at least not to any more people than the 0.01% of people who regularly time their kernel builds as I saw someone doing some emails ago. Using that sort of argument as a reason not to incorporate this sort of infrastructure just about made me fall out of my chair, especially in the context of keeping the sane case sane. Since this thread has long since lost decency and meaning and descended into name calling, I suppose I'll pitch in with that too on two fronts: 1) I'd say someone making that sort of argument is doing some very serious navel gazing. 2) Here's a cluebat: that ain't one of the sane cases you wrote about. That said, it appears to me there are a variety of constituencies that have some serious interest in this infrastructure. 1) HPC stuff 2) big database stuff. 3) people who are pushing hotplug for other reasons like the bad memory replacement stuff I saw discussed. 4) Whatever else the hotplug folk want that I don't follow. Seems to me that is a bit more than 0.01%. >When you hear voices in your head that tell you to shoot the pope, do you >do what they say? Same thing goes for customers and managers. They are the >crazy voices in your head, and you need to set them right, not just >blindly do what they ask for. I don't care if you do what I ask for, but I do start getting irate and start writing long annoyed letters if I can't do what I need to do, and find out that someone could do something about it but refuses. That said, I'm not so hot any more so I'll just unplug now. Andy Nelson PS: I read these lists at an archive, so if responders want to rm me from any cc's that is fine. I'll still read what I want or need to from there. -- Andy Nelson Theoretical Astrophysics Division (T-6) andy dot nelson at lanl dot gov Los Alamos National Laboratory http://www.phys.lsu.edu/~andy Los Alamos, NM 87545 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:00 Andy Nelson @ 2005-11-04 1:16 ` Martin J. Bligh 2005-11-04 1:27 ` Nick Piggin 2005-11-04 5:14 ` Linus Torvalds 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 1:16 UTC (permalink / raw) To: Andy Nelson, torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin > Linus writes: > >> Just face it - people who want memory hotplug had better know that >> beforehand (and let's be honest - in practice it's only going to work in >> virtualized environments or in environments where you can insert the new >> bank of memory and copy it over and remove the old one with hw support). >> >> Same as hugetlb. >> >> Nobody sane _cares_. Nobody sane is asking for these things. Only people >> with special needs are asking for it, and they know their needs. > > > Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote > about. To provide a slightly shorter version ... we had one customer running similarly large number crunching things in Fortran. Their app ran 25% faster with large pages (not a typo). Because they ran a variety of jobs in batch mode, they need large pages sometimes, and small pages at others - hence they need to dynamically resize the pool. That's the sort of thing we were trying to fix with dynamically sized hugepage pools. It does make a huge difference to real-world customers. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:16 ` Martin J. Bligh @ 2005-11-04 1:27 ` Nick Piggin 0 siblings, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-04 1:27 UTC (permalink / raw) To: Martin J. Bligh Cc: Andy Nelson, torvalds, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo Martin J. Bligh wrote: > > To provide a slightly shorter version ... we had one customer running > similarly large number crunching things in Fortran. Their app ran 25% > faster with large pages (not a typo). Because they ran a variety of > jobs in batch mode, they need large pages sometimes, and small pages > at others - hence they need to dynamically resize the pool. > > That's the sort of thing we were trying to fix with dynamically sized > hugepage pools. It does make a huge difference to real-world customers. > Aren't HPC users very easy? In fact, probably the easiest because they generally not very kernel intensive (apart from perhaps some batches of IO at the beginning and end of the jobs). A reclaimable zone should provide exactly what they need. I assume the sysadmin can give some reasonable upper and lower estimates of the memory requirements. They don't need to dynamically resize the pool because it is all being allocated to pagecache anyway, so all jobs are satisfied from the reclaimable zone. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 1:00 Andy Nelson 2005-11-04 1:16 ` Martin J. Bligh @ 2005-11-04 5:14 ` Linus Torvalds 2005-11-04 6:10 ` Paul Jackson 2005-11-04 14:56 ` Andy Nelson 1 sibling, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 5:14 UTC (permalink / raw) To: Andy Nelson Cc: mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin On Thu, 3 Nov 2005, Andy Nelson wrote: > > I have done high performance computing in astrophysics for nearly two > decades now. It gives me a perspective that kernel developers usually > don't have, but sometimes need. For my part, I promise that I specifically > do *not* have the perspective of a kernel developer. I don't even speak C. Hey, cool. You're a physicist, and you'd like to get closer to 100% efficiency out of your computer. And that's really nice, because maybe we can strike a deal. Because I also have a problem with my computer, and a physicist might just help _me_ get closer to 100% efficiency out of _my_ computer. Let me explain. I've got a laptop that takes about 45W, maybe 60W under load. And it has a battery that weighs about 350 grams. Now, I know that if I were to get 100% energy efficiency out of that battery, a trivial physics calculations tells me that e=mc^2, and that my battery _should_ have a hell of a lot of energy in it. In fact, according to my simplistic calculations, it turns out that my laptop _should_ have a battery life that is only a few times the lifetime of the universe. It turns out that isn't really the case in practice, but I'm hoping you can help me out. I obviously don't need it to be really 100% efficient, but on the other hand, I'd also like the battery to be slightly lighter, so if you could just make sure that it's at least _slightly_ closer to the theoretical values I should be getting out of it, maybe I wouldn't need to find one of those nasty electrical outlets every few hours. Do we have a deal? After all, you only need to improve my battery efficiency by a really _tiny_ amount, and I'll never need to recharge it again. And I'll improve your problem. Or are you maybe willing to make a few compromises in the name of being realistic, and living with something less than the theoretical peak performance of what you're doing? I'm willing on compromising to using only the chemical energy of the processes involved, and not even a hundred percent efficiency at that. Maybe you'd be willing on compromising by using a few kernel boot-time command line options for your not-very-common load. Ok? Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 5:14 ` Linus Torvalds @ 2005-11-04 6:10 ` Paul Jackson 2005-11-04 6:38 ` Ingo Molnar 2005-11-04 7:44 ` Eric Dumazet 2005-11-04 14:56 ` Andy Nelson 1 sibling, 2 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-04 6:10 UTC (permalink / raw) To: Linus Torvalds Cc: andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin Linus wrote: > Maybe you'd be willing on compromising by using a few kernel boot-time > command line options for your not-very-common load. If we were only a few options away from running Andy's varying load mix with something close to ideal performance, we'd be in fat city, and Andy would never have been driven to write that rant. There's more to it than that, but it is not as impossible as a battery with the efficiencies you (and the rest of us) dream of. Andy has used systems that resemble what he is seeking. So he is not asking for something clearly impossible. Though it might not yet be possible, in ways that contribute to a continuing healthy kernel code base. It's an interesting challenge - finding ways to improve the kernel's performance on such high end loads, that are also suitable and desirable (or at least innocent enough) for inclusion in a kernel far more widely used in embeddeds, desktops and ordinary servers. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 6:10 ` Paul Jackson @ 2005-11-04 6:38 ` Ingo Molnar 2005-11-04 7:26 ` Paul Jackson 2005-11-04 15:31 ` Linus Torvalds 2005-11-04 7:44 ` Eric Dumazet 1 sibling, 2 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 6:38 UTC (permalink / raw) To: Paul Jackson Cc: Linus Torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin * Paul Jackson <pj@sgi.com> wrote: > Linus wrote: > > Maybe you'd be willing on compromising by using a few kernel boot-time > > command line options for your not-very-common load. > > If we were only a few options away from running Andy's varying load > mix with something close to ideal performance, we'd be in fat city, > and Andy would never have been driven to write that rant. > > There's more to it than that, but it is not as impossible as a battery > with the efficiencies you (and the rest of us) dream of. just to make sure i didnt get it wrong, wouldnt we get most of the benefits Andy is seeking by having a: boot-time option which sets aside a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool - with the growing happening on a best-effort basis, without guarantees? i have implemented precisely such a scheme for 'bigpages' years ago, and it worked reasonably well. (i was lazy and didnt implement it as a resizable zone, but as a list of large pages taken straight off the buddy allocator. This made dynamic resizing really easy and i didnt have to muck with the buddy and mem_map[] data structures that zone-resizing forces us to do. It had the disadvantage of those pages skewing the memory balance of the affected zone.) my quick solution was good enough that on a test-system i could resize the pool across Oracle test-runs, when the box was otherwise quiet. I'd expect a well-controlled HPC system to be equally resizable. what we cannot offer is a guarantee to be able to grow the pool. Hence the /proc mechanism would be called: /proc/sys/vm/try_to_grow_hugemem_pool to clearly stress the 'might easily fail' restriction. But if userspace is well-behaved on Andy's systems (which it seems to be), then in practice it should be resizable. On a generic system, only the boot-time option is guaranteed to allocate as much RAM as possible. And once this functionality has been clearly communicated and separated, the 'try to alloc a large page' thing could become more agressive: it could attempt to construct large pages if it can. i dont think we object to such a capability, as long as the restrictions are clearly communicated. (and no, that doesnt mean some obscure Documentation/ entry - the restrictions have to be obvious from the primary way of usage. I.e. no /proc/sys/vm/hugemem_pool_size thing where growing could fail.) Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 6:38 ` Ingo Molnar @ 2005-11-04 7:26 ` Paul Jackson 2005-11-04 7:37 ` Ingo Molnar 2005-11-04 15:31 ` Linus Torvalds 1 sibling, 1 reply; 241+ messages in thread From: Paul Jackson @ 2005-11-04 7:26 UTC (permalink / raw) To: Ingo Molnar Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin Ingo wrote: > to clearly stress the 'might easily fail' restriction. But if userspace > is well-behaved on Andy's systems (which it seems to be), then in > practice it should be resizable. At first glance, this is the sticky point that jumps out at me. Andy wrote: > My experience is that after some days or weeks of running have gone > by, there is no possible way short of a reboot to get pages merged > effectively back to any pristine state with the infrastructure that > exists there. I take it, from what Andy writes, and from my other experience with similar customers, that his workload is not "well-behaved" in the sense you hoped for. After several diverse jobs are run, we cannot, so far as I know, merge small pages back to big pages. I have not played with Mel Gorman's Fragmentation Avoidance patches, so don't know if they would provide a substantial improvement here. They well might. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 7:26 ` Paul Jackson @ 2005-11-04 7:37 ` Ingo Molnar 0 siblings, 0 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 7:37 UTC (permalink / raw) To: Paul Jackson Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin * Paul Jackson <pj@sgi.com> wrote: > At first glance, this is the sticky point that jumps out at me. > > Andy wrote: > > My experience is that after some days or weeks of running have gone > > by, there is no possible way short of a reboot to get pages merged > > effectively back to any pristine state with the infrastructure that > > exists there. > > I take it, from what Andy writes, and from my other experience with > similar customers, that his workload is not "well-behaved" in the > sense you hoped for. > > After several diverse jobs are run, we cannot, so far as I know, merge > small pages back to big pages. ok, so the zone solution it has to be. I.e. the moment it's a separate special zone, you can boot with most of the RAM being in that zone, and you are all set. It can be used both for hugetlb allocations, and for other PAGE_SIZE allocations as well, in a highmem-fashion. These HPC setups are rarely kernel-intense. Thus the only dynamic sizing decision that has to be taken is to determine the amount of 'generic kernel RAM' that is needed in the worst-case. To give an example: say on a 256 GB box, set aside 8 GB for generic kernel needs, and have 248 GB in the hugemem zone. This leaves us with the following scenario: apps can use up to 97% of all RAM for hugemem, and they can use up to 100% of all RAM for PAGE_SIZE allocations. 3% of RAM can be used by generic kernel needs. Sounds pretty reasonable and straightforward from a system management point of view. No runtime resizing, but it wouldnt be needed, unless kernel activity needs more than 8GB of RAM. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 6:38 ` Ingo Molnar 2005-11-04 7:26 ` Paul Jackson @ 2005-11-04 15:31 ` Linus Torvalds 2005-11-04 15:39 ` Martin J. Bligh ` (2 more replies) 1 sibling, 3 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 15:31 UTC (permalink / raw) To: Ingo Molnar Cc: Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin On Fri, 4 Nov 2005, Ingo Molnar wrote: > > just to make sure i didnt get it wrong, wouldnt we get most of the > benefits Andy is seeking by having a: boot-time option which sets aside > a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool > - with the growing happening on a best-effort basis, without guarantees? Boot-time option to set the hugetlb zone, yes. Grow-or-shrink, probably not. Not in practice after bootup on any machine that is less than idle. The zones have to be pretty big to make any sense. You don't just grow them or shrink them - they'd be on the order of tens of megabytes to gigabytes. In other words, sized big enough that you will _not_ be able to create them on demand, except perhaps right after boot. Growing these things later simply isn't reasonable. I can pretty much guarantee that any kernel I maintain will never have dynamic kernel pointers: when some memory has been allocated with kmalloc() (or equivalent routines - pretty much _any_ kernel allocation), it stays put. Which means that if there is a _single_ kernel alloc in such a zone, it won't ever be then usable for hugetlb stuff. And I don't want excessive complexity. We can have things like "turn off kernel allocations from this zone", and then wait a day or two, and hope that there aren't long-term allocs. It might even work occasionally. But the fact is, a number of kernel allocations _are_ long-term (superblocks, root dentries, "struct thread_struct" for long-running user daemons), and it's simply not going to work well in practice unless you have set aside the "no kernel alloc" zone pretty early on. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:31 ` Linus Torvalds @ 2005-11-04 15:39 ` Martin J. Bligh 2005-11-04 15:53 ` Ingo Molnar 2005-11-06 8:44 ` Kyle Moffett 2 siblings, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 15:39 UTC (permalink / raw) To: Linus Torvalds, Ingo Molnar Cc: Paul Jackson, andy, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin >> just to make sure i didnt get it wrong, wouldnt we get most of the >> benefits Andy is seeking by having a: boot-time option which sets aside >> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool >> - with the growing happening on a best-effort basis, without guarantees? > > Boot-time option to set the hugetlb zone, yes. > > Grow-or-shrink, probably not. Not in practice after bootup on any machine > that is less than idle. > > The zones have to be pretty big to make any sense. You don't just grow > them or shrink them - they'd be on the order of tens of megabytes to > gigabytes. In other words, sized big enough that you will _not_ be able to > create them on demand, except perhaps right after boot. > > Growing these things later simply isn't reasonable. I can pretty much > guarantee that any kernel I maintain will never have dynamic kernel > pointers: when some memory has been allocated with kmalloc() (or > equivalent routines - pretty much _any_ kernel allocation), it stays put. > Which means that if there is a _single_ kernel alloc in such a zone, it > won't ever be then usable for hugetlb stuff. > > And I don't want excessive complexity. We can have things like "turn off > kernel allocations from this zone", and then wait a day or two, and hope > that there aren't long-term allocs. It might even work occasionally. But > the fact is, a number of kernel allocations _are_ long-term (superblocks, > root dentries, "struct thread_struct" for long-running user daemons), and > it's simply not going to work well in practice unless you have set aside > the "no kernel alloc" zone pretty early on. Exactly. But that's what all the anti-fragmentation stuff was about - trying to pack unfreeable stuff together. I don't think anyone is proposing dynamic kernel pointers inside Linux, except in that we could possibly change the P-V mapping underneath from the hypervisor, so that the phys address would change, but you wouldn't see it. Trouble is, that's mostly done on a larger-than-page size granularity, so we need SOME larger chunk to switch out (preferably at least a large-paged size, so we can continue to use large TLB entries for the kernel mapping). However, the statically sized option is hugely problematic too. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:31 ` Linus Torvalds 2005-11-04 15:39 ` Martin J. Bligh @ 2005-11-04 15:53 ` Ingo Molnar 2005-11-06 7:34 ` Paul Jackson 2005-11-06 8:44 ` Kyle Moffett 2 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 15:53 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin * Linus Torvalds <torvalds@osdl.org> wrote: > Boot-time option to set the hugetlb zone, yes. > > Grow-or-shrink, probably not. Not in practice after bootup on any > machine that is less than idle. > > The zones have to be pretty big to make any sense. You don't just grow > them or shrink them - they'd be on the order of tens of megabytes to > gigabytes. In other words, sized big enough that you will _not_ be > able to create them on demand, except perhaps right after boot. i think the current hugepages=<N> boot option could transparently be morphed into a 'separate zone' approach, and /proc/sys/vm/nr_hugepages would just refuse to change (or would go away altogether). Dynamically growing zones seem like a lot of trouble, without much gain. [ OTOH hugepages= parameter unit should be changed from the current 'number of hugepages' to plain RAM metrics - megabytes/gigabytes. ] that would solve two problems: any 'zone VM statistics skewing effect' of the current hugetlbs (which is a preallocated list of really large pages) would go away, and the hugetlb zone could potentially be utilized for easily freeable objects. this would already be alot more flexible that what we have: the hugetlb area would not be 'lost' altogether, like now. Once we are at this stage we can see how usable it is in practice. I strongly suspect it will cover most of the HPC uses. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:53 ` Ingo Molnar @ 2005-11-06 7:34 ` Paul Jackson 2005-11-06 15:55 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Paul Jackson @ 2005-11-06 7:34 UTC (permalink / raw) To: Ingo Molnar Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin Ingo wrote: > i think the current hugepages=<N> boot option could transparently be > morphed into a 'separate zone' approach, and ... > > this would already be alot more flexible that what we have: the hugetlb > area would not be 'lost' altogether, like now. Once we are at this stage > we can see how usable it is in practice. I strongly suspect it will > cover most of the HPC uses. It seems to me this is making it harder than it should be. You're trying to create a zone that is 100% cleanable, whereas the HPC folks only desire 99.8% cleanable. Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of Linus's unmoveable kmalloc memory in their way. They rather expect that some modest percentage of each node will have some 'kernel stuff' on it that refuses to move. They just want to be able to free up most of the pages on a node, once one job is done there, before the next job begins. They are also quite willing (based on my experience with bootcpusets) to designate a few nodes for the 'general purpose Unix load', and reserve the remaining nodes just to run their special jobs. On the other hand, as Eric Dumazet mentions on another subthread of this topic, requiring that their apps use the hugetlbfs interface to place the bulk of their memory would be a serious obstacle. Their apps are already fairly tightly wound around a rich variety of compiler, tool, library and runtime memory placement mechanisms, and they would be hardpressed to make systematic changes in that. I suspect that the answers lie in some further improvements in memory placement on various nodes. Perhaps this means a cpuset option to put the easily reclaimed (what Mel Gorman's patch would mark with __GFP_EASYRCLM) kernel pages and the user pages on the the nodes of the current cpuset, but to prefer placing the less easily reclaimed pages on the bootcpuset nodes. Then, when a job on such a dedicated set of nodes completed, most of the memory would be easily reclaimable, in preparation for the next job. The bootcpuset stuff is entirely invisible to kernel hackers, because I am doing it entirely in user space, with a pre-init program that configures the bootcpuset, moves the unpinned kernel threads into the bootcpuset, and fires up the real init in that bootcpuset. With one more twist to the cpuset API, providing a way to state per-cpuset a separate set of nodes (on what the HPC folks would call their bootcpuset) as the preferred place to allocate not-EASYRCLM kernel memory, we might be very close to meeting these HPC needs, with no changes to or reliance on hugetlbs, with no changes to the kernel boottime code, and with no changes to the memory management mechanisms used within these HPC apps. I am imagining yet another per-cpuset field, which I call 'kmems'. It would be a nodemask, as is the current 'mems' field. I'd pick up the __GFP_EASYRCLM flag of Mel Gorman's patch (no comment on suitability of the rest of his patch), and prefer to place __GFP_EASYRCLM pages on the 'mems' nodes, but other pages evenly spread across the 'kmems' nodes. For compatibility with the current cpuset API, an unset 'kmems' would tell the kernel to use the 'mems' setting as a fallback. The hardest part might be providing a mechanism, that would be invoked by the batch scheduler between jobs, to flush the easily reclaimed memory off a node (free it or write it to disk). Again, unlike the hot(un)plug folks, a 98% solution is plenty good enough. This will have to be coded and some HPC type loads tried on it, before we know if it flies. There is an obvious, unanswered question here. Would moving some of the kernels pages (the not easily reclaimed pages) off the current (faulting) node into some possibly far off node be an acceptable price to pay, to increase the percentage of the dedicated job nodes that can be freed up between jobs? Since these HPC jobs tend to be far more sensitive to their own internal data placement than they are to the kernels internal data placement, I am hopeful that this tradeoff is a good one, for HPC apps. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 7:34 ` Paul Jackson @ 2005-11-06 15:55 ` Linus Torvalds 2005-11-06 18:18 ` Paul Jackson 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-06 15:55 UTC (permalink / raw) To: Paul Jackson Cc: Ingo Molnar, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin On Sat, 5 Nov 2005, Paul Jackson wrote: > > It seems to me this is making it harder than it should be. You're > trying to create a zone that is 100% cleanable, whereas the HPC folks > only desire 99.8% cleanable. Well, 99.8% is pretty borderline. > Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of > Linus's unmoveable kmalloc memory in their way. They rather expect > that some modest percentage of each node will have some 'kernel stuff' > on it that refuses to move. The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to make pretty much _every_ hugepage in the system pinned down. Besides, right now, it's not 99.8% anyway. Not even close. It's more like 60%, and then horribly horribly ugly hacks that try to do something about the remaining 40% and usually fail (the hacks might get it closer to 99%, but they are fragile, expensive, and ugly as hell). It used to be that HIGHMEM pages were always cleanable on x86, but even that isn't true any more, since now at least pipe buffers can be there too. I agree that HPC people are usually a bit less up-tight about things than database people tend to be, and many of them won't care at all, but if you want hugetlb, you'll need big areas. Side note: the exact size of hugetlb is obviously architecture-specific, and the size matters a lot. On x86, for example, hugetlb pages are either 2MB or 4MB in size (and apparently 2GB may be coming). I assume that's where you got the 99.8% from (4kB out of 2M). Other platforms have more flexibility, but sometimes want bigger areas still. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 15:55 ` Linus Torvalds @ 2005-11-06 18:18 ` Paul Jackson 0 siblings, 0 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-06 18:18 UTC (permalink / raw) To: Linus Torvalds Cc: mingo, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin Linus wrote: > The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to > make pretty much _every_ hugepage in the system pinned down. Agreed. I realized after writing this that I wasn't clear on something. I wasn't focused the subject of this thread, adding hugetlb pages after the system has been up a while. I was focusing on a related subject - freeing up most of the ordinary size pages on the dedicated application nodes between jobs on a large system using * a bootcpuset (for the classic Unix load) and * dedicated nodes (for the HPC apps). I am looking to provide the combination of: 1) specifying some hugetlb pages at system boot, plus 2) the ability to clean off most of the ordinary sized pages from the application nodes between jobs. Perhaps Andy or some of my HPC customers wish I was also looking to provide: 3) the ability to add lots of hugetlb pages on the application nodes after the system has run a while. But if they are, then they have some more educatin' to do on me. For now, I am sympathetic to your concerns with code and locking complexity. Freeing up great globs of hugetlb sized contiguous chunks of memory after a system has run a while would be hard. We have to be careful which hard problems we decide to take on. We can't take on too many, and we have to pick ones that will provide a major long term advantage to Linux, over the forseeable changes in system hardware and architecture. Even if most of the processors that Andy has tested against would benefit from dynamically added hugetlb pages, if we can anticipate that this will not be a substained opportunity for Linux (and looking at current x86 chips doesn't require much anticipating) then that might not be the place to invest our precious core complexity dollars. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:31 ` Linus Torvalds 2005-11-04 15:39 ` Martin J. Bligh 2005-11-04 15:53 ` Ingo Molnar @ 2005-11-06 8:44 ` Kyle Moffett 2005-11-06 16:12 ` Linus Torvalds 2 siblings, 1 reply; 241+ messages in thread From: Kyle Moffett @ 2005-11-06 8:44 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin On Nov 4, 2005, at 10:31:48, Linus Torvalds wrote: > I can pretty much guarantee that any kernel I maintain will never > have dynamic kernel pointers: when some memory has been allocated > with kmalloc() (or equivalent routines - pretty much _any_ kernel > allocation), it stays put. Hmm, this brings up something that I haven't seen discussed on this list (maybe a long time ago, but perhaps it should be brought up again?). What are the pros/cons to having a non-physically-linear kernel virtual memory space? Would it be theoretically possible to allow some kind of dynamic kernel page swapping, such that the _same_ kernel-virtual pointer goes to a different physical memory page? That would definitely satisfy the memory hotplug people, but I don't know what the tradeoffs would be for normal boxen. It seems like the trick would be to make sure that page accesses _during_ the swap are correctly handled. If the page-swapper included code in the kernel fault handler to notice that a page was in the process of being swapped out/in by another CPU, it could just wait for swap-in to finish and then resume from the new page. This would get messy with DMA and non-cpu memory accessors and such, which are what I assume the reasons for not implementing this in the past have been. From what I can see, the really dumb-obvious-slow method would be to call the first and last parts of software-suspend. As memory hotplug is a relatively rare event, this would probably work well enough given the requirements: 1) Run software suspend pre-memory-dump code 2) Move pages off the to-be-removed node, remapping the kernel space to the new locations. 3) Mark the node so that new pages don't end up on it 4) Run software suspend post-memory-reload code <random-guessing> Perhaps the non-contiguous memory support would be of some help here? </random-guessing> Cheers, Kyle Moffett -- Simple things should be simple and complex things should be possible -- Alan Kay ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 8:44 ` Kyle Moffett @ 2005-11-06 16:12 ` Linus Torvalds 2005-11-06 17:00 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-06 16:12 UTC (permalink / raw) To: Kyle Moffett Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin On Sun, 6 Nov 2005, Kyle Moffett wrote: > > Hmm, this brings up something that I haven't seen discussed on this list > (maybe a long time ago, but perhaps it should be brought up again?). What are > the pros/cons to having a non-physically-linear kernel virtual memory space? Well, we _do_ actually have that, and we use it quite a bit. Both vmalloc() and HIGHMEM work that way. The biggest problem with vmalloc() is that the virtual space is often as constrained as the physical one (ie on old x86-32, the virtual address space is the bigger problem - you may have 36 bits of physical memory, but the kernel has only 30 bits of virtual). But it's quite commonly used for stuff that wants big linear areas. The HIGHMEM approach works fine, but the overhead of essentially doing a software TLB is quite high, and if we never ever have to do it again on any architecture, I suspect everybody will be pretty happy. > Would it be theoretically possible to allow some kind of dynamic kernel page > swapping, such that the _same_ kernel-virtual pointer goes to a different > physical memory page? That would definitely satisfy the memory hotplug > people, but I don't know what the tradeoffs would be for normal boxen. Any virtualization will try to do that, but they _all_ prefer huge pages if they care at all about performance. If you thought the database people wanted big pages, the kernel is worse. Unlike databases or HPC, the kernel actually wants to use the physical page address quite often, notably for IO (but also for just mapping them into some other virtual address - the users). And no standard hardware allows you to do that in hw, so we'd end up doing a software page table walk for it (or, more likely, we'd have to make "struct page" bigger). You could do it today, although at a pretty high cost. And you'd have to forget about supporting any hardware that really wants contiguous memory for DMA (sound cards etc). It just isn't worth it. Real memory hotplug needs hardware support anyway (if only buffering the memory at least electrically). At which point you're much better off supporting some remapping in the buffering too, I'm convinced. There's no _need_ to do these things in software. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 16:12 ` Linus Torvalds @ 2005-11-06 17:00 ` Linus Torvalds 2005-11-07 8:00 ` Ingo Molnar 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-06 17:00 UTC (permalink / raw) To: Kyle Moffett Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin On Sun, 6 Nov 2005, Linus Torvalds wrote: > > And no standard hardware allows you to do that in hw, so we'd end up doing > a software page table walk for it (or, more likely, we'd have to make > "struct page" bigger). > > You could do it today, although at a pretty high cost. And you'd have to > forget about supporting any hardware that really wants contiguous memory > for DMA (sound cards etc). It just isn't worth it. Btw, in case it wasn't clear: the cost of these kinds of things in the kernel is usually not so much the actual "lookup" (whether with hw assist or with another field in the "struct page"). The biggest cost of almost everything in the kernel these days is the extra code-footprint of yet another abstraction, and the locking cost. For example, the real cost of the highmem mapping seems to be almost _all_ in the locking. It also makes some code-paths more complex, so it's yet another I$ fill for the kernel. So a remappable kernel tends to be different from a remappable user application. A user application _only_ ever sees the actual cost of the TLB walk (which hardware can do quite efficiently and is very amenable indeed to a lot of optimization like OoO and speculative prefetching), but on the kernel level, the remapping itself is the cheapest part. (Yes, user apps can see some of the costs indirectly: they can see the synchronization costs if they do lots of mmap/munmap's, especially if they are threaded. But they really have to work at it to see it, and I doubt the TLB synchronization issues tend to be even on the radar for any user space performance analysis). You could probably do a remappable kernel (modulo the problems with specific devices that want bigger physically contiguous areas than one page) reasonably cheaply on UP. It gets more complex on SMP and with full device access. In fact, I suspect you can ask any Xen developer what their performance problems and worries are. I suspect they much prefer UP clients over SMP ones, and _much_ prefer paravirtualization over running unmodified kernels. So remappable kernels are certainly doable, they just have more fundamental problems than remappable user space _ever_ has. Both from a performance and from a complexity angle. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 17:00 ` Linus Torvalds @ 2005-11-07 8:00 ` Ingo Molnar 2005-11-07 11:00 ` Dave Hansen 0 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-07 8:00 UTC (permalink / raw) To: Linus Torvalds Cc: Kyle Moffett, Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin * Linus Torvalds <torvalds@osdl.org> wrote: > > You could do it today, although at a pretty high cost. And you'd have to > > forget about supporting any hardware that really wants contiguous memory > > for DMA (sound cards etc). It just isn't worth it. > > Btw, in case it wasn't clear: the cost of these kinds of things in the > kernel is usually not so much the actual "lookup" (whether with hw > assist or with another field in the "struct page"). [...] > So remappable kernels are certainly doable, they just have more > fundamental problems than remappable user space _ever_ has. Both from > a performance and from a complexity angle. furthermore, it doesnt bring us any closer to removable RAM. The problem is still unsolvable (due to the 'how to do you find live pointers to fix up' issue), even if the full kernel VM is 'mapped' at 4K granularity. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 8:00 ` Ingo Molnar @ 2005-11-07 11:00 ` Dave Hansen 2005-11-07 12:20 ` Ingo Molnar 0 siblings, 1 reply; 241+ messages in thread From: Dave Hansen @ 2005-11-07 11:00 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Kyle Moffett, Paul Jackson, andy, mbligh, Andrew Morton, arjan, arjanv, kravetz, lhms, Linux Kernel Mailing List, linux-mm, mel, Nick Piggin On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote: > * Linus Torvalds <torvalds@osdl.org> wrote: > > So remappable kernels are certainly doable, they just have more > > fundamental problems than remappable user space _ever_ has. Both from > > a performance and from a complexity angle. > > furthermore, it doesnt bring us any closer to removable RAM. The problem > is still unsolvable (due to the 'how to do you find live pointers to fix > up' issue), even if the full kernel VM is 'mapped' at 4K granularity. I'm not sure I understand. If you're remapping, why do you have to find live and fix up live pointers? Are you talking about things that require fixed _physical_ addresses? -- Dave ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 11:00 ` Dave Hansen @ 2005-11-07 12:20 ` Ingo Molnar 2005-11-07 19:34 ` Steven Rostedt 0 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-07 12:20 UTC (permalink / raw) To: Dave Hansen Cc: Linus Torvalds, Kyle Moffett, Paul Jackson, andy, mbligh, Andrew Morton, arjan, arjanv, kravetz, lhms, Linux Kernel Mailing List, linux-mm, mel, Nick Piggin * Dave Hansen <haveblue@us.ibm.com> wrote: > On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@osdl.org> wrote: > > > So remappable kernels are certainly doable, they just have more > > > fundamental problems than remappable user space _ever_ has. Both from > > > a performance and from a complexity angle. > > > > furthermore, it doesnt bring us any closer to removable RAM. The problem > > is still unsolvable (due to the 'how to do you find live pointers to fix > > up' issue), even if the full kernel VM is 'mapped' at 4K granularity. > > I'm not sure I understand. If you're remapping, why do you have to > find live and fix up live pointers? Are you talking about things that > require fixed _physical_ addresses? RAM removal, not RAM replacement. I explained all the variants in an earlier email in this thread. "extending RAM" is relatively easy. "replacing RAM" while doable, is probably undesirable. "removing RAM" impossible. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 12:20 ` Ingo Molnar @ 2005-11-07 19:34 ` Steven Rostedt 2005-11-07 23:38 ` Joel Schopp 0 siblings, 1 reply; 241+ messages in thread From: Steven Rostedt @ 2005-11-07 19:34 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, mel, linux-mm, Linux Kernel Mailing List, lhms, kravetz, arjanv, arjan, Andrew Morton, mbligh, andy, Paul Jackson, Kyle Moffett, Linus Torvalds, Dave Hansen On Mon, 2005-11-07 at 13:20 +0100, Ingo Molnar wrote: > > RAM removal, not RAM replacement. I explained all the variants in an > earlier email in this thread. "extending RAM" is relatively easy. > "replacing RAM" while doable, is probably undesirable. "removing RAM" > impossible. Hi Ingo, I'm usually amused when someone says something is impossible, so I'm wondering exactly "why"? If the one requirement is that there must be enough free memory available to remove, then what's the problem for a fully mapped kernel? Is it the GPT? Or if there's drivers that physical memory mapped? I'm not sure of the best way to solve the GPT being in the RAM that is to be removed, but there might be a way. Basically stop all activities and update all the tasks->mm. As for the drivers, one could have a accounting for all physical memory mapped, and disable the driver if it is using the memory that is to be removed. But other then these, what exactly is the problem with removing RAM? BTW, I'm not suggesting any of this is a good idea, I just like to understand why something _cant_ be done. -- Steve ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 19:34 ` Steven Rostedt @ 2005-11-07 23:38 ` Joel Schopp 2005-11-13 2:30 ` Rob Landley 0 siblings, 1 reply; 241+ messages in thread From: Joel Schopp @ 2005-11-07 23:38 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Nick Piggin, mel, linux-mm, Linux Kernel Mailing List, lhms, kravetz, arjanv, arjan, Andrew Morton, mbligh, andy, Paul Jackson, Kyle Moffett, Linus Torvalds, Dave Hansen >>RAM removal, not RAM replacement. I explained all the variants in an >>earlier email in this thread. "extending RAM" is relatively easy. >>"replacing RAM" while doable, is probably undesirable. "removing RAM" >>impossible. > <snip> > BTW, I'm not suggesting any of this is a good idea, I just like to > understand why something _cant_ be done. > I'm also of the opinion that if we make the kernel remap that we can "remove RAM". Now, we've had enough people weigh in on this being a bad idea I'm not going to try it. After all it is fairly complex, quite a bit more so than Mel's reasonable patches. But I think it is possible. The steps would look like this: Method A: 1. Find some unused RAM (or free some up) 2. Reserve that RAM 3. Copy the active data from the soon to be removed RAM to the reserved RAM 4. Remap the addresses 5. Remove the RAM This of course requires step 3 & 4 take place under something like stop_machine_run() to keep the data from changing. Alternately you could do it like this: Method B: 1. Find some unused RAM (or free some up) 2. Reserve that RAM 3. Unmap the addresses on the soon to be removed RAM 4. Copy the active data from the soon to be removed RAM to the reserved RAM 5. Remap the addresses 6. Remove the RAM Which would save you the stop_machine_run(), but which adds the complication of dealing with faults on pinned memory during the migration. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 23:38 ` Joel Schopp @ 2005-11-13 2:30 ` Rob Landley 2005-11-14 1:58 ` Joel Schopp 0 siblings, 1 reply; 241+ messages in thread From: Rob Landley @ 2005-11-13 2:30 UTC (permalink / raw) To: Joel Schopp, linux-kernel On Monday 07 November 2005 17:38, you wrote: > >>RAM removal, not RAM replacement. I explained all the variants in an > >>earlier email in this thread. "extending RAM" is relatively easy. > >>"replacing RAM" while doable, is probably undesirable. "removing RAM" > >>impossible. > > <snip> > > > BTW, I'm not suggesting any of this is a good idea, I just like to > > understand why something _cant_ be done. > > I'm also of the opinion that if we make the kernel remap that we can > "remove RAM". Now, we've had enough people weigh in on this being a bad > idea I'm not going to try it. After all it is fairly complex, quite a bit > more so than Mel's reasonable patches. But I think it is possible. The > steps would look like this: > > Method A: > 1. Find some unused RAM (or free some up) > 2. Reserve that RAM > 3. Copy the active data from the soon to be removed RAM to the reserved RAM > 4. Remap the addresses > 5. Remove the RAM > > This of course requires step 3 & 4 take place under something like > stop_machine_run() to keep the data from changing. Actually, what I was thinking is that if you use the swsusp infrastructure to suspend all processes, all dma, quiesce the heck out of the devices, and _then_ try to move the kernel... Well, you at least have a much more controlled problem. Yeah, it's pretty darn intrusive, but if you're doing "suspend to ram" perhaps the downtime could be only 5 or 10 seconds... I don't know how much of the problem that leaves unsolved, though. Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-13 2:30 ` Rob Landley @ 2005-11-14 1:58 ` Joel Schopp 0 siblings, 0 replies; 241+ messages in thread From: Joel Schopp @ 2005-11-14 1:58 UTC (permalink / raw) To: Rob Landley; +Cc: linux-kernel > Actually, what I was thinking is that if you use the swsusp infrastructure to > suspend all processes, all dma, quiesce the heck out of the devices, and > _then_ try to move the kernel... Well, you at least have a much more > controlled problem. Yeah, it's pretty darn intrusive, but if you're doing > "suspend to ram" perhaps the downtime could be only 5 or 10 seconds... I don't think suspend to ram for a memory hotplug remove would be acceptable to users. The other methods add some complexity to the kernel, but are transparent to userspace. Downtime of 5 to 10 seconds is really quite a bit of downtime. > I don't know how much of the problem that leaves unsolved, though. It would still require a remappable kernel. And seems intuitively to be wrong to me. But if you want to try it out I won't stop you. It might even work. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 6:10 ` Paul Jackson 2005-11-04 6:38 ` Ingo Molnar @ 2005-11-04 7:44 ` Eric Dumazet 2005-11-07 16:42 ` Adam Litke 1 sibling, 1 reply; 241+ messages in thread From: Eric Dumazet @ 2005-11-04 7:44 UTC (permalink / raw) To: Paul Jackson Cc: Linus Torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin Paul Jackson a écrit : > Linus wrote: > >>Maybe you'd be willing on compromising by using a few kernel boot-time >>command line options for your not-very-common load. > > > If we were only a few options away from running Andy's varying load > mix with something close to ideal performance, we'd be in fat city, > and Andy would never have been driven to write that rant. I found hugetlb support in linux not very practical/usable on NUMA machines, boot-time parameters or /proc/sys/vm/nr_hugepages. With this single integer parameter, you cannot allocate 1000 4MB pages on one specific node, letting small pages on another node. I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual node machine between one (numa aware) memory intensive job and all others (system, network, shells). At least I can reboot it if needed, but I feel Andy pain. There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages with a list of integers (one per node) ? Eric ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 7:44 ` Eric Dumazet @ 2005-11-07 16:42 ` Adam Litke 0 siblings, 0 replies; 241+ messages in thread From: Adam Litke @ 2005-11-07 16:42 UTC (permalink / raw) To: Eric Dumazet Cc: Paul Jackson, Linus Torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin On Fri, 2005-11-04 at 08:44 +0100, Eric Dumazet wrote: > Paul Jackson a écrit : > > Linus wrote: > > > >>Maybe you'd be willing on compromising by using a few kernel boot-time > >>command line options for your not-very-common load. > > > > > > If we were only a few options away from running Andy's varying load > > mix with something close to ideal performance, we'd be in fat city, > > and Andy would never have been driven to write that rant. > > I found hugetlb support in linux not very practical/usable on NUMA machines, > boot-time parameters or /proc/sys/vm/nr_hugepages. > > With this single integer parameter, you cannot allocate 1000 4MB pages on one > specific node, letting small pages on another node. > > I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual > node machine between one (numa aware) memory intensive job and all others > (system, network, shells). > At least I can reboot it if needed, but I feel Andy pain. > > There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages > with a list of integers (one per node) ? Or perhaps /sys/devices/system/node/nodeX/nr_hugepages triggers that work like the current /proc trigger but on a per node basis? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 5:14 ` Linus Torvalds 2005-11-04 6:10 ` Paul Jackson @ 2005-11-04 14:56 ` Andy Nelson 2005-11-04 15:18 ` Ingo Molnar 2005-11-04 16:00 ` Linus Torvalds 1 sibling, 2 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 14:56 UTC (permalink / raw) To: andy, torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin Linus, Since my other affiliation is with X2, which also goes by the name Thermonuclear Applications, we have a deal. I'll continue to help with the work on getting nuclear fusion to work, and you work on getting my big pages to work in linux. We both have lots of funding and resources behind us and are working with smart people. It should be easy. Beyond that, I don't know much of anything about chemistry, you'll have to find someone else to increase your battery efficiency that way. Big pages don't work now, and zones do not help because the load is too unpredictable. Sysadmins *always* turn them off, for very good reasons. They cripple the machine. I'll try in this post also to merge a couple of replies with other responses: I think it was Martin Bligh who wrote that his customer gets 25% speedups with big pages. That is peanuts compared to my factor 3.4 (search comp.arch for John Mashey's and my name at the University of Edinburgh in Jan/Feb 2003 for a conversation that includes detailed data about this), but proves the point that it is far more than just me that wants big pages. If your and other kernel developer's (<<0.01% of the universe) kernel builds slow down by 5% and my and other people's simulations (perhaps 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? Answer right now: you do, since you are writing the kernel to respond to your own issues, which are no more representative of the rest of the universe than my work is. Answer as I think it ought to be: I do, since I'd bet that HPC takes far more net cycles in the world than every one else's kernel builds put together. I can't expect much of anyone else to notice either way and neither can you, so that is a wash. Ingo Molnar says that zones work for him. In response I will now repeat my previous rant about why zones don't work. I understand that my post was very long and people probably didn't read it all. So I'll just repeat that part: 2) The last paragraph above is important because of the way HPC works as an industry. We often don't just have a dedicated machine to run on, that gets booted once and one dedicated application runs on it till it dies or gets rebooted again. Many jobs run on the same machine. Some jobs run for weeks. Others run for a few hours over and over again. Some run massively parallel. Some run throughput. How is this situation handled? With a batch scheduler. You submit a job to run and ask for X cpus, Y memory and Z time. It goes and fits you in wherever it can. cpusets were helpful infrastructure in linux for this. You may get some cpus on one side of the machine, some more on the other, and memory associated with still others. They do a pretty good job of allocating resources sanely, but there is only so much that it can do. The important point here for page related discusssions is that someone, you don't know who, was running on those cpu's and memory before you. And doing Ghu Knows What with it. This code could be running something that benefits from small pages, or it could be running with large pages. It could be dynamically allocating and freeing large or small blocks of memory or it could be allocating everything at the beginning and running statically thereafter. Different codes do different things. That means that the memory state could be totally fubar'ed before your job ever gets any time allocated to it. >Nobody takes a random machine and says "ok, we'll now put our most >performance-critical database on this machine, and oh, btw, you can't >reboot it and tune for it beforehand". Wanna bet? What I wrote above makes tuning the machine itself totally ineffective. What do you tune for? Tuning for one person's code makes someone else's slower. Tuning for the same code on one input makes another input run horribly. You also can't be rebooting after every job. What about all the other ones that weren't done yet? You'd piss off everyone running there and it takes too long besides. What about a machine that is running multiple instances of some database, some bigger or smaller than others, or doing other kinds of work? Do you penalize the big ones or the small ones, this kind of work or that? You also can't establish zones that can't be changed on the fly as things on the system change. How do zones like that fit into numa? How do things work when suddenly you've got a job that wants the entire memory filled with large pages and you've only got half your system set up for large pages? What if you tune the system that way and then let that job run. For some stupid reason user reason it dies 10 minutes after starting? Do you let the 30 other jobs in the queue sit idle because they want a different page distribution? This way lies madness. Sysadmins just say no and set up the machine in as stably as they can, usually with something not too different that whatever manufacturer recommends as a default. For very good reasons. I would bet the only kind of zone stuff that could even possibly work would be related to a cpu/memset zone arrangement. See below. Andy Nelson -- Andy Nelson Theoretical Astrophysics Division (T-6) andy dot nelson at lanl dot gov Los Alamos National Laboratory http://www.phys.lsu.edu/~andy Los Alamos, NM 87545 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 14:56 ` Andy Nelson @ 2005-11-04 15:18 ` Ingo Molnar 2005-11-04 15:39 ` Andy Nelson 2005-11-04 16:00 ` Linus Torvalds 1 sibling, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 15:18 UTC (permalink / raw) To: Andy Nelson Cc: torvalds, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin * Andy Nelson <andy@thermo.lanl.gov> wrote: > I think it was Martin Bligh who wrote that his customer gets 25% > speedups with big pages. That is peanuts compared to my factor 3.4 > (search comp.arch for John Mashey's and my name at the University of > Edinburgh in Jan/Feb 2003 for a conversation that includes detailed > data about this), but proves the point that it is far more than just > me that wants big pages. ok, this posting of you seems to be it: http://groups.google.com/group/comp.sys.sgi.admin/browse_thread/thread/39884db861b7db15/e0332608c52a17e3?lnk=st&q=&rnum=35#e0332608c52a17e3 | Timing for the tree traveral+gravity calculation were | | 16MBpages 1MBpages 64kpages | 1 * * 2361.8s | 8 86.4s 198.7s 298.1s | 16 43.5s 99.2s 148.9s | 32 22.1s 50.1s 75.0s | 64 11.2s 25.3s 37.9s | 96 7.5s 17.1s 25.4s | | (*) test not done. | | As near as I can tell the numbers show perfect | linear speedup for the runs for each page size. | | Across different page sizes there is degradation | as follows: | | 16m --> 64k decreases by a factor 3.39 in speed | 16m --> 1m decreases by a factor 2.25 in speed | 1m --> 64k decreases by a factor 1.49 in speed [...] | | Sum over cpus of TLB miss times for each test: | | 16MBpages 1MBpages 64kpages | 1 3489s | 8 64.3s 1539s 3237s | 16 64.5s 1540s 3241s | 32 64.5s 1542s 3244s | 64 64.9s 1545s 3246s | 96 64.7s 1545s 3251s | | Thus the 16MB pages rarely produced page misses, | while the 64kB pages used up 2.5x more time than | the floating point operations that we wanted to | have. I have at least some feeling that the 16MB pages | rarely caused misses because with a 128 entry | TLB (on the R12000 cpu) that gives about 1GB of | addressible memory before paging is required at all, | which I think is quite comparable to the size of | the memory actually used. to me it seems that this slowdown is due to some inefficiency in the R12000's TLB-miss handling - possibly very (very!) long TLB-miss latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K pages on x86/x64? if my assumption is correct, then hugeTLBs are more of a workaround for bad TLB-miss properties of the CPUs you are using, not something that will inevitably happen in the future. Hence i think the 'factor 3x' slowdown should not be realistic anymore - or are you still running R12000 CPUs? Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:18 ` Ingo Molnar @ 2005-11-04 15:39 ` Andy Nelson 2005-11-04 16:05 ` Ingo Molnar 2005-11-04 16:07 ` Linus Torvalds 0 siblings, 2 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 15:39 UTC (permalink / raw) To: andy, mingo Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj, torvalds Ingo wrote: >ok, this posting of you seems to be it: > <elided> >to me it seems that this slowdown is due to some inefficiency in the >R12000's TLB-miss handling - possibly very (very!) long TLB-miss >latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely >visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K >pages on x86/x64? > >if my assumption is correct, then hugeTLBs are more of a workaround for >bad TLB-miss properties of the CPUs you are using, not something that >will inevitably happen in the future. Hence i think the 'factor 3x' >slowdown should not be realistic anymore - or are you still running >R12000 CPUs? > Ingo AFAIK, mips chips have a software TLB refill that takes 1000 cycles more or less. I could be wrong. There are sgi folk on this thread, perhaps they can correct me. What is important is that I have done similar tests on other arch's and found very similar results. Specifically with IBM machines running both AIX and Linux. I've never had the opportunity to try variable page size stuff on amd or intel chips, either itanic or x86 variants. The effect is not a consequence of any excessively long tlb handling times for one single arch. The effect is a property of the code. Which has one part that is extremely branchy: traversing a tree, and another part that isn't branchy but grabs stuff from all over everywhere. The tree traversal works like this: Start from the root and stop at each node, load a few numbers, multiply them together and compare to another number, then open that node or go on to a sibling node. Net, this is about 5-8 flops and a compare per node. The issue is that the next time you want to look at a tree node, you are someplace else in memory entirely. That means a TLB miss almost always. The tree traversal leaves me with a list of a few thousand nodes and atoms. I use these nodes and atoms to calculate gravity on some particle or small group of particles. How? For each node, I grab about 10 numbers from a couple of arrays, do about 50 flops with those numbers, and store back 4 more numbers. The store back doesn't hurt anything becasuse it really only happens once at the end of the list. In the naive case, grabbing 10 numbers out of arrays that are mutiple GB in size means 10 TLB misses. The obvious solution is to stick everything together that is needed together, and get that down to one or two. I've done that. The results you quoted in your post reflect that. In other words, the performance difference is the minimal number of TLB misses that I can manage to get. Now if you have a list of thousands of nodes to cycle through, each of which lives on a different page (ordinarily true), you thrash TLB, and you thrash L1, as I noted in my original post. Believe me, I have worried about this sort of stuff intensely, and recoded around it a lot. The performance number you saw were what is left over. It is true that other sorts of codes have much more regular memory access patterns, and don't have nearly this kind of speedup. Perhaps more typical would be the 25% number quoted by Martin Bligh. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:39 ` Andy Nelson @ 2005-11-04 16:05 ` Ingo Molnar 2005-11-04 16:07 ` Linus Torvalds 1 sibling, 0 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 16:05 UTC (permalink / raw) To: Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj, torvalds * Andy Nelson <andy@thermo.lanl.gov> wrote: > Ingo wrote: > >ok, this posting of you seems to be it: > > > <elided> > > >to me it seems that this slowdown is due to some inefficiency in the > >R12000's TLB-miss handling - possibly very (very!) long TLB-miss > >latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely > >visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K > >pages on x86/x64? > > > >if my assumption is correct, then hugeTLBs are more of a workaround for > >bad TLB-miss properties of the CPUs you are using, not something that > >will inevitably happen in the future. Hence i think the 'factor 3x' > >slowdown should not be realistic anymore - or are you still running > >R12000 CPUs? > > > Ingo > > > AFAIK, mips chips have a software TLB refill that takes 1000 cycles > more or less. I could be wrong. [...] x86 in comparison has a typical cost of 7 cycles per TLB miss. And a modern x64 chip has 1024 TLBs ... If that's not enough then i believe you'll be limited by cachemiss costs and RAM latency/throughput anyway, and the only thing the TLB misses have to do is to be somewhat better than those bottlenecks. TLBs are really fast in the x86/x64 world. Then there come other features like TLB prefetch, so if you are touching pages in any predictable fashion you ought to see better latencies than the worst-case. > The effect is not a consequence of any excessively long tlb handling > times for one single arch. > > The effect is a property of the code. Which has one part that is > extremely branchy: traversing a tree, and another part that isn't > branchy but grabs stuff from all over everywhere. i dont think anyone argues against the fact that a larger 'TLB reach' will most likely improve performance. The question is always 'by how much', and that number very much depends on the cost of a single TLB miss. (and on alot of other factors) (note that it's also possible for large TLBs to cause a slowdown: there are CPUs [e.g. P3] where there are fewer large TLBs than 4K TLBs, so there are workloads where you lose due to fewer TLBs. It is also possible for large TLBs to be zero speedup: if the working set is so large that you will always get a TLB miss with a new node accessed.) Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 15:39 ` Andy Nelson 2005-11-04 16:05 ` Ingo Molnar @ 2005-11-04 16:07 ` Linus Torvalds 2005-11-04 16:40 ` Ingo Molnar 1 sibling, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 16:07 UTC (permalink / raw) To: Andy Nelson Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj On Fri, 4 Nov 2005, Andy Nelson wrote: > > AFAIK, mips chips have a software TLB refill that takes 1000 > cycles more or less. I could be wrong. You're not far off. Time it on a real machine some day. On a modern x86, you will fill a TLB entry in anything from 1-8 cycles if it's in L1, and add a couple of dozen cycles for L2. In fact, the L1 TLB miss can often be hidden by the OoO engine. Now, do the math. Your "3-4 time slowdown" with several hundred cycle TLB miss just GOES AWAY with real hardware. Yes, you'll still see slowdowns, but they won't be nearly as noticeable. And having a simpler and more efficient kernel will actually make _up_ for them in many cases. For example, you can do all your calculations on idle workstations that don't mysteriously just crash because somebody was also doing something else on them. Face it. MIPS sucks. It was clean, but it didn't perform very well. SGI doesn't sell those things very actively these days, do they? So don't blame Linux. Don't make sweeping statements based on hardware situations that just aren't relevant any more. If you ever see a machine again that has a huge TLB slowdown, let the machine vendor know, and then SWITCH VENDORS. Linux will work on sane machines too. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:07 ` Linus Torvalds @ 2005-11-04 16:40 ` Ingo Molnar 2005-11-04 17:22 ` Linus Torvalds 0 siblings, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 16:40 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj * Linus Torvalds <torvalds@osdl.org> wrote: > Time it on a real machine some day. On a modern x86, you will fill a > TLB entry in anything from 1-8 cycles if it's in L1, and add a couple > of dozen cycles for L2. below is my (x86-only) testcode that accurately measures TLB miss costs in cycles. (Has to be run as root, because it uses 'cli' as the serializing instruction.) here's the output from the default 128MB (32768 4K pages) random access pattern workload, on a 2 GHz P4 (which has 64 dTLBs): 0 24 24 24 12 12 0 0 16 0 24 24 24 12 0 12 0 12 32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses. i.e. really cheap TLB misses even in this very bad and TLB-trashing scenario: there are only 64 dTLBs and we have 32768 pages - so they are outnumbered by a factor of 1:512! Still the CPU gets it right. setting LINEAR to 1 gives an embarrasing: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses. showing that the pagetable got fully cached (probably in L1) and that has _zero_ overhead. Truly remarkable. lowering the size to 16 MB (still 1:64 TLB-to-working-set-size ratio!) gives: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4096 randomly accessed pages, 0 cycles avg, 5.859375% TLB misses. so near-zero TLB overhead. increasing BYTES to half a gigabyte gives: 2 0 12 12 24 12 24 264 24 12 24 24 0 0 24 12 24 24 24 24 24 24 24 24 12 12 24 24 24 36 24 24 0 24 24 0 24 24 288 24 24 0 228 24 24 0 0 131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses. so an occasional ~220 cycles (~== 100 nsec - DRAM latency) cachemiss, but still the average is 75 cycles, or 37 nsecs - which is still only ~37% of the DRAM latency. (NOTE: the test eliminates most data cachemisses, by using zero-mapped anonymous memory, so only a single data page exists. So the costs seen here are mostly TLB misses.) Ingo --------------- /* * TLB miss measurement on PII CPUs. * * Copyright (C) 1999, Ingo Molnar <mingo@redhat.com> */ #include <stdio.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <sys/mman.h> #define BYTES (128*1024*1024) #define PAGES (BYTES/4096) /* This define turns on the linear mode.. */ #define LINEAR 0 #if 1 # define BARRIER "cli" #else # define BARRIER "lock ; addl $0,0(%%esp)" #endif int do_test (char * addr) { unsigned long start, end; /* * 'cli' is used as a serializing instruction to * isolate the benchmarked instruction from rdtsc. */ __asm__ ( "jmp 1f; 1: .align 128;\ "BARRIER"; \ rdtsc; \ movl %0, %1; \ "BARRIER"; \ movl (%%esi), %%eax; \ "BARRIER"; \ rdtsc; \ "BARRIER"; \ " :"=a" (end), "=c" (start) :"S" (addr) :"dx","memory"); return end - start; } extern int iopl(int); int main (void) { unsigned long overhead, sum; int j, k, c, hit; int matrix [PAGES]; int delta [PAGES]; char *buffer = mmap(NULL, BYTES, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); iopl(3); /* * first generate a random access pattern. */ for (j = 0; j < PAGES; j++) { unsigned long val; #if LINEAR val = ((j*8) % PAGES) * 4096; val = j*2048; #else val = (random() % PAGES) * 4096; #endif matrix[j] = val; } /* * Calculate the overhead */ overhead = ~0UL; for (j = 0; j < 100; j++) { unsigned int diff = do_test(buffer); if (diff < overhead) overhead = diff; } printf("Overhead = %ld cycles\n", overhead); /* * 10 warmup loops, the last one is printed. */ for (k = 0; k < 10; k++) { c = 0; for (j = 0; j < PAGES; j++) { char * addr; addr = buffer + matrix[j]; delta[c++] = do_test(addr); } } hit = 0; sum = 0; for (j = 0; j < PAGES; j++) { unsigned long d = delta[j] - overhead; printf("%ld ", d); if (d <= 1) hit++; sum += d; } printf("\n"); printf("%d %s accessed pages, %d cycles avg, %f%% TLB misses.\n", PAGES, #if LINEAR "linearly", #else "randomly", #endif sum/PAGES, 100.0*((double)PAGES-(double)hit)/(double)PAGES); return 0; } ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:40 ` Ingo Molnar @ 2005-11-04 17:22 ` Linus Torvalds 2005-11-04 17:43 ` Andy Nelson 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 17:22 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj Andy, let's just take Ingo's numbers, measured on modern hardware. On Fri, 4 Nov 2005, Ingo Molnar wrote: > > 32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses. > 32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses. > 131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses. NOTE! It's hard to decide what OoO does - Ingo's load doesn't allow for a whole lot of overlapping stuff, so Ingo's numbers are fairly close to worst case, but on the other hand, that serialization can probably be honestly said to hide a couple of cycles, so let's say that _real_ worst case is five more cycles than the ones quoted. It doesn't change the math, and quite frankly, that way we're really anal about it. In real life, under real load (especially with Fp operations going on at the same time), OoO might make the cost a few cycles _less_, not more, but hey, lt's not count that. So in the absolute worst case, with 95% TLB miss ratio, the TLB cost was an average 75 cycles. Let's be _really_ nice to MIPS, and say that this is only five times faster than the MIPS case you tested (in reality, it's probably over ten). That's the WORST CASE. Realize that MIPS doesn't get better: it will _always_ have a latency of several hundred cycles when the TLB misses. It has absolutely zero OoO activity to hide a TLB miss (a software miss totally serializes the pipeline), and it has zero "code caching", so even with a perfect I$ (which it certainly didn't have), the cost of actually running the TLB miss handler doesn't go down. In contrast, the x86 hw miss gets better when there is some more locality and the page tables are cached. Much better. Ingo's worst-case example is not realistic (no locality at all in half a gigabyte or totally random examples), yet even for that worst case, modern CPU's beat the MIPS by that big factor. So let's say that the 75% miss ratio was more likely (that's still a high TLB miss ratio). So in the _likely_ case, a P4 did the miss in an average of 13 cycles. The MIPS miss cost won't have come down at all - in fact, it possibly went _up_, since the miss handler now might be getting more I$ misses since it's not called all the time (I don't know if the MIPS miss handler used non-caching loads or not - the positive D$ effects on the page tables from slightly denser TLB behaviour might help some to offset this factor). That's a likely factor of fifty speedup. But let's be pessimistic again, and say that the P4 number beat the MIPS TLB miss by "only" a factor of twenty. That means that your worst case totally untuned argument (30 times slowdown from TLB misses) on a P4 is only a 120% slowdown. Not a factor of three. But clearly you could tune your code too, and did. To the point that you had a factor of 3.4 on MIPS. Now, let's say that the tuning didn't work as well on P4 (remember, we're still being pessimistic), and you'd only get half of that. End result? If the slowdown was entirely due to TLB miss costs, your likely slowdown is in the 20-40% range. Pessimistically. Now, switching to x86 may have _other_ issues. Maybe other things might get slower. [ Mmwwhahahahhahaaa. I crack myself up. x86 slower than MIPS? I'm such a joker. ] Anyway. The point stands. This is something where hardware really rules, and software can't do a lot of sane stuff. 20-40% may sound like a big number, and it is, but this is all stuff where Moore's Law says that we shouldn't spend software effort. We'll likely be better off with a smaller, simpler kernel in the future. I hope. And the numbers above back me up. Software complexity for something like this just kills. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 17:22 ` Linus Torvalds @ 2005-11-04 17:43 ` Andy Nelson 0 siblings, 0 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 17:43 UTC (permalink / raw) To: mingo, torvalds Cc: akpm, andy, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj Linus, Please stop focussing on mips as the bad boy. Mips is dead. It has been for years and everyone knows it unless they are embedded. I wrote several times that I had tested other arches and every time you deleted those comments. Not to mention that in the few anecdotal (read no records were kept) tests I've done on with intel vs mips on more than one code, mips doesn't come out nearly as bad as you seem to believe. Maybe that is tlb related maybe it is other issue related. The fact remains. Later on after your posts I also posted numbers for power 5. Haven't seen a response to that yet. Maybe you're digesting. > let's just take Ingo's numbers, measured on modern hardware. Ingo's numbers calculate 95% tlb misses. I will likely have 100% tlb misses over most of this code. Read my discussion of what it does and you'll see why. Capsule form: Every tree node results in several thousand nodes that are acceptable. You need to examine several times that to get the acceptable ones. Several thousand memory reads from several thousand different pages means 100% TLB misses. This is by no means a pathological case. Other codes will have such effects too, as I noted in my first very long rant. I may have misread it, but that last bit of difference between 95% and 100% tlb misses will be a pretty big factor in speed differences. So your 20-40% goes right back up. Ok, so there is some minimal in my case fp overlap, but a factor 2 speed difference certainly still exists in the power5 arch numbers I quoted. I have a special case version of this code that does cache blocking on the gravity calculation. As a special case version, it is not effective for the general case. There are 0 TLB misses and 0 L1 misses for this part of the code. The tree traversal cannot be similarly cache blocked and keeps all the tlb and cache misses it always had. For that version, I can get down to 20% speed up, because overall the traversal only takes 20% or so of the total time. That is the absolute best I can do, and I've been tuning this code alone for close to a decade. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 14:56 ` Andy Nelson 2005-11-04 15:18 ` Ingo Molnar @ 2005-11-04 16:00 ` Linus Torvalds 2005-11-04 16:13 ` Martin J. Bligh 2005-11-04 16:14 ` Andy Nelson 1 sibling, 2 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 16:00 UTC (permalink / raw) To: Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin On Fri, 4 Nov 2005, Andy Nelson wrote: > > Big pages don't work now, and zones do not help because the > load is too unpredictable. Sysadmins *always* turn them > off, for very good reasons. They cripple the machine. They do. Guess why? It's complicated. SGI used to do things like that in Irix. They had the flakiest Unix kernel out there. There's a reason people use Linux, and it's not all price. A lot of it is development speed, and that in turn comes very much from not making insane decisions that aren't maintainable in the long run. Trust me. We can make things _better_, by having zones that you can't do kernel allocations from. But you'll never get everything you want, without turning the kernel into an unmaintainable mess. > I think it was Martin Bligh who wrote that his customer gets > 25% speedups with big pages. That is peanuts compared to my > factor 3.4 (search comp.arch for John Mashey's and my name > at the University of Edinburgh in Jan/Feb 2003 for a conversation > that includes detailed data about this), but proves the point that > it is far more than just me that wants big pages. I didn't find your post on google, but I assume that a large portion on your 3.4 factor was hardware. The fact is, there are tons of architectures that suck at TLB handling. They have small TLB's, and they fill slowly. x86 is actually one of the best ones out there. It has a hw TLB fill, and the page tables are cached, with real-life TLB fill times in the single cycles (a P4 can almost be seen as effectively having 32kB pages because it fills it's TLB entries to fast when they are next to each other in the page tables). Even when you have lots of other cache pressure, the page tables are at least in the L2 (or L3) caches, and you effectively have a really huge TLB. In contrast, a lot of other machines will use non-temporal loads to load the TLB entries, forcing them to _always_ go to memory, and use software fills, causing the whole machine to stall. To make matters worse, many of them use hashed page tables, so that even if they could (or do) cache them, the caching just doesn't work very well. (I used to be a big proponent of software fill - it's very flexible. It's also very slow. I've changed my mind after doing timing on x86) Basically, any machine that gets more than twice the slowdown is _broken_. If the memory access is cached, then so should be page table entry be (page tables are _much_ smaller than the pages themselves), so even if you take a TLB fault on every single access, you shouldn't see a 3.4 factor. So without finding your post, my guess is that you were on a broken machine. MIPS or alpha do really well when things generally fit in the TLB, but break down completely when they don't due to their sw fill (alpha could have fixed it, it had _archtiecturally_ sane page tables that it could have walked in hw, but never got the chance. May it rest in peace). If I remember correctly, ia64 used to suck horribly because Linux had to use a mode where the hw page table walker didn't work well (maybe it was just an itanium 1 bug), but should be better now. But x86 probably kicks its butt. The reason x86 does pretty well is that it's got one of the few sane page table setups out there (oh, page table trees are old-fashioned and simple, but they are dense and cache well), and the microarchitecture is largely optimized for TLB faults. Not having ASI's and having to work with an OS that invalidated the TLB about every couple of thousand memory accesses does that to you - it puts the pressure to do things right. So I suspect Martin's 25% is a lot more accurate on modern hardware (which means x86, possibly Power. Nothing else much matters). > If your and other kernel developer's (<<0.01% of the universe) kernel > builds slow down by 5% and my and other people's simulations (perhaps > 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? First off, you won't speed up by a factor of three or four. Not even _close_. Second, it's not about performance. It's about maintainability. It's about having a system that we can use and understand 10 years down the line. And the VM is a big part of that. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:00 ` Linus Torvalds @ 2005-11-04 16:13 ` Martin J. Bligh 2005-11-04 16:40 ` Linus Torvalds 2005-11-04 16:14 ` Andy Nelson 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 16:13 UTC (permalink / raw) To: Linus Torvalds, Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin > So I suspect Martin's 25% is a lot more accurate on modern hardware (which > means x86, possibly Power. Nothing else much matters). It was PPC64, if that helps. >> If your and other kernel developer's (<<0.01% of the universe) kernel >> builds slow down by 5% and my and other people's simulations (perhaps >> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? > > First off, you won't speed up by a factor of three or four. Not even > _close_. Well, I think it depends on the workload a lot. However fast your TLB is, if we move from "every cacheline read requires is a TLB miss" to "every cacheline read is a TLB hit" that can be a huge performance knee however fast your TLB is. Depends heavily on the locality of reference and size of data set of the application, I suspect. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:13 ` Martin J. Bligh @ 2005-11-04 16:40 ` Linus Torvalds 2005-11-04 17:10 ` Martin J. Bligh 0 siblings, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 16:40 UTC (permalink / raw) To: Martin J. Bligh Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin On Fri, 4 Nov 2005, Martin J. Bligh wrote: > > > So I suspect Martin's 25% is a lot more accurate on modern hardware (which > > means x86, possibly Power. Nothing else much matters). > > It was PPC64, if that helps. Ok. I bet x86 is even better, but Power (and possibly itanium) is the only other architecture that comes close. I don't like the horrible POWER hash-tables, but for static workloads they should perform almost as well as a sane page table (I say "almost", because I bet that the high-performance x86 vendors have spent a lot more time on tlb latency than even IBM has). My dislike for them comes from the fact that they are really only optimized for static behaviour. (And HPC is almost always static wrt TLB stuff - big, long-running processes). > Well, I think it depends on the workload a lot. However fast your TLB is, > if we move from "every cacheline read requires is a TLB miss" to "every > cacheline read is a TLB hit" that can be a huge performance knee however > fast your TLB is. Depends heavily on the locality of reference and size > of data set of the application, I suspect. I'm sure there are really pathological examples, but the thing is, they won't be on reasonable code. Some modern CPU's have TLB's that can span the whole cache. In other words, if your data is in _any_ level of caches, the TLB will be big enough to find it. Yes, that's not universally true, and when it's true, the TLB is two-level and you can have loads where it will usually miss in the first level, but we're now talking about loads where the _data_ will then always miss in the first level cache too. So the TLB miss cost will always be _lower_ than the data miss cost. Right now, you should buy Opteron if you want that kind of large TLB. I _think_ Intel still has "small" TLB's (the cpuid information only goes up to 128 entries, I think), but at least Intel has a really good fill. And I would bet (but have no first-hand information) that next generation processors will only get bigger TLB's. These things don't tend to shrink. (Itanium also has a two-level TLB, but it's absolutely pitiful in size). NOTE! It is absolutely true that for a few years we had regular caches growing much faster than TLB's. So there are unquestionably unbalanced machines out there. But it seems that CPU designers started noticing, and every indication is that TLB's are catching up. In other words, adding lots of kernel complexity is the wrong thing in the long run. This is not a long-term problem, and even in the short term you can fix it by just selecting the right hardware. In todays world, AMD leads with bug TLB's (1024-entry L2 TLB), but Intel has slightly faster fill and the AMD TLB filtering is sadly turned off on SMP right now, so you might not always get the full effect of the large TLB (but in HPC you probably won't have task switching blowing your TLB away very often). PPC64 has the huge hashed page tables that work well enough for HPC. Itanium has a pitifully small TLB, and an in-order CPU, so it will take a noticeably bigger hit on TLB's than x86 will. But even Itanium will be a _lot_ better than MIPS was. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:40 ` Linus Torvalds @ 2005-11-04 17:10 ` Martin J. Bligh 0 siblings, 0 replies; 241+ messages in thread From: Martin J. Bligh @ 2005-11-04 17:10 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin >> Well, I think it depends on the workload a lot. However fast your TLB is, >> if we move from "every cacheline read requires is a TLB miss" to "every >> cacheline read is a TLB hit" that can be a huge performance knee however >> fast your TLB is. Depends heavily on the locality of reference and size >> of data set of the application, I suspect. > > I'm sure there are really pathological examples, but the thing is, they > won't be on reasonable code. > > Some modern CPU's have TLB's that can span the whole cache. In other > words, if your data is in _any_ level of caches, the TLB will be big > enough to find it. > > Yes, that's not universally true, and when it's true, the TLB is two-level > and you can have loads where it will usually miss in the first level, but > we're now talking about loads where the _data_ will then always miss in > the first level cache too. So the TLB miss cost will always be _lower_ > than the data miss cost. > > Right now, you should buy Opteron if you want that kind of large TLB. I > _think_ Intel still has "small" TLB's (the cpuid information only goes up > to 128 entries, I think), but at least Intel has a really good fill. And I > would bet (but have no first-hand information) that next generation > processors will only get bigger TLB's. These things don't tend to shrink. Well. Last time I looked they had something in the order of 512 entries per MB of cache or so (ie 2MB of coverage per MB of cache). So it'll only cover it if you're using 2K of the data in each page (50%), but not if you're touching cachelines distributed widely over pages. with large pages, you cover 1000 times that much. Some apps may not be able to acheive a 50% locality of reference, just by their nature ... not sure that's bad programming for the big number crunching cases, or DB workloads with random access patterns to large data sets. Of course, this doesn't just apply to HPC/database either. dcache walks on large fileserver, etc. Even if we're talking data cache / icache misses, it gets even worse, doesn't it? Several cacheline misses for pagetable walks per data cacheline miss. Lots of the compute intensive stuff doesn't even come close to fitting in data cache by orders of magnitude. M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:00 ` Linus Torvalds 2005-11-04 16:13 ` Martin J. Bligh @ 2005-11-04 16:14 ` Andy Nelson 2005-11-04 16:49 ` Linus Torvalds 1 sibling, 1 reply; 241+ messages in thread From: Andy Nelson @ 2005-11-04 16:14 UTC (permalink / raw) To: andy, torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin Linus: >> If your and other kernel developer's (<<0.01% of the universe) kernel >> builds slow down by 5% and my and other people's simulations (perhaps >> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? > >First off, you won't speed up by a factor of three or four. Not even >_close_. My measurements of factors of 3-4 on more than one hw arch don't mean anything then? BTW: Ingo Molnar has a response that did find my comp.arch posts. As I indicated to him, I've done a lot of code tuning to get better performance even in the presence of tlb issues. This factor is what is left. Starting from an untuned code, the factor can be up to an order of magnitude larger. As in 30-60. Yes, I've measured that too, though these detailed measurments were only on mips/origins. It is true that I have never had the opportunity to test these issues on x86 and its relatives. Perhaps it would be better there. The relative insensitivity of the results I have already to hw arch, indicate otherwise though. Re maintainability: Fine. I like maintainable code too. Coding standards are great. Language standards are even better. These are motherhood statements. Your simple rejections ("NO, HELL NO!!") even of any attempts to make these sorts of improvements seems to make that issue pretty moot anyway. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 16:14 ` Andy Nelson @ 2005-11-04 16:49 ` Linus Torvalds 0 siblings, 0 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 16:49 UTC (permalink / raw) To: Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin On Fri, 4 Nov 2005, Andy Nelson wrote: > > My measurements of factors of 3-4 on more than one hw arch don't > mean anything then? When I _know_ that modern hardware does what you tested at least two orders of magnitude better than the hardware you tested? Think about it. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-04 15:19 Andy Nelson 0 siblings, 0 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 15:19 UTC (permalink / raw) To: torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin Nick Piggin wrote: >Mel Gorman wrote: >> On Fri, 4 Nov 2005, Nick Piggin wrote: >> >> Todays massive machiens are tomorrows desktop. Weak comment, I know, but >> it's happened before. >> >Oh I wouldn't bet against it. And if desktops of the future are using >100s of GB then they probably would be happy to use 64K pages as well. Just a note. The data I referenced in my other post that can be found on comp.arch uses 64k pages as the smallest page size in the study. Pages sized 1M and 16M were the other two. As I understand it, only a few arch's have hw support for more than 2 page sizes, but my response is that they will eventually need them. The larger the memory, the larger the possible page size needs to be too. Otherwise you are just pushing out the problem for a few years. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-04 17:03 Andy Nelson 2005-11-04 17:49 ` Linus Torvalds 2005-11-04 20:12 ` Ingo Molnar 0 siblings, 2 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 17:03 UTC (permalink / raw) To: torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin >On Fri, 4 Nov 2005, Andy Nelson wrote: >> >> My measurements of factors of 3-4 on more than one hw arch don't >> mean anything then? > >When I _know_ that modern hardware does what you tested at least two >orders of magnitude better than the hardware you tested? Ok. In other posts you have skeptically accepted Power as a `modern' architecture. I have just now dug out some numbers of a slightly different problem running on a Power 5. Specifically a IBM p575 I think. These tests were done in June, while the others were done more than 2.5 years ago. In other words, there may be other small tuning optimizations that have gone in since then too. The problem is a different configuration of particles, and about 2 times bigger (7Million) than the one in comp.arch (3million I think). I would estimate that the data set in this test spans something like 2-2.5GB or so. Here are the results: cpus 4k pages 16m pages 1 4888.74s 2399.36s 2 2447.68s 1202.71s 4 1225.98s 617.23s 6 790.05s 418.46s 8 592.26s 310.03s 12 398.46s 210.62s 16 296.19s 161.96s These numbers were on a recent Linux. I don't know which one. Now it looks like it is down to a factor 2 or slightly more. That is a totally different arch, that I think you have accepted as `modern', running the OS that you say doesn't need big page support. Still a bit more than insignificant I would say. >Think about it. Likewise. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 17:03 Andy Nelson @ 2005-11-04 17:49 ` Linus Torvalds 2005-11-04 17:51 ` Andy Nelson 2005-11-04 20:12 ` Ingo Molnar 1 sibling, 1 reply; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 17:49 UTC (permalink / raw) To: Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin On Fri, 4 Nov 2005, Andy Nelson wrote: > > Ok. In other posts you have skeptically accepted Power as a > `modern' architecture. Yes, sceptically. I'd really like to hear what your numbers are on a modern x86. Any x86-64 is interesting, and I can't imagine that with a LANL address you can't find any. I do believe that Power is within one order of magnitude of a modern x86 when it comes to TLB fill performance. That's much better than many others, but whether "almost as good" is within the error range, or whether it's "only five times worse", I don't know. The thing is, there's a reason x86 machines kick ass. They are cheap, and they really _do_ outperform pretty much everything else out there. Power 5 has a wonderful memory architecture, and those L3 caches kick ass. They probably don't help you as much as they help databases, though, and it's entirely possible that a small cheap Opteron with its integrated memory controller will outperform them on your load if you really don't have a lot of locality. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 17:49 ` Linus Torvalds @ 2005-11-04 17:51 ` Andy Nelson 0 siblings, 0 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 17:51 UTC (permalink / raw) To: andy, torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin Finding an x86 or amd is not the problem. Finding one with a sysadmin who is willing to let me experiment is. I'll ask around, but it may be a while. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 17:03 Andy Nelson 2005-11-04 17:49 ` Linus Torvalds @ 2005-11-04 20:12 ` Ingo Molnar 2005-11-04 21:04 ` Andy Nelson 1 sibling, 1 reply; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 20:12 UTC (permalink / raw) To: Andy Nelson Cc: torvalds, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin * Andy Nelson <andy@thermo.lanl.gov> wrote: > The problem is a different configuration of particles, and about 2 > times bigger (7Million) than the one in comp.arch (3million I think). > I would estimate that the data set in this test spans something like > 2-2.5GB or so. > > Here are the results: > > cpus 4k pages 16m pages > 1 4888.74s 2399.36s > 2 2447.68s 1202.71s > 4 1225.98s 617.23s > 6 790.05s 418.46s > 8 592.26s 310.03s > 12 398.46s 210.62s > 16 296.19s 161.96s interesting, and thanks for the numbers. Even if hugetlbs were only showing a 'mere' 5% improvement, a 5% _user-space improvement_ is still a considerable improvement that we should try to achieve, if possible cheaply. the 'separate hugetlb zone' solution is cheap and simple, and i believe it should cover your needs of mixed hugetlb and smallpages workloads. it would work like this: unlike the current hugepages=<nr> boot parameter, this zone would be useful for other (4K sized) allocations too. If an app requests a hugepage then we have the chance to allocate it from the hugetlb zone, in a guaranteed way [up to the point where the whole zone consists of hugepages only]. the architectural appeal in this solution is that no additional "fragmentation prevention" has to be done on this zone, because we only allow content into it that is "easy" to flush - this means that there is no complexity drag on the generic kernel VM. can you think of any reason why the boot-time-configured hugetlb zone would be inadequate for your needs? Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 20:12 ` Ingo Molnar @ 2005-11-04 21:04 ` Andy Nelson 2005-11-04 21:14 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 21:04 UTC (permalink / raw) To: andy, mingo Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds Hi, >can you think of any reason why the boot-time-configured hugetlb zone >would be inadequate for your needs? I am not enough of a kernel level person or sysadmin to know for certain, but I have still big worries about consecutive jobs that run on the same resources, but want extremely different page behavior. If what you are suggesting can cause all previous history on those resources to be forgotten and then reset to whatever it is that I want when I start my run, then yes. It would be fine for me. In some sense, this is perhaps what I was asking for in my original message when I was talking about using batch schedulers, cpusets and friends to encapsulate regions of resources, that could be reset to nice states at user specified intervals, like when the batch scheduler releases one job and another job starts. The issues that I can still think of that hpc people will need are (some points here are clearly related to each other, but anyway). 1) how do zones play with numa? Does setting up resource management this way mean that various kernel things that help me access my memory (hellifino what I'm talking about here--things like tables and lists of pages that I own and how to access them etc I suppose--whatever it is that kernels don't get rid of when someone else's job ends and before mine starts) actually get allocated in some other zone half way across the machine? This is going to kill me on latency grounds. Can it be set up so that this reserved special kernel zone is somewhere close by? If it is bigger than the next guy to get my resources wants, can it be deleted and reset once my job is finished, so his job can run? This is what I would hope for and expect that something like cpuset/memsets would help to do. 2) How do zones play with merging small pages into big pages, splitting big pages into small, or deleting whatever page environment was there in favor of a reset of those resources to some initial state? If someone runs a small page job right after my big page job, will they get big pages? If I run a big page job right after their small page job, will I get small pages? In each case, will it simply say 'no can do' and die? If this setup just means that some jobs can't be run or can't be run after something else, it will not fly. 3) How does any sort of fall back scheme work? If I can't have all of my big pages, maybe I'll settle for some small ones and some big ones. Can I have them? If I can't have them and die instead, zones like this will not fly. Points 2 and 3 have mostly to do with the question Does the system performance degrade over time for different constituencies of users or can it stay up stably, serving everyone equally and well for a long time? 4) How does any of this stuff play with interactive management? It is not going to fly if sysadmins have to get involved on a daily/regular basis, or even at much more than a cursory level of turning something on once when the machine is purchased. 5) How does any of this stuff play with me having to rewrite my code to use nonstandard language features? If I can't run using standard fortran, standard C and maybe for some folks standard C++ or Java, it won't fly. 6) what about text vs data pages. I'm talking here about executable code vs whatever that code operates on. Do they get to have different sized pages? Do they get allocated from sensible places on the machine, as in reasonably separate from each other but not in some far away zone over the rainbow? 7) If OS's/HW ever get decent support for lots and lots of page sizes (like mips and sparc now) rather than a couple , will the infrastructure be able to give me whichever size I ask for, or will I only get to choose between a couple, even if perhaps settable at boot time? Extensibility like this will be a requirement long term of course. 8) What if I want 32 cpus and 64GB of memory on a machine, get it, finish using it, and then the next jobs in line request say 8 cpus and 16GB of memory, 4cpus and 16GB of memory, 20 cpus and 4GB of memory? Will the zone system be able to handle such dynamically changing things? What I would need to see is that these sorts of issues can be handled gracefully by the OS, perhaps with the help of some user land or priveleged userland hints that would come from things like the batch scheduler or an env variable to set my prefered page size or other things about memory policy. Thanks, Andy PS to Linus: I have secured access to an dual cpu dual core amd box. I have to talk to someone who is not here today to see about turning on large pages. We'll see how that goes probably some time next week. If it is possible, you'll see some benchmarks then. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:04 ` Andy Nelson @ 2005-11-04 21:14 ` Ingo Molnar 2005-11-04 21:22 ` Linus Torvalds 2005-11-04 21:31 ` Gregory Maxwell 2 siblings, 0 replies; 241+ messages in thread From: Ingo Molnar @ 2005-11-04 21:14 UTC (permalink / raw) To: Andy Nelson Cc: pj, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds * Andy Nelson <andy@thermo.lanl.gov> wrote: > 5) How does any of this stuff play with me having to rewrite my code to > use nonstandard language features? If I can't run using standard > fortran, standard C and maybe for some folks standard C++ or Java, > it won't fly. it ought to be possible to get pretty much the same API as hugetlbfs via the 'hugetlb zone' approach too. It doesnt really change the API and FS side, it only impacts the allocator internally. So if you can utilize hugetlbfs, you should be able to utilize a 'special zone' approach pretty much the same way. Ingo ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:04 ` Andy Nelson 2005-11-04 21:14 ` Ingo Molnar @ 2005-11-04 21:22 ` Linus Torvalds 2005-11-04 21:39 ` Linus Torvalds ` (2 more replies) 2005-11-04 21:31 ` Gregory Maxwell 2 siblings, 3 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 21:22 UTC (permalink / raw) To: Andy Nelson Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin On Fri, 4 Nov 2005, Andy Nelson wrote: > > I am not enough of a kernel level person or sysadmin to know for certain, > but I have still big worries about consecutive jobs that run on the > same resources, but want extremely different page behavior. If what > you are suggesting can cause all previous history on those resources > to be forgotten and then reset to whatever it is that I want when I > start my run, then yes. That would largely be the behaviour. When you use the hugetlb zone for big pages, nothing else would be there. And when you don't use it, we'd be able to use those zones for at least page cache and user private pages - both of which are fairly easy to evict if required. So the downside is that when the admin requests such a zone at boot-time, that will mean that the kernel will never be able to use it for its "normal" allocations. Not for inodes, not for directory name caching, not for page tables and not for process and file descriptors. Only a very certain class of allocations that we know how to evict easily could use them. Now, for many loads, that's fine. User virtual pages and page cache pages are often a big part (in fact, often a huge majority) of the memory use. Not always, though. Some loads really want lots of metadata caching, and if you make too much of memory be in the largepage zones, performance would suffer badly on such loads. But the point is that this is easy(ish) to do, and would likely work wonderfully well for almost all loads. It does put a small onus on the maintainer of the machine to give a hint, but it's possible that normal loads won't mind the limitation and that we could even have a few hugepage zones by default (limit things to 25% of total memory or something). In fact, we would almost have to do so initially just to get better test coverage. Now, if you want _most_ of memory to be available for hugepages, you really will always require a special boot option, and a friendly machine maintainer. Limiting things like inodes, process descriptors etc to a smallish percentage of memory would not be acceptable in general. Something like 25% "big page zones" probably is fine even in normal use, and 50% might be an acceptable compromise even for machines that see a mixture of pretty regular use and some specialized use. But a machine that only cares about certain loads might boot up with 75% set aside in the large-page zones, and that almost certainly would _not_ be a good setup for random other usage. IOW, we want a hit up-front about how important huge pages would be. Because it's practically impossible to free pages later, because they _will_ become fragmented with stuff that we definitely do not want to teach the VM how to handle. But the hint can be pretty friendly. Especially if it's an option to just load a lot of memory into the boxes, and none of the loads are expected to want to really be excessively close to memory limits (ie you could just buy an extra 16GB to allow for "slop"). Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:22 ` Linus Torvalds @ 2005-11-04 21:39 ` Linus Torvalds 2005-11-05 2:48 ` Rob Landley 2005-11-06 10:59 ` Paul Jackson 2 siblings, 0 replies; 241+ messages in thread From: Linus Torvalds @ 2005-11-04 21:39 UTC (permalink / raw) To: Andy Nelson Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin On Fri, 4 Nov 2005, Linus Torvalds wrote: > > But the hint can be pretty friendly. Especially if it's an option to just > load a lot of memory into the boxes, and none of the loads are expected to > want to really be excessively close to memory limits (ie you could just > buy an extra 16GB to allow for "slop"). One of the issues _will_ be how to allocate things on NUMA. Right now "hugetlb" only allows us to say "this much memory for hugetlb", and it probably needs to be per-zone. Some uses might want to allocate all of the local memory on one node to huge-page usage (and specialized programs would then also like to run pinned to that node), others migth want to spread it out. So the maintenance would need to decide that. The good news is that you can boot up with almost all zones being "big page" zones, and you could turn them into "normal zones" dynamically. It's only going the other way that is hard. So from a maintenance standpoint if you manage lots of machines, you could have them all uniformly boot up with lots of memory set aside for large pages, and then use user-space tools to individually turn the zones into regular allocation zones. Linus ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:22 ` Linus Torvalds 2005-11-04 21:39 ` Linus Torvalds @ 2005-11-05 2:48 ` Rob Landley 2005-11-06 10:59 ` Paul Jackson 2 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-05 2:48 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin On Friday 04 November 2005 15:22, Linus Torvalds wrote: > Now, if you want _most_ of memory to be available for hugepages, you > really will always require a special boot option, and a friendly machine > maintainer. Limiting things like inodes, process descriptors etc to a > smallish percentage of memory would not be acceptable in general. But it might make it a lot easier for User Mode Linux to give unused memory back to the host system via madvise(DONT_NEED). (Assuming there's some way to beat the page cache into submission and actually free up space. If there was an option to tell the page cache to stay the heck out of the hugepage zone, it would be just about perfect...) Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:22 ` Linus Torvalds 2005-11-04 21:39 ` Linus Torvalds 2005-11-05 2:48 ` Rob Landley @ 2005-11-06 10:59 ` Paul Jackson 2 siblings, 0 replies; 241+ messages in thread From: Paul Jackson @ 2005-11-06 10:59 UTC (permalink / raw) To: Linus Torvalds Cc: andy, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin How would this hugetlb zone be placed - on which nodes in a NUMA system? My understanding is that you are thinking to specify it as a proportion or amount of total memory, with no particular placement. I'd rather see it as a subset of the nodes on a system being marked for use, as much as practical, for easily reclaimed memory (page cache and user). My HPC customers normally try to isolate the 'classic Unix load' on a few nodes that they call the bootcpuset, and keep the other nodes as unused as practical, except when allocated for dedicated use by a particular job. These other nodes need to run with a maximum amount of easily reclaimed memory, while the bootcpuset nodes have no such need. They don't just want easily reclaimable memory in order to get hugetlb pages. They also want it so that the memory available for use as ordinary sized pages by one job will not be unduly reduced by the hard to reclaim pages left over from some previous job. This would be easy to do with cpusets, adding a second per-cpuset nodemask that specified where not easily reclaimed kernel allocations should come from. The typical HPC user would set that second mask to their bootcpuset. The few kmalloc calls in the kernel (page cache and user space) deemed to be easily reclaimable would have a __GFP_EASYRCLM flag added, and the cpuset hook in the __alloc_pages code path would put requests -not- marked __GFP_EASYRCLM on this second set of nodes. No changes to hugetlbs or to the kernel code that runs at boot, prior to starting init, would be required at all. The bootcpuset stuff is setup by a pre-init program (specified using the kernels "init=..." boot option.) This makes all the configuration of this entirely a user space problem. Cpuset nodes, not zone sizes, are the proper way to manage this, in my view. If you ask what this means for small (1 or 2 node) systems, then I would first ask you what we are trying to do on those systems. I suspect that that would involve other classes of users, with different needs, than what Andy or I can speak to. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:04 ` Andy Nelson 2005-11-04 21:14 ` Ingo Molnar 2005-11-04 21:22 ` Linus Torvalds @ 2005-11-04 21:31 ` Gregory Maxwell 2005-11-04 22:43 ` Andi Kleen 2 siblings, 1 reply; 241+ messages in thread From: Gregory Maxwell @ 2005-11-04 21:31 UTC (permalink / raw) To: Andy Nelson Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds On 11/4/05, Andy Nelson <andy@thermo.lanl.gov> wrote: > I am not enough of a kernel level person or sysadmin to know for certain, > but I have still big worries about consecutive jobs that run on the > same resources, but want extremely different page behavior. I Thats the idea. The 'hugetlb zone' will only be usable for allocations which are guaranteed reclaimable. Reclaimable includes userspace usage (since at worst an in use userspace page can be swapped out then paged back into another physical location). For your sort of mixed use this should be a fine solution. However there are mixed use cases that that this will not solve, for example if the system usage is split between HPC uses and kernel allocation heavy workloads (say forking 10quintillion java processes) then the hugetlb zone will need to be made small to keep the kernel allocation heavy workload happy. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 21:31 ` Gregory Maxwell @ 2005-11-04 22:43 ` Andi Kleen 2005-11-05 0:07 ` Nick Piggin 2005-11-06 1:30 ` Zan Lynx 0 siblings, 2 replies; 241+ messages in thread From: Andi Kleen @ 2005-11-04 22:43 UTC (permalink / raw) To: Gregory Maxwell Cc: Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds On Friday 04 November 2005 22:31, Gregory Maxwell wrote: > On 11/4/05, Andy Nelson <andy@thermo.lanl.gov> wrote: > > I am not enough of a kernel level person or sysadmin to know for certain, > > but I have still big worries about consecutive jobs that run on the > > same resources, but want extremely different page behavior. I > > Thats the idea. The 'hugetlb zone' will only be usable for allocations > which are guaranteed reclaimable. Reclaimable includes userspace > usage (since at worst an in use userspace page can be swapped out then > paged back into another physical location). I don't like it very much. You have two choices if a workload runs out of the kernel allocatable pages. Either you spill into the reclaimable zone or you fail the allocation. The first means that the huge pages thing is unreliable, the second would mean that all the many problems of limited lowmem would be back. None of this is very attractive. -Andi ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 22:43 ` Andi Kleen @ 2005-11-05 0:07 ` Nick Piggin 2005-11-06 1:30 ` Zan Lynx 1 sibling, 0 replies; 241+ messages in thread From: Nick Piggin @ 2005-11-05 0:07 UTC (permalink / raw) To: Andi Kleen Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, torvalds Andi Kleen wrote: > On Friday 04 November 2005 22:31, Gregory Maxwell wrote: > >> >>Thats the idea. The 'hugetlb zone' will only be usable for allocations >>which are guaranteed reclaimable. Reclaimable includes userspace >>usage (since at worst an in use userspace page can be swapped out then >>paged back into another physical location). > > > I don't like it very much. You have two choices if a workload runs > out of the kernel allocatable pages. Either you spill into the reclaimable > zone or you fail the allocation. The first means that the huge pages > thing is unreliable, the second would mean that all the many problems > of limited lowmem would be back. > These are essentially the same problems that the frag patches face as well. > None of this is very attractive. > Though it is simple and I expect it should actually do a really good job for the non-kernel-intensive HPC group, and the highly tuned database group. Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-04 22:43 ` Andi Kleen 2005-11-05 0:07 ` Nick Piggin @ 2005-11-06 1:30 ` Zan Lynx 2005-11-06 2:25 ` Rob Landley 1 sibling, 1 reply; 241+ messages in thread From: Zan Lynx @ 2005-11-06 1:30 UTC (permalink / raw) To: Andi Kleen Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds Andi Kleen wrote: > I don't like it very much. You have two choices if a workload runs > out of the kernel allocatable pages. Either you spill into the reclaimable > zone or you fail the allocation. The first means that the huge pages > thing is unreliable, the second would mean that all the many problems > of limited lowmem would be back. > > None of this is very attractive. > You could allow the 'hugetlb zone' to shrink, allowing more kernel allocations. User pages at the boundary would be moved to make room. This would at least keep the 'hugetlb zone' pure and not create holes in it. ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-06 1:30 ` Zan Lynx @ 2005-11-06 2:25 ` Rob Landley 0 siblings, 0 replies; 241+ messages in thread From: Rob Landley @ 2005-11-06 2:25 UTC (permalink / raw) To: Zan Lynx Cc: Andi Kleen, Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds On Saturday 05 November 2005 19:30, Zan Lynx wrote: > > None of this is very attractive. > > You could allow the 'hugetlb zone' to shrink, allowing more kernel > allocations. User pages at the boundary would be moved to make room. Please make that optional if you do. In my potential use case, an OOM kill lets the administrator know they've got things configure wrong so they can can fix it and try again. Containing and viciously reaping things like dentries is the behavior I want out of it. Also, if you do shrink the hugetlb zone it might be possible to opportunistically expand it back to its original size. There's no guarantee that a given kernel allocation will ever go away, but if it _does_ go away then the hugetlb zone should be able to expand to the next blocking allocation or the maximum size, whichever comes first. (Given that my understanding of the layout may not match reality at all; don't ask me how the discontiguous memory stuff would work in here...) Rob ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-04 17:56 Andy Nelson 0 siblings, 0 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 17:56 UTC (permalink / raw) To: torvalds Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin Correction: >and you'll see why. Capsule form: Every tree node results in several read >and you'll see why. Capsule form: Every tree traversal results in several Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-04 21:51 Andy Nelson 0 siblings, 0 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-04 21:51 UTC (permalink / raw) To: gmaxwell Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin, torvalds Hi folks, It sound like in principle I (`I'=generic HPC person) could be happy with this sort of solution. The proof of the pudding is in the eating however, and various perversions and misunderstanding can still always crop up. Hopefully they can be solved or avoided if the do show up though. Also, other folk might not be so satisfied. I'll let them speak for themselves though. One issue remaining is that I don't know how this hugetlbfs stuff that was discussed actually works or should work, in terms of the interface to my code. What would work for me is something to the effect of f90 -flag_that_turns_access_to_big_pages_on code.f That then substitutes in allocation calls to this hugetlbfs zone instead of `normal' allocation calls to generic memory, and perhaps lets me fall back to normal memory up to whatever system limits may exist if no big pages are available. Or even something more simple like setenv HEY_OS_I_WANT_BIG_PAGES_FOR_MY_JOB or alternatively, a similar request in a batch script. I don't know that any of these things really have much to do with the OS directly however. Thanks all, and have a good weekend. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-05 1:37 Seth, Rohit 2005-11-07 0:34 ` Andy Nelson 0 siblings, 1 reply; 241+ messages in thread From: Seth, Rohit @ 2005-11-05 1:37 UTC (permalink / raw) To: Nick Piggin, Andi Kleen Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, torvalds From: Nick Piggin Friday, November 04, 2005 4:08 PM >These are essentially the same problems that the frag patches face as >well. >> None of this is very attractive. >> >Though it is simple and I expect it should actually do a really good >job for the non-kernel-intensive HPC group, and the highly tuned >database group. Not sure how applications seamlessly can use the proposed hugetlb zone based on hugetlbfs. Depending on the programming language, it might actually need changes in libs/tools etc. As far as databases are concerned, I think they mostly already grab vast chunks of memory to be used as hugepages (particularly for big mem systems)which is a separate list of pages. And actually are also glad that kernel never looks at them for any other purpose. -rohit ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-05 1:37 Seth, Rohit @ 2005-11-07 0:34 ` Andy Nelson 2005-11-07 18:58 ` Adam Litke 0 siblings, 1 reply; 241+ messages in thread From: Andy Nelson @ 2005-11-07 0:34 UTC (permalink / raw) To: ak, nickpiggin, rohit.seth Cc: akpm, andy, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds Hi folks, >Not sure how applications seamlessly can use the proposed hugetlb zone >based on hugetlbfs. Depending on the programming language, it might >actually need changes in libs/tools etc. This is my biggest worry as well. I can't recall the details right now, but I have some memories of people telling me, for example, that large pages on linux were not now available to fortran programs period, due to lack of toolchain/lib stuff, just as you note. What the reasons were/are I have no idea. I do know that the Power 5 numbers I quoted a couple of days ago required that the sysadmin apply some special patches to linux and linking to extra library. I don't know what patches (they came from ibm), but for xlf95 on Power5, the library I had to link with was this one: -T /usr/local/lib64/elf64ppc.lbss.x No changes were required to my code, which is what I need, but codes that did not link to this library would not run on a kernel that had the patches installed, and code that did link with this library would not run on a kernel that didn't have those patches. I don't know what library this is or what was in it, but I cant imagine it would have been something very standard or mainline, with that sort of drastic behavior. Maybe the ibm folk can explain what this was about. I will ask some folks here who should know how it may work on intel/amd machines about how large pages can be used this coming week, when I attempt to do page size speed testing for my code, as I promised before, as I promised before, as I promised before. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 0:34 ` Andy Nelson @ 2005-11-07 18:58 ` Adam Litke 2005-11-07 20:51 ` Rohit Seth 0 siblings, 1 reply; 241+ messages in thread From: Adam Litke @ 2005-11-07 18:58 UTC (permalink / raw) To: Andy Nelson Cc: ak, nickpiggin, rohit.seth, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds On Sun, 2005-11-06 at 17:34 -0700, Andy Nelson wrote: > Hi folks, > > >Not sure how applications seamlessly can use the proposed hugetlb zone > >based on hugetlbfs. Depending on the programming language, it might > >actually need changes in libs/tools etc. > > This is my biggest worry as well. I can't recall the details > right now, but I have some memories of people telling me, for > example, that large pages on linux were not now available to > fortran programs period, due to lack of toolchain/lib stuff, > just as you note. What the reasons were/are I have no idea. I > do know that the Power 5 numbers I quoted a couple of days ago > required that the sysadmin apply some special patches to linux > and linking to extra library. I don't know what patches (they > came from ibm), but for xlf95 on Power5, the library I had to > link with was this one: > > -T /usr/local/lib64/elf64ppc.lbss.x > > > No changes were required to my code, which is what I need, > but codes that did not link to this library would not run on > a kernel that had the patches installed, and code that did > link with this library would not run on a kernel that didn't > have those patches. > > I don't know what library this is or what was in it, but I > cant imagine it would have been something very standard or > mainline, with that sort of drastic behavior. Maybe the ibm > folk can explain what this was about. Wow. It's amazing how these things spread from my little corner of the universe ;) What you speak of sounds dangerously close to what I've been working on lately. Indeed it is not standard at all yet. I am currently working on an new approach to what you tried. It requires fewer changes to the kernel and implements the special large page usage entirely in an LD_PRELOAD library. And on newer kernels, programs linked with the .x ldscript you mention above can run using all small pages if not enough large pages are available. For the curious, here's how this all works: 1) Link the unmodified application source with a custom linker script which does the following: - Align elf segments to large page boundaries - Assert a non-standard Elf program header flag (PF_LINUX_HTLB) to signal something (see below) to use large pages. 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into large pages and transfers control back to the application. > I will ask some folks here who should know how it may work > on intel/amd machines about how large pages can be used > this coming week, when I attempt to do page size speed > testing for my code, as I promised before, as I promised > before, as I promised before. I have used this method on ppc64, x86, and x86_64 machines successfully. I'd love to see how my system works for a real-world user so if you're interested in trying it out I can send you the current version. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 18:58 ` Adam Litke @ 2005-11-07 20:51 ` Rohit Seth 2005-11-07 20:55 ` Andy Nelson 2005-11-07 21:11 ` Adam Litke 0 siblings, 2 replies; 241+ messages in thread From: Rohit Seth @ 2005-11-07 20:51 UTC (permalink / raw) To: Adam Litke Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote: > I am currently working on an new approach to what you tried. It > requires fewer changes to the kernel and implements the special large > page usage entirely in an LD_PRELOAD library. And on newer kernels, > programs linked with the .x ldscript you mention above can run using all > small pages if not enough large pages are available. > Isn't it true that most of the times we'll need to be worrying about run-time allocation of memory (using malloc or such) as compared to static. > For the curious, here's how this all works: > 1) Link the unmodified application source with a custom linker script which > does the following: > - Align elf segments to large page boundaries > - Assert a non-standard Elf program header flag (PF_LINUX_HTLB) > to signal something (see below) to use large pages. We'll need a similar flag for even code pages to start using hugetlb pages. In this case to keep the kernel changes to minimum, RTLD will need to modified. > 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages > 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into > large pages and transfers control back to the application. > COW, swap etc. are all very nice (little!) features that make hugetlb to get used more transparently. -rohit ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 20:51 ` Rohit Seth @ 2005-11-07 20:55 ` Andy Nelson 2005-11-07 20:58 ` Martin J. Bligh 2005-11-08 2:12 ` David Gibson 2005-11-07 21:11 ` Adam Litke 1 sibling, 2 replies; 241+ messages in thread From: Andy Nelson @ 2005-11-07 20:55 UTC (permalink / raw) To: agl, rohit.seth Cc: ak, akpm, andy, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin, torvalds Hi, >Isn't it true that most of the times we'll need to be worrying about >run-time allocation of memory (using malloc or such) as compared to >static. Perhaps for C. Not neccessarily true for Fortran. I don't know anything about how memory allocations proceed there, but there are no `malloc' calls (at least with that spelling) in the language itself, and I don't know what it does for either static or dynamic allocations under the hood. It could be malloc like or whatever else. In the language itself, there are language features for allocating and deallocating memory and I've seen code that uses them, but haven't played with it myself, since my codes need pretty much all the various pieces memory all the time, and so are simply statically defined. If you call something like malloc yourself, you risk portability problems in Fortran. Fortran 2003 supposedly addresses some of this with some C interop features, but only got approved within the last year, and no compilers really exist for it yet, let alone having code written. Andy ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 20:55 ` Andy Nelson @ 2005-11-07 20:58 ` Martin J. Bligh 2005-11-07 21:20 ` Rohit Seth 2005-11-08 2:12 ` David Gibson 1 sibling, 1 reply; 241+ messages in thread From: Martin J. Bligh @ 2005-11-07 20:58 UTC (permalink / raw) To: Andy Nelson, agl, rohit.seth Cc: ak, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin, torvalds >> Isn't it true that most of the times we'll need to be worrying about >> run-time allocation of memory (using malloc or such) as compared to >> static. > > Perhaps for C. Not neccessarily true for Fortran. I don't know > anything about how memory allocations proceed there, but there > are no `malloc' calls (at least with that spelling) in the language > itself, and I don't know what it does for either static or dynamic > allocations under the hood. It could be malloc like or whatever > else. In the language itself, there are language features for > allocating and deallocating memory and I've seen code that > uses them, but haven't played with it myself, since my codes > need pretty much all the various pieces memory all the time, > and so are simply statically defined. Doesn't fortran shove everything in BSS to make some truly monsterous segment? M. ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 20:58 ` Martin J. Bligh @ 2005-11-07 21:20 ` Rohit Seth 2005-11-07 21:33 ` Adam Litke 0 siblings, 1 reply; 241+ messages in thread From: Rohit Seth @ 2005-11-07 21:20 UTC (permalink / raw) To: Martin J. Bligh Cc: Andy Nelson, agl, ak, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin, torvalds On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote: > >> Isn't it true that most of the times we'll need to be worrying about > >> run-time allocation of memory (using malloc or such) as compared to > >> static. > > > > Perhaps for C. Not neccessarily true for Fortran. I don't know > > anything about how memory allocations proceed there, but there > > are no `malloc' calls (at least with that spelling) in the language > > itself, and I don't know what it does for either static or dynamic > > allocations under the hood. It could be malloc like or whatever > > else. In the language itself, there are language features for > > allocating and deallocating memory and I've seen code that > > uses them, but haven't played with it myself, since my codes > > need pretty much all the various pieces memory all the time, > > and so are simply statically defined. > > Doesn't fortran shove everything in BSS to make some truly monsterous > segment? > hmmm....that would be strange. So, if an app is using TB of data, then a TB space on disk ...then read in at the load time (or may be some optimization in the RTLD knows that this is BSS and does not need to get loaded but then a TB of disk space is a waster). -rohit ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 21:20 ` Rohit Seth @ 2005-11-07 21:33 ` Adam Litke 0 siblings, 0 replies; 241+ messages in thread From: Adam Litke @ 2005-11-07 21:33 UTC (permalink / raw) To: Rohit Seth Cc: Martin J. Bligh, Andy Nelson, ak, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo, nickpiggin, torvalds On Mon, 2005-11-07 at 13:20 -0800, Rohit Seth wrote: > On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote: > > >> Isn't it true that most of the times we'll need to be worrying about > > >> run-time allocation of memory (using malloc or such) as compared to > > >> static. > > > > > > Perhaps for C. Not neccessarily true for Fortran. I don't know > > > anything about how memory allocations proceed there, but there > > > are no `malloc' calls (at least with that spelling) in the language > > > itself, and I don't know what it does for either static or dynamic > > > allocations under the hood. It could be malloc like or whatever > > > else. In the language itself, there are language features for > > > allocating and deallocating memory and I've seen code that > > > uses them, but haven't played with it myself, since my codes > > > need pretty much all the various pieces memory all the time, > > > and so are simply statically defined. > > > > Doesn't fortran shove everything in BSS to make some truly monsterous > > segment? > > > > hmmm....that would be strange. So, if an app is using TB of data, then > a TB space on disk ...then read in at the load time (or may be some > optimization in the RTLD knows that this is BSS and does not need to get > loaded but then a TB of disk space is a waster). Nope, the bss is defined as the difference in file size (on disk) and the memory size (as specified in the ELF program header for the data segment). So the kernel loads the pre-initialized data from disk and extends the mapping to include room for the bss. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 241+ messages in thread
* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 20:55 ` Andy Nelson 2005-11-07 20:58 ` Martin J. Bligh @ 2005-11-08 2:12 ` David Gibson 1 sibling, 0 replies; 241+ messages in thread From: David Gibson @ 2005-11-08 2:12 UTC (permalink / raw) To: Andy Nelson Cc: agl, rohit.seth, ak, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin, torvalds On Mon, Nov 07, 2005 at 01:55:32PM -0700, Andy Nelson wrote: > > Hi, > > >Isn't it true that most of the times we'll need to be worrying about > >run-time allocation of memory (using malloc or such) as compared to > >static. > > Perhaps for C. Not neccessarily true for Fortran. I don't know > anything about how memory allocations proceed there, but there > are no `malloc' calls (at least with that spelling) in the language > itself, and I don't know what it does for either static or dynamic > allocations under the hood. It could be malloc like or whatever > else. In the language itself, there are language features for > allocating and deallocating memory and I've seen code that > uses them, but haven't played with it myself, since my codes > need pretty much all the various pieces memory all the time, > and so are simply statically defined. > > If you call something like malloc yourself, you risk portability > problems in Fortran. Fortran 2003 supposedly addresses some of > this with some C interop features, but only got approved within > the last year, and no compilers really exist for it yet, let > alone having code written. I believe F90 has a couple of different ways of dynamically allocating memory. I'd expect in most implementations the FORTRAN runtime would translate that into a malloc() call. However, as I gather, many HPC apps are written by people who are scientists first and programmers second, and who still think in F77 where there is no dynamic memory allocation. Hence, gigantic arrays in the BSS are common FORTRAN practice. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 20:51 ` Rohit Seth 2005-11-07 20:55 ` Andy Nelson @ 2005-11-07 21:11 ` Adam Litke 2005-11-07 21:31 ` Rohit Seth 1 sibling, 1 reply; 241+ messages in thread From: Adam Litke @ 2005-11-07 21:11 UTC (permalink / raw) To: Rohit Seth Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote: > On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote: > > > I am currently working on an new approach to what you tried. It > > requires fewer changes to the kernel and implements the special large > > page usage entirely in an LD_PRELOAD library. And on newer kernels, > > programs linked with the .x ldscript you mention above can run using all > > small pages if not enough large pages are available. > > > > Isn't it true that most of the times we'll need to be worrying about > run-time allocation of memory (using malloc or such) as compared to > static. It really depends on the workload. I've run HPC apps with 10+GB data segments. I've also worked with applications that would benefit from a hugetlb-enabled morecore (glibc malloc/sbrk). I'd like to see one standard hugetlb preload library that handles every different "memory object" we care about (static and dynamic). That's what I'm working on now. > > For the curious, here's how this all works: > > 1) Link the unmodified application source with a custom linker script which > > does the following: > > - Align elf segments to large page boundaries > > - Assert a non-standard Elf program header flag (PF_LINUX_HTLB) > > to signal something (see below) to use large pages. > > We'll need a similar flag for even code pages to start using hugetlb > pages. In this case to keep the kernel changes to minimum, RTLD will > need to modified. Yes, I foresee the functionality currently in my preload lib to exist in RTLD at some point way down the road. > > 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages > > 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into > > large pages and transfers control back to the application. > > > > COW, swap etc. are all very nice (little!) features that make hugetlb to > get used more transparently. Indeed. See my parallel post of a hugetlb-COW RFC :) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 2005-11-07 21:11 ` Adam Litke @ 2005-11-07 21:31 ` Rohit Seth 0 siblings, 0 replies; 241+ messages in thread From: Rohit Seth @ 2005-11-07 21:31 UTC (permalink / raw) To: Adam Litke Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds On Mon, 2005-11-07 at 15:11 -0600, Adam Litke wrote: > On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote: > > > Isn't it true that most of the times we'll need to be worrying about > > run-time allocation of memory (using malloc or such) as compared to > > static. > > It really depends on the workload. I've run HPC apps with 10+GB data > segments. I've also worked with applications that would benefit from a > hugetlb-enabled morecore (glibc malloc/sbrk). I'd like to see one > standard hugetlb preload library that handles every different "memory > object" we care about (static and dynamic). That's what I'm working on > now. > As said below, we will need this functionality even for code pages. I would rather have the changes absorbed in run-time loader rather than having a preload library. Makes it easy to manage. malloc/sbrks are the interesting part that does pose some challenges (as in some archs different address space is reserved hugetlb). Moreover, it will also be critical that existing semantics of normal pages is maintained even when the application ends up using hugepages. > > We'll need a similar flag for even code pages to start using hugetlb > > pages. In this case to keep the kernel changes to minimum, RTLD will > > need to modified. > > Yes, I foresee the functionality currently in my preload lib to exist in > RTLD at some point way down the road. > It will be much sooner... -rohit ^ permalink raw reply [flat|nested] 241+ messages in thread
* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 @ 2005-11-05 1:52 Seth, Rohit 0 siblings, 0 replies; 241+ messages in thread From: Seth, Rohit @ 2005-11-05 1:52 UTC (permalink / raw) To: Linus Torvalds, Andy Nelson Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, nickpiggin From: Linus Torvalds Sent: Friday, November 04, 2005 8:01 AM >If I remember correctly, ia64 used to suck horribly because Linux had to >use a mode where the hw page table walker didn't work well (maybe it was >just an itanium 1 bug), but should be better now. But x86 probably kicks >its butt. I don't remember a difference of more than (roughly) 30 percentage points even on first generation Itaniums (using hugetlb vs normal pages). And few more percentage points when walker was disabled. Over time the page table walker on IA-64 has gotten more aggressive. ...though I believe that 30% is a lot of performance. -rohit ^ permalink raw reply [flat|nested] 241+ messages in thread
end of thread, other threads:[~2005-11-14 1:58 UTC | newest]
Thread overview: 241+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
2005-10-31 5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
2005-10-31 6:37 ` Nick Piggin
2005-10-31 7:54 ` Andrew Morton
2005-10-31 7:11 ` Nick Piggin
2005-10-31 16:19 ` Mel Gorman
2005-10-31 23:54 ` Nick Piggin
2005-11-01 1:28 ` Mel Gorman
2005-11-01 1:42 ` Nick Piggin
[not found] ` <27700000.1130769270@[10.10.2.4]>
[not found] ` <20051031112409.153e7048.akpm@osdl.org>
[not found] ` <3660000.1130787652@flay>
2005-10-31 23:59 ` Nick Piggin
2005-11-01 1:36 ` Mel Gorman
[not found] ` <4366A8D1.7020507@yahoo.com.au>
[not found] ` <Pine.LNX.4.58.0510312333240.29390@skynet>
[not found] ` <4366C559.5090504@yahoo.com.au>
2005-11-01 15:25 ` Martin J. Bligh
2005-11-01 15:33 ` Dave Hansen
2005-11-01 16:57 ` Mel Gorman
2005-11-01 17:00 ` Mel Gorman
2005-11-01 18:58 ` Rob Landley
[not found] ` <Pine.LNX.4.58.0511010137020.29390@skynet>
[not found] ` <4366D469.2010202@yahoo.com.au>
[not found] ` <Pine.LNX.4.58.0511011014060.14884@skynet>
2005-11-01 13:56 ` Ingo Molnar
2005-11-01 14:10 ` Dave Hansen
2005-11-01 14:29 ` Ingo Molnar
2005-11-01 14:49 ` Dave Hansen
2005-11-01 15:01 ` Ingo Molnar
2005-11-01 15:22 ` Dave Hansen
[not found] ` <20051102084946.GA3930@elte.hu>
[not found] ` <436880B8.1050207@yahoo.com.au>
2005-11-02 9:32 ` Dave Hansen
2005-11-02 9:48 ` Nick Piggin
2005-11-02 10:54 ` Dave Hansen
2005-11-02 15:02 ` Martin J. Bligh
2005-11-03 3:21 ` Nick Piggin
2005-11-03 15:36 ` Martin J. Bligh
2005-11-03 15:40 ` Arjan van de Ven
2005-11-03 15:51 ` Linus Torvalds
2005-11-03 15:57 ` Martin J. Bligh
2005-11-03 16:20 ` Arjan van de Ven
2005-11-03 16:27 ` Mel Gorman
2005-11-03 16:46 ` Linus Torvalds
2005-11-03 16:52 ` Martin J. Bligh
2005-11-03 17:19 ` Linus Torvalds
2005-11-03 17:48 ` Dave Hansen
2005-11-03 17:51 ` Martin J. Bligh
2005-11-03 17:59 ` Arjan van de Ven
2005-11-03 18:08 ` Linus Torvalds
2005-11-03 18:17 ` Martin J. Bligh
2005-11-03 18:44 ` Linus Torvalds
2005-11-03 18:51 ` Martin J. Bligh
2005-11-03 19:35 ` Linus Torvalds
2005-11-03 22:40 ` Martin J. Bligh
2005-11-03 22:56 ` Linus Torvalds
2005-11-03 23:01 ` Martin J. Bligh
2005-11-04 0:58 ` Nick Piggin
2005-11-04 1:06 ` Linus Torvalds
2005-11-04 1:20 ` Paul Mackerras
2005-11-04 1:22 ` Nick Piggin
2005-11-04 1:48 ` Mel Gorman
2005-11-04 1:59 ` Nick Piggin
2005-11-04 2:35 ` Mel Gorman
2005-11-04 1:26 ` Mel Gorman
2005-11-03 21:11 ` Mel Gorman
2005-11-03 18:03 ` Linus Torvalds
2005-11-03 20:00 ` Paul Jackson
2005-11-03 20:46 ` Mel Gorman
2005-11-03 18:48 ` Martin J. Bligh
2005-11-03 19:08 ` Linus Torvalds
2005-11-03 22:37 ` Martin J. Bligh
2005-11-03 23:16 ` Linus Torvalds
2005-11-03 23:39 ` Martin J. Bligh
2005-11-04 0:42 ` Nick Piggin
2005-11-04 4:39 ` Andrew Morton
2005-11-04 16:22 ` Mel Gorman
2005-11-03 15:53 ` Martin J. Bligh
2005-11-01 16:48 ` Kamezawa Hiroyuki
2005-11-01 16:59 ` Kamezawa Hiroyuki
2005-11-01 17:19 ` Mel Gorman
2005-11-02 0:32 ` KAMEZAWA Hiroyuki
2005-11-02 11:22 ` Mel Gorman
2005-11-01 18:06 ` linux-os (Dick Johnson)
2005-11-02 7:19 ` Ingo Molnar
2005-11-02 7:46 ` Gerrit Huizenga
2005-11-02 8:50 ` Nick Piggin
2005-11-02 9:12 ` Gerrit Huizenga
2005-11-02 9:37 ` Nick Piggin
2005-11-02 10:17 ` Gerrit Huizenga
2005-11-02 23:47 ` Rob Landley
2005-11-03 4:43 ` Nick Piggin
2005-11-03 6:07 ` Rob Landley
2005-11-03 7:34 ` Nick Piggin
2005-11-03 17:54 ` Rob Landley
2005-11-03 20:13 ` Jeff Dike
2005-11-03 16:35 ` Jeff Dike
2005-11-03 16:23 ` Badari Pulavarty
2005-11-03 18:27 ` Jeff Dike
2005-11-03 18:49 ` Rob Landley
2005-11-04 4:52 ` Andrew Morton
2005-11-04 5:35 ` Paul Jackson
2005-11-04 5:48 ` Andrew Morton
2005-11-04 6:42 ` Paul Jackson
2005-11-04 7:10 ` Andrew Morton
2005-11-04 7:45 ` Paul Jackson
2005-11-04 8:02 ` Andrew Morton
2005-11-04 9:52 ` Paul Jackson
2005-11-04 15:27 ` Martin J. Bligh
2005-11-04 15:19 ` Martin J. Bligh
2005-11-04 17:38 ` Andrew Morton
2005-11-04 6:16 ` Bron Nelson
2005-11-04 7:26 ` [patch] swapin rlimit Ingo Molnar
2005-11-04 7:36 ` Andrew Morton
2005-11-04 8:07 ` Ingo Molnar
2005-11-04 10:06 ` Paul Jackson
2005-11-04 15:24 ` Martin J. Bligh
2005-11-04 8:18 ` Arjan van de Ven
2005-11-04 10:04 ` Paul Jackson
2005-11-04 15:14 ` Rob Landley
2005-11-04 10:14 ` Bernd Petrovitsch
2005-11-04 10:21 ` Ingo Molnar
2005-11-04 11:17 ` Bernd Petrovitsch
2005-11-02 10:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
2005-11-02 11:04 ` Gerrit Huizenga
2005-11-02 12:00 ` Ingo Molnar
2005-11-02 12:42 ` Dave Hansen
2005-11-02 15:02 ` Gerrit Huizenga
2005-11-03 0:10 ` Rob Landley
2005-11-02 7:57 ` Nick Piggin
2005-11-02 0:51 ` Nick Piggin
2005-11-02 7:42 ` Dave Hansen
2005-11-02 8:24 ` Nick Piggin
2005-11-02 8:33 ` Yasunori Goto
2005-11-02 8:43 ` Nick Piggin
2005-11-02 14:51 ` Martin J. Bligh
2005-11-02 23:28 ` Rob Landley
2005-11-03 5:26 ` Jeff Dike
2005-11-03 5:41 ` Rob Landley
2005-11-04 3:26 ` [uml-devel] " Blaisorblade
2005-11-04 15:50 ` Rob Landley
2005-11-04 17:18 ` Blaisorblade
2005-11-04 17:44 ` Rob Landley
2005-11-02 12:38 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
2005-11-03 3:14 ` Nick Piggin
2005-11-03 12:19 ` Mel Gorman
2005-11-10 18:47 ` Steve Lord
2005-11-03 15:34 ` Martin J. Bligh
2005-11-01 14:41 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-11-01 14:46 ` Ingo Molnar
2005-11-01 15:23 ` Mel Gorman
2005-11-01 18:33 ` Rob Landley
2005-11-01 19:02 ` Ingo Molnar
2005-11-01 14:50 ` Dave Hansen
2005-11-01 15:24 ` Mel Gorman
2005-11-02 5:11 ` Andrew Morton
2005-11-01 18:23 ` Rob Landley
2005-11-01 20:31 ` Joel Schopp
2005-11-01 20:59 ` Joel Schopp
2005-11-02 1:06 ` Nick Piggin
2005-11-02 1:41 ` Martin J. Bligh
2005-11-02 2:03 ` Nick Piggin
2005-11-02 2:24 ` Martin J. Bligh
2005-11-02 2:49 ` Nick Piggin
2005-11-02 4:39 ` Martin J. Bligh
2005-11-02 5:09 ` Nick Piggin
2005-11-02 5:14 ` Martin J. Bligh
2005-11-02 6:23 ` KAMEZAWA Hiroyuki
2005-11-02 10:15 ` Nick Piggin
2005-11-02 7:19 ` Yasunori Goto
2005-11-02 11:48 ` Mel Gorman
2005-11-02 11:41 ` Mel Gorman
2005-11-02 11:37 ` Mel Gorman
2005-11-02 15:11 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2005-11-04 1:00 Andy Nelson
2005-11-04 1:16 ` Martin J. Bligh
2005-11-04 1:27 ` Nick Piggin
2005-11-04 5:14 ` Linus Torvalds
2005-11-04 6:10 ` Paul Jackson
2005-11-04 6:38 ` Ingo Molnar
2005-11-04 7:26 ` Paul Jackson
2005-11-04 7:37 ` Ingo Molnar
2005-11-04 15:31 ` Linus Torvalds
2005-11-04 15:39 ` Martin J. Bligh
2005-11-04 15:53 ` Ingo Molnar
2005-11-06 7:34 ` Paul Jackson
2005-11-06 15:55 ` Linus Torvalds
2005-11-06 18:18 ` Paul Jackson
2005-11-06 8:44 ` Kyle Moffett
2005-11-06 16:12 ` Linus Torvalds
2005-11-06 17:00 ` Linus Torvalds
2005-11-07 8:00 ` Ingo Molnar
2005-11-07 11:00 ` Dave Hansen
2005-11-07 12:20 ` Ingo Molnar
2005-11-07 19:34 ` Steven Rostedt
2005-11-07 23:38 ` Joel Schopp
2005-11-13 2:30 ` Rob Landley
2005-11-14 1:58 ` Joel Schopp
2005-11-04 7:44 ` Eric Dumazet
2005-11-07 16:42 ` Adam Litke
2005-11-04 14:56 ` Andy Nelson
2005-11-04 15:18 ` Ingo Molnar
2005-11-04 15:39 ` Andy Nelson
2005-11-04 16:05 ` Ingo Molnar
2005-11-04 16:07 ` Linus Torvalds
2005-11-04 16:40 ` Ingo Molnar
2005-11-04 17:22 ` Linus Torvalds
2005-11-04 17:43 ` Andy Nelson
2005-11-04 16:00 ` Linus Torvalds
2005-11-04 16:13 ` Martin J. Bligh
2005-11-04 16:40 ` Linus Torvalds
2005-11-04 17:10 ` Martin J. Bligh
2005-11-04 16:14 ` Andy Nelson
2005-11-04 16:49 ` Linus Torvalds
2005-11-04 15:19 Andy Nelson
2005-11-04 17:03 Andy Nelson
2005-11-04 17:49 ` Linus Torvalds
2005-11-04 17:51 ` Andy Nelson
2005-11-04 20:12 ` Ingo Molnar
2005-11-04 21:04 ` Andy Nelson
2005-11-04 21:14 ` Ingo Molnar
2005-11-04 21:22 ` Linus Torvalds
2005-11-04 21:39 ` Linus Torvalds
2005-11-05 2:48 ` Rob Landley
2005-11-06 10:59 ` Paul Jackson
2005-11-04 21:31 ` Gregory Maxwell
2005-11-04 22:43 ` Andi Kleen
2005-11-05 0:07 ` Nick Piggin
2005-11-06 1:30 ` Zan Lynx
2005-11-06 2:25 ` Rob Landley
2005-11-04 17:56 Andy Nelson
2005-11-04 21:51 Andy Nelson
2005-11-05 1:37 Seth, Rohit
2005-11-07 0:34 ` Andy Nelson
2005-11-07 18:58 ` Adam Litke
2005-11-07 20:51 ` Rohit Seth
2005-11-07 20:55 ` Andy Nelson
2005-11-07 20:58 ` Martin J. Bligh
2005-11-07 21:20 ` Rohit Seth
2005-11-07 21:33 ` Adam Litke
2005-11-08 2:12 ` David Gibson
2005-11-07 21:11 ` Adam Litke
2005-11-07 21:31 ` Rohit Seth
2005-11-05 1:52 Seth, Rohit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox