[PATCH 0/7] Fragmentation Avoidance V19

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-10-30 18:33 Mel Gorman
  2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:33 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel

Hi Andrew,

This is the latest release of the fragmentation avoidance patches with no
code changes since v18. If it is possible, I would like to get this into -mm,
so this patch is generated against the latest -mm tree 2.6.14-rc5-mm1 and
is known to apply cleanly. If there is another tree that should be diffed
against instead, just say so and I'll send another version.

Here are a few brief reasons why this set of patches is useful;

o Reduced fragmentation improves the chance a large order allocation succeeds
o General-purpose memory hotplug needs the page/memory groupings provided
o Reduces the number of badly-placed pages that page migration mechanism must
  deal with. This also applies to any active page defragmentation mechanism.
o This patch is a pre-requisite for a linear scanning mechanism that could
  be used to guarantee large-page allocations

Built and tested successfully on a single processor AMD machine, quad
processor Xeon machine and PPC64. Benchmarks are generated on the Xeon machine.

Changelog since v18
o Resync against 2.6.14-rc5-mm1
o 004_markfree dropped
o Documentation note added on the behavior of free_area.nr_free

Changelog since v17
o Update to 2.6.14-rc4-mm1
o Remove explicit casts where implicit casts were in place
o Change __GFP_USER to __GFP_EASYRCLM, RCLM_USER to RCLM_EASY and PCPU_USER to
  PCPU_EASY
o Print a warning and return NULL if both RCLM flags are set in the GFP flags
o Reduce size of fallback_allocs
o Change magic number 64 to FREE_AREA_USEMAP_SIZE
o CodingStyle regressions cleanup
o Move sparsemen setup_usemap() out of header
o Changed fallback_balance to a mechanism that depended on zone->present_pages
  to avoid hotplug problems later
o Many superflous parenthesis removed

Changlog since v16
o Variables using bit operations now are unsigned long. Note that when used
  as indices, they are integers and cast to unsigned long when necessary.
  This is because aim9 shows regressions when used as unsigned longs 
  throughout (~10% slowdown)
o 004_showfree added to provide more debugging information
o 008_stats dropped. Even with CONFIG_ALLOCSTATS disabled, it is causing 
  severe performance regressions. No explanation as to why
o for_each_rclmtype_order moved to header
o More coding style cleanups

Changelog since V14 (V15 not released)
o Update against 2.6.14-rc3
o Resync with Joel's work. All suggestions made on fix-ups to his last
  set of patches should also be in here. e.g. __GFP_USER is still __GFP_USER
  but is better commented.
o Large amount of CodingStyle, readability cleanups and corrections pointed
  out by Dave Hansen.
o Fix CONFIG_NUMA error that corrupted per-cpu lists
o Patches broken out to have one-feature-per-patch rather than
  more-code-per-patch
o Fix fallback bug where pages for RCLM_NORCLM end up on random other
  free lists.

Changelog since V13
o Patches are now broken out
o Added per-cpu draining of userrclm pages
o Brought the patch more in line with memory hotplug work
o Fine-grained use of the __GFP_USER and __GFP_KERNRCLM flags
o Many coding-style corrections
o Many whitespace-damage corrections

Changelog since V12
o Minor whitespace damage fixed as pointed by Joel Schopp

Changelog since V11
o Mainly a redefiff against 2.6.12-rc5
o Use #defines for indexing into pcpu lists
o Fix rounding error in the size of usemap

Changelog since V10
o All allocation types now use per-cpu caches like the standard allocator
o Removed all the additional buddy allocator statistic code
o Elimated three zone fields that can be lived without
o Simplified some loops
o Removed many unnecessary calculations

Changelog since V9
o Tightened what pools are used for fallbacks, less likely to fragment
o Many micro-optimisations to have the same performance as the standard 
  allocator. Modified allocator now faster than standard allocator using
  gcc 3.3.5
o Add counter for splits/coalescing

Changelog since V8
o rmqueue_bulk() allocates pages in large blocks and breaks it up into the
  requested size. Reduces the number of calls to __rmqueue()
o Beancounters are now a configurable option under "Kernel Hacking"
o Broke out some code into inline functions to be more Hotplug-friendly
o Increased the size of reserve for fallbacks from 10% to 12.5%. 

Changelog since V7
o Updated to 2.6.11-rc4
o Lots of cleanups, mainly related to beancounters
o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz
  (thanks Mike)
o Introduced a 10% reserve for fallbacks. Drastically reduces the number of
  kernnorclm allocations that go to the wrong places
o Don't trigger OOM when large allocations are involved

Changelog since V6
o Updated to 2.6.11-rc2
o Minor change to allow prezeroing to be a cleaner looking patch

Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
  with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
  anything to do with asynchronous IO

Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
  allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch is designed to reduce fragmentation in the standard buddy allocator
without impairing the performance of the allocator. High fragmentation in
the standard binary buddy allocator means that high-order allocations can
rarely be serviced. This patch works by dividing allocations into three
different types of allocations;

EasyReclaimable - These are userspace pages that are easily reclaimable. This
	flag is set when it is known that the pages will be trivially reclaimed
	by writing the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
	These type of pages potentially could be reclaimed by dumping the
	caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists,
there are four, one for each type of allocation and another reserve for
fallbacks. 

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if 2^MAX_ORDER number of pages were required, the system could linearly scan
a block of pages allocated for EasyReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

Three benchmark results are included all based on a 2.6.14-rc3 kernel
compiled with gcc 3.4 (it is known that gcc 2.95 produces different results).
The first is the output of portions of AIM9 for the vanilla allocator and
the modified one;

(Tests run with bench-aim9.sh from VMRegress 0.17)
2.6.14-rc5-mm1-clean
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.04        961   16.00600        16006.00 File Creations and Closes/second
     2 page_test           60.02       4149   69.12696       117515.83 System Allocations & Pages/second
     3 brk_test            60.04       1555   25.89940       440289.81 System Memory Allocations/second
     4 jmp_test            60.00     250768 4179.46667      4179466.67 Non-local gotos/second
     5 signal_test         60.01       4849   80.80320        80803.20 Signal Traps/second
     6 exec_test           60.00        741   12.35000           61.75 Program Loads/second
     7 fork_test           60.06        797   13.27006         1327.01 Task Creations/second
     8 link_test           60.01       5269   87.80203         5531.53 Link/Unlink Pairs/second

2.6.14-rc3-mbuddy-v19
------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 creat-clo           60.04        954   15.88941        15889.41 File Creations and Closes/second
     2 page_test           60.01       4133   68.87185       117082.15 System Allocations & Pages/second
     3 brk_test            60.02       1546   25.75808       437887.37 System Memory Allocations/second
     4 jmp_test            60.00     250797 4179.95000      4179950.00 Non-local gotos/second
     5 signal_test         60.01       5121   85.33578        85335.78 Signal Traps/second
     6 exec_test           60.00        743   12.38333           61.92 Program Loads/second
     7 fork_test           60.05        806   13.42215         1342.21 Task Creations/second
     8 link_test           60.00       5291   88.18333         5555.55 Link/Unlink Pairs/second

Difference in performance operations report generated by diff-aim9.sh
                   Clean   mbuddy-v19
                ---------- ----------
 1 creat-clo      16006.00   15889.41    -116.59 -0.73% File Creations and Closes/second
 2 page_test     117515.83  117082.15    -433.68 -0.37% System Allocations & Pages/second
 3 brk_test      440289.81  437887.37   -2402.44 -0.55% System Memory Allocations/second
 4 jmp_test     4179466.67 4179950.00     483.33  0.01% Non-local gotos/second
 5 signal_test    80803.20   85335.78    4532.58  5.61% Signal Traps/second
 6 exec_test         61.75      61.92       0.17  0.28% Program Loads/second
 7 fork_test       1327.01    1342.21      15.20  1.15% Task Creations/second
 8 link_test       5531.53    5555.55      24.02  0.43% Link/Unlink Pairs/second

In this test, there were small regressions in the page_test. However, it
is known that different kernel configurations, compilers and even different
runs show similar varianes of +/- 3% .

The second benchmark tested the CPU cache usage to make sure it was not
getting clobbered. The test was to repeatedly render a large postscript file
10 times and get the average. The result is;

2.6.14-rc5-mm1-clean:      Average: 43.254 real, 38.89 user, 0.042 sys
2.6.14-rc5-mm1-mbuddy-v19: Average: 43.212 real, 40.494 user, 0.044 sys

So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;

1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j2
5. Start compiling the primary tree
6. Sleep 1 minute while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
    - note, it only attempts 160 times, no matter how often it succeeds
    - An allocation is attempted every 1/10th of a second
    - Performance will get badly shot as it forces considerable amounts of
      pageout

The result of the allocations under load (load averaging 18) were;

2.6.14-rc5-mm1 Clean
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        30
Failed allocs:         130
DMA zone allocs:       0
Normal zone allocs:    7
HighMem zone allocs:   23
% Success:            18

2.6.14-rc5-mm1 MBuddy V19
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        76
Failed allocs:         84
DMA zone allocs:       1
Normal zone allocs:    30
HighMem zone allocs:   45
% Success:            47

One thing that had to be changed in the 2.6.14-rc5-mm1 clean test was to
disable the OOM killer. During one test, the OOM killer had better results
but invoked the OOM killer a very large number of times to achieve it. The
patch with the placement policy never invoked the OOM killer.

The above results are not very dramatic but the affect is very noticeable when
the system is at rest after the test completes. After the test, the standard
allocator was able to allocate 45 order-10 pages and the modified allocator
allocated 159. The ability to allocate large pages under load depend heavily
on the decisions of kswapd so there can be large variances in results but
that is a separate problem. It is also known that the success of large 
allocations is also dependant on the location of per-cpu pages but fixing
that problem is a separate issue.

The results show that the modified allocator has comparable speed, has no
adverse cache effects but is far less fragmented and in a better position
to satisfy high-order allocations.
-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman

This patch adds two flags __GFP_EASYRCLM and __GFP_KERNRCLM that are used
to trap the type of allocation the caller is made. Allocations using
the __GFP_EASYRCLM flag are expected to be easily reclaimed by syncing
with backing storage (be it a file or swap) or cleaning the buffers and
discarding. Allocations using the __GFP_KERNRCLM flag belong to slab caches
that can be shrunk by the kernel.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/buffer.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c
--- linux-2.6.14-rc5-mm1-clean/fs/buffer.c	2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c	2005-10-30 13:34:50.000000000 +0000
@@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   GFP_NOFS|__GFP_EASYRCLM);
 	if (!page)
 		return NULL;
 
@@ -3058,7 +3059,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+						   gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/compat.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c
--- linux-2.6.14-rc5-mm1-clean/fs/compat.c	2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c	2005-10-30 13:34:50.000000000 +0000
@@ -1363,7 +1363,7 @@ static int compat_copy_strings(int argc,
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/dcache.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c
--- linux-2.6.14-rc5-mm1-clean/fs/dcache.c	2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c	2005-10-30 13:34:50.000000000 +0000
@@ -878,7 +878,7 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
 	if (!dentry)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/exec.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c
--- linux-2.6.14-rc5-mm1-clean/fs/exec.c	2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c	2005-10-30 13:34:50.000000000 +0000
@@ -237,7 +237,7 @@ static int copy_strings(int argc, char _
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c	2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c	2005-10-30 13:34:50.000000000 +0000
@@ -141,7 +141,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+						SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c	2005-10-30 13:20:00.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c	2005-10-30 13:34:50.000000000 +0000
@@ -444,7 +444,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/inode.c	2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c	2005-10-30 13:34:50.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+		mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_EASYRCLM);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c	2005-10-30 13:20:01.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c	2005-10-30 13:34:50.000000000 +0000
@@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -343,7 +343,7 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h
--- linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h	2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h	2005-10-30 13:34:50.000000000 +0000
@@ -36,7 +36,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | __GFP_EASYRCLM, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 /*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h	2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h	2005-10-30 13:34:50.000000000 +0000
@@ -50,14 +50,27 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   0x40000u /* Enforce hardwall cpuset memory allocs */
 #define __GFP_VALID	0x80000000u /* valid GFP flags */
 
-#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
+/*
+ * Allocation type modifiers, these are required to be adjacent
+ * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ * Both bits off: Kernel non-reclaimable or very hard to reclaim
+ * __GFP_EASYRCLM and __GFP_KERNRCLM should not be specified at the same time
+ * RCLM_SHIFT (defined elsewhere) depends on the location of these bits
+ */
+#define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
+#define __GFP_KERNRCLM   0x100000u /* Kernel page that is reclaimable */
+#define __GFP_RCLM_BITS  (__GFP_EASYRCLM|__GFP_KERNRCLM)
+
+#define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 
 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \
+			__GFP_EASYRCLM|__GFP_KERNRCLM)
 
 #define GFP_ATOMIC	(__GFP_VALID | __GFP_HIGH)
 #define GFP_NOIO	(__GFP_VALID | __GFP_WAIT)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h	2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h	2005-10-30 13:34:50.000000000 +0000
@@ -47,7 +47,8 @@ static inline void clear_user_highpage(s
 static inline struct page *
 alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
 {
-	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+	struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, vaddr);
 
 	if (page)
 		clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/memory.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c
--- linux-2.6.14-rc5-mm1-clean/mm/memory.c	2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c	2005-10-30 13:34:50.000000000 +0000
@@ -1295,7 +1295,8 @@ static int do_wp_page(struct mm_struct *
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, address);
 		if (!new_page)
 			goto oom;
 		copy_user_highpage(new_page, old_page, address);
@@ -1858,7 +1859,8 @@ retry:
 
 		if (unlikely(anon_vma_prepare(vma)))
 			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, address);
 		if (!page)
 			goto oom;
 		copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/shmem.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c
--- linux-2.6.14-rc5-mm1-clean/mm/shmem.c	2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c	2005-10-30 13:34:50.000000000 +0000
@@ -906,7 +906,7 @@ shmem_alloc_page(unsigned long gfp, stru
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO | __GFP_EASYRCLM, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -921,7 +921,7 @@ shmem_swapin(struct shmem_inode_info *in
 static inline struct page *
 shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
 {
-	return alloc_page(gfp | __GFP_ZERO);
+	return alloc_page(gfp | __GFP_ZERO | __GFP_EASYRCLM);
 }
 #endif
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/swap_state.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c
--- linux-2.6.14-rc5-mm1-clean/mm/swap_state.c	2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c	2005-10-30 13:34:50.000000000 +0000
@@ -341,7 +341,8 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+			new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+							vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
  2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel

This patch adds a "usemap" to the allocator. When a PAGE_PER_MAXORDER block
of pages (i.e. 2^(MAX_ORDER-1)) is split, the usemap is updated with the
type of allocation when splitting. This information is used in an 
anti-fragmentation patch to group related allocation types together.

The __GFP_EASYRCLM and __GFP_KERNRCLM bits are used to enumerate three allocation
types;

RCLM_NORLM:	These are kernel allocations that cannot be reclaimed
		on demand.
RCLM_EASY:	These are pages allocated with __GFP_EASYRCLM flag set. They are
		considered to be user and other easily reclaimed pages such
		as buffers
RCLM_KERN:	Allocated for the kernel but for caches that can be reclaimed
		on demand.

gfpflags_to_rclmtype() converts gfp_flags to their corresponding RCLM_TYPE
by masking out irrelevant bits and shifting the result right by RCLM_SHIFT.
Compile-time checks are made on RCLM_SHIFT to ensure gfpflags_to_rclmtype()
keeps working. ffz() could be used to avoid static checks, but it would be
runtime overhead for a compile-time constant.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h	2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h	2005-10-30 13:35:31.000000000 +0000
@@ -529,6 +529,12 @@ static inline void set_page_links(struct
 extern struct page *mem_map;
 #endif
 
+/*
+ * Return what type of page this 2^(MAX_ORDER-1) block of pages is being
+ * used for. Return value is one of the RCLM_X types
+ */
+extern int get_pageblock_type(struct zone *zone, struct page *page);
+
 static inline void *lowmem_page_address(struct page *page)
 {
 	return __va(page_to_pfn(page) << PAGE_SHIFT);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h	2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h	2005-10-30 13:35:31.000000000 +0000
@@ -21,6 +21,17 @@
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
+#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
+
+/*
+ * The two bit field __GFP_RECLAIMBITS enumerates the following types of
+ * page reclaimability.
+ */
+#define RCLM_NORCLM   0
+#define RCLM_EASY     1
+#define RCLM_KERN     2
+#define RCLM_TYPES    3
+#define BITS_PER_RCLM_TYPE 2
 
 struct free_area {
 	struct list_head	free_list;
@@ -146,6 +157,13 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+#ifndef CONFIG_SPARSEMEM
+	/*
+	 * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.
+	 * Each PAGES_PER_MAXORDER block of pages use BITS_PER_RCLM_TYPE bits
+	 */
+	unsigned long		*free_area_usemap;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
@@ -501,9 +519,14 @@ extern struct pglist_data contig_page_da
 #define PAGES_PER_SECTION       (1UL << PFN_SECTION_SHIFT)
 #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
 
+#define FREE_AREA_BITS		64
+
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
+#if ((SECTION_SIZE_BITS - MAX_ORDER) * BITS_PER_RCLM_TYPE) > FREE_AREA_BITS
+#error free_area_usemap is not big enough
+#endif
 
 struct page;
 struct mem_section {
@@ -516,6 +539,7 @@ struct mem_section {
 	 * before using it wrong.
 	 */
 	unsigned long section_mem_map;
+	DECLARE_BITMAP(free_area_usemap, FREE_AREA_BITS);
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -584,6 +608,18 @@ static inline struct mem_section *__pfn_
 	return __nr_to_section(pfn_to_section_nr(pfn));
 }
 
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+						unsigned long pfn)
+{
+	return &__pfn_to_section(pfn)->free_area_usemap[0];
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn &= (PAGES_PER_SECTION-1);
+	return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
+
 #define pfn_to_page(pfn) 						\
 ({ 									\
 	unsigned long __pfn = (pfn);					\
@@ -621,6 +657,17 @@ void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+						unsigned long pfn)
+{
+	return zone->free_area_usemap;
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+	pfn = pfn - zone->zone_start_pfn;
+	return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c	2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c	2005-10-30 13:35:31.000000000 +0000
@@ -69,6 +69,99 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 EXPORT_SYMBOL(totalram_pages);
 
 /*
+ * RCLM_SHIFT is the number of bits that a gfp_mask has to be shifted right
+ * to have just the __GFP_EASYRCLM and __GFP_KERNRCLM bits. The static check
+ * is made afterwards in case the GFP flags are not updated without updating
+ * this number
+ */
+#define RCLM_SHIFT 19
+#if (__GFP_EASYRCLM >> RCLM_SHIFT) != RCLM_EASY
+#error __GFP_EASYRCLM not mapping to RCLM_EASY
+#endif
+#if (__GFP_KERNRCLM >> RCLM_SHIFT) != RCLM_KERN
+#error __GFP_KERNRCLM not mapping to RCLM_KERN
+#endif
+
+/*
+ * This function maps gfpflags to their RCLM_TYPE. It makes assumptions
+ * on the location of the GFP flags.
+ */
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+	unsigned long rclmbits = gfp_flags & __GFP_RCLM_BITS;
+
+	/* Specifying both RCLM flags makes no sense */
+	if (unlikely(rclmbits == __GFP_RCLM_BITS)) {
+		printk(KERN_WARNING "Multiple RCLM GFP flags specified\n");
+		dump_stack();
+		return RCLM_TYPES;
+	}
+
+	return rclmbits >> RCLM_SHIFT;
+}
+
+/*
+ * copy_bits - Copy bits between bitmaps
+ * @dstaddr: The destination bitmap to copy to
+ * @srcaddr: The source bitmap to copy from
+ * @sindex_dst: The start bit index within the destination map to copy to
+ * @sindex_src: The start bit index within the source map to copy from
+ * @nr: The number of bits to copy
+ *
+ * Note that this method is slow and makes no guarantees for atomicity.
+ * It depends on being called with the zone spinlock held to ensure data
+ * safety
+ */
+static inline void copy_bits(unsigned long *dstaddr,
+		unsigned long *srcaddr,
+		int sindex_dst,
+		int sindex_src,
+		int nr)
+{
+	/*
+	 * Written like this to take advantage of arch-specific
+	 * set_bit() and clear_bit() functions
+	 */
+	for (nr = nr - 1; nr >= 0; nr--) {
+		int bit = test_bit(sindex_src + nr, srcaddr);
+		if (bit)
+			set_bit(sindex_dst + nr, dstaddr);
+		else
+			clear_bit(sindex_dst + nr, dstaddr);
+	}
+}
+
+int get_pageblock_type(struct zone *zone, struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long type = 0;
+	unsigned long *usemap;
+	int bitidx;
+
+	bitidx = pfn_to_bitidx(zone, pfn);
+	usemap = pfn_to_usemap(zone, pfn);
+
+	copy_bits(&type, usemap, 0, bitidx, BITS_PER_RCLM_TYPE);
+
+	return type;
+}
+
+/* Reserve a block of pages for an allocation type */
+static inline void set_pageblock_type(struct zone *zone, struct page *page,
+					int type)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long *usemap;
+	unsigned long ltype = type;
+	int bitidx;
+
+	bitidx = pfn_to_bitidx(zone, pfn);
+	usemap = pfn_to_usemap(zone, pfn);
+
+	copy_bits(usemap, &ltype, bitidx, 0, BITS_PER_RCLM_TYPE);
+}
+
+/*
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
  */
@@ -498,7 +591,8 @@ static void prep_new_page(struct page *p
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+					int alloctype)
 {
 	struct free_area * area;
 	unsigned int current_order;
@@ -514,6 +608,14 @@ static struct page *__rmqueue(struct zon
 		rmv_page_order(page);
 		area->nr_free--;
 		zone->free_pages -= 1UL << order;
+
+		/*
+		 * If splitting a large block, record what the block is being
+		 * used for in the usemap
+		 */
+		if (current_order == MAX_ORDER-1)
+			set_pageblock_type(zone, page, alloctype);
+
 		return expand(zone, page, order, current_order, area);
 	}
 
@@ -526,7 +628,8 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			int alloctype)
 {
 	unsigned long flags;
 	int i;
@@ -535,7 +638,7 @@ static int rmqueue_bulk(struct zone *zon
 	
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -719,6 +822,11 @@ buffered_rmqueue(struct zone *zone, int 
 	unsigned long flags;
 	struct page *page = NULL;
 	int cold = !!(gfp_flags & __GFP_COLD);
+	int alloctype = gfpflags_to_rclmtype(gfp_flags);
+
+	/* If the alloctype is RCLM_TYPES, the gfp_flags make no sense */
+	if (alloctype == RCLM_TYPES)
+		return NULL;
 
 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -727,7 +835,8 @@ buffered_rmqueue(struct zone *zone, int 
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list,
+						alloctype);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -739,7 +848,7 @@ buffered_rmqueue(struct zone *zone, int 
 
 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, alloctype);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -1866,6 +1975,38 @@ inline void setup_pageset(struct per_cpu
 	INIT_LIST_HEAD(&pcp->list);
 }
 
+#ifndef CONFIG_SPARSEMEM
+#define roundup(x, y) ((((x)+((y)-1))/(y))*(y))
+/*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up
+ * what is now in bits to nearest long in bits, then return it in bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+	unsigned long usemapsize;
+
+	usemapsize = roundup(zonesize, PAGES_PER_MAXORDER);
+	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize *= BITS_PER_RCLM_TYPE;
+	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+	return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize)
+{
+	unsigned long usemapsize = usemap_size(zonesize);
+	zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+	memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+				struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
 #ifdef CONFIG_NUMA
 /*
  * Boot pageset table. One per cpu which is going to be used for all
@@ -2079,6 +2220,7 @@ static void __init free_area_init_core(s
 		zonetable_add(zone, nid, j, zone_start_pfn, size);
 		init_currently_empty_zone(zone, zone_start_pfn, size);
 		zone_start_pfn += size;
+		setup_usemap(pgdat, zone, size);
 	}
 }
 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
  2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
  2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman

This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into three free lists for each RCLM_TYPE.

This section of the patch looks superflous but it is to surpress a compiler
warning. Suggestions to make this better looking are welcome.

-       struct free_area * area;
+       struct free_area * area = NULL;

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h	2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h	2005-10-30 13:36:16.000000000 +0000
@@ -33,6 +33,10 @@
 #define RCLM_TYPES    3
 #define BITS_PER_RCLM_TYPE 2
 
+#define for_each_rclmtype_order(type, order) \
+	for (order = 0; order < MAX_ORDER; order++) \
+		for (type = 0; type < RCLM_TYPES; type++)
+
 struct free_area {
 	struct list_head	free_list;
 	unsigned long		nr_free;
@@ -155,7 +159,6 @@ struct zone {
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
 #endif
-	struct free_area	free_area[MAX_ORDER];
 
 #ifndef CONFIG_SPARSEMEM
 	/*
@@ -165,6 +168,8 @@ struct zone {
 	unsigned long		*free_area_usemap;
 #endif
 
+	struct free_area	free_area_lists[RCLM_TYPES][MAX_ORDER];
+
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c	2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c	2005-10-30 13:36:16.000000000 +0000
@@ -352,6 +352,15 @@ __find_combined_index(unsigned long page
 }
 
 /*
+ * Return the free list for a given page within a zone
+ */
+static inline struct free_area *__page_find_freelist(struct zone *zone,
+							struct page *page)
+{
+	return zone->free_area_lists[get_pageblock_type(zone, page)];
+}
+
+/*
  * This function checks whether a page is free && is the buddy
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
@@ -398,6 +407,8 @@ static inline void __free_pages_bulk (st
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	struct free_area *area;
+	struct free_area *freelist;
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
@@ -407,10 +418,11 @@ static inline void __free_pages_bulk (st
 	BUG_ON(page_idx & (order_size - 1));
 	BUG_ON(bad_range(zone, page));
 
+	freelist = __page_find_freelist(zone, page);
+
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
 
 		combined_idx = __find_combined_index(page_idx, order);
@@ -421,7 +433,7 @@ static inline void __free_pages_bulk (st
 		if (!page_is_buddy(buddy, order))
 			break;		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = &freelist[order];
 		area->nr_free--;
 		rmv_page_order(buddy);
 		page = page + (combined_idx - page_idx);
@@ -429,8 +441,8 @@ static inline void __free_pages_bulk (st
 		order++;
 	}
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	list_add_tail(&page->lru, &freelist[order].free_list);
+	freelist[order].nr_free++;
 }
 
 static inline void free_pages_check(const char *function, struct page *page)
@@ -587,6 +599,45 @@ static void prep_new_page(struct page *p
 	kernel_map_pages(page, 1 << order, 1);
 }
 
+/*
+ * Find a list that has a 2^MAX_ORDER-1 block of pages available and
+ * return it
+ */
+struct page *steal_maxorder_block(struct zone *zone, int alloctype)
+{
+	struct page *page;
+	struct free_area *area = NULL;
+	int i;
+
+	for(i = 0; i < RCLM_TYPES; i++) {
+		if (i == alloctype)
+			continue;
+
+		area = &zone->free_area_lists[i][MAX_ORDER-1];
+		if (!list_empty(&area->free_list))
+			break;
+	}
+	if (i == RCLM_TYPES)
+		return NULL;
+
+	page = list_entry(area->free_list.next, struct page, lru);
+	area->nr_free--;
+
+	set_pageblock_type(zone, page, alloctype);
+
+	return page;
+}
+
+static inline struct page *
+remove_page(struct zone *zone, struct page *page, unsigned int order,
+		unsigned int current_order, struct free_area *area)
+{
+	list_del(&page->lru);
+	rmv_page_order(page);
+	zone->free_pages -= 1UL << order;
+	return expand(zone, page, order, current_order, area);
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
@@ -594,31 +645,25 @@ static void prep_new_page(struct page *p
 static struct page *__rmqueue(struct zone *zone, unsigned int order,
 					int alloctype)
 {
-	struct free_area * area;
+	struct free_area * area = NULL;
 	unsigned int current_order;
 	struct page *page;
 
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = &zone->free_area_lists[alloctype][current_order];
 		if (list_empty(&area->free_list))
 			continue;
 
 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
 		area->nr_free--;
-		zone->free_pages -= 1UL << order;
-
-		/*
-		 * If splitting a large block, record what the block is being
-		 * used for in the usemap
-		 */
-		if (current_order == MAX_ORDER-1)
-			set_pageblock_type(zone, page, alloctype);
-
-		return expand(zone, page, order, current_order, area);
+		return remove_page(zone, page, order, current_order, area);
 	}
 
+	/* Allocate a MAX_ORDER block */
+	page = steal_maxorder_block(zone, alloctype);
+	if (page != NULL)
+		return remove_page(zone, page, order, MAX_ORDER-1, area);
+
 	return NULL;
 }
 
@@ -704,9 +749,9 @@ static void __drain_pages(unsigned int c
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long zone_pfn, flags;
-	int order;
+	int order, t;
+	unsigned long start_pfn, i;
 	struct list_head *curr;
-
 	if (!zone->spanned_pages)
 		return;
 
@@ -714,14 +759,12 @@ void mark_free_pages(struct zone *zone)
 	for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
-			unsigned long start_pfn, i;
-
+	for_each_rclmtype_order(t, order) {
+		list_for_each(curr,&zone->free_area_lists[t][order].free_list) {
 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
-
 			for (i=0; i < (1<<order); i++)
 				SetPageNosaveFree(pfn_to_page(start_pfn+i));
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -876,6 +919,7 @@ int zone_watermark_ok(struct zone *z, in
 	/* free_pages my go negative - that's OK */
 	long min = mark, free_pages = z->free_pages - (1 << order) + 1;
 	int o;
+	struct free_area *kernnorclm, *kernrclm, *easyrclm;
 
 	if (gfp_high)
 		min -= min / 2;
@@ -884,15 +928,22 @@ int zone_watermark_ok(struct zone *z, in
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		goto out_failed;
+	kernnorclm = z->free_area_lists[RCLM_NORCLM];
+	easyrclm = z->free_area_lists[RCLM_EASY];
+	kernrclm = z->free_area_lists[RCLM_KERN];
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (kernnorclm->nr_free + kernrclm->nr_free +
+				easyrclm->nr_free) << o;
 
 		/* Require fewer higher order pages to be free */
 		min >>= 1;
 
 		if (free_pages <= min)
 			goto out_failed;
+		kernnorclm++;
+		easyrclm++;
+		kernrclm++;
 	}
 
 	return 1;
@@ -1496,6 +1547,7 @@ void show_free_areas(void)
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	int type;
 
 	for_each_zone(zone) {
 		show_node(zone);
@@ -1575,7 +1627,9 @@ void show_free_areas(void)
 	}
 
 	for_each_zone(zone) {
- 		unsigned long nr, flags, order, total = 0;
+ 		unsigned long nr = 0;
+		unsigned long total = 0;
+		unsigned long flags,order;
 
 		show_node(zone);
 		printk("%s: ", zone->name);
@@ -1585,10 +1639,18 @@ void show_free_areas(void)
 		}
 
 		spin_lock_irqsave(&zone->lock, flags);
-		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+		for_each_rclmtype_order(type, order) {
+			nr += zone->free_area_lists[type][order].nr_free;
 			total += nr << order;
-			printk("%lu*%lukB ", nr, K(1UL) << order);
+
+			/*
+			 * If type had reached RCLM_TYPE, the free pages
+			 * for this order have been summed up
+			 */
+			if (type == RCLM_TYPES-1) {
+				printk("%lu*%lukB ", nr, K(1UL) << order);
+				nr = 0;
+			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 		printk("= %lukB\n", K(total));
@@ -1899,9 +1961,14 @@ void zone_init_free_lists(struct pglist_
 				unsigned long size)
 {
 	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+	int type;
+	struct free_area *area;
+
+	/* Initialse the three size ordered lists of free_areas */
+	for_each_rclmtype_order(type, order) {
+		area = &(zone->free_area_lists[type][order]);
+		INIT_LIST_HEAD(&area->free_list);
+		area->nr_free = 0;
 	}
 }
 
@@ -2314,16 +2381,26 @@ static int frag_show(struct seq_file *m,
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
+	int order, t;
+	struct free_area *area;
+	unsigned long nr_bufs = 0;
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+		for_each_rclmtype_order(t, order) {
+			area = &(zone->free_area_lists[t][order]);
+			nr_bufs += area->nr_free;
+
+			if (t == RCLM_TYPES-1) {
+				seq_printf(m, "%6lu ", nr_bufs);
+				nr_bufs = 0;
+			}
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
                   ` (2 preceding siblings ...)
  2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel

This patch implements fallback logic. In the event there is no 2^(MAX_ORDER-1)
blocks of pages left, this will help the system decide what list to use. The
highlights of the patch are;

o Define a RCLM_FALLBACK type for fallbacks
o Use a percentage of each zone for fallbacks. When a reserved pool of pages
  is depleted, it will try and use RCLM_FALLBACK before using anything else.
  This greatly reduces the amount of fallbacks causing fragmentation without
  needing complex balancing algorithms
o Add a fallback_reserve that records how much of the zone is currently used
  for allocations falling back to RCLM_FALLBACK
o Adds a fallback_allocs[] array that determines the order of freelists are
  used for each allocation type

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h	2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h	2005-10-30 13:36:56.000000000 +0000
@@ -30,7 +30,8 @@
 #define RCLM_NORCLM   0
 #define RCLM_EASY     1
 #define RCLM_KERN     2
-#define RCLM_TYPES    3
+#define RCLM_FALLBACK 3
+#define RCLM_TYPES    4
 #define BITS_PER_RCLM_TYPE 2
 
 #define for_each_rclmtype_order(type, order) \
@@ -168,8 +169,17 @@ struct zone {
 	unsigned long		*free_area_usemap;
 #endif
 
+	/*
+	 * With allocation fallbacks, the nr_free count for each RCLM_TYPE must
+	 * be added together to get the correct count of free pages for a given
+	 * order. Individually, the nr_free count in a free_area may not match
+	 * the number of pages in the free_list.
+	 */
 	struct free_area	free_area_lists[RCLM_TYPES][MAX_ORDER];
 
+	/* Number of pages currently used for RCLM_FALLBACK */
+	unsigned long		fallback_reserve;
+
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +302,17 @@ struct zonelist {
 	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
 };
 
+static inline void inc_reserve_count(struct zone *zone, int type)
+{
+	if (type == RCLM_FALLBACK)
+		zone->fallback_reserve += PAGES_PER_MAXORDER;
+}
+
+static inline void dec_reserve_count(struct zone *zone, int type)
+{
+	if (type == RCLM_FALLBACK && zone->fallback_reserve)
+		zone->fallback_reserve -= PAGES_PER_MAXORDER;
+}
 
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c	2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c	2005-10-30 13:36:56.000000000 +0000
@@ -54,6 +54,22 @@ unsigned long totalhigh_pages __read_mos
 long nr_swap_pages;
 
 /*
+ * fallback_allocs contains the fallback types for low memory conditions
+ * where the preferred alloction type if not available.
+ */
+int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
+	{RCLM_NORCLM,	RCLM_FALLBACK, RCLM_KERN,   RCLM_EASY, RCLM_TYPES},
+	{RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
+	{RCLM_KERN,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
+};
+
+/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
+static inline int min_fallback_reserved(struct zone *zone)
+{
+	return zone->fallback_reserve >= zone->present_pages >> 3;
+}
+
+/*
  * results with 256, 32 in the lowmem_reserve sysctl:
  *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
  *	1G machine -> (16M dma, 784M normal, 224M high)
@@ -623,7 +639,12 @@ struct page *steal_maxorder_block(struct
 	page = list_entry(area->free_list.next, struct page, lru);
 	area->nr_free--;
 
+	if (!min_fallback_reserved(zone))
+		alloctype = RCLM_FALLBACK;
+
 	set_pageblock_type(zone, page, alloctype);
+	dec_reserve_count(zone, i);
+	inc_reserve_count(zone, alloctype);
 
 	return page;
 }
@@ -638,6 +659,78 @@ remove_page(struct zone *zone, struct pa
 	return expand(zone, page, order, current_order, area);
 }
 
+/*
+ * If we are falling back, and the allocation is KERNNORCLM,
+ * then reserve any buddies for the KERNNORCLM pool. These
+ * allocations fragment the worst so this helps keep them
+ * in the one place
+ */
+static inline struct free_area *
+fallback_buddy_reserve(int start_alloctype, struct zone *zone,
+			unsigned int current_order, struct page *page,
+			struct free_area *area)
+{
+	if (start_alloctype != RCLM_NORCLM)
+		return area;
+
+	area = &zone->free_area_lists[RCLM_NORCLM][current_order];
+
+	/* Reserve the whole block if this is a large split */
+	if (current_order >= MAX_ORDER / 2) {
+		int reserve_type = RCLM_NORCLM;
+		if (!min_fallback_reserved(zone))
+			reserve_type = RCLM_FALLBACK;
+
+		dec_reserve_count(zone, get_pageblock_type(zone,page));
+		set_pageblock_type(zone, page, reserve_type);
+		inc_reserve_count(zone, reserve_type);
+	}
+	return area;
+}
+
+static struct page *
+fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
+{
+	int *fallback_list;
+	int start_alloctype = alloctype;
+	struct free_area *area;
+	unsigned int current_order;
+	struct page *page;
+	int i;
+
+	/* Ok, pick the fallback order based on the type */
+	BUG_ON(alloctype >= RCLM_TYPES);
+	fallback_list = fallback_allocs[alloctype];
+
+	/*
+	 * Here, the alloc type lists has been depleted as well as the global
+	 * pool, so fallback. When falling back, the largest possible block
+	 * will be taken to keep the fallbacks clustered if possible
+	 */
+	for (i = 0; fallback_list[i] != RCLM_TYPES; i++) {
+		alloctype = fallback_list[i];
+
+		/* Find a block to allocate */
+		area = &zone->free_area_lists[alloctype][MAX_ORDER-1];
+		for (current_order = MAX_ORDER - 1; current_order > order;
+				current_order--, area--) {
+			if (list_empty(&area->free_list))
+				continue;
+
+			page = list_entry(area->free_list.next,
+						struct page, lru);
+			area->nr_free--;
+			area = fallback_buddy_reserve(start_alloctype, zone,
+					current_order, page, area);
+			return remove_page(zone, page, order,
+					current_order, area);
+
+		}
+	}
+
+	return NULL;
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
@@ -664,7 +757,8 @@ static struct page *__rmqueue(struct zon
 	if (page != NULL)
 		return remove_page(zone, page, order, MAX_ORDER-1, area);
 
-	return NULL;
+	/* Try falling back */
+	return fallback_alloc(alloctype, zone, order);
 }
 
 /* 
@@ -2270,6 +2364,7 @@ static void __init free_area_init_core(s
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->fallback_reserve = 0;
 
 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
                   ` (3 preceding siblings ...)
  2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman

Fragmentation avoidance patches increase our chances of satisfying high
order allocations.  So this patch takes more than one iteration at trying
to fulfill those allocations because, unlike before, the extra iterations
are often useful.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c	2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c	2005-10-30 13:37:34.000000000 +0000
@@ -1127,6 +1127,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	int do_retry;
 	int can_try_harder;
 	int did_some_progress;
+	int highorder_retry = 3;
 
 	might_sleep_if(wait);
 
@@ -1275,7 +1276,17 @@ rebalance:
 				goto got_pg;
 		}
 
-		out_of_memory(gfp_mask, order);
+		if (order < MAX_ORDER / 2)
+			out_of_memory(gfp_mask, order);
+
+		/*
+		 * Due to low fragmentation efforts, we try a little
+		 * harder to satisfy high order allocations and only
+		 * go OOM for low-order allocations
+		 */
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+				goto rebalance;
+
 		goto restart;
 	}
 
@@ -1292,6 +1303,8 @@ rebalance:
 			do_retry = 1;
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
+		if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+			do_retry = 1;
 	}
 	if (do_retry) {
 		blk_congestion_wait(WRITE, HZ/50);

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
                   ` (4 preceding siblings ...)
  2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
  2005-10-31  5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Mel Gorman, linux-kernel, lhms-devel

The freelists for each allocation type can slowly become corrupted due to
the per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

Now, a kernel page is in the middle of a __GFP_EASYRCLM page. This means
that over long periods of the time, the anti-fragmentation scheme slowly
degrades to the standard allocator.

This patch divides the per-cpu lists into Kernel and User lists. RCLM_NORCLM
and RCLM_KERN use the Kernel list and RCLM_EASY uses the user list. Strictly
speaking, there should be three lists but as little effort is made to reclaim
RCLM_KERN pages, it is not worth the overhead *yet*.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h	2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h	2005-10-30 13:38:14.000000000 +0000
@@ -60,12 +60,21 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
+/*
+ * Indices into pcpu_list
+ * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
+ * PCPU_EASY:   For RCLM_EASY allocations
+ */
+#define PCPU_KERNEL 0
+#define PCPU_EASY   1
+#define PCPU_TYPES  2
+
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
+	int count[PCPU_TYPES];  /* Number of pages on each list */
 	int low;		/* low watermark, refill needed */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+	struct list_head list[PCPU_TYPES]; /* the lists of pages */
 };
 
 struct per_cpu_pageset {
@@ -80,6 +89,10 @@ struct per_cpu_pageset {
 #endif
 } ____cacheline_aligned_in_smp;
 
+/* Helpers for per_cpu_pages */
+#define pset_count(pset) (pset.count[PCPU_KERNEL] + pset.count[PCPU_EASY])
+#define for_each_pcputype(pindex) \
+	for (pindex = 0; pindex < PCPU_TYPES; pindex++)
 #ifdef CONFIG_NUMA
 #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
 #else
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c	2005-10-30 13:37:34.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c	2005-10-30 13:38:14.000000000 +0000
@@ -792,7 +792,7 @@ static int rmqueue_bulk(struct zone *zon
 void drain_remote_pages(void)
 {
 	struct zone *zone;
-	int i;
+	int i, pindex;
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -808,9 +808,16 @@ void drain_remote_pages(void)
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			if (pcp->count)
-				pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			for_each_pcputype(pindex) {
+				if (!pcp->count[pindex])
+					continue;
+
+				/* Try remove all pages from the pcpu list */
+				pcp->count[pindex] -=
+					free_pages_bulk(zone,
+						pcp->count[pindex],
+						&pcp->list[pindex], 0);
+			}
 		}
 	}
 	local_irq_restore(flags);
@@ -821,7 +828,7 @@ void drain_remote_pages(void)
 static void __drain_pages(unsigned int cpu)
 {
 	struct zone *zone;
-	int i;
+	int i, pindex;
 
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
@@ -831,8 +838,16 @@ static void __drain_pages(unsigned int c
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			pcp->count -= free_pages_bulk(zone, pcp->count,
-						&pcp->list, 0);
+			for_each_pcputype(pindex) {
+				if (!pcp->count[pindex])
+					continue;
+
+				/* Try remove all pages from the pcpu list */
+				pcp->count[pindex] -=
+					free_pages_bulk(zone,
+						pcp->count[pindex],
+						&pcp->list[pindex], 0);
+			}
 		}
 	}
 }
@@ -911,6 +926,7 @@ static void fastcall free_hot_cold_page(
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int pindex;
 
 	arch_free_page(page, 0);
 
@@ -920,11 +936,21 @@ static void fastcall free_hot_cold_page(
 		page->mapping = NULL;
 	free_pages_check(__FUNCTION__, page);
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
+
+	/*
+	 * Strictly speaking, we should not be accessing the zone information
+	 * here. In this case, it does not matter if the read is incorrect
+	 */
+	if (get_pageblock_type(zone, page) == RCLM_EASY)
+		pindex = PCPU_EASY;
+	else
+		pindex = PCPU_KERNEL;
 	local_irq_save(flags);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
-	if (pcp->count >= pcp->high)
-		pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+	list_add(&page->lru, &pcp->list[pindex]);
+	pcp->count[pindex]++;
+	if (pcp->count[pindex] >= pcp->high)
+		pcp->count[pindex] -= free_pages_bulk(zone, pcp->batch,
+				&pcp->list[pindex], 0);
 	local_irq_restore(flags);
 	put_cpu();
 }
@@ -967,17 +993,23 @@ buffered_rmqueue(struct zone *zone, int 
 
 	if (order == 0) {
 		struct per_cpu_pages *pcp;
+		int pindex = PCPU_KERNEL;
+		if (alloctype == RCLM_EASY)
+			pindex = PCPU_EASY;
 
 		pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 		local_irq_save(flags);
-		if (pcp->count <= pcp->low)
-			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list,
-						alloctype);
-		if (pcp->count) {
-			page = list_entry(pcp->list.next, struct page, lru);
+		if (pcp->count[pindex] <= pcp->low)
+			pcp->count[pindex] += rmqueue_bulk(zone,
+					0, pcp->batch,
+					&(pcp->list[pindex]),
+					alloctype);
+
+		if (pcp->count[pindex]) {
+			page = list_entry(pcp->list[pindex].next,
+					struct page, lru);
 			list_del(&page->lru);
-			pcp->count--;
+			pcp->count[pindex]--;
 		}
 		local_irq_restore(flags);
 		put_cpu();
@@ -1678,7 +1710,7 @@ void show_free_areas(void)
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch,
-					pageset->pcp[temperature].count);
+					pset_count(pageset->pcp[temperature]));
 		}
 	}
 
@@ -2135,18 +2167,22 @@ inline void setup_pageset(struct per_cpu
 	struct per_cpu_pages *pcp;
 
 	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
+	pcp->count[PCPU_KERNEL] = 0;
+	pcp->count[PCPU_EASY] = 0;
 	pcp->low = 0;
-	pcp->high = 6 * batch;
+	pcp->high = 3 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+	INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
 
 	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
+	pcp->count[PCPU_KERNEL] = 0;
+	pcp->count[PCPU_EASY] = 0;
 	pcp->low = 0;
-	pcp->high = 2 * batch;
+	pcp->high = batch;
 	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+	INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
 }
 
 #ifndef CONFIG_SPARSEMEM
@@ -2574,7 +2610,7 @@ static int zoneinfo_show(struct seq_file
 
 			pageset = zone_pcp(zone, i);
 			for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
-				if (pageset->pcp[j].count)
+				if (pset_count(pageset->pcp[j]))
 					break;
 			}
 			if (j == ARRAY_SIZE(pageset->pcp))
@@ -2587,7 +2623,7 @@ static int zoneinfo_show(struct seq_file
 					   "\n              high:  %i"
 					   "\n              batch: %i",
 					   i, j,
-					   pageset->pcp[j].count,
+					   pset_count(pageset->pcp[j]),
 					   pageset->pcp[j].low,
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [PATCH 7/7] Fragmentation Avoidance V19: 007_stats
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
                   ` (5 preceding siblings ...)
  2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
@ 2005-10-30 18:34 ` Mel Gorman
  2005-10-31  5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
  7 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-10-30 18:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, lhms-devel, linux-kernel, Mel Gorman

It is not necessary to apply this patch to get all the anti-fragmentation
code. This patch adds a new config option called CONFIG_ALLOCSTATS. If
set, a number of new bean counters are added that are related to the
anti-fragmentation code. The information is exported via /proc/buddyinfo. This
is very useful when debugging why high-order pages are not available for
allocation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h	2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h	2005-10-30 13:38:56.000000000 +0000
@@ -193,6 +193,17 @@ struct zone {
 	/* Number of pages currently used for RCLM_FALLBACK */
 	unsigned long		fallback_reserve;
 
+#ifdef CONFIG_ALLOCSTATS
+	/*
+	 * These are beancounters that track how the placement policy
+	 * of the buddy allocator is performing
+	 */
+	unsigned long fallback_count[RCLM_TYPES];
+	unsigned long alloc_count[RCLM_TYPES];
+	unsigned long reserve_count[RCLM_TYPES];
+	unsigned long kernnorclm_full_steal;
+	unsigned long kernnorclm_partial_steal;
+#endif
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +303,17 @@ struct zone {
 	char			*name;
 } ____cacheline_maxaligned_in_smp;
 
+#ifdef CONFIG_ALLOCSTATS
+#define inc_fallback_count(zone, type) zone->fallback_count[type]++
+#define inc_alloc_count(zone, type) zone->alloc_count[type]++
+#define inc_kernnorclm_partial_steal(zone) zone->kernnorclm_partial_steal++
+#define inc_kernnorclm_full_steal(zone) zone->kernnorclm_full_steal++
+#else
+#define inc_fallback_count(zone, type) do {} while (0)
+#define inc_alloc_count(zone, type) do {} while (0)
+#define inc_kernnorclm_partial_steal(zone) do {} while (0)
+#define inc_kernnorclm_full_steal(zone) do {} while (0)
+#endif
 
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
@@ -319,12 +341,19 @@ static inline void inc_reserve_count(str
 {
 	if (type == RCLM_FALLBACK)
 		zone->fallback_reserve += PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+	zone->reserve_count[type]++;
+#endif
 }
 
 static inline void dec_reserve_count(struct zone *zone, int type)
 {
 	if (type == RCLM_FALLBACK && zone->fallback_reserve)
 		zone->fallback_reserve -= PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+	if (zone->reserve_count[type] > 0)
+		zone->reserve_count[type]--;
+#endif
 }
 
 /*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug
--- linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug	2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug	2005-10-30 13:38:56.000000000 +0000
@@ -77,6 +77,17 @@ config SCHEDSTATS
 	  application, you can say N to avoid the very slight overhead
 	  this adds.
 
+config ALLOCSTATS
+	bool "Collection buddy allocator statistics"
+	depends on DEBUG_KERNEL && PROC_FS
+	help
+	  If you say Y here, additional code will be inserted into the
+	  page allocator routines to collect statistics on the allocator
+	  behavior and provide them in /proc/buddyinfo. These stats are
+	  useful for measuring fragmentation in the buddy allocator. If
+	  you are not debugging or measuring the allocator, you can say N
+	  to avoid the slight overhead this adds.
+
 config DEBUG_SLAB
 	bool "Debug memory allocations"
 	depends on DEBUG_KERNEL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c	2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c	2005-10-30 13:38:56.000000000 +0000
@@ -187,6 +187,11 @@ EXPORT_SYMBOL(zone_table);
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
 
+#ifdef CONFIG_ALLOCSTATS
+static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
+					"KernRclm", "Fallback"};
+#endif /* CONFIG_ALLOCSTATS */
+
 unsigned long __initdata nr_kernel_pages;
 unsigned long __initdata nr_all_pages;
 
@@ -684,6 +689,9 @@ fallback_buddy_reserve(int start_allocty
 		dec_reserve_count(zone, get_pageblock_type(zone,page));
 		set_pageblock_type(zone, page, reserve_type);
 		inc_reserve_count(zone, reserve_type);
+		inc_kernnorclm_full_steal(zone);
+	} else {
+		inc_kernnorclm_partial_steal(zone);
 	}
 	return area;
 }
@@ -726,6 +734,15 @@ fallback_alloc(int alloctype, struct zon
 					current_order, area);
 
 		}
+
+		/*
+		 * If the current alloctype is RCLM_FALLBACK, it means
+		 * that the requested pool and fallback pool are both
+		 * depleted and we are falling back to other pools.
+		 * At this point, pools are starting to get fragmented
+		 */
+		if (alloctype == RCLM_FALLBACK)
+			inc_fallback_count(zone, start_alloctype);
 	}
 
 	return NULL;
@@ -742,6 +759,8 @@ static struct page *__rmqueue(struct zon
 	unsigned int current_order;
 	struct page *page;
 
+	inc_alloc_count(zone, alloctype);
+
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &zone->free_area_lists[alloctype][current_order];
 		if (list_empty(&area->free_list))
@@ -2373,6 +2392,9 @@ static __devinit void init_currently_emp
 	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
 
 	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+#ifdef CONFIG_ALLOCSTATS
+	zone->reserve_count[RCLM_NORCLM] = zone->present_pages >> (MAX_ORDER-1);
+#endif /* CONFIG_ALLOCSTATS */
 }
 
 /*
@@ -2528,6 +2550,18 @@ static int frag_show(struct seq_file *m,
 	int order, t;
 	struct free_area *area;
 	unsigned long nr_bufs = 0;
+#ifdef CONFIG_ALLOCSTATS
+	int i;
+	unsigned long kernnorclm_full_steal = 0;
+	unsigned long kernnorclm_partial_steal = 0;
+	unsigned long reserve_count[RCLM_TYPES];
+	unsigned long fallback_count[RCLM_TYPES];
+	unsigned long alloc_count[RCLM_TYPES];
+
+	memset(reserve_count, 0, sizeof(reserve_count));
+	memset(fallback_count, 0, sizeof(fallback_count));
+	memset(alloc_count, 0, sizeof(alloc_count));
+#endif
 
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
@@ -2548,6 +2582,86 @@ static int frag_show(struct seq_file *m,
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+
+#ifdef CONFIG_ALLOCSTATS
+ 	/* Show statistics for each allocation type */
+ 	seq_printf(m, "\nPer-allocation-type statistics");
+ 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ 		if (!zone->present_pages)
+ 			continue;
+
+ 		spin_lock_irqsave(&zone->lock, flags);
+ 		for (t = 0; t < RCLM_TYPES; t++) {
+			struct list_head *elem;
+ 			seq_printf(m, "\nNode %d, zone %8s, type %10s ",
+ 					pgdat->node_id, zone->name,
+ 					type_names[t]);
+ 			for (order = 0; order < MAX_ORDER; ++order) {
+ 				nr_bufs = 0;
+
+ 				list_for_each(elem, &zone->free_area_lists[t][order].free_list)
+ 					++nr_bufs;
+ 				seq_printf(m, "%6lu ", nr_bufs);
+ 			}
+		}
+
+ 		/* Scan global list */
+ 		seq_printf(m, "\n");
+ 		seq_printf(m, "Node %d, zone %8s, type %10s",
+ 					pgdat->node_id, zone->name,
+ 					"MAX_ORDER");
+		nr_bufs = 0;
+		for (t = 0; t < RCLM_TYPES; t++) {
+			nr_bufs +=
+				zone->free_area_lists[t][MAX_ORDER-1].nr_free;
+		}
+ 		seq_printf(m, "%6lu ", nr_bufs);
+		seq_printf(m, "\n");
+
+ 		seq_printf(m, "%s Zone beancounters\n", zone->name);
+		seq_printf(m, "Fallback reserve: %lu (%lu blocks)\n",
+				zone->fallback_reserve,
+				zone->fallback_reserve >> (MAX_ORDER-1));
+		seq_printf(m, "Fallback needed:  %lu (%lu blocks)\n",
+				zone->present_pages >> 3,
+				(zone->present_pages >> 3) >> (MAX_ORDER-1));
+		seq_printf(m, "Partial steal:    %lu\n",
+						zone->kernnorclm_partial_steal);
+		seq_printf(m, "Full steal:       %lu\n",
+						zone->kernnorclm_full_steal);
+
+		kernnorclm_partial_steal += zone->kernnorclm_partial_steal;
+		kernnorclm_full_steal += zone->kernnorclm_full_steal;
+		seq_putc(m, '\n');
+
+		for (i = 0; i< RCLM_TYPES; i++) {
+			seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+					type_names[i],
+					zone->alloc_count[i],
+					zone->reserve_count[i],
+					zone->fallback_count[i]);
+			alloc_count[i] += zone->alloc_count[i];
+			reserve_count[i] += zone->reserve_count[i];
+			fallback_count[i] += zone->fallback_count[i];
+		}
+
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+
+ 	/* Show bean counters */
+ 	seq_printf(m, "\nGlobal beancounters\n");
+	seq_printf(m, "Partial steal:    %lu\n", kernnorclm_partial_steal);
+	seq_printf(m, "Full steal:       %lu\n", kernnorclm_full_steal);
+
+	for (i = 0; i< RCLM_TYPES; i++) {
+ 		seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+				type_names[i],
+				alloc_count[i],
+				reserve_count[i],
+				fallback_count[i]);
+	}
+#endif /* CONFIG_ALLOCSTATS */
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
                   ` (6 preceding siblings ...)
  2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
@ 2005-10-31  5:57 ` Mike Kravetz
  2005-10-31  6:37   ` Nick Piggin
  7 siblings, 1 reply; 241+ messages in thread
From: Mike Kravetz @ 2005-10-31  5:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, linux-mm, linux-kernel, lhms-devel

On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> Here are a few brief reasons why this set of patches is useful;
> 
> o Reduced fragmentation improves the chance a large order allocation succeeds
> o General-purpose memory hotplug needs the page/memory groupings provided
> o Reduces the number of badly-placed pages that page migration mechanism must
>   deal with. This also applies to any active page defragmentation mechanism.

I can say that this patch set makes hotplug memory remove be of
value on ppc64.  My system has 6GB of memory and I would 'load
it up' to the point where it would just start to swap and let it
run for an hour.  Without these patches, it was almost impossible
to find a section that could be offlined.  With the patches, I
can consistently reduce memory to somewhere between 512MB and 1GB.
Of course, results will vary based on workload.  Also, this is
most advantageous for memory hotlug on ppc64 due to relatively
small section size (16MB) as compared to the page grouping size
(8MB).  A more general purpose solution is needed for memory hotplug
support on architectures with larger section sizes.

Just another data point,
-- 
Mike

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31  5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
@ 2005-10-31  6:37   ` Nick Piggin
  2005-10-31  7:54     ` Andrew Morton
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-10-31  6:37 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Mel Gorman, akpm, linux-mm, linux-kernel, lhms-devel

Mike Kravetz wrote:
> On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> 
>>Here are a few brief reasons why this set of patches is useful;
>>
>>o Reduced fragmentation improves the chance a large order allocation succeeds
>>o General-purpose memory hotplug needs the page/memory groupings provided
>>o Reduces the number of badly-placed pages that page migration mechanism must
>>  deal with. This also applies to any active page defragmentation mechanism.
> 
> 
> I can say that this patch set makes hotplug memory remove be of
> value on ppc64.  My system has 6GB of memory and I would 'load
> it up' to the point where it would just start to swap and let it
> run for an hour.  Without these patches, it was almost impossible
> to find a section that could be offlined.  With the patches, I
> can consistently reduce memory to somewhere between 512MB and 1GB.
> Of course, results will vary based on workload.  Also, this is
> most advantageous for memory hotlug on ppc64 due to relatively
> small section size (16MB) as compared to the page grouping size
> (8MB).  A more general purpose solution is needed for memory hotplug
> support on architectures with larger section sizes.
> 
> Just another data point,

Despite what people were trying to tell me at Ottawa, this patch
set really does add quite a lot of complexity to the page
allocator, and it seems to be increasingly only of benefit to
dynamically allocating hugepages and memory hot unplug.

If that is the case, do we really want to make such sacrifices
for the huge machines that want these things? What about just
making an extra zone for easy-to-reclaim things to live in?

This could possibly even be resized at runtime according to
demand with the memory hotplug stuff (though I haven't been
following that).

Don't take this as criticism of the actual implementation or its
effectiveness.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31  6:37   ` Nick Piggin
@ 2005-10-31  7:54     ` Andrew Morton
  2005-10-31  7:11       ` Nick Piggin
       [not found]       ` <27700000.1130769270@[10.10.2.4]>
  0 siblings, 2 replies; 241+ messages in thread
From: Andrew Morton @ 2005-10-31  7:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Mike Kravetz wrote:
> > On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> > 
> >>Here are a few brief reasons why this set of patches is useful;
> >>
> >>o Reduced fragmentation improves the chance a large order allocation succeeds
> >>o General-purpose memory hotplug needs the page/memory groupings provided
> >>o Reduces the number of badly-placed pages that page migration mechanism must
> >>  deal with. This also applies to any active page defragmentation mechanism.
> > 
> > 
> > I can say that this patch set makes hotplug memory remove be of
> > value on ppc64.  My system has 6GB of memory and I would 'load
> > it up' to the point where it would just start to swap and let it
> > run for an hour.  Without these patches, it was almost impossible
> > to find a section that could be offlined.  With the patches, I
> > can consistently reduce memory to somewhere between 512MB and 1GB.
> > Of course, results will vary based on workload.  Also, this is
> > most advantageous for memory hotlug on ppc64 due to relatively
> > small section size (16MB) as compared to the page grouping size
> > (8MB).  A more general purpose solution is needed for memory hotplug
> > support on architectures with larger section sizes.
> > 
> > Just another data point,
> 
> Despite what people were trying to tell me at Ottawa, this patch
> set really does add quite a lot of complexity to the page
> allocator, and it seems to be increasingly only of benefit to
> dynamically allocating hugepages and memory hot unplug.

Remember that Rohit is seeing ~10% variation between runs of scientific
software, and that his patch to use higher-order pages to preload the
percpu-pages magazines fixed that up.  I assume this means that it provided
up to 10% speedup, which is a lot.

But the patch caused page allocator fragmentation and several reports of
gigE Tx buffer allocation failures, so I dropped it.

We think that Mel's patches will allow us to reintroduce Rohit's
optimisation.

> If that is the case, do we really want to make such sacrifices
> for the huge machines that want these things? What about just
> making an extra zone for easy-to-reclaim things to live in?
> 
> This could possibly even be resized at runtime according to
> demand with the memory hotplug stuff (though I haven't been
> following that).
> 
> Don't take this as criticism of the actual implementation or its
> effectiveness.
> 

But yes, adding additional complexity is a black mark, and these patches
add quite a bit.  (Ditto the fine-looking adaptive readahead patches, btw).

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31  7:54     ` Andrew Morton
@ 2005-10-31  7:11       ` Nick Piggin
  2005-10-31 16:19         ` Mel Gorman
       [not found]       ` <27700000.1130769270@[10.10.2.4]>
  1 sibling, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-10-31  7:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kravetz, mel, linux-mm, linux-kernel, lhms-devel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>Despite what people were trying to tell me at Ottawa, this patch
>>set really does add quite a lot of complexity to the page
>>allocator, and it seems to be increasingly only of benefit to
>>dynamically allocating hugepages and memory hot unplug.
> 
> 
> Remember that Rohit is seeing ~10% variation between runs of scientific
> software, and that his patch to use higher-order pages to preload the
> percpu-pages magazines fixed that up.  I assume this means that it provided
> up to 10% speedup, which is a lot.
> 

OK, I wasn't aware of this. I wonder what other approaches we could
try to add a bit of colour to our pages? I bet something simple like
trying to hand out alternate odd/even pages per task might help.

> But the patch caused page allocator fragmentation and several reports of
> gigE Tx buffer allocation failures, so I dropped it.
> 
> We think that Mel's patches will allow us to reintroduce Rohit's
> optimisation.
> 
> 
>>If that is the case, do we really want to make such sacrifices
>>for the huge machines that want these things? What about just
>>making an extra zone for easy-to-reclaim things to live in?
>>
>>This could possibly even be resized at runtime according to
>>demand with the memory hotplug stuff (though I haven't been
>>following that).
>>
>>Don't take this as criticism of the actual implementation or its
>>effectiveness.
>>
> 
> 
> But yes, adding additional complexity is a black mark, and these patches
> add quite a bit.  (Ditto the fine-looking adaptive readahead patches, btw).
> 

They do look quite fine. They seem to get their claws pretty deep
into page reclaim, but I guess that is to be expected if we want
to increase readahead smarts much more.

However, I'm hoping bits of that can be merged at a time, and
interfaces and page reclaim stuff can be discussed and the best
option taken. No such luck with these patches AFAIKS - simply
adding another level of page groups, and another level of
heuristics to the page allocator is going to hurt. By definition.
I do wonder why zones can't be used... though I'm sure there are
good reasons.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31  7:11       ` Nick Piggin
@ 2005-10-31 16:19         ` Mel Gorman
  2005-10-31 23:54           ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-10-31 16:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel

On Mon, 31 Oct 2005, Nick Piggin wrote:

> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > > Despite what people were trying to tell me at Ottawa, this patch
> > > set really does add quite a lot of complexity to the page
> > > allocator, and it seems to be increasingly only of benefit to
> > > dynamically allocating hugepages and memory hot unplug.
> >
> >
> > Remember that Rohit is seeing ~10% variation between runs of scientific
> > software, and that his patch to use higher-order pages to preload the
> > percpu-pages magazines fixed that up.  I assume this means that it provided
> > up to 10% speedup, which is a lot.
> >
>
> OK, I wasn't aware of this. I wonder what other approaches we could
> try to add a bit of colour to our pages? I bet something simple like
> trying to hand out alternate odd/even pages per task might help.
>

Reading through the kernel archives, it appears that any page colouring
scheme was getting rejected because it slowed up workloads like kernel
compilers that were not very cache sensitive. Where an approach didn't
suffer from that problem, there was disagreement over whether there was a
general performance improvement or not.

I recall Rohit's patch from an earlier -mm. Without knowing anything about
his test, I am guessing he is getting cheap page colouring by preloading
the per-cpu cache with contiguous pages and his workload is faulting in
the batch of pages immediately by doing something like linearly reading a
large array. Hence, the mappings of his workload are getting the right
colour pages. This makes his workload a "lucky"  workload. The general
benefit of preloading the percpu magazines is that there is a chance the
allocator only has to be called once, not pcp->batch times.

An odd/even allocation scheme could be provided by having two free_lists
in a free_area. One list for the "left buddy" and the other list for the
"right buddy". However, at best, that would provide two colours. I'm not
sure how much benefit it would give for the cost of more linked lists.

> > gigE Tx buffer allocation failures, so I dropped it.
> >
> > We think that Mel's patches will allow us to reintroduce Rohit's
> > optimisation.
> >
> >
> > > If that is the case, do we really want to make such sacrifices
> > > for the huge machines that want these things? What about just
> > > making an extra zone for easy-to-reclaim things to live in?
> > >
> > > This could possibly even be resized at runtime according to
> > > demand with the memory hotplug stuff (though I haven't been
> > > following that).
> > >
> > > Don't take this as criticism of the actual implementation or its
> > > effectiveness.
> > >
> >
> >
> > But yes, adding additional complexity is a black mark, and these patches
> > add quite a bit.  (Ditto the fine-looking adaptive readahead patches, btw).
> >
>
> They do look quite fine. They seem to get their claws pretty deep
> into page reclaim, but I guess that is to be expected if we want
> to increase readahead smarts much more.
>
> However, I'm hoping bits of that can be merged at a time, and
> interfaces and page reclaim stuff can be discussed and the best
> option taken. No such luck with these patches AFAIKS - simply
> adding another level of page groups, and another level of
> heuristics to the page allocator is going to hurt. By definition.
> I do wonder why zones can't be used... though I'm sure there are
> good reasons.
>

Granted, the patch set does add complexity even though I tried to keep it
as simple as possible. Benchmarks were posted with each patchset to show
that it was not suffering in real performance even if the code is a bit
less approachable.

Doing something similar with zones is an old idea and brought up
specifically for memory hotplug. In implementations, the zone was called
ZONE_HOTREMOVABLE or something similar. In my opinion, replicating the
effect of this set of patches with zones introduces it's own set of
headaches and ends up being far more complicated. Hopefully, someone will
point out if I am missing historical context here, am rehashing old
arguments or am just plain wrong :)

To replicate the functionality of these patches with zones would require
two additional zones for NormalEasy and HighmemEasy (I suck at naming
things).  The plus side is that once the zone fallback lists are updated,
the page allocator remains more or less the same as it is today. Then the
headaches start.

Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
we are allocating PTEs from high memory, we could fallback to the Normal
zone even if highmem pages are available because the HighMem zone was out
of pages. It will require very different fallback logic to say that
HighMem allocations can also use HighMemEasy rather than falling back to
Normal.

Problem 2: Setting the zone size will be a very difficult tunable to get
right.  Right off, we are are introducing a tunable which will make
foreheads furrow. If the tunable is set wrong, system performance will
suffer and we could see situations where kernel allocations fail because
it's zone got depleted.

Problem 3: To get rid of the tunable, we could try resizing the zones
dynamically but that will be hard. Obviously, the zones are going to be
physically adjacent to each other. To resize the zone, the pages at one
end of the zone will need to be free. Shrinking the NormalEasy zone would
be easy enough, but shrinking the Normal zone with kernel pages in it
would be considerably harder, if not outright impossible. One page in the
wrong place will mean the zone cannot be resized

Problem 4: Page reclaim would have two new zones to deal with bringing
with it a new set of zone balancing problems. That brings it's own special
brand of fun.

There may be more problems but these 4 are fairly important. This patchset
does not suffer from the same problems.

Problem 1: This patchset has a fallback list for each allocation type. So
EasyRclm allocations can just as easily use an area reserved for kernel
allocations and vice versa. Obviously we don't like when this happens, but
when it does, things start fragmenting rather than breaking.

Problem 2: The number of pages that get reserved for each type grows and
shrinks on demand. There is no tunable and no need for one.

Problem 3: Problem doesn't exist for this patchset

Problem 4: Problem doesn't exist for this patchset.

Bottom line, using zones will be more complex than this set of patches and
bring a lot of tricky issues with it.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31 16:19         ` Mel Gorman
@ 2005-10-31 23:54           ` Nick Piggin
  2005-11-01  1:28             ` Mel Gorman
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-10-31 23:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel

Mel Gorman wrote:

> I recall Rohit's patch from an earlier -mm. Without knowing anything about
> his test, I am guessing he is getting cheap page colouring by preloading
> the per-cpu cache with contiguous pages and his workload is faulting in
> the batch of pages immediately by doing something like linearly reading a
> large array. Hence, the mappings of his workload are getting the right
> colour pages. This makes his workload a "lucky"  workload. The general
> benefit of preloading the percpu magazines is that there is a chance the
> allocator only has to be called once, not pcp->batch times.
> 

Or we could introduce a new allocation mechanism for anon pages that
passes the vaddr to the allocator, and tries to get an odd/even page
according to the vaddr.

> An odd/even allocation scheme could be provided by having two free_lists
> in a free_area. One list for the "left buddy" and the other list for the
> "right buddy". However, at best, that would provide two colours. I'm not
> sure how much benefit it would give for the cost of more linked lists.
> 

2 colours should be a good first order improvement because you will
no longer have adjacent pages of the same colour.

It would definitely be cheaper than fragmentation avoidance + higher
order batch loading.


> To replicate the functionality of these patches with zones would require
> two additional zones for NormalEasy and HighmemEasy (I suck at naming
> things).  The plus side is that once the zone fallback lists are updated,
> the page allocator remains more or less the same as it is today. Then the
> headaches start.
> 
> Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> we are allocating PTEs from high memory, we could fallback to the Normal
> zone even if highmem pages are available because the HighMem zone was out
> of pages. It will require very different fallback logic to say that
> HighMem allocations can also use HighMemEasy rather than falling back to
> Normal.
> 

Just be a different set of GFP flags. Your patches obviously also have
some ordering imposed.... pagecache would want HighMemEasy, HighMem,
NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.

Note that if you do need to make some changes to the zone allocator, then
IMO that is far preferable to add a new layer of things-that-are-blocks-of-
-memory-but-not-zones, complete with their own balancing and other heuristics.

> Problem 2: Setting the zone size will be a very difficult tunable to get
> right.  Right off, we are are introducing a tunable which will make
> foreheads furrow. If the tunable is set wrong, system performance will
> suffer and we could see situations where kernel allocations fail because
> it's zone got depleted.
> 

But even so, when you do automatic resizing, you seem to be adding a
fundamental weak point in fragmentation avoidance.

> Problem 3: To get rid of the tunable, we could try resizing the zones
> dynamically but that will be hard. Obviously, the zones are going to be
> physically adjacent to each other. To resize the zone, the pages at one
> end of the zone will need to be free. Shrinking the NormalEasy zone would
> be easy enough, but shrinking the Normal zone with kernel pages in it
> would be considerably harder, if not outright impossible. One page in the
> wrong place will mean the zone cannot be resized
> 

OK, maybe it is hard ;) Do they really need to be resized, then?

Isn't the big memory hotunplug push aimed at virtual machines and
hypervisors anyway? In which case one would presumably have some
memory that "must" be reclaimable, in which case we can't expand
non-Easy zones into that memory anyway.

> Problem 4: Page reclaim would have two new zones to deal with bringing
> with it a new set of zone balancing problems. That brings it's own special
> brand of fun.
> 
> There may be more problems but these 4 are fairly important. This patchset
> does not suffer from the same problems.
> 

If page reclaim can't deal with 5 zones then it is going to have problems
somewhere at 3 and needs to be fixed. I don't see how your patches get
around this fun by simply introducing their own balancing and fallback
heuristics.

> Problem 1: This patchset has a fallback list for each allocation type. So
> EasyRclm allocations can just as easily use an area reserved for kernel
> allocations and vice versa. Obviously we don't like when this happens, but
> when it does, things start fragmenting rather than breaking.
> 
> Problem 2: The number of pages that get reserved for each type grows and
> shrinks on demand. There is no tunable and no need for one.
> 
> Problem 3: Problem doesn't exist for this patchset
> 
> Problem 4: Problem doesn't exist for this patchset.
> 
> Bottom line, using zones will be more complex than this set of patches and
> bring a lot of tricky issues with it.
> 

Maybe zones don't do exactly what you need, but I think they're better
than you think ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31 23:54           ` Nick Piggin
@ 2005-11-01  1:28             ` Mel Gorman
  2005-11-01  1:42               ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-01  1:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel

On Tue, 1 Nov 2005, Nick Piggin wrote:

> Mel Gorman wrote:
>
> > I recall Rohit's patch from an earlier -mm. Without knowing anything about
> > his test, I am guessing he is getting cheap page colouring by preloading
> > the per-cpu cache with contiguous pages and his workload is faulting in
> > the batch of pages immediately by doing something like linearly reading a
> > large array. Hence, the mappings of his workload are getting the right
> > colour pages. This makes his workload a "lucky"  workload. The general
> > benefit of preloading the percpu magazines is that there is a chance the
> > allocator only has to be called once, not pcp->batch times.
> >
>
> Or we could introduce a new allocation mechanism for anon pages that
> passes the vaddr to the allocator, and tries to get an odd/even page
> according to the vaddr.
>

We could, but it is a different problem than what this set of patches are
trying to address. I'll add page colouring to the end of the todo list in
case I get stuck for something to do.

> > An odd/even allocation scheme could be provided by having two free_lists
> > in a free_area. One list for the "left buddy" and the other list for the
> > "right buddy". However, at best, that would provide two colours. I'm not
> > sure how much benefit it would give for the cost of more linked lists.
> >
>
> 2 colours should be a good first order improvement because you will
> no longer have adjacent pages of the same colour.
>
> It would definitely be cheaper than fragmentation avoidance + higher
> order batch loading.
>

Ok, but the page colours would also need to be in the per-cpu lists this
new api that supplies vaddrs always takes the spinlock for the free lists.
I don't believe it would be cheaper and any benefit would only show up on
benchmarks that are cache sensitive. Judging by previous discussions on
page colouring in the mail archives, Linus will happily kick the approach
full of holes.

As for current performance, the Aim9 benchmarks show that the
fragmentation avoidance does not have a major performance penalty. A run
of the patches in the -mm tree should find out if there are performance
regressions on other machine types.

>
> > To replicate the functionality of these patches with zones would require
> > two additional zones for NormalEasy and HighmemEasy (I suck at naming
> > things).  The plus side is that once the zone fallback lists are updated,
> > the page allocator remains more or less the same as it is today. Then the
> > headaches start.
> >
> > Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> > fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> > we are allocating PTEs from high memory, we could fallback to the Normal
> > zone even if highmem pages are available because the HighMem zone was out
> > of pages. It will require very different fallback logic to say that
> > HighMem allocations can also use HighMemEasy rather than falling back to
> > Normal.
> >
>
> Just be a different set of GFP flags. Your patches obviously also have
> some ordering imposed.... pagecache would want HighMemEasy, HighMem,
> NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>

As well as a different set of GFP flags, we would also need new zone
fallback logic which will hit the __alloc_pages() path. It will be adding
more complexity to the allocator and we're replacing one type of
complexity with another.

> Note that if you do need to make some changes to the zone allocator, then
> IMO that is far preferable to add a new layer of things-that-are-blocks-of-
> -memory-but-not-zones, complete with their own balancing and other heuristics.
>

Thing is, with my approach, the very worst that happens is that it
fragments just as bad as the normal allocator. With a zone-based approach,
the worst that happens is that the kernel zone is too small, kernel caches
do not grow to a suitable size and overall system performance degrades.

> > Problem 2: Setting the zone size will be a very difficult tunable to get
> > right.  Right off, we are are introducing a tunable which will make
> > foreheads furrow. If the tunable is set wrong, system performance will
> > suffer and we could see situations where kernel allocations fail because
> > it's zone got depleted.
> >
>
> But even so, when you do automatic resizing, you seem to be adding a
> fundamental weak point in fragmentation avoidance.
>

The sizing I do is when a large block is split. Then the region is just
marked for a particular allocation type. This is very simple. The second
resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
do not like the fact that we steal in this fashion but the alternative is
to teach kswapd how to reclaim easyrclm pages from other areas. I view
this as "future work" but if it was done, the "steal" mechanism would go
away.

> > Problem 3: To get rid of the tunable, we could try resizing the zones
> > dynamically but that will be hard. Obviously, the zones are going to be
> > physically adjacent to each other. To resize the zone, the pages at one
> > end of the zone will need to be free. Shrinking the NormalEasy zone would
> > be easy enough, but shrinking the Normal zone with kernel pages in it
> > would be considerably harder, if not outright impossible. One page in the
> > wrong place will mean the zone cannot be resized
> >
>
> OK, maybe it is hard ;) Do they really need to be resized, then?
>

I think we would need to, yes. If the size of the region is wrong, bad
things are likely to happen. If the kernel page zone is too small, it'll
be under pressure even though there is memory available elsewhere. If it's
too large, then it will get fragmented and high order allocations will
fail.

> Isn't the big memory hotunplug push aimed at virtual machines and
> hypervisors anyway? In which case one would presumably have some
> memory that "must" be reclaimable, in which case we can't expand
> non-Easy zones into that memory anyway.
>

I believe that is the case for hotplug all right, but not the case where
we just want to satisfy high order allocations in a reasonably reliable
fashion. In that case, it would be nice to reclaim an easyrclm region.

It has already been reported by Mike Kravetz that memory remove works a
whole lot better on PPC64 with this patch than without it. Memory hotplug
remove was not the problem I was trying to solve, but I consider the fact
that it is helped to be a big plus. So, even though it is possible that
this approach still gets fragmented under some workloads, we know that, in
general, it does a pretty good job.

> > Problem 4: Page reclaim would have two new zones to deal with bringing
> > with it a new set of zone balancing problems. That brings it's own special
> > brand of fun.
> >
> > There may be more problems but these 4 are fairly important. This patchset
> > does not suffer from the same problems.
> >
>
> If page reclaim can't deal with 5 zones then it is going to have problems
> somewhere at 3 and needs to be fixed. I don't see how your patches get
> around this fun by simply introducing their own balancing and fallback
> heuristics.
>

If my approach gets the sizes of areas all wrong, it will fragment. If the
zone-based approach gets the sizes of areas wrong, system performance
degrades. I prefer the failure scenario of my approach :).

> > Problem 1: This patchset has a fallback list for each allocation type. So
> > EasyRclm allocations can just as easily use an area reserved for kernel
> > allocations and vice versa. Obviously we don't like when this happens, but
> > when it does, things start fragmenting rather than breaking.
> >
> > Problem 2: The number of pages that get reserved for each type grows and
> > shrinks on demand. There is no tunable and no need for one.
> >
> > Problem 3: Problem doesn't exist for this patchset
> >
> > Problem 4: Problem doesn't exist for this patchset.
> >
> > Bottom line, using zones will be more complex than this set of patches and
> > bring a lot of tricky issues with it.
> >
>
> Maybe zones don't do exactly what you need, but I think they're better
> than you think ;)
>

You may be right, but I still think that my approach is simpler and less
likely to introduce horrible balancing problems.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01  1:28             ` Mel Gorman
@ 2005-11-01  1:42               ` Nick Piggin
  0 siblings, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-01  1:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel

Mel Gorman wrote:
> On Tue, 1 Nov 2005, Nick Piggin wrote:

> Ok, but the page colours would also need to be in the per-cpu lists this
> new api that supplies vaddrs always takes the spinlock for the free lists.
> I don't believe it would be cheaper and any benefit would only show up on
> benchmarks that are cache sensitive. Judging by previous discussions on
> page colouring in the mail archives, Linus will happily kick the approach
> full of holes.
> 

OK, but I'm just pointing out that improving page colouring doesn't
require contiguous pages.

> As for current performance, the Aim9 benchmarks show that the
> fragmentation avoidance does not have a major performance penalty. A run
> of the patches in the -mm tree should find out if there are performance
> regressions on other machine types.
> 

But I can see that there will be penalties. Cache misses, branches,
etc. Obviously any new feature or more sophisticated behaviour is
going to require that but they obviously need good justification.

>>Just be a different set of GFP flags. Your patches obviously also have
>>some ordering imposed.... pagecache would want HighMemEasy, HighMem,
>>NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>>
> 
> 
> As well as a different set of GFP flags, we would also need new zone
> fallback logic which will hit the __alloc_pages() path. It will be adding
> more complexity to the allocator and we're replacing one type of
> complexity with another.
> 

It is complexity that is mostly already handled for us with the zones
logic. Picking out a couple of small points that zones don't get exactly
right isn't a good basis to come up with a completely new zoneing layer.

> 
>>Note that if you do need to make some changes to the zone allocator, then
>>IMO that is far preferable to add a new layer of things-that-are-blocks-of-
>>-memory-but-not-zones, complete with their own balancing and other heuristics.
>>
> 
> 
> Thing is, with my approach, the very worst that happens is that it
> fragments just as bad as the normal allocator. With a zone-based approach,
> the worst that happens is that the kernel zone is too small, kernel caches
> do not grow to a suitable size and overall system performance degrades.
> 

If you don't need to guarantee higher order allocations, then there is
no problem with our current approach. If you do then you simply need to
make a sacrifice.

> 
>>>Problem 2: Setting the zone size will be a very difficult tunable to get
>>>right.  Right off, we are are introducing a tunable which will make
>>>foreheads furrow. If the tunable is set wrong, system performance will
>>>suffer and we could see situations where kernel allocations fail because
>>>it's zone got depleted.
>>>
>>
>>But even so, when you do automatic resizing, you seem to be adding a
>>fundamental weak point in fragmentation avoidance.
>>
> 
> 
> The sizing I do is when a large block is split. Then the region is just
> marked for a particular allocation type. This is very simple. The second
> resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
> do not like the fact that we steal in this fashion but the alternative is
> to teach kswapd how to reclaim easyrclm pages from other areas. I view
> this as "future work" but if it was done, the "steal" mechanism would go
> away.
> 

Weak point, as in: gets fragmented.

> 
>>>Problem 3: To get rid of the tunable, we could try resizing the zones
>>>dynamically but that will be hard. Obviously, the zones are going to be
>>>physically adjacent to each other. To resize the zone, the pages at one
>>>end of the zone will need to be free. Shrinking the NormalEasy zone would
>>>be easy enough, but shrinking the Normal zone with kernel pages in it
>>>would be considerably harder, if not outright impossible. One page in the
>>>wrong place will mean the zone cannot be resized
>>>
>>
>>OK, maybe it is hard ;) Do they really need to be resized, then?
>>
> 
> 
> I think we would need to, yes. If the size of the region is wrong, bad
> things are likely to happen. If the kernel page zone is too small, it'll
> be under pressure even though there is memory available elsewhere. If it's
> too large, then it will get fragmented and high order allocations will
> fail.
> 

But people will just have to get it right then. If they want to be able
to hot unplug 10G of memory, or allocate 4G of hugepages on demand, then
they simply need to specify their requirements. Not too difficult? It is
really nice to be able to place some burden on huge servers and mainframes,
because they have people administering and tuning them full-time. It
allows us to not penalise small servers and desktops.

> 
>>Isn't the big memory hotunplug push aimed at virtual machines and
>>hypervisors anyway? In which case one would presumably have some
>>memory that "must" be reclaimable, in which case we can't expand
>>non-Easy zones into that memory anyway.
>>
> 
> 
> I believe that is the case for hotplug all right, but not the case where
> we just want to satisfy high order allocations in a reasonably reliable
> fashion. In that case, it would be nice to reclaim an easyrclm region.
> 

As I've said before, I think this is a false hope and we need to
move away from higher order allocations.

> It has already been reported by Mike Kravetz that memory remove works a
> whole lot better on PPC64 with this patch than without it. Memory hotplug
> remove was not the problem I was trying to solve, but I consider the fact
> that it is helped to be a big plus. So, even though it is possible that
> this approach still gets fragmented under some workloads, we know that, in
> general, it does a pretty good job.
> 

Sure, but using zones would work too, and on the plus side you would
be able to specify exactly how much removable memory to be.

>>
>>Maybe zones don't do exactly what you need, but I think they're better
>>than you think ;)
>>
> 
> 
> You may be right, but I still think that my approach is simpler and less
> likely to introduce horrible balancing problems.
> 

Simpler? We already have zones though. They are a complexity we need to
deal with already. I really can't see how you can use the simpler argument
in favour of your patches ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

[parent not found: <27700000.1130769270@[10.10.2.4]>]

[parent not found: <20051031112409.153e7048.akpm@osdl.org>]

[parent not found: <3660000.1130787652@flay>]

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
       [not found]           ` <3660000.1130787652@flay>
@ 2005-10-31 23:59             ` Nick Piggin
  2005-11-01  1:36               ` Mel Gorman
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-10-31 23:59 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, kravetz, mel, linux-mm, linux-kernel, lhms-devel

Martin J. Bligh wrote:
> --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org> wrote:

>>I suspect this would all be a non-issue if the net drivers were using
>>__GFP_NOWARN ;)
> 
> 
> We still need to allocate them, even if it's GFP_KERNEL. As memory gets
> larger and larger, and we have no targetted reclaim, we'll have to blow
> away more and more stuff at random before we happen to get contiguous
> free areas. Just statistics aren't in your favour ... Getting 4 contig
> pages on a 1GB desktop is much harder than on a 128MB machine. 
> 

However, these allocations are not of the "easy to reclaim" type, in
which case they just use the regular fragmented-to-shit areas. If no
contiguous pages are available from there, then an easy-reclaim area
needs to be stolen, right?

In which case I don't see why these patches don't have similar long
term failure cases if there is strong demand for higher order
allocations. Prolong things a bit, perhaps, but...

> Is not going to get better as time goes on ;-) Yeah, yeah, I know, you
> want recreates, numbers, etc. Not the easiest thing to reproduce in a
> short-term consistent manner though.
> 

Regardless, I think we need to continue our steady move away from
higher order allocation requirements.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-10-31 23:59             ` Nick Piggin
@ 2005-11-01  1:36               ` Mel Gorman
  0 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-01  1:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Andrew Morton, kravetz, linux-mm, linux-kernel,
	lhms-devel

On Tue, 1 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
> > --On Monday, October 31, 2005 11:24:09 -0800 Andrew Morton <akpm@osdl.org>
> > wrote:
>
> > > I suspect this would all be a non-issue if the net drivers were using
> > > __GFP_NOWARN ;)
> >
> >
> > We still need to allocate them, even if it's GFP_KERNEL. As memory gets
> > larger and larger, and we have no targetted reclaim, we'll have to blow
> > away more and more stuff at random before we happen to get contiguous
> > free areas. Just statistics aren't in your favour ... Getting 4 contig
> > pages on a 1GB desktop is much harder than on a 128MB machine.
>
> However, these allocations are not of the "easy to reclaim" type, in
> which case they just use the regular fragmented-to-shit areas. If no
> contiguous pages are available from there, then an easy-reclaim area
> needs to be stolen, right?
>

Right.

> In which case I don't see why these patches don't have similar long
> term failure cases if there is strong demand for higher order
> allocations. Prolong things a bit, perhaps, but...
>

It hinges all on how long the high order kernel allocation is. If it's
short-lived, it will get freed back to the easyrclm free lists and we
don't fragment. If it turns out to be long lived, then we are in trouble.
If this turns out to be the case, a possibility would be to use the
__GFP_KERNRCLM flag for high order, short lived allocations. This would
tend to group large free areas in the same place. It would only be worth
investigating if we found that memory still got fragmented over very long
periods of time.

> > Is not going to get better as time goes on ;-) Yeah, yeah, I know, you
> > want recreates, numbers, etc. Not the easiest thing to reproduce in a
> > short-term consistent manner though.
> >
>
> Regardless, I think we need to continue our steady move away from
> higher order allocation requirements.
>

No arguement with you there. My actual aim is to guarantee HugeTLB
allocations for userspace which we currently have to reserve at boot time.
Stuff like memory hotplug remove and high order kernel allocations are
benefits that would be nice to pick up on the way.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

[parent not found: <4366A8D1.7020507@yahoo.com.au>]

[parent not found: <Pine.LNX.4.58.0510312333240.29390@skynet>]

[parent not found: <4366C559.5090504@yahoo.com.au>]

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
       [not found]             ` <4366C559.5090504@yahoo.com.au>
@ 2005-11-01 15:25               ` Martin J. Bligh
  2005-11-01 15:33                 ` Dave Hansen
       [not found]               ` <Pine.LNX.4.58.0511010137020.29390@skynet>
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-01 15:25 UTC (permalink / raw)
  To: Nick Piggin, Mel Gorman
  Cc: Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel,
	Ingo Molnar

> I really don't think we *want* to say we support higher order allocations
> absolutely robustly, nor do we want people using them if possible. Because
> we don't. Even with your patches.
> 
> Ingo also brought up this point at Ottawa.

Some of the driver issues can be fixed by scatter-gather DMA *if* the 
h/w supports it. But what exactly do you propose to do about kernel
stacks, etc? By the time you've fixed all the individual usages of it,
frankly, it would be easier to provide a generic mechanism to fix the 
problem ...
 
M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 15:25               ` Martin J. Bligh
@ 2005-11-01 15:33                 ` Dave Hansen
  2005-11-01 16:57                   ` Mel Gorman
  2005-11-01 18:58                   ` Rob Landley
  0 siblings, 2 replies; 241+ messages in thread
From: Dave Hansen @ 2005-11-01 15:33 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Nick Piggin, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Ingo Molnar

On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > I really don't think we *want* to say we support higher order allocations
> > absolutely robustly, nor do we want people using them if possible. Because
> > we don't. Even with your patches.
> > 
> > Ingo also brought up this point at Ottawa.
> 
> Some of the driver issues can be fixed by scatter-gather DMA *if* the 
> h/w supports it. But what exactly do you propose to do about kernel
> stacks, etc? By the time you've fixed all the individual usages of it,
> frankly, it would be easier to provide a generic mechanism to fix the 
> problem ...

That generic mechanism is the kernel virtual remapping.  However, it has
a runtime performance cost, which is increased TLB footprint inside the
kernel, and a more costly implementation of __pa() and __va().

I'll admit, I'm biased toward partial solutions without runtime cost
before we start incurring constant cost across the entire kernel,
especially when those partial solutions have other potential in-kernel
users.

-- Dave



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 15:33                 ` Dave Hansen
@ 2005-11-01 16:57                   ` Mel Gorman
  2005-11-01 17:00                     ` Mel Gorman
  2005-11-01 18:58                   ` Rob Landley
  1 sibling, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 16:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Ingo Molnar

On Tue, 1 Nov 2005, Dave Hansen wrote:

> On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > I really don't think we *want* to say we support higher order allocations
> > > absolutely robustly, nor do we want people using them if possible. Because
> > > we don't. Even with your patches.
> > >
> > > Ingo also brought up this point at Ottawa.
> >
> > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > h/w supports it. But what exactly do you propose to do about kernel
> > stacks, etc? By the time you've fixed all the individual usages of it,
> > frankly, it would be easier to provide a generic mechanism to fix the
> > problem ...
>
> That generic mechanism is the kernel virtual remapping.  However, it has
> a runtime performance cost, which is increased TLB footprint inside the
> kernel, and a more costly implementation of __pa() and __va().
>
> I'll admit, I'm biased toward partial solutions without runtime cost
> before we start incurring constant cost across the entire kernel,
> especially when those partial solutions have other potential in-kernel
> users.

To give an idea of the increased TLB footprint, I ran an aim9 test with
cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use
of small pages for the physical memory mappings.

This is the -clean results

                    clean  clean-nopse
 1 creat-clo      16006.00   15294.90    -711.10 -4.44% File Creations and Closes/second
 2 page_test     117515.83  118677.11    1161.28  0.99% System Allocations & Pages/second
 3 brk_test      440289.81  436042.64   -4247.17 -0.96% System Memory Allocations/second
 4 jmp_test     4179466.67 4173266.67   -6200.00 -0.15% Non-local gotos/second
 5 signal_test    80803.20   78286.95   -2516.25 -3.11% Signal Traps/second
 6 exec_test         61.75      60.45      -1.30 -2.11% Program Loads/second
 7 fork_test       1327.01    1318.11      -8.90 -0.67% Task Creations/second
 8 link_test       5531.53    5406.60    -124.93 -2.26% Link/Unlink Pairs/second

This is what mbuddy-v19 with and without pse looks like

                 mbuddy-v19 mbuddy-v19-nopse
 1 creat-clo      15889.41   15328.22    -561.19 -3.53% File Creations and Closes/second
 2 page_test     117082.15  116892.70    -189.45 -0.16% System Allocations & Pages/second
 3 brk_test      437887.37  432716.97   -5170.40 -1.18% System Memory Allocations/second
 4 jmp_test     4179950.00 4176087.32   -3862.68 -0.09% Non-local gotos/second
 5 signal_test    85335.78   78553.57   -6782.21 -7.95% Signal Traps/second
 6 exec_test         61.92      60.61      -1.31 -2.12% Program Loads/second
 7 fork_test       1342.21    1292.26     -49.95 -3.72% Task Creations/second
 8 link_test       5555.55    5412.90    -142.65 -2.57% Link/Unlink Pairs/second

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 16:57                   ` Mel Gorman
@ 2005-11-01 17:00                     ` Mel Gorman
  0 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 17:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, Nick Piggin, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Ingo Molnar

On Tue, 1 Nov 2005, Mel Gorman wrote:

> On Tue, 1 Nov 2005, Dave Hansen wrote:
>
> > On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > > I really don't think we *want* to say we support higher order allocations
> > > > absolutely robustly, nor do we want people using them if possible. Because
> > > > we don't. Even with your patches.
> > > >
> > > > Ingo also brought up this point at Ottawa.
> > >
> > > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > > h/w supports it. But what exactly do you propose to do about kernel
> > > stacks, etc? By the time you've fixed all the individual usages of it,
> > > frankly, it would be easier to provide a generic mechanism to fix the
> > > problem ...
> >
> > That generic mechanism is the kernel virtual remapping.  However, it has
> > a runtime performance cost, which is increased TLB footprint inside the
> > kernel, and a more costly implementation of __pa() and __va().
> >
> > I'll admit, I'm biased toward partial solutions without runtime cost
> > before we start incurring constant cost across the entire kernel,
> > especially when those partial solutions have other potential in-kernel
> > users.
>
> To give an idea of the increased TLB footprint, I ran an aim9 test with
> cpu_has_pse disabled in include/arch-i386/cpufeature.h to force the use
> of small pages for the physical memory mappings.
>
> This is the -clean results
>
>                     clean  clean-nopse
>  1 creat-clo      16006.00   15294.90    -711.10 -4.44% File Creations and Closes/second
>  2 page_test     117515.83  118677.11    1161.28  0.99% System Allocations & Pages/second
>  3 brk_test      440289.81  436042.64   -4247.17 -0.96% System Memory Allocations/second
>  4 jmp_test     4179466.67 4173266.67   -6200.00 -0.15% Non-local gotos/second
>  5 signal_test    80803.20   78286.95   -2516.25 -3.11% Signal Traps/second
>  6 exec_test         61.75      60.45      -1.30 -2.11% Program Loads/second
>  7 fork_test       1327.01    1318.11      -8.90 -0.67% Task Creations/second
>  8 link_test       5531.53    5406.60    -124.93 -2.26% Link/Unlink Pairs/second
>
> This is what mbuddy-v19 with and without pse looks like
>
>                  mbuddy-v19 mbuddy-v19-nopse
>  1 creat-clo      15889.41   15328.22    -561.19 -3.53% File Creations and Closes/second
>  2 page_test     117082.15  116892.70    -189.45 -0.16% System Allocations & Pages/second
>  3 brk_test      437887.37  432716.97   -5170.40 -1.18% System Memory Allocations/second
>  4 jmp_test     4179950.00 4176087.32   -3862.68 -0.09% Non-local gotos/second
>  5 signal_test    85335.78   78553.57   -6782.21 -7.95% Signal Traps/second
>  6 exec_test         61.92      60.61      -1.31 -2.12% Program Loads/second
>  7 fork_test       1342.21    1292.26     -49.95 -3.72% Task Creations/second
>  8 link_test       5555.55    5412.90    -142.65 -2.57% Link/Unlink Pairs/second
>

I forgot to include the comparison between -clean and -mbuddy-v19-nopse

                  clean     mbuddy-v19-nopse
 1 creat-clo      16006.00   15328.22    -677.78 -4.23% File Creations and Closes/second
 2 page_test     117515.83  116892.70    -623.13 -0.53% System Allocations & Pages/second
 3 brk_test      440289.81  432716.97   -7572.84 -1.72% System Memory Allocations/second
 4 jmp_test     4179466.67 4176087.32   -3379.35 -0.08% Non-local gotos/second
 5 signal_test    80803.20   78553.57   -2249.63 -2.78% Signal Traps/second
 6 exec_test         61.75      60.61      -1.14 -1.85% Program Loads/second
 7 fork_test       1327.01    1292.26     -34.75 -2.62% Task Creations/second
 8 link_test       5531.53    5412.90    -118.63 -2.14% Link/Unlink Pairs/second

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 15:33                 ` Dave Hansen
  2005-11-01 16:57                   ` Mel Gorman
@ 2005-11-01 18:58                   ` Rob Landley
  1 sibling, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-01 18:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, Nick Piggin, Mel Gorman, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms, Ingo Molnar

On Tuesday 01 November 2005 09:33, Dave Hansen wrote:
> On Tue, 2005-11-01 at 07:25 -0800, Martin J. Bligh wrote:
> > > I really don't think we *want* to say we support higher order
> > > allocations absolutely robustly, nor do we want people using them if
> > > possible. Because we don't. Even with your patches.
> > >
> > > Ingo also brought up this point at Ottawa.
> >
> > Some of the driver issues can be fixed by scatter-gather DMA *if* the
> > h/w supports it. But what exactly do you propose to do about kernel
> > stacks, etc? By the time you've fixed all the individual usages of it,
> > frankly, it would be easier to provide a generic mechanism to fix the
> > problem ...
>
> That generic mechanism is the kernel virtual remapping.  However, it has
> a runtime performance cost, which is increased TLB footprint inside the
> kernel, and a more costly implementation of __pa() and __va().

Ok, right now the kernel _has_ a virtual mapping, it's just a 1:1 with the 
physical mapping, right?

In theory, if you restrict all kernel unmovable mappings to a physically 
contiguous address range (something like ZONE_DMA) that's at the start of the 
physical address space, then what you could do is have a two-kernel-monte 
like situation where if you _NEED_ to move the kernel you quiesce the system 
(as if you're going to swsusp), figure out where the new start of physical 
memory will be when this bank goes bye-bye, memcpy the whole mess to the new 
location, adjust your one VMA, and then call the swsusp unfreeze stuff.

This is ugly, and a huge latency spike, but why wouldn't it work?  The problem 
now becomes finding some NEW physically contiguous range to shoehorn the 
kernel into, and that's a problem that Mel's already addressing...

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

[parent not found: <Pine.LNX.4.58.0511010137020.29390@skynet>]

[parent not found: <4366D469.2010202@yahoo.com.au>]

[parent not found: <Pine.LNX.4.58.0511011014060.14884@skynet>]

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
       [not found]                   ` <Pine.LNX.4.58.0511011014060.14884@skynet>
@ 2005-11-01 13:56                     ` Ingo Molnar
  2005-11-01 14:10                       ` Dave Hansen
                                         ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-01 13:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel


* Mel Gorman <mel@csn.ul.ie> wrote:

> The set of patches do fix a lot and make a strong start at addressing 
> the fragmentation problem, just not 100% of the way. [...]

do you have an expectation to be able to solve the 'fragmentation 
problem', all the time, in a 100% way, now or in the future?

> So, with this set of patches, how fragmented you get is dependant on 
> the workload and it may still break down and high order allocations 
> will fail. But the current situation is that it will defiantly break 
> down. The fact is that it has been reported that memory hotplug remove 
> works with these patches and doesn't without them. Granted, this is 
> just one feature on a high-end machine, but it is one solid operation 
> we can perform with the patches and cannot without them. [...]

can you always, under any circumstance hot unplug RAM with these patches 
applied? If not, do you have any expectation to reach 100%?

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 13:56                     ` Ingo Molnar
@ 2005-11-01 14:10                       ` Dave Hansen
  2005-11-01 14:29                         ` Ingo Molnar
  2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
  2005-11-01 18:23                       ` Rob Landley
  2 siblings, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-01 14:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Tue, 2005-11-01 at 14:56 +0100, Ingo Molnar wrote:
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The set of patches do fix a lot and make a strong start at addressing 
> > the fragmentation problem, just not 100% of the way. [...]
> 
> do you have an expectation to be able to solve the 'fragmentation 
> problem', all the time, in a 100% way, now or in the future?

In a word, yes.

The current allocator has no design for measuring or reducing
fragmentation.  These patches provide the framework for at least
measuring fragmentation.

The patches can not do anything magical and there will be a point where
the system has to make a choice: fragment, or fail an allocation when
there _is_ free memory.

These patches take us in a direction where we are capable of making such
a decision.

> > So, with this set of patches, how fragmented you get is dependant on 
> > the workload and it may still break down and high order allocations 
> > will fail. But the current situation is that it will defiantly break 
> > down. The fact is that it has been reported that memory hotplug remove 
> > works with these patches and doesn't without them. Granted, this is 
> > just one feature on a high-end machine, but it is one solid operation 
> > we can perform with the patches and cannot without them. [...]
> 
> can you always, under any circumstance hot unplug RAM with these patches 
> applied? If not, do you have any expectation to reach 100%?

With these patches, no.  There are currently some very nice,
pathological workloads which will still cause fragmentation.  But, in
the interest of incremental feature introduction, I think they're a fine
first step.  We can effectively reach toward a more comprehensive
solution on top of these patches.

Reaching truly 100% will require some other changes such as being able
to virtually remap things like kernel text.

-- Dave

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:10                       ` Dave Hansen
@ 2005-11-01 14:29                         ` Ingo Molnar
  2005-11-01 14:49                           ` Dave Hansen
  0 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-01 14:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

* Dave Hansen <haveblue@us.ibm.com> wrote:

> > can you always, under any circumstance hot unplug RAM with these patches 
> > applied? If not, do you have any expectation to reach 100%?
> 
> With these patches, no.  There are currently some very nice, 
> pathological workloads which will still cause fragmentation.  But, in 
> the interest of incremental feature introduction, I think they're a 
> fine first step.  We can effectively reach toward a more comprehensive 
> solution on top of these patches.
> 
> Reaching truly 100% will require some other changes such as being able 
> to virtually remap things like kernel text.

then we need to see that 100% solution first - at least in terms of 
conceptual steps. Not being able to hot-unplug RAM in a 100% way wont 
satisfy customers. Whatever solution we choose, it must work 100%. Just 
to give a comparison: would you be content with your computer failing to 
start up apps 1 time out of 100, saying that 99% is good enough? Or 
would you call it what it is: buggy and unreliable?

to stress it: hot unplug is a _feature_ that must work 100%, _not_ some 
optimization where 99% is good enough. This is a feature that people 
will be depending on if we promise it, and 1% failure rate is not 
acceptable. Your 'pathological workload' might be customer X's daily 
workload. Unless there is a clear definition of what is possible and 
what is not (which definition can be relied upon by users), having a 99% 
solution is much worse than the current 0% solution!

worse than that, this is a known _hard_ problem to solve in a 100% way, 
and saying 'this patch is a good first step' just lures us (and 
customers) into believing that we are only 1% away from the desired 100% 
solution, while nothing could be further from the truth. They will 
demand the remaining 1%, but can we offer it? Unless you can provide a 
clear, accepted-upon path towards the 100% solution, we have nothing 
right now.

I have no problems with using higher-order pages for performance 
purposes [*], as long as 'failed' allocation (and freeing) actions are 
user-invisible. But the moment you make it user-visible, it _must_ work 
in a deterministic way!

	Ingo

[*] in which case any slowdown in the page allocator must be offset by
    the gains.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:29                         ` Ingo Molnar
@ 2005-11-01 14:49                           ` Dave Hansen
  2005-11-01 15:01                             ` Ingo Molnar
  2005-11-02  0:51                             ` Nick Piggin
  0 siblings, 2 replies; 241+ messages in thread
From: Dave Hansen @ 2005-11-01 14:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Tue, 2005-11-01 at 15:29 +0100, Ingo Molnar wrote:
> * Dave Hansen <haveblue@us.ibm.com> wrote:
> > > can you always, under any circumstance hot unplug RAM with these patches 
> > > applied? If not, do you have any expectation to reach 100%?
> > 
> > With these patches, no.  There are currently some very nice, 
> > pathological workloads which will still cause fragmentation.  But, in 
> > the interest of incremental feature introduction, I think they're a 
> > fine first step.  We can effectively reach toward a more comprehensive 
> > solution on top of these patches.
> > 
> > Reaching truly 100% will require some other changes such as being able 
> > to virtually remap things like kernel text.
> 
> then we need to see that 100% solution first - at least in terms of 
> conceptual steps.

I don't think saying "truly 100%" really even makes sense.  There will
always be restrictions of some kind.  For instance, with a 10MB kernel
image, should you be able to shrink the memory in the system below
10MB? ;)  

There is also no precedent in existing UNIXes for a 100% solution.  From
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp?topic=/com.ibm.aix.doc/aixbman/prftungd/dlpar.htm , a seemingly arbitrary restriction:

	A memory region that contains a large page cannot be removed.

What the fragmentation patches _can_ give us is the ability to have 100%
success in removing certain areas: the "user-reclaimable" areas
referenced in the patch.  This gives a customer at least the ability to
plan for how dynamically reconfigurable a system should be.

After these patches, the next logical steps are to increase the
knowledge that the slabs have about fragmentation, and to teach some of
the shrinkers about fragmentation.

After that, we'll need some kind of virtual remapping, breaking the 1:1
kernel virtual mapping, so that the most problematic pages can be
remapped.  These pages would retain their virtual address, but getting a
new physical.  However, this is quite far down the road and will require
some serious evaluation because it impacts how normal devices are able
to to DMA.  The ppc64 proprietary hypervisor has features to work around
these issues, and any new hypervisors wishing to support partition
memory hotplug would likely have to follow suit.

-- Dave

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:49                           ` Dave Hansen
@ 2005-11-01 15:01                             ` Ingo Molnar
  2005-11-01 15:22                               ` Dave Hansen
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
  2005-11-02  0:51                             ` Nick Piggin
  1 sibling, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-01 15:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

* Dave Hansen <haveblue@us.ibm.com> wrote:

> > then we need to see that 100% solution first - at least in terms of 
> > conceptual steps.
> 
> I don't think saying "truly 100%" really even makes sense.  There will 
> always be restrictions of some kind.  For instance, with a 10MB kernel 
> image, should you be able to shrink the memory in the system below 
> 10MB? ;)

think of it in terms of filesystem shrinking: yes, obviously you cannot 
shrink to below the allocated size, but no user expects to be able to do 
it. But users would not accept filesystem shrinking failing for certain 
file layouts. In that case we are better off with no ability to shrink: 
it makes it clear that we have not solved the problem, yet.

so it's all about expectations: _could_ you reasonably remove a piece of 
RAM? Customer will say: "I have stopped all nonessential services, and 
free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix 
the kernel!". No reasonable customer will say: "True, I have all RAM 
used up in mlock()ed sections, but i want to remove some RAM 
nevertheless".

> There is also no precedent in existing UNIXes for a 100% solution.

does this have any relevance to the point, other than to prove that it's 
a hard problem that we should not pretend to be able to solve, without 
seeing a clear path towards a solution?

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 15:01                             ` Ingo Molnar
@ 2005-11-01 15:22                               ` Dave Hansen
       [not found]                                 ` <20051102084946.GA3930@elte.hu>
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-01 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Tue, 2005-11-01 at 16:01 +0100, Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of 
> RAM? Customer will say: "I have stopped all nonessential services, and 
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix 
> the kernel!".

That's an excellent example.  Until we have some kind of kernel
remapping, breaking the 1:1 kernel virtual mapping, these pages will
always exist.  The easiest example of this kind of memory is kernel
text.

Another example might be a somewhat errant device driver which has
allocates some large buffers and is doing DMA to or from them.  In this
case, we need to have APIs to require devices to give up and reacquire
any dynamically allocated structures.  If the device driver does not
implement these APIs it is not compatible with memory hotplug.

> > There is also no precedent in existing UNIXes for a 100% solution.
> 
> does this have any relevance to the point, other than to prove that it's 
> a hard problem that we should not pretend to be able to solve, without 
> seeing a clear path towards a solution?

Agreed.  It is a hard problem.  One that some other UNIXes have not
fully solved.

Here are the steps that I think we need to take.  Do you see any holes
in their coverage?  Anything that seems infeasible?

1. Fragmentation avoidance
   * by itself, increases likelyhood of having an area of memory
     which might be easily removed
   * very small (if any) performance overhead
   * other potential in-kernel users
   * creates infrastructure to enforce the "hotplugablity" of any
     particular are of memory.
2. Driver APIs
   * Require that drivers specifically request for areas which must
     retain constant physical addresses
   * Driver must relinquish control of such areas upon request
   * Can be worked around by hypervisors
3. Break 1:1 Kernel Virtual/Physial Mapping 
   * In any large area of physical memory we wish to remove, there will
     likely be very, very few straggler pages, which can not easily be
     freed.
   * Kernel will transparently move the contents of these physical pages
     to new pages, keeping constant virtual addresses.
   * Negative TLB overhead, as in-kernel large page mappings are broken
     down into smaller pages.
   * __{p,v}a() become more expensive, likely a table lookup

I've already done (3) on a limited basis, in the early days of memory
hotplug.  Not the remapping, just breaking the 1:1 assumptions.  It
wasn't too horribly painful.

We'll also need to make some decisions along the way about what to do
about thinks like large pages.  Is it better to just punt like AIX and
refuse to remove their areas?  Break them down into small pages and
degrade performance?

-- Dave

^ permalink raw reply	[flat|nested] 241+ messages in thread

[parent not found: <20051102084946.GA3930@elte.hu>]

[parent not found: <436880B8.1050207@yahoo.com.au>]

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
       [not found]                                   ` <436880B8.1050207@yahoo.com.au>
@ 2005-11-02  9:32                                     ` Dave Hansen
  2005-11-02  9:48                                       ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-02  9:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote:
> I agree. Especially considering that all this memory hotplug usage for
> hypervisors etc. is a relatively new thing with few of our userbase
> actually using it. I think a simple zones solution is the right way to
> go for now.

I agree enough on concept that I think we can go implement at least a
demonstration of how easy it is to perform.

There are a couple of implementation details that will require some
changes to the current zone model, however.  Perhaps you have some
suggestions on those.

In which zone do we place hot-added RAM?  I don't think answer can
simply be the HOTPLUGGABLE zone.  If you start with sufficiently small
of a machine, you'll degrade into the same horrible HIGHMEM behavior
that a 64GB ia32 machine has today, despite your architecture.  Think of
a machine that starts out with a size of 256MB and grows to 1TB.

So, if you have to add to NORMAL/DMA on the fly, how do you handle a
case where the new NORMAL/DMA ram is physically above
HIGHMEM/HOTPLUGGABLE?  Is there any other course than to make a zone
required to be able to span other zones, and be noncontiguous?  Would
that represent too much of a change to the current model?

>From where do we perform reclaim when we run out of a particular zone?
Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been
hard, and I worry that we never got it quite.  Introducing yet another
zone makes this harder.

Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in
any case?

-- Dave

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:32                                     ` Dave Hansen
@ 2005-11-02  9:48                                       ` Nick Piggin
  2005-11-02 10:54                                         ` Dave Hansen
  2005-11-02 15:02                                         ` Martin J. Bligh
  0 siblings, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  9:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

Dave Hansen wrote:
> On Wed, 2005-11-02 at 20:02 +1100, Nick Piggin wrote:
> 
>>I agree. Especially considering that all this memory hotplug usage for
>>hypervisors etc. is a relatively new thing with few of our userbase
>>actually using it. I think a simple zones solution is the right way to
>>go for now.
> 
> 
> I agree enough on concept that I think we can go implement at least a
> demonstration of how easy it is to perform.
> 
> There are a couple of implementation details that will require some
> changes to the current zone model, however.  Perhaps you have some
> suggestions on those.
> 
> In which zone do we place hot-added RAM?  I don't think answer can
> simply be the HOTPLUGGABLE zone.  If you start with sufficiently small
> of a machine, you'll degrade into the same horrible HIGHMEM behavior
> that a 64GB ia32 machine has today, despite your architecture.  Think of
> a machine that starts out with a size of 256MB and grows to 1TB.
> 

What can we do reasonably sanely? I think we can drive about 16GB of
highmem per 1GB of normal fairly well. So on your 1TB system, you
should be able to unplug 960GB RAM.

Lower the ratio to taste if you happen to be doing something
particularly zone normal intensive - remember in that case the frag
patches won't buy you anything more because a zone normal intensive
workload is going to cause unreclaimable regions by definition.

> So, if you have to add to NORMAL/DMA on the fly, how do you handle a
> case where the new NORMAL/DMA ram is physically above
> HIGHMEM/HOTPLUGGABLE?  Is there any other course than to make a zone
> required to be able to span other zones, and be noncontiguous?  Would
> that represent too much of a change to the current model?
> 

Perhaps. Perhaps it wouldn't be required to get a solution that is
"good enough" though.

But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
it all and expand your normal zones into it, bottom up.

>>From where do we perform reclaim when we run out of a particular zone?
> Getting reclaim rates of the HIGHMEM and NORMAL zones balanced has been
> hard, and I worry that we never got it quite.  Introducing yet another
> zone makes this harder.
> 

We didn't get it right, but there are fairly simple things we can do
(http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2)
to improve things remarkably, and having yet more users should result
in even more improvements.

We still have ZONE_DMA and ZONE_DMA32, so we can't just afford to
abandon zones because they're crap ;)

> Should we allow allocations for NORMAL to fall back into HOTPLUGGABLE in
> any case?
> 

I think this would defeat the purpose if we really want to set limits,
but we could have a sysctl perhaps to turn it on or off, or say, only
allow it if the alternative is going OOM.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:48                                       ` Nick Piggin
@ 2005-11-02 10:54                                         ` Dave Hansen
  2005-11-02 15:02                                         ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2005-11-02 10:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

On Wed, 2005-11-02 at 20:48 +1100, Nick Piggin wrote:
> > So, if you have to add to NORMAL/DMA on the fly, how do you handle a
> > case where the new NORMAL/DMA ram is physically above
> > HIGHMEM/HOTPLUGGABLE?  Is there any other course than to make a zone
> > required to be able to span other zones, and be noncontiguous?  Would
> > that represent too much of a change to the current model?
> > 
> 
> Perhaps. Perhaps it wouldn't be required to get a solution that is
> "good enough" though.
> 
> But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
> it all and expand your normal zones into it, bottom up.

That's a good point.  It would be slow, because you have to wait on page
reclaim, but it would work.  I do worry a bit that this might make
adding memory to slow of an operation to be useful for short periods,
but we'll see how it actually behaves.

-- Dave


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:48                                       ` Nick Piggin
  2005-11-02 10:54                                         ` Dave Hansen
@ 2005-11-02 15:02                                         ` Martin J. Bligh
  2005-11-03  3:21                                           ` Nick Piggin
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02 15:02 UTC (permalink / raw)
  To: Nick Piggin, Dave Hansen
  Cc: Ingo Molnar, Mel Gorman, Andrew Morton, Linus Torvalds, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms, Arjan van de Ven

>> I agree enough on concept that I think we can go implement at least a
>> demonstration of how easy it is to perform.
>> 
>> There are a couple of implementation details that will require some
>> changes to the current zone model, however.  Perhaps you have some
>> suggestions on those.
>> 
>> In which zone do we place hot-added RAM?  I don't think answer can
>> simply be the HOTPLUGGABLE zone.  If you start with sufficiently small
>> of a machine, you'll degrade into the same horrible HIGHMEM behavior
>> that a 64GB ia32 machine has today, despite your architecture.  Think of
>> a machine that starts out with a size of 256MB and grows to 1TB.
>> 
> 
> What can we do reasonably sanely? I think we can drive about 16GB of
> highmem per 1GB of normal fairly well. So on your 1TB system, you
> should be able to unplug 960GB RAM.

I think you need to talk to some more users trying to run 16GB ia32
systems. Feel the pain.
 
> Lower the ratio to taste if you happen to be doing something
> particularly zone normal intensive - remember in that case the frag
> patches won't buy you anything more because a zone normal intensive
> workload is going to cause unreclaimable regions by definition.
> 
>> So, if you have to add to NORMAL/DMA on the fly, how do you handle a
>> case where the new NORMAL/DMA ram is physically above
>> HIGHMEM/HOTPLUGGABLE?  Is there any other course than to make a zone
>> required to be able to span other zones, and be noncontiguous?  Would
>> that represent too much of a change to the current model?
>> 
> 
> Perhaps. Perhaps it wouldn't be required to get a solution that is
> "good enough" though.
> 
> But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
> it all and expand your normal zones into it, bottom up.

Can we quit coming up with specialist hacks for hotplug, and try to solve
the generic problem please? hotplug is NOT the only issue here. Fragmentation
in general is.



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 15:02                                         ` Martin J. Bligh
@ 2005-11-03  3:21                                           ` Nick Piggin
  2005-11-03 15:36                                             ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-03  3:21 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

Martin J. Bligh wrote:

>>What can we do reasonably sanely? I think we can drive about 16GB of
>>highmem per 1GB of normal fairly well. So on your 1TB system, you
>>should be able to unplug 960GB RAM.
> 
> 
> I think you need to talk to some more users trying to run 16GB ia32
> systems. Feel the pain.
>  

OK, make it 8GB then.

And as a bonus we get all you IBM guys back on the case again
to finish the job that was started on highmem :)

And as another bonus, you actually *have* the capability to unplug
memory or use hugepages exactly the size you require, which is not the
case with the frag patches.

>>But if you can reclaim your ZONE_RECLAIMABLE, then you could reclaim
>>it all and expand your normal zones into it, bottom up.
> 
> 
> Can we quit coming up with specialist hacks for hotplug, and try to solve
> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> in general is.
> 

Not really it isn't. There have been a few cases (e1000 being the main
one, and is fixed upstream) where fragmentation in general is a problem.
But mostly it is not.

Anyone who thinks they can start using higher order allocations willy
nilly after Mel's patch, I'm fairly sure they're wrong because they are
just going to be using up the contiguous regions.

Trust me, if the frag patches were a general solution that solved the
generic fragmentation problem I would be a lot less concerned about the
complexity they introduce. But even then it only seems to be a problem
that a very small number of users care about.

Anyway I keep saying the same things (sorry) so I'll stop now.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  3:21                                           ` Nick Piggin
@ 2005-11-03 15:36                                             ` Martin J. Bligh
  2005-11-03 15:40                                               ` Arjan van de Ven
  0 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

>> Can we quit coming up with specialist hacks for hotplug, and try to solve
>> the generic problem please? hotplug is NOT the only issue here. Fragmentation
>> in general is.
>> 
> 
> Not really it isn't. There have been a few cases (e1000 being the main
> one, and is fixed upstream) where fragmentation in general is a problem.
> But mostly it is not.

Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. And 
devices that don't support scatter-gather DMA.
 
M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:36                                             ` Martin J. Bligh
@ 2005-11-03 15:40                                               ` Arjan van de Ven
  2005-11-03 15:51                                                 ` Linus Torvalds
  2005-11-03 15:53                                                 ` Martin J. Bligh
  0 siblings, 2 replies; 241+ messages in thread
From: Arjan van de Ven @ 2005-11-03 15:40 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven

On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> >> in general is.
> >> 
> > 
> > Not really it isn't. There have been a few cases (e1000 being the main
> > one, and is fixed upstream) where fragmentation in general is a problem.
> > But mostly it is not.
> 
> Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. 

with CONFIG_4KSTACKS :)



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:40                                               ` Arjan van de Ven
@ 2005-11-03 15:51                                                 ` Linus Torvalds
  2005-11-03 15:57                                                   ` Martin J. Bligh
                                                                     ` (2 more replies)
  2005-11-03 15:53                                                 ` Martin J. Bligh
  1 sibling, 3 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 15:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar,
	Mel Gorman, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Arjan van de Ven wrote:

> On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > >> in general is.
> > >> 
> > > 
> > > Not really it isn't. There have been a few cases (e1000 being the main
> > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > But mostly it is not.
> > 
> > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. 
> 
> with CONFIG_4KSTACKS :)

2-page allocations are _not_ a problem.

Especially not for fork()/clone(). If you don't even have 2-page 
contiguous areas, you are doing something _wrong_, or you're so low on 
memory that there's no point in forking any more. 

Don't confuse "fragmentation" with "perfectly spread out page 
allocations". 

Fragmentation means that it gets _exponentially_ more unlikely that you 
can allocate big contiguous areas. But contiguous areas of order 1 are 
very very likely indeed. It's only the _big_ areas that aren't going to 
happen.

This is why fragmentation avoidance has always been totally useless. It is
 - only useful for big areas
 - very hard for big areas

(Corollary: when it's easy and possible, it's not useful).

Don't do it. We've never done it, and we've been fine. Claiming that 
fork() is a reason to do fragmentation avoidance is invalid.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:51                                                 ` Linus Torvalds
@ 2005-11-03 15:57                                                   ` Martin J. Bligh
  2005-11-03 16:20                                                   ` Arjan van de Ven
  2005-11-03 16:27                                                   ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:57 UTC (permalink / raw)
  To: Linus Torvalds, Arjan van de Ven
  Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	kravetz, linux-mm, Linux Kernel Mailing List, lhms,
	Arjan van de Ven

>> with CONFIG_4KSTACKS :)
> 
> 2-page allocations are _not_ a problem.
> 
> Especially not for fork()/clone(). If you don't even have 2-page 
> contiguous areas, you are doing something _wrong_, or you're so low on 
> memory that there's no point in forking any more. 

64 bit platforms need kernel stacks > 8K, it seems.

> Don't confuse "fragmentation" with "perfectly spread out page 
> allocations". 
> 
> Fragmentation means that it gets _exponentially_ more unlikely that you 
> can allocate big contiguous areas. But contiguous areas of order 1 are 
> very very likely indeed. It's only the _big_ areas that aren't going to 
> happen.
> 
> This is why fragmentation avoidance has always been totally useless. It is
>  - only useful for big areas
>  - very hard for big areas
> 
> (Corollary: when it's easy and possible, it's not useful).
> 
> Don't do it. We've never done it, and we've been fine. Claiming that 
> fork() is a reason to do fragmentation avoidance is invalid.

With respect, we have not been fine. We see problems fairly regularly
with no large page/hotplug issues with higher order allocations. 
Drivers, CIFS, kernel stacks, etc, etc etc.

The larger memory gets, the worse the problem is, just because the 
statistics make it less likely to free up multiple contiguous pages.

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:51                                                 ` Linus Torvalds
  2005-11-03 15:57                                                   ` Martin J. Bligh
@ 2005-11-03 16:20                                                   ` Arjan van de Ven
  2005-11-03 16:27                                                   ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Arjan van de Ven @ 2005-11-03 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Nick Piggin, Dave Hansen, Ingo Molnar,
	Mel Gorman, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 2005-11-03 at 07:51 -0800, Linus Torvalds wrote:
> 
> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
> 
> > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > > >> in general is.
> > > >> 
> > > > 
> > > > Not really it isn't. There have been a few cases (e1000 being the main
> > > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > > But mostly it is not.
> > > 
> > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. 
> > 
> > with CONFIG_4KSTACKS :)
> 
> 2-page allocations are _not_ a problem.

agreed for the general case. There are some corner cases that you can
trigger deliberate in an artifical setting with lots of java threads
(esp on x86 on a 32Gb box; the lowmem zone works as a lever here leading
to "hyperfragmentation"; otoh on x86 you can do 4k stacks and it's gone
mostly)


> Fragmentation means that it gets _exponentially_ more unlikely that you 
> can allocate big contiguous areas. But contiguous areas of order 1 are 
> very very likely indeed. It's only the _big_ areas that aren't going to 
> happen.

yup. only possible exception is the leveraged scenario .. thank god for
64 bit x86-64.



(and in the leveraged scenario I don't think active defragmentation will
buy you much over the long term at all)


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:51                                                 ` Linus Torvalds
  2005-11-03 15:57                                                   ` Martin J. Bligh
  2005-11-03 16:20                                                   ` Arjan van de Ven
@ 2005-11-03 16:27                                                   ` Mel Gorman
  2005-11-03 16:46                                                     ` Linus Torvalds
  2 siblings, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-03 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Linus Torvalds wrote:

>
>
> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
>
> > On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > > >> in general is.
> > > >>
> > > >
> > > > Not really it isn't. There have been a few cases (e1000 being the main
> > > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > > But mostly it is not.
> > >
> > > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
> >
> > with CONFIG_4KSTACKS :)
>
> 2-page allocations are _not_ a problem.
>
> Especially not for fork()/clone(). If you don't even have 2-page
> contiguous areas, you are doing something _wrong_, or you're so low on
> memory that there's no point in forking any more.
>
> Don't confuse "fragmentation" with "perfectly spread out page
> allocations".
>
> Fragmentation means that it gets _exponentially_ more unlikely that you
> can allocate big contiguous areas. But contiguous areas of order 1 are
> very very likely indeed. It's only the _big_ areas that aren't going to
> happen.
>

For me, it's the big areas that I am interested in, especially if we want
to give HugeTLB pages to a user when they are asking for them. The obvious
one here is database and HPC loads, particularly the HPC loads which may
not have had a chance to reserve what they needed at boot time. These
loads need 1024 contiguous pages on the x86 at least, not 2. We can free
all we want on todays kernels and you're not going to get more than 1 or
two blocks this large unless you are very lucky.

Hotplug is, for me, an additional benefit. For others, it is the main
benefit. For others of course, they don't care, but others don't are about
scalability to 64 processors either but we provide it anyway at a low cost
to smaller machines.

> This is why fragmentation avoidance has always been totally useless. It is
>  - only useful for big areas
>  - very hard for big areas
>
> (Corollary: when it's easy and possible, it's not useful).
>

Unless you are a user that wants a large area when it suddenly is useful.

> Don't do it. We've never done it, and we've been fine. Claiming that
> fork() is a reason to do fragmentation avoidance is invalid.
>

We've never done it but, but we've only supported HugeTLB pages being
reserved at boot time and nothing else as well.

I'm going to setup a kbuild environment, hopefully this evening, and see
are these patches adversely impacting a load that kernel developers care
about. If I am impacting it, oops I'm in some trouble. If I'm not, then
why not try and help out the people who care about the big areas.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:27                                                   ` Mel Gorman
@ 2005-11-03 16:46                                                     ` Linus Torvalds
  2005-11-03 16:52                                                       ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 16:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Mel Gorman wrote:

> On Thu, 3 Nov 2005, Linus Torvalds wrote:
> 
> > This is why fragmentation avoidance has always been totally useless. It is
> >  - only useful for big areas
> >  - very hard for big areas
> >
> > (Corollary: when it's easy and possible, it's not useful).
> >
> 
> Unless you are a user that wants a large area when it suddenly is useful.

No. It's _not_ suddenly useful.

It might be something you _want_, but that's a totally different issue.

My point is that regardless of what you _want_, defragmentation is 
_useless_. It's useless simply because for big areas it is so expensive as 
to be impractical.

Put another way: you may _want_ the moon to be made of cheese, but a moon 
made out of cheese is _useless_ because it is impractical.

The only way to support big areas is to have special zones for them.

(Then, we may be able to use the special zones for small things too, but 
under special rules, like "only used for anonymous mappings" where we 
can just always remove them by paging them out. But it would still be a 
special area meant for big pages, just temporarily "on loan").

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:46                                                     ` Linus Torvalds
@ 2005-11-03 16:52                                                       ` Martin J. Bligh
  2005-11-03 17:19                                                         ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 16:52 UTC (permalink / raw)
  To: Linus Torvalds, Mel Gorman
  Cc: Arjan van de Ven, Nick Piggin, Dave Hansen, Ingo Molnar,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms,
	Arjan van de Ven

> The only way to support big areas is to have special zones for them.
> 
> (Then, we may be able to use the special zones for small things too, but 
> under special rules, like "only used for anonymous mappings" where we 
> can just always remove them by paging them out. But it would still be a 
> special area meant for big pages, just temporarily "on loan").

The problem is how these zones get resized. Can we hotplug memory between 
them, with some sparsemem like indirection layer?

Real customers have shown us that their workloads shift, and they have
different needs at different parts of the day. We can't just pick one
size and call it good. It's the same argument as the traditional VM
balancing act between pagecache, user pages, and kernel pages (which 
incidentally, we don't use zones for). We want the system to be able
to use memory wherever it's most needed.

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:52                                                       ` Martin J. Bligh
@ 2005-11-03 17:19                                                         ` Linus Torvalds
  2005-11-03 17:48                                                           ` Dave Hansen
  2005-11-03 17:51                                                           ` Martin J. Bligh
  0 siblings, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 17:19 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> 
> The problem is how these zones get resized. Can we hotplug memory between 
> them, with some sparsemem like indirection layer?

I think you should be able to add them. You can remove them. But you can't 
resize them.

And I suspect that by default, there should be zero of them. Ie you'd have 
to set them up the same way you now set up a hugetlb area.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:19                                                         ` Linus Torvalds
@ 2005-11-03 17:48                                                           ` Dave Hansen
  2005-11-03 17:51                                                           ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2005-11-03 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Mel Gorman, Arjan van de Ven, Nick Piggin,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 2005-11-03 at 09:19 -0800, Linus Torvalds wrote:
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > 
> > The problem is how these zones get resized. Can we hotplug memory between 
> > them, with some sparsemem like indirection layer?
> 
> I think you should be able to add them. You can remove them. But you can't 
> resize them.

Any particular reasons you think we can't resize them?  I know shrinking
the non-reclaim (DMA,NORMAL) zones will be practically impossible, but
it should be quite possible to shrink the reclaim zone, and grow DMA or
NORMAL into it.

This will likely be necessary as memory is added to a system, and the
ratio of reclaim to non-reclaim zones gets out of whack and away from
the magic 16:1 or 8:1 highmem:normal ratio that seems popular.

-- Dave


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:19                                                         ` Linus Torvalds
  2005-11-03 17:48                                                           ` Dave Hansen
@ 2005-11-03 17:51                                                           ` Martin J. Bligh
  2005-11-03 17:59                                                             ` Arjan van de Ven
                                                                               ` (2 more replies)
  1 sibling, 3 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 17:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

--Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 09:19:35 -0800):

> 
> 
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>> 
>> The problem is how these zones get resized. Can we hotplug memory between 
>> them, with some sparsemem like indirection layer?
> 
> I think you should be able to add them. You can remove them. But you can't 
> resize them.
> 
> And I suspect that by default, there should be zero of them. Ie you'd have 
> to set them up the same way you now set up a hugetlb area.

So ... if there are 0 by default, and I run for a while and dirty up
memory, how do I free any pages up to put into them? Not sure how that
works.

Going back to finding contig pages for a sec ... I don't disagree with
your assertion that order 1 is doable (however, we do need to make one
fix ...see below). It's > 1 that's a problem.

For amusement, let me put in some tritely oversimplified math. For the
sake of arguement, assume the free watermarks are 8MB or so. Let's assume
a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
I'm going to assume random distribution of free pages, which is 
oversimplified, but I'm trying to demonstrate a general premise, not get
accurate numbers.

8MB = 2048 pages.

On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for
each free page, chance of it's buddy being free is 2048/16384. So in 
grossly-oversimplified stats-land, if I can remember anything at all,
chance of finding one page with a free buddy is 1-(1-2048/16384)^2048, 
which is, for all intents and purposes ... 1.

1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989

128GB system. 33554432 pages. 0.1175 probability

yes, yes, my math sucks and I'm a simpleton. The point is that as memory
gets bigger, the odds suck for getting contiguous pages. And would also
explain why you think there's no problem, and I do ;-) And bear in mind
that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll
spare you more wild attempts at foully-approximated math.

Hmmm. If we keep 128MB free, that totally kills off the above calculation
I think I'll just tweak it so the limit is not so hard on really big 
systems. Will send you a patch. However ... larger allocs will still 
suck ... I guess I'd better gross you out with more incorrect math after
all ...

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:51                                                           ` Martin J. Bligh
@ 2005-11-03 17:59                                                             ` Arjan van de Ven
  2005-11-03 18:08                                                               ` Linus Torvalds
  2005-11-03 18:03                                                             ` Linus Torvalds
  2005-11-03 18:48                                                             ` Martin J. Bligh
  2 siblings, 1 reply; 241+ messages in thread
From: Arjan van de Ven @ 2005-11-03 17:59 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Linus Torvalds, Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms,
	Arjan van de Ven

On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:

> For amusement, let me put in some tritely oversimplified math. For the
> sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> I'm going to assume random distribution of free pages, which is 
> oversimplified, but I'm trying to demonstrate a general premise, not get
> accurate numbers.

that is VERY over simplified though, given the anti-fragmentation
property of buddy algorithm


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:59                                                             ` Arjan van de Ven
@ 2005-11-03 18:08                                                               ` Linus Torvalds
  2005-11-03 18:17                                                                 ` Martin J. Bligh
  2005-11-03 21:11                                                                 ` Mel Gorman
  0 siblings, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Martin J. Bligh, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



On Thu, 3 Nov 2005, Arjan van de Ven wrote:

> On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
> 
> > For amusement, let me put in some tritely oversimplified math. For the
> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> > I'm going to assume random distribution of free pages, which is 
> > oversimplified, but I'm trying to demonstrate a general premise, not get
> > accurate numbers.
> 
> that is VERY over simplified though, given the anti-fragmentation
> property of buddy algorithm

Indeed. I write a program at one time doing random allocation and 
de-allocation and looking at what the output was, and buddy is very good 
at avoiding fragmentation.

These days we have things like per-cpu lists in front of the buddy 
allocator that will make fragmentation somewhat higher, but it's still 
absolutely true that the page allocation layout is _not_ random.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:08                                                               ` Linus Torvalds
@ 2005-11-03 18:17                                                                 ` Martin J. Bligh
  2005-11-03 18:44                                                                   ` Linus Torvalds
  2005-11-04  0:58                                                                   ` Nick Piggin
  2005-11-03 21:11                                                                 ` Mel Gorman
  1 sibling, 2 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:17 UTC (permalink / raw)
  To: Linus Torvalds, Arjan van de Ven
  Cc: Mel Gorman, Nick Piggin, Dave Hansen, Ingo Molnar, Andrew Morton,
	kravetz, linux-mm, Linux Kernel Mailing List, lhms,
	Arjan van de Ven

>> > For amusement, let me put in some tritely oversimplified math. For the
>> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
>> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
>> > I'm going to assume random distribution of free pages, which is 
>> > oversimplified, but I'm trying to demonstrate a general premise, not get
>> > accurate numbers.
>> 
>> that is VERY over simplified though, given the anti-fragmentation
>> property of buddy algorithm
> 
> Indeed. I write a program at one time doing random allocation and 
> de-allocation and looking at what the output was, and buddy is very good 
> at avoiding fragmentation.
> 
> These days we have things like per-cpu lists in front of the buddy 
> allocator that will make fragmentation somewhat higher, but it's still 
> absolutely true that the page allocation layout is _not_ random.

OK, well I'll quit torturing you with incorrect math if you'll concede
that the situation gets much much worse as memory sizes get larger ;-)

For order > 1 allocs, I think it's fixable. For order > 1, I think we
basically don't have a prayer on a largish system under pressure.

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:17                                                                 ` Martin J. Bligh
@ 2005-11-03 18:44                                                                   ` Linus Torvalds
  2005-11-03 18:51                                                                     ` Martin J. Bligh
  2005-11-04  0:58                                                                   ` Nick Piggin
  1 sibling, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:44 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > 
> > These days we have things like per-cpu lists in front of the buddy 
> > allocator that will make fragmentation somewhat higher, but it's still 
> > absolutely true that the page allocation layout is _not_ random.
> 
> OK, well I'll quit torturing you with incorrect math if you'll concede
> that the situation gets much much worse as memory sizes get larger ;-)

I don't remember the specifics (I did the stats several years ago), but if 
I recall correctly, the low-order allocations actually got _better_ with 
more memory, assuming you kept a fixed percentage of memory free. So you 
actually needed _less_ memory free (in percentages) to get low-order 
allocations reliably.

But the higher orders didn't much matter. Basically, it gets exponentially 
more difficult to keep higher-order allocations, and it doesn't help one 
whit if there's a linear improvement from having more memory available or 
something like that.

So it doesn't get _harder_ with lots of memory, but

 - you need to keep the "minimum free" watermarks growing at the same rate 
   the memory sizes grow (and on x86, I don't think we do: at least at 
   some point, the HIGHMEM zone had a much lower low-water-mark because it 
   made the balancing behaviour much nicer. But I didn't check that).

 - with lots of memory, you tend to want to get higher-order pages, and 
   that gets harder much much faster than your memory size grows. So 
   _effectively_, the kinds of allocations you care about are much harder 
   to get.

If you look at get_free_pages(), you will note that we actyally 
_guarantee_ memory allocations up to order-3:

	...
        if (!(gfp_mask & __GFP_NORETRY)) {
                if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
                        do_retry = 1;
	...

and nobody has ever even noticed. In other words, low-order allocations 
really _are_ dependable. It's just that the kinds of orders you want for 
memory hotplug or hugetlb (ie not orders <=3, but >=10) are not, and never 
will be.

(Btw, my statistics did depend on that fact that the _usage_ was an even 
higher exponential, ie you had many many more order-0 allocations than you 
had order-1). You can always run out of order-n (n != 0) pages if you just 
allocate enough of them. The buddy thing works well statistically, but it 
obviously can't do wonders).

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:44                                                                   ` Linus Torvalds
@ 2005-11-03 18:51                                                                     ` Martin J. Bligh
  2005-11-03 19:35                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



--Linus Torvalds <torvalds@osdl.org> wrote (on Thursday, November 03, 2005 10:44:14 -0800):

> 
> 
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>> > 
>> > These days we have things like per-cpu lists in front of the buddy 
>> > allocator that will make fragmentation somewhat higher, but it's still 
>> > absolutely true that the page allocation layout is _not_ random.
>> 
>> OK, well I'll quit torturing you with incorrect math if you'll concede
>> that the situation gets much much worse as memory sizes get larger ;-)
> 
> I don't remember the specifics (I did the stats several years ago), but if 
> I recall correctly, the low-order allocations actually got _better_ with 
> more memory, assuming you kept a fixed percentage of memory free. So you 
> actually needed _less_ memory free (in percentages) to get low-order 
> allocations reliably.

Possibly, I can redo the calculations easily enough (have to go for now,
but I just sent the other ones). But we don't keep a fixed percentage of
memory free - we cap it ... perhaps we should though?

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:51                                                                     ` Martin J. Bligh
@ 2005-11-03 19:35                                                                       ` Linus Torvalds
  2005-11-03 22:40                                                                         ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 19:35 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> 
> Possibly, I can redo the calculations easily enough (have to go for now,
> but I just sent the other ones). But we don't keep a fixed percentage of
> memory free - we cap it ... perhaps we should though?

I suspect the capping may well be from some old HIGHMEM interaction on x86 
(ie "don't keep half a gig free in the normal zone just because we have 
16GB in the high-zone". We used to have serious balancing issues, and I 
wouldn't be surprised at all if there are remnants from that. Stuff that 
simply hasn't been visible, because not a lot of people had many many GB 
of memory even on machines that didn't need HIGHMEM.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 19:35                                                                       ` Linus Torvalds
@ 2005-11-03 22:40                                                                         ` Martin J. Bligh
  2005-11-03 22:56                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 22:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



--On Thursday, November 03, 2005 11:35:28 -0800 Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>> 
>> Possibly, I can redo the calculations easily enough (have to go for now,
>> but I just sent the other ones). But we don't keep a fixed percentage of
>> memory free - we cap it ... perhaps we should though?
> 
> I suspect the capping may well be from some old HIGHMEM interaction on x86 
> (ie "don't keep half a gig free in the normal zone just because we have 
> 16GB in the high-zone". We used to have serious balancing issues, and I 
> wouldn't be surprised at all if there are remnants from that. Stuff that 
> simply hasn't been visible, because not a lot of people had many many GB 
> of memory even on machines that didn't need HIGHMEM.

But pages_min is based on the zone size, not the system size. And we
still cap it. Maybe that's just a mistake?

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 22:40                                                                         ` Martin J. Bligh
@ 2005-11-03 22:56                                                                           ` Linus Torvalds
  2005-11-03 23:01                                                                             ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 22:56 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> 
> But pages_min is based on the zone size, not the system size. And we
> still cap it. Maybe that's just a mistake?

The per-zone watermarking is actually the "modern" and "working" approach. 

We didn't always do it that way. I would not be at all surprised if the 
capping was from the global watermarking days.

Of course, I would _also_ not be at all surprised if it wasn't just out of 
habit. Most of the things where we try to scale things up by memory size, 
we cap for various reasons. Ie we tend to try to scale things like hash 
sizes for core data structures by memory size, but then we tend to cap 
them to "sane" versions.

So quite frankly, it's entirely possible that the capping is there not 
because it _ever_ was a good idea, but just because it's what we almost 
always do ;)

Mental inertia is definitely alive and well.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 22:56                                                                           ` Linus Torvalds
@ 2005-11-03 23:01                                                                             ` Martin J. Bligh
  0 siblings, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 23:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Mel Gorman, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

>> But pages_min is based on the zone size, not the system size. And we
>> still cap it. Maybe that's just a mistake?
> 
> The per-zone watermarking is actually the "modern" and "working" approach. 
> 
> We didn't always do it that way. I would not be at all surprised if the 
> capping was from the global watermarking days.
> 
> Of course, I would _also_ not be at all surprised if it wasn't just out of 
> habit. Most of the things where we try to scale things up by memory size, 
> we cap for various reasons. Ie we tend to try to scale things like hash 
> sizes for core data structures by memory size, but then we tend to cap 
> them to "sane" versions.
> 
> So quite frankly, it's entirely possible that the capping is there not 
> because it _ever_ was a good idea, but just because it's what we almost 
> always do ;)
> 
> Mental inertia is definitely alive and well.

Ha ;-) Well thanks for the explanation. I would suggest the patch I sent
you makes some semblence of sense then ...

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:17                                                                 ` Martin J. Bligh
  2005-11-03 18:44                                                                   ` Linus Torvalds
@ 2005-11-04  0:58                                                                   ` Nick Piggin
  2005-11-04  1:06                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-04  0:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Linus Torvalds, Arjan van de Ven, Mel Gorman, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

Martin J. Bligh wrote:

>>These days we have things like per-cpu lists in front of the buddy 
>>allocator that will make fragmentation somewhat higher, but it's still 
>>absolutely true that the page allocation layout is _not_ random.
> 
> 
> OK, well I'll quit torturing you with incorrect math if you'll concede
> that the situation gets much much worse as memory sizes get larger ;-)
> 

Let me add that as memory sized get larger, people are also looking
for more tlb coverage and less per page overhead.

Looks like ppc64 is getting 64K page support, at which point higher
order allocations (eg. for stacks) basically disappear don't they?

x86-64 I thought were also getting 64K page support but I can't
find a reference to it right now - at the very least I know Andi
wants to support larger soft pages for it.

ia64 is obviously already well covered.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  0:58                                                                   ` Nick Piggin
@ 2005-11-04  1:06                                                                     ` Linus Torvalds
  2005-11-04  1:20                                                                       ` Paul Mackerras
                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04  1:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



On Fri, 4 Nov 2005, Nick Piggin wrote:
> 
> Looks like ppc64 is getting 64K page support, at which point higher
> order allocations (eg. for stacks) basically disappear don't they?

Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a 
general-purpose machine.

64kB pages are _only_ usable for databases, nothing else.

Why? Do the math. Try to cache the whole kernel source tree in 4kB pages 
vs 64kB pages. See how the memory usage goes up by a factor of _four_.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:06                                                                     ` Linus Torvalds
@ 2005-11-04  1:20                                                                       ` Paul Mackerras
  2005-11-04  1:22                                                                       ` Nick Piggin
  2005-11-04  1:26                                                                       ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Paul Mackerras @ 2005-11-04  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Mel Gorman,
	Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

Linus Torvalds writes:

> 64kB pages are _only_ usable for databases, nothing else.

Actually people running HPC apps also like 64kB pages since their TLB
misses go down significantly, and their data files tend to be large.

Fileserving for windows boxes should also benefit, since both the
executables and the data files that typical office applications on
windows use are largish.  I got a distribution of file sizes for a
government department office and concluded that 64k pages would only
bloat the page cache by a few percent for that case.

Paul.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:06                                                                     ` Linus Torvalds
  2005-11-04  1:20                                                                       ` Paul Mackerras
@ 2005-11-04  1:22                                                                       ` Nick Piggin
  2005-11-04  1:48                                                                         ` Mel Gorman
  2005-11-04  1:26                                                                       ` Mel Gorman
  2 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-04  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Arjan van de Ven, Mel Gorman, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

Linus Torvalds wrote:
> 
> On Fri, 4 Nov 2005, Nick Piggin wrote:
> 
>>Looks like ppc64 is getting 64K page support, at which point higher
>>order allocations (eg. for stacks) basically disappear don't they?
> 
> 
> Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a 
> general-purpose machine.
> 
> 64kB pages are _only_ usable for databases, nothing else.
> 
> Why? Do the math. Try to cache the whole kernel source tree in 4kB pages 
> vs 64kB pages. See how the memory usage goes up by a factor of _four_.
> 

Yeah that's true. But Martin's worried about future machines
with massive memories - so maybe it is safe to assume those will
be using big pages, I don't know.

Maybe the solution is to bloat the kernel sources enough to make
64KB pages worthwhile?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:22                                                                       ` Nick Piggin
@ 2005-11-04  1:48                                                                         ` Mel Gorman
  2005-11-04  1:59                                                                           ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-04  1:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Fri, 4 Nov 2005, Nick Piggin wrote:

> Linus Torvalds wrote:
> >
> > On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > > Looks like ppc64 is getting 64K page support, at which point higher
> > > order allocations (eg. for stacks) basically disappear don't they?
> >
> >
> > Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
> > general-purpose machine.
> >
> > 64kB pages are _only_ usable for databases, nothing else.
> >
> > Why? Do the math. Try to cache the whole kernel source tree in 4kB pages vs
> > 64kB pages. See how the memory usage goes up by a factor of _four_.
> >
>
> Yeah that's true. But Martin's worried about future machines
> with massive memories - so maybe it is safe to assume those will
> be using big pages, I don't know.
>

Todays massive machines are tomorrows desktop. Weak comment, I know, but
it's happened before.

> Maybe the solution is to bloat the kernel sources enough to make
> 64KB pages worthwhile?
>

root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-clean
-rw-r--r--  1 root root 1718063 2005-11-01 16:17
vmlinuz-2.6.14-rc5-mm1-clean
root@monocle:/boot# ls -l vmlinuz-2.6.14-rc5-mm1-mbuddy-v19
-rw-r--r--  1 root root 1722102 2005-11-02 14:56
vmlinuz-2.6.14-rc5-mm1-mbuddy-v19
root@monocle:/boot# dc
1722102
1718063
- p
4039

root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-clean
-rwxr-xr-x  1 root root 31518866 2005-11-01 16:17
vmlinux-2.6.14-rc5-mm1-clean
root@monocle:/boot# ls -l vmlinux-2.6.14-rc5-mm1-mbuddy-v19
-rwxr-xr-x  1 root root 31585714 2005-11-02 14:56
vmlinux-2.6.14-rc5-mm1-mbuddy-v19

mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-nooom$ wc -l mm/page_alloc.c
2689 mm/page_alloc.c
mel@joshua:/usr/src/patchset-0.5/kernels/linux-2.6.14-rc5-mm1-mbuddy-v19-withdefrag$  wc -l mm/page_alloc.c
3188 mm/page_alloc.c

0.23% increase in size of bzImage, 0.21% increase in the size of vmlinux
and the major increase in code size is in one file, *one* file, all of
which does it's best to impact the flow of the well-understood code. We're
seeing bigger differences in performance than we are in the size of the
kernel. I'd understand if I was the first person to ever introduce
complexity to the VM.

If the size of the image for really small systems is the issue, what if I
say I'll add in another patch that optionally compiles away as much of
anti-defrag as possible without making the code a mess of #defines .  Are
we still going to hear "no, I don't like looking at this". The current
patch to compile it away deliberately choose the smallest part to take
away to restore the allocator to todays behavior.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:48                                                                         ` Mel Gorman
@ 2005-11-04  1:59                                                                           ` Nick Piggin
  2005-11-04  2:35                                                                             ` Mel Gorman
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-04  1:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

Mel Gorman wrote:
> On Fri, 4 Nov 2005, Nick Piggin wrote:
> 
> Todays massive machines are tomorrows desktop. Weak comment, I know, but
> it's happened before.
> 

Oh I wouldn't bet against it. And if desktops of the future are using
100s of GB then they probably would be happy to use 64K pages as well.

> 
>>Maybe the solution is to bloat the kernel sources enough to make
>>64KB pages worthwhile?
>>
> 

Sorry this wasn't meant to be a dig at your patches - I guess it turned
out that way though :\

But yes, if anybody is adding complexity or size to core code it
obviously does need to be justified -- and by no means does this only
apply to you.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:59                                                                           ` Nick Piggin
@ 2005-11-04  2:35                                                                             ` Mel Gorman
  0 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-04  2:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Fri, 4 Nov 2005, Nick Piggin wrote:

> Mel Gorman wrote:
> > On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > Todays massive machines are tomorrows desktop. Weak comment, I know, but
> > it's happened before.
> >
>
> Oh I wouldn't bet against it. And if desktops of the future are using
> 100s of GB then they probably would be happy to use 64K pages as well.
>

And would it not be nice to be ready when it happens, before it happens
even?

> >
> > > Maybe the solution is to bloat the kernel sources enough to make
> > > 64KB pages worthwhile?
> > >
> >
>
> Sorry this wasn't meant to be a dig at your patches - I guess it turned
> out that way though :\
>

Oh, I'll live. If I was going to take it personally and go into a big
sulk, I wouldn't be here.  This is linux-kernel, not the super-friends
club.

> But yes, if anybody is adding complexity or size to core code it
> obviously does need to be justified -- and by no means does this only
> apply to you.
>

I've tried to justify it with benchmarks that came with each release and
code reviews, particularly by Dave Hansen, showed that earlier versions
had significant problems that needed to be ironed out. I don't want to
hurt the normal case, because the fact of the matter is, my desktop
machine (which runs with these patches to see if there are any bugs)
runs the normal case and it will until we get much further because I'm not
configuring my machine for HugeTLB when it boots. If I'm hurting the
normal case, that's more time switching windows to see if the next test
kernel has built yet.

If we can do this and not regress in the standard case, then what is
wrong? I'm still waiting for figures that say this approach is slow and I
can only assume someone is trying considering the length of this thread.
If and when those figures show up, I'll put on the thinking hat and see
where I went wrong because regression performance is wrong. There is a
win-win solution somewhere, how hard could it possibly be :) ?

I'm looking at the zone approach. I want to see if it can work in a nice
fashion, not in a "if the sysadm can see the future and configure
correctly, it'll work just fine" fashion. I'm not confident, but it might
be bias.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:06                                                                     ` Linus Torvalds
  2005-11-04  1:20                                                                       ` Paul Mackerras
  2005-11-04  1:22                                                                       ` Nick Piggin
@ 2005-11-04  1:26                                                                       ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-04  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Martin J. Bligh, Arjan van de Ven, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Linus Torvalds wrote:

>
>
> On Fri, 4 Nov 2005, Nick Piggin wrote:
> >
> > Looks like ppc64 is getting 64K page support, at which point higher
> > order allocations (eg. for stacks) basically disappear don't they?
>
> Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
> general-purpose machine.
>
> 64kB pages are _only_ usable for databases, nothing else.
>

Very well, but if the infrastructure required to help get 64kB pages still
performs the same, or better, than the current infrastructure that gives
4kB pages, then why not? I am biased obviously and probably optimistic but
I am hoping we have a case here where we get our cake and eat it twice.

> Why? Do the math. Try to cache the whole kernel source tree in 4kB pages
> vs 64kB pages. See how the memory usage goes up by a factor of _four_.
>

I don't know, but I doubt they would use 64kB pages as the default size
unless it is a specialised machine. I could be wrong, I don't have a ppc64
machine, I don't work on a ppc64 machine, I haven't read the architectures
documentation and I didn't write this code for a ppc64 machine. If the
machine here in question it's a specialised machine, they go into the
0.01% category of people, but it's a group that we can still help without
introducing static zones they have to configure.

I'm still waiting on figures that say the approach proposed here is
actually really slow, rather than makes people unhappy slow. If this is
proved to be slow, then I'll admit there is a problem and put more effort
into the plans to use zones instead. I just haven't found a problem on the
machines I have available to me, be it aim9, bench-stresshighalloc or
building kernels (which I think is important considering how often I build
test kernels). If it's a documentation problem with these patches, I'll
write up VM docs on the allocator and submit it as a patch, complete with
downsides and caveats to be fair.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:08                                                               ` Linus Torvalds
  2005-11-03 18:17                                                                 ` Martin J. Bligh
@ 2005-11-03 21:11                                                                 ` Mel Gorman
  1 sibling, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-03 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Martin J. Bligh, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Linus Torvalds wrote:

> On Thu, 3 Nov 2005, Arjan van de Ven wrote:
>
> > On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
> >
> > > For amusement, let me put in some tritely oversimplified math. For the
> > > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> > > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> > > I'm going to assume random distribution of free pages, which is
> > > oversimplified, but I'm trying to demonstrate a general premise, not get
> > > accurate numbers.
> >
> > that is VERY over simplified though, given the anti-fragmentation
> > property of buddy algorithm
>

The statistical properties of the buddy system are a nightmare. There is a
paper called "Statistical Properties of the Buddy System" which is a whole
pile of no fun to read. It's because of the difficulty to analyse
fragmentation offline that bench-stresshighalloc was written to see how
well anti-defrag would do.

> Indeed. I write a program at one time doing random allocation and
> de-allocation and looking at what the output was, and buddy is very good
> at avoiding fragmentation.
>

The worse cause of fragmentation I found were kernel caches that were long
lived.  How fragmenting the workload is depended heavily on whether things
like updatedb happened which is why bench-stresshighalloc deliberately ran
it. It's also why anti-defrag tries to group inodes and buffer_heads into
the same areas in memory separate from other
persumed-to-be-even-longer-lived kernel allocations. The assumption is if
the buffer, inode and dcaches are all shrunk, contiguous blocks will
appear.

You're also right on the size of the watermarks for zones and how it
affects fragmentation. A serious problem I had with anti-defrag was when
87.5% of memory is in use. At this point, a "fallback" area is used by any
allocation type that has no pages of it's own. When it is depleted, real
fragmentation starts happening and it's also about here that the high
watermark for reclaiming starts. I wanted to increase the watermarks up to
start reclaiming pages when the "fallback" area started getting used but
didn't think I would get away with adjusting those figures. I could have
cheated and set it via /proc before benchmarks but didn't to avoid "magic
test system" syndrome.

> These days we have things like per-cpu lists in front of the buddy
> allocator that will make fragmentation somewhat higher, but it's still
> absolutely true that the page allocation layout is _not_ random.
>

It's worse than somewhat higher for the per-cpu pages. Using another set
of patches on top of an earlier version of anti-defrag, I was about to
allocate about 75% of physical memory in pinned 4MiB chunks of memory
under loads of 15-20 (kernel builds). To get there, per-cpu pages had to
be drained using an IPI call because for some perverse reason, there were
always 2 or 3 free per-cpu pages in the middle of a 1024 block of pages.

Basically, I don't we have to live with fragmentation in the page
allocator. I think it can be pushed down a whole lot without taking a
performance hit for the 99.99% of users that don't care about this sort of
thing.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:51                                                           ` Martin J. Bligh
  2005-11-03 17:59                                                             ` Arjan van de Ven
@ 2005-11-03 18:03                                                             ` Linus Torvalds
  2005-11-03 20:00                                                               ` Paul Jackson
  2005-11-03 20:46                                                               ` Mel Gorman
  2005-11-03 18:48                                                             ` Martin J. Bligh
  2 siblings, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 18:03 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > And I suspect that by default, there should be zero of them. Ie you'd have 
> > to set them up the same way you now set up a hugetlb area.
> 
> So ... if there are 0 by default, and I run for a while and dirty up
> memory, how do I free any pages up to put into them? Not sure how that
> works.

You don't.

Just face it - people who want memory hotplug had better know that 
beforehand (and let's be honest - in practice it's only going to work in 
virtualized environments or in environments where you can insert the new 
bank of memory and copy it over and remove the old one with hw support).

Same as hugetlb.

Nobody sane _cares_. Nobody sane is asking for these things. Only people 
with special needs are asking for it, and they know their needs.

You have to realize that the first rule of engineering is to work out the 
balances. The undeniable fact is, that 99.99% of all users will never care 
one whit, and memory management is complex and fragile. End result: the 
0.01% of users will have to do some manual configuration to keep things 
simpler for the cases that really matter.

Because the case that really matters is the sane case. The one where we
 - don't change memory (normal)
 - only add memory (easy)
 - only switch out memory with hardware support (ie the _hardware_ 
   supports parallel memory, and you can switch out a DIMM without 
   software ever really even noticing)
 - have system maintainers that do strange things, but _know_ that.

We simply DO NOT CARE about some theoretical "general case", because the 
general case is (a) insane and (b) impossible to cater to without 
excessive complexity.

Guys, a kernel developer needs to know when to say NO.

And we say NO, HELL NO!! to generic software-only memory hotplug.

If you are running a DB that needs to benchmark well, you damn well KNOW 
IT IN ADVANCE, AND YOU TUNE FOR IT.

Nobody takes a random machine and says "ok, we'll now put our most 
performance-critical database on this machine, and oh, btw, you can't 
reboot it and tune for it beforehand". And if you have such a person, you 
need to learn to IGNORE THE CRAZY PEOPLE.

When you hear voices in your head that tell you to shoot the pope, do you 
do what they say? Same thing goes for customers and managers. They are the 
crazy voices in your head, and you need to set them right, not just 
blindly do what they ask for.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:03                                                             ` Linus Torvalds
@ 2005-11-03 20:00                                                               ` Paul Jackson
  2005-11-03 20:46                                                               ` Mel Gorman
  1 sibling, 0 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-03 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, akpm, kravetz,
	linux-mm, linux-kernel, lhms-devel, arjanv

> We simply DO NOT CARE about some theoretical "general case", because the 
> general case is (a) insane and (b) impossible to cater to without 
> excessive complexity.

The lawyers have a phrase for this:

	Hard cases make bad law.

For us, that's bad code.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:03                                                             ` Linus Torvalds
  2005-11-03 20:00                                                               ` Paul Jackson
@ 2005-11-03 20:46                                                               ` Mel Gorman
  1 sibling, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-03 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Linus Torvalds wrote:

>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > > And I suspect that by default, there should be zero of them. Ie you'd have
> > > to set them up the same way you now set up a hugetlb area.
> >
> > So ... if there are 0 by default, and I run for a while and dirty up
> > memory, how do I free any pages up to put into them? Not sure how that
> > works.
>
> You don't.
>
> Just face it - people who want memory hotplug had better know that
> beforehand (and let's be honest - in practice it's only going to work in
> virtualized environments or in environments where you can insert the new
> bank of memory and copy it over and remove the old one with hw support).
>
> Same as hugetlb.
>

For HugeTLB, there are cases were the sysadmin won't configure the server
because it's a tunable that can badly affect the machine if they get it
wrong. In those cases, the users just get small pages, the performance
penalty and are told to like it.

> Nobody sane _cares_. Nobody sane is asking for these things. Only people
> with special needs are asking for it, and they know their needs.
>
> You have to realize that the first rule of engineering is to work out the
> balances. The undeniable fact is, that 99.99% of all users will never care
> one whit, and memory management is complex and fragile. End result: the
> 0.01% of users will have to do some manual configuration to keep things
> simpler for the cases that really matter.
>

Ok, so lets consider the 99.99% of users then. One two machines, aim9
benchmarks posted during this thread show some improvements on page_test,
fork_test and brk_test, the paths you would expect to be hit by these
patches. They are very minor improvements but 99.99% of users benefit from
this. Aim9 might be considered artifical so somewhere in that 99.99% of
users are kernel developers who care about kbuild so here are the timings
of "kernel untar ; make defconfig ; make"

2.6.14-rc5-mm1:				1093 seconds
2.6.14-rc5-mm1-mbuddy-v19-withoutdefrag 1089 seconds
2.6.14-rc5-mm1-mbuddy-v19-withdefrag::  1086 seconds

The withoutdefrag mark is with the core of anti-defrag disabled via a
configure option. The option to disable was a separate patch produced
during this thread. To be really honest, I don't think a configurable page
allocator is a great idea.

Building kernels is faster with this set of patches which a few people on
this list care about. aim9 shows very minor improvements which benefit a
very large number of people and 0.01% of people who care about
fragmentation get lower fragmentation.

Of course, maybe there is something magic with my test machines (or maybe
I am willing it faster) so figures from other people wouldn't hurt whether
they show gains or regressions. On my machine at least, 99.99% of people
are still benefitting.

I am going to wait to see if people post figures that show regressions
before asking "are you still saying no?" to this set of patches

> Because the case that really matters is the sane case. The one where we
>  - don't change memory (normal)
>  - only add memory (easy)
>  - only switch out memory with hardware support (ie the _hardware_
>    supports parallel memory, and you can switch out a DIMM without
>    software ever really even noticing)
>  - have system maintainers that do strange things, but _know_ that.
>
> We simply DO NOT CARE about some theoretical "general case", because the
> general case is (a) insane and (b) impossible to cater to without
> excessive complexity.
>
> Guys, a kernel developer needs to know when to say NO.
>
> And we say NO, HELL NO!! to generic software-only memory hotplug.
>
> If you are running a DB that needs to benchmark well, you damn well KNOW
> IT IN ADVANCE, AND YOU TUNE FOR IT.
>
> Nobody takes a random machine and says "ok, we'll now put our most
> performance-critical database on this machine, and oh, btw, you can't
> reboot it and tune for it beforehand". And if you have such a person, you
> need to learn to IGNORE THE CRAZY PEOPLE.
>
> When you hear voices in your head that tell you to shoot the pope, do you
> do what they say? Same thing goes for customers and managers. They are the
> crazy voices in your head, and you need to set them right, not just
> blindly do what they ask for.
>
> 		Linus
>

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:51                                                           ` Martin J. Bligh
  2005-11-03 17:59                                                             ` Arjan van de Ven
  2005-11-03 18:03                                                             ` Linus Torvalds
@ 2005-11-03 18:48                                                             ` Martin J. Bligh
  2005-11-03 19:08                                                               ` Linus Torvalds
  2 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

> For amusement, let me put in some tritely oversimplified math. For the
> sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> I'm going to assume random distribution of free pages, which is 
> oversimplified, but I'm trying to demonstrate a general premise, not get
> accurate numbers.
> 
> 8MB = 2048 pages.
> 
> On a 64MB system, we have 16384 pages, 2048 free. Very rougly speaking, for
> each free page, chance of it's buddy being free is 2048/16384. So in 
> grossly-oversimplified stats-land, if I can remember anything at all,
> chance of finding one page with a free buddy is 1-(1-2048/16384)^2048, 
> which is, for all intents and purposes ... 1.
> 
> 1 GB. system, 262144 pages 1-(1-2048/16384)^2048 = 0.9999989
> 
> 128GB system. 33554432 pages. 0.1175 probability
> 
> yes, yes, my math sucks and I'm a simpleton. The point is that as memory
> gets bigger, the odds suck for getting contiguous pages. And would also
> explain why you think there's no problem, and I do ;-) And bear in mind
> that's just for order 1 allocs. For bigger stuff, it REALLY sucks - I'll
> spare you more wild attempts at foully-approximated math.
> 
> Hmmm. If we keep 128MB free, that totally kills off the above calculation
> I think I'll just tweak it so the limit is not so hard on really big 
> systems. Will send you a patch. However ... larger allocs will still 
> suck ... I guess I'd better gross you out with more incorrect math after
> all ...

Ha. Just because I don't think I made you puke hard enough already with
foul approximations ... for order 2, I think it's

1-(1-(free_pool/total)^3)^free_pool 

because all 3 of his buddies have to be free as well.
(and generically ... 2^order - 1)

ORDER: 1

1024MB system, 8MB pool = 1.000000
131072MB system, 8MB pool = 0.117506
1024MB system, 128MB pool = 1.000000
131072MB system, 128MB pool = 1.000000

ORDER: 2

1024MB system, 8MB pool = 0.000976
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 1.000000
131072MB system, 128MB pool = 0.000031

ORDER: 3

1024MB system, 8MB pool = 0.000000
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 0.015504
131072MB system, 128MB pool = 0.000000

ORDER: 4

1024MB system, 8MB pool = 0.000000
131072MB system, 8MB pool = 0.000000
1024MB system, 128MB pool = 0.000000
131072MB system, 128MB pool = 0.000000


------------------------

I really should learn not to post my rusty math in such public places ...
but I still think the point is correct. Anyway, I'm sure somewhere in
the resultant flamewar, someone will come up with some better approx ;-)

And yes, I appreciate the random distribution thing is wrong. But it's
still not going to work for bigger allocs. Fixing the free watermarks
will help us a bit though.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 18:48                                                             ` Martin J. Bligh
@ 2005-11-03 19:08                                                               ` Linus Torvalds
  2005-11-03 22:37                                                                 ` Martin J. Bligh
  2005-11-04 16:22                                                                 ` Mel Gorman
  0 siblings, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 19:08 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven



On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> 
> Ha. Just because I don't think I made you puke hard enough already with
> foul approximations ... for order 2, I think it's

Your basic fault is in believing that the free watermark would stay 
constant.

That's insane.

Would you keep 8MB free on a 64MB system?

Would you keep 8MB free on a 8GB system?

The point being, that if you start with insane assumptions, you'll get 
insane answers.

The _correct_ assumption is that you aim to keep some fixed percentage of 
memory free. With that assumption and your math, finding higher-order 
pages is equally hard regardless of amount of memory. 

Now, your math then doesn't allow for the fact that buddy automatically 
coalesces for you, so in fact things get _easier_ with more memory, but 
hey, that needs more math than I can come up with (I never did it as math, 
only as simulations with allocation patterns - "smart people use math, 
plodding people just try to simulate an estimate" ;)

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 19:08                                                               ` Linus Torvalds
@ 2005-11-03 22:37                                                                 ` Martin J. Bligh
  2005-11-03 23:16                                                                   ` Linus Torvalds
  2005-11-04 16:22                                                                 ` Mel Gorman
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 22:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

>> Ha. Just because I don't think I made you puke hard enough already with
>> foul approximations ... for order 2, I think it's
> 
> Your basic fault is in believing that the free watermark would stay 
> constant.
> 
> That's insane.
> 
> Would you keep 8MB free on a 64MB system?
> 
> Would you keep 8MB free on a 8GB system?
> 
> The point being, that if you start with insane assumptions, you'll get 
> insane answers.

Ummm. I was basing it on what we actually do now in the code, unless I
misread it, which is perfectly possible. Do you want this patch?

diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
--- linux-2.6.14/mm/page_alloc.c	2005-10-27 18:52:20.000000000 -0700
+++ 2.6.14-no_water_cap/mm/page_alloc.c	2005-11-03 14:36:06.000000000 -0800
@@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
 			min_pages = zone->present_pages / 1024;
 			if (min_pages < SWAP_CLUSTER_MAX)
 				min_pages = SWAP_CLUSTER_MAX;
-			if (min_pages > 128)
-				min_pages = 128;
 			zone->pages_min = min_pages;
 		} else {
 			/* if it's a lowmem zone, reserve a number of pages


> The _correct_ assumption is that you aim to keep some fixed percentage of 
> memory free. With that assumption and your math, finding higher-order 
> pages is equally hard regardless of amount of memory. 

That would, indeed, make more sense.
 
> Now, your math then doesn't allow for the fact that buddy automatically 
> coalesces for you, so in fact things get _easier_ with more memory, but 
> hey, that needs more math than I can come up with (I never did it as math, 
> only as simulations with allocation patterns - "smart people use math, 
> plodding people just try to simulate an estimate" ;)

Not sure what people who do math, but wrongly, are called, but I'm sure 
it's not polite, and I'm sure I'm one of those ;-)

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 22:37                                                                 ` Martin J. Bligh
@ 2005-11-03 23:16                                                                   ` Linus Torvalds
  2005-11-03 23:39                                                                     ` Martin J. Bligh
  2005-11-04  4:39                                                                     ` Andrew Morton
  0 siblings, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-03 23:16 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> 
> Ummm. I was basing it on what we actually do now in the code, unless I
> misread it, which is perfectly possible. Do you want this patch?
> 
> diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
> --- linux-2.6.14/mm/page_alloc.c	2005-10-27 18:52:20.000000000 -0700
> +++ 2.6.14-no_water_cap/mm/page_alloc.c	2005-11-03 14:36:06.000000000 -0800
> @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
>  			min_pages = zone->present_pages / 1024;
>  			if (min_pages < SWAP_CLUSTER_MAX)
>  				min_pages = SWAP_CLUSTER_MAX;
> -			if (min_pages > 128)
> -				min_pages = 128;
>  			zone->pages_min = min_pages;
>  		} else {
>  			/* if it's a lowmem zone, reserve a number of pages

Ahh, you're right, there's a totally separate watermark for highmem.

I think I even remember this. I may even be responsible. I know some of 
our less successful highmem balancing efforts in the 2.4.x timeframe had 
serious trouble when they ran out of highmem, and started pruning lowmem 
very very aggressively. Limiting the highmem water marks meant that it 
wouldn't do that very often.

I think your patch may in fact be fine, but quite frankly, it needs 
testing under real load with highmem.

In general, I don't _think_ we should do anything different for highmem at 
all, and we should just in general try to keep a percentage of pages 
available. Now, the percentage probably does depend on the zone: we should 
be more aggressive about more "limited" zones, ie the old 16MB DMA zone 
should probably try to keep a higher percentage of free pages around than 
the normal zone, and that in turn should probably keep a higher percentage 
of pages around than the highmem zones.

And that's not because of fragmentation so much, but simply because the 
lower zones tend to have more "desperate" users. Running out of the normal 
zone is thus a "worse" situation than running out of highmem. And we 
effectively never want to allocate from the 16MB DMA zone at all, unless 
it is our only choice.

We actually do try to do that with that "lowmem_reserve[]" logic, which 
reserves more pages in the lower zones the bigger the upper zones are (ie 
if we _only_ have memory in the low 16MB, then we don't reserve any of it, 
but if we have _tons_ of memory in the high zones, then we reserve more 
memory for the low zones and thus make the watermarks higher for them).

So the watermarking interacts with that lowmem_reserve logic, and I think 
that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min" 
is limited, and second because HIGHMEM has no lowmem_reserve.

Does that make sense?

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 23:16                                                                   ` Linus Torvalds
@ 2005-11-03 23:39                                                                     ` Martin J. Bligh
  2005-11-04  0:42                                                                       ` Nick Piggin
  2005-11-04  4:39                                                                     ` Andrew Morton
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 23:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

> Ahh, you're right, there's a totally separate watermark for highmem.
> 
> I think I even remember this. I may even be responsible. I know some of 
> our less successful highmem balancing efforts in the 2.4.x timeframe had 
> serious trouble when they ran out of highmem, and started pruning lowmem 
> very very aggressively. Limiting the highmem water marks meant that it 
> wouldn't do that very often.
> 
> I think your patch may in fact be fine, but quite frankly, it needs 
> testing under real load with highmem.
> 
> In general, I don't _think_ we should do anything different for highmem at 
> all, and we should just in general try to keep a percentage of pages 
> available. Now, the percentage probably does depend on the zone: we should 
> be more aggressive about more "limited" zones, ie the old 16MB DMA zone 
> should probably try to keep a higher percentage of free pages around than 
> the normal zone, and that in turn should probably keep a higher percentage 
> of pages around than the highmem zones.

Hmm. it strikes me that there will be few (if any?) allocations out of 
highmem. PPC64 et al dump everything into ZONE_DMA though - so those should
be uncapped already.
 
> And that's not because of fragmentation so much, but simply because the 
> lower zones tend to have more "desperate" users. Running out of the normal 
> zone is thus a "worse" situation than running out of highmem. And we 
> effectively never want to allocate from the 16MB DMA zone at all, unless 
> it is our only choice.

Well it's not 16MB on the other platforms, but ...

> We actually do try to do that with that "lowmem_reserve[]" logic, which 
> reserves more pages in the lower zones the bigger the upper zones are (ie 
> if we _only_ have memory in the low 16MB, then we don't reserve any of it, 
> but if we have _tons_ of memory in the high zones, then we reserve more 
> memory for the low zones and thus make the watermarks higher for them).
> 
> So the watermarking interacts with that lowmem_reserve logic, and I think 
> that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min" 
> is limited, and second because HIGHMEM has no lowmem_reserve.
> 
> Does that make sense?

Yes. So we were only capping highmem before, now I squint at it closer.
I was going off a simplification I'd written for a paper, which is
not generally correct. I doubt frag is a problem in highmem, so maybe
the code is correct as-is. We only want contig allocs for virtual when
it's mapped 1-1 to physical (ie the kernel mapping) or real physical
things. 

I suppose I could write something to trawl the source tree to check
that assumption, but it feels right ...

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 23:39                                                                     ` Martin J. Bligh
@ 2005-11-04  0:42                                                                       ` Nick Piggin
  0 siblings, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-04  0:42 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Linus Torvalds, Mel Gorman, Arjan van de Ven, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

Martin J. Bligh wrote:
>>Ahh, you're right, there's a totally separate watermark for highmem.
>>
>>I think I even remember this. I may even be responsible. I know some of 
>>our less successful highmem balancing efforts in the 2.4.x timeframe had 
>>serious trouble when they ran out of highmem, and started pruning lowmem 
>>very very aggressively. Limiting the highmem water marks meant that it 
>>wouldn't do that very often.
>>
>>I think your patch may in fact be fine, but quite frankly, it needs 
>>testing under real load with highmem.
>>

I'd prefer not. The reason is that it increases the "min"
watermark, which only gets used basically by GFP_ATOMIC and
PF_MEMALLOC allocators - neither of which are likely to want
highmem.

Also, I don't think anybody cares about higher order highmem
allocations. At least the patches in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113082256231168&w=2

Should be applied before this. However they also need more
testing so I'll be sending them to Andrew first.

Patch 2 does basically the same thing as your patch, without
increasing the min watermark.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 23:16                                                                   ` Linus Torvalds
  2005-11-03 23:39                                                                     ` Martin J. Bligh
@ 2005-11-04  4:39                                                                     ` Andrew Morton
  1 sibling, 0 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mbligh, mel, arjan, nickpiggin, haveblue, mingo, kravetz,
	linux-mm, linux-kernel, lhms-devel, arjanv

Linus Torvalds <torvalds@osdl.org> wrote:
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>  > 
>  > Ummm. I was basing it on what we actually do now in the code, unless I
>  > misread it, which is perfectly possible. Do you want this patch?
>  > 
>  > diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
>  > --- linux-2.6.14/mm/page_alloc.c	2005-10-27 18:52:20.000000000 -0700
>  > +++ 2.6.14-no_water_cap/mm/page_alloc.c	2005-11-03 14:36:06.000000000 -0800
>  > @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
>  >  			min_pages = zone->present_pages / 1024;
>  >  			if (min_pages < SWAP_CLUSTER_MAX)
>  >  				min_pages = SWAP_CLUSTER_MAX;
>  > -			if (min_pages > 128)
>  > -				min_pages = 128;
>  >  			zone->pages_min = min_pages;
>  >  		} else {
>  >  			/* if it's a lowmem zone, reserve a number of pages
> 
>  Ahh, you're right, there's a totally separate watermark for highmem.
> 
>  I think I even remember this. I may even be responsible. I know some of 
>  our less successful highmem balancing efforts in the 2.4.x timeframe had 
>  serious trouble when they ran out of highmem, and started pruning lowmem 
>  very very aggressively. Limiting the highmem water marks meant that it 
>  wouldn't do that very often.

No, that was me and Matthew Dobson, circa 2.5.71.  The thinking was that
highmem is just for userspace pages and we don't need to keep the free
memory pool around for things like atomic allocations.  Especially as a
proportionally-sized highmem emergency pool would be potentially hundreds of
(wasted) megabytes.

iirc, things worked ok with a highmem min_pages threshold of zero pages.  Back
in 2.5.70, before everyone else broke it ;)


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 19:08                                                               ` Linus Torvalds
  2005-11-03 22:37                                                                 ` Martin J. Bligh
@ 2005-11-04 16:22                                                                 ` Mel Gorman
  1 sibling, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-04 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Arjan van de Ven, Nick Piggin, Dave Hansen,
	Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms, Arjan van de Ven

On Thu, 3 Nov 2005, Linus Torvalds wrote:

>
>
> On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > Ha. Just because I don't think I made you puke hard enough already with
> > foul approximations ... for order 2, I think it's
>
> Your basic fault is in believing that the free watermark would stay
> constant.
>
> That's insane.
>
> Would you keep 8MB free on a 64MB system?
>
> Would you keep 8MB free on a 8GB system?
>
> The point being, that if you start with insane assumptions, you'll get
> insane answers.
>
> The _correct_ assumption is that you aim to keep some fixed percentage of
> memory free. With that assumption and your math, finding higher-order
> pages is equally hard regardless of amount of memory.
>
> Now, your math then doesn't allow for the fact that buddy automatically
> coalesces for you, so in fact things get _easier_ with more memory, but
> hey, that needs more math than I can come up with (I never did it as math,
> only as simulations with allocation patterns - "smart people use math,
> plodding people just try to simulate an estimate" ;)
>

My math is not that great either, so here is a simulation.

Setup: Reboot the machine which is a quad xeon xSeries 350 with 1.5GiB of
RAM. Configure /proc/sys/vm/min_free_kbytes to try and keep 1/8th of
physical memory free. This is to keep in line with your suggestion that
fragmentation is low when there is a higher percentage of memory free.

Load: Run a load - 7 kernels compiling simultaneously at -j2 which gives
loads between 10-14. Try and get 50% worth of physical memory in 4MiB
pages (1024 contiguous pages) while compiling. When the test ends and the
system is quiet, try again. 4MiB in this case is a single HugeTLB page.

Here are the results;

2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating Under Load
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        24
Failed allocs:         136
DMA zone allocs:       0
Normal zone allocs:    16
HighMem zone allocs:   8
% Success:            15

2.6.14-rc5-mm1-mbuddy-v19 Allocating Under Load
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        24
Failed allocs:         136
DMA zone allocs:       0
Normal zone allocs:    11
HighMem zone allocs:   13
% Success:            15

Not a lot of difference there and the success rate is not great.
mbuddy-v19 is a bit better at the normal zone and that's about it. These
results are not surprising as kswapd is making no effort to get contiguous
pages. Under a load of 7 kernel compiles, kswapd will not free pages fast
enough.

When the test ends and the system is quiet, try and get 80% of physical
memory in large pages. 4 attempts are made to satisfy the requests to give
kswapd lots of time.

2.6.14-rc5-mm1-clean (OOM killer disabled) Allocating while rested
Order:                 10
Allocation type:       HighMem
Attempted allocations: 300
Success allocs:        159
Failed allocs:         141
DMA zone allocs:       0
Normal zone allocs:    46
HighMem zone allocs:   113
% Success:            53

Mainly highmem there.

2.6.14-rc5-mm1-mbuddy-v19 Allocating while rested
Order:                 10
Allocation type:       HighMem
Attempted allocations: 300
Success allocs:        212
Failed allocs:         88
DMA zone allocs:       0
Normal zone allocs:    102
HighMem zone allocs:   110
% Success:            70

Look at the big difference in the number of successful allocations in
ZONE_NORMAL because the kernel allocations were kept together. Experience
has shown me that failure to get higher success rates depended on per-cpu
pages and the number of kernel pages that leaked to other areas (56 over
the course of this test). Kernel pages leaking was helped a lot by setting
min_free_kbytes higher than the default.

I then ported forward the linear scanner and ran the tests again. The
linear scanner does two things - finds linear reclaimable pages using
information provided by anti-defrag and drains the per-cpu caches. I'll
post the linear scanner code if people want to look at it but it's really
crap. It's slow, works too hard and doesn't try to hold on to the pages
for the process reclaiming the pages are just some of it's problems. I
need to rewrite it almost from scratch and avoid all the mistakes but it's
a path that is hit *only* if you are allocating high orders.

2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating under load
Order:                 10
Allocation type:       HighMem
Attempted allocations: 160
Success allocs:        155
Failed allocs:         0
DMA zone allocs:       0
Normal zone allocs:    12
HighMem zone allocs:   143
% Success:            96

Mainly got it's pages back from highmem which is always easier as long as
PTE pages are not in the way.

2.6.14-rc5-mm1-mbuddy-v19-lnscan Allocating while rested
Order:                 10
Allocation type:       HighMem
Attempted allocations: 300
Success allocs:        275
Failed allocs:         0
DMA zone allocs:       0
Normal zone allocs:    133
HighMem zone allocs:   142
% Success:            91

That is 71% of physical memory available in contiguous blocks with the
linear scanner but that code is not ready. anti-defrag on it's own as it
is today was able to get 55% of physical memory in 4MiB chunks.

This is provided without performance regressions in the normal case
everyone cares about. In my tests, there are minor improvements on aim9
which is artificial, and gained a few seconds on kernel build tests which
people do care about.

Does these patches still make no sense to you? Lower fragmentation that
does not impact the cases everyone cares about? If so, why?

To get the best possibly results, a zone approach could still be built on
top of this and it seems as if it's worth developing. At the cost of some
configuration, the zone would give *hard* guarantees on the available
number of large pages and anti-defrag would give best effort everywhere
else. By default without configuration, you would get best-effort.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 15:40                                               ` Arjan van de Ven
  2005-11-03 15:51                                                 ` Linus Torvalds
@ 2005-11-03 15:53                                                 ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Linus Torvalds, kravetz, linux-mm, Linux Kernel Mailing List,
	lhms, Arjan van de Ven



--Arjan van de Ven <arjan@infradead.org> wrote (on Thursday, November 03, 2005 16:40:21 +0100):

> On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
>> >> Can we quit coming up with specialist hacks for hotplug, and try to solve
>> >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
>> >> in general is.
>> >> 
>> > 
>> > Not really it isn't. There have been a few cases (e1000 being the main
>> > one, and is fixed upstream) where fragmentation in general is a problem.
>> > But mostly it is not.
>> 
>> Sigh. OK, tell me how you're going to fix kernel stacks > 4K please. 
> 
> with CONFIG_4KSTACKS :)

I've been told previously that doesn't work for x86_64, and other 64 bit
platforms. Is that incorrect?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 15:01                             ` Ingo Molnar
  2005-11-01 15:22                               ` Dave Hansen
@ 2005-11-01 16:48                               ` Kamezawa Hiroyuki
  2005-11-01 16:59                                 ` Kamezawa Hiroyuki
                                                   ` (3 more replies)
  1 sibling, 4 replies; 241+ messages in thread
From: Kamezawa Hiroyuki @ 2005-11-01 16:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of 
> RAM? Customer will say: "I have stopped all nonessential services, and 
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix 
> the kernel!". No reasonable customer will say: "True, I have all RAM 
> used up in mlock()ed sections, but i want to remove some RAM 
> nevertheless".
> 
Hi, I'm one of men in -lhms

In my understanding...
- Memory Hotremove on IBM's LPAR? approach is
   [remove some amount of memory from somewhere.]
   For this approach, Mel's patch will work well.
   But this will not guaranntee a user can remove specified range of
   memory at any time because how memory range is used is not defined by an admin
   but by the kernel automatically. But to extract some amount of memory,
   Mel's patch is very important and they need this.

My own target is NUMA node hotplug, what NUMA node hotplug want is
- [remove the range of memory] For this approach, admin should define
   *core* node and removable node. Memory on removable node is removable.
   Dividing area into removable and not-removable is needed, because
   we cannot allocate any kernel's object on removable area.
   Removable area should be 100% removable. Customer can know the limitation before using.

What I'm considering now is this:
- removable area is hot-added area
- not-removable area is memory which is visible to kernel at boot time.
(I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
A customer can hot add their extra memory after boot. This is very easy to understand.
Peformance problem is trade-off.(I'm afraid of this ;)

If a cutomer wants to guarantee some memory areas should be hot-removable,
he will hot-add them.
I don't think adding memory for the kernel by hot-add is wanted by a customer.

-- Kame


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
@ 2005-11-01 16:59                                 ` Kamezawa Hiroyuki
  2005-11-01 17:19                                 ` Mel Gorman
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 241+ messages in thread
From: Kamezawa Hiroyuki @ 2005-11-01 16:59 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

Kamezawa Hiroyuki wrote:
> Ingo Molnar wrote:
> 
>> so it's all about expectations: _could_ you reasonably remove a piece 
>> of RAM? Customer will say: "I have stopped all nonessential services, 
>> and free RAM is at 90%, still I cannot remove that piece of faulty 
>> RAM, fix the kernel!". No reasonable customer will say: "True, I have 
>> all RAM used up in mlock()ed sections, but i want to remove some RAM 
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
> 
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
>   [remove some amount of memory from somewhere.]
>   For this approach, Mel's patch will work well.
>   But this will not guaranntee a user can remove specified range of
>   memory at any time because how memory range is used is not defined by 
> an admin
>   but by the kernel automatically. But to extract some amount of memory,
>   Mel's patch is very important and they need this.
> 
One more consideration...
Some cpus which support virtialization will be shipped by some vendor in near future.
If someone uses vitualized OS, only problem is *resizing*.
Hypervisor will be able to remap semi-physical pages anyware with hardware assistance
but system resizing needs operating system assistance.
To this direction, [remove some amount of memory from somewhere.] is important approach.

-- Kame



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
  2005-11-01 16:59                                 ` Kamezawa Hiroyuki
@ 2005-11-01 17:19                                 ` Mel Gorman
  2005-11-02  0:32                                   ` KAMEZAWA Hiroyuki
  2005-11-01 18:06                                 ` linux-os (Dick Johnson)
  2005-11-02  7:19                                 ` Ingo Molnar
  3 siblings, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 17:19 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

On Wed, 2 Nov 2005, Kamezawa Hiroyuki wrote:

> Ingo Molnar wrote:
> > so it's all about expectations: _could_ you reasonably remove a piece of
> > RAM? Customer will say: "I have stopped all nonessential services, and free
> > RAM is at 90%, still I cannot remove that piece of faulty RAM, fix the
> > kernel!". No reasonable customer will say: "True, I have all RAM used up in
> > mlock()ed sections, but i want to remove some RAM nevertheless".
> >
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
>   [remove some amount of memory from somewhere.]
>   For this approach, Mel's patch will work well.
>   But this will not guaranntee a user can remove specified range of
>   memory at any time because how memory range is used is not defined by an
> admin
>   but by the kernel automatically. But to extract some amount of memory,
>   Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
>   *core* node and removable node. Memory on removable node is removable.
>   Dividing area into removable and not-removable is needed, because
>   we cannot allocate any kernel's object on removable area.
>   Removable area should be 100% removable. Customer can know the limitation
> before using.
>

In this case, we would want some mechanism that says "don't put awkward
pages in this NUMA node" in a clear way. One way we could do this is;

1. Move fallback_allocs to be per-node. fallback_allocs is currently
defined as
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
        {RCLM_NORCLM,   RCLM_FALLBACK, RCLM_KERN,   RCLM_EASY, RCLM_TYPES},
        {RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
        {RCLM_KERN,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
};

  The effect is that a RCLM_NORCLM allocation, falls back to
  RCLM_FALLBACK, RCLM_KERN, RCLM_EASY and then gives up.

2. Architectures would need to provide a function that allocates and
populates a fallback_allocs[][] array. If they do not provide one, a
generic function uses array like the one above

3. When adding a node that must be removable, make the array look like
this

int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
        {RCLM_NORCLM,   RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
        {RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
        {RCLM_KERN,     RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
};

The effect of this is only allocations that are easily reclaimable will
end up in this node. This would be a straight-forward addition to build
upon this set of patches. The difference would only be visible to
architectures that cared.

> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only
> ZONE_HIGHMEM)


ZONE_HIGHMEM can still end up with PTE pages if allocating PTE pages from
highmem is configured. This is bad. With the above approach, nodes that
are not hot-added that have a ZONE_HIGHMEM will be able to use it for PTEs
as well. But when a node is hot-added, it will have a ZONE_HIGHMEM that is
not used for PTE allocations because they are not RCLM_EASY allocations.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 17:19                                 ` Mel Gorman
@ 2005-11-02  0:32                                   ` KAMEZAWA Hiroyuki
  2005-11-02 11:22                                     ` Mel Gorman
  0 siblings, 1 reply; 241+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-02  0:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

Mel Gorman wrote:
> 3. When adding a node that must be removable, make the array look like
> this
> 
> int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
>         {RCLM_NORCLM,   RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
>         {RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
>         {RCLM_KERN,     RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
> };
> 
> The effect of this is only allocations that are easily reclaimable will
> end up in this node. This would be a straight-forward addition to build
> upon this set of patches. The difference would only be visible to
> architectures that cared.
> 
Thank you for illustration.
maybe fallback_list per pgdat/zone is what I need with your patch.  right ?

-- Kame



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  0:32                                   ` KAMEZAWA Hiroyuki
@ 2005-11-02 11:22                                     ` Mel Gorman
  0 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 11:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Dave Hansen, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

On Wed, 2 Nov 2005, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > 3. When adding a node that must be removable, make the array look like
> > this
> >
> > int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
> >         {RCLM_NORCLM,   RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
> >         {RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
> >         {RCLM_KERN,     RCLM_TYPES,    RCLM_TYPES,  RCLM_TYPES, RCLM_TYPES},
> > };
> >
> > The effect of this is only allocations that are easily reclaimable will
> > end up in this node. This would be a straight-forward addition to build
> > upon this set of patches. The difference would only be visible to
> > architectures that cared.
> >
> Thank you for illustration.
> maybe fallback_list per pgdat/zone is what I need with your patch.  right ?
>

With my patch, yes. With zones, you need to change how zonelists are built
for each node.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
  2005-11-01 16:59                                 ` Kamezawa Hiroyuki
  2005-11-01 17:19                                 ` Mel Gorman
@ 2005-11-01 18:06                                 ` linux-os (Dick Johnson)
  2005-11-02  7:19                                 ` Ingo Molnar
  3 siblings, 0 replies; 241+ messages in thread
From: linux-os (Dick Johnson) @ 2005-11-01 18:06 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Ingo Molnar, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Tue, 1 Nov 2005, Kamezawa Hiroyuki wrote:

> Ingo Molnar wrote:
>> so it's all about expectations: _could_ you reasonably remove a piece of
>> RAM? Customer will say: "I have stopped all nonessential services, and
>> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
>> the kernel!". No reasonable customer will say: "True, I have all RAM
>> used up in mlock()ed sections, but i want to remove some RAM
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
>   [remove some amount of memory from somewhere.]
>   For this approach, Mel's patch will work well.
>   But this will not guaranntee a user can remove specified range of
>   memory at any time because how memory range is used is not defined by an admin
>   but by the kernel automatically. But to extract some amount of memory,
>   Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
>   *core* node and removable node. Memory on removable node is removable.
>   Dividing area into removable and not-removable is needed, because
>   we cannot allocate any kernel's object on removable area.
>   Removable area should be 100% removable. Customer can know the limitation before using.
>
> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
> A customer can hot add their extra memory after boot. This is very easy to understand.
> Peformance problem is trade-off.(I'm afraid of this ;)
>
> If a cutomer wants to guarantee some memory areas should be hot-removable,
> he will hot-add them.
> I don't think adding memory for the kernel by hot-add is wanted by a customer.
>
> -- Kame

With ix86 machines, the page directory pointed to by CR3 needs
to always be present in physical memory. This means that there
must always be some RAM that can't be hot-swapped (you can't
put back the contents of the page-directory without using
the CPU which needs the page directory).

This is explained on page 5-21 of the i486 reference manual.
This happens because there is no "present" bit in CR3 as there
are in the page tables themselves.

This problem means that "surprise" swaps are impossible. However,
given a forewarning, it is possible to build a new table somewhere
in existing RAM within the physical constraints required, call
some code there (needs to be a 1:1 translation), disable paging,
then proceed. The problem is that of writing of the contents
of RAM to be replaced, to storage media so the new page-table
needs to be loaded from the new location. This may not work
if the LDT and the GDT are not accessible from their current
locations. If they are in the RAM to be replaced, you are
in a world of hurt taking the "world" apart and putting it
back together again.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 16:48                               ` Kamezawa Hiroyuki
                                                   ` (2 preceding siblings ...)
  2005-11-01 18:06                                 ` linux-os (Dick Johnson)
@ 2005-11-02  7:19                                 ` Ingo Molnar
  2005-11-02  7:46                                   ` Gerrit Huizenga
  2005-11-02  7:57                                   ` Nick Piggin
  3 siblings, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-02  7:19 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Dave Hansen, Mel Gorman, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

* Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
>   *core* node and removable node. Memory on removable node is removable.
>   Dividing area into removable and not-removable is needed, because
>   we cannot allocate any kernel's object on removable area.
>   Removable area should be 100% removable. Customer can know the limitation 
>   before using.

that's a perfectly fine method, and is quite similar to the 'separate 
zone' approach Nick mentioned too. It is also easily understandable for 
users/customers.

under such an approach, things become easier as well: if you have zones 
you can to restrict (no kernel pinned-down allocations, no mlock-ed 
pages, etc.), there's no need for any 'fragmentation avoidance' patches!  
Basically all of that RAM becomes instantly removable (with some small 
complications). That's the beauty of the separate-zones approach. It is 
also a limitation: no kernel allocations, so all the highmem-alike 
restrictions apply to it too.

but what is a dangerous fallacy is that we will be able to support hot 
memory unplug of generic kernel RAM in any reliable way!

you really have to look at this from the conceptual angle: 'can an 
approach ever lead to a satisfactory result'? If the answer is 'no', 
then we _must not_ add a 90% solution that we _know_ will never be a 
100% solution.

for the separate-removable-zones approach we see the end of the tunnel.  
Separate zones are well-understood.

generic unpluggable kernel RAM _will not work_.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  7:19                                 ` Ingo Molnar
@ 2005-11-02  7:46                                   ` Gerrit Huizenga
  2005-11-02  8:50                                     ` Nick Piggin
  2005-11-02 10:41                                     ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
  2005-11-02  7:57                                   ` Nick Piggin
  1 sibling, 2 replies; 241+ messages in thread
From: Gerrit Huizenga @ 2005-11-02  7:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wed, 02 Nov 2005 08:19:43 +0100, Ingo Molnar wrote:
> 
> * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > My own target is NUMA node hotplug, what NUMA node hotplug want is
> > - [remove the range of memory] For this approach, admin should define
> >   *core* node and removable node. Memory on removable node is removable.
> >   Dividing area into removable and not-removable is needed, because
> >   we cannot allocate any kernel's object on removable area.
> >   Removable area should be 100% removable. Customer can know the limitation 
> >   before using.
> 
> that's a perfectly fine method, and is quite similar to the 'separate 
> zone' approach Nick mentioned too. It is also easily understandable for 
> users/customers.
> 
> under such an approach, things become easier as well: if you have zones 
> you can to restrict (no kernel pinned-down allocations, no mlock-ed 
> pages, etc.), there's no need for any 'fragmentation avoidance' patches!  
> Basically all of that RAM becomes instantly removable (with some small 
> complications). That's the beauty of the separate-zones approach. It is 
> also a limitation: no kernel allocations, so all the highmem-alike 
> restrictions apply to it too.
> 
> but what is a dangerous fallacy is that we will be able to support hot 
> memory unplug of generic kernel RAM in any reliable way!
> 
> you really have to look at this from the conceptual angle: 'can an 
> approach ever lead to a satisfactory result'? If the answer is 'no', 
> then we _must not_ add a 90% solution that we _know_ will never be a 
> 100% solution.
> 
> for the separate-removable-zones approach we see the end of the tunnel.  
> Separate zones are well-understood.
> 
> generic unpluggable kernel RAM _will not work_.

Actually, it will.  Well, depending on terminology.

There are two usage models here - those which intend to remove physical
elements and those where the kernel returnss management of its virtualized
"physical" memory to a hypervisor.  In the latter case, a hypervisor
already maintains a virtual map of the memory and the OS needs to release
virtualized "physical" memory.  I think you are referring to RAM here as
the physical component; however these same defrag patches help where a
hypervisor is maintaining the real physical memory below the operating
system and the OS is managing a virtualized "physical" memory.

On pSeries hardware or with Xen, a client OS can return chunks of memory
to the hypervisor.  That memory needs to be returned in chunks of the
size that the hypervisor normally manages/maintains.  But long ranges
of physical contiguity are not required.  Just shorter ranges, depending
on what the hypervisor maintains, need to be returned from the OS to
the hypervisor.

In other words, if we can return 1 MB chunks, the hypervisor can hand
out those 1 MB chunks to other domains/partitions.  So, if we can return
500 1 MB chunks from a 2 GB OS instance, we can add 500 MB dyanamically
to another OS image.

This happens to be a *very* satisfactory answer for virtualized environments.

The other answer, which is harder, is to return (free) entire large physical
chunks, e.g. the size of the full memory of a node, allowing a node to be
dynamically removed (or a DIMM/SIMM/etc.).

So, people are working towards two distinct solutions, both of which
require us to do a better job of defragmenting memory (or avoiding
fragementation in the first place).

gerrit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  7:46                                   ` Gerrit Huizenga
@ 2005-11-02  8:50                                     ` Nick Piggin
  2005-11-02  9:12                                       ` Gerrit Huizenga
  2005-11-02 10:41                                     ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
  1 sibling, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  8:50 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

Gerrit Huizenga wrote:

> So, people are working towards two distinct solutions, both of which
> require us to do a better job of defragmenting memory (or avoiding
> fragementation in the first place).
> 

This is just going around in circles. Even with your fragmentation
avoidance and memory defragmentation, there are still going to be
cases where memory does get fragmented and can't be defragmented.
This is Ingo's point, I believe.

Isn't the solution for your hypervisor problem to dish out pages of
the same size that are used by the virtual machines. Doesn't this
provide you with a nice, 100% solution that doesn't add complexity
where it isn't needed?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  8:50                                     ` Nick Piggin
@ 2005-11-02  9:12                                       ` Gerrit Huizenga
  2005-11-02  9:37                                         ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Gerrit Huizenga @ 2005-11-02  9:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
> 
> > So, people are working towards two distinct solutions, both of which
> > require us to do a better job of defragmenting memory (or avoiding
> > fragementation in the first place).
> > 
> 
> This is just going around in circles. Even with your fragmentation
> avoidance and memory defragmentation, there are still going to be
> cases where memory does get fragmented and can't be defragmented.
> This is Ingo's point, I believe.
> 
> Isn't the solution for your hypervisor problem to dish out pages of
> the same size that are used by the virtual machines. Doesn't this
> provide you with a nice, 100% solution that doesn't add complexity
> where it isn't needed?

So do you see the problem with fragementation if the hypervisor is
handing out, say, 1 MB pages?  Or, more likely, something like 64 MB
pages?  What are the chances that an entire 64 MB page can be freed
on a large system that has been up a while?

And, if you create zones, you run into all of the zone rebalancing
problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM.  In that case, on
any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
random allocations, making almost none of it available.

However, with reasonable defragmentation or fragmentation avoidance,
we have some potential to make large chunks available for return to
the hypervisor.  And, that same capability continues to help those
who want to remove fixed ranges of physical memory.

gerrit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:12                                       ` Gerrit Huizenga
@ 2005-11-02  9:37                                         ` Nick Piggin
  2005-11-02 10:17                                           ` Gerrit Huizenga
  2005-11-02 23:47                                           ` Rob Landley
  0 siblings, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  9:37 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

Gerrit Huizenga wrote:
> On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:

>>Isn't the solution for your hypervisor problem to dish out pages of
>>the same size that are used by the virtual machines. Doesn't this
>>provide you with a nice, 100% solution that doesn't add complexity
>>where it isn't needed?
> 
> 
> So do you see the problem with fragementation if the hypervisor is
> handing out, say, 1 MB pages?  Or, more likely, something like 64 MB
> pages?  What are the chances that an entire 64 MB page can be freed
> on a large system that has been up a while?
> 

I see the problem, but if you want to be able to shrink memory to a
given size, then you must either introduce a hard limit somewhere, or
have the hypervisor hand out guest sized pages. Use zones, or Xen?

> And, if you create zones, you run into all of the zone rebalancing
> problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM.  In that case, on
> any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> random allocations, making almost none of it available.
> 

If there are zone rebalancing problems[*], then it would be great to
have more users of zones because then they will be more likely to get
fixed.

[*] and there are, sadly enough - see the recent patches I posted to
     lkml for example. But I'm fairly confident that once the particularly
     silly ones have been fixed, zone balancing will no longer be a
     derogatory term as has been thrown around (maybe rightly) in this
     thread!

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:37                                         ` Nick Piggin
@ 2005-11-02 10:17                                           ` Gerrit Huizenga
  2005-11-02 23:47                                           ` Rob Landley
  1 sibling, 0 replies; 241+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 10:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wed, 02 Nov 2005 20:37:43 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
> > On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
> 
> >>Isn't the solution for your hypervisor problem to dish out pages of
> >>the same size that are used by the virtual machines. Doesn't this
> >>provide you with a nice, 100% solution that doesn't add complexity
> >>where it isn't needed?
> > 
> > 
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages?  Or, more likely, something like 64 MB
> > pages?  What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?

 So why do you believe there must be a hard limit?

 Any reduction in memory usage is going to be workload related.
 If the workload is consuming less memory than is available, memory
 reclaim is easy (e.g. handle fragmentation, find nice sized chunks).
 The workload determines how much the administrator can free.  If
 the workload is using all of the resources available (e.g. lots of
 associated kernel memory locked down, locked user pages, etc.)
 then the administrator will logically be able to reduce less memory
 from the machine.

 The amount of memory to be freed up is not determined by some pre-defined
 machine constraints but based on the actual workload's use of the machine.

 In other words, who really cares if there is some hard limit?  The
 only limit should be the number of pages not currently needed by
 a given workload, not some arbitrary zone size.

> > And, if you create zones, you run into all of the zone rebalancing
> > problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM.  In that case, on
> > any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> > random allocations, making almost none of it available.
> 
> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.
> 
> [*] and there are, sadly enough - see the recent patches I posted to
>      lkml for example. But I'm fairly confident that once the particularly
>      silly ones have been fixed, zone balancing will no longer be a
>      derogatory term as has been thrown around (maybe rightly) in this
>      thread!

 You are more optimistic here than I.  You might have improved the
 problem but I think that any zone rebalancing problem is intrinsicly
 hard given the way those zones are used and the fact that we sort
 of want them to be dynamic and yet physically contiguous.  Those two
 core constraints seem to be relatively at odds with each other.

 I'm not a huge fan of dividing memory up into different types which
 are all special purposed.  Everything that becomes special purposed
 over time limits its use and brings up questions on what special purpose
 bucket each allocation should use (e.g. ZONE_NORMAL or ZONE_HIGHMEM
 or ZONE_DMA or ZONE_HOTPLUGGABLE).  And then, when you run out of
 ZONE_HIGHMEM and have to reach into ZONE_HOTPLUGGABLE for some pinned
 memory allocation, it seems the whole concept leads to a messy
 train wreck.

gerrit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  9:37                                         ` Nick Piggin
  2005-11-02 10:17                                           ` Gerrit Huizenga
@ 2005-11-02 23:47                                           ` Rob Landley
  2005-11-03  4:43                                             ` Nick Piggin
  1 sibling, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-02 23:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
	Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wednesday 02 November 2005 03:37, Nick Piggin wrote:
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages?  Or, more likely, something like 64 MB
> > pages?  What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?

In the UML case, I want the system to automatically be able to hand back any 
sufficiently large chunks of memory it currently isn't using.

What does this have to do with specifying hard limits of anything?  What's to 
specify?  Workloads vary.  Deal with it.

> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.

Ok, so you want to artificially turn this into a zone balancing issue in hopes 
of giving that area of the code more testing when, if zones weren't involved, 
there would be no need for balancing at all?

How does that make sense?

> [*] and there are, sadly enough - see the recent patches I posted to
>      lkml for example.

I was under the impression that zone balancing is, conceptually speaking, a 
difficult problem.

>      But I'm fairly confident that once the particularly 
>      silly ones have been fixed,

Great, you're advocating migrating the fragmentation patches to an area of 
code that has known problems you yourself describe as "particularly silly".  
A ringing endorsement, that.

The fact that the migrated version wouldn't even address fragmentation 
avoidance at all (the topic of this thread!) is apparently a side issue.

>      zone balancing will no longer be a 
>      derogatory term as has been thrown around (maybe rightly) in this
>      thread!

If I'm not mistaken, you introduced zones into this thread, you are the 
primary (possibly only) proponent of them.  Yes, zones are a way of 
categorizing memory.  They're not a way of defragmenting it.

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 23:47                                           ` Rob Landley
@ 2005-11-03  4:43                                             ` Nick Piggin
  2005-11-03  6:07                                               ` Rob Landley
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-03  4:43 UTC (permalink / raw)
  To: Rob Landley
  Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
	Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

Rob Landley wrote:

> In the UML case, I want the system to automatically be able to hand back any 
> sufficiently large chunks of memory it currently isn't using.
> 

I'd just be happy with UML handing back page sized chunks of memory that
it isn't currently using. How does contiguous memory (in either the host
or the guest) help this?

> What does this have to do with specifying hard limits of anything?  What's to 
> specify?  Workloads vary.  Deal with it.
> 

Umm, if you hadn't bothered to read the thread then I won't go through
it all again. The short of it is that if you want guaranteed unfragmented
memory you have to specify a limit.

> 
>>If there are zone rebalancing problems[*], then it would be great to
>>have more users of zones because then they will be more likely to get
>>fixed.
> 
> 
> Ok, so you want to artificially turn this into a zone balancing issue in hopes 
> of giving that area of the code more testing when, if zones weren't involved, 
> there would be no need for balancing at all?
> 
> How does that make sense?
> 

Have you looked at the frag patches? Do you realise that they have to
balance between the different types of memory blocks? Duplicating the
same or similar infrastructure (in this case, a memory zoning facility)
is a bad thing in general.

> 
>>[*] and there are, sadly enough - see the recent patches I posted to
>>     lkml for example.
> 
> 
> I was under the impression that zone balancing is, conceptually speaking, a 
> difficult problem.
> 

I am under the impression that you think proper fragmentation avoidance
is easier.

> 
>>     But I'm fairly confident that once the particularly 
>>     silly ones have been fixed,
> 
> 
> Great, you're advocating migrating the fragmentation patches to an area of 
> code that has known problems you yourself describe as "particularly silly".  
> A ringing endorsement, that.
> 

Err, the point is so we don't now have 2 layers doing very similar things,
at least one of which has "particularly silly" bugs in it.

> The fact that the migrated version wouldn't even address fragmentation 
> avoidance at all (the topic of this thread!) is apparently a side issue.
> 

Zones can be used to guaranteee physically contiguous regions with exactly
the same effectiveness as the frag patches.

> 
>>     zone balancing will no longer be a 
>>     derogatory term as has been thrown around (maybe rightly) in this
>>     thread!
> 
> 
> If I'm not mistaken, you introduced zones into this thread, you are the 
> primary (possibly only) proponent of them.

So you didn't look at Yasunori Goto's patch from last year that implements
exactly what I described, then?

> Yes, zones are a way of categorizing memory.

Yes, have you read Mel's patches? Guess what they do?

> They're not a way of defragmenting it. 

Guess what they don't?

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  4:43                                             ` Nick Piggin
@ 2005-11-03  6:07                                               ` Rob Landley
  2005-11-03  7:34                                                 ` Nick Piggin
  2005-11-03 16:35                                                 ` Jeff Dike
  0 siblings, 2 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-03  6:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
	Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> Rob Landley wrote:
> > In the UML case, I want the system to automatically be able to hand back
> > any sufficiently large chunks of memory it currently isn't using.
>
> I'd just be happy with UML handing back page sized chunks of memory that
> it isn't currently using. How does contiguous memory (in either the host
> or the guest) help this?

Smaller chunks of memory are likely to be reclaimed really soon, and adding in 
the syscall overhead working with individual pages of memory is almost 
guaranteed to slow us down.  Plus with punch, we'd be fragmenting the heck 
out of the underlying file.

> > What does this have to do with specifying hard limits of anything? 
> > What's to specify?  Workloads vary.  Deal with it.
>
> Umm, if you hadn't bothered to read the thread then I won't go through
> it all again. The short of it is that if you want guaranteed unfragmented
> memory you have to specify a limit.

I read it.  It just didn't contain an answer the the question.  I want UML to 
be able to hand back however much memory it's not using, but handing back 
individual pages as we free them and inserting a syscall overhead for every 
page freed and allocated is just nuts.  (Plus, at page size, the OS isn't 
likely to zero them much faster than we can ourselves even without the 
syscall overhead.)  Defragmentation means we can batch this into a 
granularity that makes it worth it.

This has nothing to do with hard limits on anything.

> Have you looked at the frag patches?

I've read Mel's various descriptions, and tried to stay more or less up to 
date ever since LWN brought it to my attention.  But I can't say I'm a linux 
VM system expert.  (The last time I felt I had a really firm grasp on it was 
before Andrea and Rik started arguing circa 2.4 and Andrea spent six months 
just assuming everybody already knew what a classzone was.  I've had other 
things to do since then...)

> Do you realise that they have to 
> balance between the different types of memory blocks?

I realise they merge them back together into larger chunks as they free up 
space, and split larger chunks when they haven't got a smaller one.

> Duplicating the 
> same or similar infrastructure (in this case, a memory zoning facility)
> is a bad thing in general.

Even when they keep track of very different things?  The memory zoning thing 
is about where stuff is in physical memory, and it exists because various 
hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is 
evil and crippled and we have to humor it by not asking it to do stuff it 
can't.

The fragmentation stuff is about what long contiguous runs of free memory we 
can arrange, and it's also nice to be able to categorize them as "zeroed" or 
"not zeroed" to make new allocations faster.  Where they actually are in 
memory is not at issue here.

You can have prezeroed memory in 32 bit DMA space, and prezeroed memory in 
highmem, but there's memory in both that isn't prezeroed.  I thought there 
was a hierarchy of zones.  You want overlapping, interlaced, randomly laid 
out zones.

> >>[*] and there are, sadly enough - see the recent patches I posted to
> >>     lkml for example.
> >
> > I was under the impression that zone balancing is, conceptually speaking,
> > a difficult problem.
>
> I am under the impression that you think proper fragmentation avoidance
> is easier.

I was under the impression it was orthogonal to figuring out whether or not a 
given bank of physical memory is accessable to your sound blaster without an 
IOMMU.

> >>     But I'm fairly confident that once the particularly
> >>     silly ones have been fixed,
> >
> > Great, you're advocating migrating the fragmentation patches to an area
> > of code that has known problems you yourself describe as "particularly
> > silly". A ringing endorsement, that.
>
> Err, the point is so we don't now have 2 layers doing very similar things,
> at least one of which has "particularly silly" bugs in it.

Similar is not identical.  You seem to be implying that the IO elevator and 
the network stack queueing should be merged because they do similar things.

> > The fact that the migrated version wouldn't even address fragmentation
> > avoidance at all (the topic of this thread!) is apparently a side issue.
>
> Zones can be used to guaranteee physically contiguous regions with exactly
> the same effectiveness as the frag patches.

If you'd like to write a counter-patch to Mel's to prove it...

> >>     zone balancing will no longer be a
> >>     derogatory term as has been thrown around (maybe rightly) in this
> >>     thread!
> >
> > If I'm not mistaken, you introduced zones into this thread, you are the
> > primary (possibly only) proponent of them.
>
> So you didn't look at Yasunori Goto's patch from last year that implements
> exactly what I described, then?

I saw the patch he just posted, if that's what you mean.  By his own 
admission, it doesn't address fragmentation at all.

> > Yes, zones are a way of categorizing memory.
>
> Yes, have you read Mel's patches? Guess what they do?

The swap file is a way of storing data on disk.  So is ext3.  Obviously, one 
is a trivial extension of the other and there's no reason to have both.

> > They're not a way of defragmenting it.
>
> Guess what they don't?

I have no idea what you intended to mean by that.  Mel posted a set of patches 
in a thread titled "fragmentation avoidance", and you've been arguing about 
hotplug, and pointing to a set of patches from Goto that do not address 
fragmentation at all.  This confuses me.

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  6:07                                               ` Rob Landley
@ 2005-11-03  7:34                                                 ` Nick Piggin
  2005-11-03 17:54                                                   ` Rob Landley
  2005-11-03 16:35                                                 ` Jeff Dike
  1 sibling, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-03  7:34 UTC (permalink / raw)
  To: Rob Landley
  Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
	Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

Rob Landley wrote:
> On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> 

>>I'd just be happy with UML handing back page sized chunks of memory that
>>it isn't currently using. How does contiguous memory (in either the host
>>or the guest) help this?
> 
> 
> Smaller chunks of memory are likely to be reclaimed really soon, and adding in 
> the syscall overhead working with individual pages of memory is almost 
> guaranteed to slow us down.

Because UML doesn't already make a syscall per individual page of
memory freed? (If I read correctly)

>  Plus with punch, we'd be fragmenting the heck 
> out of the underlying file.
> 

Why? No you wouldn't.

> 
>>>What does this have to do with specifying hard limits of anything? 
>>>What's to specify?  Workloads vary.  Deal with it.
>>
>>Umm, if you hadn't bothered to read the thread then I won't go through
>>it all again. The short of it is that if you want guaranteed unfragmented
>>memory you have to specify a limit.
> 
> 
> I read it.  It just didn't contain an answer the the question.  I want UML to 
> be able to hand back however much memory it's not using, but handing back 
> individual pages as we free them and inserting a syscall overhead for every 
> page freed and allocated is just nuts.  (Plus, at page size, the OS isn't 
> likely to zero them much faster than we can ourselves even without the 
> syscall overhead.)  Defragmentation means we can batch this into a 
> granularity that makes it worth it.
> 

Oh you have measured it and found out that "defragmentation" makes
it worthwhile?

> This has nothing to do with hard limits on anything.
> 

You said:

   "What does this have to do with specifying hard limits of
    anything? What's to specify?  Workloads vary.  Deal with it."

And I was answering your very polite questions.

> 
>>Have you looked at the frag patches?
> 
> 
> I've read Mel's various descriptions, and tried to stay more or less up to 
> date ever since LWN brought it to my attention.  But I can't say I'm a linux 
> VM system expert.  (The last time I felt I had a really firm grasp on it was 
> before Andrea and Rik started arguing circa 2.4 and Andrea spent six months 
> just assuming everybody already knew what a classzone was.  I've had other 
> things to do since then...)
> 

Maybe you have better things to do now as well?

>>Duplicating the 
>>same or similar infrastructure (in this case, a memory zoning facility)
>>is a bad thing in general.
> 
> 
> Even when they keep track of very different things?  The memory zoning thing 
> is about where stuff is in physical memory, and it exists because various 
> hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is 
> evil and crippled and we have to humor it by not asking it to do stuff it 
> can't.
> 

No, the buddy allocator is and always has been what tracks the "long
contiguous runs of free memory". Both zones and Mels patches classify
blocks of memory according to some criteria. They're not exactly the
same obviously, but they're equivalent in terms of capability to
guarantee contiguous freeable regions.

> 
> I was under the impression it was orthogonal to figuring out whether or not a 
> given bank of physical memory is accessable to your sound blaster without an 
> IOMMU.
> 

Huh?

>>Err, the point is so we don't now have 2 layers doing very similar things,
>>at least one of which has "particularly silly" bugs in it.
> 
> 
> Similar is not identical.  You seem to be implying that the IO elevator and 
> the network stack queueing should be merged because they do similar things.
> 

No I don't.

> 
> If you'd like to write a counter-patch to Mel's to prove it...
> 

It has already been written as you have been told numerous times.

Now if you'd like to actually learn about what you're commenting on,
that would be really good too.

>>So you didn't look at Yasunori Goto's patch from last year that implements
>>exactly what I described, then?
> 
> 
> I saw the patch he just posted, if that's what you mean.  By his own 
> admission, it doesn't address fragmentation at all.
> 

It seems to be that it provides exactly the same (actually stronger)
guarantees than the current frag patches do. Or were you going to point
out a bug in the implementation?

> 
>>>Yes, zones are a way of categorizing memory.
>>
>>Yes, have you read Mel's patches? Guess what they do?
> 
> 
> The swap file is a way of storing data on disk.  So is ext3.  Obviously, one 
> is a trivial extension of the other and there's no reason to have both.
> 

Don't try to bullshit your way around with stupid analogies please, it
is an utter waste of time.

> 
>>>They're not a way of defragmenting it.
>>
>>Guess what they don't?
> 
> 
> I have no idea what you intended to mean by that.  Mel posted a set of patches 

What I mean is that Mel's patches aren't a way of defragmenting memory either.
They fit exactly the description you gave for zones (ie. a way of categorizing,
not defragmenting).

> in a thread titled "fragmentation avoidance", and you've been arguing about 
> hotplug, and pointing to a set of patches from Goto that do not address 
> fragmentation at all.  This confuses me.
> 

Yeah it does seem like you are confused.

Now let's finish up this subthread and try to keep the SN ratio up, please?
I'm sure Jeff or someone knowledgeable in the area can chime in if there are
concerns about UML.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  7:34                                                 ` Nick Piggin
@ 2005-11-03 17:54                                                   ` Rob Landley
  2005-11-03 20:13                                                     ` Jeff Dike
  0 siblings, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-03 17:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen,
	Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Thursday 03 November 2005 01:34, Nick Piggin wrote:
> Rob Landley wrote:
> > On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> >>I'd just be happy with UML handing back page sized chunks of memory that
> >>it isn't currently using. How does contiguous memory (in either the host
> >>or the guest) help this?
> >
> > Smaller chunks of memory are likely to be reclaimed really soon, and
> > adding in the syscall overhead working with individual pages of memory is
> > almost guaranteed to slow us down.
>
> Because UML doesn't already make a syscall per individual page of
> memory freed? (If I read correctly)

UML does a big mmap to get "physical" memory, and then manages itself using 
the normal Linux kernel mechanisms for doing so.  We even have page tables, 
although I'm still somewhat unclear on quite how that works.

> >  Plus with punch, we'd be fragmenting the heck
> > out of the underlying file.
>
> Why? No you wouldn't.

Creating holes in the file and freeing up the underlying blocks on disk?  4k 
at a time?  Randomly scattered?

> > I read it.  It just didn't contain an answer the the question.  I want
> > UML to be able to hand back however much memory it's not using, but
> > handing back individual pages as we free them and inserting a syscall
> > overhead for every page freed and allocated is just nuts.  (Plus, at page
> > size, the OS isn't likely to zero them much faster than we can ourselves
> > even without the syscall overhead.)  Defragmentation means we can batch
> > this into a granularity that makes it worth it.
>
> Oh you have measured it and found out that "defragmentation" makes
> it worthwhile?

Lots of work has gone into batching up syscalls and making as few of them as 
possible because they are a performance bottleneck.  You want to introduce a 
syscall for every single individual page of memory allocated or freed.

That's stupid.

> > This has nothing to do with hard limits on anything.
>
> You said:
>
>    "What does this have to do with specifying hard limits of
>     anything? What's to specify?  Workloads vary.  Deal with it."
>
> And I was answering your very polite questions.

You didn't answer.  You keep saying you've already answered, but there 
continues to be no answer.  Maybe you think you've answered, but I haven't 
seen it yet.  You brought up hard limits, I asked what that had to do with 
anything, and in response you quote my question back at me.

> >>Have you looked at the frag patches?
> >
> > I've read Mel's various descriptions, and tried to stay more or less up
> > to date ever since LWN brought it to my attention.  But I can't say I'm a
> > linux VM system expert.  (The last time I felt I had a really firm grasp
> > on it was before Andrea and Rik started arguing circa 2.4 and Andrea
> > spent six months just assuming everybody already knew what a classzone
> > was.  I've had other things to do since then...)
>
> Maybe you have better things to do now as well?

Yeah, thanks for reminding me.  I need to test Mel's newest round of 
fragmentation avoidance patches in my UML build system...

> >>Duplicating the
> >>same or similar infrastructure (in this case, a memory zoning facility)
> >>is a bad thing in general.
> >
> > Even when they keep track of very different things?  The memory zoning
> > thing is about where stuff is in physical memory, and it exists because
> > various hardware that wants to access memory (24 bit DMA, 32 bit DMA, and
> > PAE) is evil and crippled and we have to humor it by not asking it to do
> > stuff it can't.
>
> No, the buddy allocator is and always has been what tracks the "long
> contiguous runs of free memory".

We are still discussing fragmentation avoidance, right?  (I know _I'm_ trying 
to...)

> Both zones and Mels patches classify blocks of memory according to some
> criteria. They're not exactly the  same obviously, but they're equivalent in
> terms of capability to guarantee contiguous freeable regions.

Back up.

I don't care _where_ the freeable regions are.  I just wan't them coalesced.

Zones are all about _where_ the memory is.

I'm pretty sure we're arguing past each other.

> > I was under the impression it was orthogonal to figuring out whether or
> > not a given bank of physical memory is accessable to your sound blaster
> > without an IOMMU.
>
> Huh?

Fragmentation avoidance is what is orthogonal to...

> >>Err, the point is so we don't now have 2 layers doing very similar
> >> things, at least one of which has "particularly silly" bugs in it.
> >
> > Similar is not identical.  You seem to be implying that the IO elevator
> > and the network stack queueing should be merged because they do similar
> > things.
>
> No I don't.

They're similar though, aren't they?  Why should we have different code in 
there to do both?  (I know why, but that's what your argument sounds like to 
me.)

> > If you'd like to write a counter-patch to Mel's to prove it...
>
> It has already been written as you have been told numerous times.

Quoting Yasunori Goto, Yesterday at 2:33 pm,
Message-Id: <20051102172729.9E7C.Y-GOTO@jp.fujitsu.com>

> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.

He's NOT ADDRESSING FRAGMENTATION.

So unless you're talking about some OTHER patch, we're talking past each other 
again.

> Now if you'd like to actually learn about what you're commenting on,
> that would be really good too.

The feeling is mutual.

> >>So you didn't look at Yasunori Goto's patch from last year that
> >> implements exactly what I described, then?
> >
> > I saw the patch he just posted, if that's what you mean.  By his own
> > admission, it doesn't address fragmentation at all.
>
> It seems to be that it provides exactly the same (actually stronger)
> guarantees than the current frag patches do. Or were you going to point
> out a bug in the implementation?

No, I'm going to point out that the author of the patch contradicts you.

> >>>Yes, zones are a way of categorizing memory.
> >>
> >>Yes, have you read Mel's patches? Guess what they do?
> >
> > The swap file is a way of storing data on disk.  So is ext3.  Obviously,
> > one is a trivial extension of the other and there's no reason to have
> > both.
>
> Don't try to bullshit your way around with stupid analogies please, it
> is an utter waste of time.

I agree that this conversation is a waste of time, and will stop trying to 
reason with you now.

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 17:54                                                   ` Rob Landley
@ 2005-11-03 20:13                                                     ` Jeff Dike
  0 siblings, 0 replies; 241+ messages in thread
From: Jeff Dike @ 2005-11-03 20:13 UTC (permalink / raw)
  To: Rob Landley
  Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki,
	Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Thu, Nov 03, 2005 at 11:54:10AM -0600, Rob Landley wrote:
> Lots of work has gone into batching up syscalls and making as few of them as 
> possible because they are a performance bottleneck.  You want to introduce a 
> syscall for every single individual page of memory allocated or freed.
> 
> That's stupid.

I think what I'm optimizing is TLB flushes, not system calls.  With
mmap et al, they are effectively the same thing though.

				Jeff

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  6:07                                               ` Rob Landley
  2005-11-03  7:34                                                 ` Nick Piggin
@ 2005-11-03 16:35                                                 ` Jeff Dike
  2005-11-03 16:23                                                   ` Badari Pulavarty
  1 sibling, 1 reply; 241+ messages in thread
From: Jeff Dike @ 2005-11-03 16:35 UTC (permalink / raw)
  To: Rob Landley
  Cc: Nick Piggin, Gerrit Huizenga, Ingo Molnar, Kamezawa Hiroyuki,
	Dave Hansen, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> I want UML to 
> be able to hand back however much memory it's not using, but handing back 
> individual pages as we free them and inserting a syscall overhead for every 
> page freed and allocated is just nuts.  (Plus, at page size, the OS isn't 
> likely to zero them much faster than we can ourselves even without the 
> syscall overhead.)  Defragmentation means we can batch this into a 
> granularity that makes it worth it.

I don't think that freeing pages back to the host in free_pages is the
way to go.  The normal behavior for a Linux system, virtual or
physical, is to use all the memory it has.  So, any memory that's
freed is pretty likely to be reused for something else, wasting any
effort that's made to free pages back to the host.

The one counter-example I can think of is when a large process with a
lot of data exits.  Then its data pages will be freed and they may
stay free for a while until the system finds other data to fill them
with.

Also, it's not the virtual machine's job to know how to make the host
perform optimally.  It doesn't have the information to do it.  It's
perfectly OK for a UML to hang on to memory if the host has plenty
free.  So, it's the host's job to make sure that its memory pressure
is reflected to the UMLs.

My current thinking is that you'll have a daemon on the host keeping
track of memory pressure on the host and the UMLs, plugging and
unplugging memory in order to keep the busy machines, including the
host, supplied with memory, and periodically pushing down the memory
of idle UMLs in order to force them to GC their page caches.

With Badari's patch and UML memory hotplug, the infrastructure is
there to make this work.  The one thing I'm puzzling over right now is
how to measure memory pressure.

				Jeff

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:35                                                 ` Jeff Dike
@ 2005-11-03 16:23                                                   ` Badari Pulavarty
  2005-11-03 18:27                                                     ` Jeff Dike
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Badari Pulavarty @ 2005-11-03 16:23 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
	Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

On Thu, 2005-11-03 at 11:35 -0500, Jeff Dike wrote:
> On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> > I want UML to 
> > be able to hand back however much memory it's not using, but handing back 
> > individual pages as we free them and inserting a syscall overhead for every 
> > page freed and allocated is just nuts.  (Plus, at page size, the OS isn't 
> > likely to zero them much faster than we can ourselves even without the 
> > syscall overhead.)  Defragmentation means we can batch this into a 
> > granularity that makes it worth it.
> 
> I don't think that freeing pages back to the host in free_pages is the
> way to go.  The normal behavior for a Linux system, virtual or
> physical, is to use all the memory it has.  So, any memory that's
> freed is pretty likely to be reused for something else, wasting any
> effort that's made to free pages back to the host.
> 
> The one counter-example I can think of is when a large process with a
> lot of data exits.  Then its data pages will be freed and they may
> stay free for a while until the system finds other data to fill them
> with.
> 
> Also, it's not the virtual machine's job to know how to make the host
> perform optimally.  It doesn't have the information to do it.  It's
> perfectly OK for a UML to hang on to memory if the host has plenty
> free.  So, it's the host's job to make sure that its memory pressure
> is reflected to the UMLs.
> 
> My current thinking is that you'll have a daemon on the host keeping
> track of memory pressure on the host and the UMLs, plugging and
> unplugging memory in order to keep the busy machines, including the
> host, supplied with memory, and periodically pushing down the memory
> of idle UMLs in order to force them to GC their page caches.
> 
> With Badari's patch and UML memory hotplug, the infrastructure is
> there to make this work.  The one thing I'm puzzling over right now is
> how to measure memory pressure.

Yep. This is the exactly the issue other product groups normally raise
on Linux. How do we measure memory pressure in linux ? Some of our
software products want to grow or shrink their memory usage depending
on the memory pressure in the system. Since most memory is used for
cache, "free" really doesn't indicate anything -they are monitoring
info in /proc/meminfo and swapping rates to "guess" on the memory
pressure. They want a clear way of finding out "how badly" system
is under memory pressure. (As a starting point, they want to find out
out of "cached" memory - how much is really easily "reclaimable" 
under memory pressure - without swapping). I know this is kind of 
crazy, but interesting to think about :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:23                                                   ` Badari Pulavarty
@ 2005-11-03 18:27                                                     ` Jeff Dike
  2005-11-03 18:49                                                     ` Rob Landley
  2005-11-04  4:52                                                     ` Andrew Morton
  2 siblings, 0 replies; 241+ messages in thread
From: Jeff Dike @ 2005-11-03 18:27 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Rob Landley, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
	Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

On Thu, Nov 03, 2005 at 08:23:20AM -0800, Badari Pulavarty wrote:
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system.

I think this is wrong.  Applications shouldn't be measuring host
memory pressure and trying to react to it.

This gives you no way to implement a global memory use policy - you
can't say "App X is the most important thing on the system and must
have all the memory it needs in order run as quickly as possible".

You can't establish any sort of priority between apps when it comes to
memory use, or change those priorities.

And how does this work when the system can change the amount of memory
that it has, such as when the app is inside a UML?

I think the right way to go is for willing apps to have an interface
through which they can be told "change your memory consumption by +-X"
and have a single daemon on the host tracking memory use and memory
pressure, and shuffling memory between the apps.

This allows the admin to set memory use priorities between the apps
and to exempt important ones from having memory pulled.

Measuring at the bottom and pushing memory pressure upwards also works
naturally for virtual machines and the apps running inside them.  The
host will push memory pressure at the virtual machines, which in turn
will push that pressure at their apps.

With UML, I have an interface where a daemon on the host can add or
remove memory from an instance.  I think the apps that are willing to
adjust should implement something similar.

				Jeff

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:23                                                   ` Badari Pulavarty
  2005-11-03 18:27                                                     ` Jeff Dike
@ 2005-11-03 18:49                                                     ` Rob Landley
  2005-11-04  4:52                                                     ` Andrew Morton
  2 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-03 18:49 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Jeff Dike, Nick Piggin, Gerrit Huizenga, Ingo Molnar,
	Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

On Thursday 03 November 2005 10:23, Badari Pulavarty wrote:

> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system. Since most memory is used for
> cache, "free" really doesn't indicate anything -they are monitoring
> info in /proc/meminfo and swapping rates to "guess" on the memory
> pressure. They want a clear way of finding out "how badly" system
> is under memory pressure. (As a starting point, they want to find out
> out of "cached" memory - how much is really easily "reclaimable"
> under memory pressure - without swapping). I know this is kind of
> crazy, but interesting to think about :)

If we do ever get prezeroing, we'd want a tuneable to say how much memory 
should be spent on random page cache and how much should be prezeroed.  And 
large chunks of prezeroed memory lying around are what you'd think about 
handing back to the host OS...

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03 16:23                                                   ` Badari Pulavarty
  2005-11-03 18:27                                                     ` Jeff Dike
  2005-11-03 18:49                                                     ` Rob Landley
@ 2005-11-04  4:52                                                     ` Andrew Morton
  2005-11-04  5:35                                                       ` Paul Jackson
  2005-11-04  7:26                                                       ` [patch] swapin rlimit Ingo Molnar
  2 siblings, 2 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  4:52 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu, haveblue, mel,
	mbligh, kravetz, linux-mm, linux-kernel, lhms-devel

Badari Pulavarty <pbadari@gmail.com> wrote:
>
>  > With Badari's patch and UML memory hotplug, the infrastructure is
>  > there to make this work.  The one thing I'm puzzling over right now is
>  > how to measure memory pressure.
> 
>  Yep. This is the exactly the issue other product groups normally raise
>  on Linux. How do we measure memory pressure in linux ? Some of our
>  software products want to grow or shrink their memory usage depending
>  on the memory pressure in the system. Since most memory is used for
>  cache, "free" really doesn't indicate anything -they are monitoring
>  info in /proc/meminfo and swapping rates to "guess" on the memory
>  pressure. They want a clear way of finding out "how badly" system
>  is under memory pressure. (As a starting point, they want to find out
>  out of "cached" memory - how much is really easily "reclaimable" 
>  under memory pressure - without swapping). I know this is kind of 
>  crazy, but interesting to think about :)

Similarly, that SGI patch which was rejected 6-12 months ago to kill off
processes once they started swapping.  We thought that it could be done
from userspace, but we need a way for userspace to detect when a task is
being swapped on a per-task basis.

I'm thinking a few numbers in the mm_struct, incremented in the pageout
code, reported via /proc/stat.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  4:52                                                     ` Andrew Morton
@ 2005-11-04  5:35                                                       ` Paul Jackson
  2005-11-04  5:48                                                         ` Andrew Morton
  2005-11-04  6:16                                                         ` Bron Nelson
  2005-11-04  7:26                                                       ` [patch] swapin rlimit Ingo Molnar
  1 sibling, 2 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

> Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> processes once they started swapping.  We thought that it could be done
> from userspace, but we need a way for userspace to detect when a task is
> being swapped on a per-task basis.
> 
> I'm thinking a few numbers in the mm_struct, incremented in the pageout
> code, reported via /proc/stat.

I just sent in a proposed patch for this - one more per-cpuset
number, tracking the recent rate of calls into the synchronous
(direct) page reclaim by tasks in the cpuset.

See the message sent a few minutes ago, with subject:

  [PATCH 5/5] cpuset: memory reclaim rate meter

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  5:35                                                       ` Paul Jackson
@ 2005-11-04  5:48                                                         ` Andrew Morton
  2005-11-04  6:42                                                           ` Paul Jackson
  2005-11-04  6:16                                                         ` Bron Nelson
  1 sibling, 1 reply; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  5:48 UTC (permalink / raw)
  To: Paul Jackson, Bron Nelson
  Cc: pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Paul Jackson <pj@sgi.com> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> > processes once they started swapping.  We thought that it could be done
> > from userspace, but we need a way for userspace to detect when a task is
> > being swapped on a per-task basis.
> > 
> > I'm thinking a few numbers in the mm_struct, incremented in the pageout
> > code, reported via /proc/stat.
> 
> I just sent in a proposed patch for this - one more per-cpuset
> number, tracking the recent rate of calls into the synchronous
> (direct) page reclaim by tasks in the cpuset.
> 
> See the message sent a few minutes ago, with subject:
> 
>   [PATCH 5/5] cpuset: memory reclaim rate meter
> 

uh, OK.  If that patch is merged, does that make Bron happy, so I don't
have to reply to his plaintive email?

I was kind of thinking that the stats should be per-process (actually
per-mm) rather than bound to cpusets.  /proc/<pid>/pageout-stats or something.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  5:48                                                         ` Andrew Morton
@ 2005-11-04  6:42                                                           ` Paul Jackson
  2005-11-04  7:10                                                             ` Andrew Morton
  0 siblings, 1 reply; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Andrew wrote:
> uh, OK.  If that patch is merged, does that make Bron happy, so I don't
> have to reply to his plaintive email?

In theory yes, that should do it.  I will ack again, by early next
week, after I have verified this further.

And it should also handle some other folks who have plaintive emails
in my inbox, that haven't gotten bold enough to pester you, yet.

It really is, for the users who know my email address (*), job based
memory pressure, not task based, that matters.  Sticking it in a
cpuset, which is the natural job container, is easier, more natural,
and more efficient for all concerned.

It's jobs that are being run in cpusets with dedicated (not shared)
CPUs and Memory Nodes that care about this, so far as I know.

When running a system in a more typical sharing mode, with multiple
jobs and applications competing for the same resources, then the kernel
needs to be master of processor scheduling and memory allocation.

When running jobs in cpusets with dedicated CPUs and Memory Nodes,
then less is being asked of the kernel, and some per-job controls
from userspace make more sense.  This is where a simple hook like
this reclaim rate meter comes into play - passing up to user space
another clue to help it do its job.

> I was kind of thinking that the stats should be per-process (actually
> per-mm) rather than bound to cpusets.  /proc/<pid>/pageout-stats or something.

There may well be a market for these too.  But such stats sound like
more work, and the market isn't one that's paying my salary.

So I will leave that challenge on the table for someone else.

 (*) Of course, there is some self selection going on here.
     Folks not doing cpuset-based jobs are far less likely
     to know my email address ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  6:42                                                           ` Paul Jackson
@ 2005-11-04  7:10                                                             ` Andrew Morton
  2005-11-04  7:45                                                               ` Paul Jackson
  2005-11-04 15:19                                                               ` Martin J. Bligh
  0 siblings, 2 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  7:10 UTC (permalink / raw)
  To: Paul Jackson
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Paul Jackson <pj@sgi.com> wrote:
>
>  > I was kind of thinking that the stats should be per-process (actually
>  > per-mm) rather than bound to cpusets.  /proc/<pid>/pageout-stats or something.
> 
>  There may well be a market for these too.  But such stats sound like
>  more work, and the market isn't one that's paying my salary.

But I have to care for all users.

>  So I will leave that challenge on the table for someone else.

And I won't merge your patch ;)

Seriously, it does appear that doing it per-task is adequate for your
needs, and it is certainly more general.

I cannot understand why you decided to count only the number of
direct-reclaim events, via a "digitally filtered, constant time based,
event frequency meter".

a) It loses information.  If we were to export the number of pages
   reclaimed from the mm, filtering can be done in userspace.

b) It omits reclaim performed by kswapd and by other tasks (ok, it's
   very cpuset-specific).

c) It only counts synchronous try_to_free_pages() attempts.  What if an
   attempt only freed pagecache, or didbn't manage to free anything?

d) It doesn't notice if kswapd is swapping the heck out of your
   not-allocating-any-memory-now process.

I think all the above can be addressed by exporting per-task (actually
per-mm) reclaim info.  (I haven't put much though into what info that
should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
etc)

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  7:10                                                             ` Andrew Morton
@ 2005-11-04  7:45                                                               ` Paul Jackson
  2005-11-04  8:02                                                                 ` Andrew Morton
  2005-11-04 15:19                                                               ` Martin J. Bligh
  1 sibling, 1 reply; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Andrew wrote:
> >  So I will leave that challenge on the table for someone else.
> 
> And I won't merge your patch ;)

Be that way ;).

> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.

My motivations for the per-cpuset, digitally filtered rate, as opposed
to the per-task raw counter mostly have to do with minimizing total
cost (user + kernel) of collecting this information.  I have this phobia,
perhaps not well founded, that moving critical scheduling/allocation
decisions like this into user space will fail in some cases because
the cost of gathering the critical information will be too intrusive
on system performance and scalability.

A per-task stat requires walking the tasklist, to build a list of the
tasks to query.

A raw counter requires repeated polling to determine the recent rate of
activity.

The filtered per-cpuset rate avoids any need to repeatedly access
global resources such as the tasklist, and minimizes the total cpu
cycles required to get the interesting stat.

> But I have to care for all users.

Well you should, and well you do.

If you have good reason, or just good instincts, to think that there
are uses for per-task raw counters, then your choice is clear.

As indeed it was clear.

I don't recall hearing of any desire for per-task memory pressure data,
until tonight.

I will miss this patch.  It had provided exactly what I thought was
needed, with an extremely small impact on system (kern+user) performance.

Oh well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  7:45                                                               ` Paul Jackson
@ 2005-11-04  8:02                                                                 ` Andrew Morton
  2005-11-04  9:52                                                                   ` Paul Jackson
  0 siblings, 1 reply; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  8:02 UTC (permalink / raw)
  To: Paul Jackson
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Paul Jackson <pj@sgi.com> wrote:
>
> A per-task stat requires walking the tasklist, to build a list of the
> tasks to query.

Nope, just task->mm->whatever.

> A raw counter requires repeated polling to determine the recent rate of
> activity.

True.

> The filtered per-cpuset rate avoids any need to repeatedly access
> global resources such as the tasklist, and minimizes the total cpu
> cycles required to get the interesting stat.
> 

Well no.  Because the filtered-whatsit takes two spinlocks and does a bunch
of arith for each and every task, each time it calls try_to_free_pages(). 
The frequency of that could be very high indeed, even when nobody is
interested in the metric which is being maintained(!).

And I'd suggest that only a minority of workloads would be interested in
this metric?

ergo, polling the thing once per five seconds in those situations where we
actually want to poll the thing may well be cheaper, in global terms?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  8:02                                                                 ` Andrew Morton
@ 2005-11-04  9:52                                                                   ` Paul Jackson
  2005-11-04 15:27                                                                     ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

> > A per-task stat requires walking the tasklist, to build a list of the
> > tasks to query.
> 
> Nope, just task->mm->whatever.

Nope.

Agreed - once you have the task, then sure, that's enough.

However - a batch scheduler will end up having to figure out what tasks
there are to inquire, by either listing the tasks in a cpuset, or
by listing /proc.  Either way, that's a tasklist scan.  And it will
have to do that pretty much every iteration of polling, since it has
no a priori knowledge of what tasks a job is firing up.

> Well no.  Because the filtered-whatsit takes two spinlocks and does a bunch
> of arith for each and every task, each time it calls try_to_free_pages(). 

Neither spinlock is global - the task and a lock in its cpuset.

I see a fair number of existing locks and semaphores, some global
and some in loops, that look to be in the code invoked by
try_to_free_pages(). And far more arithmetic than in that little
filter.

Granted, its cost seen by all, for the benefit of few.  But other sorts
of per-task or per-mm stats are not going to be free either.  I would
have figured that doing something per-page, even the most trivial
"counter++" (better have that mm locked) will likely cost more than
doing something per try_to_free_pages() call.

> The frequency of that could be very high indeed, even when nobody is
> interested in the metric which is being maintained(!)

When I have a task start allocating memory as fast it can, it is only
able to call try_to_free_pages() about 10 times a second on an idle
ia64 SN2 system, with a single thread, or about 20 times a second
running several threads at once allocating memory.

  That's not "very high" in my book.

What sort of load would hit this much more often?  

If more folks need these detailed stats, then that's how it should be.

But I am no fan of exposing more than the minimum kernel vm details for
use by production software.

We agree that my per-cpuset memory_reclaim_rate meter certainly hides
more detail than the sorts of stats you are suggesting.  I thought that
was good, so long as what was needed was still present.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  9:52                                                                   ` Paul Jackson
@ 2005-11-04 15:27                                                                     ` Martin J. Bligh
  0 siblings, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:27 UTC (permalink / raw)
  To: Paul Jackson, Andrew Morton
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel



> We agree that my per-cpuset memory_reclaim_rate meter certainly hides
> more detail than the sorts of stats you are suggesting.  I thought that
> was good, so long as what was needed was still present.

But it's horribly specific to cpusets. If you want something multi-task,
would be better if it worked by more generic task groupings. 

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  7:10                                                             ` Andrew Morton
  2005-11-04  7:45                                                               ` Paul Jackson
@ 2005-11-04 15:19                                                               ` Martin J. Bligh
  2005-11-04 17:38                                                                 ` Andrew Morton
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:19 UTC (permalink / raw)
  To: Andrew Morton, Paul Jackson
  Cc: bron, pbadari, jdike, rob, nickpiggin, gh, mingo, kamezawa.hiroyu,
	haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel

> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.
> 
> 
> 
> I cannot understand why you decided to count only the number of
> direct-reclaim events, via a "digitally filtered, constant time based,
> event frequency meter".
> 
> a) It loses information.  If we were to export the number of pages
>    reclaimed from the mm, filtering can be done in userspace.
> 
> b) It omits reclaim performed by kswapd and by other tasks (ok, it's
>    very cpuset-specific).
> 
> c) It only counts synchronous try_to_free_pages() attempts.  What if an
>    attempt only freed pagecache, or didbn't manage to free anything?
> 
> d) It doesn't notice if kswapd is swapping the heck out of your
>    not-allocating-any-memory-now process.
> 
> 
> I think all the above can be addressed by exporting per-task (actually
> per-mm) reclaim info.  (I haven't put much though into what info that
> should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> etc)

I've been looking at similar things. When we page out / free something from 
a shared library that 10 tasks have mapped, who does that count against
for pressure?

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:19                                                               ` Martin J. Bligh
@ 2005-11-04 17:38                                                                 ` Andrew Morton
  0 siblings, 0 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-04 17:38 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: pj, bron, pbadari, jdike, rob, nickpiggin, gh, mingo,
	kamezawa.hiroyu, haveblue, mel, kravetz, linux-mm, linux-kernel,
	lhms-devel

"Martin J. Bligh" <mbligh@mbligh.org> wrote:
>
> > Seriously, it does appear that doing it per-task is adequate for your
> > needs, and it is certainly more general.
> > 
> > 
> > 
> > I cannot understand why you decided to count only the number of
> > direct-reclaim events, via a "digitally filtered, constant time based,
> > event frequency meter".
> > 
> > a) It loses information.  If we were to export the number of pages
> >    reclaimed from the mm, filtering can be done in userspace.
> > 
> > b) It omits reclaim performed by kswapd and by other tasks (ok, it's
> >    very cpuset-specific).
> > 
> > c) It only counts synchronous try_to_free_pages() attempts.  What if an
> >    attempt only freed pagecache, or didbn't manage to free anything?
> > 
> > d) It doesn't notice if kswapd is swapping the heck out of your
> >    not-allocating-any-memory-now process.
> > 
> > 
> > I think all the above can be addressed by exporting per-task (actually
> > per-mm) reclaim info.  (I haven't put much though into what info that
> > should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> > etc)
> 
> I've been looking at similar things. When we page out / free something from 
> a shared library that 10 tasks have mapped, who does that count against
> for pressure?

Count pte unmappings and minor faults and account them against the
mm_struct, I guess.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  5:35                                                       ` Paul Jackson
  2005-11-04  5:48                                                         ` Andrew Morton
@ 2005-11-04  6:16                                                         ` Bron Nelson
  1 sibling, 0 replies; 241+ messages in thread
From: Bron Nelson @ 2005-11-04  6:16 UTC (permalink / raw)
  To: Paul Jackson, Andrew Morton
  Cc: lhms-devel, linux-kernel, linux-mm, kravetz, mbligh, mel,
	haveblue, kamezawa.hiroyu, mingo, gh, nickpiggin, rob, jdike,
	pbadari

> I was kind of thinking that the stats should be per-process (actually
> per-mm) rather than bound to cpusets.  /proc/<pid>/pageout-stats or something.

The particular people that I deal with care about constraining things
on a per-cpuset basis, so that is the information that I personally am
looking for.  But it is simple enough to map tasks to cpusets and vice-versa,
so this is not really a serious consideration.  I would generically be in
favor of the per-process stats (even though the application at hand is
actually interested in the cpuset aggregate stats), because we can always
produce an aggregate from the detailed, but not vice-versa.  And no doubt
some future as-yet-unimagined application will want per-process info.

--
Bron Campbell Nelson      bron@sgi.com
These statements are my own, not those of Silicon Graphics.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [patch] swapin rlimit
  2005-11-04  4:52                                                     ` Andrew Morton
  2005-11-04  5:35                                                       ` Paul Jackson
@ 2005-11-04  7:26                                                       ` Ingo Molnar
  2005-11-04  7:36                                                         ` Andrew Morton
  2005-11-04 10:14                                                         ` Bernd Petrovitsch
  1 sibling, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Badari Pulavarty, Linus Torvalds, jdike, rob, nickpiggin, gh,
	kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
	linux-kernel, lhms-devel


* Andrew Morton <akpm@osdl.org> wrote:

> Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> off processes once they started swapping.  We thought that it could be 
> done from userspace, but we need a way for userspace to detect when a 
> task is being swapped on a per-task basis.

wouldnt the clean solution here be a "swap ulimit"?

I.e. something like the 2-minute quick-hack below (against Linus-curr).  

	Ingo

---
implement a swap ulimit: RLIMIT_SWAP.

setting the ulimit to 0 causes any swapin activity to kill the task.  
Setting the rlimit to 0 is allowed for unprivileged users too, since it 
is a decrease of the default RLIM_INFINITY value. I.e. users could run 
known-memory-intense jobs with such an ulimit set, and get a guarantee 
that they wont put the system into a swap-storm.

Note: it's just swapin that causes the SIGKILL, because at swapout time 
it's hard to identify the originating task. Pure swapouts and a buildup 
in the swap-cache is not punished, only actual hard swapins. I didnt try 
too hard to make the rlimit particularly finegrained - i.e. right now we 
only know 'zero' and 'infinity' ...

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 include/asm-generic/resource.h |    4 +++-
 mm/memory.c                    |   13 +++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/resource.h
===================================================================
--- linux.orig/include/asm-generic/resource.h
+++ linux/include/asm-generic/resource.h
@@ -44,8 +44,9 @@
 #define RLIMIT_NICE		13	/* max nice prio allowed to raise to
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
+#define RLIMIT_SWAP		15	/* maximum swapspace for task */
 
-#define RLIM_NLIMITS		15
+#define RLIM_NLIMITS		16
 
 /*
  * SuS says limits have to be unsigned.
@@ -86,6 +87,7 @@
 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
+	[RLIMIT_SWAP]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
 }
 
 #endif	/* __KERNEL__ */
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -1647,6 +1647,18 @@ void swapin_readahead(swp_entry_t entry,
 }
 
 /*
+ * Crude first-approximation swapin-avoidance: if there is a zero swap
+ * rlimit then kill the task.
+ */
+static inline void check_swap_rlimit(void)
+{
+	unsigned long limit = current->signal->rlim[RLIMIT_SWAP].rlim_cur;
+
+	if (limit != RLIM_INFINITY)
+		force_sig(SIGKILL, current);
+}
+
+/*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -1667,6 +1679,7 @@ static int do_swap_page(struct mm_struct
 	entry = pte_to_swp_entry(orig_pte);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		check_swap_rlimit();
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  7:26                                                       ` [patch] swapin rlimit Ingo Molnar
@ 2005-11-04  7:36                                                         ` Andrew Morton
  2005-11-04  8:07                                                           ` Ingo Molnar
                                                                             ` (2 more replies)
  2005-11-04 10:14                                                         ` Bernd Petrovitsch
  1 sibling, 3 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-04  7:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@osdl.org> wrote:
> 
>  > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
>  > off processes once they started swapping.  We thought that it could be 
>  > done from userspace, but we need a way for userspace to detect when a 
>  > task is being swapped on a per-task basis.
> 
>  wouldnt the clean solution here be a "swap ulimit"?

Well it's _a_ solution, but it's terribly specific.

How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
non-zero, kill <pid>?

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  7:36                                                         ` Andrew Morton
@ 2005-11-04  8:07                                                           ` Ingo Molnar
  2005-11-04 10:06                                                             ` Paul Jackson
  2005-11-04 15:24                                                             ` Martin J. Bligh
  2005-11-04  8:18                                                           ` Arjan van de Ven
  2005-11-04 15:14                                                           ` Rob Landley
  2 siblings, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04  8:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
	haveblue, mel, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

* Andrew Morton <akpm@osdl.org> wrote:

> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Morton <akpm@osdl.org> wrote:
> > 
> >  > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> >  > off processes once they started swapping.  We thought that it could be 
> >  > done from userspace, but we need a way for userspace to detect when a 
> >  > task is being swapped on a per-task basis.
> > 
> >  wouldnt the clean solution here be a "swap ulimit"?
> 
> Well it's _a_ solution, but it's terribly specific.
> 
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's 
> non-zero, kill <pid>?

on a system with possibly thousands of taks, over /proc, on a 
high-performance node where for a 0.5% improvement they are willing to 
sacrifice maidens? :)

Seriously, while nr_swapped_in_pages ought to be OK, i think there is a 
generic problem with /proc based stats.

System instrumentation people are already complaining about how costly 
/proc parsing is. If you have to get some nontrivial stat from all 
threads in the system, and if Linux doesnt offer that counter or summary 
by default, it gets pretty expensive.

One solution i can think of would be to make a binary representation of 
/proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every 
task tracked that way, and stats updates would have to update this page 
too - but it would make instrumentation of running apps really 
unintrusive and scalable.

Another addition would be some mechanism for a monitoring app to capture 
events in the PID space: so that they can mmap() new tasks [if they are 
interested] on a non-polling basis, i.e. not like readdir on /proc. This 
capability probably has to be a system-call though, as /proc seems too 
quirky for it. The system does not wait on the monitoring app(s) to 
catch up - if it's too slow in reacting and the event buffer overflows 
then tough luck - monitoring apps will have no impact on the runtime 
characteristics of other tasks. In theory this is somewhat similar to 
auditing, but the purpose would be quite different, and it only cares 
about PID-space events like 'fork/clone', 'exec' and 'exit'.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  8:07                                                           ` Ingo Molnar
@ 2005-11-04 10:06                                                             ` Paul Jackson
  2005-11-04 15:24                                                             ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-04 10:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, pbadari, torvalds, jdike, rob, nickpiggin, gh,
	kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
	linux-kernel, lhms-devel

Ingo wrote:
> Seriously, while nr_swapped_in_pages ought to be OK, i think there is a 
> generic problem with /proc based stats.
> 
> System instrumentation people are already complaining about how costly 
> /proc parsing is. If you have to get some nontrivial stat from all 
> threads in the system, and if Linux doesnt offer that counter or summary 
> by default, it gets pretty expensive.

Agreed.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  8:07                                                           ` Ingo Molnar
  2005-11-04 10:06                                                             ` Paul Jackson
@ 2005-11-04 15:24                                                             ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:24 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: pbadari, torvalds, jdike, rob, nickpiggin, gh, kamezawa.hiroyu,
	haveblue, mel, kravetz, linux-mm, linux-kernel, lhms-devel

> System instrumentation people are already complaining about how costly 
> /proc parsing is. If you have to get some nontrivial stat from all 
> threads in the system, and if Linux doesnt offer that counter or summary 
> by default, it gets pretty expensive.
> 
> One solution i can think of would be to make a binary representation of 
> /proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every 
> task tracked that way, and stats updates would have to update this page 
> too - but it would make instrumentation of running apps really 
> unintrusive and scalable.

That would be awesome - the current methods we have are mostly crap. There
are some atomicity issues though. Plus when I suggested this 2 years ago,
everyone told me to piss off, but I'm not bitter ;-) Seriously, we do
need a fast communication mechanism.
 
M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  7:36                                                         ` Andrew Morton
  2005-11-04  8:07                                                           ` Ingo Molnar
@ 2005-11-04  8:18                                                           ` Arjan van de Ven
  2005-11-04 10:04                                                             ` Paul Jackson
  2005-11-04 15:14                                                           ` Rob Landley
  2 siblings, 1 reply; 241+ messages in thread
From: Arjan van de Ven @ 2005-11-04  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, pbadari, torvalds, jdike, rob, nickpiggin, gh,
	kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
	linux-kernel, lhms-devel

On Thu, 2005-11-03 at 23:36 -0800, Andrew Morton wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Morton <akpm@osdl.org> wrote:
> > 
> >  > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> >  > off processes once they started swapping.  We thought that it could be 
> >  > done from userspace, but we need a way for userspace to detect when a 
> >  > task is being swapped on a per-task basis.
> > 
> >  wouldnt the clean solution here be a "swap ulimit"?
> 
> Well it's _a_ solution, but it's terribly specific.
> 
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?

well or do it the other way around

write a counter to such a thing
and kill when it hits zero
(similar to the CPU perf counter stuff on x86)

doing this from userspace is tricky; what if the task dies of natural
causes and the pid gets reused, between the time the userspace app reads
the value and the time it decides the time is up and time for a kill....
(and on a busy server that can be quite a bit of time)


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  8:18                                                           ` Arjan van de Ven
@ 2005-11-04 10:04                                                             ` Paul Jackson
  0 siblings, 0 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-04 10:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: akpm, mingo, pbadari, torvalds, jdike, rob, nickpiggin, gh,
	kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
	linux-kernel, lhms-devel

Arjan wrote:
> doing this from userspace is tricky; what if the task dies of natural
> causes and the pid gets reused, between the time the userspace app reads
> the value and the time it decides the time is up and time for a kill....
> (and on a busy server that can be quite a bit of time)

If pids are being reused within seconds of their being freed up,
then the batch managers running on the big HPC systems I care
about are so screwed it isn't even funny.  They depend heavily
on being able to identify the task pids in a job and then doing
something to those tasks (suspend, kill, gather stats, ...).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  7:36                                                         ` Andrew Morton
  2005-11-04  8:07                                                           ` Ingo Molnar
  2005-11-04  8:18                                                           ` Arjan van de Ven
@ 2005-11-04 15:14                                                           ` Rob Landley
  2 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-04 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, pbadari, torvalds, jdike, nickpiggin, gh,
	kamezawa.hiroyu, haveblue, mel, mbligh, kravetz, linux-mm,
	linux-kernel, lhms-devel

On Friday 04 November 2005 01:36, Andrew Morton wrote:
> >  wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?

Things like make fork lots of short-lived child processes, and some of those 
can be quite memory intensive.  (The gcc 4.0.2 build causes an outright swap 
storm for me about halfway through, doing genattrtab and then again compiling 
the result).

Is there any way for parents to collect their child process's statistics when 
the children exit?  Or by the time the actual swapper exits, do we not care 
anymore?

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04  7:26                                                       ` [patch] swapin rlimit Ingo Molnar
  2005-11-04  7:36                                                         ` Andrew Morton
@ 2005-11-04 10:14                                                         ` Bernd Petrovitsch
  2005-11-04 10:21                                                           ` Ingo Molnar
  1 sibling, 1 reply; 241+ messages in thread
From: Bernd Petrovitsch @ 2005-11-04 10:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
	nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
	linux-mm, linux-kernel, lhms-devel

On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> > off processes once they started swapping.  We thought that it could be 
> > done from userspace, but we need a way for userspace to detect when a 
> > task is being swapped on a per-task basis.
> 
> wouldnt the clean solution here be a "swap ulimit"?

Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
OK, mlockall() can only be done by root (processes).

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04 10:14                                                         ` Bernd Petrovitsch
@ 2005-11-04 10:21                                                           ` Ingo Molnar
  2005-11-04 11:17                                                             ` Bernd Petrovitsch
  0 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 10:21 UTC (permalink / raw)
  To: Bernd Petrovitsch
  Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
	nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
	linux-mm, linux-kernel, lhms-devel


* Bernd Petrovitsch <bernd@firmix.at> wrote:

> On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > * Andrew Morton <akpm@osdl.org> wrote:
> > 
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> > > off processes once they started swapping.  We thought that it could be 
> > > done from userspace, but we need a way for userspace to detect when a 
> > > task is being swapped on a per-task basis.
> > 
> > wouldnt the clean solution here be a "swap ulimit"?
> 
> Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"? 
> OK, mlockall() can only be done by root (processes).

what do you mean? mlockall pins down all pages. swapin ulimit kills the 
task (and thus frees all the RAM it had) when it touches swap for the 
first time. These two solutions almost oppose each other!

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [patch] swapin rlimit
  2005-11-04 10:21                                                           ` Ingo Molnar
@ 2005-11-04 11:17                                                             ` Bernd Petrovitsch
  0 siblings, 0 replies; 241+ messages in thread
From: Bernd Petrovitsch @ 2005-11-04 11:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Badari Pulavarty, Linus Torvalds, jdike, rob,
	nickpiggin, gh, kamezawa.hiroyu, haveblue, mel, mbligh, kravetz,
	linux-mm, linux-kernel, lhms-devel

On Fri, 2005-11-04 at 11:21 +0100, Ingo Molnar wrote:
> * Bernd Petrovitsch <bernd@firmix.at> wrote:
> > On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > > * Andrew Morton <akpm@osdl.org> wrote:
> > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill 
> > > > off processes once they started swapping.  We thought that it could be 
> > > > done from userspace, but we need a way for userspace to detect when a 
> > > > task is being swapped on a per-task basis.
> > > 
> > > wouldnt the clean solution here be a "swap ulimit"?
> > 
> > Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"? 
> > OK, mlockall() can only be done by root (processes).
> 
> what do you mean? mlockall pins down all pages. swapin ulimit kills the 
                                                in memory.
> task (and thus frees all the RAM it had) when it touches swap for the 
> first time. These two solutions almost oppose each other!

Almost IMHO as locked pages in RAM avoid swapping totally. Probably
"complement each other" is more correct.

Given the limit for "max locked memory" it should pretty much behave the
same if the process gets on his limits.
OK, the difference may be loaded executable and lib pages.

Hmm, delivering a signal on the first swapped out page might be another
simple solution and the process might do something to avoid it.

The nice thing about "swap ulimit" is: It is easy to understand what it
is (which is always a good thing).
Generating a similar effect with the combination of 2 other features is
probably somewhat more arcane.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  7:46                                   ` Gerrit Huizenga
  2005-11-02  8:50                                     ` Nick Piggin
@ 2005-11-02 10:41                                     ` Ingo Molnar
  2005-11-02 11:04                                       ` Gerrit Huizenga
  1 sibling, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-02 10:41 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms


* Gerrit Huizenga <gh@us.ibm.com> wrote:

> > generic unpluggable kernel RAM _will not work_.
> 
> Actually, it will.  Well, depending on terminology.

'generic unpluggable kernel RAM' means what it says: any RAM seen by the 
kernel can be unplugged, always. (as long as the unplug request is 
reasonable and there is enough free space to migrate in-use pages to).

> There are two usage models here - those which intend to remove 
> physical elements and those where the kernel returnss management of 
> its virtualized "physical" memory to a hypervisor.  In the latter 
> case, a hypervisor already maintains a virtual map of the memory and 
> the OS needs to release virtualized "physical" memory.  I think you 
> are referring to RAM here as the physical component; however these 
> same defrag patches help where a hypervisor is maintaining the real 
> physical memory below the operating system and the OS is managing a 
> virtualized "physical" memory.

reliable unmapping of "generic kernel RAM" is not possible even in a 
virtualized environment. Think of the 'live pointers' problem i outlined 
in an earlier mail in this thread today.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 10:41                                     ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
@ 2005-11-02 11:04                                       ` Gerrit Huizenga
  2005-11-02 12:00                                         ` Ingo Molnar
  0 siblings, 1 reply; 241+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 11:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms


On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> 
> * Gerrit Huizenga <gh@us.ibm.com> wrote:
> 
> > > generic unpluggable kernel RAM _will not work_.
> > 
> > Actually, it will.  Well, depending on terminology.
> 
> 'generic unpluggable kernel RAM' means what it says: any RAM seen by the 
> kernel can be unplugged, always. (as long as the unplug request is 
> reasonable and there is enough free space to migrate in-use pages to).
 
 Okay, I understand your terminology.  Yes, I can not point to any
 particular piece of memory and say "I want *that* one" and have that
 request succeed.  However, I can say "find me 50 chunks of memory
 of your choosing" and have a very good chance of finding enough
 memory to satisfy my request.

> > There are two usage models here - those which intend to remove 
> > physical elements and those where the kernel returnss management of 
> > its virtualized "physical" memory to a hypervisor.  In the latter 
> > case, a hypervisor already maintains a virtual map of the memory and 
> > the OS needs to release virtualized "physical" memory.  I think you 
> > are referring to RAM here as the physical component; however these 
> > same defrag patches help where a hypervisor is maintaining the real 
> > physical memory below the operating system and the OS is managing a 
> > virtualized "physical" memory.
> 
> reliable unmapping of "generic kernel RAM" is not possible even in a 
> virtualized environment. Think of the 'live pointers' problem i outlined 
> in an earlier mail in this thread today.

 Yeah - and that isn't what is being proposed here.  The goal is to ask
 the kernel to identify some memory which can be legitimately freed and
 hasten the freeing of that memory.

gerrit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 11:04                                       ` Gerrit Huizenga
@ 2005-11-02 12:00                                         ` Ingo Molnar
  2005-11-02 12:42                                           ` Dave Hansen
  2005-11-02 15:02                                           ` Gerrit Huizenga
  0 siblings, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-02 12:00 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms


* Gerrit Huizenga <gh@us.ibm.com> wrote:

> 
> On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> > 
> > * Gerrit Huizenga <gh@us.ibm.com> wrote:
> > 
> > > > generic unpluggable kernel RAM _will not work_.
> > > 
> > > Actually, it will.  Well, depending on terminology.
> > 
> > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the 
> > kernel can be unplugged, always. (as long as the unplug request is 
> > reasonable and there is enough free space to migrate in-use pages to).
>  
>  Okay, I understand your terminology.  Yes, I can not point to any
>  particular piece of memory and say "I want *that* one" and have that
>  request succeed.  However, I can say "find me 50 chunks of memory
>  of your choosing" and have a very good chance of finding enough
>  memory to satisfy my request.

but that's obviously not 'generic unpluggable kernel RAM'. It's very 
special RAM: RAM that is free or easily freeable. I never argued that 
such RAM is not returnable to the hypervisor.

> > reliable unmapping of "generic kernel RAM" is not possible even in a 
> > virtualized environment. Think of the 'live pointers' problem i outlined 
> > in an earlier mail in this thread today.
> 
>  Yeah - and that isn't what is being proposed here.  The goal is to 
>  ask the kernel to identify some memory which can be legitimately 
>  freed and hasten the freeing of that memory.

but that's very easy to identify: check the free list or the clean 
list(s). No defragmentation necessary. [unless the unit of RAM mapping 
between hypervisor and guest is too coarse (i.e. not 4K pages).]

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 12:00                                         ` Ingo Molnar
@ 2005-11-02 12:42                                           ` Dave Hansen
  2005-11-02 15:02                                           ` Gerrit Huizenga
  1 sibling, 0 replies; 241+ messages in thread
From: Dave Hansen @ 2005-11-02 12:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gerrit Huizenga, KAMEZAWA Hiroyuki, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wed, 2005-11-02 at 13:00 +0100, Ingo Molnar wrote:
> 
> >  Yeah - and that isn't what is being proposed here.  The goal is to 
> >  ask the kernel to identify some memory which can be legitimately 
> >  freed and hasten the freeing of that memory.
> 
> but that's very easy to identify: check the free list or the clean 
> list(s). No defragmentation necessary. [unless the unit of RAM mapping 
> between hypervisor and guest is too coarse (i.e. not 4K pages).]

It needs to be that coarse in cases where HugeTLB is desired for use.
I'm not sure I could convince the DB guys to give up large pages,
they're pretty hooked on them. ;)

-- Dave


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 12:00                                         ` Ingo Molnar
  2005-11-02 12:42                                           ` Dave Hansen
@ 2005-11-02 15:02                                           ` Gerrit Huizenga
  2005-11-03  0:10                                             ` Rob Landley
  1 sibling, 1 reply; 241+ messages in thread
From: Gerrit Huizenga @ 2005-11-02 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Nick Piggin,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wed, 02 Nov 2005 13:00:48 +0100, Ingo Molnar wrote:
> 
> * Gerrit Huizenga <gh@us.ibm.com> wrote:
> 
> > 
> > On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> > > 
> > > * Gerrit Huizenga <gh@us.ibm.com> wrote:
> > > 
> > > > > generic unpluggable kernel RAM _will not work_.
> > > > 
> > > > Actually, it will.  Well, depending on terminology.
> > > 
> > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the 
> > > kernel can be unplugged, always. (as long as the unplug request is 
> > > reasonable and there is enough free space to migrate in-use pages to).
> >  
> >  Okay, I understand your terminology.  Yes, I can not point to any
> >  particular piece of memory and say "I want *that* one" and have that
> >  request succeed.  However, I can say "find me 50 chunks of memory
> >  of your choosing" and have a very good chance of finding enough
> >  memory to satisfy my request.
> 
> but that's obviously not 'generic unpluggable kernel RAM'. It's very 
> special RAM: RAM that is free or easily freeable. I never argued that 
> such RAM is not returnable to the hypervisor.

 Okay - and 'generic unpluggable kernel RAM' has not been a goal for
 the hypervisor based environments.  I believe it is closer to being
 a goal for those machines which want to hot-remove DIMMs or physical
 memory, e.g. those with IA64 machines wishing to remove entire nodes.

> > > reliable unmapping of "generic kernel RAM" is not possible even in a 
> > > virtualized environment. Think of the 'live pointers' problem i outlined 
> > > in an earlier mail in this thread today.
> > 
> >  Yeah - and that isn't what is being proposed here.  The goal is to 
> >  ask the kernel to identify some memory which can be legitimately 
> >  freed and hasten the freeing of that memory.
> 
> but that's very easy to identify: check the free list or the clean 
> list(s). No defragmentation necessary. [unless the unit of RAM mapping 
> between hypervisor and guest is too coarse (i.e. not 4K pages).]

 Ah, but the hypervisor often manages large page sizes, e.g. 64 MB.
 It doesn't manage page rights for each guest OS at the 4 K granularity.
 Hypervisors are theoretically light in terms of memory needs and
 general footprint.  Picture the overhead of tracking rights/permissions
 of each page of memory and its assignment to any of, say, 256 different
 guest operating systems.  For a machine of any size, that would be
 a huge amount of state for a hypervisor to maintain.  Would you
 really want a hypervisor to keep that much state?  Or is it more
 reasonably for a hypervisor to track, say, 64 MB chunks and the
 rights of that memory for a number of guest operating systems?  Even
 if the number of guests is small, the data structures for fast
 memory management would grow quickly.

gerrit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 15:02                                           ` Gerrit Huizenga
@ 2005-11-03  0:10                                             ` Rob Landley
  0 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-03  0:10 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Ingo Molnar, Kamezawa Hiroyuki, Dave Hansen, Mel Gorman,
	Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wednesday 02 November 2005 09:02, Gerrit Huizenga wrote:
> > but that's obviously not 'generic unpluggable kernel RAM'. It's very
> > special RAM: RAM that is free or easily freeable. I never argued that
> > such RAM is not returnable to the hypervisor.
>
>  Okay - and 'generic unpluggable kernel RAM' has not been a goal for
>  the hypervisor based environments.  I believe it is closer to being
>  a goal for those machines which want to hot-remove DIMMs or physical
>  memory, e.g. those with IA64 machines wishing to remove entire nodes

Keep in mind that just about any virtualized environment might benefit from 
being able to tell the parent system "we're not using this ram".  I mentioned 
UML, and I can also imagine a Linux driver that signals qemu (or even vmware) 
to say "this chunk of physical memory isn't currently in use", and even if 
they don't actually _free_ it they can call madvise() on it.

Heck, if we have prezeroing of large blocks, telling your emulator to 
madvise(ADV_DONTNEED) the pages for you should just plug right in to that 
infrastructure...

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  7:19                                 ` Ingo Molnar
  2005-11-02  7:46                                   ` Gerrit Huizenga
@ 2005-11-02  7:57                                   ` Nick Piggin
  1 sibling, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  7:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

Ingo Molnar wrote:
> * Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> 
>>My own target is NUMA node hotplug, what NUMA node hotplug want is
>>- [remove the range of memory] For this approach, admin should define
>>  *core* node and removable node. Memory on removable node is removable.
>>  Dividing area into removable and not-removable is needed, because
>>  we cannot allocate any kernel's object on removable area.
>>  Removable area should be 100% removable. Customer can know the limitation 
>>  before using.
> 
> 
> that's a perfectly fine method, and is quite similar to the 'separate 
> zone' approach Nick mentioned too. It is also easily understandable for 
> users/customers.
> 

I agree - and I think it should be easy to configure out of the
kernel for those that don't want the functionality, and should
at very little complexity to core code (all without looking at
the patches so I could be very wrong!).

> 
> but what is a dangerous fallacy is that we will be able to support hot 
> memory unplug of generic kernel RAM in any reliable way!
> 

Very true.

> you really have to look at this from the conceptual angle: 'can an 
> approach ever lead to a satisfactory result'? If the answer is 'no', 
> then we _must not_ add a 90% solution that we _know_ will never be a 
> 100% solution.
> 
> for the separate-removable-zones approach we see the end of the tunnel.  
> Separate zones are well-understood.
> 

Yep, I don't see why this doesn't cover all the needs that the frag
patches attempt (hot unplug, hugepage dynamic reserves).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:49                           ` Dave Hansen
  2005-11-01 15:01                             ` Ingo Molnar
@ 2005-11-02  0:51                             ` Nick Piggin
  2005-11-02  7:42                               ` Dave Hansen
  2005-11-02 12:38                               ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
  1 sibling, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  0:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

Dave Hansen wrote:

> What the fragmentation patches _can_ give us is the ability to have 100%
> success in removing certain areas: the "user-reclaimable" areas
> referenced in the patch.  This gives a customer at least the ability to
> plan for how dynamically reconfigurable a system should be.
> 

But the "user-reclaimable" areas can still be taken over by other
areas which become fragmented.

That's like saying we can already guarantee 100% success in removing
areas that are unfragmented and free, or freeable.

> After these patches, the next logical steps are to increase the
> knowledge that the slabs have about fragmentation, and to teach some of
> the shrinkers about fragmentation.
> 

I don't like all this work and complexity and overheads going into a
partial solution.

Look: if you have to guarantee memory can be shrunk, set aside a zone
for it (that only fills with user reclaimable areas). This is better
than the current frag patches because it will give you the 100%
guarantee that you need (provided we have page migration to move mlocked
pages).

If you don't need a guarantee, then our current, simple system does the
job perfectly.

> After that, we'll need some kind of virtual remapping, breaking the 1:1
> kernel virtual mapping, so that the most problematic pages can be
> remapped.  These pages would retain their virtual address, but getting a
> new physical.  However, this is quite far down the road and will require
> some serious evaluation because it impacts how normal devices are able
> to to DMA.  The ppc64 proprietary hypervisor has features to work around
> these issues, and any new hypervisors wishing to support partition
> memory hotplug would likely have to follow suit.
> 

I would more like to see something like this happen (provided it was
nicely abstracted away and could be CONFIGed out for the 99.999% of
users who don't need the overhead or complexity).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  0:51                             ` Nick Piggin
@ 2005-11-02  7:42                               ` Dave Hansen
  2005-11-02  8:24                                 ` Nick Piggin
  2005-11-02 12:38                               ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
  1 sibling, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-02  7:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).

With Mel's patches, you can easily add the same guarantee.  Look at the
code in  fallback_alloc() (patch 5/8).  It would be quite easy to modify
the fallback lists to disallow fallbacks into areas from which we would
like to remove memory.  That was left out for simplicity.  As you say,
they're quite complex as it is.  Would you be interested in seeing a
patch to provide those kinds of guarantees?

We've had a bit of experience with a hotpluggable zone approach  before.
Just like the current topic patches, you're right, that approach can
also provide strong guarantees.  However, the issue comes if the system
ever needs to move memory between such zones, such as if a user ever
decides that they'd prefer to break hotplug guarantees rather than OOM.

Do you think changing what a particular area of memory is being used for
would ever be needed?

One other thing, if we decide to take the zones approach, it would have
no other side benefits for the kernel.  It would be for hotplug only and
I don't think even the large page users would get much benefit.  

-- Dave

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  7:42                               ` Dave Hansen
@ 2005-11-02  8:24                                 ` Nick Piggin
  2005-11-02  8:33                                   ` Yasunori Goto
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  8:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

Dave Hansen wrote:
> On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
> 
>>Look: if you have to guarantee memory can be shrunk, set aside a zone
>>for it (that only fills with user reclaimable areas). This is better
>>than the current frag patches because it will give you the 100%
>>guarantee that you need (provided we have page migration to move mlocked
>>pages).
> 
> 
> With Mel's patches, you can easily add the same guarantee.  Look at the
> code in  fallback_alloc() (patch 5/8).  It would be quite easy to modify
> the fallback lists to disallow fallbacks into areas from which we would
> like to remove memory.  That was left out for simplicity.  As you say,
> they're quite complex as it is.  Would you be interested in seeing a
> patch to provide those kinds of guarantees?
> 

On top of Mel's patch? I think this is essiential for any guarantees
that you might be interested... but it would just mean that now you
have a redundant extra zoning layer.

I think ZONE_REMOVABLE is something that really needs to be looked at
again if you need a hotunplug solution in the kernel.

> We've had a bit of experience with a hotpluggable zone approach  before.
> Just like the current topic patches, you're right, that approach can
> also provide strong guarantees.  However, the issue comes if the system
> ever needs to move memory between such zones, such as if a user ever
> decides that they'd prefer to break hotplug guarantees rather than OOM.
> 

I can imagine one could have a sysctl to allow/disallow non-easy-reclaim
allocations from ZONE_REMOVABLE.

As Ingo says, neither way is going to give a 100% solution - I wouldn't
like to see so much complexity added to bring us from a ZONE_REMOVABLE 80%
solution to a 90% solution. I believe this is where Linus' "perfect is
the enemy of good" quote applies.

> Do you think changing what a particular area of memory is being used for
> would ever be needed?
> 

Perhaps, but Mel's patch only guarantees you to change once, same as
ZONE_REMOVABLE. Once you eat up those easy-to-reclaim areas, you can't
get them back.

> One other thing, if we decide to take the zones approach, it would have
> no other side benefits for the kernel.  It would be for hotplug only and
> I don't think even the large page users would get much benefit.  
> 

Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
talking about other higher-order users, I still think we can't guarantee
past about order 1 or 2 with Mel's patch and they simply need to have
some other ways to do things.

But I think using zones would have advantages in that they would help
give zones and zone balancing more scrutiny and test coverage in the
kernel, which is sorely needed since everyone threw out their highmem
systems :P

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  8:24                                 ` Nick Piggin
@ 2005-11-02  8:33                                   ` Yasunori Goto
  2005-11-02  8:43                                     ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: Yasunori Goto @ 2005-11-02  8:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

> > One other thing, if we decide to take the zones approach, it would have
> > no other side benefits for the kernel.  It would be for hotplug only and
> > I don't think even the large page users would get much benefit.  
> > 
> 
> Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
> talking about other higher-order users, I still think we can't guarantee
> past about order 1 or 2 with Mel's patch and they simply need to have
> some other ways to do things.

Hmmm. I don't see at this point.
Why do you think ZONE_REMOVABLE can satisfy for hugepage.
At leaset, my ZONE_REMOVABLE patch doesn't any concern about
fragmentation.

Bye.

-- 
Yasunori Goto 


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  8:33                                   ` Yasunori Goto
@ 2005-11-02  8:43                                     ` Nick Piggin
  2005-11-02 14:51                                       ` Martin J. Bligh
  2005-11-02 23:28                                       ` Rob Landley
  0 siblings, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  8:43 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

Yasunori Goto wrote:
>>>One other thing, if we decide to take the zones approach, it would have
>>>no other side benefits for the kernel.  It would be for hotplug only and
>>>I don't think even the large page users would get much benefit.  
>>>
>>
>>Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
>>talking about other higher-order users, I still think we can't guarantee
>>past about order 1 or 2 with Mel's patch and they simply need to have
>>some other ways to do things.
> 
> 
> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.
> 

Well I think it can satisfy hugepage allocations simply because
we can be reasonably sure of being able to free contiguous regions.
Of course it will be memory no longer easily reclaimable, same as
the case for the frag patches. Nor would be name ZONE_REMOVABLE any
longer be the most appropriate!

But my point is, the basic mechanism is there and is workable.
Hugepages and memory unplug are the two main reasons for IBM to be
pushing this AFAIKS.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  8:43                                     ` Nick Piggin
@ 2005-11-02 14:51                                       ` Martin J. Bligh
  2005-11-02 23:28                                       ` Rob Landley
  1 sibling, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02 14:51 UTC (permalink / raw)
  To: Nick Piggin, Yasunori Goto
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
> 
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.

No, that's not true - those are just the "exciting" features that go 
on the back of it. Look back in this email thread - there's lots of
other reasons to fix fragmentation. I don't believe you can eliminate
all the order > 0 allocations in the kernel.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  8:43                                     ` Nick Piggin
  2005-11-02 14:51                                       ` Martin J. Bligh
@ 2005-11-02 23:28                                       ` Rob Landley
  2005-11-03  5:26                                         ` Jeff Dike
  1 sibling, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-02 23:28 UTC (permalink / raw)
  To: Nick Piggin, user-mode-linux-devel
  Cc: Yasunori Goto, Dave Hansen, Ingo Molnar, Mel Gorman,
	Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

On Wednesday 02 November 2005 02:43, Nick Piggin wrote:

> > Hmmm. I don't see at this point.
> > Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> > At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> > fragmentation.
>
> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
>
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.

Who cares what IBM is pushing?  I'm interested in fragmentation avoidance for 
User Mode Linux.

I use User Mode Linux to virtualize a system build, and one problem I 
currently have is that some workloads temporarily use a lot of memory.  For 
example, I can run a complete system build in about 48 megs of ram: except 
for building GCC.  That spikes to a couple hundred megabytes.  If I allocate 
256 megabytes of memory to UML, that's half the memory on my laptop and UML 
will just use it for redundant cacheing and such while desktop performance 
gets a bit unhappy with the build going.

UML gets an instance's "physical memory" by allocating a temporary file, 
mmapping it, and deleting it (which signals to the vfs that flushing this 
data to backing store should only be done under memory pressure from the rest 
of the OS, because the file's going away when it's closed so there's no 

With fragmentation reduction and prezeroing, UML suddenly gains the option of 
calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of 
prezeroing, B) a way of giving memory back to the host OS when it's not in 
use.

This has _nothing_ to do with IBM.  Or large systems.  This is some random 
developer trying to run a virtualized system build on his laptop.

(The reason I need to use UML is that I build uClibc with the newest 2.6 
kernel headers I can, link apps against it, and then running many of those 
apps during later stages of the build.  If the kernel headers used to build 
libc are sufficiently newer than the kernel the build is running under, I get 
segfaults because the new libc tries use kernel features that aren't there on 
the host system, but will be in the final system.  I also get the ability to 
mknod/chown/chroot without needing root access on the host system for 
free...)

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02 23:28                                       ` Rob Landley
@ 2005-11-03  5:26                                         ` Jeff Dike
  2005-11-03  5:41                                           ` Rob Landley
  0 siblings, 1 reply; 241+ messages in thread
From: Jeff Dike @ 2005-11-03  5:26 UTC (permalink / raw)
  To: Rob Landley
  Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen,
	Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> With fragmentation reduction and prezeroing, UML suddenly gains the option of 
> calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of 
> prezeroing, B) a way of giving memory back to the host OS when it's not in 
> use.

DONT_NEED is insufficient.  It doesn't discard the data in dirty
file-backed pages.

Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
which does do the trick, and I have a UML patch which adds memory
hotplug.  This combination does free memory back to the host.

				Jeff

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  5:26                                         ` Jeff Dike
@ 2005-11-03  5:41                                           ` Rob Landley
  2005-11-04  3:26                                             ` [uml-devel] " Blaisorblade
  0 siblings, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-03  5:41 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Nick Piggin, user-mode-linux-devel, Yasunori Goto, Dave Hansen,
	Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > With fragmentation reduction and prezeroing, UML suddenly gains the
> > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a
> > fast way of prezeroing, B) a way of giving memory back to the host OS
> > when it's not in use.
>
> DONT_NEED is insufficient.  It doesn't discard the data in dirty
> file-backed pages.

I thought DONT_NEED would discard the page cache, and punch was only needed to 
free up the disk space.

I was hoping that since the file was deleted from disk and is already getting 
_some_ special treatment (since it's a longstanding "poor man's shared 
memory" hack), that madvise wouldn't flush the data to disk, but would just 
zero it out.  A bit optimistic on my part, I know. :)

> Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> which does do the trick, and I have a UML patch which adds memory
> hotplug.  This combination does free memory back to the host.

I saw it wander by, and am all for it.  If it goes in, it's obviously the 
right thing to use.  You may remember I asked about this two years ago:
http://seclists.org/lists/linux-kernel/2003/Dec/0919.html

And a reply indicated that SVr4 had it, but we don't.  I assume the "naming 
discussion" mentioned in the recent thread already scrubbed through this old 
thread to determine that the SVr4 API was icky.
http://seclists.org/lists/linux-kernel/2003/Dec/0955.html

>     Jeff

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-03  5:41                                           ` Rob Landley
@ 2005-11-04  3:26                                             ` Blaisorblade
  2005-11-04 15:50                                               ` Rob Landley
  0 siblings, 1 reply; 241+ messages in thread
From: Blaisorblade @ 2005-11-04  3:26 UTC (permalink / raw)
  To: user-mode-linux-devel
  Cc: Rob Landley, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen,
	Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Thursday 03 November 2005 06:41, Rob Landley wrote:
> On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > > With fragmentation reduction and prezeroing, UML suddenly gains the
> > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A)
> > > a fast way of prezeroing, B) a way of giving memory back to the host OS
> > > when it's not in use.

> > DONT_NEED is insufficient.  It doesn't discard the data in dirty
> > file-backed pages.

> I thought DONT_NEED would discard the page cache, and punch was only needed
> to free up the disk space.
This is correct, but...

> I was hoping that since the file was deleted from disk and is already
> getting _some_ special treatment (since it's a longstanding "poor man's
> shared memory" hack), that madvise wouldn't flush the data to disk, but
> would just zero it out.  A bit optimistic on my part, I know. :)

I read at some time that this optimization existed but was deemed obsolete and 
removed.

Why obsolete? Because... we have tmpfs! And that's the point. With DONTNEED, 
we detach references from page tables, but the content is still pinned: it 
_is_ the "disk"! (And you have TMPDIR on tmpfs, right?)

> > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> > which does do the trick, and I have a UML patch which adds memory
> > hotplug.  This combination does free memory back to the host.

> I saw it wander by, and am all for it.  If it goes in, it's obviously the
> right thing to use.
Btw, on this side of the picture, I think fragmentation avoidance is not 
needed for that.

I guess you refer to using frag. avoidance on the guest (if it matters for the 
host, let me know). When it will be present using it will be nice, but 
currently we'd do madvise() on a page-per-page basis, and we'd do it on 
non-consecutive pages (basically, free pages we either find or free or 
purpose).

> You may remember I asked about this two years ago: 
> http://seclists.org/lists/linux-kernel/2003/Dec/0919.html

> And a reply indicated that SVr4 had it, but we don't.  I assume the "naming
> discussion" mentioned in the recent thread already scrubbed through this
> old thread to determine that the SVr4 API was icky.
> http://seclists.org/lists/linux-kernel/2003/Dec/0955.html

I assume not everybody did (even if somebody pointed out the existance of the 
SVr4 API), but there was the need, in at least one usage, for a virtual 
address-based API rather than a file offset based one, like the SVr4 one - 
that user would need implementing backward mapping in userspace only for this 
purpose, while we already have it in the kernel.

Anyway, the sys_punch() API will follow later - customers need mainly 
madvise() for now.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  3:26                                             ` [uml-devel] " Blaisorblade
@ 2005-11-04 15:50                                               ` Rob Landley
  2005-11-04 17:18                                                 ` Blaisorblade
  0 siblings, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-04 15:50 UTC (permalink / raw)
  To: user-mode-linux-devel
  Cc: Blaisorblade, Jeff Dike, Nick Piggin, Yasunori Goto, Dave Hansen,
	Ingo Molnar, Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > I was hoping that since the file was deleted from disk and is already
> > getting _some_ special treatment (since it's a longstanding "poor man's
> > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > would just zero it out.  A bit optimistic on my part, I know. :)
>
> I read at some time that this optimization existed but was deemed obsolete
> and removed.
>
> Why obsolete? Because... we have tmpfs! And that's the point. With
> DONTNEED, we detach references from page tables, but the content is still
> pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)

If I had that kind of control over environment my build would always be 
deployed in (including root access), I wouldn't need UML. :)

(P.S. The default for Ubuntu "Horny Hedgehog" is no.  The only tmpfs mount 
is /dev/shm, and /tmp is on / which is ext3.  Yeah, I need to upgrade my 
laptop...)

> I guess you refer to using frag. avoidance on the guest

Yes.  Moot point since Linus doesn't want it.

> (if it matters for 
> the host, let me know). When it will be present using it will be nice, but
> currently we'd do madvise() on a page-per-page basis, and we'd do it on
> non-consecutive pages (basically, free pages we either find or free or
> purpose).

Might be a performance issue if that gets introduced with per-page 
granularity, and how do you avoid giving back pages we're about to re-use?  
Oh well, bench it when it happens.  (And in any case, it needs a tunable to 
beat the page cache into submission or there's no free memory to give back.  
If there's already such a tuneable, I haven't found it yet.)

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:50                                               ` Rob Landley
@ 2005-11-04 17:18                                                 ` Blaisorblade
  2005-11-04 17:44                                                   ` Rob Landley
  0 siblings, 1 reply; 241+ messages in thread
From: Blaisorblade @ 2005-11-04 17:18 UTC (permalink / raw)
  To: Rob Landley
  Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto,
	Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms

(Note - I've removed a few CC's since we're too many ones, sorry for any 
inconvenience).

On Friday 04 November 2005 16:50, Rob Landley wrote:
> On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > > I was hoping that since the file was deleted from disk and is already
> > > getting _some_ special treatment (since it's a longstanding "poor man's
> > > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > > would just zero it out.  A bit optimistic on my part, I know. :)
> >
> > I read at some time that this optimization existed but was deemed
> > obsolete and removed.
> >
> > Why obsolete? Because... we have tmpfs! And that's the point. With
> > DONTNEED, we detach references from page tables, but the content is still
> > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)
>
> If I had that kind of control over environment my build would always be
> deployed in (including root access), I wouldn't need UML. :)
Yep, right for your case... however currently the majority of users use tmpfs 
(I hope for them)...

> > I guess you refer to using frag. avoidance on the guest
>
> Yes.  Moot point since Linus doesn't want it.
See lwn.net last issue (when it becomes available) on this issue. In short, 
however, the real point is that we need this kind of support.

> Might be a performance issue if that gets introduced with per-page
> granularity,
I'm aware of this possibility, and I've said in fact "Frag. avoidance will be 
nice to use". However I'm not sure that the system call overhead is so big, 
compared to flushing the TLB entries...

But for now we haven't the issue - you don't do hotunplug frequently. When 
somebody will write the auto-hotunplug management daemon we could have a 
problem on this...
> and how do you avoid giving back pages we're about to re-use? 

Jeff's trick is call the buddy allocator (__get_free_pages()) to get a full 
page (and it will do any needed work to free memory), so nobody else will use 
it, and then madvise() it.

If a better API exists, that will be used.

> Oh well, bench it when it happens.  (And in any case, it needs a tunable to
> beat the page cache into submission or there's no free memory to give back.
I couldn't parse your sentence. The allocation will free memory like when 
memory is needed.

However look at /proc/sys/vm/swappiness or use Con Kolivas's patches to find 
new tunable and policies.
> If there's already such a tuneable, I haven't found it yet.)
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 17:18                                                 ` Blaisorblade
@ 2005-11-04 17:44                                                   ` Rob Landley
  0 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-04 17:44 UTC (permalink / raw)
  To: Blaisorblade
  Cc: user-mode-linux-devel, Jeff Dike, Nick Piggin, Yasunori Goto,
	Dave Hansen, linux-mm, Linux Kernel Mailing List, lhms

On Friday 04 November 2005 11:18, Blaisorblade wrote:
> > Oh well, bench it when it happens.  (And in any case, it needs a tunable
> > to beat the page cache into submission or there's no free memory to give
> > back.
>
> I couldn't parse your sentence. The allocation will free memory like when
> memory is needed.

If you've got a daemon running in the virtual system to hand back memory to 
the host, then you don't need a tuneable.

What I was thinking is that if we get prezeroing infrastructure that can use 
various prezeroing accelerators (as has been discussed but I don't believe 
merged), then a logical prezeroing accelerator for UML would be calling 
madvise on the host system.  This has the advantage of automatically giving 
back to the host system any memory that's not in use, but would require some 
way to tell kswapd or some such that keeping around lots of prezeroed memory 
is preferable to keeping around lots of page cache.

In my case, I have a workload that can mostly work with 32-48 megs of ram, but 
it spikes up to 256 at one point.  Right now, I'm telling UML mem=64 megs and 
the feeding it a 256 swap file on ubd, but this is hideously inefficient when 
it actually tries to use this swap file.  (And since the host system is 
running a 2.6.10 kernel, there's a five minute period during each build where 
things on my desktop actually freeze for 15-30 seconds at a time.  And this 
is on a laptop with 512 megs of ram.  I think it's because the disk is so 
overwhelmed, and some things (like vim's .swp file, and something similar in 
kmail's composer) do a gratuitous fsync...

> However look at /proc/sys/vm/swappiness

Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that 
completes with swappiness at 60.  I mentioned this on the list a little while 
ago and some people asked for copies of my test script...

> or use Con Kolivas's patches to find new tunable and policies.

The daemon you mentioned is an alternative, but I'm not quite sure how rapid 
the daemon's reaction is going to be to potential OOM situations when 
something suddenly wants an extra 200 megs...

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
  2005-11-02  0:51                             ` Nick Piggin
  2005-11-02  7:42                               ` Dave Hansen
@ 2005-11-02 12:38                               ` Mel Gorman
  2005-11-03  3:14                                 ` Nick Piggin
  1 sibling, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 12:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Dave Hansen wrote:
>
> > What the fragmentation patches _can_ give us is the ability to have 100%
> > success in removing certain areas: the "user-reclaimable" areas
> > referenced in the patch.  This gives a customer at least the ability to
> > plan for how dynamically reconfigurable a system should be.
> >
>
> But the "user-reclaimable" areas can still be taken over by other
> areas which become fragmented.
>

This is true, we have worst case scenarios. With our patches though, our
assertion it takes a lot longer to degrade and in good scenarios like
where the workload is not using all of physical memory, we don't degrade
at all.  Assuming we get a page migration or active defragmentation in the
future, it will be a lot longer before they have to do any work. As we
only fragment when there is nothing else to do, page migration will also
have less work to do.

> That's like saying we can already guarantee 100% success in removing
> areas that are unfragmented and free, or freeable.
>
> > After these patches, the next logical steps are to increase the
> > knowledge that the slabs have about fragmentation, and to teach some of
> > the shrinkers about fragmentation.
> >
>
> I don't like all this work and complexity and overheads going into a
> partial solution.
>
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).
>
> If you don't need a guarantee, then our current, simple system does the
> job perfectly.
>

Ok. To me, the rest of the thread are beating around the same points and
no one is giving ground. The points are made so lets summarise. Apologies
if anything is missing.

Problem
=======

Memory gets fragmented meaning that contiguous blocks of memory are not
free and not freeable no matter how much kswapd works

Impact
======
A number of different users are hit, in different ways
  Physical Hotplug remove: Hotplug remove needs to be able to free a large
	region of memory that is then unplugged. Different architectures have
	different ways of doing this
  Virtualization hotplug remove: The requirements are lighter here.
	Contiguous Regions from 1MiB to 64MiB (figure taken from thread)
	must be freed to move the memory between virtual machines
  High order allocations: With fragmentation, high order allocations fail.
	Depending on the workload, kswapd could work forever and not free up a
	4MiB chunk

Who cares
=========
  Physical hotplug remove: Vendors of the hardware that support this -
	Fujitsu, HP (I think), IBM etc

  Virtualization hotplug remove: Sellers of virtualization software, some
	hardware like any IBM machine that lists LPAR in it's list of
	features.  Probably software solutions like Xen are also affected
	if they want to be able to grow and shrink the virtual machines on
	demand

  High order allocations: Ultimately, hugepage users. Today, that is a
	feature only big server users like Oracle care about. In the
	future I reckon applications will be able to use them for things
	like backing the heap by huge pages. Other users like GigE,
	loopback devices with large MTUs, some filesystem like CIFS are
	all interested although they are also been told use use smaller
	pages.

Solutions
=========

Anti-defrag: This solution defines three groups of pages KernNoRclm,
	KernRclm and EasyRclm. Small sub-zone regions of size
	2^(MAX_ORDER-1) are reserved for each allocation type. If there
	are no large blocks available and no reserved pages available, it
	falls back and begins to fragment. This tries to delay
	fragmentation for as long as possible

New Zone: Add a new zone for easyrclm only allocations. This means that
	all kernel pages go in one place and all easyrclm go in another.
	This solution would allow us to reclaim contiguous blocks of
	(Note: This is basically what Solaris Kernel Cages are)

Note that I am leaving out Growing/Shrinking zone code for the moment.
While zones are currently able to get new pages with something like memory
hotadd, there is no mechanism available to move existing pages from one
zone into another. This will need planning and code. Code exists for page
migration so we can reasonable speculate about what it brings to the table
for both anti-defrag and New Zone approaches.

Pros/Cons of Solutions
======================

Anti-defrag Pros
  o Aim9 shows no significant regressions (.37% on page_test). On some
    tests, it shows performance gains (> 5% on fork_test)
  o Stress tests show that it manages to keep fragmentation down to a far
    lower level even without teaching kswapd how to linear reclaim
  o Stress tests with a linear reclaim experimental patch shows that it
    can successfully find large contiguous chunks of memory
  o It is known to help hotplug on PPC64
  o No tunables. The approach tries to manage itself as much as possible
  o It exists, heavily tested, and synced against the latest -mm1
  o Can be compiled away be redefining the RCLM_* macros and the
    __GFP_*RCLM flags

Anti-defrag Cons
  o More complexity within the page allocator
  o Adds a new layer onto the allocator that effectively creates subzones
  o Adding a new concept that maintainers have to work with
  o Depending on the workload, it fragments anyway

New Zone Pros
  o Zones are a well known and understood concept
  o For people that do not care about hotplug, they can easily get rid of it
  o Provides reliable areas of contiguous groups that can be freed for
    HugeTLB pages going to userspace
  o Uses existing zone infrastructure for balancing

New Zone Cons
  o Zones historically have introduced balancing problems
  o Been tried for hotplug and dropped because of being awkward to work with
  o It only helps hotplug and potentially HugeTLB pages for userspace
  o Tunable required. If you get it wrong, the system suffers a lot
  o Needs to be planned for and developed

Scenarios
=========

Lets outline some situations then or workloads that can occur

1. Heavy job running that consumes 75% of physical memory. Like a kernel
   build

  Anti-defrag: It will not fragment as it will never have to fallback.High
	order allocations will be possible in the remaining 25%.
  Zone-based: After been tuned to a kernel build load, it will not
	fragment. Get the tuning wrong, performance suffers or workload
	fails. High order allocations will be possible in the remaining 25%.

  Future work for scenario 1
    Anti-defrag: No problem.
    Zone-based: Tune some more if problems occur.

2. Heavy job running that needs 110% of physical memory, swap is used.
  	Example would be too many simultaneous kernel builds
  Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
	KernNoRclm starts stealing UserRclm regions to avoid excessive
	fragmentation but some fragmentation occurs. Extent depends on the
	duration and heaviness of the load. High order allocations will
	work if kswapd runs for long enough as it will reclaim the
	UserRclm reserved areas. Your chances depend on the intensity of
	KernNoRclm allocations

  Zone-based: After been tuned to the new kernel build load, it will not
	fragment. Get it wrong and performance suffers. High order
	allocations will work if you're lucky enough to have enough
	reclaimable pages together. Your chances are not good

  Future work for scenario 2
    Anti-defrag: kswapd would need to know how to reclaim EasyRclm pages
	from the KernNoRclm, KernRclm and Fallback areas.
    Zone-based: Keep tuning

3. HighMem intensive workload with CONFIG_HIGHPTE set. Example would be a
	scientific job that was performing a very large calculation on an
	anonymous region of memory. Possible that some desktop
	workloads are like this - i.e. use large amounts of anonymous
	memory

  Anti-defrag: For ZONE_HIGHMEM, PTEs are grouped into one area,
	everything else into another, no fragmentation. HugeTLB
	allocations in ZONE_HIGHMEM will work if kswapd works long enough
  Zone-based: PTEs go to anywhere in ZONE_HIGHMEM. Easy-reclaimed goes to
	ZONE_HIGHMEM and ZONE_HOTREMOVABLE. ZONE_HIGHMEM fragments,
	ZONE_HOTREMOVABLE does not. HugeTLB pages will be available in
	ZONE_HOTREMOVABLE, but probably not in ZONE_HIGHMEM.

  Future work for scenario 3
    Anti-defrag: No problem. On-demand HugeTLB allocation for userspace is
	possible. Would work better with linear page reclaim.
    Zone-based: Depends if we care that ZONE_HIGHMEM gets fragmented. We
	would only care if trying to allocate HugeTLB pages on demand from
	ZONE_HIGHMEM. ZONE_HOTREMOVABLE depending on it's size would be
	possible. Linear reclaim will help ZONE_HOTREMOVABLE, but not
	ZONE_HIGHMEM

4. KBuild. Main concerns here are performance
  Anti-defrag: May cause problems because of the .37% drop on page_test.
	May cause improvements because of the 5% increase on fork_test. No
	figures on kbuild available
  Zone-based: No figures available. Depends heavily on being configured
	correctly

  Future work for scenario 4
    Anti-defrag: Try and optimise the paths affected. Alternatively make
	anti-defrag a configurable option by altering the values of RCLM_*
	and __GFP_*RCLM. (Note, would people be interested in a
	compile-time option for anti-defrag or would it make the complexity
	worse for people?)
    Zone-based: Tune for performance or compile away the zone

5. Physically unplug memory 25% of physical memory

  Anti-defrag: Memory in the region gets reclaimed if it's EasyRclm.
	Possibly will encounter awkward pages. Known that PPC64 has some
	success. Fujitsu's use nodes for hotplug, they would need to
	adjust the fallbacks to be fully reliable
  Zone-based: If we are unplugging the right zone, reclaim the pages.
	Possibly will encounter awkward pages (only mlock in this case)

  Future work for scenario 5
    Anti-defrag: fallback_allocs for each node for Fujitsu to be any way
	reliable. Ability to move awkward pages around. For 100% success,
	ability to move kernel pages
    Zone-based: Ability to move awkward pages around. There is no 100%
	success scenario here. You remove the ZONE_HOTREMOVEABLE area or
	you turn the machine off.

6. Fsck a large filesystem (known to be a KernNoRclm heavy workload)

  Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
	workload. It is also known that free blocks reappear through the
	address space when it finishes. Contiguous blocks may appear in
	the middle of the zone rather than either end.
  Zone-based: If misconfigured, performance degrades. As a machine will
	not be tuned for fsck, changes of degrading are pretty high. On
	the other hand, fsck is something people can wait for

  Future work for scenario 6
    Anti-defrag: Ideally, in case of fallbacks, page migration would move
	awkward pages out of UserRclm areas
    Zone-based: Keep tuning if you run into problems

Lets say we agree on a way that ZONE_HOTREMOVABLE can be shrunk in such a
way to give pages to ZONE_NORMAL and ZONE_HIGHMEM as necessary (and we
have to be able to handle both), Situation 2 and 6 changes. Note that this
changing of zones sizes brings all the problems from the anti-defrag
approach to the zone-based approach.

2a. Heavy job running that needs 110% of physical memory, swap is used.
  Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
	KernNoRclm starts stealing UserRclm regions to avoid excessive
	fragmentation but some fragmentation occurs. Extent depends on the
	duration and heaviness of the load.
  Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVALE. The zone cannot be
	shrunk so ZONE_NORMAL fragments as normal.

  Future work for scenario 2a
    Anti-defrag: kswapd would need to know how to clean EasyRclm pages
	from the KernNoRclm, KernRclm and Fallback reserved areas. When
	load drops off, regions will get reserved again for EasyRclm.
	Contiguous blocks will become whenever possible be it the
	beginning, middle or end of the zone. Page migration would help
	fix up single kernel pages left in EasyRclm areas.
    Zone-based: Page migration would need to move pages from the end of
	the zone so it could be shrunk.

6a. Fsck
    Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
	workload. It is also known that free blocks reappear through the
	address space when it finishes. Once the free blocks appear, they
	get reserved for the different allocation types on demand and
	business continues as usual
    Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVABLE. No mechanism to
	shrink it so it doesn't recover

  Future work for scenario 2a

    Anti-defrag: Same as for Situation 2. kswapd would need to know how to
	clean UserRclm pages from the KernNoRclm, KernRclm and Fallback
	reserved areas.
    Zone-based: Same as for 2a. Page migration would need to move pages
	from the end of the zone so it could be shrunk

I've tried to be as objective as possible with the summary.

>From the points above though, I think that anti-defrag gets us a lot of
the way, with the complexity isolated in one place. It's downside is that
it can still break down and future work is needed to stop it degrading
(kswapd cleaning UserRclm areas and page migration when we get really
stuck). Zone-based is more reliable but only addresses a limited
situation, principally hotplug and it does not even go 100% of the way for
hotplug. It also depends on a tunable which is not cool and it is static.
If we make the zones growable+shrinkable, we run into all the same
problems that anti-defrag has today.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
  2005-11-02 12:38                               ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
@ 2005-11-03  3:14                                 ` Nick Piggin
  2005-11-03 12:19                                   ` Mel Gorman
  2005-11-03 15:34                                   ` Martin J. Bligh
  0 siblings, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-03  3:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

Mel Gorman wrote:

> 
> Ok. To me, the rest of the thread are beating around the same points and
> no one is giving ground. The points are made so lets summarise. Apologies
> if anything is missing.
> 

Thanks for attempting a summary of a difficult topic. I have a couple
of suggestions.

> Who cares
> =========
>   Physical hotplug remove: Vendors of the hardware that support this -
> 	Fujitsu, HP (I think), IBM etc
> 
>   Virtualization hotplug remove: Sellers of virtualization software, some
> 	hardware like any IBM machine that lists LPAR in it's list of
> 	features.  Probably software solutions like Xen are also affected
> 	if they want to be able to grow and shrink the virtual machines on
> 	demand
> 

Ingo said that Xen is fine with per page granular freeing - this covers
embedded, desktop and small server users of VMs into the future I'd say.

>   High order allocations: Ultimately, hugepage users. Today, that is a
> 	feature only big server users like Oracle care about. In the
> 	future I reckon applications will be able to use them for things
> 	like backing the heap by huge pages. Other users like GigE,
> 	loopback devices with large MTUs, some filesystem like CIFS are
> 	all interested although they are also been told use use smaller
> 	pages.
> 

I think that saying its now OK to use higher order allocations is wrong
because as I said even with your patches they are going to run into
problems.

Actually I think one reason your patches may perform so well is because
there aren't actually a lot of higher order allocations in the kernel.

I think that probably leaves us realistically with demand hugepages,
hot unplug memory, and IBM lpars?


> Pros/Cons of Solutions
> ======================
> 
> Anti-defrag Pros
>   o Aim9 shows no significant regressions (.37% on page_test). On some
>     tests, it shows performance gains (> 5% on fork_test)
>   o Stress tests show that it manages to keep fragmentation down to a far
>     lower level even without teaching kswapd how to linear reclaim

This sounds like a kind of funny test to me if nobody is actually
using higher order allocations.

When a higher order allocation is attempted, either you will satisfy
it from the kernel region, in which case the vanilla kernel would
have done the same. Or you satisfy it from an easy-reclaim contiguous
region, in which case it is no longer an easy-reclaim contiguous
region.

>   o Stress tests with a linear reclaim experimental patch shows that it
>     can successfully find large contiguous chunks of memory
>   o It is known to help hotplug on PPC64
>   o No tunables. The approach tries to manage itself as much as possible

But it has more dreaded heuristics :P

>   o It exists, heavily tested, and synced against the latest -mm1
>   o Can be compiled away be redefining the RCLM_* macros and the
>     __GFP_*RCLM flags
> 
> Anti-defrag Cons
>   o More complexity within the page allocator
>   o Adds a new layer onto the allocator that effectively creates subzones
>   o Adding a new concept that maintainers have to work with
>   o Depending on the workload, it fragments anyway
> 
> New Zone Pros
>   o Zones are a well known and understood concept
>   o For people that do not care about hotplug, they can easily get rid of it
>   o Provides reliable areas of contiguous groups that can be freed for
>     HugeTLB pages going to userspace
>   o Uses existing zone infrastructure for balancing
> 
> New Zone Cons
>   o Zones historically have introduced balancing problems
>   o Been tried for hotplug and dropped because of being awkward to work with
>   o It only helps hotplug and potentially HugeTLB pages for userspace
>   o Tunable required. If you get it wrong, the system suffers a lot

Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
them get it right.

>   o Needs to be planned for and developed
> 

Yasunori Goto had patches around from last year. Not sure what sort
of shape they're in now but I'd think most of the hard work is done.

> Scenarios
> =========
> 
> Lets outline some situations then or workloads that can occur
> 
> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
>    build
> 
>   Anti-defrag: It will not fragment as it will never have to fallback.High
> 	order allocations will be possible in the remaining 25%.
>   Zone-based: After been tuned to a kernel build load, it will not
> 	fragment. Get the tuning wrong, performance suffers or workload
> 	fails. High order allocations will be possible in the remaining 25%.
> 

You don't need to continually tune things for each and every possible
workload under the sun. It is like how we currently drive 16GB highmem
systems quite nicely under most workloads with 1GB of normal memory.
Make that an 8:1 ratio if you're worried.

[snip]

> 
> I've tried to be as objective as possible with the summary.
> 
>>From the points above though, I think that anti-defrag gets us a lot of
> the way, with the complexity isolated in one place. It's downside is that
> it can still break down and future work is needed to stop it degrading
> (kswapd cleaning UserRclm areas and page migration when we get really
> stuck). Zone-based is more reliable but only addresses a limited
> situation, principally hotplug and it does not even go 100% of the way for
> hotplug. 

To me it seems like it solves the hotplug, lpar hotplug, and hugepages
problems which seem to be the main ones.

 > It also depends on a tunable which is not cool and it is static.

I think it is very cool because it means the tiny minority of Linux
users who want this can do so without impacting the rest of the code
or users. This is how Linux has been traditionally run and I still
have a tiny bit of faith left :)

> If we make the zones growable+shrinkable, we run into all the same
> problems that anti-defrag has today.
> 

But we don't have the extra zones layer that anti defrag has today.

And anti defrag needs limits if it is to be reliable anyway.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
  2005-11-03  3:14                                 ` Nick Piggin
@ 2005-11-03 12:19                                   ` Mel Gorman
  2005-11-10 18:47                                     ` Steve Lord
  2005-11-03 15:34                                   ` Martin J. Bligh
  1 sibling, 1 reply; 241+ messages in thread
From: Mel Gorman @ 2005-11-03 12:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Hansen, Ingo Molnar, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Thu, 3 Nov 2005, Nick Piggin wrote:

> Mel Gorman wrote:
>
> >
> > Ok. To me, the rest of the thread are beating around the same points and
> > no one is giving ground. The points are made so lets summarise. Apologies
> > if anything is missing.
> >
>
> Thanks for attempting a summary of a difficult topic. I have a couple
> of suggestions.
>
> > Who cares
> > =========
> >   Physical hotplug remove: Vendors of the hardware that support this -
> >      Fujitsu, HP (I think), IBM etc
> >
> >   Virtualization hotplug remove: Sellers of virtualization software, some
> >      hardware like any IBM machine that lists LPAR in it's list of
> >      features.  Probably software solutions like Xen are also affected
> >      if they want to be able to grow and shrink the virtual machines on
> >      demand
> >
>
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.
>

Ok, hard to argue with that.

> >   High order allocations: Ultimately, hugepage users. Today, that is a
> >      feature only big server users like Oracle care about. In the
> >      future I reckon applications will be able to use them for things
> >      like backing the heap by huge pages. Other users like GigE,
> >      loopback devices with large MTUs, some filesystem like CIFS are
> >      all interested although they are also been told use use smaller
> >      pages.
> >
>
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
>

Ok, I have not denied that they will run into problems. I have asserted
that, with more work built upon these patches, we can grant large pages
with a good degree of reliability. Subsystems should still use small
orders whenever possible and at the very least, large orders should be
short-lived.

For userspace users, I would like to move towards better availibility of
huge page without requiring boot-time tunables which are required today.
Do we agree that this would be useful at least for a few different users?

HugeTLB user 1: Todays users of hugetlbfs like big databases etc
HugeTLB user 2: HPC jobs that run with sparse data sets
HugeTLB user 3: Desktop applications that use large amounts of address space.

I got a mail from a user of category 2. He said I can quote his email, but
he didn't say I could quote his name which is inconvenient but I'm sure he
has good reasons.

To him, low fragmentation is "critical, at least in HPC environments".
Here is the core of his issue;

--- excerpt ---
Take the scenario that you have a large machine that is
used by multiple users, and the usage is regulated by a batch
scheduler. Loadleveler on ibm's for example. PBS on many
others. Both appear to be available in linux environments.

In the case of my codes, I find that having large pages is
extremely beneficial to my run times. As in factors of several,
modulo things that I've coded in by hand to try and avoid the
issues. I don't think my code is in any way unusual in this
magnitude of improvement.
--- excerpt ---

ok, so we have two potential solutions, anti-defrag and zones. We don't
need to rehash the pro's and cons. With zones, we just say "just reclaim
the easy reclaim zone, alloc your pages and away we go".

Now, his problem is that the server is not restarted between job times and
jobs takes days and weeks to complete. The system administrators will not
restart the machine so getting it to a prestine state is a difficulty. The
state he gets the system in is the state he works with and with
fragmentation, he doesn't get large pages unless he is lucky enough to be
the first user of the machine

With the zone approach, we would just be saying "tune it". Here is what he
says about that

--- excerpt ---
I specifically *don't* want things that I have to beg sysadmins to
tune correctly. They won't get it right because there is no `right'
that is right for everyone. They won't want to change it and it
won't work besides. Been there, done that. My experience is that
with linux so far, and some other non-linux machines too, they
always turn all the page stuff off because it breaks the machine.
--- excerpt ---

This is an example of a real user that "tune the size of your zone
correctly" is just not good enough. He makes a novel suggestion on how
anti-defrag + hotplug could be used.

--- excerpt ---
In the context of hotplug stuff and fragmentation avoidance,
this sort of reset would be implemented by performing the
the first step in the hot unplug, to migrate everything off
of that memory, including whatever kernel pages that exist
there, but not the second step. Just leave that memory plugged
in and reset the memory to a sane initial state. Essentially
this would be some sort of pseudo hotunplug followed by a pseudo
hotplug of that memory.
--- excerpt ---

I'm pretty sure this is not what hotplug was aimed at but it would get him
what he wants, large pages to echo BigNumber > nr_hugepages at the least.
It also needs hotplug remove to be working for some banks and regions of
memory although not the 100% case.

Ok, this is one example of a user for scientific workloads that "tune the
size of the zone" just is not good enough. The admins won't do it for him
because it'll just break for the next scheduled job.

> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
>
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?
>

>
> > Pros/Cons of Solutions
> > ======================
> >
> > Anti-defrag Pros
> >   o Aim9 shows no significant regressions (.37% on page_test). On some
> >     tests, it shows performance gains (> 5% on fork_test)
> >   o Stress tests show that it manages to keep fragmentation down to a far
> >     lower level even without teaching kswapd how to linear reclaim
>
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.
>

No one uses them because they always fail. This is a chicken and egg
problem.

> When a higher order allocation is attempted, either you will satisfy
> it from the kernel region, in which case the vanilla kernel would
> have done the same. Or you satisfy it from an easy-reclaim contiguous
> region, in which case it is no longer an easy-reclaim contiguous
> region.
>

Right, but right now, we say "don't use high order allocations ever". With
work, we'll be saying "ok, use high order allocations but they should be
short lived or you won't be allocating them for long"

> >   o Stress tests with a linear reclaim experimental patch shows that it
> >     can successfully find large contiguous chunks of memory
> >   o It is known to help hotplug on PPC64
> >   o No tunables. The approach tries to manage itself as much as possible
>
> But it has more dreaded heuristics :P
>

Yeah, but if it gets them wrong, the system chugs along anyway, just
fragmented like it is today. If the zone-based approach gets it wrong, the
system goes down the tubes.

At very worst, the patches give a kernel allocator that is as good as
todays. At very worst, the zone-based approach makes an unusable system.
The performance of the patches is another story. I've been posting aim9
figures based on my test machine. I'm trying to kick an ancient PowerPC
43P Model 150 machine into working.  This machine is a different
architecture and ancient (I found it on the way to a skip) so should give
different figures.

> >   o It exists, heavily tested, and synced against the latest -mm1
> >   o Can be compiled away be redefining the RCLM_* macros and the
> >     __GFP_*RCLM flags
> >
> > Anti-defrag Cons
> >   o More complexity within the page allocator
> >   o Adds a new layer onto the allocator that effectively creates subzones
> >   o Adding a new concept that maintainers have to work with
> >   o Depending on the workload, it fragments anyway
> >
> > New Zone Pros
> >   o Zones are a well known and understood concept
> >   o For people that do not care about hotplug, they can easily get rid of it
> >   o Provides reliable areas of contiguous groups that can be freed for
> >     HugeTLB pages going to userspace
> >   o Uses existing zone infrastructure for balancing
> >
> > New Zone Cons
> >   o Zones historically have introduced balancing problems
> >   o Been tried for hotplug and dropped because of being awkward to work with
> >   o It only helps hotplug and potentially HugeTLB pages for userspace
> >   o Tunable required. If you get it wrong, the system suffers a lot
>
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.
>

Unless you work in a place where they sysadmins will tell you to go away
such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
have better things to do than twiddle a tunable all day.

> >   o Needs to be planned for and developed
> >
>
> Yasunori Goto had patches around from last year. Not sure what sort
> of shape they're in now but I'd think most of the hard work is done.
>

But Yasunori (thanks for sending the links ) himself says when he posted.

--- excerpt ---
Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.
--- excerpt ---

He thought that the new zone removed the ability of some architectures to
treat all memory the same. My patches give some of the benefits of using
another zone while still preserving an architectures ability to
treat all memory the same.

> > Scenarios
> > =========
> >
> > Lets outline some situations then or workloads that can occur
> >
> > 1. Heavy job running that consumes 75% of physical memory. Like a kernel
> >    build
> >
> >   Anti-defrag: It will not fragment as it will never have to fallback.High
> >      order allocations will be possible in the remaining 25%.
> >   Zone-based: After been tuned to a kernel build load, it will not
> >      fragment. Get the tuning wrong, performance suffers or workload
> >      fails. High order allocations will be possible in the remaining 25%.
> >
>
> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.
>
> [snip]
>
> >
> > I've tried to be as objective as possible with the summary.
> >
> > > From the points above though, I think that anti-defrag gets us a lot of
> > the way, with the complexity isolated in one place. It's downside is that
> > it can still break down and future work is needed to stop it degrading
> > (kswapd cleaning UserRclm areas and page migration when we get really
> > stuck). Zone-based is more reliable but only addresses a limited
> > situation, principally hotplug and it does not even go 100% of the way for
> > hotplug.
>
> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.
>
> > It also depends on a tunable which is not cool and it is static.
>
> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users. This is how Linux has been traditionally run and I still
> have a tiny bit of faith left :)
>

The impact of the code and users will depend on benchmarks. I've posted
benchmarks that show there are either very small regressions or else there
are performance gains. As I write this, some of the aim9 benchmarks
completed on the PowerPC.

This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig

 1 creat-clo      73500.00   72504.58    -995.42 -1.35% File Creations and Closes/second
 2 page_test      30806.13   31076.49     270.36  0.88% System Allocations & Pages/second
 3 brk_test      335299.02  341926.35    6627.33  1.98% System Memory Allocations/second
 4 jmp_test     1641733.33 1644566.67    2833.34  0.17% Non-local gotos/second
 5 signal_test   100883.19   98900.18   -1983.01 -1.97% Signal Traps/second
 6 exec_test        116.53     118.44       1.91  1.64% Program Loads/second
 7 fork_test        751.70     746.84      -4.86 -0.65% Task Creations/second
 8 link_test      30217.11   30463.82     246.71  0.82% Link/Unlink Pairs/second

Performance gains on page_test, brk_test and exec_test. Even with
variances between tests, we are looking at "more or less the same", not
regressions. No user impact there.

This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-withantidefrag

 1 creat-clo      73500.00   71188.14   -2311.86 -3.15% File Creations and Closes/second
 2 page_test      30806.13   31060.96     254.83  0.83% System Allocations & Pages/second
 3 brk_test      335299.02  344361.15    9062.13  2.70% System Memory Allocations/second
 4 jmp_test     1641733.33 1627228.80  -14504.53 -0.88% Non-local gotos/second
 5 signal_test   100883.19  100233.33    -649.86 -0.64% Signal Traps/second
 6 exec_test        116.53     117.63       1.10  0.94% Program Loads/second
 7 fork_test        751.70     763.73      12.03  1.60% Task Creations/second
 8 link_test      30217.11   30322.10     104.99  0.35% Link/Unlink Pairs/second

Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
going for complex overhead. create-clo took a beating, but what workload
opens and closes files at that rate?

This is an old, small machine. If I hotplug this, I'll be lucky if it ever
turns on again. The aim9 benchmarks on two machines show that there is
similar and, in some cases better, performance with these patches. If a
workload does suffer badly, an additional patch has been supplied that
disables anti-defrag. A run in -mm will tell us if this is the general
case for machines or are my two test boxes running on magic beans.

So, the small number of users that want this, get this. The rest of the
users who just run the code, should not notice or care. This brings us
back to the main stickler, code complexity. I think that the code has been
very well isolated from the code allocator code and people looking at the
allocator could avoid it if they really wanted while stilling knowing what
the buddy allocator was doing.

> > If we make the zones growable+shrinkable, we run into all the same
> > problems that anti-defrag has today.
> >
>
> But we don't have the extra zones layer that anti defrag has today.
>

So, we just have an extra layer on the side that has to be configured. All
of the problems with all of the configuration.

> And anti defrag needs limits if it is to be reliable anyway.
>

I'm confident given time that I can make this manage itself with a very
good degree of reliability.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
  2005-11-03 12:19                                   ` Mel Gorman
@ 2005-11-10 18:47                                     ` Steve Lord
  0 siblings, 0 replies; 241+ messages in thread
From: Steve Lord @ 2005-11-10 18:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Dave Hansen, Ingo Molnar, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, Linux Kernel Mailing List, lhms

Flogging a dead horse here maybe, I missed this whole thread when it was
live, and someone may already have covered this.

Another reason for avoiding memory fragmentation, which may have been lost
in the discussion, is avoiding scatter/gather in I/O. The block layer now
has the smarts to join together physically contiguous pages into a single
scatter/gather element. It always had the smarts to deal with I/O from lots
of small chunks of memory, and let the hardware do the work of reassembling
it. This does not come for free though.

I have come across situations where a raid controller gets cpu bound dealing
with I/O from Linux, but not from Windows. The reason being that Windows seems
to manage to present the same amount of memory in less scatter gather entries.
Because the number of DMA elements is another limiting factor, Windows also
managed to submit larger individual requests. Once Linux reaches steady state,
it ends up submitting one page per scatter gather entry.

OK, if you are going via the page cache, then this is not going to mean anything
unless the idea of having PAGE_CACHE_SIZE > PAGE_SIZE gets dusted off. However,
for direct userspace <-> disk direct I/O, having the address space of a process
be more physically contiguous could help here. Specifically allocated huge pages
is another way to achieve this, but it does require special coding in an app
to do it.

I'll go back to my day job now ;-)

Steve



Mel Gorman wrote:
> On Thu, 3 Nov 2005, Nick Piggin wrote:
> 
>> Mel Gorman wrote:
>>
>>> Ok. To me, the rest of the thread are beating around the same points and
>>> no one is giving ground. The points are made so lets summarise. Apologies
>>> if anything is missing.
>>>
>> Thanks for attempting a summary of a difficult topic. I have a couple
>> of suggestions.
>>
>>> Who cares
>>> =========
>>>   Physical hotplug remove: Vendors of the hardware that support this -
>>>      Fujitsu, HP (I think), IBM etc
>>>
>>>   Virtualization hotplug remove: Sellers of virtualization software, some
>>>      hardware like any IBM machine that lists LPAR in it's list of
>>>      features.  Probably software solutions like Xen are also affected
>>>      if they want to be able to grow and shrink the virtual machines on
>>>      demand
>>>
>> Ingo said that Xen is fine with per page granular freeing - this covers
>> embedded, desktop and small server users of VMs into the future I'd say.
>>
> 
> Ok, hard to argue with that.
> 
>>>   High order allocations: Ultimately, hugepage users. Today, that is a
>>>      feature only big server users like Oracle care about. In the
>>>      future I reckon applications will be able to use them for things
>>>      like backing the heap by huge pages. Other users like GigE,
>>>      loopback devices with large MTUs, some filesystem like CIFS are
>>>      all interested although they are also been told use use smaller
>>>      pages.
>>>
>> I think that saying its now OK to use higher order allocations is wrong
>> because as I said even with your patches they are going to run into
>> problems.
>>
> 
> Ok, I have not denied that they will run into problems. I have asserted
> that, with more work built upon these patches, we can grant large pages
> with a good degree of reliability. Subsystems should still use small
> orders whenever possible and at the very least, large orders should be
> short-lived.
> 
> For userspace users, I would like to move towards better availibility of
> huge page without requiring boot-time tunables which are required today.
> Do we agree that this would be useful at least for a few different users?
> 
> HugeTLB user 1: Todays users of hugetlbfs like big databases etc
> HugeTLB user 2: HPC jobs that run with sparse data sets
> HugeTLB user 3: Desktop applications that use large amounts of address space.
> 
> I got a mail from a user of category 2. He said I can quote his email, but
> he didn't say I could quote his name which is inconvenient but I'm sure he
> has good reasons.
> 
> To him, low fragmentation is "critical, at least in HPC environments".
> Here is the core of his issue;
> 
> --- excerpt ---
> Take the scenario that you have a large machine that is
> used by multiple users, and the usage is regulated by a batch
> scheduler. Loadleveler on ibm's for example. PBS on many
> others. Both appear to be available in linux environments.
> 
> In the case of my codes, I find that having large pages is
> extremely beneficial to my run times. As in factors of several,
> modulo things that I've coded in by hand to try and avoid the
> issues. I don't think my code is in any way unusual in this
> magnitude of improvement.
> --- excerpt ---
> 
> ok, so we have two potential solutions, anti-defrag and zones. We don't
> need to rehash the pro's and cons. With zones, we just say "just reclaim
> the easy reclaim zone, alloc your pages and away we go".
> 
> Now, his problem is that the server is not restarted between job times and
> jobs takes days and weeks to complete. The system administrators will not
> restart the machine so getting it to a prestine state is a difficulty. The
> state he gets the system in is the state he works with and with
> fragmentation, he doesn't get large pages unless he is lucky enough to be
> the first user of the machine
> 
> With the zone approach, we would just be saying "tune it". Here is what he
> says about that
> 
> --- excerpt ---
> I specifically *don't* want things that I have to beg sysadmins to
> tune correctly. They won't get it right because there is no `right'
> that is right for everyone. They won't want to change it and it
> won't work besides. Been there, done that. My experience is that
> with linux so far, and some other non-linux machines too, they
> always turn all the page stuff off because it breaks the machine.
> --- excerpt ---
> 
> This is an example of a real user that "tune the size of your zone
> correctly" is just not good enough. He makes a novel suggestion on how
> anti-defrag + hotplug could be used.
> 
> --- excerpt ---
> In the context of hotplug stuff and fragmentation avoidance,
> this sort of reset would be implemented by performing the
> the first step in the hot unplug, to migrate everything off
> of that memory, including whatever kernel pages that exist
> there, but not the second step. Just leave that memory plugged
> in and reset the memory to a sane initial state. Essentially
> this would be some sort of pseudo hotunplug followed by a pseudo
> hotplug of that memory.
> --- excerpt ---
> 
> I'm pretty sure this is not what hotplug was aimed at but it would get him
> what he wants, large pages to echo BigNumber > nr_hugepages at the least.
> It also needs hotplug remove to be working for some banks and regions of
> memory although not the 100% case.
> 
> Ok, this is one example of a user for scientific workloads that "tune the
> size of the zone" just is not good enough. The admins won't do it for him
> because it'll just break for the next scheduled job.
> 
>> Actually I think one reason your patches may perform so well is because
>> there aren't actually a lot of higher order allocations in the kernel.
>>
>> I think that probably leaves us realistically with demand hugepages,
>> hot unplug memory, and IBM lpars?
>>
> 
> 
>>> Pros/Cons of Solutions
>>> ======================
>>>
>>> Anti-defrag Pros
>>>   o Aim9 shows no significant regressions (.37% on page_test). On some
>>>     tests, it shows performance gains (> 5% on fork_test)
>>>   o Stress tests show that it manages to keep fragmentation down to a far
>>>     lower level even without teaching kswapd how to linear reclaim
>> This sounds like a kind of funny test to me if nobody is actually
>> using higher order allocations.
>>
> 
> No one uses them because they always fail. This is a chicken and egg
> problem.
> 
>> When a higher order allocation is attempted, either you will satisfy
>> it from the kernel region, in which case the vanilla kernel would
>> have done the same. Or you satisfy it from an easy-reclaim contiguous
>> region, in which case it is no longer an easy-reclaim contiguous
>> region.
>>
> 
> Right, but right now, we say "don't use high order allocations ever". With
> work, we'll be saying "ok, use high order allocations but they should be
> short lived or you won't be allocating them for long"
> 
>>>   o Stress tests with a linear reclaim experimental patch shows that it
>>>     can successfully find large contiguous chunks of memory
>>>   o It is known to help hotplug on PPC64
>>>   o No tunables. The approach tries to manage itself as much as possible
>> But it has more dreaded heuristics :P
>>
> 
> Yeah, but if it gets them wrong, the system chugs along anyway, just
> fragmented like it is today. If the zone-based approach gets it wrong, the
> system goes down the tubes.
> 
> At very worst, the patches give a kernel allocator that is as good as
> todays. At very worst, the zone-based approach makes an unusable system.
> The performance of the patches is another story. I've been posting aim9
> figures based on my test machine. I'm trying to kick an ancient PowerPC
> 43P Model 150 machine into working.  This machine is a different
> architecture and ancient (I found it on the way to a skip) so should give
> different figures.
> 
>>>   o It exists, heavily tested, and synced against the latest -mm1
>>>   o Can be compiled away be redefining the RCLM_* macros and the
>>>     __GFP_*RCLM flags
>>>
>>> Anti-defrag Cons
>>>   o More complexity within the page allocator
>>>   o Adds a new layer onto the allocator that effectively creates subzones
>>>   o Adding a new concept that maintainers have to work with
>>>   o Depending on the workload, it fragments anyway
>>>
>>> New Zone Pros
>>>   o Zones are a well known and understood concept
>>>   o For people that do not care about hotplug, they can easily get rid of it
>>>   o Provides reliable areas of contiguous groups that can be freed for
>>>     HugeTLB pages going to userspace
>>>   o Uses existing zone infrastructure for balancing
>>>
>>> New Zone Cons
>>>   o Zones historically have introduced balancing problems
>>>   o Been tried for hotplug and dropped because of being awkward to work with
>>>   o It only helps hotplug and potentially HugeTLB pages for userspace
>>>   o Tunable required. If you get it wrong, the system suffers a lot
>> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
>> them get it right.
>>
> 
> Unless you work in a place where they sysadmins will tell you to go away
> such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
> have better things to do than twiddle a tunable all day.
> 
>>>   o Needs to be planned for and developed
>>>
>> Yasunori Goto had patches around from last year. Not sure what sort
>> of shape they're in now but I'd think most of the hard work is done.
>>
> 
> But Yasunori (thanks for sending the links ) himself says when he posted.
> 
> --- excerpt ---
> Another one was a bit similar than Mel-san's one.
> One of motivation of this patch was to create orthogonal relationship
> between Removable and DMA/Normal/Highmem. I thought it is desirable.
> Because, ppc64 can treat that all of memory is same (DMA) zone.
> I thought that new zone spoiled its good feature.
> --- excerpt ---
> 
> He thought that the new zone removed the ability of some architectures to
> treat all memory the same. My patches give some of the benefits of using
> another zone while still preserving an architectures ability to
> treat all memory the same.
> 
>>> Scenarios
>>> =========
>>>
>>> Lets outline some situations then or workloads that can occur
>>>
>>> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
>>>    build
>>>
>>>   Anti-defrag: It will not fragment as it will never have to fallback.High
>>>      order allocations will be possible in the remaining 25%.
>>>   Zone-based: After been tuned to a kernel build load, it will not
>>>      fragment. Get the tuning wrong, performance suffers or workload
>>>      fails. High order allocations will be possible in the remaining 25%.
>>>
>> You don't need to continually tune things for each and every possible
>> workload under the sun. It is like how we currently drive 16GB highmem
>> systems quite nicely under most workloads with 1GB of normal memory.
>> Make that an 8:1 ratio if you're worried.
>>
>> [snip]
>>
>>> I've tried to be as objective as possible with the summary.
>>>
>>>> From the points above though, I think that anti-defrag gets us a lot of
>>> the way, with the complexity isolated in one place. It's downside is that
>>> it can still break down and future work is needed to stop it degrading
>>> (kswapd cleaning UserRclm areas and page migration when we get really
>>> stuck). Zone-based is more reliable but only addresses a limited
>>> situation, principally hotplug and it does not even go 100% of the way for
>>> hotplug.
>> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
>> problems which seem to be the main ones.
>>
>>> It also depends on a tunable which is not cool and it is static.
>> I think it is very cool because it means the tiny minority of Linux
>> users who want this can do so without impacting the rest of the code
>> or users. This is how Linux has been traditionally run and I still
>> have a tiny bit of faith left :)
>>
> 
> The impact of the code and users will depend on benchmarks. I've posted
> benchmarks that show there are either very small regressions or else there
> are performance gains. As I write this, some of the aim9 benchmarks
> completed on the PowerPC.
> 
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig
> 
>  1 creat-clo      73500.00   72504.58    -995.42 -1.35% File Creations and Closes/second
>  2 page_test      30806.13   31076.49     270.36  0.88% System Allocations & Pages/second
>  3 brk_test      335299.02  341926.35    6627.33  1.98% System Memory Allocations/second
>  4 jmp_test     1641733.33 1644566.67    2833.34  0.17% Non-local gotos/second
>  5 signal_test   100883.19   98900.18   -1983.01 -1.97% Signal Traps/second
>  6 exec_test        116.53     118.44       1.91  1.64% Program Loads/second
>  7 fork_test        751.70     746.84      -4.86 -0.65% Task Creations/second
>  8 link_test      30217.11   30463.82     246.71  0.82% Link/Unlink Pairs/second
> 
> Performance gains on page_test, brk_test and exec_test. Even with
> variances between tests, we are looking at "more or less the same", not
> regressions. No user impact there.
> 
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-withantidefrag
> 
>  1 creat-clo      73500.00   71188.14   -2311.86 -3.15% File Creations and Closes/second
>  2 page_test      30806.13   31060.96     254.83  0.83% System Allocations & Pages/second
>  3 brk_test      335299.02  344361.15    9062.13  2.70% System Memory Allocations/second
>  4 jmp_test     1641733.33 1627228.80  -14504.53 -0.88% Non-local gotos/second
>  5 signal_test   100883.19  100233.33    -649.86 -0.64% Signal Traps/second
>  6 exec_test        116.53     117.63       1.10  0.94% Program Loads/second
>  7 fork_test        751.70     763.73      12.03  1.60% Task Creations/second
>  8 link_test      30217.11   30322.10     104.99  0.35% Link/Unlink Pairs/second
> 
> Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
> going for complex overhead. create-clo took a beating, but what workload
> opens and closes files at that rate?
> 
> This is an old, small machine. If I hotplug this, I'll be lucky if it ever
> turns on again. The aim9 benchmarks on two machines show that there is
> similar and, in some cases better, performance with these patches. If a
> workload does suffer badly, an additional patch has been supplied that
> disables anti-defrag. A run in -mm will tell us if this is the general
> case for machines or are my two test boxes running on magic beans.
> 
> So, the small number of users that want this, get this. The rest of the
> users who just run the code, should not notice or care. This brings us
> back to the main stickler, code complexity. I think that the code has been
> very well isolated from the code allocator code and people looking at the
> allocator could avoid it if they really wanted while stilling knowing what
> the buddy allocator was doing.
> 
>>> If we make the zones growable+shrinkable, we run into all the same
>>> problems that anti-defrag has today.
>>>
>> But we don't have the extra zones layer that anti defrag has today.
>>
> 
> So, we just have an extra layer on the side that has to be configured. All
> of the problems with all of the configuration.
> 
>> And anti defrag needs limits if it is to be reliable anyway.
>>
> 
> I'm confident given time that I can make this manage itself with a very
> good degree of reliability.
> 


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary
  2005-11-03  3:14                                 ` Nick Piggin
  2005-11-03 12:19                                   ` Mel Gorman
@ 2005-11-03 15:34                                   ` Martin J. Bligh
  1 sibling, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-03 15:34 UTC (permalink / raw)
  To: Nick Piggin, Mel Gorman
  Cc: Dave Hansen, Ingo Molnar, Andrew Morton, kravetz, linux-mm,
	Linux Kernel Mailing List, lhms

>>   Physical hotplug remove: Vendors of the hardware that support this -
>> 	Fujitsu, HP (I think), IBM etc
>> 
>>   Virtualization hotplug remove: Sellers of virtualization software, some
>> 	hardware like any IBM machine that lists LPAR in it's list of
>> 	features.  Probably software solutions like Xen are also affected
>> 	if they want to be able to grow and shrink the virtual machines on
>> 	demand
> 
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.

Not using large page mappings for the kernel area will be a substantial
performance hit. It's a less efficient approach inside the hypervisor, 
and not all VMs / hardware can support it.

>>   High order allocations: Ultimately, hugepage users. Today, that is a
>> 	feature only big server users like Oracle care about. In the
>> 	future I reckon applications will be able to use them for things
>> 	like backing the heap by huge pages. Other users like GigE,
>> 	loopback devices with large MTUs, some filesystem like CIFS are
>> 	all interested although they are also been told use use smaller
>> 	pages.
> 
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
> 
> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
> 
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?

Sigh. You seem obsessed with this. There are various critical places in
the kernel that use higher order allocations. Yes, they're normally
smaller ones rather than larger ones, but .... please try re-reading
the earlier portions of this thread. You are NOT going to be able to
get rid of all higher-order allocations - please quit pretending you
can - living in denial is not going to help us.

If you really, really believe you can do that, please go ahead and prove
it. Until that point, please let go of the "it's only for a few specialized
users" arguement, and acknowledge we DO actually use higher order allocs
in the kernel right now.

>>   o Aim9 shows no significant regressions (.37% on page_test). On some
>>     tests, it shows performance gains (> 5% on fork_test)
>>   o Stress tests show that it manages to keep fragmentation down to a far
>>     lower level even without teaching kswapd how to linear reclaim
> 
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.

It's a regression test. To, like, test for regressions in the normal
case ;-)

>> New Zone Cons
>>   o Zones historically have introduced balancing problems
>>   o Been tried for hotplug and dropped because of being awkward to work with
>>   o It only helps hotplug and potentially HugeTLB pages for userspace
>>   o Tunable required. If you get it wrong, the system suffers a lot
> 
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.

Having met some of them ... that's not a pro ;-) We have quite enough
meaningless tunables already. And to be honest, the bigger problem is
that it's a problem with no correct answer - workloads shift day vs.
night, etc.

> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.

Thanks for turning my 64 bit system back into a 32 bit one. really 
appreciate that. Note the last 5 years of endless whining about all
the problems with large 32 bit systems, and how they're unfixable
and we should all move to 64 bit please.

> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.

That's because you're not listening, you're going on your own preconcieved
notions ...

> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users.

Ditto.

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 13:56                     ` Ingo Molnar
  2005-11-01 14:10                       ` Dave Hansen
@ 2005-11-01 14:41                       ` Mel Gorman
  2005-11-01 14:46                         ` Ingo Molnar
                                           ` (2 more replies)
  2005-11-01 18:23                       ` Rob Landley
  2 siblings, 3 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 14:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel

On Tue, 1 Nov 2005, Ingo Molnar wrote:

> * Mel Gorman <mel@csn.ul.ie> wrote:
>
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?
>

Not now, but I expect to make 100% on demand in the future for all but
GFP_ATOMIC and GFP_NOFS allocations. As GFP_ATOMIC and GFP_NOFS cannot do
any reclaim work themselves, they will still be required to use smaller
orders or private pools that are refilled using GFP_KERNEL if necessary.
The high order pages would have to be reclaimed by another process like
kswapd just like what happens for order-0 pages today.

> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?
>

No, you cannot guarantee hot unplug RAM with these patches applied.
Anecdotal evidence suggests your chances are better on PPC64 which is a
start but we have to start somewhere. The full 100% solution would be a
large set of far reaching patches that would touch a lot of the memory
manager. This would get rejected because the patches should have have
arrived piecemeal. These patches are one piece. To reach 100%, other
mechanisms are also needed such as;

o Page migration to move unreclaimable pages like mlock()ed pages or
  kernel pages that had fallen back into easy-reclaim areas. A mechanism
  would also be needed to move things like kernel text. I think the memory
  hotplug tree has done a lot of work here
o Mechanism for taking regions of memory offline. Again, I think the
  memory hotplug crowd have something for this. If they don't, one of them
  will chime in.
o linear page reclaim that linearly scans a region of memory reclaims or
  moves all the pages it. I have a proof-of-concept patch that does the
  linear scan and reclaim but it's currently ugly and depends on this set
  of patches been applied.

These patches are the *starting* point that other things like linear page
reclaim can be based on.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-11-01 14:46                         ` Ingo Molnar
  2005-11-01 15:23                           ` Mel Gorman
  2005-11-01 18:33                           ` Rob Landley
  2005-11-01 14:50                         ` Dave Hansen
  2005-11-02  5:11                         ` Andrew Morton
  2 siblings, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-01 14:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel


* Mel Gorman <mel@csn.ul.ie> wrote:

> [...] The full 100% solution would be a large set of far reaching 
> patches that would touch a lot of the memory manager. This would get 
> rejected because the patches should have have arrived piecemeal. These 
> patches are one piece. To reach 100%, other mechanisms are also needed 
> such as;
> 
> o Page migration to move unreclaimable pages like mlock()ed pages or
>   kernel pages that had fallen back into easy-reclaim areas. A mechanism
>   would also be needed to move things like kernel text. I think the memory
>   hotplug tree has done a lot of work here
> o Mechanism for taking regions of memory offline. Again, I think the
>   memory hotplug crowd have something for this. If they don't, one of them
>   will chime in.
> o linear page reclaim that linearly scans a region of memory reclaims or
>   moves all the pages it. I have a proof-of-concept patch that does the
>   linear scan and reclaim but it's currently ugly and depends on this set
>   of patches been applied.

how will the 100% solution handle a simple kmalloc()-ed kernel buffer, 
that is pinned down, and to/from which live pointers may exist? That 
alone can prevent RAM from being removable.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:46                         ` Ingo Molnar
@ 2005-11-01 15:23                           ` Mel Gorman
  2005-11-01 18:33                           ` Rob Landley
  1 sibling, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 15:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel

On Tue, 1 Nov 2005, Ingo Molnar wrote:

>
> * Mel Gorman <mel@csn.ul.ie> wrote:
>
> > [...] The full 100% solution would be a large set of far reaching
> > patches that would touch a lot of the memory manager. This would get
> > rejected because the patches should have have arrived piecemeal. These
> > patches are one piece. To reach 100%, other mechanisms are also needed
> > such as;
> >
> > o Page migration to move unreclaimable pages like mlock()ed pages or
> >   kernel pages that had fallen back into easy-reclaim areas. A mechanism
> >   would also be needed to move things like kernel text. I think the memory
> >   hotplug tree has done a lot of work here
> > o Mechanism for taking regions of memory offline. Again, I think the
> >   memory hotplug crowd have something for this. If they don't, one of them
> >   will chime in.
> > o linear page reclaim that linearly scans a region of memory reclaims or
> >   moves all the pages it. I have a proof-of-concept patch that does the
> >   linear scan and reclaim but it's currently ugly and depends on this set
> >   of patches been applied.
>
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.
>

It would require the page to have it's virtual->physical mapping changed
in the pagetables for each running process and the master page table. That
would be another step on the road to 100% support.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:46                         ` Ingo Molnar
  2005-11-01 15:23                           ` Mel Gorman
@ 2005-11-01 18:33                           ` Rob Landley
  2005-11-01 19:02                             ` Ingo Molnar
  1 sibling, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-01 18:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, linux-kernel, lhms-devel

On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.

Would you like to apply your "100% or nothing" argument to the virtual memory 
management subsystem and see how it sounds in that context?  (As an argument 
that we shouldn't _have_ one?)

>  Ingo

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 18:33                           ` Rob Landley
@ 2005-11-01 19:02                             ` Ingo Molnar
  0 siblings, 0 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-01 19:02 UTC (permalink / raw)
  To: Rob Landley
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, linux-kernel, lhms-devel

* Rob Landley <rob@landley.net> wrote:

> On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> > how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> > that is pinned down, and to/from which live pointers may exist? That
> > alone can prevent RAM from being removable.
> 
> Would you like to apply your "100% or nothing" argument to the virtual 
> memory management subsystem and see how it sounds in that context?  
> (As an argument that we shouldn't _have_ one?)

that would be comparing apples to oranges. There is a big difference 
between "VM failures under high load", and "failure of VM functionality 
for no user-visible reason". The fragmentation problem here has nothing 
to do with pathological workloads. It has to do with 'unlucky' 
allocation patterns that pin down RAM areas which thus become 
non-removable. The RAM module will be non-removable for no user-visible 
reason. Possible under zero load, and with lots of free RAM otherwise.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
  2005-11-01 14:46                         ` Ingo Molnar
@ 2005-11-01 14:50                         ` Dave Hansen
  2005-11-01 15:24                           ` Mel Gorman
  2005-11-02  5:11                         ` Andrew Morton
  2 siblings, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-01 14:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> o Mechanism for taking regions of memory offline. Again, I think the
>   memory hotplug crowd have something for this. If they don't, one of them
>   will chime in.

I'm not sure what you're asking for here.

Right now, you can offline based on NUMA node, or physical address.
It's all revealed in sysfs.  Sounds like "regions" to me. :)

-- Dave


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:50                         ` Dave Hansen
@ 2005-11-01 15:24                           ` Mel Gorman
  0 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-01 15:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, Linux Kernel Mailing List, lhms

On Tue, 1 Nov 2005, Dave Hansen wrote:

> On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> > o Mechanism for taking regions of memory offline. Again, I think the
> >   memory hotplug crowd have something for this. If they don't, one of them
> >   will chime in.
>
> I'm not sure what you're asking for here.
>
> Right now, you can offline based on NUMA node, or physical address.
> It's all revealed in sysfs.  Sounds like "regions" to me. :)
>

Ah yes, that would do the job all right.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
  2005-11-01 14:46                         ` Ingo Molnar
  2005-11-01 14:50                         ` Dave Hansen
@ 2005-11-02  5:11                         ` Andrew Morton
  2 siblings, 0 replies; 241+ messages in thread
From: Andrew Morton @ 2005-11-02  5:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, nickpiggin, mbligh, kravetz, linux-mm, linux-kernel,
	lhms-devel

Mel Gorman <mel@csn.ul.ie> wrote:
>
> As GFP_ATOMIC and GFP_NOFS cannot do
>  any reclaim work themselves

Both GFP_NOFS and GFP_NOIO can indeed perform direct reclaim.   All
we require is __GFP_WAIT.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 13:56                     ` Ingo Molnar
  2005-11-01 14:10                       ` Dave Hansen
  2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
@ 2005-11-01 18:23                       ` Rob Landley
  2005-11-01 20:31                         ` Joel Schopp
  2 siblings, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-01 18:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Nick Piggin, Martin J. Bligh, Andrew Morton, kravetz,
	linux-mm, linux-kernel, lhms-devel

On Tuesday 01 November 2005 07:56, Ingo Molnar wrote:
> * Mel Gorman <mel@csn.ul.ie> wrote:
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?

Considering anybody can allocate memory and never release it, _any_ 100% 
solution is going to require migrating existing pages, regardless of 
allocation strategy.

> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?

You're asking intentionally leading questions, aren't you?  Without on-demand 
page migration a given area of physical memory would only ever be free by 
sheer coincidence.  Less fragmented page allocation doesn't address _where_ 
the free areas are, it just tries to make them contiguous.

A page migration strategy would have to do less work if there's less 
fragmention, and it also allows you to cluster the "difficult" cases (such as 
kernel structures that just ain't moving) so you can much more easily 
hot-unplug everything else.  It also makes larger order allocations easier to 
do so drivers needing that can load as modules after boot, and it also means 
hugetlb comes a lot closer to general purpose infrastructure rather than a 
funky boot-time reservation thing.  Plus page prezeroing approaches get to 
work on larger chunks, and so on.

But any strategy to demand that "this physical memory range must be freed up 
now" will by definition require moving pages...

>  Ingo

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 18:23                       ` Rob Landley
@ 2005-11-01 20:31                         ` Joel Schopp
  0 siblings, 0 replies; 241+ messages in thread
From: Joel Schopp @ 2005-11-01 20:31 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ingo Molnar, Mel Gorman, Nick Piggin, Martin J. Bligh,
	Andrew Morton, kravetz, linux-mm, linux-kernel, lhms-devel

>>>The set of patches do fix a lot and make a strong start at addressing
>>>the fragmentation problem, just not 100% of the way. [...]
>>
>>do you have an expectation to be able to solve the 'fragmentation
>>problem', all the time, in a 100% way, now or in the future?
> 
> 
> Considering anybody can allocate memory and never release it, _any_ 100% 
> solution is going to require migrating existing pages, regardless of 
> allocation strategy.
> 

Three issues here.  Fragmentation of memory in general, fragmentation of usage, 
and being able to have 100% success rate at removing memory.

We will never be able to have 100% contiguous memory with no fragmentation. 
Ever.  Certainly not while we have non-movable pieces of memory.  Even if we 
could move every piece of memory it would be impractical.  What these patches do 
for general fragmentation is to keep the allocations that never will get freed 
away from the rest of memory, so that memory has a chance to form larger 
contiguous ranges when it is freed.

By separating memory based on usage there is another side effect.  It also makes 
possible some more active defragmentation methods on easier memory, because it 
doesn't have annoying hard memory scattered throughout.  Suddenly we can talk 
about being able to do memory hotplug remove on significant portions of memory. 
    Or allocating these hugepages after boot.  Or doing active defragmentation. 
  Or modules being able to be modules because they don't have to preallocate big 
pieces of contiguous memory.

Some people will argue that we need 100% separation of usage or no separation at 
all.  Well, change the array of fallback to not allow kernel non-reclaimable to 
fallback and we are done.  4 line change, 100% separation.  But the tradeoff is 
that under memory pressure we might fail allocations when we still have free 
memory.  There are other options for fallback of course, the fallback_alloc() 
function is easily replaceable if somebody wants to.  Many of these options get 
easier once memory migration is in.  The way fallback is done in the current 
patches is to maintain current behavior as much as possible, satisfy 
allocations, and not affect performance.

As to the 100% success at removing memory, this set of patches doesn't solve 
that.  But it solves the 80% problem quite nicely (when combined with the memory 
migration patches).  80% is great for virtualized systems where the OS has some 
choice over which memory to remove, but not the quantity to remove.  It is also 
a good start to 100%, because we can separate and identify the easy memory from 
the hard memory.  Dave Hansen has outlined in separate posts how we can get to 
100%, including hard memory.

>>can you always, under any circumstance hot unplug RAM with these patches
>>applied? If not, do you have any expectation to reach 100%?
> 
> 
> You're asking intentionally leading questions, aren't you?  Without on-demand 
> page migration a given area of physical memory would only ever be free by 
> sheer coincidence.  Less fragmented page allocation doesn't address _where_ 
> the free areas are, it just tries to make them contiguous.
> 
> A page migration strategy would have to do less work if there's less 
> fragmention, and it also allows you to cluster the "difficult" cases (such as 
> kernel structures that just ain't moving) so you can much more easily 
> hot-unplug everything else.  It also makes larger order allocations easier to 
> do so drivers needing that can load as modules after boot, and it also means 
> hugetlb comes a lot closer to general purpose infrastructure rather than a 
> funky boot-time reservation thing.  Plus page prezeroing approaches get to 
> work on larger chunks, and so on.
> 
> But any strategy to demand that "this physical memory range must be freed up 
> now" will by definition require moving pages...

Perfectly stated.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
       [not found]                 ` <4366D469.2010202@yahoo.com.au>
       [not found]                   ` <Pine.LNX.4.58.0511011014060.14884@skynet>
@ 2005-11-01 20:59                   ` Joel Schopp
  2005-11-02  1:06                     ` Nick Piggin
  1 sibling, 1 reply; 241+ messages in thread
From: Joel Schopp @ 2005-11-01 20:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

>> The patches have gone through a large number of revisions, have been
>> heavily tested and reviewed by a few people. The memory footprint of this
>> approach is smaller than introducing new zones. If the cache footprint,
>> increased branches and instructions were a problem, I would expect 
>> them to
>> show up in the aim9 benchmark or the benchmark that ran ghostscript
>> multiple times on a large file.
>>
> 
> I appreciate that a lot of work has gone into them. You must appreciate
> that they add a reasonable amount of complexity and a non-zero perormance
> cost to the page allocator.

The patches do ad a reasonable amount of complexity to the page allocator.  In 
my opinion that is the only downside of these patches, even though it is a big 
one.  What we need to decide as a community is if there is a less complex way to 
do this, and if there isn't a less complex way then is the benefit worth the 
increased complexity.

As to the non-zero performance cost, I think hard numbers should carry more 
weight than they have been given in this area.  Mel has posted hard numbers that 
say the patches are a wash with respect to performance.  I don't see any 
evidence to contradict those results.

>> The will need high order allocations if we want to provide HugeTLB pages
>> to userspace on-demand rather than reserving at boot-time. This is a
>> future problem, but it's one that is not worth tackling until the
>> fragmentation problem is fixed first.
>>
> 
> Sure. In what form, we haven't agreed. I vote zones! :)

I'd like to hear more details of how zones would be less complex while still 
solving the problem.  I just don't get it.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-01 20:59                   ` Joel Schopp
@ 2005-11-02  1:06                     ` Nick Piggin
  2005-11-02  1:41                       ` Martin J. Bligh
                                         ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  1:06 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Mel Gorman, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

Joel Schopp wrote:

> The patches do ad a reasonable amount of complexity to the page 
> allocator.  In my opinion that is the only downside of these patches, 
> even though it is a big one.  What we need to decide as a community is 
> if there is a less complex way to do this, and if there isn't a less 
> complex way then is the benefit worth the increased complexity.
> 
> As to the non-zero performance cost, I think hard numbers should carry 
> more weight than they have been given in this area.  Mel has posted hard 
> numbers that say the patches are a wash with respect to performance.  I 
> don't see any evidence to contradict those results.
> 

The numbers I have seen show that performance is decreased. People
like Ken Chen spend months trying to find a 0.05% improvement in
performance. Not long ago I just spent days getting our cached
kbuild performance back to where 2.4 is on my build system.

I can simply see they will cost more icache, more dcache, more branches,
etc. in what is the hottest part of the kernel in some workloads (kernel
compiles, for one).

I'm sorry if I sound like a wet blanket. I just don't look at a patch
and think "wow all those 3 guys with Linux on IBM mainframes and using
lpars are going to be so much happier now, this is something we need".

>>> The will need high order allocations if we want to provide HugeTLB pages
>>> to userspace on-demand rather than reserving at boot-time. This is a
>>> future problem, but it's one that is not worth tackling until the
>>> fragmentation problem is fixed first.
>>>
>>
>> Sure. In what form, we haven't agreed. I vote zones! :)
> 
> 
> I'd like to hear more details of how zones would be less complex while 
> still solving the problem.  I just don't get it.
> 

You have an extra zone. You size that zone at boot according to the
amount of memory you need to be able to free. Only easy-reclaim stuff
goes in that zone.

It is less complex because zones are a complexity we already have to
live with. 99% of the infrastructure is already there to do this.

If you want to hot unplug memory or guarantee hugepage allocation,
this is the way to do it. Nobody has told me why this *doesn't* work.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  1:06                     ` Nick Piggin
@ 2005-11-02  1:41                       ` Martin J. Bligh
  2005-11-02  2:03                         ` Nick Piggin
  2005-11-02 11:37                       ` Mel Gorman
  2005-11-02 15:11                       ` Mel Gorman
  2 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02  1:41 UTC (permalink / raw)
  To: Nick Piggin, Joel Schopp
  Cc: Mel Gorman, Andrew Morton, kravetz, linux-mm, linux-kernel,
	lhms-devel, Ingo Molnar

> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.

Ironically, we're currently trying to chase down a 'database benchmark'
regression that seems to have been cause by the last round of "let's
rewrite the scheduler again" (more details later). Nick, you've added an 
awful lot of complexity to some of these code paths yourself ... seems 
ironic that you're the one complaining about it ;-)

>>> Sure. In what form, we haven't agreed. I vote zones! :)
>> 
>> 
>> I'd like to hear more details of how zones would be less complex while 
>> still solving the problem.  I just don't get it.
>> 
> 
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
> 
> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
> 
> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.

Because the zone is statically sized, and you're back to the same crap
we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
effectively. Define how much you need for system ram, and how much
for easily reclaimable memory at boot time. You can't - it doesn't work.

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  1:41                       ` Martin J. Bligh
@ 2005-11-02  2:03                         ` Nick Piggin
  2005-11-02  2:24                           ` Martin J. Bligh
  2005-11-02 11:41                           ` Mel Gorman
  0 siblings, 2 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  2:03 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

Martin J. Bligh wrote:
>>The numbers I have seen show that performance is decreased. People
>>like Ken Chen spend months trying to find a 0.05% improvement in
>>performance. Not long ago I just spent days getting our cached
>>kbuild performance back to where 2.4 is on my build system.
> 
> 
> Ironically, we're currently trying to chase down a 'database benchmark'
> regression that seems to have been cause by the last round of "let's
> rewrite the scheduler again" (more details later). Nick, you've added an 
> awful lot of complexity to some of these code paths yourself ... seems 
> ironic that you're the one complaining about it ;-)
> 

Yeah that's unfortunate, but I think a large portion of the problem
(if they are anything the same) has been narrowed down to some over
eager wakeup balancing for which there are a number of proposed
patches.

But in this case I was more worried about getting the groundwork done
for handling the multicore multicore systems that everyone will soon
be using rather than several % performance regression on TPC-C (not
to say that I don't care about that at all)... I don't see the irony.

But let's move this to another thread if it is going to continue. I
would be happy to discuss scheduler problems.

>>You have an extra zone. You size that zone at boot according to the
>>amount of memory you need to be able to free. Only easy-reclaim stuff
>>goes in that zone.
>>
>>It is less complex because zones are a complexity we already have to
>>live with. 99% of the infrastructure is already there to do this.
>>
>>If you want to hot unplug memory or guarantee hugepage allocation,
>>this is the way to do it. Nobody has told me why this *doesn't* work.
> 
> 
> Because the zone is statically sized, and you're back to the same crap
> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> effectively. Define how much you need for system ram, and how much
> for easily reclaimable memory at boot time. You can't - it doesn't work.
> 

You can't what? What doesn't work? If you have no hard limits set,
then the frag patches can't guarantee anything either.

You can't have it both ways. Either you have limits for things or
you don't need any guarantees. Zones handle the former case nicely,
and we currently do the latter case just fine (along with the frag
patches).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:03                         ` Nick Piggin
@ 2005-11-02  2:24                           ` Martin J. Bligh
  2005-11-02  2:49                             ` Nick Piggin
  2005-11-02 11:41                           ` Mel Gorman
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02  2:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

>>> The numbers I have seen show that performance is decreased. People
>>> like Ken Chen spend months trying to find a 0.05% improvement in
>>> performance. Not long ago I just spent days getting our cached
>>> kbuild performance back to where 2.4 is on my build system.
>> 
>> Ironically, we're currently trying to chase down a 'database benchmark'
>> regression that seems to have been cause by the last round of "let's
>> rewrite the scheduler again" (more details later). Nick, you've added an 
>> awful lot of complexity to some of these code paths yourself ... seems 
>> ironic that you're the one complaining about it ;-)
> 
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
> 
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
> 
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.

My point was that most things we do add complexity to the codebase,
including the things you do yourself ... I'm not saying the we're worse
off for the changes you've made, by any means - I think they've been
mostly beneficial. I'm just pointing out that we ALL do it, so let us
not be too quick to judge when others propose adding something that does ;-)

>> Because the zone is statically sized, and you're back to the same crap
>> we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
>> effectively. Define how much you need for system ram, and how much
>> for easily reclaimable memory at boot time. You can't - it doesn't work.
> 
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
> 
> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).

I'll go look through Mel's current patchset again. I was under the
impression it didn't suffer from this problem, at least not as much
as zones did.

Nothing is guaranteed. You can shag the whole machine and/or VM in
any number of ways ... if we can significantly improve the probability 
of existing higher order allocs working, and new functionality has
an excellent probability of success, that's as good as you're going to 
get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:24                           ` Martin J. Bligh
@ 2005-11-02  2:49                             ` Nick Piggin
  2005-11-02  4:39                               ` Martin J. Bligh
                                                 ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  2:49 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

Martin J. Bligh wrote:

>>But let's move this to another thread if it is going to continue. I
>>would be happy to discuss scheduler problems.
> 
> 
> My point was that most things we do add complexity to the codebase,
> including the things you do yourself ... I'm not saying the we're worse
> off for the changes you've made, by any means - I think they've been
> mostly beneficial.

Heh - I like the "mostly" ;)

> I'm just pointing out that we ALL do it, so let us
> not be too quick to judge when others propose adding something that does ;-)
> 

What I'm getting worried about is the marked increase in the
rate of features and complexity going in.

I am almost certainly never going to use memory hotplug or
demand paging of hugepages. I am pretty likely going to have
to wade through this code at some point in the future if it
is merged.

It is also going to slow down my kernel by maybe 1% when
doing kbuilds, but hey let's not worry about that until we've
merged 10 more such slowdowns (ok that wasn't aimed at you or
Mel, but my perception of the status quo).

> 
>>You can't what? What doesn't work? If you have no hard limits set,
>>then the frag patches can't guarantee anything either.
>>
>>You can't have it both ways. Either you have limits for things or
>>you don't need any guarantees. Zones handle the former case nicely,
>>and we currently do the latter case just fine (along with the frag
>>patches).
> 
> 
> I'll go look through Mel's current patchset again. I was under the
> impression it didn't suffer from this problem, at least not as much
> as zones did.
> 

Over time, I don't think it can offer any stronger a guarantee
than what we currently have. I'm not even sure that it would be
any better at all for problematic workloads as time -> infinity.

> Nothing is guaranteed. You can shag the whole machine and/or VM in
> any number of ways ... if we can significantly improve the probability 
> of existing higher order allocs working, and new functionality has
> an excellent probability of success, that's as good as you're going to 
> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> 

I think it falls down if these higher order allocations actually
get *used* for anything. You'll simply be going through the process
of replacing your contiguous, easy-to-reclaim memory with pinned
kernel memory.

However, for the purpose of memory hot unplug, a new zone *will*
guarantee memory can be reclaimed and unplugged.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:49                             ` Nick Piggin
@ 2005-11-02  4:39                               ` Martin J. Bligh
  2005-11-02  5:09                                 ` Nick Piggin
  2005-11-02  7:19                               ` Yasunori Goto
  2005-11-02 11:48                               ` Mel Gorman
  2 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02  4:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

>> I'm just pointing out that we ALL do it, so let us
>> not be too quick to judge when others propose adding something that does ;-)
> 
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
> 
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.

Mmm. Though whether any one of us will personally use each feature
is perhaps not the most ideal criteria to judge things by ;-)

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).

If it's really 1%, yes, that's a huge problem. And yes, I agree with
you that there's a problem with the rate of change. Part of that is
a lack of performance measurement and testing, and the quality sometimes
scares me (though the last month has actually been significantly better,
the tree mostly builds and boots now!). I've tried to do something on 
the testing front, but I'm acutely aware it's not sufficient by any means.

>>> You can't what? What doesn't work? If you have no hard limits set,
>>> then the frag patches can't guarantee anything either.
>>> 
>>> You can't have it both ways. Either you have limits for things or
>>> you don't need any guarantees. Zones handle the former case nicely,
>>> and we currently do the latter case just fine (along with the frag
>>> patches).
>> 
>> I'll go look through Mel's current patchset again. I was under the
>> impression it didn't suffer from this problem, at least not as much
>> as zones did.
> 
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.

Sounds worth discussing. We need *some* way of dealing with fragmentation
issues. To me that means both an avoidance strategy, and an ability
to actively defragment if we need it. Linux is evolved software, it
may not be perfect at first - that's the way we work, and it's served
us well up till now. To me, that's the biggest advantage we have over
the proprietary model.

>> Nothing is guaranteed. You can shag the whole machine and/or VM in
>> any number of ways ... if we can significantly improve the probability 
>> of existing higher order allocs working, and new functionality has
>> an excellent probability of success, that's as good as you're going to 
>> get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> 
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.

It seems inevitable that we need both physically contiguous memory
sections, and virtually contiguous in kernel space (which equates to
the same thing, unless we totally break the 1-1 P-V mapping and
lose the large page mapping for kernel, which I'd hate to do.)
 
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.

It's not just about memory hotplug. There are, as we have discussed
already, many usage for physically contiguous (and virtually contiguous)
memory segments. Focusing purely on any one of them will not solve the
issue at hand ...

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  4:39                               ` Martin J. Bligh
@ 2005-11-02  5:09                                 ` Nick Piggin
  2005-11-02  5:14                                   ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Nick Piggin @ 2005-11-02  5:09 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

Martin J. Bligh wrote:

>>I am almost certainly never going to use memory hotplug or
>>demand paging of hugepages. I am pretty likely going to have
>>to wade through this code at some point in the future if it
>>is merged.
> 
> 
> Mmm. Though whether any one of us will personally use each feature
> is perhaps not the most ideal criteria to judge things by ;-)
> 

Of course, but I'd say very few people will. Then again maybe
I'm just a luddite who doesn't know what's good for him ;)

> 
>>It is also going to slow down my kernel by maybe 1% when
>>doing kbuilds, but hey let's not worry about that until we've
>>merged 10 more such slowdowns (ok that wasn't aimed at you or
>>Mel, but my perception of the status quo).
> 
> 
> If it's really 1%, yes, that's a huge problem. And yes, I agree with
> you that there's a problem with the rate of change. Part of that is
> a lack of performance measurement and testing, and the quality sometimes
> scares me (though the last month has actually been significantly better,
> the tree mostly builds and boots now!). I've tried to do something on 
> the testing front, but I'm acutely aware it's not sufficient by any means.
> 

To be honest I haven't tested so this is an unfounded guess. However
it is based on what I have seen of Mel's numbers, and the fact that
the kernel spends nearly 1/3rd of its time in the page allocator when
running a kbuild.

I may get around to getting some real numbers when my current patch
queues shrink.

>>Over time, I don't think it can offer any stronger a guarantee
>>than what we currently have. I'm not even sure that it would be
>>any better at all for problematic workloads as time -> infinity.
> 
> 
> Sounds worth discussing. We need *some* way of dealing with fragmentation
> issues. To me that means both an avoidance strategy, and an ability
> to actively defragment if we need it. Linux is evolved software, it
> may not be perfect at first - that's the way we work, and it's served
> us well up till now. To me, that's the biggest advantage we have over
> the proprietary model.
> 

True and I'm also annoyed that we have these issues at all. I just
don't see that the avoidance strategy helps that much because as I
said, you don't need to keep these lovely contiguous regions just for
show (or other easy-to-reclaim user pages).

The absolute priority is to move away from higher order allocs or
use fallbacks IMO. And that doesn't necessarily mean order 1 or even
2 allocations because we've don't seem to have a problem with those.

Because I want Linux to be as robust as you do.

>>I think it falls down if these higher order allocations actually
>>get *used* for anything. You'll simply be going through the process
>>of replacing your contiguous, easy-to-reclaim memory with pinned
>>kernel memory.
> 
> 
> It seems inevitable that we need both physically contiguous memory
> sections, and virtually contiguous in kernel space (which equates to
> the same thing, unless we totally break the 1-1 P-V mapping and
> lose the large page mapping for kernel, which I'd hate to do.)
>  

I think this isn't as bad an idea as you think. If it means those
guys doing memory hotplug take a few % performance hit and nobody else
has to bear the costs then that sounds great.

> 
>>However, for the purpose of memory hot unplug, a new zone *will*
>>guarantee memory can be reclaimed and unplugged.
> 
> 
> It's not just about memory hotplug. There are, as we have discussed
> already, many usage for physically contiguous (and virtually contiguous)
> memory segments. Focusing purely on any one of them will not solve the
> issue at hand ...
> 

True, but we don't seem to have huge problems with other things. The
main ones that have come up on lkml are e1000 which is getting fixed,
and maybe XFS which I think there are also moves to improve.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  5:09                                 ` Nick Piggin
@ 2005-11-02  5:14                                   ` Martin J. Bligh
  2005-11-02  6:23                                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-02  5:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joel Schopp, Mel Gorman, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

>> It's not just about memory hotplug. There are, as we have discussed
>> already, many usage for physically contiguous (and virtually contiguous)
>> memory segments. Focusing purely on any one of them will not solve the
>> issue at hand ...
> 
> True, but we don't seem to have huge problems with other things. The
> main ones that have come up on lkml are e1000 which is getting fixed,
> and maybe XFS which I think there are also moves to improve.

It should be fairly easy to trawl through the list of all allocations 
and pull out all the higher order ones from the whole source tree. I
suspect there's a lot ... maybe I'll play with it later on.

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  5:14                                   ` Martin J. Bligh
@ 2005-11-02  6:23                                     ` KAMEZAWA Hiroyuki
  2005-11-02 10:15                                       ` Nick Piggin
  0 siblings, 1 reply; 241+ messages in thread
From: KAMEZAWA Hiroyuki @ 2005-11-02  6:23 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Nick Piggin, Joel Schopp, Mel Gorman, Andrew Morton, kravetz,
	linux-mm, linux-kernel, lhms-devel, Ingo Molnar

Martin J. Bligh wrote:
>>True, but we don't seem to have huge problems with other things. The
>>main ones that have come up on lkml are e1000 which is getting fixed,
>>and maybe XFS which I think there are also moves to improve.
> 
> 
> It should be fairly easy to trawl through the list of all allocations 
> and pull out all the higher order ones from the whole source tree. I
> suspect there's a lot ... maybe I'll play with it later on.
> 

please check kmalloc(32k,64k)

For example, loopback device's default MTU=16436 means order=3 and
maybe there are other high MTU device.

I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
sufferd from fragmentation when MTU is big. They allocs large skb by
gathering fragmented skbs.When these skb_* funcs failed, the packet
is silently discarded by netfilter. If fragmentation is heavy, packets
(especialy TCP) uses large MTU never reachs its end, even if loopback.

Honestly, I'm not familiar with network code, could anyone comment this ?

-- Kame

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  6:23                                     ` KAMEZAWA Hiroyuki
@ 2005-11-02 10:15                                       ` Nick Piggin
  0 siblings, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-02 10:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Martin J. Bligh, Joel Schopp, Mel Gorman, Andrew Morton, kravetz,
	linux-mm, linux-kernel, lhms-devel, Ingo Molnar

KAMEZAWA Hiroyuki wrote:
> Martin J. Bligh wrote:
> 

> please check kmalloc(32k,64k)
> 
> For example, loopback device's default MTU=16436 means order=3 and
> maybe there are other high MTU device.
> 
> I suspect skb_makewritable()/skb_copy()/skb_linearize() function can be
> sufferd from fragmentation when MTU is big. They allocs large skb by
> gathering fragmented skbs.When these skb_* funcs failed, the packet
> is silently discarded by netfilter. If fragmentation is heavy, packets
> (especialy TCP) uses large MTU never reachs its end, even if loopback.
> 
> Honestly, I'm not familiar with network code, could anyone comment this ?
> 

I'd be interested to know, actually. I was hoping loopback should always
use order-0 allocations, because the loopback driver is SG, FRAGLIST,
and HIGHDMA capable. However I'm likewise not familiar with network code.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:49                             ` Nick Piggin
  2005-11-02  4:39                               ` Martin J. Bligh
@ 2005-11-02  7:19                               ` Yasunori Goto
  2005-11-02 11:48                               ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Yasunori Goto @ 2005-11-02  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar, Mel Gorman

Hello.
Nick-san.

I posted patches to make ZONE_REMOVABLE to LHMS.
I don't say they are better than Mel-san's patch.
I hope this will be base of good discussion.


There were 2 types.
One was just add ZONE_REMOVABLE.
This patch came from early implementation of memory hotplug VA-Linux
team. 
http://sourceforge.net/mailarchive/forum.php?thread_id=5969508&forum_id=223

ZONE_HIGHMEM is used for this purpose at early implementation.
We thought ZONE_HIGHMEM is easier removing than other zone.
But some of archtecture don't use it. That is why ZONE_REMOVABLE
was born.
(And I remember that ZONE_DMA32 was defined after this patch.
 So, number of zone became 5, and one more bit was necessary in
 page->flags. (I don't know recent progress of ZONE_DMA32)).


Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.

http://sourceforge.net/mailarchive/forum.php?thread_id=5345977&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345978&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345979&forum_id=223
http://sourceforge.net/mailarchive/forum.php?thread_id=5345980&forum_id=223


Thanks.

P.S. to Mel-san.
 I'm sorry for late writing of this. This threads was mail bomb for me
 to read with my poor English skill. :-(


> Martin J. Bligh wrote:
> 
> >>But let's move this to another thread if it is going to continue. I
> >>would be happy to discuss scheduler problems.
> > 
> > 
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
> 
> Heh - I like the "mostly" ;)
> 
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> > 
> 
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
> 
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
> 
> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
> 
> > 
> >>You can't what? What doesn't work? If you have no hard limits set,
> >>then the frag patches can't guarantee anything either.
> >>
> >>You can't have it both ways. Either you have limits for things or
> >>you don't need any guarantees. Zones handle the former case nicely,
> >>and we currently do the latter case just fine (along with the frag
> >>patches).
> > 
> > 
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> > 
> 
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
> 
> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability 
> > of existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to 
> > get. Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> > 
> 
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
> 
> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
> 
> -- 
> SUSE Labs, Novell Inc.
> 
> Send instant messages to your online friends http://au.messenger.yahoo.com 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Yasunori Goto 


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:49                             ` Nick Piggin
  2005-11-02  4:39                               ` Martin J. Bligh
  2005-11-02  7:19                               ` Yasunori Goto
@ 2005-11-02 11:48                               ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 11:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
> > > But let's move this to another thread if it is going to continue. I
> > > would be happy to discuss scheduler problems.
> >
> >
> > My point was that most things we do add complexity to the codebase,
> > including the things you do yourself ... I'm not saying the we're worse
> > off for the changes you've made, by any means - I think they've been
> > mostly beneficial.
>
> Heh - I like the "mostly" ;)
>
> > I'm just pointing out that we ALL do it, so let us
> > not be too quick to judge when others propose adding something that does ;-)
> >
>
> What I'm getting worried about is the marked increase in the
> rate of features and complexity going in.
>
> I am almost certainly never going to use memory hotplug or
> demand paging of hugepages. I am pretty likely going to have
> to wade through this code at some point in the future if it
> is merged.
>

Plenty of features in the kernel I don't use either :) .

> It is also going to slow down my kernel by maybe 1% when
> doing kbuilds, but hey let's not worry about that until we've
> merged 10 more such slowdowns (ok that wasn't aimed at you or
> Mel, but my perception of the status quo).
>

Ok, my patches show performance gains and losses on different parts of
Aim9. page_test is slightly down but fork_test was considerably up. Both
would have an effect on kbuild so more figures are needed on mode
machines. That will only be found from testing from a variety of machines.

> >
> > > You can't what? What doesn't work? If you have no hard limits set,
> > > then the frag patches can't guarantee anything either.
> > >
> > > You can't have it both ways. Either you have limits for things or
> > > you don't need any guarantees. Zones handle the former case nicely,
> > > and we currently do the latter case just fine (along with the frag
> > > patches).
> >
> >
> > I'll go look through Mel's current patchset again. I was under the
> > impression it didn't suffer from this problem, at least not as much
> > as zones did.
> >
>
> Over time, I don't think it can offer any stronger a guarantee
> than what we currently have. I'm not even sure that it would be
> any better at all for problematic workloads as time -> infinity.
>

Not as they currently stand no. As I've said elsewhere, to really
guarantee things, kswapd would need to know how to clear out UesrRclm
pages from the other reserve types.

> > Nothing is guaranteed. You can shag the whole machine and/or VM in
> > any number of ways ... if we can significantly improve the probability of
> > existing higher order allocs working, and new functionality has
> > an excellent probability of success, that's as good as you're going to get.
> > Have a free "perfect is the enemy of good" Linus quote, on me ;-)
> >
>
> I think it falls down if these higher order allocations actually
> get *used* for anything. You'll simply be going through the process
> of replacing your contiguous, easy-to-reclaim memory with pinned
> kernel memory.
>

And a misconfigured zone-based approach just falls apart. Going to finish
that summary mail to avoid repetition.

> However, for the purpose of memory hot unplug, a new zone *will*
> guarantee memory can be reclaimed and unplugged.
>

>

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  2:03                         ` Nick Piggin
  2005-11-02  2:24                           ` Martin J. Bligh
@ 2005-11-02 11:41                           ` Mel Gorman
  1 sibling, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 11:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Joel Schopp, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
> > > The numbers I have seen show that performance is decreased. People
> > > like Ken Chen spend months trying to find a 0.05% improvement in
> > > performance. Not long ago I just spent days getting our cached
> > > kbuild performance back to where 2.4 is on my build system.
> >
> >
> > Ironically, we're currently trying to chase down a 'database benchmark'
> > regression that seems to have been cause by the last round of "let's
> > rewrite the scheduler again" (more details later). Nick, you've added an
> > awful lot of complexity to some of these code paths yourself ... seems
> > ironic that you're the one complaining about it ;-)
> >
>
> Yeah that's unfortunate, but I think a large portion of the problem
> (if they are anything the same) has been narrowed down to some over
> eager wakeup balancing for which there are a number of proposed
> patches.
>
> But in this case I was more worried about getting the groundwork done
> for handling the multicore multicore systems that everyone will soon
> be using rather than several % performance regression on TPC-C (not
> to say that I don't care about that at all)... I don't see the irony.
>
> But let's move this to another thread if it is going to continue. I
> would be happy to discuss scheduler problems.
>
> > > You have an extra zone. You size that zone at boot according to the
> > > amount of memory you need to be able to free. Only easy-reclaim stuff
> > > goes in that zone.
> > >
> > > It is less complex because zones are a complexity we already have to
> > > live with. 99% of the infrastructure is already there to do this.
> > >
> > > If you want to hot unplug memory or guarantee hugepage allocation,
> > > this is the way to do it. Nobody has told me why this *doesn't* work.
> >
> >
> > Because the zone is statically sized, and you're back to the same crap
> > we had with 32bit systems of splitting ZONE_NORMAL and ZONE_HIGHMEM,
> > effectively. Define how much you need for system ram, and how much
> > for easily reclaimable memory at boot time. You can't - it doesn't work.
> >
>
> You can't what? What doesn't work? If you have no hard limits set,
> then the frag patches can't guarantee anything either.
>

True, but the difference is

Anti-defrag: Best effort at low cost (according to Aim9) without tunable
Zones: Will work, but requires tunable. falls apart if tuned wrong

> You can't have it both ways. Either you have limits for things or
> you don't need any guarantees. Zones handle the former case nicely,
> and we currently do the latter case just fine (along with the frag
> patches).
>

Sure, so you compromise and do best effort for as long as possible.
Always try to keep fragmentation low. If the system is configured to
really need low fragmentation, then after a long period of time, a
page-migration mechanism kicks in to move the kernel pages out of EasyRclm
areas and we continue on.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  1:06                     ` Nick Piggin
  2005-11-02  1:41                       ` Martin J. Bligh
@ 2005-11-02 11:37                       ` Mel Gorman
  2005-11-02 15:11                       ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 11:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Joel Schopp wrote:
>
> > The patches do ad a reasonable amount of complexity to the page allocator.
> > In my opinion that is the only downside of these patches, even though it is
> > a big one.  What we need to decide as a community is if there is a less
> > complex way to do this, and if there isn't a less complex way then is the
> > benefit worth the increased complexity.
> >
> > As to the non-zero performance cost, I think hard numbers should carry more
> > weight than they have been given in this area.  Mel has posted hard numbers
> > that say the patches are a wash with respect to performance.  I don't see
> > any evidence to contradict those results.
> >
>
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance.

Fine, that is understandable. The AIM9 benchmarks also show performance
improvements in other areas like fork_test. About a 5% difference which is
also important for kernel builds. Wider testing would be needed to see if
the improvements are specific to my tests or not. Every set of patches
have had a performance regression test run with Aim9 so I certainly have
not been ignoring perforkmance.

> Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
>

Then it would be interesting to find out how 2.6.14-rc5-mm1 compares
against 2.6.14-rc5-mm1-mbuddy-v19?

> I can simply see they will cost more icache, more dcache, more branches,
> etc. in what is the hottest part of the kernel in some workloads (kernel
> compiles, for one).
>
> I'm sorry if I sound like a wet blanket. I just don't look at a patch
> and think "wow all those 3 guys with Linux on IBM mainframes and using
> lpars are going to be so much happier now, this is something we need".
>

I developed this as the beginning of a long term solution for on-demand
HugeTLB pages as part of a PhD.  This could potentially help desktop
workloads in the future. Hotplug machines are a benefit that was picked up
by the work on the way. We can help hotplug to some extent today and
desktop users in the future (and given time, all of the hotplug problems
as well). But if we tell desktop users "Yeah, your applications will run a
bit better with HugeTLB pages as long as you configure the size of the
zone correctly" at any stage, we'll be told where to go.

> > > > The will need high order allocations if we want to provide HugeTLB pages
> > > > to userspace on-demand rather than reserving at boot-time. This is a
> > > > future problem, but it's one that is not worth tackling until the
> > > > fragmentation problem is fixed first.
> > > >
> > >
> > > Sure. In what form, we haven't agreed. I vote zones! :)
> >
> >
> > I'd like to hear more details of how zones would be less complex while still
> > solving the problem.  I just don't get it.
> >
>
> You have an extra zone. You size that zone at boot according to the
> amount of memory you need to be able to free. Only easy-reclaim stuff
> goes in that zone.
>

Helps hotplug, no one else. Rules out HugeTLB on demand for userspace
unless we are willing to tell desktop users to configure this tunable.

> It is less complex because zones are a complexity we already have to
> live with. 99% of the infrastructure is already there to do this.
>

The simplicity of zones is still in dispute. I am putting together a mail
of pros, cons, situations and future work for both approaches. I hope to
sent it out fairly soon.

> If you want to hot unplug memory or guarantee hugepage allocation,
> this is the way to do it. Nobody has told me why this *doesn't* work.
>

Hot unplug the configured zone of memory and guarantee hugepage allocation
only for userspace. There is no help for kernel allocations to steal a
huge page under any circumstance. Our approach allows the kernel to get
the large page at the cost of fragmentation degrading slowly over time. To
stop it fragmenting slowly over time, more work is needed.

-- 
Mel Gorman
Part-time Phd Student                          Java Applications Developer
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-02  1:06                     ` Nick Piggin
  2005-11-02  1:41                       ` Martin J. Bligh
  2005-11-02 11:37                       ` Mel Gorman
@ 2005-11-02 15:11                       ` Mel Gorman
  2 siblings, 0 replies; 241+ messages in thread
From: Mel Gorman @ 2005-11-02 15:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joel Schopp, Martin J. Bligh, Andrew Morton, kravetz, linux-mm,
	linux-kernel, lhms-devel, Ingo Molnar

On (02/11/05 12:06), Nick Piggin didst pronounce:
> Joel Schopp wrote:
> 
> >The patches do ad a reasonable amount of complexity to the page 
> >allocator.  In my opinion that is the only downside of these patches, 
> >even though it is a big one.  What we need to decide as a community is 
> >if there is a less complex way to do this, and if there isn't a less 
> >complex way then is the benefit worth the increased complexity.
> >
> >As to the non-zero performance cost, I think hard numbers should carry 
> >more weight than they have been given in this area.  Mel has posted hard 
> >numbers that say the patches are a wash with respect to performance.  I 
> >don't see any evidence to contradict those results.
> >
> 
> The numbers I have seen show that performance is decreased. People
> like Ken Chen spend months trying to find a 0.05% improvement in
> performance. Not long ago I just spent days getting our cached
> kbuild performance back to where 2.4 is on my build system.
> 

One contention point is the overhead this introduces. Lets say that
we do discover that kbuild is slower with this patch (still unknown),
then we have to get rid of mbuddy, disable it or replace it with an
as-yet-to-be-written-zone-based-approach.

I wrote a quick patch that disables anti-defrag via a config option and ran
aim9 on the test machine I have been using all along. I deliberatly changed
the minimum amount of anti-defrag as possible but maybe we could make this
patch even smaller or go the other way and conditionally take out as much
anti-defrag as possible.

Here are the Aim9 comparisons between -clean and
-mbuddy-v19-antidefrag-disabled-with-config-option (just the one run)

These are both based on 2.6.14-rc5-mm1

                vanilla-mm  mbuddy-disabled-via-config
1 creat-clo      16006.00   15844.72    -161.28 -1.01% File Creations and Closes/second
2 page_test     117515.83  119696.77    2180.94  1.86% System Allocations & Pages/second
3 brk_test      440289.81  439870.04    -419.77 -0.10% System Memory Allocations/second
4 jmp_test     4179466.67 4179150.00    -316.67 -0.01% Non-local gotos/second
5 signal_test    80803.20   82055.98    1252.78  1.55% Signal Traps/second
6 exec_test         61.75      61.53      -0.22 -0.36% Program Loads/second
7 fork_test       1327.01    1344.55      17.54  1.32% Task Creations/second
8 link_test       5531.53    5548.33      16.80  0.30% Link/Unlink Pairs/second

On this kernel, I forgot to disable the collection of buddy allocator
statistics. Collection introduces more overhead in both CPU and memory.
Here are the figures when statistic collection is also disabled via the
config option.

		vanilla-mm mbuddy-disabled-via-config-nostats
 1 creat-clo      16006.00   15906.06     -99.94 -0.62% File Creations and Closes/second
 2 page_test     117515.83  120736.54    3220.71  2.74% System Allocations & Pages/second
 3 brk_test      440289.81  430311.61   -9978.20 -2.27% System Memory Allocations/second
 4 jmp_test     4179466.67 4181683.33    2216.66  0.05% Non-local gotos/second
 5 signal_test    80803.20   87387.54    6584.34  8.15% Signal Traps/second
 6 exec_test         61.75      62.14       0.39  0.63% Program Loads/second
 7 fork_test       1327.01    1345.77      18.76  1.41% Task Creations/second
 8 link_test       5531.53    5556.72      25.19  0.46% Link/Unlink Pairs/second

So, now we have performance gains in a number of areas. Nice big jump in
page_test and that fork_test improvement probably won't hurt kbuild either with
exec_test giving a bit of a nudge. signal_test has a big hike for some reason,
not sure who will benefit there, but hey, it can't be bad. I am annoyed with
brk_test especially as it is very similar to page_test in the aim9 source
code but there is no point hiding the result either. These figures does not
tell us how kbuild really performs of course. For that, kbuild needs to be run
on both kernels and compared. This applies to any workload.

This anti-defrag makes the code more complex and harder to read, no
arguement there. However, on at least one test machine, there is a very small
difference when anti-defrag is enabled in comparison to a vanilla kernel.
When the patches applied and the anti-defrag disabled via a kernel option,
we see a number of performance gains, on one machine at least which is a
good thing. Wider testing would show if these good figures are specific to
my testbed or not.

If other testbeds show up nothing bad, anti-defrag with this additional
patch could give us the best of both worlds. If you have a hotplug machine
or you care about high orders, enable this option. Otherwise, choose N and
avoid the anti-defrag overhead.

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/gfp.h	2005-11-02 12:44:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/gfp.h	2005-11-02 12:49:24.000000000 +0000
@@ -50,6 +50,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   0x40000u /* Enforce hardwall cpuset memory allocs */
 #define __GFP_VALID	0x80000000u /* valid GFP flags */
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 /*
  * Allocation type modifiers, these are required to be adjacent
  * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
@@ -61,6 +62,11 @@ struct vm_area_struct;
 #define __GFP_EASYRCLM   0x80000u  /* User and other easily reclaimed pages */
 #define __GFP_KERNRCLM   0x100000u /* Kernel page that is reclaimable */
 #define __GFP_RCLM_BITS  (__GFP_EASYRCLM|__GFP_KERNRCLM)
+#else
+#define __GFP_EASYRCLM   0
+#define __GFP_KERNRCLM   0
+#define __GFP_RCLM_BITS  0
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/include/linux/mmzone.h	2005-11-02 12:44:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/include/linux/mmzone.h	2005-11-02 13:00:56.000000000 +0000
@@ -23,6 +23,7 @@
 #endif
 #define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 /*
  * The two bit field __GFP_RECLAIMBITS enumerates the following types of
  * page reclaimability.
@@ -33,6 +34,14 @@
 #define RCLM_FALLBACK 3
 #define RCLM_TYPES    4
 #define BITS_PER_RCLM_TYPE 2
+#else
+#define RCLM_NORCLM   0
+#define RCLM_EASY     0
+#define RCLM_KERN     0
+#define RCLM_FALLBACK 0
+#define RCLM_TYPES    1
+#define BITS_PER_RCLM_TYPE 0
+#endif
 
 #define for_each_rclmtype_order(type, order) \
 	for (order = 0; order < MAX_ORDER; order++) \
@@ -60,6 +69,7 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 /*
  * Indices into pcpu_list
  * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
@@ -68,6 +78,11 @@ struct zone_padding {
 #define PCPU_KERNEL 0
 #define PCPU_EASY   1
 #define PCPU_TYPES  2
+#else
+#define PCPU_KERNEL 0
+#define PCPU_EASY   0
+#define PCPU_TYPES  1
+#endif
 
 struct per_cpu_pages {
 	int count[PCPU_TYPES];  /* Number of pages on each list */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/init/Kconfig	2005-11-02 12:42:20.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/init/Kconfig	2005-11-02 12:59:49.000000000 +0000
@@ -419,6 +419,17 @@ config CC_ALIGN_JUMPS
 	  no dummy operations need be executed.
 	  Zero means use compiler's default.
 
+config PAGEALLOC_ANTIDEFRAG
+	bool "Try and avoid fragmentation in the page allocator"
+	def_bool y
+	help
+	  The standard allocator will fragment memory over time which means that
+	  high order allocations will fail even if kswapd is running. If this
+	  option is set, the allocator will try and group page types into
+	  three groups KernNoRclm, KernRclm and EasyRclm. The gain is a best
+	  effort attempt at lowering fragmentation. The loss is more complexity
+
+
 endmenu		# General setup
 
 config TINY_SHMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-mbuddy-v19-noconfig/mm/page_alloc.c	2005-11-02 13:05:07.000000000 +0000
+++ linux-2.6.14-rc5-mm1-mbuddy-v19-withconfig/mm/page_alloc.c	2005-11-02 14:09:37.000000000 +0000
@@ -57,11 +57,17 @@ long nr_swap_pages;
  * fallback_allocs contains the fallback types for low memory conditions
  * where the preferred alloction type if not available.
  */
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
 	{RCLM_NORCLM,	RCLM_FALLBACK, RCLM_KERN,   RCLM_EASY, RCLM_TYPES},
 	{RCLM_EASY,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
 	{RCLM_KERN,     RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
 };
+#else
+int fallback_allocs[RCLM_TYPES][RCLM_TYPES+1] = {
+	{RCLM_NORCLM, RCLM_TYPES}
+};
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 /* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
 static inline int min_fallback_reserved(struct zone *zone)
@@ -98,6 +104,7 @@ EXPORT_SYMBOL(totalram_pages);
 #error __GFP_KERNRCLM not mapping to RCLM_KERN
 #endif
 
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 /*
  * This function maps gfpflags to their RCLM_TYPE. It makes assumptions
  * on the location of the GFP flags.
@@ -115,6 +122,12 @@ static inline int gfpflags_to_rclmtype(g
 
 	return rclmbits >> RCLM_SHIFT;
 }
+#else
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+	return RCLM_NORCLM;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 
 /*
  * copy_bits - Copy bits between bitmaps
@@ -134,6 +147,9 @@ static inline void copy_bits(unsigned lo
 		int sindex_src,
 		int nr)
 {
+	if (nr == 0)
+		return;
+
 	/*
 	 * Written like this to take advantage of arch-specific
 	 * set_bit() and clear_bit() functions
@@ -188,8 +204,12 @@ static char *zone_names[MAX_NR_ZONES] = 
 int min_free_kbytes = 1024;
 
 #ifdef CONFIG_ALLOCSTATS
+#ifdef CONFIG_PAGEALLOC_ANTIDEFRAG
 static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
 					"KernRclm", "Fallback"};
+#else
+static char *type_names[RCLM_TYPES] = { "KernNoRclm" };
+#endif /* CONFIG_PAGEALLOC_ANTIDEFRAG */
 #endif /* CONFIG_ALLOCSTATS */
 
 unsigned long __initdata nr_kernel_pages;
@@ -2228,8 +2248,10 @@ static void __init setup_usemap(struct p
 				struct zone *zone, unsigned long zonesize)
 {
 	unsigned long usemapsize = usemap_size(zonesize);
-	zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
-	memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+	if (usemapsize != 0) {
+		zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+		memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+	}
 }
 #else
 static void inline setup_usemap(struct pglist_data *pgdat,



^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-04  1:00 Andy Nelson
  2005-11-04  1:16 ` Martin J. Bligh
  2005-11-04  5:14 ` Linus Torvalds
  0 siblings, 2 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04  1:00 UTC (permalink / raw)
  To: mbligh, torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mel, mingo, nickpiggin

Linus writes:

>Just face it - people who want memory hotplug had better know that 
>beforehand (and let's be honest - in practice it's only going to work in 
>virtualized environments or in environments where you can insert the new 
>bank of memory and copy it over and remove the old one with hw support).
>
>Same as hugetlb.
>
>Nobody sane _cares_. Nobody sane is asking for these things. Only people 
>with special needs are asking for it, and they know their needs.

Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
about. I am the whisperer in people's minds, causing them to conspire 
against sanity everywhere and make lives as insane and crazy as mine is.
I love my work. I am an astrophysicist. I have lurked on various linux 
lists for years now, and this is my first time standing in front of all 
you people, hoping to make you bend your insane and crazy kernel developing
minds to listen to the rantings of my insane and crazy HPC mind.

I have done high performance computing in astrophysics for nearly two
decades now. It gives me a perspective that kernel developers usually
don't have, but sometimes need. For my part, I promise that I specifically
do *not* have the perspective of a kernel developer. I don't even speak C.

I don't really know what you folks do all day or night, and I actually 
don't much care except when it impacts my own work. I am fairly certain 
a lot of this hotplug/page defragmentation/page faulting/page zeroing
stuff that the sgi and ibm folk are currently getting rejected from
inclusion in the kernel impacts my work in very serious ways. You're
right, I do know my needs. They are not being met and the people with the
power to do anything about it call me insane and crazy and refuse to be
interested even in making improvement possible, even when it quite likely
helps them too.

Today I didn't hear a voice in my head that told me to shoot the pope, but
I did I hear one telling me to write a note telling you about my issues,
which apparently are in the 0.01% of insane crazies that should be
ignored, as are about 1/2 of the people responding on this thread. 
I'll tell you a bit about my issues and their context now that things
have gotten hot enough that even a devout lurker like me is posting. Some
of it might make sense. Other parts may be internally inconsistent if only
I knew enough. Still other parts may be useful to get people who don't
talk to each other in contact, and think about things in ways they haven't.

I run large hydrodynamic simulations using a variety of techniques
whose relevance is only tangential to the current flamefest. I'll let you
know the important details as they come in later. A lot of my statements
will be common to a large fraction of all hpc applications, and I imagine
to many large scale database applications as well though I'm guessing a
bit there.

I run the same codes on many kinds of systems from workstations up
to large supercomputing platforms. Mostly my experience has been
in shared memory systems, but recently I've been part of things that
will put me into distributed memory space as well.

What does it mean to use computers like I do? Maybe this is surprising
but my executables are very very small. Typically 1-2MB or less, with
only a bit more needed for various external libraries like FFTW or
the like. On the other hand, my memory requirements are huge. Typically
many GB, and some folks run simulations with many TB.  Picture a very
small and very fast flea repeatedly jumping around all over the skin of a
very large elephant, taking a bite at each jump and that is a crude idea
of what is happening.

This has bearing on the current discussion in the following ways, which
are not theoretical in any way. 

1) Some of these simulations frequently need to access data that is 
   located very far away in memory. That means that the bigger your
   pages are, the fewer TLB misses you get, the smaller the
   thrashing, and the faster your code runs. 

   One example: I have a particle hydrodynamics code that uses gravity.
   Molecular dynamics simulations have similar issues with long range
   forces too. Gravity is calculated by culling acceptable nodes and atoms
   out of a tree structure that can be many GB in size, or for bigger
   jobs, many TB. You have to traverse the entire tree for every particle
   (or closely packed small group). During this stage, almost every node
   examination (a simple compare and about 5 flops) requires at least one
   TLB miss and depending on how you've laid out your array, several TLB
   misses. Huge pages help this problem, big time. Fine with me if all I
   had was one single page. If I am stupid and get into swap territory, I
   deserve every bad thing that happens to me.

   Now you have a list of a few thousand nodes and atoms with their data
   spread sparsely over that entire multi-GB memory volume. Grab data
   (about 10 double precision numbers) for one node, do 40-50 flops with
   it, and repeat, L1 and TLB thrashing your way through the entire list.
   There are some tricks that work some times (preload an L1 sized array
   of node data and use it for an entire group of particles, then discard
   it for another preload if there is more data; dimension arrays in the
   right direction, so you get multiple loads from the same cache line
   etc) but such things don't always work or aren't always useful.

   I can easily imagine database apps doing things not too dissimilar to
   this. With my particular code, I have measured factors of several (\sim
   3-4) speedup with large pages compared to small. This is measured on 
   an Origin 3000, with 64k, 1M and 16MB pages were used. Not a factor 
   of several percent. A factor of several. I have also measured similar
   sorts of speedups on other types of machines. It is also not a factor
   related to NUMA. I can see other effects from that source and can
   distinguish between them.

   Another example: Take a code that discretizes space on a grid
   in 3d and does something to various variables to make them evolve.
   You've got 3d arrays many GB in size, and for various calculations 
   you have to sweep through them in each direction: x, y and z. Going 
   in the z direction means that you are leaping across huge slices of 
   memory every time you increment the grid zone by 1. In some codes 
   only a few calculations are needed per zone. For example you want 
   to take a derivative:

        deriv = (rho(i,j,k+1) - rho(i,j,k-1))/dz(k)

   (I speak fortran, so the last index is the slow one here).

   Again, every calculation strides through huge distances and gets you
   a TLB miss or several. Note for the unwary: it usually does not make
   sense to transpose the arrays so that the fast index is the one you
   work with. You don't have enough memory for one thing and you pay
   for the TLB overhead in the transpose anyway.

   In both examples, with large pages the chances of getting a TLB hit 
   are far far higher than with small pages. That means I want truly 
   huge pages. Assuming pages at all (various arches don't have them
   I think), a single one that covered my whole memory would be fine. 

   Other codes don't seem to benefit so much from large pages, or even
   benefit from small pages, though my experience is minimal with 
   such codes. Other folks run them on the same machines I do though:

2) The last paragraph above is important because of the way HPC
   works as an industry. We often don't just have a dedicated machine to
   run on, that gets booted once and one dedicated application runs on it
   till it dies or gets rebooted again. Many jobs run on the same machine.
   Some jobs run for weeks. Others run for a few hours over and over
   again. Some run massively parallel. Some run throughput.

   How is this situation handled? With a batch scheduler. You submit
   a job to run and ask for X cpus, Y memory and Z time. It goes and
   fits you in wherever it can. cpusets were helpful infrastructure
   in linux for this. 

   You may get some cpus on one side of the machine, some more
   on the other, and memory associated with still others. They
   do a pretty good job of allocating resources sanely, but there is
   only so much that it can do. 

   The important point here for page related discusssions is that
   someone, you don't know who, was running on those cpu's and memory
   before you. And doing Ghu Knows What with it. 

   This code could be running something that benefits from small pages, or
   it could be running with large pages. It could be dynamically
   allocating and freeing large or small blocks of memory or it could be
   allocating everything at the beginning and running statically
   thereafter. Different codes do different things. That means that the
   memory state could be totally fubar'ed before your job ever gets
   any time allocated to it.

>Nobody takes a random machine and says "ok, we'll now put our most 
>performance-critical database on this machine, and oh, btw, you can't 
>reboot it and tune for it beforehand". 

   Wanna bet?

   What I wrote above makes tuning the machine itself totally ineffective.
   What do you tune for? Tuning for one person's code makes someone else's
   slower. Tuning for the same code on one input makes another input run
   horribly. 

   You also can't be rebooting after every job. What about all the other
   ones that weren't done yet? You'd piss off everyone running there and
   it takes too long besides.

   What about a machine that is running multiple instances of some
   database, some bigger or smaller than others, or doing other kinds
   of work? Do you penalize the big ones or the small ones, this kind
   of work or that?

   You also can't establish zones that can't be changed on the fly
   as things on the system change. How do zones like that fit into
   numa? How do things work when suddenly you've got a job that wants
   the entire memory filled with large pages and you've only got 
   half your system set up for large pages? What if you tune the
   system that way and then let that job run. For some stupid reason user
   reason it dies 10 minutes after starting? Do you let the 30
   other jobs in the queue sit idle because they want a different
   page distribution?

   This way lies madness. Sysadmins just say no and set up the machine
   in as stably as they can, usually with something not too different
   that whatever manufacturer recommends as a default. For very good reasons.  

   I would bet the only kind of zone stuff that could even possibly
   work would be related to a cpu/memset zone arrangement. See below.

3) I have experimented quite a bit with the page merge infrastructure
   that exists on irix. I understand that similar large page and merge
   infrastructure exists on solaris, though I haven't run on such systems.
   I can get very good page distributions if I run immediately after
   reboot. I get progressively worse distributions if my job runs only
   a few days or weeks later. 

   My experience is that after some days or weeks of running have gone
   by, there is no possible way short of a reboot to get pages merged
   effectively back to any pristine state with the infrastructure that 
   exists there.

   Some improvement can be had however, with a bit of pain. What I
   would like to see is not a theoretical, general, all purpose 
   defragmentation and hotplug scheme, but one that can work effectively
   with the kinds of constraints that a batch scheduler imposes.
   I would even imagine that a more general scheduler type of situation
   could be effective it that scheduler was smart enough. God knows,
   the scheduler in linux has been rewritten often enough. What is
   one more time for this purpose too?

   You may claim that this sort of merge stuff requires excessive time
   for the OS. Nothing could matter to me less. I've got those cpu's
   full time for the next X days and if I want them to spend the first
   5 minutes or whatever of my run making the place comfortable, so that
   my job gets done three days earlier then I want to spend that time. 

3) The thing is that all of this memory management at this level is not
   the batch scheduler's job, its the OS's job. The thing that will make
   it work is that in the case of a reasonably intelligent batch scheduler
   (there are many), you are absolutely certain that nothing else is
   running on those cpus and that memory. Except whatever the kernel
   sprinkled in and didn't clean up afterwards. 

   So why can't the kernel clean up after itself? Why does the kernel need
   to keep anything in this memory anyway? I supposedly have a guarantee
   that it is mine, but it goes and immediately violates that guarantee
   long before I even get started. I want all that kernel stuff gone from
   my allocation and reset to a nice, sane pristine state.

   The thing that would make all of it work is good fragmentation and
   hotplug type stuff in the kernel. Push everything that the kernel did
   to my memory into the bitbucket and start over. There shouldn't be
   anything there that it needs to remember from before anyway. Perhaps
   this is what the defragmentation stuff is supposed to help with.
   Probably it has other uses that aren't on my agenda. Like pulling out
   bad ram sticks or whatever. Perhaps there are things that need to be
   remembered. Certainly being able to hotunplug those pieces would do it.
   Just do everything but unplug it from the board, and then do a hotplug
   to turn it back on. 

4) You seem to claim that issues I wrote about above are 'theoretical
   general cases'. They are not, at least not to any more people than the
   0.01% of people who regularly time their kernel builds as I saw someone
   doing some emails ago. Using that sort of argument as a reason not to
   incorporate this sort of infrastructure just about made me fall out of
   my chair, especially in the context of keeping the sane case sane.
   Since this thread has long since lost decency and meaning and descended
   into name calling, I suppose I'll pitch in with that too on two fronts: 

   1) I'd say someone making that sort of argument is doing some very serious 
   navel gazing. 

   2) Here's a cluebat: that ain't one of the sane cases you wrote about.

That said, it appears to me there are a variety of constituencies that
have some serious interest in this infrastructure.

1) HPC stuff

2) big database stuff.

3) people who are pushing hotplug for other reasons like the
   bad memory replacement stuff I saw discussed.  

4) Whatever else the hotplug folk want that I don't follow.

Seems to me that is a bit more than 0.01%.

>When you hear voices in your head that tell you to shoot the pope, do you 
>do what they say? Same thing goes for customers and managers. They are the 
>crazy voices in your head, and you need to set them right, not just 
>blindly do what they ask for.

I don't care if you do what I ask for, but I do start getting irate and
start writing long annoyed letters if I can't do what I need to do, and
find out that someone could do something about it but refuses.

That said, I'm not so hot any more so I'll just unplug now.

Andy Nelson

PS: I read these lists at an archive, so if responders want to rm me from
any cc's that is fine. I'll still read what I want or need to from there.

--
Andy Nelson                       Theoretical Astrophysics Division (T-6) 
andy dot nelson at lanl dot gov   Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy     Los Alamos, NM 87545

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:00 Andy Nelson
@ 2005-11-04  1:16 ` Martin J. Bligh
  2005-11-04  1:27   ` Nick Piggin
  2005-11-04  5:14 ` Linus Torvalds
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04  1:16 UTC (permalink / raw)
  To: Andy Nelson, torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mel, mingo, nickpiggin

> Linus writes:
> 
>> Just face it - people who want memory hotplug had better know that 
>> beforehand (and let's be honest - in practice it's only going to work in 
>> virtualized environments or in environments where you can insert the new 
>> bank of memory and copy it over and remove the old one with hw support).
>> 
>> Same as hugetlb.
>> 
>> Nobody sane _cares_. Nobody sane is asking for these things. Only people 
>> with special needs are asking for it, and they know their needs.
> 
> 
> Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
> about.

To provide a slightly shorter version ... we had one customer running
similarly large number crunching things in Fortran. Their app ran 25%
faster with large pages (not a typo). Because they ran a variety of
jobs in batch mode, they need large pages sometimes, and small pages
at others - hence they need to dynamically resize the pool. 

That's the sort of thing we were trying to fix with dynamically sized
hugepage pools. It does make a huge difference to real-world customers.

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:16 ` Martin J. Bligh
@ 2005-11-04  1:27   ` Nick Piggin
  0 siblings, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-04  1:27 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andy Nelson, torvalds, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, mingo

Martin J. Bligh wrote:

> 
> To provide a slightly shorter version ... we had one customer running
> similarly large number crunching things in Fortran. Their app ran 25%
> faster with large pages (not a typo). Because they ran a variety of
> jobs in batch mode, they need large pages sometimes, and small pages
> at others - hence they need to dynamically resize the pool. 
> 
> That's the sort of thing we were trying to fix with dynamically sized
> hugepage pools. It does make a huge difference to real-world customers.
> 

Aren't HPC users very easy? In fact, probably the easiest because they
generally not very kernel intensive (apart from perhaps some batches of
IO at the beginning and end of the jobs).

A reclaimable zone should provide exactly what they need. I assume the
sysadmin can give some reasonable upper and lower estimates of the
memory requirements.

They don't need to dynamically resize the pool because it is all being
allocated to pagecache anyway, so all jobs are satisfied from the
reclaimable zone.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  1:00 Andy Nelson
  2005-11-04  1:16 ` Martin J. Bligh
@ 2005-11-04  5:14 ` Linus Torvalds
  2005-11-04  6:10   ` Paul Jackson
  2005-11-04 14:56   ` Andy Nelson
  1 sibling, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04  5:14 UTC (permalink / raw)
  To: Andy Nelson
  Cc: mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mel, mingo, nickpiggin

On Thu, 3 Nov 2005, Andy Nelson wrote:
> 
> I have done high performance computing in astrophysics for nearly two
> decades now. It gives me a perspective that kernel developers usually
> don't have, but sometimes need. For my part, I promise that I specifically
> do *not* have the perspective of a kernel developer. I don't even speak C.

Hey, cool. You're a physicist, and you'd like to get closer to 100% 
efficiency out of your computer.

And that's really nice, because maybe we can strike a deal.

Because I also have a problem with my computer, and a physicist might just 
help _me_ get closer to 100% efficiency out of _my_ computer.

Let me explain.

I've got a laptop that takes about 45W, maybe 60W under load.

And it has a battery that weighs about 350 grams.

Now, I know that if I were to get 100% energy efficiency out of that 
battery, a trivial physics calculations tells me that e=mc^2, and that my 
battery _should_ have a hell of a lot of energy in it. In fact, according 
to my simplistic calculations, it turns out that my laptop _should_ have a 
battery life that is only a few times the lifetime of the universe.

It turns out that isn't really the case in practice, but I'm hoping you 
can help me out. I obviously don't need it to be really 100% efficient, 
but on the other hand, I'd also like the battery to be slightly lighter, 
so if you could just make sure that it's at least _slightly_ closer to the 
theoretical values I should be getting out of it, maybe I wouldn't need to 
find one of those nasty electrical outlets every few hours.

Do we have a deal? After all, you only need to improve my battery 
efficiency by a really _tiny_ amount, and I'll never need to recharge it 
again. And I'll improve your problem.

Or are you maybe willing to make a few compromises in the name of being 
realistic, and living with something less than the theoretical peak 
performance of what you're doing?

I'm willing on compromising to using only the chemical energy of the 
processes involved, and not even a hundred percent efficiency at that. 
Maybe you'd be willing on compromising by using a few kernel boot-time 
command line options for your not-very-common load.

Ok?

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  5:14 ` Linus Torvalds
@ 2005-11-04  6:10   ` Paul Jackson
  2005-11-04  6:38     ` Ingo Molnar
  2005-11-04  7:44     ` Eric Dumazet
  2005-11-04 14:56   ` Andy Nelson
  1 sibling, 2 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  6:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mel, mingo, nickpiggin

Linus wrote:
> Maybe you'd be willing on compromising by using a few kernel boot-time 
> command line options for your not-very-common load.

If we were only a few options away from running Andy's varying load
mix with something close to ideal performance, we'd be in fat city,
and Andy would never have been driven to write that rant.

There's more to it than that, but it is not as impossible as a battery
with the efficiencies you (and the rest of us) dream of.

Andy has used systems that resemble what he is seeking.  So he is not
asking for something clearly impossible.  Though it might not yet be
possible, in ways that contribute to a continuing healthy kernel code
base.

It's an interesting challenge - finding ways to improve the kernel's
performance on such high end loads, that are also suitable and
desirable (or at least innocent enough) for inclusion in a kernel far
more widely used in embeddeds, desktops and ordinary servers.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  6:10   ` Paul Jackson
@ 2005-11-04  6:38     ` Ingo Molnar
  2005-11-04  7:26       ` Paul Jackson
  2005-11-04 15:31       ` Linus Torvalds
  2005-11-04  7:44     ` Eric Dumazet
  1 sibling, 2 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04  6:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Linus Torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

* Paul Jackson <pj@sgi.com> wrote:

> Linus wrote:
> > Maybe you'd be willing on compromising by using a few kernel boot-time 
> > command line options for your not-very-common load.
> 
> If we were only a few options away from running Andy's varying load 
> mix with something close to ideal performance, we'd be in fat city, 
> and Andy would never have been driven to write that rant.
> 
> There's more to it than that, but it is not as impossible as a battery 
> with the efficiencies you (and the rest of us) dream of.

just to make sure i didnt get it wrong, wouldnt we get most of the 
benefits Andy is seeking by having a: boot-time option which sets aside 
a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool 
- with the growing happening on a best-effort basis, without guarantees?

i have implemented precisely such a scheme for 'bigpages' years ago, and 
it worked reasonably well. (i was lazy and didnt implement it as a 
resizable zone, but as a list of large pages taken straight off the 
buddy allocator. This made dynamic resizing really easy and i didnt have 
to muck with the buddy and mem_map[] data structures that zone-resizing 
forces us to do. It had the disadvantage of those pages skewing the 
memory balance of the affected zone.)

my quick solution was good enough that on a test-system i could resize 
the pool across Oracle test-runs, when the box was otherwise quiet. I'd 
expect a well-controlled HPC system to be equally resizable.

what we cannot offer is a guarantee to be able to grow the pool. Hence 
the /proc mechanism would be called:

	/proc/sys/vm/try_to_grow_hugemem_pool

to clearly stress the 'might easily fail' restriction. But if userspace 
is well-behaved on Andy's systems (which it seems to be), then in 
practice it should be resizable. On a generic system, only the boot-time 
option is guaranteed to allocate as much RAM as possible. And once this 
functionality has been clearly communicated and separated, the 'try to 
alloc a large page' thing could become more agressive: it could attempt 
to construct large pages if it can.

i dont think we object to such a capability, as long as the restrictions 
are clearly communicated. (and no, that doesnt mean some obscure 
Documentation/ entry - the restrictions have to be obvious from the 
primary way of usage. I.e. no /proc/sys/vm/hugemem_pool_size thing where 
growing could fail.)

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  6:38     ` Ingo Molnar
@ 2005-11-04  7:26       ` Paul Jackson
  2005-11-04  7:37         ` Ingo Molnar
  2005-11-04 15:31       ` Linus Torvalds
  1 sibling, 1 reply; 241+ messages in thread
From: Paul Jackson @ 2005-11-04  7:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

Ingo wrote:
> to clearly stress the 'might easily fail' restriction. But if userspace 
> is well-behaved on Andy's systems (which it seems to be), then in 
> practice it should be resizable. 

At first glance, this is the sticky point that jumps out at me.

Andy wrote:
>    My experience is that after some days or weeks of running have gone
>    by, there is no possible way short of a reboot to get pages merged
>    effectively back to any pristine state with the infrastructure that 
>    exists there.

I take it, from what Andy writes, and from my other experience with
similar customers, that his workload is not "well-behaved" in the
sense you hoped for.

After several diverse jobs are run, we cannot, so far as I know,
merge small pages back to big pages.

I have not played with Mel Gorman's Fragmentation Avoidance patches,
so don't know if they would provide a substantial improvement here.
They well might.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  7:26       ` Paul Jackson
@ 2005-11-04  7:37         ` Ingo Molnar
  0 siblings, 0 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04  7:37 UTC (permalink / raw)
  To: Paul Jackson
  Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

* Paul Jackson <pj@sgi.com> wrote:

> At first glance, this is the sticky point that jumps out at me.
> 
> Andy wrote:
> >    My experience is that after some days or weeks of running have gone
> >    by, there is no possible way short of a reboot to get pages merged
> >    effectively back to any pristine state with the infrastructure that 
> >    exists there.
> 
> I take it, from what Andy writes, and from my other experience with 
> similar customers, that his workload is not "well-behaved" in the 
> sense you hoped for.
> 
> After several diverse jobs are run, we cannot, so far as I know, merge 
> small pages back to big pages.

ok, so the zone solution it has to be. I.e. the moment it's a separate 
special zone, you can boot with most of the RAM being in that zone, and 
you are all set. It can be used both for hugetlb allocations, and for 
other PAGE_SIZE allocations as well, in a highmem-fashion. These HPC 
setups are rarely kernel-intense.

Thus the only dynamic sizing decision that has to be taken is to 
determine the amount of 'generic kernel RAM' that is needed in the 
worst-case. To give an example: say on a 256 GB box, set aside 8 GB for 
generic kernel needs, and have 248 GB in the hugemem zone. This leaves 
us with the following scenario: apps can use up to 97% of all RAM for 
hugemem, and they can use up to 100% of all RAM for PAGE_SIZE 
allocations. 3% of RAM can be used by generic kernel needs. Sounds 
pretty reasonable and straightforward from a system management point of 
view. No runtime resizing, but it wouldnt be needed, unless kernel 
activity needs more than 8GB of RAM.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  6:38     ` Ingo Molnar
  2005-11-04  7:26       ` Paul Jackson
@ 2005-11-04 15:31       ` Linus Torvalds
  2005-11-04 15:39         ` Martin J. Bligh
                           ` (2 more replies)
  1 sibling, 3 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 15:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

On Fri, 4 Nov 2005, Ingo Molnar wrote:
> 
> just to make sure i didnt get it wrong, wouldnt we get most of the 
> benefits Andy is seeking by having a: boot-time option which sets aside 
> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool 
> - with the growing happening on a best-effort basis, without guarantees?

Boot-time option to set the hugetlb zone, yes.

Grow-or-shrink, probably not. Not in practice after bootup on any machine 
that is less than idle.

The zones have to be pretty big to make any sense. You don't just grow 
them or shrink them - they'd be on the order of tens of megabytes to 
gigabytes. In other words, sized big enough that you will _not_ be able to 
create them on demand, except perhaps right after boot.

Growing these things later simply isn't reasonable. I can pretty much 
guarantee that any kernel I maintain will never have dynamic kernel 
pointers: when some memory has been allocated with kmalloc() (or 
equivalent routines - pretty much _any_ kernel allocation), it stays put. 
Which means that if there is a _single_ kernel alloc in such a zone, it 
won't ever be then usable for hugetlb stuff.

And I don't want excessive complexity. We can have things like "turn off 
kernel allocations from this zone", and then wait a day or two, and hope 
that there aren't long-term allocs. It might even work occasionally. But 
the fact is, a number of kernel allocations _are_ long-term (superblocks, 
root dentries, "struct thread_struct" for long-running user daemons), and 
it's simply not going to work well in practice unless you have set aside 
the "no kernel alloc" zone pretty early on.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:31       ` Linus Torvalds
@ 2005-11-04 15:39         ` Martin J. Bligh
  2005-11-04 15:53         ` Ingo Molnar
  2005-11-06  8:44         ` Kyle Moffett
  2 siblings, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 15:39 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Paul Jackson, andy, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

>> just to make sure i didnt get it wrong, wouldnt we get most of the 
>> benefits Andy is seeking by having a: boot-time option which sets aside 
>> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool 
>> - with the growing happening on a best-effort basis, without guarantees?
> 
> Boot-time option to set the hugetlb zone, yes.
> 
> Grow-or-shrink, probably not. Not in practice after bootup on any machine 
> that is less than idle.
> 
> The zones have to be pretty big to make any sense. You don't just grow 
> them or shrink them - they'd be on the order of tens of megabytes to 
> gigabytes. In other words, sized big enough that you will _not_ be able to 
> create them on demand, except perhaps right after boot.
> 
> Growing these things later simply isn't reasonable. I can pretty much 
> guarantee that any kernel I maintain will never have dynamic kernel 
> pointers: when some memory has been allocated with kmalloc() (or 
> equivalent routines - pretty much _any_ kernel allocation), it stays put. 
> Which means that if there is a _single_ kernel alloc in such a zone, it 
> won't ever be then usable for hugetlb stuff.
> 
> And I don't want excessive complexity. We can have things like "turn off 
> kernel allocations from this zone", and then wait a day or two, and hope 
> that there aren't long-term allocs. It might even work occasionally. But 
> the fact is, a number of kernel allocations _are_ long-term (superblocks, 
> root dentries, "struct thread_struct" for long-running user daemons), and 
> it's simply not going to work well in practice unless you have set aside 
> the "no kernel alloc" zone pretty early on.

Exactly. But that's what all the anti-fragmentation stuff was about - trying
to pack unfreeable stuff together. 

I don't think anyone is proposing dynamic kernel pointers inside Linux,
except in that we could possibly change the P-V mapping underneath from
the hypervisor, so that the phys address would change, but you wouldn't
see it. Trouble is, that's mostly done on a larger-than-page size
granularity, so we need SOME larger chunk to switch out (preferably at
least a large-paged size, so we can continue to use large TLB entries for
the kernel mapping).

However, the statically sized option is hugely problematic too.

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:31       ` Linus Torvalds
  2005-11-04 15:39         ` Martin J. Bligh
@ 2005-11-04 15:53         ` Ingo Molnar
  2005-11-06  7:34           ` Paul Jackson
  2005-11-06  8:44         ` Kyle Moffett
  2 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Jackson, andy, mbligh, akpm, arjan, arjanv, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

* Linus Torvalds <torvalds@osdl.org> wrote:

> Boot-time option to set the hugetlb zone, yes.
> 
> Grow-or-shrink, probably not. Not in practice after bootup on any 
> machine that is less than idle.
> 
> The zones have to be pretty big to make any sense. You don't just grow 
> them or shrink them - they'd be on the order of tens of megabytes to 
> gigabytes. In other words, sized big enough that you will _not_ be 
> able to create them on demand, except perhaps right after boot.

i think the current hugepages=<N> boot option could transparently be 
morphed into a 'separate zone' approach, and /proc/sys/vm/nr_hugepages 
would just refuse to change (or would go away altogether). Dynamically 
growing zones seem like a lot of trouble, without much gain. [ OTOH 
hugepages= parameter unit should be changed from the current 'number of 
hugepages' to plain RAM metrics - megabytes/gigabytes. ]

that would solve two problems: any 'zone VM statistics skewing effect' 
of the current hugetlbs (which is a preallocated list of really large 
pages) would go away, and the hugetlb zone could potentially be utilized 
for easily freeable objects.

this would already be alot more flexible that what we have: the hugetlb 
area would not be 'lost' altogether, like now. Once we are at this stage 
we can see how usable it is in practice. I strongly suspect it will 
cover most of the HPC uses.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:53         ` Ingo Molnar
@ 2005-11-06  7:34           ` Paul Jackson
  2005-11-06 15:55             ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Paul Jackson @ 2005-11-06  7:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

Ingo wrote:
> i think the current hugepages=<N> boot option could transparently be 
> morphed into a 'separate zone' approach, and ...
> 
> this would already be alot more flexible that what we have: the hugetlb 
> area would not be 'lost' altogether, like now. Once we are at this stage 
> we can see how usable it is in practice. I strongly suspect it will 
> cover most of the HPC uses.

It seems to me this is making it harder than it should be.  You're
trying to create a zone that is 100% cleanable, whereas the HPC folks
only desire 99.8% cleanable.

Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
Linus's unmoveable kmalloc memory in their way.  They rather expect
that some modest percentage of each node will have some 'kernel stuff'
on it that refuses to move.  They just want to be able to free up
most of the pages on a node, once one job is done there, before the
next job begins.

They are also quite willing (based on my experience with bootcpusets)
to designate a few nodes for the 'general purpose Unix load', and
reserve the remaining nodes just to run their special jobs.

On the other hand, as Eric Dumazet mentions on another subthread of
this topic, requiring that their apps use the hugetlbfs interface
to place the bulk of their memory would be a serious obstacle.
Their apps are already fairly tightly wound around a rich variety
of compiler, tool, library and runtime memory placement mechanisms,
and they would be hardpressed to make systematic changes in that.

I suspect that the answers lie in some further improvements in memory
placement on various nodes.  Perhaps this means a cpuset option to
put the easily reclaimed (what Mel Gorman's patch would mark with
__GFP_EASYRCLM) kernel pages and the user pages on the the nodes of
the current cpuset, but to prefer placing the less easily reclaimed
pages on the bootcpuset nodes.  Then, when a job on such a dedicated
set of nodes completed, most of the memory would be easily reclaimable,
in preparation for the next job.

The bootcpuset stuff is entirely invisible to kernel hackers, because
I am doing it entirely in user space, with a pre-init program that
configures the bootcpuset, moves the unpinned kernel threads into
the bootcpuset, and fires up the real init in that bootcpuset.

With one more twist to the cpuset API, providing a way to state
per-cpuset a separate set of nodes (on what the HPC folks would call
their bootcpuset) as the preferred place to allocate not-EASYRCLM
kernel memory, we might be very close to meeting these HPC needs,
with no changes to or reliance on hugetlbs, with no changes to the
kernel boottime code, and with no changes to the memory management
mechanisms used within these HPC apps.

I am imagining yet another per-cpuset field, which I call 'kmems'.  It
would be a nodemask, as is the current 'mems' field.  I'd pick up the
__GFP_EASYRCLM flag of Mel Gorman's patch (no comment on suitability of
the rest of his patch), and prefer to place __GFP_EASYRCLM pages on the
'mems' nodes, but other pages evenly spread across the 'kmems' nodes.
For compatibility with the current cpuset API, an unset 'kmems'
would tell the kernel to use the 'mems' setting as a fallback.

The hardest part might be providing a mechanism, that would be invoked
by the batch scheduler between jobs, to flush the easily reclaimed
memory off a node (free it or write it to disk).  Again, unlike the
hot(un)plug folks, a 98% solution is plenty good enough.

This will have to be coded and some HPC type loads tried on it, before
we know if it flies.

There is an obvious, unanswered question here.  Would moving some of
the kernels pages (the not easily reclaimed pages) off the current
(faulting) node into some possibly far off node be an acceptable
price to pay, to increase the percentage of the dedicated job nodes
that can be freed up between jobs?  Since these HPC jobs tend to be
far more sensitive to their own internal data placement than they
are to the kernels internal data placement, I am hopeful that this
tradeoff is a good one, for HPC apps.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06  7:34           ` Paul Jackson
@ 2005-11-06 15:55             ` Linus Torvalds
  2005-11-06 18:18               ` Paul Jackson
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-06 15:55 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Ingo Molnar, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

On Sat, 5 Nov 2005, Paul Jackson wrote:
> 
> It seems to me this is making it harder than it should be.  You're
> trying to create a zone that is 100% cleanable, whereas the HPC folks
> only desire 99.8% cleanable.

Well, 99.8% is pretty borderline.

> Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
> Linus's unmoveable kmalloc memory in their way.  They rather expect
> that some modest percentage of each node will have some 'kernel stuff'
> on it that refuses to move.

The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to 
make pretty much _every_ hugepage in the system pinned down.

Besides, right now, it's not 99.8% anyway. Not even close. It's more like 
60%, and then horribly horribly ugly hacks that try to do something about 
the remaining 40% and usually fail (the hacks might get it closer to 99%, 
but they are fragile, expensive, and ugly as hell).

It used to be that HIGHMEM pages were always cleanable on x86, but even 
that isn't true any more, since now at least pipe buffers can be there 
too.

I agree that HPC people are usually a bit less up-tight about things than 
database people tend to be, and many of them won't care at all, but if you 
want hugetlb, you'll need big areas.

Side note: the exact size of hugetlb is obviously architecture-specific, 
and the size matters a lot. On x86, for example, hugetlb pages are either 
2MB or 4MB in size (and apparently 2GB may be coming). I assume that's 
where you got the 99.8% from (4kB out of 2M).

Other platforms have more flexibility, but sometimes want bigger areas 
still. 

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06 15:55             ` Linus Torvalds
@ 2005-11-06 18:18               ` Paul Jackson
  0 siblings, 0 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-06 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, andy, mbligh, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mel, nickpiggin

Linus wrote:
> The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to 
> make pretty much _every_ hugepage in the system pinned down.

Agreed.

I realized after writing this that I wasn't clear on something.

I wasn't focused the subject of this thread, adding hugetlb pages after
the system has been up a while.

I was focusing on a related subject - freeing up most of the ordinary
size pages on the dedicated application nodes between jobs on a large
system using
 * a bootcpuset (for the classic Unix load) and
 * dedicated nodes (for the HPC apps).

I am looking to provide the combination of:
 1) specifying some hugetlb pages at system boot, plus
 2) the ability to clean off most of the ordinary sized pages
    from the application nodes between jobs.

Perhaps Andy or some of my HPC customers wish I was also looking
to provide:
 3) the ability to add lots of hugetlb pages on the application
    nodes after the system has run a while.
But if they are, then they have some more educatin' to do on me.

For now, I am sympathetic to your concerns with code and locking
complexity.  Freeing up great globs of hugetlb sized contiguous chunks
of memory after a system has run a while would be hard.

We have to be careful which hard problems we decide to take on.

We can't take on too many, and we have to pick ones that will provide
a major long term advantage to Linux, over the forseeable changes in
system hardware and architecture.

Even if most of the processors that Andy has tested against would
benefit from dynamically added hugetlb pages, if we can anticipate
that this will not be a substained opportunity for Linux (and looking
at current x86 chips doesn't require much anticipating) then that
might not be the place to invest our precious core complexity dollars.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:31       ` Linus Torvalds
  2005-11-04 15:39         ` Martin J. Bligh
  2005-11-04 15:53         ` Ingo Molnar
@ 2005-11-06  8:44         ` Kyle Moffett
  2005-11-06 16:12           ` Linus Torvalds
  2 siblings, 1 reply; 241+ messages in thread
From: Kyle Moffett @ 2005-11-06  8:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel,
	nickpiggin

On Nov 4, 2005, at 10:31:48, Linus Torvalds wrote:
> I can pretty much guarantee that any kernel I maintain will never  
> have dynamic kernel pointers: when some memory has been allocated  
> with kmalloc() (or equivalent routines - pretty much _any_ kernel  
> allocation), it stays put.

Hmm, this brings up something that I haven't seen discussed on this  
list (maybe a long time ago, but perhaps it should be brought up  
again?).  What are the pros/cons to having a non-physically-linear  
kernel virtual memory space?  Would it be theoretically possible to  
allow some kind of dynamic kernel page swapping, such that the _same_  
kernel-virtual pointer goes to a different physical memory page?   
That would definitely satisfy the memory hotplug people, but I don't  
know what the tradeoffs would be for normal boxen.

It seems like the trick would be to make sure that page accesses  
_during_ the swap are correctly handled.  If the page-swapper  
included code in the kernel fault handler to notice that a page was  
in the process of being swapped out/in by another CPU, it could just  
wait for swap-in to finish and then resume from the new page.  This  
would get messy with DMA and non-cpu memory accessors and such, which  
are what I assume the reasons for not implementing this in the past  
have been.

 From what I can see, the really dumb-obvious-slow method would be to  
call the first and last parts of software-suspend.  As memory hotplug  
is a relatively rare event, this would probably work well enough  
given the requirements:
     1)  Run software suspend pre-memory-dump code
     2)  Move pages off the to-be-removed node, remapping the kernel  
space to the new locations.
     3)  Mark the node so that new pages don't end up on it
     4)  Run software suspend post-memory-reload code

<random-guessing>
Perhaps the non-contiguous memory support would be of some help here?
</random-guessing>

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
   -- Alan Kay

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06  8:44         ` Kyle Moffett
@ 2005-11-06 16:12           ` Linus Torvalds
  2005-11-06 17:00             ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-06 16:12 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel,
	nickpiggin

On Sun, 6 Nov 2005, Kyle Moffett wrote:
> 
> Hmm, this brings up something that I haven't seen discussed on this list
> (maybe a long time ago, but perhaps it should be brought up again?).  What are
> the pros/cons to having a non-physically-linear kernel virtual memory space?

Well, we _do_ actually have that, and we use it quite a bit. Both 
vmalloc() and HIGHMEM work that way.

The biggest problem with vmalloc() is that the virtual space is often as 
constrained as the physical one (ie on old x86-32, the virtual address 
space is the bigger problem - you may have 36 bits of physical memory, but 
the kernel has only 30 bits of virtual). But it's quite commonly used for 
stuff that wants big linear areas.

The HIGHMEM approach works fine, but the overhead of essentially doing a 
software TLB is quite high, and if we never ever have to do it again on 
any architecture, I suspect everybody will be pretty happy.

> Would it be theoretically possible to allow some kind of dynamic kernel page
> swapping, such that the _same_ kernel-virtual pointer goes to a different
> physical memory page?  That would definitely satisfy the memory hotplug
> people, but I don't know what the tradeoffs would be for normal boxen.

Any virtualization will try to do that, but they _all_ prefer huge pages 
if they care at all about performance.

If you thought the database people wanted big pages, the kernel is worse. 
Unlike databases or HPC, the kernel actually wants to use the physical 
page address quite often, notably for IO (but also for just mapping them 
into some other virtual address - the users).

And no standard hardware allows you to do that in hw, so we'd end up doing 
a software page table walk for it (or, more likely, we'd have to make 
"struct page" bigger).

You could do it today, although at a pretty high cost. And you'd have to 
forget about supporting any hardware that really wants contiguous memory 
for DMA (sound cards etc). It just isn't worth it.

Real memory hotplug needs hardware support anyway (if only buffering the 
memory at least electrically). At which point you're much better off 
supporting some remapping in the buffering too, I'm convinced. There's no 
_need_ to do these things in software.

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06 16:12           ` Linus Torvalds
@ 2005-11-06 17:00             ` Linus Torvalds
  2005-11-07  8:00               ` Ingo Molnar
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-06 17:00 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Ingo Molnar, Paul Jackson, andy, mbligh, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel,
	nickpiggin

On Sun, 6 Nov 2005, Linus Torvalds wrote:
> 
> And no standard hardware allows you to do that in hw, so we'd end up doing 
> a software page table walk for it (or, more likely, we'd have to make 
> "struct page" bigger).
> 
> You could do it today, although at a pretty high cost. And you'd have to 
> forget about supporting any hardware that really wants contiguous memory 
> for DMA (sound cards etc). It just isn't worth it.

Btw, in case it wasn't clear: the cost of these kinds of things in the 
kernel is usually not so much the actual "lookup" (whether with hw assist 
or with another field in the "struct page").

The biggest cost of almost everything in the kernel these days is the 
extra code-footprint of yet another abstraction, and the locking cost. 

For example, the real cost of the highmem mapping seems to be almost _all_ 
in the locking. It also makes some code-paths more complex, so it's yet 
another I$ fill for the kernel.

So a remappable kernel tends to be different from a remappable user 
application. A user application _only_ ever sees the actual cost of the 
TLB walk (which hardware can do quite efficiently and is very amenable 
indeed to a lot of optimization like OoO and speculative prefetching), but 
on the kernel level, the remapping itself is the cheapest part.

(Yes, user apps can see some of the costs indirectly: they can see the 
synchronization costs if they do lots of mmap/munmap's, especially if they 
are threaded. But they really have to work at it to see it, and I doubt 
the TLB synchronization issues tend to be even on the radar for any user 
space performance analysis).

You could probably do a remappable kernel (modulo the problems with 
specific devices that want bigger physically contiguous areas than one 
page) reasonably cheaply on UP. It gets more complex on SMP and with full 
device access.

In fact, I suspect you can ask any Xen developer what their performance 
problems and worries are. I suspect they much prefer UP clients over SMP 
ones, and _much_ prefer paravirtualization over running unmodified 
kernels.

So remappable kernels are certainly doable, they just have more 
fundamental problems than remappable user space _ever_ has. Both from a 
performance and from a complexity angle.

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06 17:00             ` Linus Torvalds
@ 2005-11-07  8:00               ` Ingo Molnar
  2005-11-07 11:00                 ` Dave Hansen
  0 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-07  8:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kyle Moffett, Paul Jackson, andy, mbligh, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel,
	nickpiggin


* Linus Torvalds <torvalds@osdl.org> wrote:

> > You could do it today, although at a pretty high cost. And you'd have to 
> > forget about supporting any hardware that really wants contiguous memory 
> > for DMA (sound cards etc). It just isn't worth it.
> 
> Btw, in case it wasn't clear: the cost of these kinds of things in the 
> kernel is usually not so much the actual "lookup" (whether with hw 
> assist or with another field in the "struct page").
[...]

> So remappable kernels are certainly doable, they just have more 
> fundamental problems than remappable user space _ever_ has. Both from 
> a performance and from a complexity angle.

furthermore, it doesnt bring us any closer to removable RAM. The problem 
is still unsolvable (due to the 'how to do you find live pointers to fix 
up' issue), even if the full kernel VM is 'mapped' at 4K granularity.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07  8:00               ` Ingo Molnar
@ 2005-11-07 11:00                 ` Dave Hansen
  2005-11-07 12:20                   ` Ingo Molnar
  0 siblings, 1 reply; 241+ messages in thread
From: Dave Hansen @ 2005-11-07 11:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Kyle Moffett, Paul Jackson, andy, mbligh,
	Andrew Morton, arjan, arjanv, kravetz, lhms,
	Linux Kernel Mailing List, linux-mm, mel, Nick Piggin

On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@osdl.org> wrote:
> > So remappable kernels are certainly doable, they just have more 
> > fundamental problems than remappable user space _ever_ has. Both from 
> > a performance and from a complexity angle.
> 
> furthermore, it doesnt bring us any closer to removable RAM. The problem 
> is still unsolvable (due to the 'how to do you find live pointers to fix 
> up' issue), even if the full kernel VM is 'mapped' at 4K granularity.

I'm not sure I understand.  If you're remapping, why do you have to find
live and fix up live pointers?  Are you talking about things that
require fixed _physical_ addresses?

-- Dave


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 11:00                 ` Dave Hansen
@ 2005-11-07 12:20                   ` Ingo Molnar
  2005-11-07 19:34                     ` Steven Rostedt
  0 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-07 12:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kyle Moffett, Paul Jackson, andy, mbligh,
	Andrew Morton, arjan, arjanv, kravetz, lhms,
	Linux Kernel Mailing List, linux-mm, mel, Nick Piggin


* Dave Hansen <haveblue@us.ibm.com> wrote:

> On Mon, 2005-11-07 at 09:00 +0100, Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@osdl.org> wrote:
> > > So remappable kernels are certainly doable, they just have more 
> > > fundamental problems than remappable user space _ever_ has. Both from 
> > > a performance and from a complexity angle.
> > 
> > furthermore, it doesnt bring us any closer to removable RAM. The problem 
> > is still unsolvable (due to the 'how to do you find live pointers to fix 
> > up' issue), even if the full kernel VM is 'mapped' at 4K granularity.
> 
> I'm not sure I understand.  If you're remapping, why do you have to 
> find live and fix up live pointers?  Are you talking about things that 
> require fixed _physical_ addresses?

RAM removal, not RAM replacement. I explained all the variants in an 
earlier email in this thread. "extending RAM" is relatively easy.  
"replacing RAM" while doable, is probably undesirable. "removing RAM" 
impossible.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 12:20                   ` Ingo Molnar
@ 2005-11-07 19:34                     ` Steven Rostedt
  2005-11-07 23:38                       ` Joel Schopp
  0 siblings, 1 reply; 241+ messages in thread
From: Steven Rostedt @ 2005-11-07 19:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, mel, linux-mm, Linux Kernel Mailing List, lhms,
	kravetz, arjanv, arjan, Andrew Morton, mbligh, andy, Paul Jackson,
	Kyle Moffett, Linus Torvalds, Dave Hansen

On Mon, 2005-11-07 at 13:20 +0100, Ingo Molnar wrote:

> 
> RAM removal, not RAM replacement. I explained all the variants in an 
> earlier email in this thread. "extending RAM" is relatively easy.  
> "replacing RAM" while doable, is probably undesirable. "removing RAM" 
> impossible.

Hi Ingo,

I'm usually amused when someone says something is impossible, so I'm
wondering exactly "why"?

If the one requirement is that there must be enough free memory
available to remove, then what's the problem for a fully mapped kernel?
Is it the GPT?  Or if there's drivers that physical memory mapped?  

I'm not sure of the best way to solve the GPT being in the RAM that is
to be removed, but there might be a way. Basically stop all activities
and update all the tasks->mm.

As for the drivers, one could have a accounting for all physical memory
mapped, and disable the driver if it is using the memory that is to be
removed.

But other then these, what exactly is the problem with removing RAM?

BTW, I'm not suggesting any of this is a good idea, I just like to
understand why something _cant_ be done.

-- Steve

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 19:34                     ` Steven Rostedt
@ 2005-11-07 23:38                       ` Joel Schopp
  2005-11-13  2:30                         ` Rob Landley
  0 siblings, 1 reply; 241+ messages in thread
From: Joel Schopp @ 2005-11-07 23:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Nick Piggin, mel, linux-mm,
	Linux Kernel Mailing List, lhms, kravetz, arjanv, arjan,
	Andrew Morton, mbligh, andy, Paul Jackson, Kyle Moffett,
	Linus Torvalds, Dave Hansen

>>RAM removal, not RAM replacement. I explained all the variants in an 
>>earlier email in this thread. "extending RAM" is relatively easy.  
>>"replacing RAM" while doable, is probably undesirable. "removing RAM" 
>>impossible.
> 
<snip>
> BTW, I'm not suggesting any of this is a good idea, I just like to
> understand why something _cant_ be done.
> 

I'm also of the opinion that if we make the kernel remap that we can "remove 
RAM".  Now, we've had enough people weigh in on this being a bad idea I'm not 
going to try it.  After all it is fairly complex, quite a bit more so than Mel's 
reasonable patches.  But I think it is possible.  The steps would look like this:

Method A:
1. Find some unused RAM (or free some up)
2. Reserve that RAM
3. Copy the active data from the soon to be removed RAM to the reserved RAM
4. Remap the addresses
5. Remove the RAM

This of course requires step 3 & 4 take place under something like 
stop_machine_run() to keep the data from changing.

Alternately you could do it like this:

Method B:
1. Find some unused RAM (or free some up)
2. Reserve that RAM
3. Unmap the addresses on the soon to be removed RAM
4. Copy the active data from the soon to be removed RAM to the reserved RAM
5. Remap the addresses
6. Remove the RAM

Which would save you the stop_machine_run(), but which adds the complication of 
dealing with faults on pinned memory during the migration.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 23:38                       ` Joel Schopp
@ 2005-11-13  2:30                         ` Rob Landley
  2005-11-14  1:58                           ` Joel Schopp
  0 siblings, 1 reply; 241+ messages in thread
From: Rob Landley @ 2005-11-13  2:30 UTC (permalink / raw)
  To: Joel Schopp, linux-kernel

On Monday 07 November 2005 17:38, you wrote:
> >>RAM removal, not RAM replacement. I explained all the variants in an
> >>earlier email in this thread. "extending RAM" is relatively easy.
> >>"replacing RAM" while doable, is probably undesirable. "removing RAM"
> >>impossible.
>
> <snip>
>
> > BTW, I'm not suggesting any of this is a good idea, I just like to
> > understand why something _cant_ be done.
>
> I'm also of the opinion that if we make the kernel remap that we can
> "remove RAM".  Now, we've had enough people weigh in on this being a bad
> idea I'm not going to try it.  After all it is fairly complex, quite a bit
> more so than Mel's reasonable patches.  But I think it is possible.  The
> steps would look like this:
>
> Method A:
> 1. Find some unused RAM (or free some up)
> 2. Reserve that RAM
> 3. Copy the active data from the soon to be removed RAM to the reserved RAM
> 4. Remap the addresses
> 5. Remove the RAM
>
> This of course requires step 3 & 4 take place under something like
> stop_machine_run() to keep the data from changing.

Actually, what I was thinking is that if you use the swsusp infrastructure to 
suspend all processes, all dma, quiesce the heck out of the devices, and 
_then_ try to move the kernel...  Well, you at least have a much more 
controlled problem.  Yeah, it's pretty darn intrusive, but if you're doing 
"suspend to ram" perhaps the downtime could be only 5 or 10 seconds...

I don't know how much of the problem that leaves unsolved, though.

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-13  2:30                         ` Rob Landley
@ 2005-11-14  1:58                           ` Joel Schopp
  0 siblings, 0 replies; 241+ messages in thread
From: Joel Schopp @ 2005-11-14  1:58 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

> Actually, what I was thinking is that if you use the swsusp infrastructure to 
> suspend all processes, all dma, quiesce the heck out of the devices, and 
> _then_ try to move the kernel...  Well, you at least have a much more 
> controlled problem.  Yeah, it's pretty darn intrusive, but if you're doing 
> "suspend to ram" perhaps the downtime could be only 5 or 10 seconds...

I don't think suspend to ram for a memory hotplug remove would be acceptable to 
users.  The other methods add some complexity to the kernel, but are transparent 
to userspace.  Downtime of 5 to 10 seconds is really quite a bit of downtime.

> I don't know how much of the problem that leaves unsolved, though.

It would still require a remappable kernel.  And seems intuitively to be wrong 
to me.  But if you want to try it out I won't stop you.  It might even work.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  6:10   ` Paul Jackson
  2005-11-04  6:38     ` Ingo Molnar
@ 2005-11-04  7:44     ` Eric Dumazet
  2005-11-07 16:42       ` Adam Litke
  1 sibling, 1 reply; 241+ messages in thread
From: Eric Dumazet @ 2005-11-04  7:44 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Linus Torvalds, andy, mbligh, akpm, arjan, arjanv, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo,
	nickpiggin

Paul Jackson a écrit :
> Linus wrote:
> 
>>Maybe you'd be willing on compromising by using a few kernel boot-time 
>>command line options for your not-very-common load.
> 
> 
> If we were only a few options away from running Andy's varying load
> mix with something close to ideal performance, we'd be in fat city,
> and Andy would never have been driven to write that rant.

I found hugetlb support in linux not very practical/usable on NUMA machines, 
boot-time parameters or /proc/sys/vm/nr_hugepages.

With this single integer parameter, you cannot allocate 1000 4MB pages on one 
specific node, letting small pages on another node.

I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual 
node machine between one (numa aware) memory intensive job and all others 
(system, network, shells).
At least I can reboot it if needed, but I feel Andy pain.

There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages 
with a list of integers (one per node) ?

Eric

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  7:44     ` Eric Dumazet
@ 2005-11-07 16:42       ` Adam Litke
  0 siblings, 0 replies; 241+ messages in thread
From: Adam Litke @ 2005-11-07 16:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paul Jackson, Linus Torvalds, andy, mbligh, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo,
	nickpiggin

On Fri, 2005-11-04 at 08:44 +0100, Eric Dumazet wrote:
> Paul Jackson a écrit :
> > Linus wrote:
> > 
> >>Maybe you'd be willing on compromising by using a few kernel boot-time 
> >>command line options for your not-very-common load.
> > 
> > 
> > If we were only a few options away from running Andy's varying load
> > mix with something close to ideal performance, we'd be in fat city,
> > and Andy would never have been driven to write that rant.
> 
> I found hugetlb support in linux not very practical/usable on NUMA machines, 
> boot-time parameters or /proc/sys/vm/nr_hugepages.
> 
> With this single integer parameter, you cannot allocate 1000 4MB pages on one 
> specific node, letting small pages on another node.
> 
> I'm not an astrophysician, nor a DB admin, I'm only trying to partition a dual 
> node machine between one (numa aware) memory intensive job and all others 
> (system, network, shells).
> At least I can reboot it if needed, but I feel Andy pain.
> 
> There is a /proc/buddyinfo file, maybe we need a /proc/sys/vm/node_hugepages 
> with a list of integers (one per node) ?

Or perhaps /sys/devices/system/node/nodeX/nr_hugepages triggers that
work like the current /proc trigger but on a per node basis?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04  5:14 ` Linus Torvalds
  2005-11-04  6:10   ` Paul Jackson
@ 2005-11-04 14:56   ` Andy Nelson
  2005-11-04 15:18     ` Ingo Molnar
  2005-11-04 16:00     ` Linus Torvalds
  1 sibling, 2 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 14:56 UTC (permalink / raw)
  To: andy, torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

Linus,

Since my other affiliation is with X2, which also goes by 
the name Thermonuclear Applications, we have a deal. I'll
continue to help with the work on getting nuclear fusion 
to work, and you work on getting my big pages to work 
in linux. We both have lots of funding and resources behind 
us and are working with smart people. It should be easy.
Beyond that, I don't know much of anything about chemistry,
you'll have to find someone else to increase your battery 
efficiency that way.

Big pages don't work now, and zones do not help because the
load is too unpredictable. Sysadmins *always* turn them
off, for very good reasons. They cripple the machine.

I'll try in this post also to merge a couple of replies with 
other responses:

I think it was Martin Bligh who wrote that his customer gets
25% speedups with big pages. That is peanuts compared to my
factor 3.4 (search comp.arch for John Mashey's and my name
at the University of Edinburgh in Jan/Feb 2003 for a conversation
that includes detailed data about this), but proves the point that 
it is far more than just me that wants big pages. 

If your and other kernel developer's (<<0.01% of the universe) kernel
builds slow down by 5% and my and other people's simulations (perhaps 
0.01% of the universe) speed up by a factor up to 3 or 4, who wins? 
Answer right now: you do, since you are writing the kernel to 
respond to your own issues, which are no more representative of the
rest of the universe than my work is. Answer as I think it
ought to be: I do, since I'd bet that HPC takes far more net
cycles in the world than every one else's kernel builds put
together. I can't expect much of anyone else to notice either
way and neither can you, so that is a wash.

Ingo Molnar says that zones work for him. In response I
will now repeat my previous rant about why zones don't
work. I understand that my post was very long and people
probably didn't read it all. So I'll just repeat that
part:

2) The last paragraph above is important because of the way HPC
   works as an industry. We often don't just have a dedicated machine to
   run on, that gets booted once and one dedicated application runs on it
   till it dies or gets rebooted again. Many jobs run on the same machine.
   Some jobs run for weeks. Others run for a few hours over and over
   again. Some run massively parallel. Some run throughput.

   How is this situation handled? With a batch scheduler. You submit
   a job to run and ask for X cpus, Y memory and Z time. It goes and
   fits you in wherever it can. cpusets were helpful infrastructure
   in linux for this. 

   You may get some cpus on one side of the machine, some more
   on the other, and memory associated with still others. They
   do a pretty good job of allocating resources sanely, but there is
   only so much that it can do. 

   The important point here for page related discusssions is that
   someone, you don't know who, was running on those cpu's and memory
   before you. And doing Ghu Knows What with it. 

   This code could be running something that benefits from small pages, or
   it could be running with large pages. It could be dynamically
   allocating and freeing large or small blocks of memory or it could be
   allocating everything at the beginning and running statically
   thereafter. Different codes do different things. That means that the
   memory state could be totally fubar'ed before your job ever gets
   any time allocated to it.

>Nobody takes a random machine and says "ok, we'll now put our most 
>performance-critical database on this machine, and oh, btw, you can't 
>reboot it and tune for it beforehand". 

   Wanna bet?

   What I wrote above makes tuning the machine itself totally ineffective.
   What do you tune for? Tuning for one person's code makes someone else's
   slower. Tuning for the same code on one input makes another input run
   horribly. 

   You also can't be rebooting after every job. What about all the other
   ones that weren't done yet? You'd piss off everyone running there and
   it takes too long besides.

   What about a machine that is running multiple instances of some
   database, some bigger or smaller than others, or doing other kinds
   of work? Do you penalize the big ones or the small ones, this kind
   of work or that?

   You also can't establish zones that can't be changed on the fly
   as things on the system change. How do zones like that fit into
   numa? How do things work when suddenly you've got a job that wants
   the entire memory filled with large pages and you've only got 
   half your system set up for large pages? What if you tune the
   system that way and then let that job run. For some stupid reason user
   reason it dies 10 minutes after starting? Do you let the 30
   other jobs in the queue sit idle because they want a different
   page distribution?

   This way lies madness. Sysadmins just say no and set up the machine
   in as stably as they can, usually with something not too different
   that whatever manufacturer recommends as a default. For very good reasons.  

   I would bet the only kind of zone stuff that could even possibly
   work would be related to a cpu/memset zone arrangement. See below.

Andy Nelson

--
Andy Nelson                       Theoretical Astrophysics Division (T-6) 
andy dot nelson at lanl dot gov   Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy     Los Alamos, NM 87545

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 14:56   ` Andy Nelson
@ 2005-11-04 15:18     ` Ingo Molnar
  2005-11-04 15:39       ` Andy Nelson
  2005-11-04 16:00     ` Linus Torvalds
  1 sibling, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 15:18 UTC (permalink / raw)
  To: Andy Nelson
  Cc: torvalds, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin


* Andy Nelson <andy@thermo.lanl.gov> wrote:

> I think it was Martin Bligh who wrote that his customer gets 25% 
> speedups with big pages. That is peanuts compared to my factor 3.4 
> (search comp.arch for John Mashey's and my name at the University of 
> Edinburgh in Jan/Feb 2003 for a conversation that includes detailed 
> data about this), but proves the point that it is far more than just 
> me that wants big pages.

ok, this posting of you seems to be it:

 http://groups.google.com/group/comp.sys.sgi.admin/browse_thread/thread/39884db861b7db15/e0332608c52a17e3?lnk=st&q=&rnum=35#e0332608c52a17e3

|  Timing for the tree traveral+gravity calculation were
|
|   16MBpages    1MBpages    64kpages
|    1  *          *         2361.8s
|    8  86.4s     198.7s      298.1s
|   16  43.5s      99.2s      148.9s
|   32  22.1s      50.1s       75.0s
|   64  11.2s      25.3s       37.9s
|   96   7.5s      17.1s       25.4s
|
|   (*) test not done.
|
|   As near as I can tell the numbers show perfect
|   linear speedup for the runs for each page size.
|
|   Across different page sizes there is degradation
|   as follows:
|
|   16m --> 64k   decreases by a factor 3.39 in speed
|   16m --> 1m    decreases by a factor 2.25 in speed
|   1m  --> 64k   decreases by a factor 1.49 in speed

[...]
|
|   Sum over cpus of TLB miss times for each test:
|
|   16MBpages    1MBpages    64kpages
|    1                       3489s
|    8  64.3s     1539s      3237s
|   16  64.5s     1540s      3241s
|   32  64.5s     1542s      3244s
|   64  64.9s     1545s      3246s
|   96  64.7s     1545s      3251s
|
|   Thus the 16MB pages rarely produced page misses,
|   while the 64kB pages used up 2.5x more time than
|   the floating point operations that we wanted to
|   have. I have at least some feeling that the 16MB pages
|   rarely caused misses because with a 128 entry
|   TLB (on the R12000 cpu) that gives about 1GB of
|   addressible memory before paging is required at all,
|   which I think is quite comparable to the size of
|   the memory actually used.

to me it seems that this slowdown is due to some inefficiency in the 
R12000's TLB-miss handling - possibly very (very!) long TLB-miss 
latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely 
visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K 
pages on x86/x64?

if my assumption is correct, then hugeTLBs are more of a workaround for 
bad TLB-miss properties of the CPUs you are using, not something that 
will inevitably happen in the future. Hence i think the 'factor 3x' 
slowdown should not be realistic anymore - or are you still running 
R12000 CPUs?

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:18     ` Ingo Molnar
@ 2005-11-04 15:39       ` Andy Nelson
  2005-11-04 16:05         ` Ingo Molnar
  2005-11-04 16:07         ` Linus Torvalds
  0 siblings, 2 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 15:39 UTC (permalink / raw)
  To: andy, mingo
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, nickpiggin, pj, torvalds

Ingo wrote:
>ok, this posting of you seems to be it:

> <elided>

>to me it seems that this slowdown is due to some inefficiency in the
>R12000's TLB-miss handling - possibly very (very!) long TLB-miss
>latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
>visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
>pages on x86/x64?
>
>if my assumption is correct, then hugeTLBs are more of a workaround for
>bad TLB-miss properties of the CPUs you are using, not something that
>will inevitably happen in the future. Hence i think the 'factor 3x'
>slowdown should not be realistic anymore - or are you still running
>R12000 CPUs?

>        Ingo

AFAIK, mips chips have a software TLB refill that takes 1000
cycles more or less. I could be wrong. There are sgi folk on this
thread, perhaps they can correct me. What is important is
that I have done similar tests on other arch's and found very
similar results. Specifically with IBM machines running both
AIX and Linux. I've never had the opportunity to try variable
page size stuff on amd or intel chips, either itanic or x86
variants.

The effect is not a consequence of any excessively long tlb 
handling times for one single arch.

The effect is a property of the code. Which has one part that
is extremely branchy: traversing a tree, and another part that
isn't branchy but grabs stuff from all over everywhere.

The tree traversal works like this:  Start from the root and stop at
each node, load a few numbers, multiply them together and compare to
another number, then open that node or go on to a sibling node. Net, 
this is about 5-8 flops and a compare per node. The issue is that the 
next time you  want to look at a tree node, you are someplace else 
in memory entirely. That means a TLB miss almost always.

The tree traversal leaves me with a list of a few thousand nodes
and atoms. I use these nodes and atoms to calculate gravity on some
particle or small group of particles. How? For each node, I grab about
10 numbers from a couple of arrays, do about 50 flops with those 
numbers, and store back 4 more numbers. The store back doesn't hurt
anything becasuse it really only happens once at the end of the list.

In the naive case, grabbing 10 numbers out of arrays that are mutiple
GB in size means 10 TLB misses. The obvious solution is to stick
everything together that is needed together, and get that down to
one or two. I've done that. The results you quoted in your post
reflect that. In other words, the performance difference is the minimal
number of TLB misses that I can manage to get. 

Now if you have a list of thousands of nodes to cycle through, each of
which lives on a different page (ordinarily true), you thrash TLB,
and you thrash L1, as I noted in my original post. 

Believe me, I have worried about this sort of stuff intensely,
and recoded around it a lot. The performance number you saw were what
is left over. 

It is true that other sorts of codes have much more regular memory
access patterns, and don't have nearly this kind of speedup. Perhaps
more typical would be the 25% number quoted by Martin Bligh.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:39       ` Andy Nelson
@ 2005-11-04 16:05         ` Ingo Molnar
  2005-11-04 16:07         ` Linus Torvalds
  1 sibling, 0 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 16:05 UTC (permalink / raw)
  To: Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, nickpiggin, pj, torvalds

* Andy Nelson <andy@thermo.lanl.gov> wrote:

> Ingo wrote:
> >ok, this posting of you seems to be it:
> 
> > <elided>
> 
> >to me it seems that this slowdown is due to some inefficiency in the
> >R12000's TLB-miss handling - possibly very (very!) long TLB-miss
> >latencies? On modern CPUs (x86/x64) the TLB-miss latency is rarely
> >visible. Would it be possible to run some benchmarks of hugetlbs vs. 4K
> >pages on x86/x64?
> >
> >if my assumption is correct, then hugeTLBs are more of a workaround for
> >bad TLB-miss properties of the CPUs you are using, not something that
> >will inevitably happen in the future. Hence i think the 'factor 3x'
> >slowdown should not be realistic anymore - or are you still running
> >R12000 CPUs?
> 
> >        Ingo
> 
> 
> AFAIK, mips chips have a software TLB refill that takes 1000 cycles 
> more or less. I could be wrong. [...]

x86 in comparison has a typical cost of 7 cycles per TLB miss. And a 
modern x64 chip has 1024 TLBs ... If that's not enough then i believe 
you'll be limited by cachemiss costs and RAM latency/throughput anyway, 
and the only thing the TLB misses have to do is to be somewhat better 
than those bottlenecks. TLBs are really fast in the x86/x64 world. Then 
there come other features like TLB prefetch, so if you are touching 
pages in any predictable fashion you ought to see better latencies than 
the worst-case.

> The effect is not a consequence of any excessively long tlb handling 
> times for one single arch.
> 
> The effect is a property of the code. Which has one part that is 
> extremely branchy: traversing a tree, and another part that isn't 
> branchy but grabs stuff from all over everywhere.

i dont think anyone argues against the fact that a larger 'TLB reach' 
will most likely improve performance. The question is always 'by how 
much', and that number very much depends on the cost of a single TLB 
miss. (and on alot of other factors)

(note that it's also possible for large TLBs to cause a slowdown: there 
are CPUs [e.g. P3] where there are fewer large TLBs than 4K TLBs, so 
there are workloads where you lose due to fewer TLBs. It is also 
possible for large TLBs to be zero speedup: if the working set is so 
large that you will always get a TLB miss with a new node accessed.)

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 15:39       ` Andy Nelson
  2005-11-04 16:05         ` Ingo Molnar
@ 2005-11-04 16:07         ` Linus Torvalds
  2005-11-04 16:40           ` Ingo Molnar
  1 sibling, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 16:07 UTC (permalink / raw)
  To: Andy Nelson
  Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj

On Fri, 4 Nov 2005, Andy Nelson wrote:
> 
> AFAIK, mips chips have a software TLB refill that takes 1000
> cycles more or less. I could be wrong.

You're not far off.

Time it on a real machine some day. On a modern x86, you will fill a TLB 
entry in anything from 1-8 cycles if it's in L1, and add a couple of dozen 
cycles for L2.

In fact, the L1 TLB miss can often be hidden by the OoO engine.

Now, do the math. Your "3-4 time slowdown" with several hundred cycle TLB 
miss just GOES AWAY with real hardware. Yes, you'll still see slowdowns, 
but they won't be nearly as noticeable. And having a simpler and more 
efficient kernel will actually make _up_ for them in many cases. For 
example, you can do all your calculations on idle workstations that don't 
mysteriously just crash because somebody was also doing something else on 
them.

Face it. MIPS sucks. It was clean, but it didn't perform very well. SGI 
doesn't sell those things very actively these days, do they?

So don't blame Linux. Don't make sweeping statements based on hardware 
situations that just aren't relevant any more. 

If you ever see a machine again that has a huge TLB slowdown, let the 
machine vendor know, and then SWITCH VENDORS. Linux will work on sane 
machines too.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:07         ` Linus Torvalds
@ 2005-11-04 16:40           ` Ingo Molnar
  2005-11-04 17:22             ` Linus Torvalds
  0 siblings, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 16:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj


* Linus Torvalds <torvalds@osdl.org> wrote:

> Time it on a real machine some day. On a modern x86, you will fill a 
> TLB entry in anything from 1-8 cycles if it's in L1, and add a couple 
> of dozen cycles for L2.

below is my (x86-only) testcode that accurately measures TLB miss costs 
in cycles. (Has to be run as root, because it uses 'cli' as the 
serializing instruction.)

here's the output from the default 128MB (32768 4K pages) random access 
pattern workload, on a 2 GHz P4 (which has 64 dTLBs):

  0 24 24 24 12 12 0 0 16 0 24 24 24 12 0 12 0 12

  32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.

i.e. really cheap TLB misses even in this very bad and TLB-trashing 
scenario: there are only 64 dTLBs and we have 32768 pages - so they are 
outnumbered by a factor of 1:512! Still the CPU gets it right.

setting LINEAR to 1 gives an embarrasing:

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.

showing that the pagetable got fully cached (probably in L1) and that 
has _zero_ overhead. Truly remarkable.

lowering the size to 16 MB (still 1:64 TLB-to-working-set-size ratio!) 
gives:

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 4096 randomly accessed pages, 0 cycles avg, 5.859375% TLB misses.

so near-zero TLB overhead.

increasing BYTES to half a gigabyte gives:

 2 0 12 12 24 12 24 264 24 12 24 24 0 0 24 12 24 24 24 24 24 24 24 24 12 
 12 24 24 24 36 24 24 0 24 24 0 24 24 288 24 24 0 228 24 24 0 0 

 131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

so an occasional ~220 cycles (~== 100 nsec - DRAM latency) cachemiss, 
but still the average is 75 cycles, or 37 nsecs - which is still only 
~37% of the DRAM latency.

(NOTE: the test eliminates most data cachemisses, by using zero-mapped 
anonymous memory, so only a single data page exists. So the costs seen 
here are mostly TLB misses.)

	Ingo

---------------
/*
 * TLB miss measurement on PII CPUs.
 *
 * Copyright (C) 1999, Ingo Molnar <mingo@redhat.com>
 */
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mman.h>

#define BYTES (128*1024*1024)
#define PAGES (BYTES/4096)

/* This define turns on the linear mode.. */
#define LINEAR 0

#if 1
# define BARRIER "cli"
#else
# define BARRIER "lock ; addl $0,0(%%esp)"
#endif

int do_test (char * addr)
{
	unsigned long start, end;
	/*
	 * 'cli' is used as a serializing instruction to
	 * isolate the benchmarked instruction from rdtsc.
	 */
	__asm__ (
		"jmp 1f; 1: .align 128;\
"BARRIER";				\
		rdtsc;			\
		movl %0, %1;		\
"BARRIER";				\
		movl (%%esi), %%eax;	\
"BARRIER";				\
		rdtsc;			\
"BARRIER";				\
		"

		:"=a" (end), "=c" (start)
		:"S" (addr)
		:"dx","memory");
	return end - start;
}

extern int iopl(int);

int main (void)
{
	unsigned long overhead, sum;
	int j, k, c, hit;
	int matrix [PAGES];
	int delta [PAGES];
	char *buffer = mmap(NULL, BYTES, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

	iopl(3);
	/*
	 * first generate a random access pattern.
	 */
	for (j = 0; j < PAGES; j++) {
		unsigned long val;
#if LINEAR
		val = ((j*8) % PAGES) * 4096;
		val = j*2048;
#else
		val = (random() % PAGES) * 4096;
#endif
		matrix[j] = val;
	}

	/*
	 * Calculate the overhead
	 */
	overhead = ~0UL;
	for (j = 0; j < 100; j++) {
		unsigned int diff = do_test(buffer);
		if (diff < overhead)
			overhead = diff;
	}
	printf("Overhead = %ld cycles\n", overhead);

	/*
	 * 10 warmup loops, the last one is printed.
	 */
	for (k = 0; k < 10; k++) {
		c = 0;
		for (j = 0; j < PAGES; j++) {
			char * addr;
			addr = buffer + matrix[j];
			delta[c++] = do_test(addr);
		}
	}
	hit = 0;
	sum = 0;
	for (j = 0; j < PAGES; j++) {
		unsigned long d = delta[j] - overhead;
		printf("%ld ", d);
		if (d <= 1)
			hit++;
		sum += d;
	}
	printf("\n");
	printf("%d %s accessed pages, %d cycles avg, %f%% TLB misses.\n",
		PAGES,
#if LINEAR
		"linearly",
#else
		"randomly",
#endif
		sum/PAGES,
		100.0*((double)PAGES-(double)hit)/(double)PAGES);

	return 0;
}

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:40           ` Ingo Molnar
@ 2005-11-04 17:22             ` Linus Torvalds
  2005-11-04 17:43               ` Andy Nelson
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj

Andy,
 let's just take Ingo's numbers, measured on modern hardware.

On Fri, 4 Nov 2005, Ingo Molnar wrote:
> 
>   32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.
>   32768 linearly accessed pages,  0 cycles avg,  0.259399% TLB misses.
>  131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

NOTE! It's hard to decide what OoO does - Ingo's load doesn't allow for a 
whole lot of overlapping stuff, so Ingo's numbers are fairly close to 
worst case, but on the other hand, that serialization can probably be 
honestly said to hide a couple of cycles, so let's say that _real_ worst 
case is five more cycles than the ones quoted. It doesn't change the math, 
and quite frankly, that way we're really anal about it.

In real life, under real load (especially with Fp operations going on at 
the same time), OoO might make the cost a few cycles _less_, not more, but 
hey, lt's not count that.

So in the absolute worst case, with 95% TLB miss ratio, the TLB cost was 
an average 75 cycles. Let's be _really_ nice to MIPS, and say that this is 
only five times faster than the MIPS case you tested (in reality, it's 
probably over ten).

That's the WORST CASE. Realize that MIPS doesn't get better: it will 
_always_ have a latency of several hundred cycles when the TLB misses. It 
has absolutely zero OoO activity to hide a TLB miss (a software miss 
totally serializes the pipeline), and it has zero "code caching", so even 
with a perfect I$ (which it certainly didn't have), the cost of actually 
running the TLB miss handler doesn't go down.

In contrast, the x86 hw miss gets better when there is some more locality 
and the page tables are cached. Much better. Ingo's worst-case example is 
not realistic (no locality at all in half a gigabyte or totally random 
examples), yet even for that worst case, modern CPU's beat the MIPS by 
that big factor. 

So let's say that the 75% miss ratio was more likely (that's still a high 
TLB miss ratio). So in the _likely_ case, a P4 did the miss in an average 
of 13 cycles. The MIPS miss cost won't have come down at all - in fact, it 
possibly went _up_, since the miss handler now might be getting more I$ 
misses since it's not called all the time (I don't know if the MIPS miss 
handler used non-caching loads or not - the positive D$ effects on the 
page tables from slightly denser TLB behaviour might help some to offset 
this factor).

That's a likely factor of fifty speedup. But let's be pessimistic again, 
and say that the P4 number beat the MIPS TLB miss by "only" a factor of 
twenty. That means that your worst case totally untuned argument (30 times 
slowdown from TLB misses) on a P4 is only a 120% slowdown. Not a factor of 
three.

But clearly you could tune your code too, and did. To the point that you 
had a factor of 3.4 on MIPS. Now, let's say that the tuning didn't work as 
well on P4 (remember, we're still being pessimistic), and you'd only get 
half of that.

End result? If the slowdown was entirely due to TLB miss costs, your 
likely slowdown is in the 20-40% range. Pessimistically.

Now, switching to x86 may have _other_ issues. Maybe other things might 
get slower. [ Mmwwhahahahhahaaa. I crack myself up. x86 slower than MIPS? 
I'm such a joker. ]

Anyway. The point stands. This is something where hardware really rules, 
and software can't do a lot of sane stuff. 20-40% may sound like a big 
number, and it is, but this is all stuff where Moore's Law says that 
we shouldn't spend software effort.

We'll likely be better off with a smaller, simpler kernel in the future. I 
hope. And the numbers above back me up. Software complexity for something 
like this just kills.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 17:22             ` Linus Torvalds
@ 2005-11-04 17:43               ` Andy Nelson
  0 siblings, 0 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 17:43 UTC (permalink / raw)
  To: mingo, torvalds
  Cc: akpm, andy, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, pj

Linus,

Please stop focussing on mips as the bad boy. Mips is dead. It
has been for years and everyone knows it unless they are embedded. 
I wrote several times that I had tested other arches and every 
time you deleted those comments.  Not to mention that in the few
anecdotal (read no records were kept) tests I've done on with intel 
vs mips on more than one code, mips doesn't come out nearly as bad 
as you seem to believe. Maybe that is tlb related maybe it is other
issue related. The fact remains.

Later on after your posts I also posted numbers for power 5. Haven't
seen a response to that yet. Maybe you're digesting.

> let's just take Ingo's numbers, measured on modern hardware.

Ingo's numbers calculate 95% tlb misses. I will likely have 100% tlb
misses over most of this code. Read my discussion of what it does
and you'll see why. Capsule form: Every tree node results in several 
thousand nodes that are acceptable. You need to examine several times 
that to get the acceptable ones. Several thousand memory reads from
several thousand different pages means 100% TLB misses. This is by no
means a pathological case. Other codes will have such effects too, as
I noted in my first very long rant.

I may have misread it, but that last bit of difference between 95% 
and 100% tlb misses will be a pretty big factor in speed differences. 
So your 20-40% goes right back up.

Ok, so there is some minimal in my case fp overlap, but a factor 2 
speed difference certainly still exists in the power5 arch numbers I 
quoted. 

I have a special case version of this code that does cache blocking
on the gravity calculation. As a special case version, it is not
effective for the general case. There are 0 TLB misses and 0 L1 misses
for this part of the code. The tree traversal cannot be similarly
cache blocked and keeps all the tlb and cache misses it always had.

For that version, I can get down to 20% speed up, because overall the
traversal only takes 20% or so of the total time. That is the absolute
best I can do, and I've been tuning this code alone for close to a
decade.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 14:56   ` Andy Nelson
  2005-11-04 15:18     ` Ingo Molnar
@ 2005-11-04 16:00     ` Linus Torvalds
  2005-11-04 16:13       ` Martin J. Bligh
  2005-11-04 16:14       ` Andy Nelson
  1 sibling, 2 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 16:00 UTC (permalink / raw)
  To: Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Big pages don't work now, and zones do not help because the
> load is too unpredictable. Sysadmins *always* turn them
> off, for very good reasons. They cripple the machine.

They do. Guess why? It's complicated.

SGI used to do things like that in Irix. They had the flakiest Unix kernel 
out there. There's a reason people use Linux, and it's not all price. A 
lot of it is development speed, and that in turn comes very much from not 
making insane decisions that aren't maintainable in the long run.

Trust me. We can make things _better_, by having zones that you can't do 
kernel allocations from. But you'll never get everything you want, without 
turning the kernel into an unmaintainable mess. 

> I think it was Martin Bligh who wrote that his customer gets
> 25% speedups with big pages. That is peanuts compared to my
> factor 3.4 (search comp.arch for John Mashey's and my name
> at the University of Edinburgh in Jan/Feb 2003 for a conversation
> that includes detailed data about this), but proves the point that 
> it is far more than just me that wants big pages. 

I didn't find your post on google, but I assume that a large portion on 
your 3.4 factor was hardware.

The fact is, there are tons of architectures that suck at TLB handling. 
They have small TLB's, and they fill slowly.

x86 is actually one of the best ones out there. It has a hw TLB fill, and 
the page tables are cached, with real-life TLB fill times in the single 
cycles (a P4 can almost be seen as effectively having 32kB pages because 
it fills it's TLB entries to fast when they are next to each other in the 
page tables). Even when you have lots of other cache pressure, the page 
tables are at least in the L2 (or L3) caches, and you effectively have a 
really huge TLB.

In contrast, a lot of other machines will use non-temporal loads to load 
the TLB entries, forcing them to _always_ go to memory, and use software 
fills, causing the whole machine to stall. To make matters worse, many of 
them use hashed page tables, so that even if they could (or do) cache 
them, the caching just doesn't work very well.

(I used to be a big proponent of software fill - it's very flexible. It's 
also very slow. I've changed my mind after doing timing on x86)

Basically, any machine that gets more than twice the slowdown is _broken_. 
If the memory access is cached, then so should be page table entry be 
(page tables are _much_ smaller than the pages themselves), so even if you 
take a TLB fault on every single access, you shouldn't see a 3.4 factor.

So without finding your post, my guess is that you were on a broken 
machine. MIPS or alpha do really well when things generally fit in the 
TLB, but break down completely when they don't due to their sw fill (alpha 
could have fixed it, it had _archtiecturally_ sane page tables that it 
could have walked in hw, but never got the chance. May it rest in peace).

If I remember correctly, ia64 used to suck horribly because Linux had to 
use a mode where the hw page table walker didn't work well (maybe it was 
just an itanium 1 bug), but should be better now. But x86 probably kicks 
its butt.

The reason x86 does pretty well is that it's got one of the few sane page 
table setups out there (oh, page table trees are old-fashioned and simple, 
but they are dense and cache well), and the microarchitecture is largely 
optimized for TLB faults. Not having ASI's and having to work with an OS 
that invalidated the TLB about every couple of thousand memory accesses 
does that to you - it puts the pressure to do things right.

So I suspect Martin's 25% is a lot more accurate on modern hardware (which 
means x86, possibly Power. Nothing else much matters).

> If your and other kernel developer's (<<0.01% of the universe) kernel
> builds slow down by 5% and my and other people's simulations (perhaps 
> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? 

First off, you won't speed up by a factor of three or four. Not even 
_close_. 

Second, it's not about performance. It's about maintainability. It's about 
having a system that we can use and understand 10 years down the line. And 
the VM is a big part of that.

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:00     ` Linus Torvalds
@ 2005-11-04 16:13       ` Martin J. Bligh
  2005-11-04 16:40         ` Linus Torvalds
  2005-11-04 16:14       ` Andy Nelson
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 16:13 UTC (permalink / raw)
  To: Linus Torvalds, Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mel, mingo, nickpiggin


> So I suspect Martin's 25% is a lot more accurate on modern hardware (which 
> means x86, possibly Power. Nothing else much matters).

It was PPC64, if that helps.
 
>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps 
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? 
> 
> First off, you won't speed up by a factor of three or four. Not even 
> _close_. 

Well, I think it depends on the workload a lot. However fast your TLB is,
if we move from "every cacheline read requires is a TLB miss" to "every
cacheline read is a TLB hit" that can be a huge performance knee however
fast your TLB is. Depends heavily on the locality of reference and size
of data set of the application, I suspect.

M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:13       ` Martin J. Bligh
@ 2005-11-04 16:40         ` Linus Torvalds
  2005-11-04 17:10           ` Martin J. Bligh
  0 siblings, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 16:40 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mel, mingo, nickpiggin

On Fri, 4 Nov 2005, Martin J. Bligh wrote:
> 
> > So I suspect Martin's 25% is a lot more accurate on modern hardware (which 
> > means x86, possibly Power. Nothing else much matters).
> 
> It was PPC64, if that helps.

Ok. I bet x86 is even better, but Power (and possibly itanium) is the only 
other architecture that comes close.

I don't like the horrible POWER hash-tables, but for static workloads they 
should perform almost as well as a sane page table (I say "almost", 
because I bet that the high-performance x86 vendors have spent a lot more 
time on tlb latency than even IBM has). My dislike for them comes from the 
fact that they are really only optimized for static behaviour.

(And HPC is almost always static wrt TLB stuff - big, long-running 
processes).

> Well, I think it depends on the workload a lot. However fast your TLB is,
> if we move from "every cacheline read requires is a TLB miss" to "every
> cacheline read is a TLB hit" that can be a huge performance knee however
> fast your TLB is. Depends heavily on the locality of reference and size
> of data set of the application, I suspect.

I'm sure there are really pathological examples, but the thing is, they 
won't be on reasonable code.

Some modern CPU's have TLB's that can span the whole cache. In other 
words, if your data is in _any_ level of caches, the TLB will be big 
enough to find it.

Yes, that's not universally true, and when it's true, the TLB is two-level 
and you can have loads where it will usually miss in the first level, but 
we're now talking about loads where the _data_ will then always miss in 
the first level cache too. So the TLB miss cost will always be _lower_ 
than the data miss cost.

Right now, you should buy Opteron if you want that kind of large TLB. I 
_think_ Intel still has "small" TLB's (the cpuid information only goes up 
to 128 entries, I think), but at least Intel has a really good fill. And I 
would bet (but have no first-hand information) that next generation 
processors will only get bigger TLB's. These things don't tend to shrink.

(Itanium also has a two-level TLB, but it's absolutely pitiful in size).

NOTE! It is absolutely true that for a few years we had regular caches 
growing much faster than TLB's. So there are unquestionably unbalanced 
machines out there. But it seems that CPU designers started noticing, and 
every indication is that TLB's are catching up.

In other words, adding lots of kernel complexity is the wrong thing in the 
long run. This is not a long-term problem, and even in the short term you 
can fix it by just selecting the right hardware.

In todays world, AMD leads with bug TLB's (1024-entry L2 TLB), but Intel 
has slightly faster fill and the AMD TLB filtering is sadly turned off on 
SMP right now, so you might not always get the full effect of the large 
TLB (but in HPC you probably won't have task switching blowing your TLB 
away very often).

PPC64 has the huge hashed page tables that work well enough for HPC. 

Itanium has a pitifully small TLB, and an in-order CPU, so it will take a 
noticeably bigger hit on TLB's than x86 will. But even Itanium will be a 
_lot_ better than MIPS was.

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:40         ` Linus Torvalds
@ 2005-11-04 17:10           ` Martin J. Bligh
  0 siblings, 0 replies; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-04 17:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Nelson, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mel, mingo, nickpiggin

>> Well, I think it depends on the workload a lot. However fast your TLB is,
>> if we move from "every cacheline read requires is a TLB miss" to "every
>> cacheline read is a TLB hit" that can be a huge performance knee however
>> fast your TLB is. Depends heavily on the locality of reference and size
>> of data set of the application, I suspect.
> 
> I'm sure there are really pathological examples, but the thing is, they 
> won't be on reasonable code.
> 
> Some modern CPU's have TLB's that can span the whole cache. In other 
> words, if your data is in _any_ level of caches, the TLB will be big 
> enough to find it.
> 
> Yes, that's not universally true, and when it's true, the TLB is two-level 
> and you can have loads where it will usually miss in the first level, but 
> we're now talking about loads where the _data_ will then always miss in 
> the first level cache too. So the TLB miss cost will always be _lower_ 
> than the data miss cost.
> 
> Right now, you should buy Opteron if you want that kind of large TLB. I 
> _think_ Intel still has "small" TLB's (the cpuid information only goes up 
> to 128 entries, I think), but at least Intel has a really good fill. And I 
> would bet (but have no first-hand information) that next generation 
> processors will only get bigger TLB's. These things don't tend to shrink.

Well. Last time I looked they had something in the order of 512 entries
per MB of cache or so (ie 2MB of coverage per MB of cache). So it'll only 
cover it if you're using 2K of the data in each page (50%), but not if 
you're touching cachelines distributed widely over pages. with large 
pages, you cover 1000 times that much. Some apps may not be able to 
acheive a 50% locality of reference, just by their nature ... not sure 
that's bad programming for the big number crunching cases, or DB workloads 
with random access patterns to large data sets.

Of course, this doesn't just apply to HPC/database either. dcache walks
on large fileserver, etc. 

Even if we're talking data cache / icache misses, it gets even worse,
doesn't it? Several cacheline misses for pagetable walks per data cacheline
miss. Lots of the compute intensive stuff doesn't even come close to 
fitting in data cache by orders of magnitude.

M.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:00     ` Linus Torvalds
  2005-11-04 16:13       ` Martin J. Bligh
@ 2005-11-04 16:14       ` Andy Nelson
  2005-11-04 16:49         ` Linus Torvalds
  1 sibling, 1 reply; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 16:14 UTC (permalink / raw)
  To: andy, torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

Linus:

>> If your and other kernel developer's (<<0.01% of the universe) kernel
>> builds slow down by 5% and my and other people's simulations (perhaps 
>> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins? 
>
>First off, you won't speed up by a factor of three or four. Not even 
>_close_. 

My measurements of factors of 3-4 on more than one hw arch don't
mean anything then? BTW: Ingo Molnar has a response that did find 
my comp.arch posts. As I indicated to him, I've done a lot of code
tuning to get better performance even in the presence of tlb issues.
This factor is what is left. Starting from an untuned code, the factor
can be up to an order of magnitude larger. As in 30-60. Yes, I've
measured that too, though these detailed measurments were only on
mips/origins.

It is true that I have never had the opportunity to test these
issues on x86 and its relatives. Perhaps it would be better there.
The relative insensitivity of the results I have already to hw 
arch, indicate otherwise though.

Re maintainability: Fine. I like maintainable code too. Coding
standards are great. Language standards are even better.

These are motherhood statements. Your simple rejections
("NO, HELL NO!!") even of any attempts to make these sorts
of improvements seems to make that issue pretty moot anyway. 

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 16:14       ` Andy Nelson
@ 2005-11-04 16:49         ` Linus Torvalds
  0 siblings, 0 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 16:49 UTC (permalink / raw)
  To: Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin



On Fri, 4 Nov 2005, Andy Nelson wrote:
> 
> My measurements of factors of 3-4 on more than one hw arch don't
> mean anything then?

When I _know_ that modern hardware does what you tested at least two 
orders of magnitude better than the hardware you tested?

Think about it. 

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-04 15:19 Andy Nelson
  0 siblings, 0 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 15:19 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

Nick Piggin wrote:
>Mel Gorman wrote:
>> On Fri, 4 Nov 2005, Nick Piggin wrote:
>>
>> Todays massive machiens are tomorrows desktop. Weak comment, I know, but
>> it's happened before.
>>

>Oh I wouldn't bet against it. And if desktops of the future are using
>100s of GB then they probably would be happy to use 64K pages as well.

Just a note. The data I referenced in my other post that can be found
on comp.arch uses 64k pages as the smallest page size in the study.
Pages sized 1M and 16M were the other two.

As I understand it, only a few arch's have hw support for more than 2
page sizes, but my response is that they will eventually need them. 
The larger the memory, the larger the possible page size needs to be
too. Otherwise you are just pushing out the problem for a few years.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-04 17:03 Andy Nelson
  2005-11-04 17:49 ` Linus Torvalds
  2005-11-04 20:12 ` Ingo Molnar
  0 siblings, 2 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 17:03 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

>On Fri, 4 Nov 2005, Andy Nelson wrote:
>> 
>> My measurements of factors of 3-4 on more than one hw arch don't
>> mean anything then?
>
>When I _know_ that modern hardware does what you tested at least two 
>orders of magnitude better than the hardware you tested?

Ok. In other posts you have skeptically accepted Power as a
`modern' architecture. I have just now dug out some numbers
of a slightly different problem running on a Power 5. Specifically
a IBM p575 I think. These tests were done in June, while the others
were done more than 2.5 years ago. In other words, there may be 
other small tuning optimizations that have gone in since then too.

The problem is a different configuration of particles, and about
2 times bigger (7Million) than the one in comp.arch (3million I think).
I would estimate that the data set in this test spans something like
2-2.5GB or so.

Here are the results:

cpus    4k pages   16m pages
1       4888.74s   2399.36s
2       2447.68s   1202.71s
4       1225.98s    617.23s
6        790.05s    418.46s
8        592.26s    310.03s
12       398.46s    210.62s
16       296.19s    161.96s

These numbers were on a recent Linux. I don't know which one.

Now it looks like it is down to a factor 2 or slightly more. That
is a totally different arch, that I think you have accepted as 
`modern', running the OS that you say doesn't need big page support. 

Still a bit more than insignificant I would say.

>Think about it. 

Likewise.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 17:03 Andy Nelson
@ 2005-11-04 17:49 ` Linus Torvalds
  2005-11-04 17:51   ` Andy Nelson
  2005-11-04 20:12 ` Ingo Molnar
  1 sibling, 1 reply; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 17:49 UTC (permalink / raw)
  To: Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

On Fri, 4 Nov 2005, Andy Nelson wrote:
> 
> Ok. In other posts you have skeptically accepted Power as a
> `modern' architecture.

Yes, sceptically.

I'd really like to hear what your numbers are on a modern x86. Any x86-64 
is interesting, and I can't imagine that with a LANL address you can't 
find any.

I do believe that Power is within one order of magnitude of a modern x86 
when it comes to TLB fill performance. That's much better than many 
others, but whether "almost as good" is within the error range, or whether 
it's "only five times worse", I don't know.

The thing is, there's a reason x86 machines kick ass. They are cheap, and 
they really _do_ outperform pretty much everything else out there.

Power 5 has a wonderful memory architecture, and those L3 caches kick ass. 
They probably don't help you as much as they help databases, though, and 
it's entirely possible that a small cheap Opteron with its integrated 
memory controller will outperform them on your load if you really don't 
have a lot of locality.

			Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 17:49 ` Linus Torvalds
@ 2005-11-04 17:51   ` Andy Nelson
  0 siblings, 0 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 17:51 UTC (permalink / raw)
  To: andy, torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin


Finding an x86 or amd is not the problem. Finding one with
a sysadmin who is willing to let me experiment is. I'll ask
around, but it may be a while. 

Andy


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 17:03 Andy Nelson
  2005-11-04 17:49 ` Linus Torvalds
@ 2005-11-04 20:12 ` Ingo Molnar
  2005-11-04 21:04   ` Andy Nelson
  1 sibling, 1 reply; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 20:12 UTC (permalink / raw)
  To: Andy Nelson
  Cc: torvalds, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin

* Andy Nelson <andy@thermo.lanl.gov> wrote:

> The problem is a different configuration of particles, and about 2 
> times bigger (7Million) than the one in comp.arch (3million I think). 
> I would estimate that the data set in this test spans something like 
> 2-2.5GB or so.
> 
> Here are the results:
> 
> cpus    4k pages   16m pages
> 1       4888.74s   2399.36s
> 2       2447.68s   1202.71s
> 4       1225.98s    617.23s
> 6        790.05s    418.46s
> 8        592.26s    310.03s
> 12       398.46s    210.62s
> 16       296.19s    161.96s

interesting, and thanks for the numbers. Even if hugetlbs were only 
showing a 'mere' 5% improvement, a 5% _user-space improvement_ is still 
a considerable improvement that we should try to achieve, if possible 
cheaply.

the 'separate hugetlb zone' solution is cheap and simple, and i believe 
it should cover your needs of mixed hugetlb and smallpages workloads.

it would work like this: unlike the current hugepages=<nr> boot 
parameter, this zone would be useful for other (4K sized) allocations 
too. If an app requests a hugepage then we have the chance to allocate 
it from the hugetlb zone, in a guaranteed way [up to the point where the 
whole zone consists of hugepages only].

the architectural appeal in this solution is that no additional 
"fragmentation prevention" has to be done on this zone, because we only 
allow content into it that is "easy" to flush - this means that there is 
no complexity drag on the generic kernel VM.

can you think of any reason why the boot-time-configured hugetlb zone 
would be inadequate for your needs?

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 20:12 ` Ingo Molnar
@ 2005-11-04 21:04   ` Andy Nelson
  2005-11-04 21:14     ` Ingo Molnar
                       ` (2 more replies)
  0 siblings, 3 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 21:04 UTC (permalink / raw)
  To: andy, mingo
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, nickpiggin, torvalds


Hi,

>can you think of any reason why the boot-time-configured hugetlb zone 
>would be inadequate for your needs?

I am not enough of a kernel level person or sysadmin to know for certain,
but I have still big worries about consecutive jobs that run on the
same resources, but want extremely different page behavior. If what
you are suggesting can cause all previous history on those resources
to be forgotten and then reset to whatever it is that I want when I
start my run, then yes. It would be fine for me. In some sense, this is
perhaps what I was asking for in my original message when I was talking
about using batch schedulers, cpusets and friends to encapsulate 
regions of resources, that could be reset to nice states at user
specified intervals, like when the batch scheduler releases one job
and another job starts.


The issues that I can still think of that hpc people will need are 
(some points here are clearly related to each other, but anyway).


  1) how do zones play with numa? Does setting up resource management this 
     way mean that various kernel things that help me access my memory
     (hellifino what I'm talking about here--things like tables and lists
     of pages that I own and how to access them etc I suppose--whatever
     it is that kernels don't get rid of when someone else's job ends and
     before mine starts) actually get allocated in some other zone half
     way across the machine? This is going to kill me on latency grounds.
     Can it be set up so that this reserved special kernel zone is somewhere
     close by? If it is bigger than the next guy to get my resources wants, 
     can it be deleted and reset once my job is finished, so his job can run?
     This is what I would hope for and expect that something like 
     cpuset/memsets would help to do.

  2) How do zones play with merging small pages into big pages, splitting
     big pages into small, or deleting whatever page environment was there
     in favor of a reset of those resources to some initial state? If
     someone runs a small page job right after my big page job, will
     they get big pages? If I run a big page job right after their small
     page job, will I get small pages? 
     
     In each case, will it simply say 'no can do' and die? If this setup
     just means that some jobs can't be run or can't be run after
     something else, it will not fly.

  3) How does any sort of fall back scheme work? If I can't have all of my
     big pages, maybe I'll settle for some small ones and some big ones.
     Can I have them? If I can't have them and die instead, zones like
     this will not fly. 

     Points 2 and 3 have mostly to do with the question Does the system
     performance degrade over time for different constituencies of users
     or can it stay up stably, serving everyone equally and well for a
     long time? 

  4) How does any of this stuff play with interactive management? It is
     not going to fly if sysadmins have to get involved on a
     daily/regular basis, or even at much more than a cursory level of 
     turning something on once when the machine is purchased.

  5) How does any of this stuff play with me having to rewrite my code to
     use nonstandard language features? If I can't run using standard 
     fortran, standard C and maybe for some folks standard C++ or Java,
     it won't fly. 

  6) what about text vs data pages. I'm talking here about executable
     code vs whatever that code operates on. Do they get to have different
     sized pages? Do they get allocated from sensible places on the
     machine, as in reasonably separate from each other but not in some
     far away zone over the rainbow?  

  7) If OS's/HW ever get decent support for lots and lots of page sizes 
     (like mips and sparc now) rather than a couple , will the 
     infrastructure be able to give me whichever size I ask for, or will 
     I only get to choose between a couple, even if perhaps settable at 
     boot time? Extensibility like this will be a requirement long term 
     of course.

  8) What if I want 32 cpus and 64GB of memory on a machine, get it,
     finish using it, and then the next jobs in line request say 8 cpus
     and 16GB of memory, 4cpus and 16GB of memory, 20 cpus and 4GB
     of memory? Will the zone system be able to handle such dynamically
     changing things?


What I would need to see is that these sorts of issues can be handled
gracefully by the OS, perhaps with the help of some user land or
priveleged userland hints that would come from things like the batch 
scheduler or an env variable to set my prefered page size or other 
things about memory policy.


Thanks,

Andy

PS to Linus: I have secured access to an dual cpu dual core amd box. 
I have to talk to someone who is not here today to see about turning
on large pages. We'll see how that goes probably some time next week. 
If it is possible, you'll see some benchmarks then.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:04   ` Andy Nelson
@ 2005-11-04 21:14     ` Ingo Molnar
  2005-11-04 21:22     ` Linus Torvalds
  2005-11-04 21:31     ` Gregory Maxwell
  2 siblings, 0 replies; 241+ messages in thread
From: Ingo Molnar @ 2005-11-04 21:14 UTC (permalink / raw)
  To: Andy Nelson
  Cc: pj, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds


* Andy Nelson <andy@thermo.lanl.gov> wrote:

>   5) How does any of this stuff play with me having to rewrite my code to
>      use nonstandard language features? If I can't run using standard 
>      fortran, standard C and maybe for some folks standard C++ or Java,
>      it won't fly. 

it ought to be possible to get pretty much the same API as hugetlbfs via 
the 'hugetlb zone' approach too. It doesnt really change the API and FS 
side, it only impacts the allocator internally. So if you can utilize 
hugetlbfs, you should be able to utilize a 'special zone' approach 
pretty much the same way.

	Ingo

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:04   ` Andy Nelson
  2005-11-04 21:14     ` Ingo Molnar
@ 2005-11-04 21:22     ` Linus Torvalds
  2005-11-04 21:39       ` Linus Torvalds
                         ` (2 more replies)
  2005-11-04 21:31     ` Gregory Maxwell
  2 siblings, 3 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 21:22 UTC (permalink / raw)
  To: Andy Nelson
  Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. If what
> you are suggesting can cause all previous history on those resources
> to be forgotten and then reset to whatever it is that I want when I
> start my run, then yes.

That would largely be the behaviour.

When you use the hugetlb zone for big pages, nothing else would be there.

And when you don't use it, we'd be able to use those zones for at least 
page cache and user private pages - both of which are fairly easy to evict 
if required.

So the downside is that when the admin requests such a zone at boot-time, 
that will mean that the kernel will never be able to use it for its 
"normal" allocations. Not for inodes, not for directory name caching, not 
for page tables and not for process and file descriptors. Only a very 
certain class of allocations that we know how to evict easily could use 
them.

Now, for many loads, that's fine. User virtual pages and page cache pages 
are often a big part (in fact, often a huge majority) of the memory use.

Not always, though. Some loads really want lots of metadata caching, and 
if you make too much of memory be in the largepage zones, performance 
would suffer badly on such loads.

But the point is that this is easy(ish) to do, and would likely work 
wonderfully well for almost all loads. It does put a small onus on the 
maintainer of the machine to give a hint, but it's possible that normal 
loads won't mind the limitation and that we could even have a few hugepage 
zones by default (limit things to 25% of total memory or something). In 
fact, we would almost have to do so initially just to get better test 
coverage.

Now, if you want _most_ of memory to be available for hugepages, you 
really will always require a special boot option, and a friendly machine 
maintainer. Limiting things like inodes, process descriptors etc to a 
smallish percentage of memory would not be acceptable in general. 

Something like 25% "big page zones" probably is fine even in normal use, 
and 50% might be an acceptable compromise even for machines that see a 
mixture of pretty regular use and some specialized use. But a machine that 
only cares about certain loads might boot up with 75% set aside in the 
large-page zones, and that almost certainly would _not_ be a good setup 
for random other usage.

IOW, we want a hit up-front about how important huge pages would be. 
Because it's practically impossible to free pages later, because they 
_will_ become fragmented with stuff that we definitely do not want to 
teach the VM how to handle.

But the hint can be pretty friendly. Especially if it's an option to just 
load a lot of memory into the boxes, and none of the loads are expected to 
want to really be excessively close to memory limits (ie you could just 
buy an extra 16GB to allow for "slop").

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:22     ` Linus Torvalds
@ 2005-11-04 21:39       ` Linus Torvalds
  2005-11-05  2:48       ` Rob Landley
  2005-11-06 10:59       ` Paul Jackson
  2 siblings, 0 replies; 241+ messages in thread
From: Linus Torvalds @ 2005-11-04 21:39 UTC (permalink / raw)
  To: Andy Nelson
  Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin

On Fri, 4 Nov 2005, Linus Torvalds wrote:
> 
> But the hint can be pretty friendly. Especially if it's an option to just 
> load a lot of memory into the boxes, and none of the loads are expected to 
> want to really be excessively close to memory limits (ie you could just 
> buy an extra 16GB to allow for "slop").

One of the issues _will_ be how to allocate things on NUMA. Right now 
"hugetlb" only allows us to say "this much memory for hugetlb", and it 
probably needs to be per-zone. 

Some uses might want to allocate all of the local memory on one node to 
huge-page usage (and specialized programs would then also like to run 
pinned to that node), others migth want to spread it out. So the 
maintenance would need to decide that.

The good news is that you can boot up with almost all zones being "big 
page" zones, and you could turn them into "normal zones" dynamically. It's 
only going the other way that is hard.

So from a maintenance standpoint if you manage lots of machines, you could 
have them all uniformly boot up with lots of memory set aside for large 
pages, and then use user-space tools to individually turn the zones into 
regular allocation zones.

		Linus

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:22     ` Linus Torvalds
  2005-11-04 21:39       ` Linus Torvalds
@ 2005-11-05  2:48       ` Rob Landley
  2005-11-06 10:59       ` Paul Jackson
  2 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-05  2:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin

On Friday 04 November 2005 15:22, Linus Torvalds wrote:
> Now, if you want _most_ of memory to be available for hugepages, you
> really will always require a special boot option, and a friendly machine
> maintainer. Limiting things like inodes, process descriptors etc to a
> smallish percentage of memory would not be acceptable in general.

But it might make it a lot easier for User Mode Linux to give unused memory 
back to the host system via madvise(DONT_NEED).

(Assuming there's some way to beat the page cache into submission and actually 
free up space.  If there was an option to tell the page cache to stay the 
heck out of the hugepage zone, it would be just about perfect...)

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:22     ` Linus Torvalds
  2005-11-04 21:39       ` Linus Torvalds
  2005-11-05  2:48       ` Rob Landley
@ 2005-11-06 10:59       ` Paul Jackson
  2 siblings, 0 replies; 241+ messages in thread
From: Paul Jackson @ 2005-11-06 10:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: andy, mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin

How would this hugetlb zone be placed - on which nodes in a NUMA
system?

My understanding is that you are thinking to specify it as a proportion
or amount of total memory, with no particular placement.

I'd rather see it as a subset of the nodes on a system being marked
for use, as much as practical, for easily reclaimed memory (page
cache and user).

My HPC customers normally try to isolate the 'classic Unix load' on
a few nodes that they call the bootcpuset, and keep the other nodes
as unused as practical, except when allocated for dedicated use by a
particular job.  These other nodes need to run with a maximum amount of
easily reclaimed memory, while the bootcpuset nodes have no such need.

They don't just want easily reclaimable memory in order to get
hugetlb pages.  They also want it so that the memory available for
use as ordinary sized pages by one job will not be unduly reduced by
the hard to reclaim pages left over from some previous job.

This would be easy to do with cpusets, adding a second per-cpuset
nodemask that specified where not easily reclaimed kernel allocations
should come from.  The typical HPC user would set that second mask to
their bootcpuset.  The few kmalloc calls in the kernel (page cache and
user space) deemed to be easily reclaimable would have a __GFP_EASYRCLM
flag added, and the cpuset hook in the __alloc_pages code path would
put requests -not- marked __GFP_EASYRCLM on this second set of nodes.

No changes to hugetlbs or to the kernel code that runs at boot,
prior to starting init, would be required at all.  The bootcpuset
stuff is setup by a pre-init program (specified using the kernels
"init=..." boot option.)  This makes all the configuration of this
entirely a user space problem.

Cpuset nodes, not zone sizes, are the proper way to manage this,
in my view.

If you ask what this means for small (1 or 2 node) systems, then
I would first ask you what we are trying to do on those systems.
I suspect that that would involve other classes of users, with
different needs, than what Andy or I can speak to.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:04   ` Andy Nelson
  2005-11-04 21:14     ` Ingo Molnar
  2005-11-04 21:22     ` Linus Torvalds
@ 2005-11-04 21:31     ` Gregory Maxwell
  2005-11-04 22:43       ` Andi Kleen
  2 siblings, 1 reply; 241+ messages in thread
From: Gregory Maxwell @ 2005-11-04 21:31 UTC (permalink / raw)
  To: Andy Nelson
  Cc: mingo, akpm, arjan, arjanv, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mbligh, mel, nickpiggin, torvalds

On 11/4/05, Andy Nelson <andy@thermo.lanl.gov> wrote:
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. I

Thats the idea. The 'hugetlb zone' will only be usable for allocations
which are guaranteed reclaimable.  Reclaimable includes userspace
usage (since at worst an in use userspace page can be swapped out then
paged back into another physical location).

For your sort of mixed use this should be a fine solution. However
there are mixed use cases that that this will not solve, for example
if the system usage is split between HPC uses and kernel allocation
heavy workloads (say forking 10quintillion java processes) then the
hugetlb zone will need to be made small to keep the kernel allocation
heavy workload happy.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 21:31     ` Gregory Maxwell
@ 2005-11-04 22:43       ` Andi Kleen
  2005-11-05  0:07         ` Nick Piggin
  2005-11-06  1:30         ` Zan Lynx
  0 siblings, 2 replies; 241+ messages in thread
From: Andi Kleen @ 2005-11-04 22:43 UTC (permalink / raw)
  To: Gregory Maxwell
  Cc: Andy Nelson, mingo, akpm, arjan, arjanv, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mbligh, mel, nickpiggin,
	torvalds

On Friday 04 November 2005 22:31, Gregory Maxwell wrote:
> On 11/4/05, Andy Nelson <andy@thermo.lanl.gov> wrote:
> > I am not enough of a kernel level person or sysadmin to know for certain,
> > but I have still big worries about consecutive jobs that run on the
> > same resources, but want extremely different page behavior. I
>
> Thats the idea. The 'hugetlb zone' will only be usable for allocations
> which are guaranteed reclaimable.  Reclaimable includes userspace
> usage (since at worst an in use userspace page can be swapped out then
> paged back into another physical location).

I don't like it very much. You have two choices if a workload runs
out of the kernel allocatable pages. Either you spill into the reclaimable
zone or you fail the allocation. The first means that the huge pages
thing is unreliable, the second would mean that all the many problems
of limited lowmem would be back.

None of this is very attractive.

-Andi

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 22:43       ` Andi Kleen
@ 2005-11-05  0:07         ` Nick Piggin
  2005-11-06  1:30         ` Zan Lynx
  1 sibling, 0 replies; 241+ messages in thread
From: Nick Piggin @ 2005-11-05  0:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, torvalds

Andi Kleen wrote:
> On Friday 04 November 2005 22:31, Gregory Maxwell wrote:
> 
>>
>>Thats the idea. The 'hugetlb zone' will only be usable for allocations
>>which are guaranteed reclaimable.  Reclaimable includes userspace
>>usage (since at worst an in use userspace page can be swapped out then
>>paged back into another physical location).
> 
> 
> I don't like it very much. You have two choices if a workload runs
> out of the kernel allocatable pages. Either you spill into the reclaimable
> zone or you fail the allocation. The first means that the huge pages
> thing is unreliable, the second would mean that all the many problems
> of limited lowmem would be back.
> 

These are essentially the same problems that the frag patches face as
well.

> None of this is very attractive.
> 

Though it is simple and I expect it should actually do a really good
job for the non-kernel-intensive HPC group, and the highly tuned
database group.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-04 22:43       ` Andi Kleen
  2005-11-05  0:07         ` Nick Piggin
@ 2005-11-06  1:30         ` Zan Lynx
  2005-11-06  2:25           ` Rob Landley
  1 sibling, 1 reply; 241+ messages in thread
From: Zan Lynx @ 2005-11-06  1:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, nickpiggin, torvalds

Andi Kleen wrote:
> I don't like it very much. You have two choices if a workload runs
> out of the kernel allocatable pages. Either you spill into the reclaimable
> zone or you fail the allocation. The first means that the huge pages
> thing is unreliable, the second would mean that all the many problems
> of limited lowmem would be back.
>
> None of this is very attractive.
>   
You could allow the 'hugetlb zone' to shrink, allowing more kernel 
allocations.  User pages at the boundary would be moved to make room.

This would at least keep the 'hugetlb zone' pure and not create holes in it.

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-06  1:30         ` Zan Lynx
@ 2005-11-06  2:25           ` Rob Landley
  0 siblings, 0 replies; 241+ messages in thread
From: Rob Landley @ 2005-11-06  2:25 UTC (permalink / raw)
  To: Zan Lynx
  Cc: Andi Kleen, Gregory Maxwell, Andy Nelson, mingo, akpm, arjan,
	arjanv, haveblue, kravetz, lhms-devel, linux-kernel, linux-mm,
	mbligh, mel, nickpiggin, torvalds

On Saturday 05 November 2005 19:30, Zan Lynx wrote:
> > None of this is very attractive.
>
> You could allow the 'hugetlb zone' to shrink, allowing more kernel
> allocations.  User pages at the boundary would be moved to make room.

Please make that optional if you do.  In my potential use case, an OOM kill 
lets the administrator know they've got things configure wrong so they can 
can fix it and try again.  Containing and viciously reaping things like 
dentries is the behavior I want out of it.

Also, if you do shrink the hugetlb zone it might be possible to 
opportunistically expand it back to its original size.  There's no guarantee 
that a given kernel allocation will ever go away, but if it _does_ go away 
then the hugetlb zone should be able to expand to the next blocking 
allocation or the maximum size, whichever comes first.  (Given that my 
understanding of the layout may not match reality at all; don't ask me how 
the discontiguous memory stuff would work in here...)

Rob

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-04 17:56 Andy Nelson
  0 siblings, 0 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 17:56 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin




Correction:
>and you'll see why. Capsule form: Every tree node results in several 

read 

>and you'll see why. Capsule form: Every tree traversal results in several 


Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-04 21:51 Andy Nelson
  0 siblings, 0 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-04 21:51 UTC (permalink / raw)
  To: gmaxwell
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin, torvalds

Hi folks,

It sound like in principle I (`I'=generic HPC person) could be
happy with this sort of solution. The proof of the pudding is
in the eating however, and various perversions and misunderstanding
can still always crop up. Hopefully they can be solved or avoided
if the do show up though. Also, other folk might not be so satisfied.
I'll let them speak for themselves though.

One issue remaining is that I don't know how this hugetlbfs stuff 
that was discussed actually works or should work, in terms of 
the interface to my code. What would work for me is something to 
the effect of

f90 -flag_that_turns_access_to_big_pages_on code.f

That then substitutes in allocation calls to this hugetlbfs zone
instead of `normal' allocation calls to generic memory, and perhaps
lets me fall back to normal memory up to whatever system limits may
exist if no big pages are available.

Or even something more simple like 

setenv HEY_OS_I_WANT_BIG_PAGES_FOR_MY_JOB  

or alternatively, a similar request in a batch script.
I don't know that any of these things really have much to do
with the OS directly however.

Thanks all, and have a good weekend.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-05  1:37 Seth, Rohit
  2005-11-07  0:34 ` Andy Nelson
  0 siblings, 1 reply; 241+ messages in thread
From: Seth, Rohit @ 2005-11-05  1:37 UTC (permalink / raw)
  To: Nick Piggin, Andi Kleen
  Cc: Gregory Maxwell, Andy Nelson, mingo, akpm, arjan, arjanv,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, torvalds

From: Nick Piggin Friday, November 04, 2005 4:08 PM

>These are essentially the same problems that the frag patches face as
>well.

>> None of this is very attractive.
>> 

>Though it is simple and I expect it should actually do a really good
>job for the non-kernel-intensive HPC group, and the highly tuned
>database group.

Not sure how applications seamlessly can use the proposed hugetlb zone
based on hugetlbfs.  Depending on the programming language, it might
actually need changes in libs/tools etc.

As far as databases are concerned, I think they mostly already grab vast
chunks of memory to be used as hugepages (particularly for big mem
systems)which is a separate list of pages.  And actually are also glad
that kernel never looks at them for any other purpose.

-rohit

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-05  1:37 Seth, Rohit
@ 2005-11-07  0:34 ` Andy Nelson
  2005-11-07 18:58   ` Adam Litke
  0 siblings, 1 reply; 241+ messages in thread
From: Andy Nelson @ 2005-11-07  0:34 UTC (permalink / raw)
  To: ak, nickpiggin, rohit.seth
  Cc: akpm, andy, arjan, arjanv, gmaxwell, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo, torvalds

Hi folks,

>Not sure how applications seamlessly can use the proposed hugetlb zone
>based on hugetlbfs.  Depending on the programming language, it might
>actually need changes in libs/tools etc.

This is my biggest worry as well. I can't recall the details
right now, but I have some memories of people telling me, for
example, that large pages on linux were not now available to 
fortran programs period, due to lack of toolchain/lib stuff, 
just as you note. What the reasons were/are I have no idea. I 
do know that the Power 5 numbers I quoted a couple of days ago
required that the sysadmin apply some special patches to linux
and linking to extra library. I don't know what patches (they
came from ibm), but for xlf95 on Power5, the library I had to 
link with was this one:  

    -T /usr/local/lib64/elf64ppc.lbss.x

No changes were required to my code, which is what I need,
but codes that did not link to this library would not run on 
a kernel that had the patches installed, and code that did 
link with this library would not run on a kernel that didn't 
have those patches. 

I don't know what library this is or what was in it, but I 
cant imagine it would have been something very standard or
mainline, with that sort of drastic behavior. Maybe the ibm
folk can explain what this was about.

I will ask some folks here who should know how it may work
on intel/amd machines about how large pages can be used 
this coming week, when I attempt to do page size speed 
testing for my code, as I promised before, as I promised
before, as I promised before. 

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07  0:34 ` Andy Nelson
@ 2005-11-07 18:58   ` Adam Litke
  2005-11-07 20:51     ` Rohit Seth
  0 siblings, 1 reply; 241+ messages in thread
From: Adam Litke @ 2005-11-07 18:58 UTC (permalink / raw)
  To: Andy Nelson
  Cc: ak, nickpiggin, rohit.seth, akpm, arjan, arjanv, gmaxwell,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, mingo, torvalds

On Sun, 2005-11-06 at 17:34 -0700, Andy Nelson wrote:
> Hi folks,
> 
> >Not sure how applications seamlessly can use the proposed hugetlb zone
> >based on hugetlbfs.  Depending on the programming language, it might
> >actually need changes in libs/tools etc.
> 
> This is my biggest worry as well. I can't recall the details
> right now, but I have some memories of people telling me, for
> example, that large pages on linux were not now available to 
> fortran programs period, due to lack of toolchain/lib stuff, 
> just as you note. What the reasons were/are I have no idea. I 
> do know that the Power 5 numbers I quoted a couple of days ago
> required that the sysadmin apply some special patches to linux
> and linking to extra library. I don't know what patches (they
> came from ibm), but for xlf95 on Power5, the library I had to 
> link with was this one:  
> 
>     -T /usr/local/lib64/elf64ppc.lbss.x
> 
> 
> No changes were required to my code, which is what I need,
> but codes that did not link to this library would not run on 
> a kernel that had the patches installed, and code that did 
> link with this library would not run on a kernel that didn't 
> have those patches. 
> 
> I don't know what library this is or what was in it, but I 
> cant imagine it would have been something very standard or
> mainline, with that sort of drastic behavior. Maybe the ibm
> folk can explain what this was about.

Wow.  It's amazing how these things spread from my little corner of the
universe ;)  What you speak of sounds dangerously close to what I've
been working on lately.  Indeed it is not standard at all yet.  

I am currently working on an new approach to what you tried.  It
requires fewer changes to the kernel and implements the special large
page usage entirely in an LD_PRELOAD library.  And on newer kernels,
programs linked with the .x ldscript you mention above can run using all
small pages if not enough large pages are available.

For the curious, here's how this all works:
1) Link the unmodified application source with a custom linker script which
does the following:
  - Align elf segments to large page boundaries
  - Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
    to signal something (see below) to use large pages.
2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
large pages and transfers control back to the application.

> I will ask some folks here who should know how it may work
> on intel/amd machines about how large pages can be used 
> this coming week, when I attempt to do page size speed 
> testing for my code, as I promised before, as I promised
> before, as I promised before. 

I have used this method on ppc64, x86, and x86_64 machines successfully.
I'd love to see how my system works for a real-world user so if you're
interested in trying it out I can send you the current version.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 18:58   ` Adam Litke
@ 2005-11-07 20:51     ` Rohit Seth
  2005-11-07 20:55       ` Andy Nelson
  2005-11-07 21:11       ` Adam Litke
  0 siblings, 2 replies; 241+ messages in thread
From: Rohit Seth @ 2005-11-07 20:51 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, mingo, torvalds

On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote:

> I am currently working on an new approach to what you tried.  It
> requires fewer changes to the kernel and implements the special large
> page usage entirely in an LD_PRELOAD library.  And on newer kernels,
> programs linked with the .x ldscript you mention above can run using all
> small pages if not enough large pages are available.
> 

Isn't it true that most of the times we'll need to be worrying about
run-time allocation of memory (using malloc or such) as compared to
static.

> For the curious, here's how this all works:
> 1) Link the unmodified application source with a custom linker script which
> does the following:
>   - Align elf segments to large page boundaries
>   - Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
>     to signal something (see below) to use large pages.

We'll need a similar flag for even code pages to start using hugetlb
pages. In this case to keep the kernel changes to minimum, RTLD will
need to modified.

> 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
> 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
> large pages and transfers control back to the application.
> 

COW, swap etc. are all very nice (little!) features that make hugetlb to
get used more transparently.

-rohit




^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 20:51     ` Rohit Seth
@ 2005-11-07 20:55       ` Andy Nelson
  2005-11-07 20:58         ` Martin J. Bligh
  2005-11-08  2:12         ` David Gibson
  2005-11-07 21:11       ` Adam Litke
  1 sibling, 2 replies; 241+ messages in thread
From: Andy Nelson @ 2005-11-07 20:55 UTC (permalink / raw)
  To: agl, rohit.seth
  Cc: ak, akpm, andy, arjan, arjanv, gmaxwell, haveblue, kravetz,
	lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo,
	nickpiggin, torvalds

Hi,

>Isn't it true that most of the times we'll need to be worrying about
>run-time allocation of memory (using malloc or such) as compared to
>static.

Perhaps for C. Not neccessarily true for Fortran. I don't know
anything about how memory allocations proceed there, but there
are no `malloc' calls (at least with that spelling) in the language 
itself, and I don't know what it does for either static or dynamic 
allocations under the hood. It could be malloc like or whatever
else. In the language itself, there are language features for
allocating and deallocating memory and I've seen code that 
uses them, but haven't played with it myself, since my codes 
need pretty much all the various pieces memory all the time, 
and so are simply statically defined.

If you call something like malloc yourself, you risk portability 
problems in Fortran. Fortran 2003 supposedly addresses some of
this with some C interop features, but only got approved within 
the last year, and no compilers really exist for it yet, let
alone having code written.

Andy

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 20:55       ` Andy Nelson
@ 2005-11-07 20:58         ` Martin J. Bligh
  2005-11-07 21:20           ` Rohit Seth
  2005-11-08  2:12         ` David Gibson
  1 sibling, 1 reply; 241+ messages in thread
From: Martin J. Bligh @ 2005-11-07 20:58 UTC (permalink / raw)
  To: Andy Nelson, agl, rohit.seth
  Cc: ak, akpm, arjan, arjanv, gmaxwell, haveblue, kravetz, lhms-devel,
	linux-kernel, linux-mm, mel, mingo, nickpiggin, torvalds

>> Isn't it true that most of the times we'll need to be worrying about
>> run-time allocation of memory (using malloc or such) as compared to
>> static.
> 
> Perhaps for C. Not neccessarily true for Fortran. I don't know
> anything about how memory allocations proceed there, but there
> are no `malloc' calls (at least with that spelling) in the language 
> itself, and I don't know what it does for either static or dynamic 
> allocations under the hood. It could be malloc like or whatever
> else. In the language itself, there are language features for
> allocating and deallocating memory and I've seen code that 
> uses them, but haven't played with it myself, since my codes 
> need pretty much all the various pieces memory all the time, 
> and so are simply statically defined.

Doesn't fortran shove everything in BSS to make some truly monsterous
segment?
 
M.


^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 20:58         ` Martin J. Bligh
@ 2005-11-07 21:20           ` Rohit Seth
  2005-11-07 21:33             ` Adam Litke
  0 siblings, 1 reply; 241+ messages in thread
From: Rohit Seth @ 2005-11-07 21:20 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andy Nelson, agl, ak, akpm, arjan, arjanv, gmaxwell, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo,
	nickpiggin, torvalds

On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote:
> >> Isn't it true that most of the times we'll need to be worrying about
> >> run-time allocation of memory (using malloc or such) as compared to
> >> static.
> > 
> > Perhaps for C. Not neccessarily true for Fortran. I don't know
> > anything about how memory allocations proceed there, but there
> > are no `malloc' calls (at least with that spelling) in the language 
> > itself, and I don't know what it does for either static or dynamic 
> > allocations under the hood. It could be malloc like or whatever
> > else. In the language itself, there are language features for
> > allocating and deallocating memory and I've seen code that 
> > uses them, but haven't played with it myself, since my codes 
> > need pretty much all the various pieces memory all the time, 
> > and so are simply statically defined.
> 
> Doesn't fortran shove everything in BSS to make some truly monsterous
> segment?
>  

hmmm....that would be strange.  So, if an app is using TB of data, then
a TB space on disk ...then read in at the load time (or may be some
optimization in the RTLD knows that this is BSS and does not need to get
loaded but then a TB of disk space is a waster).

-rohit


^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 21:20           ` Rohit Seth
@ 2005-11-07 21:33             ` Adam Litke
  0 siblings, 0 replies; 241+ messages in thread
From: Adam Litke @ 2005-11-07 21:33 UTC (permalink / raw)
  To: Rohit Seth
  Cc: Martin J. Bligh, Andy Nelson, ak, akpm, arjan, arjanv, gmaxwell,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mel, mingo,
	nickpiggin, torvalds

On Mon, 2005-11-07 at 13:20 -0800, Rohit Seth wrote:
> On Mon, 2005-11-07 at 12:58 -0800, Martin J. Bligh wrote:
> > >> Isn't it true that most of the times we'll need to be worrying about
> > >> run-time allocation of memory (using malloc or such) as compared to
> > >> static.
> > > 
> > > Perhaps for C. Not neccessarily true for Fortran. I don't know
> > > anything about how memory allocations proceed there, but there
> > > are no `malloc' calls (at least with that spelling) in the language 
> > > itself, and I don't know what it does for either static or dynamic 
> > > allocations under the hood. It could be malloc like or whatever
> > > else. In the language itself, there are language features for
> > > allocating and deallocating memory and I've seen code that 
> > > uses them, but haven't played with it myself, since my codes 
> > > need pretty much all the various pieces memory all the time, 
> > > and so are simply statically defined.
> > 
> > Doesn't fortran shove everything in BSS to make some truly monsterous
> > segment?
> >  
> 
> hmmm....that would be strange.  So, if an app is using TB of data, then
> a TB space on disk ...then read in at the load time (or may be some
> optimization in the RTLD knows that this is BSS and does not need to get
> loaded but then a TB of disk space is a waster).

Nope, the bss is defined as the difference in file size (on disk) and
the memory size (as specified in the ELF program header for the data
segment).  So the kernel loads the pre-initialized data from disk and
extends the mapping to include room for the bss. 

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 241+ messages in thread

* Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 20:55       ` Andy Nelson
  2005-11-07 20:58         ` Martin J. Bligh
@ 2005-11-08  2:12         ` David Gibson
  1 sibling, 0 replies; 241+ messages in thread
From: David Gibson @ 2005-11-08  2:12 UTC (permalink / raw)
  To: Andy Nelson
  Cc: agl, rohit.seth, ak, akpm, arjan, arjanv, gmaxwell, haveblue,
	kravetz, lhms-devel, linux-kernel, linux-mm, mbligh, mel, mingo,
	nickpiggin, torvalds

On Mon, Nov 07, 2005 at 01:55:32PM -0700, Andy Nelson wrote:
> 
> Hi,
> 
> >Isn't it true that most of the times we'll need to be worrying about
> >run-time allocation of memory (using malloc or such) as compared to
> >static.
> 
> Perhaps for C. Not neccessarily true for Fortran. I don't know
> anything about how memory allocations proceed there, but there
> are no `malloc' calls (at least with that spelling) in the language 
> itself, and I don't know what it does for either static or dynamic 
> allocations under the hood. It could be malloc like or whatever
> else. In the language itself, there are language features for
> allocating and deallocating memory and I've seen code that 
> uses them, but haven't played with it myself, since my codes 
> need pretty much all the various pieces memory all the time, 
> and so are simply statically defined.
> 
> If you call something like malloc yourself, you risk portability 
> problems in Fortran. Fortran 2003 supposedly addresses some of
> this with some C interop features, but only got approved within 
> the last year, and no compilers really exist for it yet, let
> alone having code written.

I believe F90 has a couple of different ways of dynamically allocating
memory.  I'd expect in most implementations the FORTRAN runtime would
translate that into a malloc() call.  However, as I gather, many HPC
apps are written by people who are scientists first and programmers
second, and who still think in F77 where there is no dynamic memory
allocation.  Hence, gigantic arrays in the BSS are common FORTRAN
practice.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 20:51     ` Rohit Seth
  2005-11-07 20:55       ` Andy Nelson
@ 2005-11-07 21:11       ` Adam Litke
  2005-11-07 21:31         ` Rohit Seth
  1 sibling, 1 reply; 241+ messages in thread
From: Adam Litke @ 2005-11-07 21:11 UTC (permalink / raw)
  To: Rohit Seth
  Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, mingo, torvalds

On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote:
> On Mon, 2005-11-07 at 12:58 -0600, Adam Litke wrote:
> 
> > I am currently working on an new approach to what you tried.  It
> > requires fewer changes to the kernel and implements the special large
> > page usage entirely in an LD_PRELOAD library.  And on newer kernels,
> > programs linked with the .x ldscript you mention above can run using all
> > small pages if not enough large pages are available.
> > 
> 
> Isn't it true that most of the times we'll need to be worrying about
> run-time allocation of memory (using malloc or such) as compared to
> static.

It really depends on the workload.  I've run HPC apps with 10+GB data
segments.  I've also worked with applications that would benefit from a
hugetlb-enabled morecore (glibc malloc/sbrk).  I'd like to see one
standard hugetlb preload library that handles every different "memory
object" we care about (static and dynamic).  That's what I'm working on
now.

> > For the curious, here's how this all works:
> > 1) Link the unmodified application source with a custom linker script which
> > does the following:
> >   - Align elf segments to large page boundaries
> >   - Assert a non-standard Elf program header flag (PF_LINUX_HTLB)
> >     to signal something (see below) to use large pages.
> 
> We'll need a similar flag for even code pages to start using hugetlb
> pages. In this case to keep the kernel changes to minimum, RTLD will
> need to modified.

Yes, I foresee the functionality currently in my preload lib to exist in
RTLD at some point way down the road.

> > 2) Boot a kernel that supports copy-on-write for PRIVATE hugetlb pages
> > 3) Use an LD_PRELOAD library which reloads the PF_LINUX_HTLB segments into
> > large pages and transfers control back to the application.
> > 
> 
> COW, swap etc. are all very nice (little!) features that make hugetlb to
> get used more transparently.

Indeed.  See my parallel post of a hugetlb-COW RFC :)

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
  2005-11-07 21:11       ` Adam Litke
@ 2005-11-07 21:31         ` Rohit Seth
  0 siblings, 0 replies; 241+ messages in thread
From: Rohit Seth @ 2005-11-07 21:31 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andy Nelson, ak, nickpiggin, akpm, arjan, arjanv, gmaxwell,
	haveblue, kravetz, lhms-devel, linux-kernel, linux-mm, mbligh,
	mel, mingo, torvalds

On Mon, 2005-11-07 at 15:11 -0600, Adam Litke wrote:
> On Mon, 2005-11-07 at 12:51 -0800, Rohit Seth wrote:
>  
> > Isn't it true that most of the times we'll need to be worrying about
> > run-time allocation of memory (using malloc or such) as compared to
> > static.
> 
> It really depends on the workload.  I've run HPC apps with 10+GB data
> segments.  I've also worked with applications that would benefit from a
> hugetlb-enabled morecore (glibc malloc/sbrk).  I'd like to see one
> standard hugetlb preload library that handles every different "memory
> object" we care about (static and dynamic).  That's what I'm working on
> now.
> 

As said below, we will need this functionality even for code pages.  I
would rather have the changes absorbed in run-time loader rather than
having a preload library. Makes it easy to manage.

malloc/sbrks are the interesting part that does pose some challenges (as
in some archs different address space is reserved hugetlb).  Moreover,
it will also be critical that existing semantics of normal pages is
maintained even when the application ends up using hugepages.
 
> > We'll need a similar flag for even code pages to start using hugetlb
> > pages. In this case to keep the kernel changes to minimum, RTLD will
> > need to modified.
> 
> Yes, I foresee the functionality currently in my preload lib to exist in
> RTLD at some point way down the road.
> 

It will be much sooner...

-rohit


^ permalink raw reply	[flat|nested] 241+ messages in thread

* RE: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
@ 2005-11-05  1:52 Seth, Rohit
  0 siblings, 0 replies; 241+ messages in thread
From: Seth, Rohit @ 2005-11-05  1:52 UTC (permalink / raw)
  To: Linus Torvalds, Andy Nelson
  Cc: akpm, arjan, arjanv, haveblue, kravetz, lhms-devel, linux-kernel,
	linux-mm, mbligh, mel, mingo, nickpiggin

From: Linus Torvalds Sent: Friday, November 04, 2005 8:01 AM

>If I remember correctly, ia64 used to suck horribly because Linux had
to 
>use a mode where the hw page table walker didn't work well (maybe it
was 
>just an itanium 1 bug), but should be better now. But x86 probably
kicks 
>its butt.

I don't remember a difference of more than (roughly) 30 percentage
points even on first generation Itaniums (using hugetlb vs normal
pages). And few more percentage points when walker was disabled. Over
time the page table walker on IA-64 has gotten more aggressive.

...though I believe that 30% is a lot of performance.

-rohit

^ permalink raw reply	[flat|nested] 241+ messages in thread

end of thread, other threads:[~2005-11-14  1:58 UTC | newest]

Thread overview: 241+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-30 18:33 [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-10-30 18:34 ` [PATCH 1/7] Fragmentation Avoidance V19: 001_antidefrag_flags Mel Gorman
2005-10-30 18:34 ` [PATCH 2/7] Fragmentation Avoidance V19: 002_usemap Mel Gorman
2005-10-30 18:34 ` [PATCH 3/7] Fragmentation Avoidance V19: 003_fragcore Mel Gorman
2005-10-30 18:34 ` [PATCH 4/7] Fragmentation Avoidance V19: 004_fallback Mel Gorman
2005-10-30 18:34 ` [PATCH 5/7] Fragmentation Avoidance V19: 005_largealloc_tryharder Mel Gorman
2005-10-30 18:34 ` [PATCH 6/7] Fragmentation Avoidance V19: 006_percpu Mel Gorman
2005-10-30 18:34 ` [PATCH 7/7] Fragmentation Avoidance V19: 007_stats Mel Gorman
2005-10-31  5:57 ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mike Kravetz
2005-10-31  6:37   ` Nick Piggin
2005-10-31  7:54     ` Andrew Morton
2005-10-31  7:11       ` Nick Piggin
2005-10-31 16:19         ` Mel Gorman
2005-10-31 23:54           ` Nick Piggin
2005-11-01  1:28             ` Mel Gorman
2005-11-01  1:42               ` Nick Piggin
     [not found]       ` <27700000.1130769270@[10.10.2.4]>
     [not found]         ` <20051031112409.153e7048.akpm@osdl.org>
     [not found]           ` <3660000.1130787652@flay>
2005-10-31 23:59             ` Nick Piggin
2005-11-01  1:36               ` Mel Gorman
     [not found]         ` <4366A8D1.7020507@yahoo.com.au>
     [not found]           ` <Pine.LNX.4.58.0510312333240.29390@skynet>
     [not found]             ` <4366C559.5090504@yahoo.com.au>
2005-11-01 15:25               ` Martin J. Bligh
2005-11-01 15:33                 ` Dave Hansen
2005-11-01 16:57                   ` Mel Gorman
2005-11-01 17:00                     ` Mel Gorman
2005-11-01 18:58                   ` Rob Landley
     [not found]               ` <Pine.LNX.4.58.0511010137020.29390@skynet>
     [not found]                 ` <4366D469.2010202@yahoo.com.au>
     [not found]                   ` <Pine.LNX.4.58.0511011014060.14884@skynet>
2005-11-01 13:56                     ` Ingo Molnar
2005-11-01 14:10                       ` Dave Hansen
2005-11-01 14:29                         ` Ingo Molnar
2005-11-01 14:49                           ` Dave Hansen
2005-11-01 15:01                             ` Ingo Molnar
2005-11-01 15:22                               ` Dave Hansen
     [not found]                                 ` <20051102084946.GA3930@elte.hu>
     [not found]                                   ` <436880B8.1050207@yahoo.com.au>
2005-11-02  9:32                                     ` Dave Hansen
2005-11-02  9:48                                       ` Nick Piggin
2005-11-02 10:54                                         ` Dave Hansen
2005-11-02 15:02                                         ` Martin J. Bligh
2005-11-03  3:21                                           ` Nick Piggin
2005-11-03 15:36                                             ` Martin J. Bligh
2005-11-03 15:40                                               ` Arjan van de Ven
2005-11-03 15:51                                                 ` Linus Torvalds
2005-11-03 15:57                                                   ` Martin J. Bligh
2005-11-03 16:20                                                   ` Arjan van de Ven
2005-11-03 16:27                                                   ` Mel Gorman
2005-11-03 16:46                                                     ` Linus Torvalds
2005-11-03 16:52                                                       ` Martin J. Bligh
2005-11-03 17:19                                                         ` Linus Torvalds
2005-11-03 17:48                                                           ` Dave Hansen
2005-11-03 17:51                                                           ` Martin J. Bligh
2005-11-03 17:59                                                             ` Arjan van de Ven
2005-11-03 18:08                                                               ` Linus Torvalds
2005-11-03 18:17                                                                 ` Martin J. Bligh
2005-11-03 18:44                                                                   ` Linus Torvalds
2005-11-03 18:51                                                                     ` Martin J. Bligh
2005-11-03 19:35                                                                       ` Linus Torvalds
2005-11-03 22:40                                                                         ` Martin J. Bligh
2005-11-03 22:56                                                                           ` Linus Torvalds
2005-11-03 23:01                                                                             ` Martin J. Bligh
2005-11-04  0:58                                                                   ` Nick Piggin
2005-11-04  1:06                                                                     ` Linus Torvalds
2005-11-04  1:20                                                                       ` Paul Mackerras
2005-11-04  1:22                                                                       ` Nick Piggin
2005-11-04  1:48                                                                         ` Mel Gorman
2005-11-04  1:59                                                                           ` Nick Piggin
2005-11-04  2:35                                                                             ` Mel Gorman
2005-11-04  1:26                                                                       ` Mel Gorman
2005-11-03 21:11                                                                 ` Mel Gorman
2005-11-03 18:03                                                             ` Linus Torvalds
2005-11-03 20:00                                                               ` Paul Jackson
2005-11-03 20:46                                                               ` Mel Gorman
2005-11-03 18:48                                                             ` Martin J. Bligh
2005-11-03 19:08                                                               ` Linus Torvalds
2005-11-03 22:37                                                                 ` Martin J. Bligh
2005-11-03 23:16                                                                   ` Linus Torvalds
2005-11-03 23:39                                                                     ` Martin J. Bligh
2005-11-04  0:42                                                                       ` Nick Piggin
2005-11-04  4:39                                                                     ` Andrew Morton
2005-11-04 16:22                                                                 ` Mel Gorman
2005-11-03 15:53                                                 ` Martin J. Bligh
2005-11-01 16:48                               ` Kamezawa Hiroyuki
2005-11-01 16:59                                 ` Kamezawa Hiroyuki
2005-11-01 17:19                                 ` Mel Gorman
2005-11-02  0:32                                   ` KAMEZAWA Hiroyuki
2005-11-02 11:22                                     ` Mel Gorman
2005-11-01 18:06                                 ` linux-os (Dick Johnson)
2005-11-02  7:19                                 ` Ingo Molnar
2005-11-02  7:46                                   ` Gerrit Huizenga
2005-11-02  8:50                                     ` Nick Piggin
2005-11-02  9:12                                       ` Gerrit Huizenga
2005-11-02  9:37                                         ` Nick Piggin
2005-11-02 10:17                                           ` Gerrit Huizenga
2005-11-02 23:47                                           ` Rob Landley
2005-11-03  4:43                                             ` Nick Piggin
2005-11-03  6:07                                               ` Rob Landley
2005-11-03  7:34                                                 ` Nick Piggin
2005-11-03 17:54                                                   ` Rob Landley
2005-11-03 20:13                                                     ` Jeff Dike
2005-11-03 16:35                                                 ` Jeff Dike
2005-11-03 16:23                                                   ` Badari Pulavarty
2005-11-03 18:27                                                     ` Jeff Dike
2005-11-03 18:49                                                     ` Rob Landley
2005-11-04  4:52                                                     ` Andrew Morton
2005-11-04  5:35                                                       ` Paul Jackson
2005-11-04  5:48                                                         ` Andrew Morton
2005-11-04  6:42                                                           ` Paul Jackson
2005-11-04  7:10                                                             ` Andrew Morton
2005-11-04  7:45                                                               ` Paul Jackson
2005-11-04  8:02                                                                 ` Andrew Morton
2005-11-04  9:52                                                                   ` Paul Jackson
2005-11-04 15:27                                                                     ` Martin J. Bligh
2005-11-04 15:19                                                               ` Martin J. Bligh
2005-11-04 17:38                                                                 ` Andrew Morton
2005-11-04  6:16                                                         ` Bron Nelson
2005-11-04  7:26                                                       ` [patch] swapin rlimit Ingo Molnar
2005-11-04  7:36                                                         ` Andrew Morton
2005-11-04  8:07                                                           ` Ingo Molnar
2005-11-04 10:06                                                             ` Paul Jackson
2005-11-04 15:24                                                             ` Martin J. Bligh
2005-11-04  8:18                                                           ` Arjan van de Ven
2005-11-04 10:04                                                             ` Paul Jackson
2005-11-04 15:14                                                           ` Rob Landley
2005-11-04 10:14                                                         ` Bernd Petrovitsch
2005-11-04 10:21                                                           ` Ingo Molnar
2005-11-04 11:17                                                             ` Bernd Petrovitsch
2005-11-02 10:41                                     ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Ingo Molnar
2005-11-02 11:04                                       ` Gerrit Huizenga
2005-11-02 12:00                                         ` Ingo Molnar
2005-11-02 12:42                                           ` Dave Hansen
2005-11-02 15:02                                           ` Gerrit Huizenga
2005-11-03  0:10                                             ` Rob Landley
2005-11-02  7:57                                   ` Nick Piggin
2005-11-02  0:51                             ` Nick Piggin
2005-11-02  7:42                               ` Dave Hansen
2005-11-02  8:24                                 ` Nick Piggin
2005-11-02  8:33                                   ` Yasunori Goto
2005-11-02  8:43                                     ` Nick Piggin
2005-11-02 14:51                                       ` Martin J. Bligh
2005-11-02 23:28                                       ` Rob Landley
2005-11-03  5:26                                         ` Jeff Dike
2005-11-03  5:41                                           ` Rob Landley
2005-11-04  3:26                                             ` [uml-devel] " Blaisorblade
2005-11-04 15:50                                               ` Rob Landley
2005-11-04 17:18                                                 ` Blaisorblade
2005-11-04 17:44                                                   ` Rob Landley
2005-11-02 12:38                               ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary Mel Gorman
2005-11-03  3:14                                 ` Nick Piggin
2005-11-03 12:19                                   ` Mel Gorman
2005-11-10 18:47                                     ` Steve Lord
2005-11-03 15:34                                   ` Martin J. Bligh
2005-11-01 14:41                       ` [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 Mel Gorman
2005-11-01 14:46                         ` Ingo Molnar
2005-11-01 15:23                           ` Mel Gorman
2005-11-01 18:33                           ` Rob Landley
2005-11-01 19:02                             ` Ingo Molnar
2005-11-01 14:50                         ` Dave Hansen
2005-11-01 15:24                           ` Mel Gorman
2005-11-02  5:11                         ` Andrew Morton
2005-11-01 18:23                       ` Rob Landley
2005-11-01 20:31                         ` Joel Schopp
2005-11-01 20:59                   ` Joel Schopp
2005-11-02  1:06                     ` Nick Piggin
2005-11-02  1:41                       ` Martin J. Bligh
2005-11-02  2:03                         ` Nick Piggin
2005-11-02  2:24                           ` Martin J. Bligh
2005-11-02  2:49                             ` Nick Piggin
2005-11-02  4:39                               ` Martin J. Bligh
2005-11-02  5:09                                 ` Nick Piggin
2005-11-02  5:14                                   ` Martin J. Bligh
2005-11-02  6:23                                     ` KAMEZAWA Hiroyuki
2005-11-02 10:15                                       ` Nick Piggin
2005-11-02  7:19                               ` Yasunori Goto
2005-11-02 11:48                               ` Mel Gorman
2005-11-02 11:41                           ` Mel Gorman
2005-11-02 11:37                       ` Mel Gorman
2005-11-02 15:11                       ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2005-11-04  1:00 Andy Nelson
2005-11-04  1:16 ` Martin J. Bligh
2005-11-04  1:27   ` Nick Piggin
2005-11-04  5:14 ` Linus Torvalds
2005-11-04  6:10   ` Paul Jackson
2005-11-04  6:38     ` Ingo Molnar
2005-11-04  7:26       ` Paul Jackson
2005-11-04  7:37         ` Ingo Molnar
2005-11-04 15:31       ` Linus Torvalds
2005-11-04 15:39         ` Martin J. Bligh
2005-11-04 15:53         ` Ingo Molnar
2005-11-06  7:34           ` Paul Jackson
2005-11-06 15:55             ` Linus Torvalds
2005-11-06 18:18               ` Paul Jackson
2005-11-06  8:44         ` Kyle Moffett
2005-11-06 16:12           ` Linus Torvalds
2005-11-06 17:00             ` Linus Torvalds
2005-11-07  8:00               ` Ingo Molnar
2005-11-07 11:00                 ` Dave Hansen
2005-11-07 12:20                   ` Ingo Molnar
2005-11-07 19:34                     ` Steven Rostedt
2005-11-07 23:38                       ` Joel Schopp
2005-11-13  2:30                         ` Rob Landley
2005-11-14  1:58                           ` Joel Schopp
2005-11-04  7:44     ` Eric Dumazet
2005-11-07 16:42       ` Adam Litke
2005-11-04 14:56   ` Andy Nelson
2005-11-04 15:18     ` Ingo Molnar
2005-11-04 15:39       ` Andy Nelson
2005-11-04 16:05         ` Ingo Molnar
2005-11-04 16:07         ` Linus Torvalds
2005-11-04 16:40           ` Ingo Molnar
2005-11-04 17:22             ` Linus Torvalds
2005-11-04 17:43               ` Andy Nelson
2005-11-04 16:00     ` Linus Torvalds
2005-11-04 16:13       ` Martin J. Bligh
2005-11-04 16:40         ` Linus Torvalds
2005-11-04 17:10           ` Martin J. Bligh
2005-11-04 16:14       ` Andy Nelson
2005-11-04 16:49         ` Linus Torvalds
2005-11-04 15:19 Andy Nelson
2005-11-04 17:03 Andy Nelson
2005-11-04 17:49 ` Linus Torvalds
2005-11-04 17:51   ` Andy Nelson
2005-11-04 20:12 ` Ingo Molnar
2005-11-04 21:04   ` Andy Nelson
2005-11-04 21:14     ` Ingo Molnar
2005-11-04 21:22     ` Linus Torvalds
2005-11-04 21:39       ` Linus Torvalds
2005-11-05  2:48       ` Rob Landley
2005-11-06 10:59       ` Paul Jackson
2005-11-04 21:31     ` Gregory Maxwell
2005-11-04 22:43       ` Andi Kleen
2005-11-05  0:07         ` Nick Piggin
2005-11-06  1:30         ` Zan Lynx
2005-11-06  2:25           ` Rob Landley
2005-11-04 17:56 Andy Nelson
2005-11-04 21:51 Andy Nelson
2005-11-05  1:37 Seth, Rohit
2005-11-07  0:34 ` Andy Nelson
2005-11-07 18:58   ` Adam Litke
2005-11-07 20:51     ` Rohit Seth
2005-11-07 20:55       ` Andy Nelson
2005-11-07 20:58         ` Martin J. Bligh
2005-11-07 21:20           ` Rohit Seth
2005-11-07 21:33             ` Adam Litke
2005-11-08  2:12         ` David Gibson
2005-11-07 21:11       ` Adam Litke
2005-11-07 21:31         ` Rohit Seth
2005-11-05  1:52 Seth, Rohit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox