* [RFC/T/D][PATCH 0/2] KVM page cache optimization (v2) @ 2010-06-08 15:51 Balbir Singh 2010-06-08 15:51 ` [RFC][PATCH 1/2] Linux/Guest unmapped page cache control Balbir Singh 2010-06-08 15:51 ` [RFC/T/D][PATCH 2/2] Linux/Guest cooperative " Balbir Singh 0 siblings, 2 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-08 15:51 UTC (permalink / raw) To: kvm; +Cc: Avi Kivity, linux-mm, Balbir Singh, linux-kernel This is version 2 of the page cache control patches for KVM. This series has two patches, the first controls the amount of unmapped page cache usage via a boot parameter and sysctl. The second patch controls page and slab cache via the balloon driver. Both the patches make heavy use of the zone_reclaim() functionality already present in the kernel. page-cache-control balloon-page-cache -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-08 15:51 [RFC/T/D][PATCH 0/2] KVM page cache optimization (v2) Balbir Singh @ 2010-06-08 15:51 ` Balbir Singh 2010-06-13 18:31 ` Balbir Singh 2010-06-08 15:51 ` [RFC/T/D][PATCH 2/2] Linux/Guest cooperative " Balbir Singh 1 sibling, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-08 15:51 UTC (permalink / raw) To: kvm; +Cc: Avi Kivity, linux-mm, Balbir Singh, linux-kernel Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh <balbir@linux.vnet.ibm.com> This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. The patch is applied against mmotm feb-11-2010. TODt Usage without boot parameter (memory in KB) ---------------------------- MemFree Cached Time 19900 292912 137 17540 296196 139 17900 296124 141 19356 296660 141 Host usage: (memory in KB) RSS Cache mapped swap 2788664 781884 3780 359536 Guest Usage with boot parameter (memory in KB) ------------------------- Memfree Cached Time 244824 74828 144 237840 81764 143 235880 83044 138 239312 80092 148 Host usage: (memory in KB) RSS Cache mapped swap 2700184 958012 334848 398412 TODOS ----- 1. Balance slab cache as well 2. Invoke the balance routines from the balloon driver Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/mmzone.h | 2 - include/linux/swap.h | 3 + mm/page_alloc.c | 9 ++- mm/vmscan.c | 165 ++++++++++++++++++++++++++++++++++++------------ 4 files changed, 134 insertions(+), 45 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4d109e..9f96b6d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -293,12 +293,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; + unsigned long min_unmapped_pages; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ - unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index ff4acea..f92f1ee 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern bool should_balance_unmapped_pages(struct zone *zone); +extern int sysctl_min_unmapped_ratio; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 431214b..fee9420 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1641,6 +1641,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4069,10 +4072,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone->spanned_pages = size; zone->present_pages = realsize; -#ifdef CONFIG_NUMA - zone->node = nid; zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone->node = nid; zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone->name = zone_names[j]; @@ -4982,7 +4985,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, return 0; } -#ifdef CONFIG_NUMA int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -4999,6 +5001,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +#ifdef CONFIG_NUMA int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c7e57c..27bc536 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,18 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* + * XXX: Should we tweak swappiness here? + */ + return 1; +} +__setup("unmapped_page_control", unmapped_page_control_parm); + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -1986,6 +1998,103 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) } /* + * Percentage of pages in a zone that must be unmapped for zone_reclaim to + * occur. + */ +int sysctl_min_unmapped_ratio = 1; +/* + * Priority for ZONE_RECLAIM. This determines the fraction of pages + * of a node considered for each zone_reclaim. 4 scans 1/16th of + * a zone. + */ +#define ZONE_RECLAIM_PRIORITY 4 + + +#define RECLAIM_OFF 0 +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ +#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ + +static inline unsigned long zone_unmapped_file_pages(struct zone *zone) +{ + unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED); + unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) + + zone_page_state(zone, NR_ACTIVE_FILE); + + /* + * It's possible for there to be more file mapped pages than + * accounted for by the pages on the file LRU lists because + * tmpfs pages accounted for as ANON can also be FILE_MAPPED + */ + return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0; +} + +/* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* + * Free memory by calling shrink zone with increasing + * priorities until we have enough memory freed. + */ + priority = ZONE_RECLAIM_PRIORITY; + do { + note_zone_scanning_priority(zone, priority); + shrink_zone(priority, zone, sc); + priority--; + } while (priority >= 0 && sc->nr_reclaimed < nr_pages); +} + +/* + * Routine to balance unmapped pages, inspired from the code under + * CONFIG_NUMA that does unmapped page and slab page control by keeping + * min_unmapped_pages in the zone. We currently reclaim just unmapped + * pages, slab control will come in soon, at which point this routine + * should be called balance cached pages + */ +static unsigned long balance_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc) +{ + if (unmapped_page_control && + (zone_unmapped_file_pages(zone) > zone->min_unmapped_pages)) { + struct scan_control nsc; + unsigned long nr_pages; + + nsc = *sc; + + nsc.swappiness = 0; + nsc.may_writepage = 0; + nsc.may_unmap = 0; + nsc.nr_reclaimed = 0; + + nr_pages = zone_unmapped_file_pages(zone) - + zone->min_unmapped_pages; + /* Magically try to reclaim eighth the unmapped cache pages */ + nr_pages >>= 3; + + zone_reclaim_unmapped_pages(zone, &nsc, nr_pages); + return nsc.nr_reclaimed; + } + return 0; +} + +#define UNMAPPED_PAGE_RATIO 16 +bool should_balance_unmapped_pages(struct zone *zone) +{ + if (unmapped_page_control && + (zone_unmapped_file_pages(zone) > + UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages)) + return true; + return false; +} + +/* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at high_wmark_pages(zone). * @@ -2074,6 +2183,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); + /* + * We do unmapped page balancing once here and once + * below, so that we don't lose out + */ + balance_unmapped_pages(priority, zone, &sc); + if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2115,6 +2230,13 @@ loop_again: nid = pgdat->node_id; zid = zone_idx(zone); + + /* + * Balance unmapped pages upfront, this should be + * really cheap + */ + balance_unmapped_pages(priority, zone, &sc); + /* * Call soft limit reclaim before calling shrink_zone. * For now we ignore the return value @@ -2336,7 +2458,8 @@ void wakeup_kswapd(struct zone *zone, int order) return; pgdat = zone->zone_pgdat; - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) + if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0) && + !should_balance_unmapped_pages(zone)) return; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; @@ -2502,44 +2625,12 @@ module_init(kswapd_init) */ int zone_reclaim_mode __read_mostly; -#define RECLAIM_OFF 0 -#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ -#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ - -/* - * Priority for ZONE_RECLAIM. This determines the fraction of pages - * of a node considered for each zone_reclaim. 4 scans 1/16th of - * a zone. - */ -#define ZONE_RECLAIM_PRIORITY 4 - -/* - * Percentage of pages in a zone that must be unmapped for zone_reclaim to - * occur. - */ -int sysctl_min_unmapped_ratio = 1; - /* * If the number of slab pages in a zone grows beyond this percentage then * slab reclaim needs to occur. */ int sysctl_min_slab_ratio = 5; -static inline unsigned long zone_unmapped_file_pages(struct zone *zone) -{ - unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED); - unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) + - zone_page_state(zone, NR_ACTIVE_FILE); - - /* - * It's possible for there to be more file mapped pages than - * accounted for by the pages on the file LRU lists because - * tmpfs pages accounted for as ANON can also be FILE_MAPPED - */ - return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0; -} - /* Work out how many page cache pages we can reclaim in this reclaim_mode */ static long zone_pagecache_reclaimable(struct zone *zone) { @@ -2577,7 +2668,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 << order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), @@ -2607,12 +2697,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * Free memory by calling shrink zone with increasing * priorities until we have enough memory freed. */ - priority = ZONE_RECLAIM_PRIORITY; - do { - note_zone_scanning_priority(zone, priority); - shrink_zone(priority, zone, &sc); - priority--; - } while (priority >= 0 && sc.nr_reclaimed < nr_pages); + zone_reclaim_unmapped_pages(zone, &sc, nr_pages); } slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-08 15:51 ` [RFC][PATCH 1/2] Linux/Guest unmapped page cache control Balbir Singh @ 2010-06-13 18:31 ` Balbir Singh 2010-06-14 0:28 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-13 18:31 UTC (permalink / raw) To: kvm; +Cc: Avi Kivity, linux-mm, linux-kernel * Balbir Singh <balbir@linux.vnet.ibm.com> [2010-06-08 21:21:46]: > Selectively control Unmapped Page Cache (nospam version) > > From: Balbir Singh <balbir@linux.vnet.ibm.com> > > This patch implements unmapped page cache control via preferred > page cache reclaim. The current patch hooks into kswapd and reclaims > page cache if the user has requested for unmapped page control. > This is useful in the following scenario > > - In a virtualized environment with cache=writethrough, we see > double caching - (one in the host and one in the guest). As > we try to scale guests, cache usage across the system grows. > The goal of this patch is to reclaim page cache when Linux is running > as a guest and get the host to hold the page cache and manage it. > There might be temporary duplication, but in the long run, memory > in the guests would be used for mapped pages. > - The option is controlled via a boot option and the administrator > can selectively turn it on, on a need to use basis. > > A lot of the code is borrowed from zone_reclaim_mode logic for > __zone_reclaim(). One might argue that the with ballooning and > KSM this feature is not very useful, but even with ballooning, > we need extra logic to balloon multiple VM machines and it is hard > to figure out the correct amount of memory to balloon. With these > patches applied, each guest has a sufficient amount of free memory > available, that can be easily seen and reclaimed by the balloon driver. > The additional memory in the guest can be reused for additional > applications or used to start additional guests/balance memory in > the host. > > KSM currently does not de-duplicate host and guest page cache. The goal > of this patch is to help automatically balance unmapped page cache when > instructed to do so. > > There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO > and the number of pages to reclaim when unmapped_page_control argument > is supplied. These numbers were chosen to avoid aggressiveness in > reaping page cache ever so frequently, at the same time providing control. > > The sysctl for min_unmapped_ratio provides further control from > within the guest on the amount of unmapped pages to reclaim. > Are there any major objections to this patch? -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-13 18:31 ` Balbir Singh @ 2010-06-14 0:28 ` KAMEZAWA Hiroyuki 2010-06-14 6:49 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-14 0:28 UTC (permalink / raw) To: balbir; +Cc: kvm, Avi Kivity, linux-mm, linux-kernel On Mon, 14 Jun 2010 00:01:45 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * Balbir Singh <balbir@linux.vnet.ibm.com> [2010-06-08 21:21:46]: > > > Selectively control Unmapped Page Cache (nospam version) > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com> > > > > This patch implements unmapped page cache control via preferred > > page cache reclaim. The current patch hooks into kswapd and reclaims > > page cache if the user has requested for unmapped page control. > > This is useful in the following scenario > > > > - In a virtualized environment with cache=writethrough, we see > > double caching - (one in the host and one in the guest). As > > we try to scale guests, cache usage across the system grows. > > The goal of this patch is to reclaim page cache when Linux is running > > as a guest and get the host to hold the page cache and manage it. > > There might be temporary duplication, but in the long run, memory > > in the guests would be used for mapped pages. > > - The option is controlled via a boot option and the administrator > > can selectively turn it on, on a need to use basis. > > > > A lot of the code is borrowed from zone_reclaim_mode logic for > > __zone_reclaim(). One might argue that the with ballooning and > > KSM this feature is not very useful, but even with ballooning, > > we need extra logic to balloon multiple VM machines and it is hard > > to figure out the correct amount of memory to balloon. With these > > patches applied, each guest has a sufficient amount of free memory > > available, that can be easily seen and reclaimed by the balloon driver. > > The additional memory in the guest can be reused for additional > > applications or used to start additional guests/balance memory in > > the host. > > > > KSM currently does not de-duplicate host and guest page cache. The goal > > of this patch is to help automatically balance unmapped page cache when > > instructed to do so. > > > > There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO > > and the number of pages to reclaim when unmapped_page_control argument > > is supplied. These numbers were chosen to avoid aggressiveness in > > reaping page cache ever so frequently, at the same time providing control. > > > > The sysctl for min_unmapped_ratio provides further control from > > within the guest on the amount of unmapped pages to reclaim. > > > > Are there any major objections to this patch? > This kind of patch needs "how it works well" measurement. - How did you measure the effect of the patch ? kernbench is not enough, of course. - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? - Linux vm guys tend to say, "free memory is bad memory". ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. If your patch just drops "duplicated, but no more necessary for other kvm", I agree your patch may increase available size of page-caches. But you just drops unmapped pages. Hmm. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-14 0:28 ` KAMEZAWA Hiroyuki @ 2010-06-14 6:49 ` Balbir Singh 2010-06-14 7:00 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 6:49 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: kvm, Avi Kivity, linux-mm, linux-kernel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-14 09:28:19]: > On Mon, 14 Jun 2010 00:01:45 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > * Balbir Singh <balbir@linux.vnet.ibm.com> [2010-06-08 21:21:46]: > > > > > Selectively control Unmapped Page Cache (nospam version) > > > > > > From: Balbir Singh <balbir@linux.vnet.ibm.com> > > > > > > This patch implements unmapped page cache control via preferred > > > page cache reclaim. The current patch hooks into kswapd and reclaims > > > page cache if the user has requested for unmapped page control. > > > This is useful in the following scenario > > > > > > - In a virtualized environment with cache=writethrough, we see > > > double caching - (one in the host and one in the guest). As > > > we try to scale guests, cache usage across the system grows. > > > The goal of this patch is to reclaim page cache when Linux is running > > > as a guest and get the host to hold the page cache and manage it. > > > There might be temporary duplication, but in the long run, memory > > > in the guests would be used for mapped pages. > > > - The option is controlled via a boot option and the administrator > > > can selectively turn it on, on a need to use basis. > > > > > > A lot of the code is borrowed from zone_reclaim_mode logic for > > > __zone_reclaim(). One might argue that the with ballooning and > > > KSM this feature is not very useful, but even with ballooning, > > > we need extra logic to balloon multiple VM machines and it is hard > > > to figure out the correct amount of memory to balloon. With these > > > patches applied, each guest has a sufficient amount of free memory > > > available, that can be easily seen and reclaimed by the balloon driver. > > > The additional memory in the guest can be reused for additional > > > applications or used to start additional guests/balance memory in > > > the host. > > > > > > KSM currently does not de-duplicate host and guest page cache. The goal > > > of this patch is to help automatically balance unmapped page cache when > > > instructed to do so. > > > > > > There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO > > > and the number of pages to reclaim when unmapped_page_control argument > > > is supplied. These numbers were chosen to avoid aggressiveness in > > > reaping page cache ever so frequently, at the same time providing control. > > > > > > The sysctl for min_unmapped_ratio provides further control from > > > within the guest on the amount of unmapped pages to reclaim. > > > > > > > Are there any major objections to this patch? > > > > This kind of patch needs "how it works well" measurement. > > - How did you measure the effect of the patch ? kernbench is not enough, of course. I can run other benchmarks as well, I will do so > - Why don't you believe LRU ? And if LRU doesn't work well, should it be > fixed by a knob rather than generic approach ? > - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. > > - Linux vm guys tend to say, "free memory is bad memory". ok, for what > free memory created by your patch is used ? IOW, I can't see the benefit. > If free memory that your patch created will be used for another page-cache, > it will be dropped soon by your patch itself. > Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. > If your patch just drops "duplicated, but no more necessary for other kvm", > I agree your patch may increase available size of page-caches. But you just > drops unmapped pages. > unmapped and unused are the best targets, I plan to add slab cache control later. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-14 6:49 ` Balbir Singh @ 2010-06-14 7:00 ` KAMEZAWA Hiroyuki 2010-06-14 7:36 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-14 7:00 UTC (permalink / raw) To: balbir; +Cc: kvm, Avi Kivity, linux-mm, linux-kernel On Mon, 14 Jun 2010 12:19:55 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > - Why don't you believe LRU ? And if LRU doesn't work well, should it be > > fixed by a knob rather than generic approach ? > > - No side effects ? > > I believe in LRU, just that the problem I am trying to solve is of > using double the memory for caching the same data (consider kvm > running in cache=writethrough or writeback mode, both the hypervisor > and the guest OS maintain a page cache of the same data). As the VM's > grow the overhead is substantial. In my runs I found upto 60% > duplication in some cases. > > > - Linux vm guys tend to say, "free memory is bad memory". ok, for what > free memory created by your patch is used ? IOW, I can't see the benefit. > If free memory that your patch created will be used for another page-cache, > it will be dropped soon by your patch itself. > > Free memory is good for cases when you want to do more in the same > system. I agree that in a bare metail environment that might be > partially true. I don't have a problem with frequently used data being > cached, but I am targetting a consolidated environment at the moment. > Moreover, the administrator has control via a boot option, so it is > non-instrusive in many ways. It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? >From performance perspective, I don't see any advantage to drop caches which can be dropped easily. I just use cpus for the purpose it may no be necessary. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-14 7:00 ` KAMEZAWA Hiroyuki @ 2010-06-14 7:36 ` Balbir Singh 2010-06-14 7:49 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 7:36 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: kvm, Avi Kivity, linux-mm, linux-kernel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-14 16:00:21]: > On Mon, 14 Jun 2010 12:19:55 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > - Why don't you believe LRU ? And if LRU doesn't work well, should it be > > > fixed by a knob rather than generic approach ? > > > - No side effects ? > > > > I believe in LRU, just that the problem I am trying to solve is of > > using double the memory for caching the same data (consider kvm > > running in cache=writethrough or writeback mode, both the hypervisor > > and the guest OS maintain a page cache of the same data). As the VM's > > grow the overhead is substantial. In my runs I found upto 60% > > duplication in some cases. > > > > > > - Linux vm guys tend to say, "free memory is bad memory". ok, for what > > free memory created by your patch is used ? IOW, I can't see the benefit. > > If free memory that your patch created will be used for another page-cache, > > it will be dropped soon by your patch itself. > > > > Free memory is good for cases when you want to do more in the same > > system. I agree that in a bare metail environment that might be > > partially true. I don't have a problem with frequently used data being > > cached, but I am targetting a consolidated environment at the moment. > > Moreover, the administrator has control via a boot option, so it is > > non-instrusive in many ways. > > It sounds that what you want is to improve performance etc. but to make it > easy sizing the system and to help admins. Right ? > Right, to allow freeing up of using double the memory to cache data. > From performance perspective, I don't see any advantage to drop caches > which can be dropped easily. I just use cpus for the purpose it may no > be necessary. > It is not that easy, in a virtualized environment, you do directly reclaim, but use a mechanism like ballooning and that too requires a smart software to decide where to balloon from. This patch (optionally if enabled) optimizes that by 1. Reducing double caching 2. Not requiring newer smarts or a management software to monitor and balloon 3. Allows better estimation of free memory by avoiding double caching 4. Allows immediate use of free memory for other applications or startup of newer guest instances. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control 2010-06-14 7:36 ` Balbir Singh @ 2010-06-14 7:49 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-14 7:49 UTC (permalink / raw) To: balbir; +Cc: kvm, Avi Kivity, linux-mm, linux-kernel On Mon, 14 Jun 2010 13:06:46 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > It sounds that what you want is to improve performance etc. but to make it > > easy sizing the system and to help admins. Right ? > > > > Right, to allow freeing up of using double the memory to cache data. > Oh, sorry. ask again.. It sounds that what you want is _not_ to improve performance etc. but to make it ... ? -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-08 15:51 [RFC/T/D][PATCH 0/2] KVM page cache optimization (v2) Balbir Singh 2010-06-08 15:51 ` [RFC][PATCH 1/2] Linux/Guest unmapped page cache control Balbir Singh @ 2010-06-08 15:51 ` Balbir Singh 2010-06-10 9:43 ` Avi Kivity 1 sibling, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-08 15:51 UTC (permalink / raw) To: kvm; +Cc: Avi Kivity, linux-mm, Balbir Singh, linux-kernel Balloon unmapped page cache pages first From: Balbir Singh <balbir@linux.vnet.ibm.com> This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed in the gfp_mask. The virtio balloon driver has been changed to use __GFP_FREE_CACHE. Tests: I ran a simple filter function that kept frequently ballon a single VM running kernbench. The VM was configured with 2GB of memory and 2 VCPUs. The filter function was a triangular wave function that ballooned the VM under study from 500MB to 1500MB using a triangular wave function continously. The run times of the VM with and without changes are shown below. The run times showed no significant impact of the changes. Withchanges Elapsed Time 223.86 (1.52822) User Time 191.01 (0.65395) System Time 199.468 (2.43616) Percent CPU 174 (1) Context Switches 103182 (595.05) Sleeps 39107.6 (1505.67) Without changes Elapsed Time 225.526 (2.93102) User Time 193.53 (3.53626) System Time 199.832 (3.26281) Percent CPU 173.6 (1.14018) Context Switches 103744 (1311.53) Sleeps 39383.2 (831.865) The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- drivers/virtio/virtio_balloon.c | 3 ++- include/linux/gfp.h | 8 +++++++- include/linux/swap.h | 9 +++------ mm/page_alloc.c | 3 ++- mm/vmscan.c | 2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0f1da45..609a9c2 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -104,7 +104,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) { struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | - __GFP_NOMEMALLOC | __GFP_NOWARN); + __GFP_NOMEMALLOC | __GFP_NOWARN | + __GFP_FREE_CACHE); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, &vb->vdev->dev, diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 975609c..9048259 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -61,12 +61,18 @@ struct vm_area_struct; #endif /* + * While allocating pages, try to free cache pages first. Note the + * heavy dependency on zone_reclaim_mode logic + */ +#define __GFP_FREE_CACHE ((__force gfp_t)0x400000u) /* Free cache first */ + +/* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 22 /* Room for 22 __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 23 /* Room for 22 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/swap.h b/include/linux/swap.h index f92f1ee..f77c603 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,16 +254,13 @@ extern long vm_total_pages; extern bool should_balance_unmapped_pages(struct zone *zone); extern int sysctl_min_unmapped_ratio; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 -static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) -{ - return 0; -} #endif extern int page_evictable(struct page *page, struct vm_area_struct *vma); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fee9420..d977b36 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1649,7 +1649,8 @@ zonelist_scan: classzone_idx, alloc_flags)) goto try_this_zone; - if (zone_reclaim_mode == 0) + if (zone_reclaim_mode == 0 && + !(gfp_mask & __GFP_FREE_CACHE)) goto this_zone_full; ret = zone_reclaim(zone, gfp_mask, order); diff --git a/mm/vmscan.c b/mm/vmscan.c index 27bc536..393bee5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2624,6 +2624,7 @@ module_init(kswapd_init) * the watermarks. */ int zone_reclaim_mode __read_mostly; +#endif /* * If the number of slab pages in a zone grows beyond this percentage then @@ -2780,7 +2781,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-08 15:51 ` [RFC/T/D][PATCH 2/2] Linux/Guest cooperative " Balbir Singh @ 2010-06-10 9:43 ` Avi Kivity 2010-06-10 14:25 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-10 9:43 UTC (permalink / raw) To: Balbir Singh; +Cc: kvm, linux-mm, linux-kernel On 06/08/2010 06:51 PM, Balbir Singh wrote: > Balloon unmapped page cache pages first > > From: Balbir Singh<balbir@linux.vnet.ibm.com> > > This patch builds on the ballooning infrastructure by ballooning unmapped > page cache pages first. It looks for low hanging fruit first and tries > to reclaim clean unmapped pages first. > I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. Many workloads have many unmapped cache pages, for example static web serving and the all-important kernel build. > The key advantage was that it resulted in lesser RSS usage in the host and > more cached usage, indicating that the caching had been pushed towards > the host. The guest cached memory usage was lower and free memory in > the guest was also higher. > Caching in the host is only helpful if the cache can be shared, otherwise it's better to cache in the guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-10 9:43 ` Avi Kivity @ 2010-06-10 14:25 ` Balbir Singh 2010-06-11 0:07 ` Dave Hansen 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-10 14:25 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-10 12:43:11]: > On 06/08/2010 06:51 PM, Balbir Singh wrote: > >Balloon unmapped page cache pages first > > > >From: Balbir Singh<balbir@linux.vnet.ibm.com> > > > >This patch builds on the ballooning infrastructure by ballooning unmapped > >page cache pages first. It looks for low hanging fruit first and tries > >to reclaim clean unmapped pages first. > > I'm not sure victimizing unmapped cache pages is a good idea. > Shouldn't page selection use the LRU for recency information instead > of the cost of guest reclaim? Dropping a frequently used unmapped > cache page can be more expensive than dropping an unused text page > that was loaded as part of some executable's initialization and > forgotten. > We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. > Many workloads have many unmapped cache pages, for example static > web serving and the all-important kernel build. > I've tested kernbench, you can see the results in the original posting and there is no observable overhead as a result of the patch in my run. > >The key advantage was that it resulted in lesser RSS usage in the host and > >more cached usage, indicating that the caching had been pushed towards > >the host. The guest cached memory usage was lower and free memory in > >the guest was also higher. > > Caching in the host is only helpful if the cache can be shared, > otherwise it's better to cache in the guest. > Hmm.. so we would need a ballon cache hint from the monitor, so that it is not unconditional? Overall my results show the following 1. No drastic reduction of guest unmapped cache, just sufficient to show lesser RSS in the host. More freeable memory (as in cached memory + free memory) visible on the host. 2. No significant impact on the benchmark (numbers) running in the guest. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-10 14:25 ` Balbir Singh @ 2010-06-11 0:07 ` Dave Hansen 2010-06-11 1:54 ` KAMEZAWA Hiroyuki 2010-06-11 4:56 ` Balbir Singh 0 siblings, 2 replies; 48+ messages in thread From: Dave Hansen @ 2010-06-11 0:07 UTC (permalink / raw) To: balbir; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > I'm not sure victimizing unmapped cache pages is a good idea. > > Shouldn't page selection use the LRU for recency information instead > > of the cost of guest reclaim? Dropping a frequently used unmapped > > cache page can be more expensive than dropping an unused text page > > that was loaded as part of some executable's initialization and > > forgotten. > > We victimize the unmapped cache only if it is unused (in LRU order). > We don't force the issue too much. We also have free slab cache to go > after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 0:07 ` Dave Hansen @ 2010-06-11 1:54 ` KAMEZAWA Hiroyuki 2010-06-11 4:46 ` Balbir Singh 2010-06-11 4:56 ` Balbir Singh 1 sibling, 1 reply; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-11 1:54 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, Avi Kivity, kvm, linux-mm, linux-kernel On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen <dave@linux.vnet.ibm.com> wrote: > On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > > I'm not sure victimizing unmapped cache pages is a good idea. > > > Shouldn't page selection use the LRU for recency information instead > > > of the cost of guest reclaim? Dropping a frequently used unmapped > > > cache page can be more expensive than dropping an unused text page > > > that was loaded as part of some executable's initialization and > > > forgotten. > > > > We victimize the unmapped cache only if it is unused (in LRU order). > > We don't force the issue too much. We also have free slab cache to go > > after. > > Just to be clear, let's say we have a mapped page (say of /sbin/init) > that's been unreferenced since _just_ after the system booted. We also > have an unmapped page cache page of a file often used at runtime, say > one from /etc/resolv.conf or /etc/passwd. > Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 1:54 ` KAMEZAWA Hiroyuki @ 2010-06-11 4:46 ` Balbir Singh 2010-06-11 5:05 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-11 4:46 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, Avi Kivity, kvm, linux-mm, linux-kernel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-11 10:54:41]: > On Thu, 10 Jun 2010 17:07:32 -0700 > Dave Hansen <dave@linux.vnet.ibm.com> wrote: > > > On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > > > I'm not sure victimizing unmapped cache pages is a good idea. > > > > Shouldn't page selection use the LRU for recency information instead > > > > of the cost of guest reclaim? Dropping a frequently used unmapped > > > > cache page can be more expensive than dropping an unused text page > > > > that was loaded as part of some executable's initialization and > > > > forgotten. > > > > > > We victimize the unmapped cache only if it is unused (in LRU order). > > > We don't force the issue too much. We also have free slab cache to go > > > after. > > > > Just to be clear, let's say we have a mapped page (say of /sbin/init) > > that's been unreferenced since _just_ after the system booted. We also > > have an unmapped page cache page of a file often used at runtime, say > > one from /etc/resolv.conf or /etc/passwd. > > > > Hmm. I'm not fan of estimating working set size by calculation > based on some numbers without considering history or feedback. > > Can't we use some kind of feedback algorithm as hi-low-watermark, random walk > or GA (or somehing more smart) to detect the size ? > Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 4:46 ` Balbir Singh @ 2010-06-11 5:05 ` KAMEZAWA Hiroyuki 2010-06-11 5:08 ` KAMEZAWA Hiroyuki 2010-06-11 6:14 ` Balbir Singh 0 siblings, 2 replies; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-11 5:05 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, Avi Kivity, kvm, linux-mm, linux-kernel On Fri, 11 Jun 2010 10:16:32 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-11 10:54:41]: > > > On Thu, 10 Jun 2010 17:07:32 -0700 > > Dave Hansen <dave@linux.vnet.ibm.com> wrote: > > > > > On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > > > > I'm not sure victimizing unmapped cache pages is a good idea. > > > > > Shouldn't page selection use the LRU for recency information instead > > > > > of the cost of guest reclaim? Dropping a frequently used unmapped > > > > > cache page can be more expensive than dropping an unused text page > > > > > that was loaded as part of some executable's initialization and > > > > > forgotten. > > > > > > > > We victimize the unmapped cache only if it is unused (in LRU order). > > > > We don't force the issue too much. We also have free slab cache to go > > > > after. > > > > > > Just to be clear, let's say we have a mapped page (say of /sbin/init) > > > that's been unreferenced since _just_ after the system booted. We also > > > have an unmapped page cache page of a file often used at runtime, say > > > one from /etc/resolv.conf or /etc/passwd. > > > > > > > Hmm. I'm not fan of estimating working set size by calculation > > based on some numbers without considering history or feedback. > > > > Can't we use some kind of feedback algorithm as hi-low-watermark, random walk > > or GA (or somehing more smart) to detect the size ? > > > > Could you please clarify at what level you are suggesting size > detection? I assume it is outside the OS, right? > "OS" includes kernel and system programs ;) I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) (A) increases free memory in Guest. (B) increases free memory in Host. This is an example of feedback based memory resizing between host and guest. I think (B) is necessary at least before considering complecated things. To implement something clever, (A) and (B) should take into account that how frequently memory reclaim in guest (which requires some I/O) happens. If doing outside kernel, I think using memcg is better than depends on balloon driver. But co-operative balloon and memcg may show us something good. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 5:05 ` KAMEZAWA Hiroyuki @ 2010-06-11 5:08 ` KAMEZAWA Hiroyuki 2010-06-11 6:14 ` Balbir Singh 1 sibling, 0 replies; 48+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-06-11 5:08 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: balbir, Dave Hansen, Avi Kivity, kvm, linux-mm, linux-kernel On Fri, 11 Jun 2010 14:05:53 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > I can think of both way in kernel and in user approarh and they should be > complement to each other. > > An example of kernel-based approach is. > 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. > 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. > (I guess current balloon driver is only for host. Please imagine.) ^^^^ guest. Sorry. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 5:05 ` KAMEZAWA Hiroyuki 2010-06-11 5:08 ` KAMEZAWA Hiroyuki @ 2010-06-11 6:14 ` Balbir Singh 1 sibling, 0 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-11 6:14 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, Avi Kivity, kvm, linux-mm, linux-kernel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-11 14:05:53]: > On Fri, 11 Jun 2010 10:16:32 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-11 10:54:41]: > > > > > On Thu, 10 Jun 2010 17:07:32 -0700 > > > Dave Hansen <dave@linux.vnet.ibm.com> wrote: > > > > > > > On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > > > > > I'm not sure victimizing unmapped cache pages is a good idea. > > > > > > Shouldn't page selection use the LRU for recency information instead > > > > > > of the cost of guest reclaim? Dropping a frequently used unmapped > > > > > > cache page can be more expensive than dropping an unused text page > > > > > > that was loaded as part of some executable's initialization and > > > > > > forgotten. > > > > > > > > > > We victimize the unmapped cache only if it is unused (in LRU order). > > > > > We don't force the issue too much. We also have free slab cache to go > > > > > after. > > > > > > > > Just to be clear, let's say we have a mapped page (say of /sbin/init) > > > > that's been unreferenced since _just_ after the system booted. We also > > > > have an unmapped page cache page of a file often used at runtime, say > > > > one from /etc/resolv.conf or /etc/passwd. > > > > > > > > > > Hmm. I'm not fan of estimating working set size by calculation > > > based on some numbers without considering history or feedback. > > > > > > Can't we use some kind of feedback algorithm as hi-low-watermark, random walk > > > or GA (or somehing more smart) to detect the size ? > > > > > > > Could you please clarify at what level you are suggesting size > > detection? I assume it is outside the OS, right? > > > "OS" includes kernel and system programs ;) > > I can think of both way in kernel and in user approarh and they should be > complement to each other. > > An example of kernel-based approach is. > 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. > 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. > (I guess current balloon driver is only for host. Please imagine.) > > (A) increases free memory in Guest. > (B) increases free memory in Host. > > This is an example of feedback based memory resizing between host and guest. > > I think (B) is necessary at least before considering complecated things. B is left to the hypervisor and the memory policy running on it. My patches address Linux running as a guest, with a Linux hypervisor at the moment, but that can be extended to other balloon drivers as well. > > To implement something clever, (A) and (B) should take into account that > how frequently memory reclaim in guest (which requires some I/O) happens. > Yes, I think the policy in the hypervisor needs to look at those details as well. > If doing outside kernel, I think using memcg is better than depends on > balloon driver. But co-operative balloon and memcg may show us something > good. > Yes, agreed. Co-operative is better, if there is no co-operation than memcg might be used for enforcement. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 0:07 ` Dave Hansen 2010-06-11 1:54 ` KAMEZAWA Hiroyuki @ 2010-06-11 4:56 ` Balbir Singh 2010-06-14 8:09 ` Avi Kivity 1 sibling, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-11 4:56 UTC (permalink / raw) To: Dave Hansen; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel * Dave Hansen <dave@linux.vnet.ibm.com> [2010-06-10 17:07:32]: > On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: > > > I'm not sure victimizing unmapped cache pages is a good idea. > > > Shouldn't page selection use the LRU for recency information instead > > > of the cost of guest reclaim? Dropping a frequently used unmapped > > > cache page can be more expensive than dropping an unused text page > > > that was loaded as part of some executable's initialization and > > > forgotten. > > > > We victimize the unmapped cache only if it is unused (in LRU order). > > We don't force the issue too much. We also have free slab cache to go > > after. > > Just to be clear, let's say we have a mapped page (say of /sbin/init) > that's been unreferenced since _just_ after the system booted. We also > have an unmapped page cache page of a file often used at runtime, say > one from /etc/resolv.conf or /etc/passwd. > > Which page will be preferred for eviction with this patch set? > In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-11 4:56 ` Balbir Singh @ 2010-06-14 8:09 ` Avi Kivity 2010-06-14 8:48 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-14 8:09 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/11/2010 07:56 AM, Balbir Singh wrote: > >> Just to be clear, let's say we have a mapped page (say of /sbin/init) >> that's been unreferenced since _just_ after the system booted. We also >> have an unmapped page cache page of a file often used at runtime, say >> one from /etc/resolv.conf or /etc/passwd. >> >> Which page will be preferred for eviction with this patch set? >> >> > In this case the order is as follows > > 1. First we pick free pages if any > 2. If we don't have free pages, we go after unmapped page cache and > slab cache > 3. If that fails as well, we go after regularly memory > > In the scenario that you describe, we'll not be able to easily free up > the frequently referenced page from /etc/*. The code will move on to > step 3 and do its regular reclaim. > Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 8:09 ` Avi Kivity @ 2010-06-14 8:48 ` Balbir Singh 2010-06-14 12:40 ` Avi Kivity 2010-06-14 15:12 ` Dave Hansen 0 siblings, 2 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-14 8:48 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-14 11:09:44]: > On 06/11/2010 07:56 AM, Balbir Singh wrote: > > > >>Just to be clear, let's say we have a mapped page (say of /sbin/init) > >>that's been unreferenced since _just_ after the system booted. We also > >>have an unmapped page cache page of a file often used at runtime, say > >>one from /etc/resolv.conf or /etc/passwd. > >> > >>Which page will be preferred for eviction with this patch set? > >> > >In this case the order is as follows > > > >1. First we pick free pages if any > >2. If we don't have free pages, we go after unmapped page cache and > >slab cache > >3. If that fails as well, we go after regularly memory > > > >In the scenario that you describe, we'll not be able to easily free up > >the frequently referenced page from /etc/*. The code will move on to > >step 3 and do its regular reclaim. > > Still it seems to me you are subverting the normal order of reclaim. > I don't see why an unmapped page cache or slab cache item should be > evicted before a mapped page. Certainly the cost of rebuilding a > dentry compared to the gain from evicting it, is much higher than > that of reestablishing a mapped page. > Subverting to aviod memory duplication, the word subverting is overloaded, let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 8:48 ` Balbir Singh @ 2010-06-14 12:40 ` Avi Kivity 2010-06-14 12:50 ` Balbir Singh 2010-06-14 15:12 ` Dave Hansen 1 sibling, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-14 12:40 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/14/2010 11:48 AM, Balbir Singh wrote: >>> >>> In this case the order is as follows >>> >>> 1. First we pick free pages if any >>> 2. If we don't have free pages, we go after unmapped page cache and >>> slab cache >>> 3. If that fails as well, we go after regularly memory >>> >>> In the scenario that you describe, we'll not be able to easily free up >>> the frequently referenced page from /etc/*. The code will move on to >>> step 3 and do its regular reclaim. >>> >> Still it seems to me you are subverting the normal order of reclaim. >> I don't see why an unmapped page cache or slab cache item should be >> evicted before a mapped page. Certainly the cost of rebuilding a >> dentry compared to the gain from evicting it, is much higher than >> that of reestablishing a mapped page. >> >> > Subverting to aviod memory duplication, the word subverting is > overloaded, Right, should have used a different one. > let me try to reason a bit. First let me explain the > problem > > Memory is a precious resource in a consolidated environment. > We don't want to waste memory via page cache duplication > (cache=writethrough and cache=writeback mode). > > Now here is what we are trying to do > > 1. A slab page will not be freed until the entire page is free (all > slabs have been kfree'd so to speak). Normal reclaim will definitely > free this page, but a lot of it depends on how frequently we are > scanning the LRU list and when this page got added. > 2. In the case of page cache (specifically unmapped page cache), there > is duplication already, so why not go after unmapped page caches when > the system is under memory pressure? > > In the case of 1, we don't force a dentry to be freed, but rather a > freed page in the slab cache to be reclaimed ahead of forcing reclaim > of mapped pages. > Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? > Does the problem statement make sense? If so, do you agree with 1 and > 2? Is there major concern about subverting regular reclaim? Does > subverting it make sense in the duplicated scenario? > > In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 12:40 ` Avi Kivity @ 2010-06-14 12:50 ` Balbir Singh 2010-06-14 13:01 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 12:50 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-14 15:40:28]: > On 06/14/2010 11:48 AM, Balbir Singh wrote: > >>> > >>>In this case the order is as follows > >>> > >>>1. First we pick free pages if any > >>>2. If we don't have free pages, we go after unmapped page cache and > >>>slab cache > >>>3. If that fails as well, we go after regularly memory > >>> > >>>In the scenario that you describe, we'll not be able to easily free up > >>>the frequently referenced page from /etc/*. The code will move on to > >>>step 3 and do its regular reclaim. > >>Still it seems to me you are subverting the normal order of reclaim. > >>I don't see why an unmapped page cache or slab cache item should be > >>evicted before a mapped page. Certainly the cost of rebuilding a > >>dentry compared to the gain from evicting it, is much higher than > >>that of reestablishing a mapped page. > >> > >Subverting to aviod memory duplication, the word subverting is > >overloaded, > > Right, should have used a different one. > > >let me try to reason a bit. First let me explain the > >problem > > > >Memory is a precious resource in a consolidated environment. > >We don't want to waste memory via page cache duplication > >(cache=writethrough and cache=writeback mode). > > > >Now here is what we are trying to do > > > >1. A slab page will not be freed until the entire page is free (all > >slabs have been kfree'd so to speak). Normal reclaim will definitely > >free this page, but a lot of it depends on how frequently we are > >scanning the LRU list and when this page got added. > >2. In the case of page cache (specifically unmapped page cache), there > >is duplication already, so why not go after unmapped page caches when > >the system is under memory pressure? > > > >In the case of 1, we don't force a dentry to be freed, but rather a > >freed page in the slab cache to be reclaimed ahead of forcing reclaim > >of mapped pages. > > Sounds like this should be done unconditionally, then. An empty > slab page is worth less than an unmapped pagecache page at all > times, no? > In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. > >Does the problem statement make sense? If so, do you agree with 1 and > >2? Is there major concern about subverting regular reclaim? Does > >subverting it make sense in the duplicated scenario? > > > > In the case of 2, how do you know there is duplication? You know > the guest caches the page, but you have no information about the > host. Since the page is cached in the guest, the host doesn't see > it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloon <size> <cache?> with the cache argument being optional. > > If there is no duplication, then you may have dropped a > recently-used page and will likely cause a major fault soon. > Yes, agreed. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 12:50 ` Balbir Singh @ 2010-06-14 13:01 ` Avi Kivity 2010-06-14 15:33 ` Dave Hansen 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-14 13:01 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/14/2010 03:50 PM, Balbir Singh wrote: > >> >>> let me try to reason a bit. First let me explain the >>> problem >>> >>> Memory is a precious resource in a consolidated environment. >>> We don't want to waste memory via page cache duplication >>> (cache=writethrough and cache=writeback mode). >>> >>> Now here is what we are trying to do >>> >>> 1. A slab page will not be freed until the entire page is free (all >>> slabs have been kfree'd so to speak). Normal reclaim will definitely >>> free this page, but a lot of it depends on how frequently we are >>> scanning the LRU list and when this page got added. >>> 2. In the case of page cache (specifically unmapped page cache), there >>> is duplication already, so why not go after unmapped page caches when >>> the system is under memory pressure? >>> >>> In the case of 1, we don't force a dentry to be freed, but rather a >>> freed page in the slab cache to be reclaimed ahead of forcing reclaim >>> of mapped pages. >>> >> Sounds like this should be done unconditionally, then. An empty >> slab page is worth less than an unmapped pagecache page at all >> times, no? >> >> > In a consolidated environment, even at the cost of some CPU to run > shrinkers, I think potentially yes. > I don't understand. If you're running the shrinkers then you're evicting live entries, which could cost you an I/O each. That's expensive, consolidated or not. If you're not running the shrinkers, why does it matter if you're consolidated or not? Drop that age unconditionally. >>> Does the problem statement make sense? If so, do you agree with 1 and >>> 2? Is there major concern about subverting regular reclaim? Does >>> subverting it make sense in the duplicated scenario? >>> >>> >> In the case of 2, how do you know there is duplication? You know >> the guest caches the page, but you have no information about the >> host. Since the page is cached in the guest, the host doesn't see >> it referenced, and is likely to drop it. >> > True, that is why the first patch is controlled via a boot parameter > that the host can pass. For the second patch, I think we'll need > something like a balloon<size> <cache?> with the cache argument being > optional. > Whether a page is duplicated on the host or not is per-page, it cannot be a boot parameter. If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. Overall, I don't see how a user can tune this. If I were a guest admin, I'd play it safe by not assuming the host will back me, and disabling the feature. To get something like this to work, we need to reward cooperating guests somehow. >> If there is no duplication, then you may have dropped a >> recently-used page and will likely cause a major fault soon. >> > Yes, agreed. > So how do we deal with this? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 13:01 ` Avi Kivity @ 2010-06-14 15:33 ` Dave Hansen 2010-06-14 15:44 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Dave Hansen @ 2010-06-14 15:33 UTC (permalink / raw) To: Avi Kivity; +Cc: balbir, kvm, linux-mm, linux-kernel On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: > If we drop unmapped pagecache pages, we need to be sure they can be > backed by the host, and that depends on the amount of sharing. You also have to set up the host up properly, and continue to maintain it in a way that finds and eliminates duplicates. I saw some benchmarks where KSM was doing great, finding lots of duplicate pages. Then, the host filled up, and guests started reclaiming. As memory pressure got worse, so did KSM's ability to find duplicates. At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:33 ` Dave Hansen @ 2010-06-14 15:44 ` Avi Kivity 2010-06-14 15:55 ` Dave Hansen 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-14 15:44 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, kvm, linux-mm, linux-kernel On 06/14/2010 06:33 PM, Dave Hansen wrote: > On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: > >> If we drop unmapped pagecache pages, we need to be sure they can be >> backed by the host, and that depends on the amount of sharing. >> > You also have to set up the host up properly, and continue to maintain > it in a way that finds and eliminates duplicates. > > I saw some benchmarks where KSM was doing great, finding lots of > duplicate pages. Then, the host filled up, and guests started > reclaiming. As memory pressure got worse, so did KSM's ability to find > duplicates. > Yup. KSM needs to be backed up by ballooning, swap, and live migration. > At the same time, I see what you're trying to do with this. It really > can be an alternative to ballooning if we do it right, since ballooning > would probably evict similar pages. Although it would only work in idle > guests, what about a knob that the host can turn to just get the guest > to start running reclaim? > Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:44 ` Avi Kivity @ 2010-06-14 15:55 ` Dave Hansen 2010-06-14 16:34 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Dave Hansen @ 2010-06-14 15:55 UTC (permalink / raw) To: Avi Kivity; +Cc: balbir, kvm, linux-mm, linux-kernel On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: > On 06/14/2010 06:33 PM, Dave Hansen wrote: > > At the same time, I see what you're trying to do with this. It really > > can be an alternative to ballooning if we do it right, since ballooning > > would probably evict similar pages. Although it would only work in idle > > guests, what about a knob that the host can turn to just get the guest > > to start running reclaim? > > Isn't the knob in this proposal the balloon? AFAICT, the idea here is > to change how the guest reacts to being ballooned, but the trigger > itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. > My issue is that changing the type of object being preferentially > reclaimed just changes the type of workload that would prematurely > suffer from reclaim. In this case, workloads that use a lot of unmapped > pagecache would suffer. > > btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. Balbir, can you elaborate a bit on why you would need these patches on a guest that is being ballooned? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:55 ` Dave Hansen @ 2010-06-14 16:34 ` Avi Kivity 2010-06-14 17:45 ` Balbir Singh 2010-06-14 17:58 ` Dave Hansen 0 siblings, 2 replies; 48+ messages in thread From: Avi Kivity @ 2010-06-14 16:34 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, kvm, linux-mm, linux-kernel On 06/14/2010 06:55 PM, Dave Hansen wrote: > On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: > >> On 06/14/2010 06:33 PM, Dave Hansen wrote: >> >>> At the same time, I see what you're trying to do with this. It really >>> can be an alternative to ballooning if we do it right, since ballooning >>> would probably evict similar pages. Although it would only work in idle >>> guests, what about a knob that the host can turn to just get the guest >>> to start running reclaim? >>> >> Isn't the knob in this proposal the balloon? AFAICT, the idea here is >> to change how the guest reacts to being ballooned, but the trigger >> itself would not change. >> > I think the patch was made on the following assumptions: > 1. Guests will keep filling their memory with relatively worthless page > cache that they don't really need. > 2. When they do this, it hurts the overall system with no real gain for > anyone. > > In the case of a ballooned guest, they _won't_ keep filling memory. The > balloon will prevent them. So, I guess I was just going down the path > of considering if this would be useful without ballooning in place. To > me, it's really hard to justify _with_ ballooning in place. > There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. >> My issue is that changing the type of object being preferentially >> reclaimed just changes the type of workload that would prematurely >> suffer from reclaim. In this case, workloads that use a lot of unmapped >> pagecache would suffer. >> >> btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? >> > Those tell you how to balance going after the different classes of > things that we can reclaim. > > Again, this is useless when ballooning is being used. But, I'm thinking > of a more general mechanism to force the system to both have MemFree > _and_ be acting as if it is under memory pressure. > If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 16:34 ` Avi Kivity @ 2010-06-14 17:45 ` Balbir Singh 2010-06-15 6:58 ` Avi Kivity 2010-06-14 17:58 ` Dave Hansen 1 sibling, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 17:45 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-14 19:34:00]: > On 06/14/2010 06:55 PM, Dave Hansen wrote: > >On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: > >>On 06/14/2010 06:33 PM, Dave Hansen wrote: > >>>At the same time, I see what you're trying to do with this. It really > >>>can be an alternative to ballooning if we do it right, since ballooning > >>>would probably evict similar pages. Although it would only work in idle > >>>guests, what about a knob that the host can turn to just get the guest > >>>to start running reclaim? > >>Isn't the knob in this proposal the balloon? AFAICT, the idea here is > >>to change how the guest reacts to being ballooned, but the trigger > >>itself would not change. > >I think the patch was made on the following assumptions: > >1. Guests will keep filling their memory with relatively worthless page > > cache that they don't really need. > >2. When they do this, it hurts the overall system with no real gain for > > anyone. > > > >In the case of a ballooned guest, they _won't_ keep filling memory. The > >balloon will prevent them. So, I guess I was just going down the path > >of considering if this would be useful without ballooning in place. To > >me, it's really hard to justify _with_ ballooning in place. > > There are two decisions that need to be made: > > - how much memory a guest should be given > - given some guest memory, what's the best use for it > > The first question can perhaps be answered by looking at guest I/O > rates and giving more memory to more active guests. The second > question is hard, but not any different than running non-virtualized > - except if we can detect sharing or duplication. In this case, > dropping a duplicated page is worthwhile, while dropping a shared > page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances I think the answer to 1 and 2 is yes. > > How the patch helps answer either question, I'm not sure. I don't > think preferential dropping of unmapped page cache is the answer. > Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. > >>My issue is that changing the type of object being preferentially > >>reclaimed just changes the type of workload that would prematurely > >>suffer from reclaim. In this case, workloads that use a lot of unmapped > >>pagecache would suffer. > >> > >>btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? > >Those tell you how to balance going after the different classes of > >things that we can reclaim. > > > >Again, this is useless when ballooning is being used. But, I'm thinking > >of a more general mechanism to force the system to both have MemFree > >_and_ be acting as if it is under memory pressure. > > If there is no memory pressure on the host, there is no reason for > the guest to pretend it is under pressure. If there is memory > pressure on the host, it should share the pain among its guests by > applying the balloon. So I don't think voluntarily dropping cache > is a good direction. > There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 17:45 ` Balbir Singh @ 2010-06-15 6:58 ` Avi Kivity 2010-06-15 7:49 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-15 6:58 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/14/2010 08:45 PM, Balbir Singh wrote: > >> There are two decisions that need to be made: >> >> - how much memory a guest should be given >> - given some guest memory, what's the best use for it >> >> The first question can perhaps be answered by looking at guest I/O >> rates and giving more memory to more active guests. The second >> question is hard, but not any different than running non-virtualized >> - except if we can detect sharing or duplication. In this case, >> dropping a duplicated page is worthwhile, while dropping a shared >> page provides no benefit. >> > I think there is another way of looking at it, give some free memory > > 1. Can the guest run more applications or run faster > That's my second question. How to best use this memory. More applications == drop the page from cache, faster == keep page in cache. All we need is to select the right page to drop. > 2. Can the host potentially get this memory via ballooning or some > other means to start newer guest instances > Well, we already have ballooning. The question is can we improve the eviction algorithm. > I think the answer to 1 and 2 is yes. > > >> How the patch helps answer either question, I'm not sure. I don't >> think preferential dropping of unmapped page cache is the answer. >> >> > Preferential dropping as selected by the host, that knows about the > setup and if there is duplication involved. While we use the term > preferential dropping, remember it is still via LRU and we don't > always succeed. It is a best effort (if you can and the unmapped pages > are not highly referenced) scenario. > How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. >>> Those tell you how to balance going after the different classes of >>> things that we can reclaim. >>> >>> Again, this is useless when ballooning is being used. But, I'm thinking >>> of a more general mechanism to force the system to both have MemFree >>> _and_ be acting as if it is under memory pressure. >>> >> If there is no memory pressure on the host, there is no reason for >> the guest to pretend it is under pressure. If there is memory >> pressure on the host, it should share the pain among its guests by >> applying the balloon. So I don't think voluntarily dropping cache >> is a good direction. >> >> > There are two situations > > 1. Voluntarily drop cache, if it was setup to do so (the host knows > that it caches that information anyway) > It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. > 2. Drop the cache on either a special balloon option, again the host > knows it caches that very same information, so it prefers to free that > up first. > Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 6:58 ` Avi Kivity @ 2010-06-15 7:49 ` Balbir Singh 2010-06-15 9:44 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-15 7:49 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-15 09:58:33]: > On 06/14/2010 08:45 PM, Balbir Singh wrote: > > > >>There are two decisions that need to be made: > >> > >>- how much memory a guest should be given > >>- given some guest memory, what's the best use for it > >> > >>The first question can perhaps be answered by looking at guest I/O > >>rates and giving more memory to more active guests. The second > >>question is hard, but not any different than running non-virtualized > >>- except if we can detect sharing or duplication. In this case, > >>dropping a duplicated page is worthwhile, while dropping a shared > >>page provides no benefit. > >I think there is another way of looking at it, give some free memory > > > >1. Can the guest run more applications or run faster > > That's my second question. How to best use this memory. More > applications == drop the page from cache, faster == keep page in > cache. > > All we need is to select the right page to drop. > Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. > >2. Can the host potentially get this memory via ballooning or some > >other means to start newer guest instances > > Well, we already have ballooning. The question is can we improve > the eviction algorithm. > > >I think the answer to 1 and 2 is yes. > > > >>How the patch helps answer either question, I'm not sure. I don't > >>think preferential dropping of unmapped page cache is the answer. > >> > >Preferential dropping as selected by the host, that knows about the > >setup and if there is duplication involved. While we use the term > >preferential dropping, remember it is still via LRU and we don't > >always succeed. It is a best effort (if you can and the unmapped pages > >are not highly referenced) scenario. > > How can the host tell if there is duplication? It may know it has > some pagecache, but it has no idea whether or to what extent guest > pagecache duplicates host pagecache. > Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. > >>>Those tell you how to balance going after the different classes of > >>>things that we can reclaim. > >>> > >>>Again, this is useless when ballooning is being used. But, I'm thinking > >>>of a more general mechanism to force the system to both have MemFree > >>>_and_ be acting as if it is under memory pressure. > >>If there is no memory pressure on the host, there is no reason for > >>the guest to pretend it is under pressure. If there is memory > >>pressure on the host, it should share the pain among its guests by > >>applying the balloon. So I don't think voluntarily dropping cache > >>is a good direction. > >> > >There are two situations > > > >1. Voluntarily drop cache, if it was setup to do so (the host knows > >that it caches that information anyway) > > It doesn't, really. The host only has aggregate information about > itself, and no information about the guest. > > Dropping duplicate pages would be good if we could identify them. > Even then, it's better to drop the page from the host, not the > guest, unless we know the same page is cached by multiple guests. > On the exact pages to drop, please see my comments above on the class of pages to drop. There are reasons for wanting to get the host to cache the data Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts > But why would the guest voluntarily drop the cache? If there is no > memory pressure, dropping caches increases cpu overhead and latency > even if the data is still cached on the host. > So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. > >2. Drop the cache on either a special balloon option, again the host > >knows it caches that very same information, so it prefers to free that > >up first. > > Dropping in response to pressure is good. I'm just not convinced > the patch helps in selecting the correct page to drop. > That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 7:49 ` Balbir Singh @ 2010-06-15 9:44 ` Avi Kivity 2010-06-15 10:18 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-15 9:44 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/15/2010 10:49 AM, Balbir Singh wrote: > >> All we need is to select the right page to drop. >> >> > Do we need to drop to the granularity of the page to drop? I think > figuring out the class of pages and making sure that we don't write > our own reclaim logic, but work with what we have to identify the > class of pages is a good start. > Well, the class of pages are 'pages that are duplicated on the host'. Unmapped page cache pages are 'pages that might be duplicated on the host'. IMO, that's not close enough. >> How can the host tell if there is duplication? It may know it has >> some pagecache, but it has no idea whether or to what extent guest >> pagecache duplicates host pagecache. >> >> > Well it is possible in host user space, I for example use memory > cgroup and through the stats I have a good idea of how much is duplicated. > I am ofcourse making an assumption with my setup of the cached mode, > that the data in the guest page cache and page cache in the cgroup > will be duplicated to a large extent. I did some trivial experiments > like drop the data from the guest and look at the cost of bringing it > in and dropping the data from both guest and host and look at the > cost. I could see a difference. > > Unfortunately, I did not save the data, so I'll need to redo the > experiment. > I'm sure we can detect it experimentally, but how do we do it programatically at run time (without dropping all the pages). Situations change, and I don't think we can infer from a few experiments that we'll have a similar amount of sharing. The cost of an incorrect decision is too high IMO (not that I think the kernel always chooses the right pages now, but I'd like to avoid regressions from the unvirtualized state). btw, when running with a disk controller that has a very large cache, we might also see duplication between "guest" and host. So, if this is a good idea, it shouldn't be enabled just for virtualization, but for any situation where we have a sizeable cache behind us. >> It doesn't, really. The host only has aggregate information about >> itself, and no information about the guest. >> >> Dropping duplicate pages would be good if we could identify them. >> Even then, it's better to drop the page from the host, not the >> guest, unless we know the same page is cached by multiple guests. >> >> > On the exact pages to drop, please see my comments above on the class > of pages to drop. > Well, we disagree about that. There is some value in dropping duplicated pages (not always), but that's not what the patch does. It drops unmapped pagecache pages, which may or may not be duplicated. > There are reasons for wanting to get the host to cache the data > There are also reasons to get the guest to cache the data - it's more efficient to access it in the guest. > Unless the guest is using cache = none, the data will still hit the > host page cache > The host can do a better job of optimizing the writeouts > True, especially for non-raw storage. But even there we have to fsync all the time to keep the metadata right. >> But why would the guest voluntarily drop the cache? If there is no >> memory pressure, dropping caches increases cpu overhead and latency >> even if the data is still cached on the host. >> >> > So, there are basically two approaches > > 1. First patch, proactive - enabled by a boot option > 2. When ballooned, we try to (please NOTE try to) reclaim cached pages > first. Failing which, we go after regular pages in the alloc_page() > call in the balloon driver. > Doesn't that mean you may evict a RU mapped page ahead of an LRU unmapped page, just in the hope that it is double-cached? Maybe we need the guest and host to talk to each other about which pages to keep. >>> 2. Drop the cache on either a special balloon option, again the host >>> knows it caches that very same information, so it prefers to free that >>> up first. >>> >> Dropping in response to pressure is good. I'm just not convinced >> the patch helps in selecting the correct page to drop. >> >> > That is why I've presented data on the experiments I've run and > provided more arguments to backup the approach. > I'm still unconvinced, sorry. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 9:44 ` Avi Kivity @ 2010-06-15 10:18 ` Balbir Singh 0 siblings, 0 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-15 10:18 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-15 12:44:31]: > On 06/15/2010 10:49 AM, Balbir Singh wrote: > > > >>All we need is to select the right page to drop. > >> > >Do we need to drop to the granularity of the page to drop? I think > >figuring out the class of pages and making sure that we don't write > >our own reclaim logic, but work with what we have to identify the > >class of pages is a good start. > > Well, the class of pages are 'pages that are duplicated on the > host'. Unmapped page cache pages are 'pages that might be > duplicated on the host'. IMO, that's not close enough. > Agreed, but what happens in reality with the code is that it drops not-so-frequently-used cache (still reusing the reclaim mechanism), but prioritizing cached memory. > >>How can the host tell if there is duplication? It may know it has > >>some pagecache, but it has no idea whether or to what extent guest > >>pagecache duplicates host pagecache. > >> > >Well it is possible in host user space, I for example use memory > >cgroup and through the stats I have a good idea of how much is duplicated. > >I am ofcourse making an assumption with my setup of the cached mode, > >that the data in the guest page cache and page cache in the cgroup > >will be duplicated to a large extent. I did some trivial experiments > >like drop the data from the guest and look at the cost of bringing it > >in and dropping the data from both guest and host and look at the > >cost. I could see a difference. > > > >Unfortunately, I did not save the data, so I'll need to redo the > >experiment. > > I'm sure we can detect it experimentally, but how do we do it > programatically at run time (without dropping all the pages). > Situations change, and I don't think we can infer from a few > experiments that we'll have a similar amount of sharing. The cost > of an incorrect decision is too high IMO (not that I think the > kernel always chooses the right pages now, but I'd like to avoid > regressions from the unvirtualized state). > > btw, when running with a disk controller that has a very large > cache, we might also see duplication between "guest" and host. So, > if this is a good idea, it shouldn't be enabled just for > virtualization, but for any situation where we have a sizeable cache > behind us. > It depends, once the disk controller has the cache and the pages in the guest are not-so-frequently-used we can drop them. Please remember we still use the LRU to identify these pages. > >>It doesn't, really. The host only has aggregate information about > >>itself, and no information about the guest. > >> > >>Dropping duplicate pages would be good if we could identify them. > >>Even then, it's better to drop the page from the host, not the > >>guest, unless we know the same page is cached by multiple guests. > >> > >On the exact pages to drop, please see my comments above on the class > >of pages to drop. > > Well, we disagree about that. There is some value in dropping > duplicated pages (not always), but that's not what the patch does. > It drops unmapped pagecache pages, which may or may not be > duplicated. > > >There are reasons for wanting to get the host to cache the data > > There are also reasons to get the guest to cache the data - it's > more efficient to access it in the guest. > > >Unless the guest is using cache = none, the data will still hit the > >host page cache > >The host can do a better job of optimizing the writeouts > > True, especially for non-raw storage. But even there we have to > fsync all the time to keep the metadata right. > > >>But why would the guest voluntarily drop the cache? If there is no > >>memory pressure, dropping caches increases cpu overhead and latency > >>even if the data is still cached on the host. > >> > >So, there are basically two approaches > > > >1. First patch, proactive - enabled by a boot option > >2. When ballooned, we try to (please NOTE try to) reclaim cached pages > >first. Failing which, we go after regular pages in the alloc_page() > >call in the balloon driver. > > Doesn't that mean you may evict a RU mapped page ahead of an LRU > unmapped page, just in the hope that it is double-cached? > > Maybe we need the guest and host to talk to each other about which > pages to keep. > Yeah.. I guess that falls into the domain of CMM. > >>>2. Drop the cache on either a special balloon option, again the host > >>>knows it caches that very same information, so it prefers to free that > >>>up first. > >>Dropping in response to pressure is good. I'm just not convinced > >>the patch helps in selecting the correct page to drop. > >> > >That is why I've presented data on the experiments I've run and > >provided more arguments to backup the approach. > > I'm still unconvinced, sorry. > The reason for making this optional is to let the administrators decide how they want to use the memory in the system. In some situations it might be a big no-no to waste memory, in some cases it might be acceptable. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 16:34 ` Avi Kivity 2010-06-14 17:45 ` Balbir Singh @ 2010-06-14 17:58 ` Dave Hansen 2010-06-15 7:07 ` Avi Kivity 1 sibling, 1 reply; 48+ messages in thread From: Dave Hansen @ 2010-06-14 17:58 UTC (permalink / raw) To: Avi Kivity; +Cc: balbir, kvm, linux-mm, linux-kernel On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: > > Again, this is useless when ballooning is being used. But, I'm thinking > > of a more general mechanism to force the system to both have MemFree > > _and_ be acting as if it is under memory pressure. > > > > If there is no memory pressure on the host, there is no reason for the > guest to pretend it is under pressure. I can think of quite a few places where this would be beneficial. Ballooning is dangerous. I've OOMed quite a few guests by over-ballooning them. Anything that's voluntary like this is safer than things imposed by the host, although you do trade of effectiveness. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... A system in this state where everyone is proactively keeping their footprints down is more likely to be able to handle load spikes. Reclaim is an expensive, costly activity, and this ensures that we don't have to do that when we're busy doing other things like handling load spikes. This was one of the concepts behind CMM2: reduce the overhead during peak periods. It's also handy for planning. Guests exhibiting this behavior will _act_ as if they're under pressure. That's a good thing to approximate how a guest will act when it _is_ under pressure. > If there is memory pressure on > the host, it should share the pain among its guests by applying the > balloon. So I don't think voluntarily dropping cache is a good direction. I think we're trying to consider things slightly outside of ballooning at this point. If ballooning was the end-all solution, I'm fairly sure Balbir wouldn't be looking at this stuff. Just trying to keep options open. :) -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 17:58 ` Dave Hansen @ 2010-06-15 7:07 ` Avi Kivity 2010-06-15 14:47 ` Dave Hansen 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-15 7:07 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, kvm, linux-mm, linux-kernel On 06/14/2010 08:58 PM, Dave Hansen wrote: > On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: > >>> Again, this is useless when ballooning is being used. But, I'm thinking >>> of a more general mechanism to force the system to both have MemFree >>> _and_ be acting as if it is under memory pressure. >>> >>> >> If there is no memory pressure on the host, there is no reason for the >> guest to pretend it is under pressure. >> > I can think of quite a few places where this would be beneficial. > > Ballooning is dangerous. I've OOMed quite a few guests by > over-ballooning them. Anything that's voluntary like this is safer than > things imposed by the host, although you do trade of effectiveness. > That's a bug that needs to be fixed. Eventually the host will come under pressure and will balloon the guest. If that kills the guest, the ballooning is not effective as a host memory management technique. Trying to defer ballooning by voluntarily dropping cache is simply trying to defer being bitten by the bug. > If all the guests do this, then it leaves that much more free memory on > the host, which can be used flexibly for extra host page cache, new > guests, etc... If the host detects lots of pagecache misses it can balloon guests down. If pagecache is quiet, why change anything? If the host wants to start new guests, it can balloon guests down. If no new guests are wanted, why change anything? etc... > A system in this state where everyone is proactively > keeping their footprints down is more likely to be able to handle load > spikes. That is true. But from the guest's point of view, voluntarily giving up memory means dropping the guest's cushion vs load spikes. > Reclaim is an expensive, costly activity, and this ensures that > we don't have to do that when we're busy doing other things like > handling load spikes. The guest doesn't want to reclaim memory from the host when it's under a load spike either. > This was one of the concepts behind CMM2: reduce > the overhead during peak periods. > Ah, but CMM2 actually reduced work being done by sharing information between guest and host. > It's also handy for planning. Guests exhibiting this behavior will > _act_ as if they're under pressure. That's a good thing to approximate > how a guest will act when it _is_ under pressure. > If a guest acts as if it is under pressure, then it will be slower and consume more cpu. Bad for both guest and host. >> If there is memory pressure on >> the host, it should share the pain among its guests by applying the >> balloon. So I don't think voluntarily dropping cache is a good direction. >> > I think we're trying to consider things slightly outside of ballooning > at this point. If ballooning was the end-all solution, I'm fairly sure > Balbir wouldn't be looking at this stuff. Just trying to keep options > open. :) > I see this as an extension to ballooning - perhaps I'm missing the big picture. I would dearly love to have CMM2 where decisions are made on a per-page basis instead of using heuristics. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 7:07 ` Avi Kivity @ 2010-06-15 14:47 ` Dave Hansen 2010-06-16 11:39 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Dave Hansen @ 2010-06-15 14:47 UTC (permalink / raw) To: Avi Kivity; +Cc: balbir, kvm, linux-mm, linux-kernel On Tue, 2010-06-15 at 10:07 +0300, Avi Kivity wrote: > On 06/14/2010 08:58 PM, Dave Hansen wrote: > > On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: > > > >>> Again, this is useless when ballooning is being used. But, I'm thinking > >>> of a more general mechanism to force the system to both have MemFree > >>> _and_ be acting as if it is under memory pressure. > >>> > >>> > >> If there is no memory pressure on the host, there is no reason for the > >> guest to pretend it is under pressure. > >> > > I can think of quite a few places where this would be beneficial. > > > > Ballooning is dangerous. I've OOMed quite a few guests by > > over-ballooning them. Anything that's voluntary like this is safer than > > things imposed by the host, although you do trade of effectiveness. > > That's a bug that needs to be fixed. Eventually the host will come > under pressure and will balloon the guest. If that kills the guest, the > ballooning is not effective as a host memory management technique. I'm not convinced that it's just a bug that can be fixed. Consider a case where a host sees a guest with 100MB of free memory at the exact moment that a database app sees that memory. The host tries to balloon that memory away at the same time that the app goes and allocates it. That can certainly lead to an OOM very quickly, even for very small amounts of memory (much less than 100MB). Where's the bug? I think the issues are really fundamental to ballooning. > > If all the guests do this, then it leaves that much more free memory on > > the host, which can be used flexibly for extra host page cache, new > > guests, etc... > > If the host detects lots of pagecache misses it can balloon guests > down. If pagecache is quiet, why change anything? Page cache misses alone are not really sufficient. This is the classic problem where we try to differentiate streaming I/O (which we can't effectively cache) from I/O which can be effectively cached. > If the host wants to start new guests, it can balloon guests down. If > no new guests are wanted, why change anything? We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 14:47 ` Dave Hansen @ 2010-06-16 11:39 ` Avi Kivity 2010-06-17 6:04 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-16 11:39 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, kvm, linux-mm, linux-kernel On 06/15/2010 05:47 PM, Dave Hansen wrote: > >> That's a bug that needs to be fixed. Eventually the host will come >> under pressure and will balloon the guest. If that kills the guest, the >> ballooning is not effective as a host memory management technique. >> > I'm not convinced that it's just a bug that can be fixed. Consider a > case where a host sees a guest with 100MB of free memory at the exact > moment that a database app sees that memory. The host tries to balloon > that memory away at the same time that the app goes and allocates it. > That can certainly lead to an OOM very quickly, even for very small > amounts of memory (much less than 100MB). Where's the bug? > > I think the issues are really fundamental to ballooning. > There are two issues involved. One is, can the kernel accurately determine the amount of memory it needs to work? We have resources such as RAM and swap. We have liabilities in the form of swappable userspace memory, mlocked userspace memory, kernel memory to support these, and various reclaimable and non-reclaimable kernel caches. Can we determine the minimum amount of RAM to support are workload at a point in time? If we had this, we could modify the balloon to refuse to balloon if it takes the kernel beneath the minimum amount of RAM needed. In fact, this is similar to allocating memory with overcommit_memory = 0. The difference is the balloon allocates mlocked memory, while normal allocations can be charged against swap. But fundamentally it's the same. >>> If all the guests do this, then it leaves that much more free memory on >>> the host, which can be used flexibly for extra host page cache, new >>> guests, etc... >>> >> If the host detects lots of pagecache misses it can balloon guests >> down. If pagecache is quiet, why change anything? >> > Page cache misses alone are not really sufficient. This is the classic > problem where we try to differentiate streaming I/O (which we can't > effectively cache) from I/O which can be effectively cached. > True. Random I/O across a very large dataset is also difficult to cache. >> If the host wants to start new guests, it can balloon guests down. If >> no new guests are wanted, why change anything? >> > We're talking about an environment which we're always trying to > optimize. Imagine that we're always trying to consolidate guests on to > smaller numbers of hosts. We're effectively in a state where we > _always_ want new guests. > If this came at no cost to the guests, you'd be right. But at some point guest performance will be hit by this, so the advantage gained from freeing memory will be balanced by the disadvantage. Also, memory is not the only resource. At some point you become cpu bound; at that point freeing memory doesn't help and in fact may increase your cpu load. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-16 11:39 ` Avi Kivity @ 2010-06-17 6:04 ` Balbir Singh 0 siblings, 0 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-17 6:04 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-16 14:39:02]: > >We're talking about an environment which we're always trying to > >optimize. Imagine that we're always trying to consolidate guests on to > >smaller numbers of hosts. We're effectively in a state where we > >_always_ want new guests. > > If this came at no cost to the guests, you'd be right. But at some > point guest performance will be hit by this, so the advantage gained > from freeing memory will be balanced by the disadvantage. > > Also, memory is not the only resource. At some point you become cpu > bound; at that point freeing memory doesn't help and in fact may > increase your cpu load. > We'll probably need control over other resources as well, but IMHO memory is the most precious because it is non-renewable. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 8:48 ` Balbir Singh 2010-06-14 12:40 ` Avi Kivity @ 2010-06-14 15:12 ` Dave Hansen 2010-06-14 15:34 ` Avi Kivity 2010-06-14 16:58 ` Balbir Singh 1 sibling, 2 replies; 48+ messages in thread From: Dave Hansen @ 2010-06-14 15:12 UTC (permalink / raw) To: balbir; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: > 1. A slab page will not be freed until the entire page is free (all > slabs have been kfree'd so to speak). Normal reclaim will definitely > free this page, but a lot of it depends on how frequently we are > scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:12 ` Dave Hansen @ 2010-06-14 15:34 ` Avi Kivity 2010-06-14 17:40 ` Balbir Singh 2010-06-14 16:58 ` Balbir Singh 1 sibling, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-14 15:34 UTC (permalink / raw) To: Dave Hansen; +Cc: balbir, kvm, linux-mm, linux-kernel On 06/14/2010 06:12 PM, Dave Hansen wrote: > On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: > >> 1. A slab page will not be freed until the entire page is free (all >> slabs have been kfree'd so to speak). Normal reclaim will definitely >> free this page, but a lot of it depends on how frequently we are >> scanning the LRU list and when this page got added. >> > You don't have to be freeing entire slab pages for the reclaim to have > been useful. You could just be making space so that _future_ > allocations fill in the slab holes you just created. You may not be > freeing pages, but you're reducing future system pressure. > Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. > If unmapped page cache is the easiest thing to evict, then it should be > the first thing that goes when a balloon request comes in, which is the > case this patch is trying to handle. If it isn't the easiest thing to > evict, then we _shouldn't_ evict it. > Easy to evict is just one measure. There's benefit (size of data evicted), cost to refill (seeks, cpu), and likelihood that the cost to refill will be incurred (recency). It's all very complicated. We need better information to make these decisions. For one thing, I'd like to see age information tied to objects. We may have two pages that were referenced in wildly different times be next to each other in LRU order. We have many LRUs, but no idea of the relative recency of the tails of those LRUs. If each page or object had an age, we could scale those ages by the benefit from reclaim and cost to refill and make a better decision as to what to evict first. But of course page->age means increasing sizeof struct page, and we can only approximate its value by scanning the accessed bit, not determine it accurately (unlike the other objects managed by the cache). -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:34 ` Avi Kivity @ 2010-06-14 17:40 ` Balbir Singh 2010-06-15 7:11 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 17:40 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-14 18:34:58]: > On 06/14/2010 06:12 PM, Dave Hansen wrote: > >On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: > >>1. A slab page will not be freed until the entire page is free (all > >>slabs have been kfree'd so to speak). Normal reclaim will definitely > >>free this page, but a lot of it depends on how frequently we are > >>scanning the LRU list and when this page got added. > >You don't have to be freeing entire slab pages for the reclaim to have > >been useful. You could just be making space so that _future_ > >allocations fill in the slab holes you just created. You may not be > >freeing pages, but you're reducing future system pressure. > > Depends. If you've evicted something that will be referenced soon, > you're increasing system pressure. > I don't think slab pages care about being referenced soon, they are either allocated or freed. A page is just a storage unit for the data structure. A new one can be allocated on demand. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 17:40 ` Balbir Singh @ 2010-06-15 7:11 ` Avi Kivity 0 siblings, 0 replies; 48+ messages in thread From: Avi Kivity @ 2010-06-15 7:11 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/14/2010 08:40 PM, Balbir Singh wrote: > * Avi Kivity<avi@redhat.com> [2010-06-14 18:34:58]: > > >> On 06/14/2010 06:12 PM, Dave Hansen wrote: >> >>> On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: >>> >>>> 1. A slab page will not be freed until the entire page is free (all >>>> slabs have been kfree'd so to speak). Normal reclaim will definitely >>>> free this page, but a lot of it depends on how frequently we are >>>> scanning the LRU list and when this page got added. >>>> >>> You don't have to be freeing entire slab pages for the reclaim to have >>> been useful. You could just be making space so that _future_ >>> allocations fill in the slab holes you just created. You may not be >>> freeing pages, but you're reducing future system pressure. >>> >> Depends. If you've evicted something that will be referenced soon, >> you're increasing system pressure. >> >> > I don't think slab pages care about being referenced soon, they are > either allocated or freed. A page is just a storage unit for the data > structure. A new one can be allocated on demand. > If we're talking just about slab pages, I agree. If we're applying pressure on the shrinkers, then you are removing live objects which can be costly to reinstantiate. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 15:12 ` Dave Hansen 2010-06-14 15:34 ` Avi Kivity @ 2010-06-14 16:58 ` Balbir Singh 2010-06-14 17:09 ` Dave Hansen 1 sibling, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 16:58 UTC (permalink / raw) To: Dave Hansen; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel * Dave Hansen <dave@linux.vnet.ibm.com> [2010-06-14 08:12:56]: > On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: > > 1. A slab page will not be freed until the entire page is free (all > > slabs have been kfree'd so to speak). Normal reclaim will definitely > > free this page, but a lot of it depends on how frequently we are > > scanning the LRU list and when this page got added. > > You don't have to be freeing entire slab pages for the reclaim to have > been useful. You could just be making space so that _future_ > allocations fill in the slab holes you just created. You may not be > freeing pages, but you're reducing future system pressure. > > If unmapped page cache is the easiest thing to evict, then it should be > the first thing that goes when a balloon request comes in, which is the > case this patch is trying to handle. If it isn't the easiest thing to > evict, then we _shouldn't_ evict it. > Like I said earlier, a lot of that works correctly as you said, but it is also an idealization. If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 16:58 ` Balbir Singh @ 2010-06-14 17:09 ` Dave Hansen 2010-06-14 17:16 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Dave Hansen @ 2010-06-14 17:09 UTC (permalink / raw) To: balbir; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: > If you've got duplicate pages and you know > that they are duplicated and can be retrieved at a lower cost, why > wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. "Unmapped page cache" may be the best proxy that we have at the moment for "easy to recreate", but I think it's still too poor a match to make these patches useful. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 17:09 ` Dave Hansen @ 2010-06-14 17:16 ` Balbir Singh 2010-06-15 7:12 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-14 17:16 UTC (permalink / raw) To: Dave Hansen; +Cc: Avi Kivity, kvm, linux-mm, linux-kernel * Dave Hansen <dave@linux.vnet.ibm.com> [2010-06-14 10:09:31]: > On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: > > If you've got duplicate pages and you know > > that they are duplicated and can be retrieved at a lower cost, why > > wouldn't we go after them first? > > I agree with this in theory. But, the guest lacks the information about > what is truly duplicated and what the costs are for itself and/or the > host to recreate it. "Unmapped page cache" may be the best proxy that > we have at the moment for "easy to recreate", but I think it's still too > poor a match to make these patches useful. > That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. The first patch today is again enabled by the host. Both of them are expected to be useful in the cache != none case. The data I have shows more details including the performance and overhead. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-14 17:16 ` Balbir Singh @ 2010-06-15 7:12 ` Avi Kivity 2010-06-15 7:52 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-15 7:12 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/14/2010 08:16 PM, Balbir Singh wrote: > * Dave Hansen<dave@linux.vnet.ibm.com> [2010-06-14 10:09:31]: > > >> On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: >> >>> If you've got duplicate pages and you know >>> that they are duplicated and can be retrieved at a lower cost, why >>> wouldn't we go after them first? >>> >> I agree with this in theory. But, the guest lacks the information about >> what is truly duplicated and what the costs are for itself and/or the >> host to recreate it. "Unmapped page cache" may be the best proxy that >> we have at the moment for "easy to recreate", but I think it's still too >> poor a match to make these patches useful. >> >> > That is why the policy (in the next set) will come from the host. As > to whether the data is truly duplicated, my experiments show up to 60% > of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 7:12 ` Avi Kivity @ 2010-06-15 7:52 ` Balbir Singh 2010-06-15 9:54 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Balbir Singh @ 2010-06-15 7:52 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-15 10:12:44]: > On 06/14/2010 08:16 PM, Balbir Singh wrote: > >* Dave Hansen<dave@linux.vnet.ibm.com> [2010-06-14 10:09:31]: > > > >>On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: > >>>If you've got duplicate pages and you know > >>>that they are duplicated and can be retrieved at a lower cost, why > >>>wouldn't we go after them first? > >>I agree with this in theory. But, the guest lacks the information about > >>what is truly duplicated and what the costs are for itself and/or the > >>host to recreate it. "Unmapped page cache" may be the best proxy that > >>we have at the moment for "easy to recreate", but I think it's still too > >>poor a match to make these patches useful. > >> > >That is why the policy (in the next set) will come from the host. As > >to whether the data is truly duplicated, my experiments show up to 60% > >of the page cache is duplicated. > > Isn't that incredibly workload dependent? > > We can't expect the host admin to know whether duplication will > occur or not. > I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 7:52 ` Balbir Singh @ 2010-06-15 9:54 ` Avi Kivity 2010-06-15 12:49 ` Balbir Singh 0 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2010-06-15 9:54 UTC (permalink / raw) To: balbir; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel On 06/15/2010 10:52 AM, Balbir Singh wrote: >>> >>> That is why the policy (in the next set) will come from the host. As >>> to whether the data is truly duplicated, my experiments show up to 60% >>> of the page cache is duplicated. >>> >> Isn't that incredibly workload dependent? >> >> We can't expect the host admin to know whether duplication will >> occur or not. >> >> > I was referring to cache = (policy) we use based on the setup. I don't > think the duplication is too workload specific. Moreover, we could use > aggressive policies and restrict page cache usage or do it selectively > on ballooning. We could also add other options to make the ballooning > option truly optional, so that the system management software decides. > Consider a read-only workload that exactly fits in guest cache. Without trimming, the guest will keep hitting its own cache, and the host will see no access to the cache at all. So the host (assuming it is under even low pressure) will evict those pages, and the guest will happily use its own cache. If we start to trim, the guest will have to go to disk. That's the best case. Now for the worst case. A random access workload that misses the cache on both guest and host. Now every page is duplicated, and trimming guest pages allows the host to increase its cache, and potentially reduce misses. In this case trimming duplicated pages works. Real life will see a mix of this. Often used pages won't be duplicated, and less often used pages may see some duplication, especially if the host cache portion dedicated to the guest is bigger than the guest cache. I can see that trimming duplicate pages helps, but (a) I'd like to be sure they are duplicates and (b) often trimming them from the host is better than trimming them from the guest. Trimming from the guest is worthwhile if the pages are not used very often (but enough that caching them in the host is worth it) and if the host cache can serve more than one guest. If we can identify those pages, we don't risk degrading best-case workloads (as defined above). (note ksm to some extent identifies those pages, though it is a bit expensive, and doesn't share with the host pagecache). -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control 2010-06-15 9:54 ` Avi Kivity @ 2010-06-15 12:49 ` Balbir Singh 0 siblings, 0 replies; 48+ messages in thread From: Balbir Singh @ 2010-06-15 12:49 UTC (permalink / raw) To: Avi Kivity; +Cc: Dave Hansen, kvm, linux-mm, linux-kernel * Avi Kivity <avi@redhat.com> [2010-06-15 12:54:31]: > On 06/15/2010 10:52 AM, Balbir Singh wrote: > >>> > >>>That is why the policy (in the next set) will come from the host. As > >>>to whether the data is truly duplicated, my experiments show up to 60% > >>>of the page cache is duplicated. > >>Isn't that incredibly workload dependent? > >> > >>We can't expect the host admin to know whether duplication will > >>occur or not. > >> > >I was referring to cache = (policy) we use based on the setup. I don't > >think the duplication is too workload specific. Moreover, we could use > >aggressive policies and restrict page cache usage or do it selectively > >on ballooning. We could also add other options to make the ballooning > >option truly optional, so that the system management software decides. > > Consider a read-only workload that exactly fits in guest cache. > Without trimming, the guest will keep hitting its own cache, and the > host will see no access to the cache at all. So the host (assuming > it is under even low pressure) will evict those pages, and the guest > will happily use its own cache. If we start to trim, the guest will > have to go to disk. That's the best case. > > Now for the worst case. A random access workload that misses the > cache on both guest and host. Now every page is duplicated, and > trimming guest pages allows the host to increase its cache, and > potentially reduce misses. In this case trimming duplicated pages > works. > > Real life will see a mix of this. Often used pages won't be > duplicated, and less often used pages may see some duplication, > especially if the host cache portion dedicated to the guest is > bigger than the guest cache. > > I can see that trimming duplicate pages helps, but (a) I'd like to > be sure they are duplicates and (b) often trimming them from the > host is better than trimming them from the guest. > Lets see the behaviour with these patches The first patch is a proactive approach to keep more memory around. Enabling the parameter implies we are OK paying the cost of some overhead. My data shows that leaves a significant amount of free memory with a small 5% (in my case) overhead. This brings us back to what you can do with free memory. The second patch shows no overhead and selectively tries to use free cache to return back on memory pressure (as indicated by the balloon driver). We've discussed the reasons for doing this 1. In the situations where cache is duplicated this should benefit us. Your contention is that we need to be specific about the duplication. That falls under the realm of CMM. 2. In the case of slab cache, duplication does not matter, it is a free page, that should be reclaimed ahead of mapped pages ideally. If the slab grows, it will get another new page. What is the cost of (1) In the worst case, we select a non-duplicated page, but for us to select it, it should be inactive, in that case we do I/O to bring back the page. > Trimming from the guest is worthwhile if the pages are not used very > often (but enough that caching them in the host is worth it) and if > the host cache can serve more than one guest. If we can identify > those pages, we don't risk degrading best-case workloads (as defined > above). > > (note ksm to some extent identifies those pages, though it is a bit > expensive, and doesn't share with the host pagecache). > I see that you are hinting towards finding exact duplicates, I don't know if the cost and complexity justify it. I hope more users can try the patches with and without the boot parameter and provide additional feedback. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2010-06-17 6:04 UTC | newest] Thread overview: 48+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-06-08 15:51 [RFC/T/D][PATCH 0/2] KVM page cache optimization (v2) Balbir Singh 2010-06-08 15:51 ` [RFC][PATCH 1/2] Linux/Guest unmapped page cache control Balbir Singh 2010-06-13 18:31 ` Balbir Singh 2010-06-14 0:28 ` KAMEZAWA Hiroyuki 2010-06-14 6:49 ` Balbir Singh 2010-06-14 7:00 ` KAMEZAWA Hiroyuki 2010-06-14 7:36 ` Balbir Singh 2010-06-14 7:49 ` KAMEZAWA Hiroyuki 2010-06-08 15:51 ` [RFC/T/D][PATCH 2/2] Linux/Guest cooperative " Balbir Singh 2010-06-10 9:43 ` Avi Kivity 2010-06-10 14:25 ` Balbir Singh 2010-06-11 0:07 ` Dave Hansen 2010-06-11 1:54 ` KAMEZAWA Hiroyuki 2010-06-11 4:46 ` Balbir Singh 2010-06-11 5:05 ` KAMEZAWA Hiroyuki 2010-06-11 5:08 ` KAMEZAWA Hiroyuki 2010-06-11 6:14 ` Balbir Singh 2010-06-11 4:56 ` Balbir Singh 2010-06-14 8:09 ` Avi Kivity 2010-06-14 8:48 ` Balbir Singh 2010-06-14 12:40 ` Avi Kivity 2010-06-14 12:50 ` Balbir Singh 2010-06-14 13:01 ` Avi Kivity 2010-06-14 15:33 ` Dave Hansen 2010-06-14 15:44 ` Avi Kivity 2010-06-14 15:55 ` Dave Hansen 2010-06-14 16:34 ` Avi Kivity 2010-06-14 17:45 ` Balbir Singh 2010-06-15 6:58 ` Avi Kivity 2010-06-15 7:49 ` Balbir Singh 2010-06-15 9:44 ` Avi Kivity 2010-06-15 10:18 ` Balbir Singh 2010-06-14 17:58 ` Dave Hansen 2010-06-15 7:07 ` Avi Kivity 2010-06-15 14:47 ` Dave Hansen 2010-06-16 11:39 ` Avi Kivity 2010-06-17 6:04 ` Balbir Singh 2010-06-14 15:12 ` Dave Hansen 2010-06-14 15:34 ` Avi Kivity 2010-06-14 17:40 ` Balbir Singh 2010-06-15 7:11 ` Avi Kivity 2010-06-14 16:58 ` Balbir Singh 2010-06-14 17:09 ` Dave Hansen 2010-06-14 17:16 ` Balbir Singh 2010-06-15 7:12 ` Avi Kivity 2010-06-15 7:52 ` Balbir Singh 2010-06-15 9:54 ` Avi Kivity 2010-06-15 12:49 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).