From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id CADA96B0062 for ; Tue, 17 Jan 2012 03:14:15 -0500 (EST) Received: by vcge1 with SMTP id e1so473969vcg.14 for ; Tue, 17 Jan 2012 00:14:14 -0800 (PST) From: Minchan Kim Subject: [RFC 0/3] low memory notify Date: Tue, 17 Jan 2012 17:13:55 +0900 Message-Id: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. (Of course, KOSAKI's original patchset includes more logics but I didn't include all things intentionally because I want to start from beginning again) Recently, there are some requirements of notification of system memory pressure. It would be very useful for various cases. For example, QEMU/JVM/Firefox like big memory hogger can release their memory when memory pressure happens. Another example in embedded side, they can close background application. For this, there are some trial but we need more general one and not-hacked alloc/free hot path. I think most big problem of system slowness is swap-in operation. Swap-in is a synchronous operation so application's latency would be big. Solution for that is prevent swap-out itself. We couldn't prevent swapout totally but could reduce it with this patch. In case of swapless system, code page is very important for system response. So we have to keep code page, too. I used very naive heuristic in this patch but welcome to any idea. I want to make kernel logic simple if possible and just notify to user space. Of course, there are lots of thing we have to consider but for discussion this simple patch would be a good start point. This version is totally RFC so any comments are welcome. Minchan Kim (3): [RFC 1/3] /dev/low_mem_notify [RFC 2/3] vmscan hook [RFC 3/3] test program drivers/char/mem.c | 7 ++ include/linux/low_mem_notify.h | 6 ++ mm/Kconfig | 7 ++ mm/Makefile | 1 + mm/low_mem_notify.c | 61 ++++++++++++++++++++ mm/vmscan.c | 28 +++++++++ poll.c | 121 ++++++++++++++++++++++++++++++++++++++++ 7 files changed, 231 insertions(+), 0 deletions(-) create mode 100644 include/linux/low_mem_notify.h create mode 100644 mm/low_mem_notify.c create mode 100644 poll.c -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 6BECA6B0068 for ; Tue, 17 Jan 2012 03:14:20 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so1866150vbb.14 for ; Tue, 17 Jan 2012 00:14:19 -0800 (PST) From: Minchan Kim Subject: [RFC 1/3] /dev/low_mem_notify Date: Tue, 17 Jan 2012 17:13:56 +0900 Message-Id: <1326788038-29141-2-git-send-email-minchan@kernel.org> In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim , KOSAKI Motohiro This patch makes new device file "/dev/low_mem_notify". If application polls it, it can receive event when system memory pressure happens. This patch is based on KOSAKI and Marcelo's long time ago work. http://lwn.net/Articles/268732/ Signed-off-by: Marcelo Tosatti Signed-off-by: KOSAKI Motohiro Signed-off-by: Minchan Kim --- drivers/char/mem.c | 7 ++++ include/linux/low_mem_notify.h | 6 ++++ mm/Kconfig | 7 ++++ mm/Makefile | 1 + mm/low_mem_notify.c | 61 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 82 insertions(+), 0 deletions(-) create mode 100644 include/linux/low_mem_notify.h create mode 100644 mm/low_mem_notify.c diff --git a/drivers/char/mem.c b/drivers/char/mem.c index d6e9d08..72bc12b 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -35,6 +35,10 @@ # include #endif +#ifdef CONFIG_LOW_MEM_NOTIFY +extern struct file_operations low_mem_notify_fops; +#endif + static inline unsigned long size_inside_page(unsigned long start, unsigned long size) { @@ -867,6 +871,9 @@ static const struct memdev { #ifdef CONFIG_CRASH_DUMP [12] = { "oldmem", 0, &oldmem_fops, NULL }, #endif +#ifdef CONFIG_LOW_MEM_NOTIFY + [13] = { "low_mem_notify",0666, &low_mem_notify_fops, NULL}, +#endif }; static int memory_open(struct inode *inode, struct file *filp) diff --git a/include/linux/low_mem_notify.h b/include/linux/low_mem_notify.h new file mode 100644 index 0000000..bc0fc89 --- /dev/null +++ b/include/linux/low_mem_notify.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_LOW_MEM_NOTIFY_H +#define _LINUX_LOW_MEM_NOTIFY_H + +void low_memory_pressure(void); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index e338407..a2f48c6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -379,3 +379,10 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config LOW_MEM_NOTIFY + bool "Enable low memory notification" + default n + help + If system suffer from low memory, kernel can notify it to user through + /dev/low_mem_notify. diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7856357 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o +obj-$(CONFIG_LOW_MEM_NOTIFY) += low_mem_notify.o diff --git a/mm/low_mem_notify.c b/mm/low_mem_notify.c new file mode 100644 index 0000000..7432307 --- /dev/null +++ b/mm/low_mem_notify.c @@ -0,0 +1,61 @@ +#include +#include +#include +#include +#include + +static DECLARE_WAIT_QUEUE_HEAD(low_mem_wait); +static atomic_t nr_low_mem = ATOMIC_INIT(0); + +struct low_mem_notify_file_info { + unsigned long last_proc_notify; +}; + +void low_memory_pressure(void) +{ + atomic_inc(&nr_low_mem); + wake_up(&low_mem_wait); +} + +static int low_mem_notify_open(struct inode *inode, struct file *file) +{ + struct low_mem_notify_file_info *info; + int err = 0; + + info = kmalloc(sizeof(*info), GFP_KERNEL); + if (!info) { + err = -ENOMEM; + goto out; + } + + file->private_data = info; +out: + return err; +} + +static int low_mem_notify_release(struct inode *inode, struct file *file) +{ + kfree(file->private_data); + return 0; +} + +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait) +{ + unsigned int ret = 0; + + poll_wait(file, &low_mem_wait, wait); + + if (atomic_read(&nr_low_mem) != 0) { + ret = POLLIN; + atomic_set(&nr_low_mem, 0); + } + + return ret; +} + +struct file_operations low_mem_notify_fops = { + .open = low_mem_notify_open, + .release = low_mem_notify_release, + .poll = low_mem_notify_poll, +}; +EXPORT_SYMBOL(low_mem_notify_fops); -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id 69E8C6B006C for ; Tue, 17 Jan 2012 03:14:24 -0500 (EST) Received: by mail-vx0-f169.google.com with SMTP id e1so473969vcg.14 for ; Tue, 17 Jan 2012 00:14:24 -0800 (PST) From: Minchan Kim Subject: [RFC 2/3] vmscan hook Date: Tue, 17 Jan 2012 17:13:57 +0900 Message-Id: <1326788038-29141-3-git-send-email-minchan@kernel.org> In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim This patch insert memory pressure notify point into vmscan.c Most problem in system slowness is swap-in. swap-in is a synchronous opeartion so that it affects heavily system response. This patch alert it when reclaimer start to reclaim inactive anon list. It seems rather earlier but not bad than too late. Other alert point is when there is few cache pages In this implementation, if it is (cache < free pages), memory pressure notify happens. It has to need more testing and tuning or other hueristic. Any suggesion are welcome. Signed-off-by: Minchan Kim --- mm/vmscan.c | 28 ++++++++++++++++++++++++++++ 1 files changed, 28 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2880396..cfa2e2d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; + enum lru_list lru; unsigned long nr_reclaimed, nr_scanned; unsigned long nr_to_reclaim = sc->nr_to_reclaim; struct blk_plug plug; +#ifdef CONFIG_LOW_MEM_NOTIFY + bool low_mem = false; + unsigned long free, file; +#endif restart: nr_reclaimed = 0; nr_scanned = sc->nr_scanned; get_scan_count(mz, sc, nr, priority); +#ifdef CONFIG_LOW_MEM_NOTIFY + /* We want to avoid swapout */ + if (nr[LRU_INACTIVE_ANON]) + low_mem = true; + /* + * We want to avoid dropping page cache excessively + * in no swap system + */ + if (nr_swap_pages <= 0) { + free = zone_page_state(mz->zone, NR_FREE_PAGES); + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + + zone_page_state(mz->zone, NR_INACTIVE_FILE); + /* + * If we have very few page cache pages, + * notify to user + */ + if (file < free) + low_mem = true; + } + if (low_mem) + low_memory_pressure(); +#endif blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 070276B006C for ; Tue, 17 Jan 2012 03:14:28 -0500 (EST) Received: by mail-vw0-f41.google.com with SMTP id fa15so1866150vbb.14 for ; Tue, 17 Jan 2012 00:14:28 -0800 (PST) From: Minchan Kim Subject: [RFC 3/3] test program Date: Tue, 17 Jan 2012 17:13:58 +0900 Message-Id: <1326788038-29141-4-git-send-email-minchan@kernel.org> In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim This test program allocates 10M per second and when memory pressure notify happens, it releases 20M. I tested this patch on 512M qemu machine with 3 test program. I saw some swapout but not too many and even didn't see OOM. It obviously reduces swap out. Signed-off-by: Minchan Kim --- poll.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 121 insertions(+), 0 deletions(-) create mode 100644 poll.c diff --git a/poll.c b/poll.c new file mode 100644 index 0000000..3215f8b --- /dev/null +++ b/poll.c @@ -0,0 +1,121 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define ALLOC_UNIT 10 /* MB */ +#define FREE_UNIT 20 /* MB */ + +void alloc_memory(); +void free_memory(); + +unsigned int total_memory = 0; /* MB */ + +pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; + +/* + * If total memory is higher than 200M + */ +bool memory_full() +{ + return total_memory >= 400 ? true : false; +} + +struct alloc_chunk { + void *ptr; + struct alloc_chunk *next; +}; + +struct alloc_chunk head_chunk; + +void init_alloc_chunk(void) +{ + head_chunk.ptr = NULL; + head_chunk.next = NULL; +} + +void add_memory(void *ptr) +{ + struct alloc_chunk *new_chunk = malloc(sizeof(struct alloc_chunk)); + new_chunk->ptr = ptr; + + pthread_mutex_lock(&mutex); + new_chunk->next = head_chunk.next; + head_chunk.next = new_chunk; + total_memory += ALLOC_UNIT; + pthread_mutex_unlock(&mutex); + + printf("[%d] Add total memory %d(MB)\n", getpid(), total_memory); +} + +void alloc_memory(void) +{ + while(1) { + if (memory_full()) { + sleep(10); + continue; + } + + void *new = malloc(ALLOC_UNIT*1024*1024); + memset(new, 0, ALLOC_UNIT*1024*1024); + add_memory(new); + sleep(1); + } +} + +void free_memory(void) +{ + int count = FREE_UNIT / ALLOC_UNIT; + while(count--) { + struct alloc_chunk *chunk = head_chunk.next; + if (chunk == NULL) + break; + + pthread_mutex_lock(&mutex); + head_chunk.next = chunk->next; + total_memory -= ALLOC_UNIT; + pthread_mutex_unlock(&mutex); + + free(chunk->ptr); + free(chunk); + + printf("[%d] Free total memory %d(MB)\n", getpid(), total_memory); + } +} + +void *poll_thread(void *dummy) +{ + struct pollfd pfd; + int fd = open("/dev/low_mem_notify", O_RDONLY); + if (fd == -1) { + fprintf(stderr, "Fail to open\n"); + return; + } + + pfd.fd = fd; + pfd.events = POLLIN; + + while(1) { + poll(&pfd, 1, -1); + free_memory(); + } +} + +int main() +{ + pthread_t threadid; + init_alloc_chunk(); + + if (pthread_create(&threadid, NULL, poll_thread, NULL)) { + fprintf(stderr, "pthread create fail\n"); + return 1; + } + + alloc_memory(); + return 0; +} -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id 2EE286B0073 for ; Tue, 17 Jan 2012 03:40:56 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 381933EE0BD for ; Tue, 17 Jan 2012 17:40:54 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 16C2345DEF2 for ; Tue, 17 Jan 2012 17:40:54 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id DFA5345DEDC for ; Tue, 17 Jan 2012 17:40:53 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id CD7531DB8047 for ; Tue, 17 Jan 2012 17:40:53 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.240.81.147]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 75A0C1DB8042 for ; Tue, 17 Jan 2012 17:40:53 +0900 (JST) Date: Tue, 17 Jan 2012 17:39:32 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <1326788038-29141-3-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, 17 Jan 2012 17:13:57 +0900 Minchan Kim wrote: > This patch insert memory pressure notify point into vmscan.c > Most problem in system slowness is swap-in. swap-in is a synchronous > opeartion so that it affects heavily system response. > > This patch alert it when reclaimer start to reclaim inactive anon list. > It seems rather earlier but not bad than too late. > > Other alert point is when there is few cache pages > In this implementation, if it is (cache < free pages), > memory pressure notify happens. It has to need more testing and tuning > or other hueristic. Any suggesion are welcome. > > Signed-off-by: Minchan Kim In my 1st impression, isn't this too simple ? > --- > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > 1 files changed, 28 insertions(+), 0 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2880396..cfa2e2d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -43,6 +43,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, > { > unsigned long nr[NR_LRU_LISTS]; > unsigned long nr_to_scan; > + > enum lru_list lru; > unsigned long nr_reclaimed, nr_scanned; > unsigned long nr_to_reclaim = sc->nr_to_reclaim; > struct blk_plug plug; > +#ifdef CONFIG_LOW_MEM_NOTIFY > + bool low_mem = false; > + unsigned long free, file; > +#endif > > restart: > nr_reclaimed = 0; > nr_scanned = sc->nr_scanned; > get_scan_count(mz, sc, nr, priority); > +#ifdef CONFIG_LOW_MEM_NOTIFY > + /* We want to avoid swapout */ > + if (nr[LRU_INACTIVE_ANON]) > + low_mem = true; IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0. And get_scan_count() now check per-memcg-lru. So, this only works when memcg is not used. > + /* > + * We want to avoid dropping page cache excessively > + * in no swap system > + */ > + if (nr_swap_pages <= 0) { > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > + /* > + * If we have very few page cache pages, > + * notify to user > + */ > + if (file < free) > + low_mem = true; > + } I can't understand why you think you can check lowmem condition by "file < free". And I don't think using per-zone data is good. (I'm not sure how many zones embeded guys using..) Another idea: 1. can't we use some technique like cleancache to detect the condition ? 2. can't we measure page-in/page-out distance by recording something ? 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can ignore the data file cache ? 4. how about checking kswapd's busy status ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx127.postini.com [74.125.245.127]) by kanga.kvack.org (Postfix) with SMTP id 2FD666B006E for ; Tue, 17 Jan 2012 04:14:09 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so1901977vbb.14 for ; Tue, 17 Jan 2012 01:14:08 -0800 (PST) Date: Tue, 17 Jan 2012 18:13:56 +0900 From: Minchan Kim Subject: Re: [RFC 2/3] vmscan hook Message-ID: <20120117091356.GA29736@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 17 Jan 2012 17:13:57 +0900 > Minchan Kim wrote: > > > This patch insert memory pressure notify point into vmscan.c > > Most problem in system slowness is swap-in. swap-in is a synchronous > > opeartion so that it affects heavily system response. > > > > This patch alert it when reclaimer start to reclaim inactive anon list. > > It seems rather earlier but not bad than too late. > > > > Other alert point is when there is few cache pages > > In this implementation, if it is (cache < free pages), > > memory pressure notify happens. It has to need more testing and tuning > > or other hueristic. Any suggesion are welcome. > > > > Signed-off-by: Minchan Kim > > In my 1st impression, isn't this too simple ? I agree It's too simple. It would be good start point rather than unnecessary complicated things. > > > > --- > > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > > 1 files changed, 28 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2880396..cfa2e2d 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -43,6 +43,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, > > { > > unsigned long nr[NR_LRU_LISTS]; > > unsigned long nr_to_scan; > > + > > enum lru_list lru; > > unsigned long nr_reclaimed, nr_scanned; > > unsigned long nr_to_reclaim = sc->nr_to_reclaim; > > struct blk_plug plug; > > +#ifdef CONFIG_LOW_MEM_NOTIFY > > + bool low_mem = false; > > + unsigned long free, file; > > +#endif > > > > restart: > > nr_reclaimed = 0; > > nr_scanned = sc->nr_scanned; > > get_scan_count(mz, sc, nr, priority); > > +#ifdef CONFIG_LOW_MEM_NOTIFY > > + /* We want to avoid swapout */ > > + if (nr[LRU_INACTIVE_ANON]) > > + low_mem = true; > > IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0. Yes. But I thought it would be better than late notification. Late notification ends up swap out which is a big concern about this patch. More proper timing suggestion helps me a lot. > And get_scan_count() now check per-memcg-lru. So, this only works when > memcg is not used. Hmm, I didn't look at recent memcg/global reclaim unify patch of Johannes. I need time to look at it. Thanks. > > > > + /* > > + * We want to avoid dropping page cache excessively > > + * in no swap system > > + */ > > + if (nr_swap_pages <= 0) { > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > + /* > > + * If we have very few page cache pages, > > + * notify to user > > + */ > > + if (file < free) > > + low_mem = true; > > + } > > I can't understand why you think you can check lowmem condition by "file < free". The reason I thought so is I want to maintain some page cache to some degree. But I admit It's very naive heuristic and should be improved. > And I don't think using per-zone data is good. > (I'm not sure how many zones embeded guys using..) Agree. In case of swapless system, we need another heuristic. > > Another idea: > 1. can't we use some technique like cleancache to detect the condition ? I totally forgot cleancache approach. Could you remind that? > 2. can't we measure page-in/page-out distance by recording something ? I can't understand your point. What's relation does it with swapout prevent? > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > ignore the data file cache ? It's good but how do we define some amount? It's very vague but I guess we can get a good idea from that. Perhaps, you already has it. > 4. how about checking kswapd's busy status ? Could you elaborate on your idea? Kame, Thanks for reply, > > > > Thanks, > -Kame > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id 379246B0080 for ; Tue, 17 Jan 2012 04:27:35 -0500 (EST) Received: by vcbfl11 with SMTP id fl11so32394vcb.14 for ; Tue, 17 Jan 2012 01:27:34 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1326788038-29141-2-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 11:27:34 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: > +static unsigned int low_mem_notify_poll(struct file *file, poll_table *w= ait) > +{ > + =A0 =A0 =A0 =A0unsigned int ret =3D 0; > + > + =A0 =A0 =A0 =A0poll_wait(file, &low_mem_wait, wait); > + > + =A0 =A0 =A0 =A0if (atomic_read(&nr_low_mem) !=3D 0) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D POLLIN; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0atomic_set(&nr_low_mem, 0); > + =A0 =A0 =A0 =A0} > + > + =A0 =A0 =A0 =A0return ret; > +} Doesn't this mean that only one application will receive the notification? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id 031F26B0093 for ; Tue, 17 Jan 2012 04:45:08 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so1921390vbb.14 for ; Tue, 17 Jan 2012 01:45:08 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1326788038-29141-2-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 11:45:06 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: > This patch makes new device file "/dev/low_mem_notify". > If application polls it, it can receive event when system > memory pressure happens. > > This patch is based on KOSAKI and Marcelo's long time ago work. > http://lwn.net/Articles/268732/ I'm not loving the ABI. Alternative solutions: - SIGDANGER + signalfd() for poll - sys_eventfd() - sys_mem_notify_open() similar to sys_perf_event_open() Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id 934226B0096 for ; Tue, 17 Jan 2012 05:06:35 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id DED353EE0C3 for ; Tue, 17 Jan 2012 19:06:33 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id C765C45DEEA for ; Tue, 17 Jan 2012 19:06:33 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id A894145DEED for ; Tue, 17 Jan 2012 19:06:33 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 984F41DB8038 for ; Tue, 17 Jan 2012 19:06:33 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.240.81.146]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 43CA71DB803B for ; Tue, 17 Jan 2012 19:06:33 +0900 (JST) Date: Tue, 17 Jan 2012 19:05:12 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120117091356.GA29736@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, 17 Jan 2012 18:13:56 +0900 Minchan Kim wrote: > On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 17 Jan 2012 17:13:57 +0900 > > Minchan Kim wrote: > > > > > > > + /* > > > + * We want to avoid dropping page cache excessively > > > + * in no swap system > > > + */ > > > + if (nr_swap_pages <= 0) { > > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > > + /* > > > + * If we have very few page cache pages, > > > + * notify to user > > > + */ > > > + if (file < free) > > > + low_mem = true; > > > + } > > > > I can't understand why you think you can check lowmem condition by "file < free". > > The reason I thought so is I want to maintain some page cache to some degree. > But I admit It's very naive heuristic and should be improved. > > > And I don't think using per-zone data is good. > > (I'm not sure how many zones embeded guys using..) > > Agree. In case of swapless system, we need another heuristic. > > > > > Another idea: > > 1. can't we use some technique like cleancache to detect the condition ? > > I totally forgot cleancache approach. Could you remind that? > Similar to 'victim cache'. Then, cache some clean pages somewhere when vmscan pageout it. page -> vmscan's pageout -> cleancache -> may be discarded. If a filesystem look up a page which is in a cleancache, cache-hit and bring it back to radix-tree. If not, read from disk again. And cleancache for swap(frontswap) was posted, too. > > 2. can't we measure page-in/page-out distance by recording something ? > > I can't understand your point. What's relation does it with swapout prevent? > If distance between pageout -> pagein is short, it means thrashing. For example, recoding the timestamp when the page(mapping, index) was paged-out, and check it at page-in. > > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > > ignore the data file cache ? > > It's good but how do we define some amount? > It's very vague but I guess we can get a good idea from that. > Perhaps, you already has it. > Hm, a rough idea is... - we now have rss counter per mm. - mapped anon - mapped file - swapents Ok, here, add one more counter. - paged-out file. (I think this can be recorded in pte.) +1 when try_to_unmap_file() unmaps it. -1 when a page is back or unmapped. Then, scanning all tasks. Then, mapped_anon + mapped_file active_map_ratio = ----------------------------------------------------- * 100 mapped_anon + mapped_file + swapents + paged_out_file Ok, how to use this value... Like memcg's threshold notify interface, you can change the mem_notify interface to use eventfd() as This will inform you an event when active_map_ratio crosses passed threshold. complicated ? > > 4. how about checking kswapd's busy status ? > > Could you elaborate on your idea? > I just thought kswapd may not stop when the situation is very bad. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id 71BD76B00BE for ; Tue, 17 Jan 2012 09:38:35 -0500 (EST) Received: from compute2.internal (compute2.nyi.mail.srv.osa [10.202.2.42]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 7299B20E53 for ; Tue, 17 Jan 2012 09:38:34 -0500 (EST) Subject: Re: [RFC 0/3] low memory notify From: Colin Walters Date: Tue, 17 Jan 2012 09:38:10 -0500 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Message-ID: <1326811093.3467.41.camel@lenny> Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, 2012-01-17 at 17:13 +0900, Minchan Kim wrote: > As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. > (Of course, KOSAKI's original patchset includes more logics but I didn't > include all things intentionally because I want to start from beginning > again) Recently, there are some requirements of notification of system > memory pressure. How does this relate to the existing cgroups memory notifications? See Documentation/cgroups/memory.txt under "10. OOM Control" > It would be very useful for various cases. > For example, QEMU/JVM/Firefox like big memory hogger can release their memory > when memory pressure happens. I don't know about QEMU, but the key characteristic of the JVM and Firefox is that they use garbage collection. Which also applies to Python, Ruby, Google Go, Haskell, OCaml... So what you really want to be investigating here is integration between a garbage collector and the system VM. Your test program looks nothing like a garbage collector. I'd expect most of the performance tradeoffs to be similar between these runtimes. The Azul people have been doing something like this: http://www.managedruntime.org/ In Firefox' case though it can also drop other caches, e.g.: http://people.gnome.org/~federico/news-2007-09.html#firefox-memory-1 As far as the desktop goes, I want to get notified if we're going to hit swap, not if we're close to exhausting the total of RAM+swap. While swap may make sense for servers that care about throughput mainly, I care a lot about latency. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 161726B0098 for ; Tue, 17 Jan 2012 10:04:30 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so2189228vbb.14 for ; Tue, 17 Jan 2012 07:04:29 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1326811093.3467.41.camel@lenny> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326811093.3467.41.camel@lenny> Date: Tue, 17 Jan 2012 17:04:28 +0200 Message-ID: Subject: Re: [RFC 0/3] low memory notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Colin Walters Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, Jan 17, 2012 at 4:38 PM, Colin Walters wrote: > So what you really want to be investigating here is integration between > a garbage collector and the system VM. =A0Your test program looks nothing > like a garbage collector. =A0I'd expect most of the performance tradeoffs > to be similar between these runtimes. =A0The Azul people have been doing > something like this: http://www.managedruntime.org/ The interraction isn't all that complex, really. I'd expect most VMs to simply wake up the GC thread when poll() returns. GCs that are able to compact the heap can madvise(MADV_DONTNEED) or even munmap() unused parts of the heap. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx200.postini.com [74.125.245.200]) by kanga.kvack.org (Postfix) with SMTP id 00C986B00D7 for ; Tue, 17 Jan 2012 11:36:10 -0500 (EST) Message-ID: <4F15A34F.40808@redhat.com> Date: Tue, 17 Jan 2012 11:35:27 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On 01/17/2012 04:27 AM, Pekka Enberg wrote: > On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: >> +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait) >> +{ >> + unsigned int ret = 0; >> + >> + poll_wait(file,&low_mem_wait, wait); >> + >> + if (atomic_read(&nr_low_mem) != 0) { >> + ret = POLLIN; >> + atomic_set(&nr_low_mem, 0); >> + } >> + >> + return ret; >> +} > > Doesn't this mean that only one application will receive the notification? One at a time, which could be a good thing since the last thing we want to do when the system is under memory pressure is create a thundering herd. OTOH, we do need to ensure that programs take turns getting the memory pressure notification. I do not know whether poll_wait automatically takes care of that... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id 51D176B00D8 for ; Tue, 17 Jan 2012 11:44:45 -0500 (EST) Message-ID: <4F15A570.8090604@redhat.com> Date: Tue, 17 Jan 2012 11:44:32 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 0/3] low memory notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326811093.3467.41.camel@lenny> In-Reply-To: <1326811093.3467.41.camel@lenny> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Colin Walters Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On 01/17/2012 09:38 AM, Colin Walters wrote: > How does this relate to the existing cgroups memory notifications? See > Documentation/cgroups/memory.txt under "10. OOM Control" > As far as the desktop goes, I want to get notified if we're going to hit > swap, not if we're close to exhausting the total of RAM+swap. While > swap may make sense for servers that care about throughput mainly, I > care a lot about latency. You just answered your own question :) This code is indeed meant to avoid/reduce swap use and improve userspace latencies. Minchan posted a very simple example patch set, so we can get an idea in what direction people would want the code to go. This often beats working on complex code for weeks, and then having people tell you they wanted something else :) -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 979586B00DD for ; Tue, 17 Jan 2012 12:16:57 -0500 (EST) Received: by yhpp34 with SMTP id p34so376926yhp.14 for ; Tue, 17 Jan 2012 09:16:56 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 09:16:56 -0800 Message-ID: Subject: Re: [RFC 0/3] low memory notify From: Olof Johansson Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Hi, On Tue, Jan 17, 2012 at 12:13 AM, Minchan Kim wrote: > As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. > (Of course, KOSAKI's original patchset includes more logics but I didn't > include all things intentionally because I want to start from beginning > again) Recently, there are some requirements of notification of system > memory pressure. It would be very useful for various cases. > For example, QEMU/JVM/Firefox like big memory hogger can release their memory > when memory pressure happens. Another example in embedded side, > they can close background application. For this, there are some trial but > we need more general one and not-hacked alloc/free hot path. > > I think most big problem of system slowness is swap-in operation. > Swap-in is a synchronous operation so application's latency would be > big. Solution for that is prevent swap-out itself. We couldn't prevent > swapout totally but could reduce it with this patch. > > In case of swapless system, code page is very important for system response. > So we have to keep code page, too. I used very naive heuristic in this patch > but welcome to any idea. > > I want to make kernel logic simple if possible and just notify to user space. > Of course, there are lots of thing we have to consider but for discussion > this simple patch would be a good start point. This is almost exactly what we've been looking at doing for Chrome OS (which is swapless). In our case, the browser is by far the largest memory consumer on the system, and we have for quite a while been playing tricks with OOM scores trying to make the interaction between the VM and the application happen right such that if we're OOM, the "right" tab process gets killed, etc. But it's not enough (and it's not always accurate enough). Chrome definitely knows already what it would prefer to do to release memory, so having a simple notifier for low memory condition is preferred. We have considered doing it through cgroups but it adds a level of complexity that we don't need for this use case (we do already use cgroups for other reasons though). If this simpler solution is heading towards inclusion we'll probably use it instead. -Olof -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 55EAF6B005C for ; Tue, 17 Jan 2012 13:51:22 -0500 (EST) Received: by lagw12 with SMTP id w12so779983lag.14 for ; Tue, 17 Jan 2012 10:51:20 -0800 (PST) Date: Tue, 17 Jan 2012 20:51:13 +0200 (EET) From: Pekka Enberg Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <4F15A34F.40808@redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Hello, Ok, so here's a proof of concept patch that implements sample-base per-process free threshold VM event watching using perf-like syscall ABI. I'd really like to see something like this that's much more extensible and clean than the /dev based ABIs that people have proposed so far. Pekka -------------------> From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id 1FEB66B004F for ; Tue, 17 Jan 2012 14:31:07 -0500 (EST) Message-ID: <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 14:30:30 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On 01/17/2012 01:51 PM, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have proposed > so far. Looks like a nice extensible interface to me. The only thing is, I expect we will not want to wake up processes most of the time, when there is no memory pressure, because that would just waste battery power and/or cpu time that could be used for something else. The desire to avoid such wakeups makes it harder to wake up processes at arbitrary points set by the API. Another issue is that we might be running two programs on the system, each with a different threshold for "lets free some of my cache". Say one program sets the threshold at 20% free/cache memory, the other program at 10%. We could end up with the first process continually throwing away its caches, while the second process never gives its unused memory back to the kernel. I am not sure what the right thing to do would be... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id D062F6B005A for ; Tue, 17 Jan 2012 14:49:14 -0500 (EST) Received: by vcbfl11 with SMTP id fl11so642668vcb.14 for ; Tue, 17 Jan 2012 11:49:13 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <4F15CC56.90309@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:49:13 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel wrote: > Looks like a nice extensible interface to me. > > The only thing is, I expect we will not want to wake > up processes most of the time, when there is no memory > pressure, because that would just waste battery power > and/or cpu time that could be used for something else. > > The desire to avoid such wakeups makes it harder to > wake up processes at arbitrary points set by the API. Sure. You could either bump up the threshold or use Minchan's hooks - or bo= th. On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel wrote: > Another issue is that we might be running two programs > on the system, each with a different threshold for > "lets free some of my cache". =A0Say one program sets > the threshold at 20% free/cache memory, the other > program at 10%. > > We could end up with the first process continually > throwing away its caches, while the second process > never gives its unused memory back to the kernel. > > I am not sure what the right thing to do would be... One option is to use per-process thresholds on RSS, for example, and also support system-wide thresholds. That said, I'd really like to see the N9 and Android policies supported with this ABI. It's much easier to make it generic once we support real-world use cases. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 449696B004F for ; Tue, 17 Jan 2012 14:54:28 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so2537297vbb.14 for ; Tue, 17 Jan 2012 11:54:27 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:54:27 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg wrote: > That said, I'd really like to see the N9 and Android policies > supported with this ABI. It's much easier to make it generic once we > support real-world use cases. If people are interested in hacking on the thing, I pushed the commit in 'vmnotify/core' branch of git://github.com/penberg/linux.git Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 57A3C6B004F for ; Tue, 17 Jan 2012 14:57:54 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so2541025vbb.14 for ; Tue, 17 Jan 2012 11:57:53 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:57:53 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg wrote: >> The desire to avoid such wakeups makes it harder to >> wake up processes at arbitrary points set by the API. > > Sure. You could either bump up the threshold or use Minchan's hooks - or both. s/threshold/sample period/g -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id 57A1C6B004D for ; Tue, 17 Jan 2012 18:08:15 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so2728455vbb.14 for ; Tue, 17 Jan 2012 15:08:14 -0800 (PST) Date: Wed, 18 Jan 2012 08:08:01 +0900 From: Minchan Kim Subject: Re: [RFC 2/3] vmscan hook Message-ID: <20120117230801.GA903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Tue, Jan 17, 2012 at 07:05:12PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 17 Jan 2012 18:13:56 +0900 > Minchan Kim wrote: > > > On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 17 Jan 2012 17:13:57 +0900 > > > Minchan Kim wrote: > > > > > > > > > > + /* > > > > + * We want to avoid dropping page cache excessively > > > > + * in no swap system > > > > + */ > > > > + if (nr_swap_pages <= 0) { > > > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > > > + /* > > > > + * If we have very few page cache pages, > > > > + * notify to user > > > > + */ > > > > + if (file < free) > > > > + low_mem = true; > > > > + } > > > > > > I can't understand why you think you can check lowmem condition by "file < free". > > > > The reason I thought so is I want to maintain some page cache to some degree. > > But I admit It's very naive heuristic and should be improved. > > > > > And I don't think using per-zone data is good. > > > (I'm not sure how many zones embeded guys using..) > > > > Agree. In case of swapless system, we need another heuristic. > > > > > > > > Another idea: > > > 1. can't we use some technique like cleancache to detect the condition ? > > > > I totally forgot cleancache approach. Could you remind that? > > > > Similar to 'victim cache'. Then, cache some clean pages somewhere when > vmscan pageout it. > > page -> vmscan's pageout -> cleancache -> may be discarded. > > If a filesystem look up a page which is in a cleancache, cache-hit and > bring it back to radix-tree. If not, read from disk again. > And cleancache for swap(frontswap) was posted, too. I am not sure this can prevent swapout. I think it ends up evicting pages into swap devices. > > > > > 2. can't we measure page-in/page-out distance by recording something ? > > > > I can't understand your point. What's relation does it with swapout prevent? > > > > If distance between pageout -> pagein is short, it means thrashing. > For example, recoding the timestamp when the page(mapping, index) was > paged-out, and check it at page-in. Our goal is prevent swapout. When we found thrashing, it's too late. > > > > > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > > > ignore the data file cache ? > > > > It's good but how do we define some amount? > > It's very vague but I guess we can get a good idea from that. > > Perhaps, you already has it. > > > > Hm, a rough idea is... > > - we now have rss counter per mm. > - mapped anon > - mapped file > - swapents > > Ok, here, add one more counter. > > - paged-out file. (I think this can be recorded in pte.) > +1 when try_to_unmap_file() unmaps it. > -1 when a page is back or unmapped. > > Then, scanning all tasks. Then, > > mapped_anon + mapped_file > active_map_ratio = ----------------------------------------------------- * 100 > mapped_anon + mapped_file + swapents + paged_out_file > > Ok, how to use this value... > > Like memcg's threshold notify interface, you can change the mem_notify interface > to use eventfd() as > > > > This will inform you an event when active_map_ratio crosses passed threshold. > > complicated ? Yes. :) I want to make simple if possible. > > > > > 4. how about checking kswapd's busy status ? > > > > Could you elaborate on your idea? > > > > I just thought kswapd may not stop when the situation is very bad. As I said eariler, the goal is prevent swap. When we found kswapd is busy, it might many pages are already swapped-out so it's too late. > > Thanks, > -Kame > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id 127526B004D for ; Tue, 17 Jan 2012 18:20:36 -0500 (EST) Received: by vcbfl11 with SMTP id fl11so832301vcb.14 for ; Tue, 17 Jan 2012 15:20:35 -0800 (PST) Date: Wed, 18 Jan 2012 08:20:25 +0900 From: Minchan Kim Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120117232025.GB903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have > proposed so far. > > Pekka > > -------------------> > > From a07f93fdca360b20daef4a5d66f2a5746f31f6a6 Mon Sep 17 00:00:00 2001 > From: Pekka Enberg > Date: Tue, 17 Jan 2012 17:51:48 +0200 > Subject: [PATCH] vmnotify: VM event notification system > > This patch implements a new sys_vmnotify_fd() system call that returns a > pollable file descriptor that can be used to watch VM events. > > For example, to watch for VM event when free memory is below 99% of available > memory using 1 second sample period, you'd do something like this: > > struct vmnotify_config config; > struct vmnotify_event event; > struct pollfd pollfd; > int fd; > > config = (struct vmnotify_config) { > .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, > .sample_period_ns = 1000000000L, > .free_threshold = 99, > }; > > fd = sys_vmnotify_fd(&config); > > pollfd.fd = fd; > pollfd.events = POLLIN; > > if (poll(&pollfd, 1, -1) < 0) { > perror("poll failed"); > exit(1); > } > > memset(&event, 0, sizeof(event)); > > if (read(fd, &event, sizeof(event)) < 0) { > perror("read failed"); > exit(1); > } Hi Pekka, I didn't look into your code(will do) but as I read description, still I don't convince we need really some process specific threshold like 99% I think application can know it by polling /proc/meminfo without this mechanism if they really want. I would like to notify when system has a trobule with memory pressure without some process specific threshold. Of course, applicatoin can't expect it.(ie, application can know system memory pressure by /proc/meminfo but it can't know when swapout really happens). Kernel low mem notify have to give such notification to user space, I think. > > Signed-off-by: Pekka Enberg > --- > arch/x86/include/asm/unistd_64.h | 2 + > include/linux/vmnotify.h | 44 ++++++ > mm/Kconfig | 6 + > mm/Makefile | 1 + > mm/vmnotify.c | 235 ++++++++++++++++++++++++++++++++ > tools/testing/vmnotify/vmnotify-test.c | 68 +++++++++ > 6 files changed, 356 insertions(+), 0 deletions(-) > create mode 100644 include/linux/vmnotify.h > create mode 100644 mm/vmnotify.c > create mode 100644 tools/testing/vmnotify/vmnotify-test.c > > diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h > index 0431f19..b0928cd 100644 > --- a/arch/x86/include/asm/unistd_64.h > +++ b/arch/x86/include/asm/unistd_64.h > @@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu) > __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv) > #define __NR_process_vm_writev 311 > __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev) > +#define __NR_vmnotify_fd 312 > +__SYSCALL(__NR_vmnotify_fd, sys_vmnotify_fd) > > #ifndef __NO_STUBS > #define __ARCH_WANT_OLD_READDIR > diff --git a/include/linux/vmnotify.h b/include/linux/vmnotify.h > new file mode 100644 > index 0000000..8f8642b > --- /dev/null > +++ b/include/linux/vmnotify.h > @@ -0,0 +1,44 @@ > +#ifndef _LINUX_VMNOTIFY_H > +#define _LINUX_VMNOTIFY_H > + > +#include > + > +enum { > + VMNOTIFY_TYPE_FREE_THRESHOLD = 1ULL << 0, > + VMNOTIFY_TYPE_SAMPLE = 1ULL << 1, > +}; > + > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; > + > + /* > + * Sample period in nanoseconds > + */ > + __u64 sample_period_ns; > +}; > + > +struct vmnotify_event { > + /* Size of the struct for ABI extensibility. */ > + __u32 size; > + > + __u64 nr_avail_pages; > + > + __u64 nr_swap_pages; > + > + __u64 nr_free_pages; > +}; > + > +#endif /* _LINUX_VMNOTIFY_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index 011b110..6631167 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -373,3 +373,9 @@ config CLEANCACHE > in a negligible performance hit. > > If unsure, say Y to enable cleancache > + > +config VMNOTIFY > + bool "Enable VM event notification system" > + default n > + help > + If unsure, say N to disable vmnotify > diff --git a/mm/Makefile b/mm/Makefile > index 50ec00e..e1b5db3 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o > obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > obj-$(CONFIG_CLEANCACHE) += cleancache.o > +obj-$(CONFIG_VMNOTIFY) += vmnotify.o > diff --git a/mm/vmnotify.c b/mm/vmnotify.c > new file mode 100644 > index 0000000..6800450 > --- /dev/null > +++ b/mm/vmnotify.c > @@ -0,0 +1,235 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define VMNOTIFY_MAX_FREE_THRESHOD 100 > + > +struct vmnotify_watch { > + struct vmnotify_config config; > + > + struct mutex mutex; > + bool pending; > + struct vmnotify_event event; > + > + /* sampling */ > + struct hrtimer timer; > + > + /* poll */ > + wait_queue_head_t waitq; > +}; > + > +static bool vmnotify_match(struct vmnotify_watch *watch, struct vmnotify_event *event) > +{ > + if (watch->config.type & VMNOTIFY_TYPE_FREE_THRESHOLD) { > + u64 threshold; > + > + if (!event->nr_avail_pages) > + return false; > + > + threshold = event->nr_free_pages * 100 / event->nr_avail_pages; > + if (threshold > watch->config.free_threshold) > + return false; > + } > + > + return true; > +} > + > +static void vmnotify_sample(struct vmnotify_watch *watch) > +{ > + struct vmnotify_event event; > + struct sysinfo si; > + > + memset(&event, 0, sizeof(event)); > + > + event.size = sizeof(event); > + event.nr_free_pages = global_page_state(NR_FREE_PAGES); > + > + si_meminfo(&si); > + event.nr_avail_pages = si.totalram; > + > +#ifdef CONFIG_SWAP > + si_swapinfo(&si); > + event.nr_swap_pages = si.totalswap; > +#endif > + > + if (!vmnotify_match(watch, &event)) > + return; > + > + mutex_lock(&watch->mutex); > + > + watch->pending = true; > + > + memcpy(&watch->event, &event, sizeof(event)); > + > + mutex_unlock(&watch->mutex); > +} > + > +static enum hrtimer_restart vmnotify_timer_fn(struct hrtimer *hrtimer) > +{ > + struct vmnotify_watch *watch = container_of(hrtimer, struct vmnotify_watch, timer); > + u64 sample_period = watch->config.sample_period_ns; > + > + vmnotify_sample(watch); > + > + hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); > + > + wake_up(&watch->waitq); > + > + return HRTIMER_RESTART; > +} > + > +static void vmnotify_start_timer(struct vmnotify_watch *watch) > +{ > + u64 sample_period = watch->config.sample_period_ns; > + > + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + watch->timer.function = vmnotify_timer_fn; > + > + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED); > +} > + > +static unsigned int vmnotify_poll(struct file *file, poll_table *wait) > +{ > + struct vmnotify_watch *watch = file->private_data; > + unsigned int events = 0; > + > + poll_wait(file, &watch->waitq, wait); > + > + mutex_lock(&watch->mutex); > + > + if (watch->pending) > + events |= POLLIN; > + > + mutex_unlock(&watch->mutex); > + > + return events; > +} > + > +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) > +{ > + struct vmnotify_watch *watch = file->private_data; > + int ret = 0; > + > + mutex_lock(&watch->mutex); > + > + if (!watch->pending) > + goto out_unlock; > + > + if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) { > + ret = -EFAULT; > + goto out_unlock; > + } > + > + ret = watch->event.size; > + > + watch->pending = false; > + > +out_unlock: > + mutex_unlock(&watch->mutex); > + > + return ret; > +} > + > +static int vmnotify_release(struct inode *inode, struct file *file) > +{ > + struct vmnotify_watch *watch = file->private_data; > + > + hrtimer_cancel(&watch->timer); > + > + kfree(watch); > + > + return 0; > +} > + > +static const struct file_operations vmnotify_fops = { > + .poll = vmnotify_poll, > + .read = vmnotify_read, > + .release = vmnotify_release, > +}; > + > +static struct vmnotify_watch *vmnotify_watch_alloc(void) > +{ > + struct vmnotify_watch *watch; > + > + watch = kzalloc(sizeof *watch, GFP_KERNEL); > + if (!watch) > + return NULL; > + > + mutex_init(&watch->mutex); > + > + init_waitqueue_head(&watch->waitq); > + > + return watch; > +} > + > +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig, > + struct vmnotify_config *config) > +{ > + int ret; > + > + ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config)); > + if (ret) > + return -EFAULT; > + > + if (!config->type) > + return -EINVAL; > + > + if (config->type & VMNOTIFY_TYPE_SAMPLE) { > + if (config->sample_period_ns < NSEC_PER_MSEC) > + return -EINVAL; > + } > + > + if (config->type & VMNOTIFY_TYPE_FREE_THRESHOLD) { > + if (config->free_threshold > VMNOTIFY_MAX_FREE_THRESHOD) > + return -EINVAL; > + } > + > + return 0; > +} > + > +SYSCALL_DEFINE1(vmnotify_fd, > + struct vmnotify_config __user *, uconfig) > +{ > + struct vmnotify_watch *watch; > + struct file *file; > + int err; > + int fd; > + > + watch = vmnotify_watch_alloc(); > + if (!watch) > + return -ENOMEM; > + > + err = vmnotify_copy_config(uconfig, &watch->config); > + if (err) > + goto err_free; > + > + fd = get_unused_fd_flags(O_RDONLY); > + if (fd < 0) { > + err = fd; > + goto err_free; > + } > + > + file = anon_inode_getfile("[vmnotify]", &vmnotify_fops, watch, O_RDONLY); > + if (IS_ERR(file)) { > + err = PTR_ERR(file); > + goto err_fd; > + } > + > + fd_install(fd, file); > + > + if (watch->config.type & VMNOTIFY_TYPE_SAMPLE) > + vmnotify_start_timer(watch); > + > + return fd; > + > +err_fd: > + put_unused_fd(fd); > +err_free: > + kfree(watch); > + return err; > +} > diff --git a/tools/testing/vmnotify/vmnotify-test.c b/tools/testing/vmnotify/vmnotify-test.c > new file mode 100644 > index 0000000..3c6b26d > --- /dev/null > +++ b/tools/testing/vmnotify/vmnotify-test.c > @@ -0,0 +1,68 @@ > +#include "../../../include/linux/vmnotify.h" > + > +#if defined(__x86_64__) > +#include "../../../arch/x86/include/asm/unistd.h" > +#endif > + > +#include > +#include > +#include > +#include > +#include > + > +static int sys_vmnotify_fd(struct vmnotify_config *config) > +{ > + config->size = sizeof(*config); > + > + return syscall(__NR_vmnotify_fd, config); > +} > + > +int main(int argc, char *argv[]) > +{ > + struct vmnotify_config config; > + struct vmnotify_event event; > + struct pollfd pollfd; > + int i; > + int fd; > + > + config = (struct vmnotify_config) { > + .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, > + .sample_period_ns = 1000000000L, > + .free_threshold = 99, > + }; > + > + fd = sys_vmnotify_fd(&config); > + if (fd < 0) { > + perror("vmnotify_fd failed"); > + exit(1); > + } > + > + for (i = 0; i < 10; i++) { > + pollfd.fd = fd; > + pollfd.events = POLLIN; > + > + if (poll(&pollfd, 1, -1) < 0) { > + perror("poll failed"); > + exit(1); > + } > + > + memset(&event, 0, sizeof(event)); > + > + if (read(fd, &event, sizeof(event)) < 0) { > + perror("read failed"); > + exit(1); > + } > + > + printf("VM event:\n"); > + printf("\tsize=%lu\n", event.size); > + printf("\tnr_avail_pages=%Lu\n", event.nr_avail_pages); > + printf("\tnr_swap_pages=%Lu\n", event.nr_swap_pages); > + printf("\tnr_free_pages=%Lu\n", event.nr_free_pages); > + } > + if (close(fd) < 0) { > + perror("close failed"); > + exit(1); > + } > + > + return 0; > +} > -- > 1.7.6.4 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 47E456B004D for ; Tue, 17 Jan 2012 19:19:44 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id C25593EE0BC for ; Wed, 18 Jan 2012 09:19:42 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id A027F45DEF1 for ; Wed, 18 Jan 2012 09:19:42 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 82DE045DEEC for ; Wed, 18 Jan 2012 09:19:42 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 74A971DB803C for ; Wed, 18 Jan 2012 09:19:42 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.240.81.147]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 2898A1DB803F for ; Wed, 18 Jan 2012 09:19:42 +0900 (JST) Date: Wed, 18 Jan 2012 09:18:24 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120117230801.GA903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Wed, 18 Jan 2012 08:08:01 +0900 Minchan Kim wrote: > > > > > > > > 2. can't we measure page-in/page-out distance by recording something ? > > > > > > I can't understand your point. What's relation does it with swapout prevent? > > > > > > > If distance between pageout -> pagein is short, it means thrashing. > > For example, recoding the timestamp when the page(mapping, index) was > > paged-out, and check it at page-in. > > Our goal is prevent swapout. When we found thrashing, it's too late. > If you want to prevent swap-out, don't swapon any. That's all. Then, you can check the number of FILE_CACHE and have threshold. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 345CC6B004D for ; Wed, 18 Jan 2012 02:17:04 -0500 (EST) Received: by lagw12 with SMTP id w12so1167327lag.14 for ; Tue, 17 Jan 2012 23:17:02 -0800 (PST) Date: Wed, 18 Jan 2012 09:16:49 +0200 (EET) From: Pekka Enberg Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <20120117232025.GB903@barrios-desktop.redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120117232025.GB903@barrios-desktop.redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Wed, 18 Jan 2012, Minchan Kim wrote: > I didn't look into your code(will do) but as I read description, > still I don't convince we need really some process specific threshold like 99% > I think application can know it by polling /proc/meminfo without this mechanism > if they really want. I'm not sure if we need arbitrary threshold either. However, we need to support the following cases: - We're about to swap - We're about to run out of memory - We're about to start OOM killing and I don't think your patch solves that. One possibility is to implement: VMNOTIFY_TYPE_ABOUT_TO_SWAP VMNOTIFY_TYPE_ABOUT_TO_OOM VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL and maybe rip out support for arbitrary thresholds. Does that more reasonable? As for polling /proc/meminfo, I'd much rather deliver stats as part of vmnotify_read() because it's easier to extend the ABI rather than adding new fields to /proc/meminfo. On Wed, 18 Jan 2012, Minchan Kim wrote: > I would like to notify when system has a trobule with memory pressure without > some process specific threshold. Of course, applicatoin can't expect it.(ie, > application can know system memory pressure by /proc/meminfo but it can't know > when swapout really happens). Kernel low mem notify have to give such notification > to user space, I think. It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE that uses your hooks. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 50E956B004D for ; Wed, 18 Jan 2012 02:49:44 -0500 (EST) Received: by vbbfa15 with SMTP id fa15so2997285vbb.14 for ; Tue, 17 Jan 2012 23:49:43 -0800 (PST) Date: Wed, 18 Jan 2012 16:49:30 +0900 From: Minchan Kim Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120118074930.GA18621@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120117232025.GB903@barrios-desktop.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Wed, Jan 18, 2012 at 09:16:49AM +0200, Pekka Enberg wrote: > On Wed, 18 Jan 2012, Minchan Kim wrote: > >I didn't look into your code(will do) but as I read description, > >still I don't convince we need really some process specific threshold like 99% > >I think application can know it by polling /proc/meminfo without this mechanism > >if they really want. > > I'm not sure if we need arbitrary threshold either. However, we need > to support the following cases: > > - We're about to swap > > - We're about to run out of memory > > - We're about to start OOM killing > > and I don't think your patch solves that. One possibility is to implement: I think my patch can extend it but your ABI looks good to me than my approach. > > VMNOTIFY_TYPE_ABOUT_TO_SWAP > VMNOTIFY_TYPE_ABOUT_TO_OOM > VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL Yes. We can define some levels. 1. page cache reclaim 2. code page reclaim 3. anonymous page swap out 4. OOM kill. Application might handle it differenlty by the memory pressure level. > > and maybe rip out support for arbitrary thresholds. Does that more > reasonable? Currently, Nokia people seem to want process specific thresholds so we might need it. > > As for polling /proc/meminfo, I'd much rather deliver stats as part > of vmnotify_read() because it's easier to extend the ABI rather than > adding new fields to /proc/meminfo. Agree. > > On Wed, 18 Jan 2012, Minchan Kim wrote: > >I would like to notify when system has a trobule with memory pressure without > >some process specific threshold. Of course, applicatoin can't expect it.(ie, > >application can know system memory pressure by /proc/meminfo but it can't know > >when swapout really happens). Kernel low mem notify have to give such notification > >to user space, I think. > > It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE > that uses your hooks. Indeed. > > Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id 4014F6B005A for ; Wed, 18 Jan 2012 04:09:12 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Wed, 18 Jan 2012 09:06:06 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: penberg@kernel.org, riel@redhat.com Cc: minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Hi, Just couple of observations, which maybe wrong below > -----Original Message----- > From: Pekka Enberg [mailto:penberg@gmail.com] On Behalf Of ext Pekka > Enberg > Sent: 17 January, 2012 20:51 .... > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; Would be possible to not use percents for thesholds? Accounting in pages ev= en not so difficult to user-space. Also, looking on vmnotify_match I understand that events propagated to user= -space only in case threshold trigger change state from 0 to 1 but not back= , 1-> 0 is very useful event as well. Would be possible to use for threshold pointed value(s) e.g. according to e= num zone_state_item, because kinds of memory to track could be different? E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be= interesting, not only free. > + > + /* > + * Sample period in nanoseconds > + */ > + __u64 sample_period_ns; > +}; > + .... > +struct vmnotify_event { > + /* Size of the struct for ABI extensibility. */ > + __u32 size; > + > + __u64 nr_avail_pages; > + > + __u64 nr_swap_pages; > + > + __u64 nr_free_pages; > +}; Two fields here most likely session-constant, (nr_avail_pages and nr_swap_p= ages), seems not much sense to report them in every event. If we have memory/swap hotplug user-space can use sysinfo() call. > +static void vmnotify_sample(struct vmnotify_watch *watch) { ... > + si_meminfo(&si); > + event.nr_avail_pages =3D si.totalram; > + > +#ifdef CONFIG_SWAP > + si_swapinfo(&si); > + event.nr_swap_pages =3D si.totalswap; > +#endif > + Why not to use global_page_state() directly? si_meminfo() and especial si_s= wapinfo are quite expensive call. > +static void vmnotify_start_timer(struct vmnotify_watch *watch) { > + u64 sample_period =3D watch->config.sample_period_ns; > + > + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, > HRTIMER_MODE_REL); > + watch->timer.function =3D vmnotify_timer_fn; > + > + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), > +HRTIMER_MODE_REL_PINNED); } Do I understand correct you allocate timer for every user-space client and = propagate events every pointed interval? What will happened with system if we have a timer but need to turn CPU off?= The timer must not be a reason to wakeup if user-space is sleeping. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id D087C6B004D for ; Wed, 18 Jan 2012 04:15:42 -0500 (EST) Received: by obbta7 with SMTP id ta7so5040430obb.14 for ; Wed, 18 Jan 2012 01:15:41 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> Date: Wed, 18 Jan 2012 11:15:41 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On Wed, Jan 18, 2012 at 11:06 AM, wrote: > Would be possible to not use percents for thesholds? Accounting in pages even > not so difficult to user-space. How does that work with memory hotplug? On Wed, Jan 18, 2012 at 11:06 AM, wrote: > Also, looking on vmnotify_match I understand that events propagated to > user-space only in case threshold trigger change state from 0 to 1 but not > back, 1-> 0 is very useful event as well. > > Would be possible to use for threshold pointed value(s) e.g. according to > enum zone_state_item, because kinds of memory to track could be different? > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be > interesting, not only free. I don't think there's anything in the ABI that would prevent that. >> +struct vmnotify_event { >> + /* Size of the struct for ABI extensibility. */ >> + __u32 size; >> + >> + __u64 nr_avail_pages; >> + >> + __u64 nr_swap_pages; >> + >> + __u64 nr_free_pages; >> +}; > > Two fields here most likely session-constant, (nr_avail_pages and > nr_swap_pages), seems not much sense to report them in every event. If we > have memory/swap hotplug user-space can use sysinfo() call. I actually changed the ABI to look like this: struct vmnotify_event { /* * Size of the struct for ABI extensibility. */ __u32 size; __u64 attrs; __u64 attr_values[]; }; So userspace can decide which fields to include in notifications. On Wed, Jan 18, 2012 at 11:06 AM, wrote: >> +static void vmnotify_sample(struct vmnotify_watch *watch) { > ... >> + si_meminfo(&si); >> + event.nr_avail_pages = si.totalram; >> + >> +#ifdef CONFIG_SWAP >> + si_swapinfo(&si); >> + event.nr_swap_pages = si.totalswap; >> +#endif >> + > > Why not to use global_page_state() directly? si_meminfo() and especial > si_swapinfo are quite expensive call. Sure, we can do that. Feel free to send a patch :-). >> +static void vmnotify_start_timer(struct vmnotify_watch *watch) { >> + u64 sample_period = watch->config.sample_period_ns; >> + >> + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, >> HRTIMER_MODE_REL); >> + watch->timer.function = vmnotify_timer_fn; >> + >> + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), >> +HRTIMER_MODE_REL_PINNED); } > > Do I understand correct you allocate timer for every user-space client and > propagate events every pointed interval? What will happened with system if > we have a timer but need to turn CPU off? The timer must not be a reason to > wakeup if user-space is sleeping. No idea what happens. The sampling code is just a proof of concept thing and I expect it to be buggy as hell. :-) Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id C260B6B004D for ; Wed, 18 Jan 2012 04:43:47 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Wed, 18 Jan 2012 09:41:41 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: penberg@kernel.org Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 18 January, 2012 11:16 ... > > Would be possible to not use percents for thesholds? Accounting in page= s > even > > not so difficult to user-space. >=20 > How does that work with memory hotplug? Not worse than %%. For example you had 10% free memory threshold for 512 MB= RAM meaning 51.2 MB in absolute number. Then hotplug turned off 256 MB, you for sure must update threshold for %% b= ecause these 10% for 25.6 MB most likely will be not suitable for different= operating mode. Using pages makes calculations must simpler. >=20 > On Wed, Jan 18, 2012 at 11:06 AM, wrote: > > Also, looking on vmnotify_match I understand that events propagated to > > user-space only in case threshold trigger change state from 0 to 1 but = not > > back, 1-> 0 is very useful event as well (*) > > > > Would be possible to use for threshold pointed value(s) e.g. according = to > > enum zone_state_item, because kinds of memory to track could be > different? > > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE > could be > > interesting, not only free. >=20 > I don't think there's anything in the ABI that would prevent that. If this statement also related my question (*) I have to point need to tra= ck attributes history, otherwise user-space will be constantly kicked with = updates. > I actually changed the ABI to look like this: >=20 > struct vmnotify_event { > /* > * Size of the struct for ABI extensibility. > */ > __u32 size; >=20 > __u64 attrs; >=20 > __u64 attr_values[]; > }; >=20 > So userspace can decide which fields to include in notifications. Good. But how you can provide current status of attributes to user-space? N= eed to have read() call support to deliver all supported attr_values[] on d= emand. > >> + > >> +#ifdef CONFIG_SWAP > >> + si_swapinfo(&si); > >> + event.nr_swap_pages =3D si.totalswap; > >> +#endif > >> + > > > > Why not to use global_page_state() directly? si_meminfo() and especial > > si_swapinfo are quite expensive call. >=20 > Sure, we can do that. Feel free to send a patch :-). When I see code because from emails it is quite difficult to understand.=20 For short-term I need to focus on integration "memnotify" version internall= y which is kind of work for me already and provides all required interfaces= n9 needs. =20 Btw, when API starts to work with pointed thresholds logically it is not an= ymore low_mem_notify, you need to invent some other name.=20 > No idea what happens. The sampling code is just a proof of concept thing = and > I expect it to be buggy as hell. :-) >=20 > Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id 4510B6B004D for ; Wed, 18 Jan 2012 05:40:11 -0500 (EST) Received: by obbta7 with SMTP id ta7so5170361obb.14 for ; Wed, 18 Jan 2012 02:40:10 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> Date: Wed, 18 Jan 2012 12:40:10 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On Wed, Jan 18, 2012 at 11:41 AM, wrote: >> -----Original Message----- >> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext >> Pekka Enberg >> Sent: 18 January, 2012 11:16 > ... >> > Would be possible to not use percents for thesholds? Accounting in pages >> even >> > not so difficult to user-space. >> >> How does that work with memory hotplug? > > Not worse than %%. For example you had 10% free memory threshold for 512 MB > RAM meaning 51.2 MB in absolute number. Then hotplug turned off 256 MB, you > for sure must update threshold for %% because these 10% for 25.6 MB most > likely will be not suitable for different operating mode. > Using pages makes calculations must simpler. Right. Does threshold in percentages make any sense then? Is it enough to use number of free pages? On Wed, Jan 18, 2012 at 11:06 AM, wrote: >> > Also, looking on vmnotify_match I understand that events propagated to >> > user-space only in case threshold trigger change state from 0 to 1 but not >> > back, 1-> 0 is very useful event as well > (*) > >> > >> > Would be possible to use for threshold pointed value(s) e.g. according to >> > enum zone_state_item, because kinds of memory to track could be >> different? >> > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE >> could be >> > interesting, not only free. >> >> I don't think there's anything in the ABI that would prevent that. > > If this statement also related my question (*) I have to point need to track > attributes history, otherwise user-space will be constantly kicked with > updates. Well sure, I think it makes sense to support state change to both directions. > When I see code because from emails it is quite difficult to understand. For > short-term I need to focus on integration "memnotify" version internally > which is kind of work for me already and provides all required interfaces n9 > needs. Sure. I'm only talking about mainline here. > Btw, when API starts to work with pointed thresholds logically it is not Definitely, it's about generic VM event notification now. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id D89676B004D for ; Wed, 18 Jan 2012 05:44:53 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Wed, 18 Jan 2012 10:44:13 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: penberg@kernel.org Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 18 January, 2012 12:40 ... > > Not worse than %%. For example you had 10% free memory threshold for > > 512 MB RAM meaning 51.2 MB in absolute number. Then hotplug turned > > off 256 MB, you for sure must update threshold for %% because these > > 10% for 25.6 MB most likely will be not suitable for different operatin= g > mode. > > Using pages makes calculations must simpler. >=20 > Right. Does threshold in percentages make any sense then? Is it enough to > use number of free pages? Paul Mundt noticed that and we stopped use percentage in 2006 for n770 upda= te.=20 He was right. Percents are useless and do not correlate with other kernel APIs like sysin= fo(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx186.postini.com [74.125.245.186]) by kanga.kvack.org (Postfix) with SMTP id DC8EA6B004D for ; Wed, 18 Jan 2012 09:17:57 -0500 (EST) Message-ID: <4F16D46D.5080000@redhat.com> Date: Wed, 18 Jan 2012 09:17:17 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 2/3] vmscan hook References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > On Wed, 18 Jan 2012 08:08:01 +0900 > Minchan Kim wrote: > >>>>> 2. can't we measure page-in/page-out distance by recording something ? >>>> >>>> I can't understand your point. What's relation does it with swapout prevent? >>>> >>> >>> If distance between pageout -> pagein is short, it means thrashing. >>> For example, recoding the timestamp when the page(mapping, index) was >>> paged-out, and check it at page-in. >> >> Our goal is prevent swapout. When we found thrashing, it's too late. > > If you want to prevent swap-out, don't swapon any. That's all. > Then, you can check the number of FILE_CACHE and have threshold. I think you are getting hung up on a word here. As I understand it, the goal is to push out the point where we start doing heavier swap IO, allowing us to overcommit memory more heavily before things start really slowing down. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id DB5E06B004D for ; Wed, 18 Jan 2012 09:31:32 -0500 (EST) Message-ID: <4F16D79C.2020402@redhat.com> Date: Wed, 18 Jan 2012 09:30:52 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: penberg@kernel.org, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On 01/18/2012 04:06 AM, leonid.moiseichuk@nokia.com wrote: > Would be possible to use for threshold pointed value(s) e.g. according to enum zone_state_item, because kinds of memory to track could be different? > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be interesting, not only free. That seems like a horrible idea, because there is no guarantee that the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE internally in the future. What is exported to userspace must be somewhat independent of the specifics of how the kernel implements things internally. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 6397F6B004D for ; Wed, 18 Jan 2012 10:29:34 -0500 (EST) Received: by obbta7 with SMTP id ta7so5601392obb.14 for ; Wed, 18 Jan 2012 07:29:33 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <4F16D79C.2020402@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <4F16D79C.2020402@redhat.com> Date: Wed, 18 Jan 2012 17:29:33 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: leonid.moiseichuk@nokia.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On Wed, Jan 18, 2012 at 4:30 PM, Rik van Riel wrote: > That seems like a horrible idea, because there is no guarantee that > the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE > internally in the future. > > What is exported to userspace must be somewhat independent of the > specifics of how the kernel implements things internally. Exactly, that's what I'm also interested in. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id 65D046B004F for ; Wed, 18 Jan 2012 18:35:11 -0500 (EST) Message-ID: <4F175706.8000808@redhat.com> Date: Thu, 19 Jan 2012 01:34:30 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/18/2012 12:44 PM, leonid.moiseichuk@nokia.com wrote: >> -----Original Message----- >> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext >> Pekka Enberg >> Sent: 18 January, 2012 12:40 > ... >>> Not worse than %%. For example you had 10% free memory threshold for >>> 512 MB RAM meaning 51.2 MB in absolute number. Then hotplug turned >>> off 256 MB, you for sure must update threshold for %% because these >>> 10% for 25.6 MB most likely will be not suitable for different operating >> mode. >>> Using pages makes calculations must simpler. >> Right. Does threshold in percentages make any sense then? Is it enough to >> use number of free pages? > Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. > He was right. > Percents are useless and do not correlate with other kernel APIs like sysinfo(). I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time. Ronen. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 612616B004F for ; Wed, 18 Jan 2012 21:26:47 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id DD52E3EE081 for ; Thu, 19 Jan 2012 11:26:45 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id C1E3345DEAD for ; Thu, 19 Jan 2012 11:26:45 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id A52D345DEA6 for ; Thu, 19 Jan 2012 11:26:45 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 973511DB803F for ; Thu, 19 Jan 2012 11:26:45 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.240.81.146]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 4B3601DB8040 for ; Thu, 19 Jan 2012 11:26:45 +0900 (JST) Date: Thu, 19 Jan 2012 11:25:28 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <4F16D46D.5080000@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Wed, 18 Jan 2012 09:17:17 -0500 Rik van Riel wrote: > On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > > On Wed, 18 Jan 2012 08:08:01 +0900 > > Minchan Kim wrote: > > > >>>>> 2. can't we measure page-in/page-out distance by recording something ? > >>>> > >>>> I can't understand your point. What's relation does it with swapout prevent? > >>>> > >>> > >>> If distance between pageout -> pagein is short, it means thrashing. > >>> For example, recoding the timestamp when the page(mapping, index) was > >>> paged-out, and check it at page-in. > >> > >> Our goal is prevent swapout. When we found thrashing, it's too late. > > > > If you want to prevent swap-out, don't swapon any. That's all. > > Then, you can check the number of FILE_CACHE and have threshold. > > I think you are getting hung up on a word here. > > As I understand it, the goal is to push out the point where > we start doing heavier swap IO, allowing us to overcommit > memory more heavily before things start really slowing down. > Yes. Hmm, considering that the issue is slow down, time values as - 'cpu time used for memory reclaim' - 'latency of page allocation' - 'application execution speed' ? may be a better score to see rather than just seeing lru's stat. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id E03906B004F for ; Thu, 19 Jan 2012 02:25:18 -0500 (EST) Received: by lagw12 with SMTP id w12so2159294lag.14 for ; Wed, 18 Jan 2012 23:25:16 -0800 (PST) Date: Thu, 19 Jan 2012 09:25:03 +0200 (EET) From: Pekka Enberg Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <4F175706.8000808@redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, 19 Jan 2012, Ronen Hod wrote: > I believe that it will be best if the kernel publishes an ideal > number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to > work with since this is what applications do, they free pages. Applications > will be able to refer to this number from their garbage collector, or before > allocating memory also if they did not get a notification, and it is also > useful if several applications free memory at the same time. Isn't /proc/sys/vm/min_free_kbytes pretty much just that? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx159.postini.com [74.125.245.159]) by kanga.kvack.org (Postfix) with SMTP id 058CD6B004F for ; Thu, 19 Jan 2012 02:34:40 -0500 (EST) Received: by lagw12 with SMTP id w12so2163631lag.14 for ; Wed, 18 Jan 2012 23:34:39 -0800 (PST) Date: Thu, 19 Jan 2012 09:34:34 +0200 (EET) From: Pekka Enberg Subject: RE: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On Wed, 18 Jan 2012, leonid.moiseichuk@nokia.com wrote: > Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. > He was right. > Percents are useless and do not correlate with other kernel APIs like sysinfo(). I changed the code to use number of pages. Thanks! Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id CD53C6B004F for ; Thu, 19 Jan 2012 04:06:32 -0500 (EST) Message-ID: <4F17DCED.4020908@redhat.com> Date: Thu, 19 Jan 2012 11:05:49 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/19/2012 09:25 AM, Pekka Enberg wrote: > On Thu, 19 Jan 2012, Ronen Hod wrote: >> I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time. > > Isn't > > /proc/sys/vm/min_free_kbytes > > pretty much just that? > > Pekka Would you suggest to use min_free_kbytes as the threshold for sending low_memory_notifications to applications, and separately as a target value for the applications' memory giveaway? Thanks, Ronen. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id ADAEF6B004F for ; Thu, 19 Jan 2012 04:10:21 -0500 (EST) Received: by obbta7 with SMTP id ta7so6906122obb.14 for ; Thu, 19 Jan 2012 01:10:20 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <4F17DCED.4020908@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> Date: Thu, 19 Jan 2012 11:10:20 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod wrote: >>> I believe that it will be best if the kernel publishes an ideal >>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to >>> work with since this is what applications do, they free pages. Applications >>> will be able to refer to this number from their garbage collector, or before >>> allocating memory also if they did not get a notification, and it is also >>> useful if several applications free memory at the same time. >> >> Isn't >> >> /proc/sys/vm/min_free_kbytes >> >> pretty much just that? > > Would you suggest to use min_free_kbytes as the threshold for sending > low_memory_notifications to applications, and separately as a target value > for the applications' memory giveaway? I'm not saying that the kernel should use it directly but it seems like the kind of "ideal number of free pages" threshold you're suggesting. So userspace can read that value and use it as the "number of free pages" threshold for VM events, no? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id EA7096B004F for ; Thu, 19 Jan 2012 04:21:08 -0500 (EST) Message-ID: <4F17E058.8020008@redhat.com> Date: Thu, 19 Jan 2012 11:20:24 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/19/2012 11:10 AM, Pekka Enberg wrote: > On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod wrote: >>>> I believe that it will be best if the kernel publishes an ideal >>>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to >>>> work with since this is what applications do, they free pages. Applications >>>> will be able to refer to this number from their garbage collector, or before >>>> allocating memory also if they did not get a notification, and it is also >>>> useful if several applications free memory at the same time. >>> Isn't >>> >>> /proc/sys/vm/min_free_kbytes >>> >>> pretty much just that? >> Would you suggest to use min_free_kbytes as the threshold for sending >> low_memory_notifications to applications, and separately as a target value >> for the applications' memory giveaway? > I'm not saying that the kernel should use it directly but it seems > like the kind of "ideal number of free pages" threshold you're > suggesting. So userspace can read that value and use it as the "number > of free pages" threshold for VM events, no? Yes, I like it. The rules of the game are simple and consistent all over, be it the alert threshold, voluntary poling by the apps, and for concurrent work by several applications. Well, as long as it provides a good indication for low_mem_pressure. Thanks, Ronen. > > Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id 1F75B6B004F for ; Thu, 19 Jan 2012 05:54:12 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Thu, 19 Jan 2012 10:53:29 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> In-Reply-To: <4F17E058.8020008@redhat.com> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: rhod@redhat.com, penberg@kernel.org Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com > -----Original Message----- > From: ext Ronen Hod [mailto:rhod@redhat.com] > Sent: 19 January, 2012 11:20 > To: Pekka Enberg ... > >>> Isn't > >>> > >>> /proc/sys/vm/min_free_kbytes > >>> > >>> pretty much just that? > >> Would you suggest to use min_free_kbytes as the threshold for sending > >> low_memory_notifications to applications, and separately as a target > >> value for the applications' memory giveaway? > > I'm not saying that the kernel should use it directly but it seems > > like the kind of "ideal number of free pages" threshold you're > > suggesting. So userspace can read that value and use it as the "number > > of free pages" threshold for VM events, no? >=20 > Yes, I like it. The rules of the game are simple and consistent all over,= be it the > alert threshold, voluntary poling by the apps, and for concurrent work by > several applications. > Well, as long as it provides a good indication for low_mem_pressure. For me it doesn't look that have much sense. min_free_kbytes could be set f= rom user-space (or auto-tuned by kernel) to keep some amount=20 of memory available for GFP_ATOMIC allocations. In case situation comes un= der pointed level kernel will reclaim memory from e.g. caches. >>From potential user point of view the proposed API has number of lacks whic= h would be nice to have implemented: 1. rename this API from low_mem_pressure to something more related to notif= ication and memory situation in system: memory_pressure, memnotify, memory_= level etc. The word "low" is misleading here 2. API must use deferred timers to prevent use-time impact. Deferred timer = will be triggered only in case HW event or non-deferrable timer, so if devi= ce sleeps timer might be skipped and that is what expected for user-space 3. API should be tunable for propagate changes when level is Up or Down, ma= ybe both ways.=20 4. to avoid triggering too much events probably has sense to filter accordi= ng to amount of change but that is optional. If subscriber set timer to 1s = the amount of events should not be very big. 5. API must provide interface to request parameters e.g. available swap or = free memory just to have some base. 6. I do not understand how work with attributes performed ( ) but it has se= nse to use mask and fill requested attributes using mask and callback table= i.e. if free pages requested - they are reported, otherwise not. 7. would have sense to backport couple of attributes from memnotify.c I can submit couple of patches if some of proposals looks sane for everyone= . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx159.postini.com [74.125.245.159]) by kanga.kvack.org (Postfix) with SMTP id BBE426B004F for ; Thu, 19 Jan 2012 06:07:53 -0500 (EST) Received: by obbta7 with SMTP id ta7so7087459obb.14 for ; Thu, 19 Jan 2012 03:07:52 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 13:07:52 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to > notification and memory situation in system: memory_pressure, memnotify, > memory_level etc. The word "low" is misleading here The thing is called vmevent: http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core [penberg@tux ~]$ vi [penberg@tux ~]$ cat email On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to > notification and memory situation in system: memory_pressure, memnotify, > memory_level etc. The word "low" is misleading here The thing is called vmevent: http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core I haven't used "low mem" at all in the patches. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 2. API must use deferred timers to prevent use-time impact. Deferred timer > will be triggered only in case HW event or non-deferrable timer, so if device > sleeps timer might be skipped and that is what expected for user-space I'm currently looking at the possibility of hooking VM events to perf which also uses hrtimers. Can't we make hrtimers do the right thing? On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 3. API should be tunable for propagate changes when level is Up or Down, > maybe both ways. Agreed. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 4. to avoid triggering too much events probably has sense to filter according > to amount of change but that is optional. If subscriber set timer to 1s the > amount of events should not be very big. Agreed. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 5. API must provide interface to request parameters e.g. available swap or > free memory just to have some base. The current ABI already supports that. You can specify which attributes you're interested in and they will be delivered as part of th event. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 6. I do not understand how work with attributes performed ( ) but it has > sense to use mask and fill requested attributes using mask and callback table > i.e. if free pages requested - they are reported, otherwise not. That's how it works now in the git tree. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 7. would have sense to backport couple of attributes from memnotify.c > > I can submit couple of patches if some of proposals looks sane for everyone. Feel free to do that. I'm currently looking at how to support Minchan's non-sampled events. It seems to me integrating with perf would be nice because we could simply use tracepoints for this. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id 8C1F46B004F for ; Thu, 19 Jan 2012 06:55:46 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Thu, 19 Jan 2012 11:54:58 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: penberg@kernel.org Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 19 January, 2012 13:08 ... > > 1. rename this API from low_mem_pressure to something more related to > > notification and memory situation in system: memory_pressure, > > memnotify, memory_level etc. The word "low" is misleading here >=20 > The thing is called vmevent: Yes, I see it. But I was a bit confused with vmnotify_fops and was sure it = is mapped through dev. Now it anonymous inode. >=20 > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 2. API must use deferred timers to prevent use-time impact. Deferred > > timer will be triggered only in case HW event or non-deferrable timer, > > so if device sleeps timer might be skipped and that is what expected > > for user-space >=20 > I'm currently looking at the possibility of hooking VM events to perf whi= ch > also uses hrtimers. Can't we make hrtimers do the right thing? I had no answer for this question. According to hrtimer_cpu_notify the cpu = state is tracked but timer may set HW event to wake up. In this case use-time will be affected due to you will have too much HW eve= nts and reasons to wakeup. At least powertop reports hrtimers in relation to as an activ= ities sources. >=20 > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 3. API should be tunable for propagate changes when level is Up or > > Down, maybe both ways. >=20 > Agreed. >=20 > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 4. to avoid triggering too much events probably has sense to filter > > according to amount of change but that is optional. If subscriber set > > timer to 1s the amount of events should not be very big. >=20 > Agreed. >=20 > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 5. API must provide interface to request parameters e.g. available > > swap or free memory just to have some base. >=20 > The current ABI already supports that. You can specify which attributes > you're interested in and they will be delivered as part of th event. But you have in vmnotify.h suspicious free_pages_threshold field. >=20 > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 6. I do not understand how work with attributes performed ( ) but it > > has sense to use mask and fill requested attributes using mask and > > callback table i.e. if free pages requested - they are reported, otherw= ise > not. >=20 > That's how it works now in the git tree. Vmnotify.c has vmnotify_watch_event which collects fixed set of parameters. > I'm currently looking at how to support Minchan's non-sampled events. It > seems to me integrating with perf would be nice because we could simply > use tracepoints for this. If tracepoints not jeopardize use time has sense to do it. >=20 > Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx114.postini.com [74.125.245.114]) by kanga.kvack.org (Postfix) with SMTP id C5DB06B004F for ; Thu, 19 Jan 2012 06:59:59 -0500 (EST) Received: by obbta7 with SMTP id ta7so7168067obb.14 for ; Thu, 19 Jan 2012 03:59:59 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 13:59:58 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, Jan 19, 2012 at 1:54 PM, wrote: >> The current ABI already supports that. You can specify which attributes >> you're interested in and they will be delivered as part of th event. > > But you have in vmnotify.h suspicious free_pages_threshold field. Aah, I was actually talking about the events userspace _reads_. The free_pages_threshold field is only used if VMEVENT_TYPE_FREE_THRESHOLD bit is set. It should be cleaned up a bit but it in theory it supports watching other attributes as well. I've postponed the cleanup until I've figured out whether we can use perf which would make the whole syscall go away. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 7DEF36B005C for ; Thu, 19 Jan 2012 07:06:15 -0500 (EST) Received: by obbta7 with SMTP id ta7so7177694obb.14 for ; Thu, 19 Jan 2012 04:06:14 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 14:06:14 +0200 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, Jan 19, 2012 at 1:54 PM, wrote: >> On Thu, Jan 19, 2012 at 12:53 PM, =A0 wrote= : >> > 6. I do not understand how work with attributes performed ( ) but it >> > has sense to use mask and fill requested attributes using mask and >> > callback table i.e. if free pages requested - they are reported, other= wise >> not. >> >> That's how it works now in the git tree. > > Vmnotify.c has vmnotify_watch_event which collects fixed set of parameter= s. That's would be a bug. We should check event_attrs like we do for NR_SWAP_P= AGES. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id 3DD196B004F for ; Thu, 19 Jan 2012 09:43:37 -0500 (EST) Message-ID: <4F182BF3.7050809@redhat.com> Date: Thu, 19 Jan 2012 09:42:59 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 2/3] vmscan hook References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote: > On Wed, 18 Jan 2012 09:17:17 -0500 > Rik van Riel wrote: > >> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: >>> On Wed, 18 Jan 2012 08:08:01 +0900 >>> Minchan Kim wrote: >>> >>>>>>> 2. can't we measure page-in/page-out distance by recording something ? >>>>>> >>>>>> I can't understand your point. What's relation does it with swapout prevent? >>>>>> >>>>> >>>>> If distance between pageout -> pagein is short, it means thrashing. >>>>> For example, recoding the timestamp when the page(mapping, index) was >>>>> paged-out, and check it at page-in. >>>> >>>> Our goal is prevent swapout. When we found thrashing, it's too late. >>> >>> If you want to prevent swap-out, don't swapon any. That's all. >>> Then, you can check the number of FILE_CACHE and have threshold. >> >> I think you are getting hung up on a word here. >> >> As I understand it, the goal is to push out the point where >> we start doing heavier swap IO, allowing us to overcommit >> memory more heavily before things start really slowing down. >> > > Yes. > > Hmm, considering that the issue is slow down, > > time values as > > - 'cpu time used for memory reclaim' > - 'latency of page allocation' > - 'application execution speed' ? > > may be a better score to see rather than just seeing lru's stat. I believe those all qualify as "too late". We want to prevent things from becoming bad, for as long as we (easily) can. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id BE22E6B004F for ; Thu, 19 Jan 2012 19:26:10 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 54E363EE0C0 for ; Fri, 20 Jan 2012 09:26:09 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 3796B45DE52 for ; Fri, 20 Jan 2012 09:26:09 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 14A9A45DE50 for ; Fri, 20 Jan 2012 09:26:09 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 01DFA1DB802F for ; Fri, 20 Jan 2012 09:26:09 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.240.81.146]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id AF7511DB803E for ; Fri, 20 Jan 2012 09:26:08 +0900 (JST) Date: Fri, 20 Jan 2012 09:24:49 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120120092449.4ecbec86.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <4F182BF3.7050809@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> <4F182BF3.7050809@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod On Thu, 19 Jan 2012 09:42:59 -0500 Rik van Riel wrote: > On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote: > > On Wed, 18 Jan 2012 09:17:17 -0500 > > Rik van Riel wrote: > > > >> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > >>> On Wed, 18 Jan 2012 08:08:01 +0900 > >>> Minchan Kim wrote: > >>> > >>>>>>> 2. can't we measure page-in/page-out distance by recording something ? > >>>>>> > >>>>>> I can't understand your point. What's relation does it with swapout prevent? > >>>>>> > >>>>> > >>>>> If distance between pageout -> pagein is short, it means thrashing. > >>>>> For example, recoding the timestamp when the page(mapping, index) was > >>>>> paged-out, and check it at page-in. > >>>> > >>>> Our goal is prevent swapout. When we found thrashing, it's too late. > >>> > >>> If you want to prevent swap-out, don't swapon any. That's all. > >>> Then, you can check the number of FILE_CACHE and have threshold. > >> > >> I think you are getting hung up on a word here. > >> > >> As I understand it, the goal is to push out the point where > >> we start doing heavier swap IO, allowing us to overcommit > >> memory more heavily before things start really slowing down. > >> > > > > Yes. > > > > Hmm, considering that the issue is slow down, > > > > time values as > > > > - 'cpu time used for memory reclaim' > > - 'latency of page allocation' > > - 'application execution speed' ? > > > > may be a better score to see rather than just seeing lru's stat. > > I believe those all qualify as "too late". > > We want to prevent things from becoming bad, for as long > as we (easily) can. > Hmm, then some threshold-notifier interface will be required. Problem is how to know free + page_can_be_freed_without_risk. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 0804C6B004F for ; Tue, 24 Jan 2012 10:42:01 -0500 (EST) Date: Tue, 24 Jan 2012 13:38:35 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124153835.GA10990@amt.cnet> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> Sender: owner-linux-mm@kvack.org List-ID: To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > > -----Original Message----- > > From: ext Ronen Hod [mailto:rhod@redhat.com] > > Sent: 19 January, 2012 11:20 > > To: Pekka Enberg > ... > > >>> Isn't > > >>> > > >>> /proc/sys/vm/min_free_kbytes > > >>> > > >>> pretty much just that? > > >> Would you suggest to use min_free_kbytes as the threshold for sending > > >> low_memory_notifications to applications, and separately as a target > > >> value for the applications' memory giveaway? > > > I'm not saying that the kernel should use it directly but it seems > > > like the kind of "ideal number of free pages" threshold you're > > > suggesting. So userspace can read that value and use it as the "number > > > of free pages" threshold for VM events, no? > > > > Yes, I like it. The rules of the game are simple and consistent all over, be it the > > alert threshold, voluntary poling by the apps, and for concurrent work by > > several applications. > > Well, as long as it provides a good indication for low_mem_pressure. > > For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > > >From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space Having userspace specify the "sample period" for low memory notification makes no sense. The frequency of notifications is a function of the memory pressure. > 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. It would make the interface easier to use if it provided the number of pages to free, in the notification (kernel can calculate that as the delta between current_free_pages -> comfortable_free_pages relative to process RSS). > 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not. > 7. would have sense to backport couple of attributes from memnotify.c > > I can submit couple of patches if some of proposals looks sane for everyone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id 134126B004F for ; Tue, 24 Jan 2012 10:52:28 -0500 (EST) Date: Tue, 24 Jan 2012 13:40:01 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124154001.GB10990@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have > proposed so far. > > Pekka What is the practical advantage of a syscall, again? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id 6D84C6B004F for ; Tue, 24 Jan 2012 11:01:24 -0500 (EST) Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg In-Reply-To: <20120124154001.GB10990@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120124154001.GB10990@amt.cnet> Content-Type: text/plain; charset="ISO-8859-1" Date: Tue, 24 Jan 2012 18:01:20 +0200 Message-ID: <1327420880.13624.24.camel@jaguar> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Marcelo Tosatti Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > What is the practical advantage of a syscall, again? Why do you ask? The advantage for this particular case is not needing to add ioctls() for configuration and keeping the file read/write ABI simple. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id A77546B004F for ; Tue, 24 Jan 2012 11:09:08 -0500 (EST) Message-ID: <4F1ED77F.4090900@redhat.com> Date: Tue, 24 Jan 2012 18:08:31 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> In-Reply-To: <20120124153835.GA10990@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Marcelo Tosatti Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>> -----Original Message----- >>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>> Sent: 19 January, 2012 11:20 >>> To: Pekka Enberg >> ... >>>>>> Isn't >>>>>> >>>>>> /proc/sys/vm/min_free_kbytes >>>>>> >>>>>> pretty much just that? >>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>> low_memory_notifications to applications, and separately as a target >>>>> value for the applications' memory giveaway? >>>> I'm not saying that the kernel should use it directly but it seems >>>> like the kind of "ideal number of free pages" threshold you're >>>> suggesting. So userspace can read that value and use it as the "number >>>> of free pages" threshold for VM events, no? >>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>> alert threshold, voluntary poling by the apps, and for concurrent work by >>> several applications. >>> Well, as long as it provides a good indication for low_mem_pressure. >> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >> >> > From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > Having userspace specify the "sample period" for low memory notification > makes no sense. The frequency of notifications is a function of the > memory pressure. > >> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > >> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > It would make the interface easier to use if it provided the number of > pages to free, in the notification (kernel can calculate that as the > delta between current_free_pages -> comfortable_free_pages relative to > process RSS). If you rely on the notification's argument you lose several features: - Handling of notifications by several applications in parallel - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. - Iterative release loops, until there are enough free pages. I believe that the notification should only serve as a trigger to run the cleanup. Ronen. > >> 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not. >> 7. would have sense to backport couple of attributes from memnotify.c >> >> I can submit couple of patches if some of proposals looks sane for everyone. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx148.postini.com [74.125.245.148]) by kanga.kvack.org (Postfix) with SMTP id C7AF86B004F for ; Tue, 24 Jan 2012 11:10:42 -0500 (EST) Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg In-Reply-To: <20120124153835.GA10990@amt.cnet> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> Content-Type: text/plain; charset="ISO-8859-1" Date: Tue, 24 Jan 2012 18:10:40 +0200 Message-ID: <1327421440.13624.30.camel@jaguar> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Marcelo Tosatti Cc: leonid.moiseichuk@nokia.com, rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > Having userspace specify the "sample period" for low memory notification > makes no sense. The frequency of notifications is a function of the > memory pressure. Sure, it makes sense to autotune sample period. I don't see the problem with letting userspace decide it for themselves if they want to. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 6DA686B004F for ; Tue, 24 Jan 2012 11:23:19 -0500 (EST) From: Arnd Bergmann Subject: Re: [RFC 1/3] /dev/low_mem_notify Date: Tue, 24 Jan 2012 16:22:36 +0000 References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201201241622.36222.arnd@arndb.de> Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com On Wednesday 18 January 2012, Pekka Enberg wrote: > >> +struct vmnotify_event { > >> + /* Size of the struct for ABI extensibility. */ > >> + __u32 size; > >> + > >> + __u64 nr_avail_pages; > >> + > >> + __u64 nr_swap_pages; > >> + > >> + __u64 nr_free_pages; > >> +}; > > > > Two fields here most likely session-constant, (nr_avail_pages and > > nr_swap_pages), seems not much sense to report them in every event. If we > > have memory/swap hotplug user-space can use sysinfo() call. > > I actually changed the ABI to look like this: > > struct vmnotify_event { > /* > * Size of the struct for ABI extensibility. > */ > __u32 size; > > __u64 attrs; > > __u64 attr_values[]; > }; > > So userspace can decide which fields to include in notifications. Please make the first member a __u64 instead of a __u32. This will avoid incompatibility between 32 and 64 bit processes, which have different alignment rules on x86: x86-32 would implicitly pack the struct while x86-64 would add padding with your layout. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx104.postini.com [74.125.245.104]) by kanga.kvack.org (Postfix) with SMTP id D30576B004F for ; Tue, 24 Jan 2012 11:25:59 -0500 (EST) From: Arnd Bergmann Subject: Re: [RFC 1/3] /dev/low_mem_notify Date: Tue, 24 Jan 2012 16:25:55 +0000 References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <20120124154001.GB10990@amt.cnet> <1327420880.13624.24.camel@jaguar> In-Reply-To: <1327420880.13624.24.camel@jaguar> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201201241625.55295.arnd@arndb.de> Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Marcelo Tosatti , Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tuesday 24 January 2012, Pekka Enberg wrote: > On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > > What is the practical advantage of a syscall, again? > > Why do you ask? The advantage for this particular case is not needing to > add ioctls() for configuration and keeping the file read/write ABI > simple. The two are obviously equivalent and there is no reason to avoid ioctl in general. However I agree that the syscall would be better in this case, because that is what we tend to use for core kernel functionality, while character devices tend to be used for I/O device drivers that need stuff like enumeration and permission management. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 0A4D26B004F for ; Tue, 24 Jan 2012 13:30:51 -0500 (EST) Date: Tue, 24 Jan 2012 16:29:09 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124182909.GB19186@amt.cnet> References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <1327421440.13624.30.camel@jaguar> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1327421440.13624.30.camel@jaguar> Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: leonid.moiseichuk@nokia.com, rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Tue, Jan 24, 2012 at 06:10:40PM +0200, Pekka Enberg wrote: > On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > > Having userspace specify the "sample period" for low memory notification > > makes no sense. The frequency of notifications is a function of the > > memory pressure. > > Sure, it makes sense to autotune sample period. I don't see the problem > with letting userspace decide it for themselves if they want to. > > Pekka Application polls on a file descriptor waiting for asynchronous events, particular conditions of memory reclaim upon which an action is necessary. These signalled conditions are not simply percentages of free memory, but depend on the amount of freeable cache available, etc. Otherwise applications could monitor /proc/mem_info and act on that. What is the point of sampling in the interface as you have it? Application can read() from the file descriptor to retrieve the current status, if it wishes. The objective in this argument is to make the API as simple and easy to use as possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx190.postini.com [74.125.245.190]) by kanga.kvack.org (Postfix) with SMTP id CB57D6B005A for ; Tue, 24 Jan 2012 13:31:12 -0500 (EST) Date: Tue, 24 Jan 2012 16:10:34 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124181034.GA19186@amt.cnet> References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1ED77F.4090900@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: > On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > >On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > >>>-----Original Message----- > >>>From: ext Ronen Hod [mailto:rhod@redhat.com] > >>>Sent: 19 January, 2012 11:20 > >>>To: Pekka Enberg > >>... > >>>>>>Isn't > >>>>>> > >>>>>>/proc/sys/vm/min_free_kbytes > >>>>>> > >>>>>>pretty much just that? > >>>>>Would you suggest to use min_free_kbytes as the threshold for sending > >>>>>low_memory_notifications to applications, and separately as a target > >>>>>value for the applications' memory giveaway? > >>>>I'm not saying that the kernel should use it directly but it seems > >>>>like the kind of "ideal number of free pages" threshold you're > >>>>suggesting. So userspace can read that value and use it as the "number > >>>>of free pages" threshold for VM events, no? > >>>Yes, I like it. The rules of the game are simple and consistent all over, be it the > >>>alert threshold, voluntary poling by the apps, and for concurrent work by > >>>several applications. > >>>Well, as long as it provides a good indication for low_mem_pressure. > >>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > >>of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > >> > >>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > >>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > >>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > >Having userspace specify the "sample period" for low memory notification > >makes no sense. The frequency of notifications is a function of the > >memory pressure. > > > >>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > > > >>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > >>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > >It would make the interface easier to use if it provided the number of > >pages to free, in the notification (kernel can calculate that as the > >delta between current_free_pages -> comfortable_free_pages relative to > >process RSS). > > If you rely on the notification's argument you lose several features: > - Handling of notifications by several applications in parallel Each application has its argument built in a custom fashion (pages_to_free = delta between current_free_pages -> comfortable_free_pages relative to process RSS), or something to that effect. It is compatible with parallel notifications. > - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. I am suggesting an additional field in the notification data so that the freeing routine has a goal. But it is not mandatory. > - Iterative release loops, until there are enough free pages. What is the advantage versus releasing the necessary amount of memory in a given moment? > I believe that the notification should only serve as a trigger to run the cleanup. Agree. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id 435C66B004F for ; Tue, 24 Jan 2012 13:34:49 -0500 (EST) Date: Tue, 24 Jan 2012 16:32:47 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124183247.GA19853@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <20120124154001.GB10990@amt.cnet> <1327420880.13624.24.camel@jaguar> <201201241625.55295.arnd@arndb.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201201241625.55295.arnd@arndb.de> Sender: owner-linux-mm@kvack.org List-ID: To: Arnd Bergmann Cc: Pekka Enberg , Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, Jan 24, 2012 at 04:25:55PM +0000, Arnd Bergmann wrote: > On Tuesday 24 January 2012, Pekka Enberg wrote: > > On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > > > What is the practical advantage of a syscall, again? > > > > Why do you ask? The advantage for this particular case is not needing to > > add ioctls() for configuration and keeping the file read/write ABI > > simple. > > The two are obviously equivalent and there is no reason to avoid > ioctl in general. However I agree that the syscall would be better > in this case, because that is what we tend to use for core kernel > functionality, while character devices tend to be used for I/O device > drivers that need stuff like enumeration and permission management. > > Arnd Makes sense. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 95D046B004D for ; Tue, 24 Jan 2012 16:57:16 -0500 (EST) Date: Tue, 24 Jan 2012 14:57:13 -0700 From: Jonathan Corbet Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124145713.20fad866@dt> In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro On Tue, 17 Jan 2012 20:51:13 +0200 (EET) Pekka Enberg wrote: > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall ABI. > I'd really like to see something like this that's much more extensible and > clean than the /dev based ABIs that people have proposed so far. OK, so I'm slow, but better late than never. I plead travel. I guess the thing that surprises me is that nobody has said this yet: this looks a lot like an event-reporting mechanism like perf. Is there a reason these can't be perf-style events integrated with all the rest? > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; Is this an upper-bound threshold or a lower-bound threshold? From your example, it looks like "free_threshold" is "the amount of memory that is not free", which seems confusing. [...] > new file mode 100644 > index 0000000..6800450 > --- /dev/null > +++ b/mm/vmnotify.c > @@ -0,0 +1,235 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define VMNOTIFY_MAX_FREE_THRESHOD 100 Did we run out of L's here? :) > +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) > +{ > + struct vmnotify_watch *watch = file->private_data; > + int ret = 0; > + > + mutex_lock(&watch->mutex); > + > + if (!watch->pending) > + goto out_unlock; > + > + if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) { > + ret = -EFAULT; > + goto out_unlock; > + } > + > + ret = watch->event.size; > + > + watch->pending = false; > + > +out_unlock: > + mutex_unlock(&watch->mutex); > + > + return ret; > +} So this is a nonblocking-only interface? That may surprise some developers. You already have a wait queue, why not wait on it if need be? > +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig, > + struct vmnotify_config *config) > +{ > + int ret; > + > + ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config)); > + if (ret) > + return -EFAULT; > + > + if (!config->type) > + return -EINVAL; > + > + if (config->type & VMNOTIFY_TYPE_SAMPLE) { > + if (config->sample_period_ns < NSEC_PER_MSEC) > + return -EINVAL; > + } What happens if the sample period is zero? jon -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id C62C06B004D for ; Wed, 25 Jan 2012 03:20:05 -0500 (EST) From: Subject: RE: [RFC 1/3] /dev/low_mem_notify Date: Wed, 25 Jan 2012 08:19:11 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904562B60@008-AM1MPN1-003.mgdnok.nokia.com> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <1327421440.13624.30.camel@jaguar> In-Reply-To: <1327421440.13624.30.camel@jaguar> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: penberg@kernel.org, mtosatti@redhat.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com > -----Original Message----- > From: ext Pekka Enberg [mailto:penberg@kernel.org] > Sent: 24 January, 2012 18:11 > To: Marcelo Tosatti .... > On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > > Having userspace specify the "sample period" for low memory > > notification makes no sense. The frequency of notifications is a > > function of the memory pressure. >=20 > Sure, it makes sense to autotune sample period. I don't see the problem > with letting userspace decide it for themselves if they want to. >=20 > Pekka Good point, but you must take into account that reaction time in user-space= depends how SW stack is organized. So for some components 1s is good enough update time, for another cases 10= ms. If changes on VM happened too often they had no sense for user-space. Thus from practical point of view having sampling period is not a bad idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 2EA156B004D for ; Wed, 25 Jan 2012 03:53:03 -0500 (EST) Message-ID: <4F1FC2C8.10103@redhat.com> Date: Wed, 25 Jan 2012 10:52:24 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> In-Reply-To: <20120124181034.GA19186@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Marcelo Tosatti Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: > On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: >> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: >>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>>>> -----Original Message----- >>>>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>>>> Sent: 19 January, 2012 11:20 >>>>> To: Pekka Enberg >>>> ... >>>>>>>> Isn't >>>>>>>> >>>>>>>> /proc/sys/vm/min_free_kbytes >>>>>>>> >>>>>>>> pretty much just that? >>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>>>> low_memory_notifications to applications, and separately as a target >>>>>>> value for the applications' memory giveaway? >>>>>> I'm not saying that the kernel should use it directly but it seems >>>>>> like the kind of "ideal number of free pages" threshold you're >>>>>> suggesting. So userspace can read that value and use it as the "number >>>>>> of free pages" threshold for VM events, no? >>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>>>> alert threshold, voluntary poling by the apps, and for concurrent work by >>>>> several applications. >>>>> Well, as long as it provides a good indication for low_mem_pressure. >>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >>>> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >>>> >>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space >>> Having userspace specify the "sample period" for low memory notification >>> makes no sense. The frequency of notifications is a function of the >>> memory pressure. >>> >>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. >>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. >>> It would make the interface easier to use if it provided the number of >>> pages to free, in the notification (kernel can calculate that as the >>> delta between current_free_pages -> comfortable_free_pages relative to >>> process RSS). >> If you rely on the notification's argument you lose several features: >> - Handling of notifications by several applications in parallel > Each application has its argument built in a custom fashion > (pages_to_free = delta between current_free_pages -> > comfortable_free_pages relative to process RSS), or something to that > effect. It is compatible with parallel notifications. Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? >> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. > I am suggesting an additional field in the notification data so that the > freeing routine has a goal. But it is not mandatory. If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? > >> - Iterative release loops, until there are enough free pages. > What is the advantage versus releasing the necessary amount of > memory in a given moment? The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. Ronen. > >> I believe that the notification should only serve as a trigger to run the cleanup. > Agree. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 369EF6B005A for ; Wed, 25 Jan 2012 05:14:23 -0500 (EST) Date: Wed, 25 Jan 2012 08:12:09 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120125101209.GB29167@amt.cnet> References: <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1FC2C8.10103@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote: > On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: > >On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: > >>On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > >>>On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > >>>>>-----Original Message----- > >>>>>From: ext Ronen Hod [mailto:rhod@redhat.com] > >>>>>Sent: 19 January, 2012 11:20 > >>>>>To: Pekka Enberg > >>>>... > >>>>>>>>Isn't > >>>>>>>> > >>>>>>>>/proc/sys/vm/min_free_kbytes > >>>>>>>> > >>>>>>>>pretty much just that? > >>>>>>>Would you suggest to use min_free_kbytes as the threshold for sending > >>>>>>>low_memory_notifications to applications, and separately as a target > >>>>>>>value for the applications' memory giveaway? > >>>>>>I'm not saying that the kernel should use it directly but it seems > >>>>>>like the kind of "ideal number of free pages" threshold you're > >>>>>>suggesting. So userspace can read that value and use it as the "number > >>>>>>of free pages" threshold for VM events, no? > >>>>>Yes, I like it. The rules of the game are simple and consistent all over, be it the > >>>>>alert threshold, voluntary poling by the apps, and for concurrent work by > >>>>>several applications. > >>>>>Well, as long as it provides a good indication for low_mem_pressure. > >>>>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > >>>>of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > >>>> > >>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > >>>>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > >>>>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > >>>Having userspace specify the "sample period" for low memory notification > >>>makes no sense. The frequency of notifications is a function of the > >>>memory pressure. > >>> > >>>>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > >>>>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > >>>>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > >>>It would make the interface easier to use if it provided the number of > >>>pages to free, in the notification (kernel can calculate that as the > >>>delta between current_free_pages -> comfortable_free_pages relative to > >>>process RSS). > >>If you rely on the notification's argument you lose several features: > >> - Handling of notifications by several applications in parallel > >Each application has its argument built in a custom fashion > >(pages_to_free = delta between current_free_pages -> > >comfortable_free_pages relative to process RSS), or something to that > >effect. It is compatible with parallel notifications. > > Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. > Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? The problem is, how is each process supposed to know how much memory it should free for each notification received, that is, its part? Its easier if there is a goal, a hint of how many pages the process should release. > >> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. > >I am suggesting an additional field in the notification data so that the > >freeing routine has a goal. But it is not mandatory. > > If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? > > > > >>- Iterative release loops, until there are enough free pages. > >What is the advantage versus releasing the necessary amount of > >memory in a given moment? > > The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. > > Ronen. > > > > >>I believe that the notification should only serve as a trigger to run the cleanup. > >Agree. > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 124E76B004D for ; Wed, 25 Jan 2012 05:48:45 -0500 (EST) Message-ID: <4F1FDDE2.9050609@redhat.com> Date: Wed, 25 Jan 2012 12:48:02 +0200 From: Ronen Hod MIME-Version: 1.0 Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> <20120125101209.GB29167@amt.cnet> In-Reply-To: <20120125101209.GB29167@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Marcelo Tosatti Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com On 01/25/2012 12:12 PM, Marcelo Tosatti wrote: > On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote: >> On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: >>> On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: >>>> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: >>>>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>>>>>> -----Original Message----- >>>>>>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>>>>>> Sent: 19 January, 2012 11:20 >>>>>>> To: Pekka Enberg >>>>>> ... >>>>>>>>>> Isn't >>>>>>>>>> >>>>>>>>>> /proc/sys/vm/min_free_kbytes >>>>>>>>>> >>>>>>>>>> pretty much just that? >>>>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>>>>>> low_memory_notifications to applications, and separately as a target >>>>>>>>> value for the applications' memory giveaway? >>>>>>>> I'm not saying that the kernel should use it directly but it seems >>>>>>>> like the kind of "ideal number of free pages" threshold you're >>>>>>>> suggesting. So userspace can read that value and use it as the "number >>>>>>>> of free pages" threshold for VM events, no? >>>>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>>>>>> alert threshold, voluntary poling by the apps, and for concurrent work by >>>>>>> several applications. >>>>>>> Well, as long as it provides a good indication for low_mem_pressure. >>>>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >>>>>> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >>>>>> >>>>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >>>>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >>>>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space >>>>> Having userspace specify the "sample period" for low memory notification >>>>> makes no sense. The frequency of notifications is a function of the >>>>> memory pressure. >>>>> >>>>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. >>>>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >>>>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. >>>>> It would make the interface easier to use if it provided the number of >>>>> pages to free, in the notification (kernel can calculate that as the >>>>> delta between current_free_pages -> comfortable_free_pages relative to >>>>> process RSS). >>>> If you rely on the notification's argument you lose several features: >>>> - Handling of notifications by several applications in parallel >>> Each application has its argument built in a custom fashion >>> (pages_to_free = delta between current_free_pages -> >>> comfortable_free_pages relative to process RSS), or something to that >>> effect. It is compatible with parallel notifications. >> Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. >> Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? > The problem is, how is each process supposed to know how much memory > it should free for each notification received, that is, its part? > > Its easier if there is a goal, a hint of how many pages the process > should release. I have to agree. Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested). Fairness is also an issue. And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work? Ronen. > >>>> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. >>> I am suggesting an additional field in the notification data so that the >>> freeing routine has a goal. But it is not mandatory. >> If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? >> >>>> - Iterative release loops, until there are enough free pages. >>> What is the advantage versus releasing the necessary amount of >>> memory in a given moment? >> The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. >> >> Ronen. >> >>>> I believe that the notification should only serve as a trigger to run the cleanup. >>> Agree. >>> >>> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 8E24C6B004F for ; Thu, 26 Jan 2012 11:20:15 -0500 (EST) Date: Thu, 26 Jan 2012 14:17:58 -0200 From: Marcelo Tosatti Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120126161758.GA28367@amt.cnet> References: <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> <20120125101209.GB29167@amt.cnet> <4F1FDDE2.9050609@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1FDDE2.9050609@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com > >it should free for each notification received, that is, its part? > > > >Its easier if there is a goal, a hint of how many pages the process > >should release. > > I have to agree. > Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested). It is easier if the kernel calculates the target (the application is free to ignore the hint, of course), because it depends on information not readily available in userspace. > Fairness is also an issue. > And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work? There appears to be interest in an event to notify that higher levels of memory are available (see Leonid's email). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752368Ab2AQIOQ (ORCPT ); Tue, 17 Jan 2012 03:14:16 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:52374 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751939Ab2AQIOP (ORCPT ); Tue, 17 Jan 2012 03:14:15 -0500 From: Minchan Kim To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim Subject: [RFC 0/3] low memory notify Date: Tue, 17 Jan 2012 17:13:55 +0900 Message-Id: <1326788038-29141-1-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.7.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. (Of course, KOSAKI's original patchset includes more logics but I didn't include all things intentionally because I want to start from beginning again) Recently, there are some requirements of notification of system memory pressure. It would be very useful for various cases. For example, QEMU/JVM/Firefox like big memory hogger can release their memory when memory pressure happens. Another example in embedded side, they can close background application. For this, there are some trial but we need more general one and not-hacked alloc/free hot path. I think most big problem of system slowness is swap-in operation. Swap-in is a synchronous operation so application's latency would be big. Solution for that is prevent swap-out itself. We couldn't prevent swapout totally but could reduce it with this patch. In case of swapless system, code page is very important for system response. So we have to keep code page, too. I used very naive heuristic in this patch but welcome to any idea. I want to make kernel logic simple if possible and just notify to user space. Of course, there are lots of thing we have to consider but for discussion this simple patch would be a good start point. This version is totally RFC so any comments are welcome. Minchan Kim (3): [RFC 1/3] /dev/low_mem_notify [RFC 2/3] vmscan hook [RFC 3/3] test program drivers/char/mem.c | 7 ++ include/linux/low_mem_notify.h | 6 ++ mm/Kconfig | 7 ++ mm/Makefile | 1 + mm/low_mem_notify.c | 61 ++++++++++++++++++++ mm/vmscan.c | 28 +++++++++ poll.c | 121 ++++++++++++++++++++++++++++++++++++++++ 7 files changed, 231 insertions(+), 0 deletions(-) create mode 100644 include/linux/low_mem_notify.h create mode 100644 mm/low_mem_notify.c create mode 100644 poll.c -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752473Ab2AQIO3 (ORCPT ); Tue, 17 Jan 2012 03:14:29 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:52374 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752435Ab2AQIOY (ORCPT ); Tue, 17 Jan 2012 03:14:24 -0500 From: Minchan Kim To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim Subject: [RFC 2/3] vmscan hook Date: Tue, 17 Jan 2012 17:13:57 +0900 Message-Id: <1326788038-29141-3-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.7.5 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch insert memory pressure notify point into vmscan.c Most problem in system slowness is swap-in. swap-in is a synchronous opeartion so that it affects heavily system response. This patch alert it when reclaimer start to reclaim inactive anon list. It seems rather earlier but not bad than too late. Other alert point is when there is few cache pages In this implementation, if it is (cache < free pages), memory pressure notify happens. It has to need more testing and tuning or other hueristic. Any suggesion are welcome. Signed-off-by: Minchan Kim --- mm/vmscan.c | 28 ++++++++++++++++++++++++++++ 1 files changed, 28 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2880396..cfa2e2d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; + enum lru_list lru; unsigned long nr_reclaimed, nr_scanned; unsigned long nr_to_reclaim = sc->nr_to_reclaim; struct blk_plug plug; +#ifdef CONFIG_LOW_MEM_NOTIFY + bool low_mem = false; + unsigned long free, file; +#endif restart: nr_reclaimed = 0; nr_scanned = sc->nr_scanned; get_scan_count(mz, sc, nr, priority); +#ifdef CONFIG_LOW_MEM_NOTIFY + /* We want to avoid swapout */ + if (nr[LRU_INACTIVE_ANON]) + low_mem = true; + /* + * We want to avoid dropping page cache excessively + * in no swap system + */ + if (nr_swap_pages <= 0) { + free = zone_page_state(mz->zone, NR_FREE_PAGES); + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + + zone_page_state(mz->zone, NR_INACTIVE_FILE); + /* + * If we have very few page cache pages, + * notify to user + */ + if (file < free) + low_mem = true; + } + if (low_mem) + low_memory_pressure(); +#endif blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752454Ab2AQIOY (ORCPT ); Tue, 17 Jan 2012 03:14:24 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:47117 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752391Ab2AQIOU (ORCPT ); Tue, 17 Jan 2012 03:14:20 -0500 From: Minchan Kim To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim , KOSAKI Motohiro Subject: [RFC 1/3] /dev/low_mem_notify Date: Tue, 17 Jan 2012 17:13:56 +0900 Message-Id: <1326788038-29141-2-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.7.5 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch makes new device file "/dev/low_mem_notify". If application polls it, it can receive event when system memory pressure happens. This patch is based on KOSAKI and Marcelo's long time ago work. http://lwn.net/Articles/268732/ Signed-off-by: Marcelo Tosatti Signed-off-by: KOSAKI Motohiro Signed-off-by: Minchan Kim --- drivers/char/mem.c | 7 ++++ include/linux/low_mem_notify.h | 6 ++++ mm/Kconfig | 7 ++++ mm/Makefile | 1 + mm/low_mem_notify.c | 61 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 82 insertions(+), 0 deletions(-) create mode 100644 include/linux/low_mem_notify.h create mode 100644 mm/low_mem_notify.c diff --git a/drivers/char/mem.c b/drivers/char/mem.c index d6e9d08..72bc12b 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -35,6 +35,10 @@ # include #endif +#ifdef CONFIG_LOW_MEM_NOTIFY +extern struct file_operations low_mem_notify_fops; +#endif + static inline unsigned long size_inside_page(unsigned long start, unsigned long size) { @@ -867,6 +871,9 @@ static const struct memdev { #ifdef CONFIG_CRASH_DUMP [12] = { "oldmem", 0, &oldmem_fops, NULL }, #endif +#ifdef CONFIG_LOW_MEM_NOTIFY + [13] = { "low_mem_notify",0666, &low_mem_notify_fops, NULL}, +#endif }; static int memory_open(struct inode *inode, struct file *filp) diff --git a/include/linux/low_mem_notify.h b/include/linux/low_mem_notify.h new file mode 100644 index 0000000..bc0fc89 --- /dev/null +++ b/include/linux/low_mem_notify.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_LOW_MEM_NOTIFY_H +#define _LINUX_LOW_MEM_NOTIFY_H + +void low_memory_pressure(void); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index e338407..a2f48c6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -379,3 +379,10 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config LOW_MEM_NOTIFY + bool "Enable low memory notification" + default n + help + If system suffer from low memory, kernel can notify it to user through + /dev/low_mem_notify. diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7856357 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o +obj-$(CONFIG_LOW_MEM_NOTIFY) += low_mem_notify.o diff --git a/mm/low_mem_notify.c b/mm/low_mem_notify.c new file mode 100644 index 0000000..7432307 --- /dev/null +++ b/mm/low_mem_notify.c @@ -0,0 +1,61 @@ +#include +#include +#include +#include +#include + +static DECLARE_WAIT_QUEUE_HEAD(low_mem_wait); +static atomic_t nr_low_mem = ATOMIC_INIT(0); + +struct low_mem_notify_file_info { + unsigned long last_proc_notify; +}; + +void low_memory_pressure(void) +{ + atomic_inc(&nr_low_mem); + wake_up(&low_mem_wait); +} + +static int low_mem_notify_open(struct inode *inode, struct file *file) +{ + struct low_mem_notify_file_info *info; + int err = 0; + + info = kmalloc(sizeof(*info), GFP_KERNEL); + if (!info) { + err = -ENOMEM; + goto out; + } + + file->private_data = info; +out: + return err; +} + +static int low_mem_notify_release(struct inode *inode, struct file *file) +{ + kfree(file->private_data); + return 0; +} + +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait) +{ + unsigned int ret = 0; + + poll_wait(file, &low_mem_wait, wait); + + if (atomic_read(&nr_low_mem) != 0) { + ret = POLLIN; + atomic_set(&nr_low_mem, 0); + } + + return ret; +} + +struct file_operations low_mem_notify_fops = { + .open = low_mem_notify_open, + .release = low_mem_notify_release, + .poll = low_mem_notify_poll, +}; +EXPORT_SYMBOL(low_mem_notify_fops); -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752496Ab2AQIOh (ORCPT ); Tue, 17 Jan 2012 03:14:37 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:47117 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752460Ab2AQIO2 (ORCPT ); Tue, 17 Jan 2012 03:14:28 -0500 From: Minchan Kim To: linux-mm Cc: LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , Minchan Kim Subject: [RFC 3/3] test program Date: Tue, 17 Jan 2012 17:13:58 +0900 Message-Id: <1326788038-29141-4-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.7.5 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This test program allocates 10M per second and when memory pressure notify happens, it releases 20M. I tested this patch on 512M qemu machine with 3 test program. I saw some swapout but not too many and even didn't see OOM. It obviously reduces swap out. Signed-off-by: Minchan Kim --- poll.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 121 insertions(+), 0 deletions(-) create mode 100644 poll.c diff --git a/poll.c b/poll.c new file mode 100644 index 0000000..3215f8b --- /dev/null +++ b/poll.c @@ -0,0 +1,121 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define ALLOC_UNIT 10 /* MB */ +#define FREE_UNIT 20 /* MB */ + +void alloc_memory(); +void free_memory(); + +unsigned int total_memory = 0; /* MB */ + +pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; + +/* + * If total memory is higher than 200M + */ +bool memory_full() +{ + return total_memory >= 400 ? true : false; +} + +struct alloc_chunk { + void *ptr; + struct alloc_chunk *next; +}; + +struct alloc_chunk head_chunk; + +void init_alloc_chunk(void) +{ + head_chunk.ptr = NULL; + head_chunk.next = NULL; +} + +void add_memory(void *ptr) +{ + struct alloc_chunk *new_chunk = malloc(sizeof(struct alloc_chunk)); + new_chunk->ptr = ptr; + + pthread_mutex_lock(&mutex); + new_chunk->next = head_chunk.next; + head_chunk.next = new_chunk; + total_memory += ALLOC_UNIT; + pthread_mutex_unlock(&mutex); + + printf("[%d] Add total memory %d(MB)\n", getpid(), total_memory); +} + +void alloc_memory(void) +{ + while(1) { + if (memory_full()) { + sleep(10); + continue; + } + + void *new = malloc(ALLOC_UNIT*1024*1024); + memset(new, 0, ALLOC_UNIT*1024*1024); + add_memory(new); + sleep(1); + } +} + +void free_memory(void) +{ + int count = FREE_UNIT / ALLOC_UNIT; + while(count--) { + struct alloc_chunk *chunk = head_chunk.next; + if (chunk == NULL) + break; + + pthread_mutex_lock(&mutex); + head_chunk.next = chunk->next; + total_memory -= ALLOC_UNIT; + pthread_mutex_unlock(&mutex); + + free(chunk->ptr); + free(chunk); + + printf("[%d] Free total memory %d(MB)\n", getpid(), total_memory); + } +} + +void *poll_thread(void *dummy) +{ + struct pollfd pfd; + int fd = open("/dev/low_mem_notify", O_RDONLY); + if (fd == -1) { + fprintf(stderr, "Fail to open\n"); + return; + } + + pfd.fd = fd; + pfd.events = POLLIN; + + while(1) { + poll(&pfd, 1, -1); + free_memory(); + } +} + +int main() +{ + pthread_t threadid; + init_alloc_chunk(); + + if (pthread_create(&threadid, NULL, poll_thread, NULL)) { + fprintf(stderr, "pthread create fail\n"); + return 1; + } + + alloc_memory(); + return 0; +} -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752552Ab2AQIk5 (ORCPT ); Tue, 17 Jan 2012 03:40:57 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:45153 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751822Ab2AQIkz (ORCPT ); Tue, 17 Jan 2012 03:40:55 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Tue, 17 Jan 2012 17:39:32 +0900 From: KAMEZAWA Hiroyuki To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <1326788038-29141-3-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 17 Jan 2012 17:13:57 +0900 Minchan Kim wrote: > This patch insert memory pressure notify point into vmscan.c > Most problem in system slowness is swap-in. swap-in is a synchronous > opeartion so that it affects heavily system response. > > This patch alert it when reclaimer start to reclaim inactive anon list. > It seems rather earlier but not bad than too late. > > Other alert point is when there is few cache pages > In this implementation, if it is (cache < free pages), > memory pressure notify happens. It has to need more testing and tuning > or other hueristic. Any suggesion are welcome. > > Signed-off-by: Minchan Kim In my 1st impression, isn't this too simple ? > --- > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > 1 files changed, 28 insertions(+), 0 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2880396..cfa2e2d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -43,6 +43,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, > { > unsigned long nr[NR_LRU_LISTS]; > unsigned long nr_to_scan; > + > enum lru_list lru; > unsigned long nr_reclaimed, nr_scanned; > unsigned long nr_to_reclaim = sc->nr_to_reclaim; > struct blk_plug plug; > +#ifdef CONFIG_LOW_MEM_NOTIFY > + bool low_mem = false; > + unsigned long free, file; > +#endif > > restart: > nr_reclaimed = 0; > nr_scanned = sc->nr_scanned; > get_scan_count(mz, sc, nr, priority); > +#ifdef CONFIG_LOW_MEM_NOTIFY > + /* We want to avoid swapout */ > + if (nr[LRU_INACTIVE_ANON]) > + low_mem = true; IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0. And get_scan_count() now check per-memcg-lru. So, this only works when memcg is not used. > + /* > + * We want to avoid dropping page cache excessively > + * in no swap system > + */ > + if (nr_swap_pages <= 0) { > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > + /* > + * If we have very few page cache pages, > + * notify to user > + */ > + if (file < free) > + low_mem = true; > + } I can't understand why you think you can check lowmem condition by "file < free". And I don't think using per-zone data is good. (I'm not sure how many zones embeded guys using..) Another idea: 1. can't we use some technique like cleancache to detect the condition ? 2. can't we measure page-in/page-out distance by recording something ? 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can ignore the data file cache ? 4. how about checking kswapd's busy status ? Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752630Ab2AQJOM (ORCPT ); Tue, 17 Jan 2012 04:14:12 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:44717 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751340Ab2AQJOJ (ORCPT ); Tue, 17 Jan 2012 04:14:09 -0500 Date: Tue, 17 Jan 2012 18:13:56 +0900 From: Minchan Kim To: KAMEZAWA Hiroyuki Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-ID: <20120117091356.GA29736@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 17 Jan 2012 17:13:57 +0900 > Minchan Kim wrote: > > > This patch insert memory pressure notify point into vmscan.c > > Most problem in system slowness is swap-in. swap-in is a synchronous > > opeartion so that it affects heavily system response. > > > > This patch alert it when reclaimer start to reclaim inactive anon list. > > It seems rather earlier but not bad than too late. > > > > Other alert point is when there is few cache pages > > In this implementation, if it is (cache < free pages), > > memory pressure notify happens. It has to need more testing and tuning > > or other hueristic. Any suggesion are welcome. > > > > Signed-off-by: Minchan Kim > > In my 1st impression, isn't this too simple ? I agree It's too simple. It would be good start point rather than unnecessary complicated things. > > > > --- > > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > > 1 files changed, 28 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2880396..cfa2e2d 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -43,6 +43,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz, > > { > > unsigned long nr[NR_LRU_LISTS]; > > unsigned long nr_to_scan; > > + > > enum lru_list lru; > > unsigned long nr_reclaimed, nr_scanned; > > unsigned long nr_to_reclaim = sc->nr_to_reclaim; > > struct blk_plug plug; > > +#ifdef CONFIG_LOW_MEM_NOTIFY > > + bool low_mem = false; > > + unsigned long free, file; > > +#endif > > > > restart: > > nr_reclaimed = 0; > > nr_scanned = sc->nr_scanned; > > get_scan_count(mz, sc, nr, priority); > > +#ifdef CONFIG_LOW_MEM_NOTIFY > > + /* We want to avoid swapout */ > > + if (nr[LRU_INACTIVE_ANON]) > > + low_mem = true; > > IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0. Yes. But I thought it would be better than late notification. Late notification ends up swap out which is a big concern about this patch. More proper timing suggestion helps me a lot. > And get_scan_count() now check per-memcg-lru. So, this only works when > memcg is not used. Hmm, I didn't look at recent memcg/global reclaim unify patch of Johannes. I need time to look at it. Thanks. > > > > + /* > > + * We want to avoid dropping page cache excessively > > + * in no swap system > > + */ > > + if (nr_swap_pages <= 0) { > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > + /* > > + * If we have very few page cache pages, > > + * notify to user > > + */ > > + if (file < free) > > + low_mem = true; > > + } > > I can't understand why you think you can check lowmem condition by "file < free". The reason I thought so is I want to maintain some page cache to some degree. But I admit It's very naive heuristic and should be improved. > And I don't think using per-zone data is good. > (I'm not sure how many zones embeded guys using..) Agree. In case of swapless system, we need another heuristic. > > Another idea: > 1. can't we use some technique like cleancache to detect the condition ? I totally forgot cleancache approach. Could you remind that? > 2. can't we measure page-in/page-out distance by recording something ? I can't understand your point. What's relation does it with swapout prevent? > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > ignore the data file cache ? It's good but how do we define some amount? It's very vague but I guess we can get a good idea from that. Perhaps, you already has it. > 4. how about checking kswapd's busy status ? Could you elaborate on your idea? Kame, Thanks for reply, > > > > Thanks, > -Kame > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752683Ab2AQJ1h (ORCPT ); Tue, 17 Jan 2012 04:27:37 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:38708 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750839Ab2AQJ1e convert rfc822-to-8bit (ORCPT ); Tue, 17 Jan 2012 04:27:34 -0500 MIME-Version: 1.0 In-Reply-To: <1326788038-29141-2-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 11:27:34 +0200 X-Google-Sender-Auth: MIE03H-NsXoVjJmb95iNqA9xgcU Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: > +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait) > +{ > +        unsigned int ret = 0; > + > +        poll_wait(file, &low_mem_wait, wait); > + > +        if (atomic_read(&nr_low_mem) != 0) { > +                ret = POLLIN; > +                atomic_set(&nr_low_mem, 0); > +        } > + > +        return ret; > +} Doesn't this mean that only one application will receive the notification? Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752710Ab2AQJpL (ORCPT ); Tue, 17 Jan 2012 04:45:11 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:50690 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752152Ab2AQJpI (ORCPT ); Tue, 17 Jan 2012 04:45:08 -0500 MIME-Version: 1.0 In-Reply-To: <1326788038-29141-2-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 11:45:06 +0200 X-Google-Sender-Auth: 7EAPFb8nEQXCshxJt5giJF9Ym8Q Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: > This patch makes new device file "/dev/low_mem_notify". > If application polls it, it can receive event when system > memory pressure happens. > > This patch is based on KOSAKI and Marcelo's long time ago work. > http://lwn.net/Articles/268732/ I'm not loving the ABI. Alternative solutions: - SIGDANGER + signalfd() for poll - sys_eventfd() - sys_mem_notify_open() similar to sys_perf_event_open() Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752812Ab2AQKGh (ORCPT ); Tue, 17 Jan 2012 05:06:37 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:46566 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751255Ab2AQKGg (ORCPT ); Tue, 17 Jan 2012 05:06:36 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Tue, 17 Jan 2012 19:05:12 +0900 From: KAMEZAWA Hiroyuki To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120117091356.GA29736@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 17 Jan 2012 18:13:56 +0900 Minchan Kim wrote: > On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 17 Jan 2012 17:13:57 +0900 > > Minchan Kim wrote: > > > > > > > + /* > > > + * We want to avoid dropping page cache excessively > > > + * in no swap system > > > + */ > > > + if (nr_swap_pages <= 0) { > > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > > + /* > > > + * If we have very few page cache pages, > > > + * notify to user > > > + */ > > > + if (file < free) > > > + low_mem = true; > > > + } > > > > I can't understand why you think you can check lowmem condition by "file < free". > > The reason I thought so is I want to maintain some page cache to some degree. > But I admit It's very naive heuristic and should be improved. > > > And I don't think using per-zone data is good. > > (I'm not sure how many zones embeded guys using..) > > Agree. In case of swapless system, we need another heuristic. > > > > > Another idea: > > 1. can't we use some technique like cleancache to detect the condition ? > > I totally forgot cleancache approach. Could you remind that? > Similar to 'victim cache'. Then, cache some clean pages somewhere when vmscan pageout it. page -> vmscan's pageout -> cleancache -> may be discarded. If a filesystem look up a page which is in a cleancache, cache-hit and bring it back to radix-tree. If not, read from disk again. And cleancache for swap(frontswap) was posted, too. > > 2. can't we measure page-in/page-out distance by recording something ? > > I can't understand your point. What's relation does it with swapout prevent? > If distance between pageout -> pagein is short, it means thrashing. For example, recoding the timestamp when the page(mapping, index) was paged-out, and check it at page-in. > > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > > ignore the data file cache ? > > It's good but how do we define some amount? > It's very vague but I guess we can get a good idea from that. > Perhaps, you already has it. > Hm, a rough idea is... - we now have rss counter per mm. - mapped anon - mapped file - swapents Ok, here, add one more counter. - paged-out file. (I think this can be recorded in pte.) +1 when try_to_unmap_file() unmaps it. -1 when a page is back or unmapped. Then, scanning all tasks. Then, mapped_anon + mapped_file active_map_ratio = ----------------------------------------------------- * 100 mapped_anon + mapped_file + swapents + paged_out_file Ok, how to use this value... Like memcg's threshold notify interface, you can change the mem_notify interface to use eventfd() as This will inform you an event when active_map_ratio crosses passed threshold. complicated ? > > 4. how about checking kswapd's busy status ? > > Could you elaborate on your idea? > I just thought kswapd may not stop when the situation is very bad. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754690Ab2AQOih (ORCPT ); Tue, 17 Jan 2012 09:38:37 -0500 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:51455 "EHLO out3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754395Ab2AQOif (ORCPT ); Tue, 17 Jan 2012 09:38:35 -0500 X-Sasl-enc: Buv3FIJB1RQZ5tNx3pe89yqR6FiazW+2PYYJcnlsXZOI 1326811114 Subject: Re: [RFC 0/3] low memory notify From: Colin Walters To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Date: Tue, 17 Jan 2012 09:38:10 -0500 In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.0.3 (3.0.3-1.fc15) Content-Transfer-Encoding: 7bit Message-ID: <1326811093.3467.41.camel@lenny> Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2012-01-17 at 17:13 +0900, Minchan Kim wrote: > As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. > (Of course, KOSAKI's original patchset includes more logics but I didn't > include all things intentionally because I want to start from beginning > again) Recently, there are some requirements of notification of system > memory pressure. How does this relate to the existing cgroups memory notifications? See Documentation/cgroups/memory.txt under "10. OOM Control" > It would be very useful for various cases. > For example, QEMU/JVM/Firefox like big memory hogger can release their memory > when memory pressure happens. I don't know about QEMU, but the key characteristic of the JVM and Firefox is that they use garbage collection. Which also applies to Python, Ruby, Google Go, Haskell, OCaml... So what you really want to be investigating here is integration between a garbage collector and the system VM. Your test program looks nothing like a garbage collector. I'd expect most of the performance tradeoffs to be similar between these runtimes. The Azul people have been doing something like this: http://www.managedruntime.org/ In Firefox' case though it can also drop other caches, e.g.: http://people.gnome.org/~federico/news-2007-09.html#firefox-memory-1 As far as the desktop goes, I want to get notified if we're going to hit swap, not if we're close to exhausting the total of RAM+swap. While swap may make sense for servers that care about throughput mainly, I care a lot about latency. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754642Ab2AQPEa (ORCPT ); Tue, 17 Jan 2012 10:04:30 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:52907 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753763Ab2AQPE3 convert rfc822-to-8bit (ORCPT ); Tue, 17 Jan 2012 10:04:29 -0500 MIME-Version: 1.0 In-Reply-To: <1326811093.3467.41.camel@lenny> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326811093.3467.41.camel@lenny> Date: Tue, 17 Jan 2012 17:04:28 +0200 X-Google-Sender-Auth: 7i0QmhphMlM_T1kFc208ApsZ630 Message-ID: Subject: Re: [RFC 0/3] low memory notify From: Pekka Enberg To: Colin Walters Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 4:38 PM, Colin Walters wrote: > So what you really want to be investigating here is integration between > a garbage collector and the system VM.  Your test program looks nothing > like a garbage collector.  I'd expect most of the performance tradeoffs > to be similar between these runtimes.  The Azul people have been doing > something like this: http://www.managedruntime.org/ The interraction isn't all that complex, really. I'd expect most VMs to simply wake up the GC thread when poll() returns. GCs that are able to compact the heap can madvise(MADV_DONTNEED) or even munmap() unused parts of the heap. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755291Ab2AQQgS (ORCPT ); Tue, 17 Jan 2012 11:36:18 -0500 Received: from mx1.redhat.com ([209.132.183.28]:1473 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755067Ab2AQQgR (ORCPT ); Tue, 17 Jan 2012 11:36:17 -0500 Message-ID: <4F15A34F.40808@redhat.com> Date: Tue, 17 Jan 2012 11:35:27 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Pekka Enberg CC: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/17/2012 04:27 AM, Pekka Enberg wrote: > On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim wrote: >> +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait) >> +{ >> + unsigned int ret = 0; >> + >> + poll_wait(file,&low_mem_wait, wait); >> + >> + if (atomic_read(&nr_low_mem) != 0) { >> + ret = POLLIN; >> + atomic_set(&nr_low_mem, 0); >> + } >> + >> + return ret; >> +} > > Doesn't this mean that only one application will receive the notification? One at a time, which could be a good thing since the last thing we want to do when the system is under memory pressure is create a thundering herd. OTOH, we do need to ensure that programs take turns getting the memory pressure notification. I do not know whether poll_wait automatically takes care of that... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755333Ab2AQQow (ORCPT ); Tue, 17 Jan 2012 11:44:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:27352 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755086Ab2AQQov (ORCPT ); Tue, 17 Jan 2012 11:44:51 -0500 Message-ID: <4F15A570.8090604@redhat.com> Date: Tue, 17 Jan 2012 11:44:32 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Colin Walters CC: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 0/3] low memory notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326811093.3467.41.camel@lenny> In-Reply-To: <1326811093.3467.41.camel@lenny> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/17/2012 09:38 AM, Colin Walters wrote: > How does this relate to the existing cgroups memory notifications? See > Documentation/cgroups/memory.txt under "10. OOM Control" > As far as the desktop goes, I want to get notified if we're going to hit > swap, not if we're close to exhausting the total of RAM+swap. While > swap may make sense for servers that care about throughput mainly, I > care a lot about latency. You just answered your own question :) This code is indeed meant to avoid/reduce swap use and improve userspace latencies. Minchan posted a very simple example patch set, so we can get an idea in what direction people would want the code to go. This often beats working on complex code for weeks, and then having people tell you they wanted something else :) -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752324Ab2AQRQ6 (ORCPT ); Tue, 17 Jan 2012 12:16:58 -0500 Received: from mail-yw0-f46.google.com ([209.85.213.46]:51113 "EHLO mail-yw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750998Ab2AQRQ5 (ORCPT ); Tue, 17 Jan 2012 12:16:57 -0500 MIME-Version: 1.0 X-Originating-IP: [2620:0:1000:1b02:1aa9:5ff:fe24:37a9] In-Reply-To: <1326788038-29141-1-git-send-email-minchan@kernel.org> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> Date: Tue, 17 Jan 2012 09:16:56 -0800 Message-ID: Subject: Re: [RFC 0/3] low memory notify From: Olof Johansson To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Jan 17, 2012 at 12:13 AM, Minchan Kim wrote: > As you can see, it's respin of mem_notify core of KOSAKI and Marcelo. > (Of course, KOSAKI's original patchset includes more logics but I didn't > include all things intentionally because I want to start from beginning > again) Recently, there are some requirements of notification of system > memory pressure. It would be very useful for various cases. > For example, QEMU/JVM/Firefox like big memory hogger can release their memory > when memory pressure happens. Another example in embedded side, > they can close background application. For this, there are some trial but > we need more general one and not-hacked alloc/free hot path. > > I think most big problem of system slowness is swap-in operation. > Swap-in is a synchronous operation so application's latency would be > big. Solution for that is prevent swap-out itself. We couldn't prevent > swapout totally but could reduce it with this patch. > > In case of swapless system, code page is very important for system response. > So we have to keep code page, too. I used very naive heuristic in this patch > but welcome to any idea. > > I want to make kernel logic simple if possible and just notify to user space. > Of course, there are lots of thing we have to consider but for discussion > this simple patch would be a good start point. This is almost exactly what we've been looking at doing for Chrome OS (which is swapless). In our case, the browser is by far the largest memory consumer on the system, and we have for quite a while been playing tricks with OOM scores trying to make the interaction between the VM and the application happen right such that if we're OOM, the "right" tab process gets killed, etc. But it's not enough (and it's not always accurate enough). Chrome definitely knows already what it would prefer to do to release memory, so having a simple notifier for low memory condition is preferred. We have considered doing it through cgroups but it adds a level of complexity that we don't need for this use case (we do already use cgroups for other reasons though). If this simpler solution is heading towards inclusion we'll probably use it instead. -Olof From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755208Ab2AQSvY (ORCPT ); Tue, 17 Jan 2012 13:51:24 -0500 Received: from mail-lpp01m010-f46.google.com ([209.85.215.46]:64927 "EHLO mail-lpp01m010-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755120Ab2AQSvW (ORCPT ); Tue, 17 Jan 2012 13:51:22 -0500 Date: Tue, 17 Jan 2012 20:51:13 +0200 (EET) From: Pekka Enberg X-X-Sender: penberg@tux.localdomain To: Rik van Riel cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <4F15A34F.40808@redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Ok, so here's a proof of concept patch that implements sample-base per-process free threshold VM event watching using perf-like syscall ABI. I'd really like to see something like this that's much more extensible and clean than the /dev based ABIs that people have proposed so far. Pekka -------------------> >>From a07f93fdca360b20daef4a5d66f2a5746f31f6a6 Mon Sep 17 00:00:00 2001 From: Pekka Enberg Date: Tue, 17 Jan 2012 17:51:48 +0200 Subject: [PATCH] vmnotify: VM event notification system This patch implements a new sys_vmnotify_fd() system call that returns a pollable file descriptor that can be used to watch VM events. For example, to watch for VM event when free memory is below 99% of available memory using 1 second sample period, you'd do something like this: struct vmnotify_config config; struct vmnotify_event event; struct pollfd pollfd; int fd; config = (struct vmnotify_config) { .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, .sample_period_ns = 1000000000L, .free_threshold = 99, }; fd = sys_vmnotify_fd(&config); pollfd.fd = fd; pollfd.events = POLLIN; if (poll(&pollfd, 1, -1) < 0) { perror("poll failed"); exit(1); } memset(&event, 0, sizeof(event)); if (read(fd, &event, sizeof(event)) < 0) { perror("read failed"); exit(1); } Signed-off-by: Pekka Enberg --- arch/x86/include/asm/unistd_64.h | 2 + include/linux/vmnotify.h | 44 ++++++ mm/Kconfig | 6 + mm/Makefile | 1 + mm/vmnotify.c | 235 ++++++++++++++++++++++++++++++++ tools/testing/vmnotify/vmnotify-test.c | 68 +++++++++ 6 files changed, 356 insertions(+), 0 deletions(-) create mode 100644 include/linux/vmnotify.h create mode 100644 mm/vmnotify.c create mode 100644 tools/testing/vmnotify/vmnotify-test.c diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index 0431f19..b0928cd 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu) __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv) #define __NR_process_vm_writev 311 __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev) +#define __NR_vmnotify_fd 312 +__SYSCALL(__NR_vmnotify_fd, sys_vmnotify_fd) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/include/linux/vmnotify.h b/include/linux/vmnotify.h new file mode 100644 index 0000000..8f8642b --- /dev/null +++ b/include/linux/vmnotify.h @@ -0,0 +1,44 @@ +#ifndef _LINUX_VMNOTIFY_H +#define _LINUX_VMNOTIFY_H + +#include + +enum { + VMNOTIFY_TYPE_FREE_THRESHOLD = 1ULL << 0, + VMNOTIFY_TYPE_SAMPLE = 1ULL << 1, +}; + +struct vmnotify_config { + /* + * Size of the struct for ABI extensibility. + */ + __u32 size; + + /* + * Notification type bitmask + */ + __u64 type; + + /* + * Free memory threshold in percentages [1..99] + */ + __u32 free_threshold; + + /* + * Sample period in nanoseconds + */ + __u64 sample_period_ns; +}; + +struct vmnotify_event { + /* Size of the struct for ABI extensibility. */ + __u32 size; + + __u64 nr_avail_pages; + + __u64 nr_swap_pages; + + __u64 nr_free_pages; +}; + +#endif /* _LINUX_VMNOTIFY_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 011b110..6631167 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -373,3 +373,9 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config VMNOTIFY + bool "Enable VM event notification system" + default n + help + If unsure, say N to disable vmnotify diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..e1b5db3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o +obj-$(CONFIG_VMNOTIFY) += vmnotify.o diff --git a/mm/vmnotify.c b/mm/vmnotify.c new file mode 100644 index 0000000..6800450 --- /dev/null +++ b/mm/vmnotify.c @@ -0,0 +1,235 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#define VMNOTIFY_MAX_FREE_THRESHOD 100 + +struct vmnotify_watch { + struct vmnotify_config config; + + struct mutex mutex; + bool pending; + struct vmnotify_event event; + + /* sampling */ + struct hrtimer timer; + + /* poll */ + wait_queue_head_t waitq; +}; + +static bool vmnotify_match(struct vmnotify_watch *watch, struct vmnotify_event *event) +{ + if (watch->config.type & VMNOTIFY_TYPE_FREE_THRESHOLD) { + u64 threshold; + + if (!event->nr_avail_pages) + return false; + + threshold = event->nr_free_pages * 100 / event->nr_avail_pages; + if (threshold > watch->config.free_threshold) + return false; + } + + return true; +} + +static void vmnotify_sample(struct vmnotify_watch *watch) +{ + struct vmnotify_event event; + struct sysinfo si; + + memset(&event, 0, sizeof(event)); + + event.size = sizeof(event); + event.nr_free_pages = global_page_state(NR_FREE_PAGES); + + si_meminfo(&si); + event.nr_avail_pages = si.totalram; + +#ifdef CONFIG_SWAP + si_swapinfo(&si); + event.nr_swap_pages = si.totalswap; +#endif + + if (!vmnotify_match(watch, &event)) + return; + + mutex_lock(&watch->mutex); + + watch->pending = true; + + memcpy(&watch->event, &event, sizeof(event)); + + mutex_unlock(&watch->mutex); +} + +static enum hrtimer_restart vmnotify_timer_fn(struct hrtimer *hrtimer) +{ + struct vmnotify_watch *watch = container_of(hrtimer, struct vmnotify_watch, timer); + u64 sample_period = watch->config.sample_period_ns; + + vmnotify_sample(watch); + + hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); + + wake_up(&watch->waitq); + + return HRTIMER_RESTART; +} + +static void vmnotify_start_timer(struct vmnotify_watch *watch) +{ + u64 sample_period = watch->config.sample_period_ns; + + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + watch->timer.function = vmnotify_timer_fn; + + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED); +} + +static unsigned int vmnotify_poll(struct file *file, poll_table *wait) +{ + struct vmnotify_watch *watch = file->private_data; + unsigned int events = 0; + + poll_wait(file, &watch->waitq, wait); + + mutex_lock(&watch->mutex); + + if (watch->pending) + events |= POLLIN; + + mutex_unlock(&watch->mutex); + + return events; +} + +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + struct vmnotify_watch *watch = file->private_data; + int ret = 0; + + mutex_lock(&watch->mutex); + + if (!watch->pending) + goto out_unlock; + + if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) { + ret = -EFAULT; + goto out_unlock; + } + + ret = watch->event.size; + + watch->pending = false; + +out_unlock: + mutex_unlock(&watch->mutex); + + return ret; +} + +static int vmnotify_release(struct inode *inode, struct file *file) +{ + struct vmnotify_watch *watch = file->private_data; + + hrtimer_cancel(&watch->timer); + + kfree(watch); + + return 0; +} + +static const struct file_operations vmnotify_fops = { + .poll = vmnotify_poll, + .read = vmnotify_read, + .release = vmnotify_release, +}; + +static struct vmnotify_watch *vmnotify_watch_alloc(void) +{ + struct vmnotify_watch *watch; + + watch = kzalloc(sizeof *watch, GFP_KERNEL); + if (!watch) + return NULL; + + mutex_init(&watch->mutex); + + init_waitqueue_head(&watch->waitq); + + return watch; +} + +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig, + struct vmnotify_config *config) +{ + int ret; + + ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config)); + if (ret) + return -EFAULT; + + if (!config->type) + return -EINVAL; + + if (config->type & VMNOTIFY_TYPE_SAMPLE) { + if (config->sample_period_ns < NSEC_PER_MSEC) + return -EINVAL; + } + + if (config->type & VMNOTIFY_TYPE_FREE_THRESHOLD) { + if (config->free_threshold > VMNOTIFY_MAX_FREE_THRESHOD) + return -EINVAL; + } + + return 0; +} + +SYSCALL_DEFINE1(vmnotify_fd, + struct vmnotify_config __user *, uconfig) +{ + struct vmnotify_watch *watch; + struct file *file; + int err; + int fd; + + watch = vmnotify_watch_alloc(); + if (!watch) + return -ENOMEM; + + err = vmnotify_copy_config(uconfig, &watch->config); + if (err) + goto err_free; + + fd = get_unused_fd_flags(O_RDONLY); + if (fd < 0) { + err = fd; + goto err_free; + } + + file = anon_inode_getfile("[vmnotify]", &vmnotify_fops, watch, O_RDONLY); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_fd; + } + + fd_install(fd, file); + + if (watch->config.type & VMNOTIFY_TYPE_SAMPLE) + vmnotify_start_timer(watch); + + return fd; + +err_fd: + put_unused_fd(fd); +err_free: + kfree(watch); + return err; +} diff --git a/tools/testing/vmnotify/vmnotify-test.c b/tools/testing/vmnotify/vmnotify-test.c new file mode 100644 index 0000000..3c6b26d --- /dev/null +++ b/tools/testing/vmnotify/vmnotify-test.c @@ -0,0 +1,68 @@ +#include "../../../include/linux/vmnotify.h" + +#if defined(__x86_64__) +#include "../../../arch/x86/include/asm/unistd.h" +#endif + +#include +#include +#include +#include +#include + +static int sys_vmnotify_fd(struct vmnotify_config *config) +{ + config->size = sizeof(*config); + + return syscall(__NR_vmnotify_fd, config); +} + +int main(int argc, char *argv[]) +{ + struct vmnotify_config config; + struct vmnotify_event event; + struct pollfd pollfd; + int i; + int fd; + + config = (struct vmnotify_config) { + .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, + .sample_period_ns = 1000000000L, + .free_threshold = 99, + }; + + fd = sys_vmnotify_fd(&config); + if (fd < 0) { + perror("vmnotify_fd failed"); + exit(1); + } + + for (i = 0; i < 10; i++) { + pollfd.fd = fd; + pollfd.events = POLLIN; + + if (poll(&pollfd, 1, -1) < 0) { + perror("poll failed"); + exit(1); + } + + memset(&event, 0, sizeof(event)); + + if (read(fd, &event, sizeof(event)) < 0) { + perror("read failed"); + exit(1); + } + + printf("VM event:\n"); + printf("\tsize=%lu\n", event.size); + printf("\tnr_avail_pages=%Lu\n", event.nr_avail_pages); + printf("\tnr_swap_pages=%Lu\n", event.nr_swap_pages); + printf("\tnr_free_pages=%Lu\n", event.nr_free_pages); + } + if (close(fd) < 0) { + perror("close failed"); + exit(1); + } + + return 0; +} -- 1.7.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755513Ab2AQTbO (ORCPT ); Tue, 17 Jan 2012 14:31:14 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46759 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753118Ab2AQTbN (ORCPT ); Tue, 17 Jan 2012 14:31:13 -0500 Message-ID: <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 14:30:30 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Pekka Enberg CC: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/17/2012 01:51 PM, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have proposed > so far. Looks like a nice extensible interface to me. The only thing is, I expect we will not want to wake up processes most of the time, when there is no memory pressure, because that would just waste battery power and/or cpu time that could be used for something else. The desire to avoid such wakeups makes it harder to wake up processes at arbitrary points set by the API. Another issue is that we might be running two programs on the system, each with a different threshold for "lets free some of my cache". Say one program sets the threshold at 20% free/cache memory, the other program at 10%. We could end up with the first process continually throwing away its caches, while the second process never gives its unused memory back to the kernel. I am not sure what the right thing to do would be... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754465Ab2AQTtP (ORCPT ); Tue, 17 Jan 2012 14:49:15 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:60491 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753028Ab2AQTtO convert rfc822-to-8bit (ORCPT ); Tue, 17 Jan 2012 14:49:14 -0500 MIME-Version: 1.0 In-Reply-To: <4F15CC56.90309@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:49:13 +0200 X-Google-Sender-Auth: 4GQqt3sCe1EaknGfZT0cpcKpCSQ Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel wrote: > Looks like a nice extensible interface to me. > > The only thing is, I expect we will not want to wake > up processes most of the time, when there is no memory > pressure, because that would just waste battery power > and/or cpu time that could be used for something else. > > The desire to avoid such wakeups makes it harder to > wake up processes at arbitrary points set by the API. Sure. You could either bump up the threshold or use Minchan's hooks - or both. On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel wrote: > Another issue is that we might be running two programs > on the system, each with a different threshold for > "lets free some of my cache".  Say one program sets > the threshold at 20% free/cache memory, the other > program at 10%. > > We could end up with the first process continually > throwing away its caches, while the second process > never gives its unused memory back to the kernel. > > I am not sure what the right thing to do would be... One option is to use per-process thresholds on RSS, for example, and also support system-wide thresholds. That said, I'd really like to see the N9 and Android policies supported with this ABI. It's much easier to make it generic once we support real-world use cases. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754672Ab2AQTy3 (ORCPT ); Tue, 17 Jan 2012 14:54:29 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:62424 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750870Ab2AQTy2 (ORCPT ); Tue, 17 Jan 2012 14:54:28 -0500 MIME-Version: 1.0 In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:54:27 +0200 X-Google-Sender-Auth: WThxvw5GsEpk1af3xSld6SjCsXQ Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg wrote: > That said, I'd really like to see the N9 and Android policies > supported with this ABI. It's much easier to make it generic once we > support real-world use cases. If people are interested in hacking on the thing, I pushed the commit in 'vmnotify/core' branch of git://github.com/penberg/linux.git Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755519Ab2AQT5z (ORCPT ); Tue, 17 Jan 2012 14:57:55 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:40364 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753028Ab2AQT5y (ORCPT ); Tue, 17 Jan 2012 14:57:54 -0500 MIME-Version: 1.0 In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <4F15CC56.90309@redhat.com> Date: Tue, 17 Jan 2012 21:57:53 +0200 X-Google-Sender-Auth: 4RFknTZFLHJQzphAXR_6E57h39w Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg wrote: >> The desire to avoid such wakeups makes it harder to >> wake up processes at arbitrary points set by the API. > > Sure. You could either bump up the threshold or use Minchan's hooks - or both. s/threshold/sample period/g From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756289Ab2AQXIQ (ORCPT ); Tue, 17 Jan 2012 18:08:16 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:58626 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756185Ab2AQXIP (ORCPT ); Tue, 17 Jan 2012 18:08:15 -0500 Date: Wed, 18 Jan 2012 08:08:01 +0900 From: Minchan Kim To: KAMEZAWA Hiroyuki Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-ID: <20120117230801.GA903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 07:05:12PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 17 Jan 2012 18:13:56 +0900 > Minchan Kim wrote: > > > On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 17 Jan 2012 17:13:57 +0900 > > > Minchan Kim wrote: > > > > > > > > > > + /* > > > > + * We want to avoid dropping page cache excessively > > > > + * in no swap system > > > > + */ > > > > + if (nr_swap_pages <= 0) { > > > > + free = zone_page_state(mz->zone, NR_FREE_PAGES); > > > > + file = zone_page_state(mz->zone, NR_ACTIVE_FILE) + > > > > + zone_page_state(mz->zone, NR_INACTIVE_FILE); > > > > + /* > > > > + * If we have very few page cache pages, > > > > + * notify to user > > > > + */ > > > > + if (file < free) > > > > + low_mem = true; > > > > + } > > > > > > I can't understand why you think you can check lowmem condition by "file < free". > > > > The reason I thought so is I want to maintain some page cache to some degree. > > But I admit It's very naive heuristic and should be improved. > > > > > And I don't think using per-zone data is good. > > > (I'm not sure how many zones embeded guys using..) > > > > Agree. In case of swapless system, we need another heuristic. > > > > > > > > Another idea: > > > 1. can't we use some technique like cleancache to detect the condition ? > > > > I totally forgot cleancache approach. Could you remind that? > > > > Similar to 'victim cache'. Then, cache some clean pages somewhere when > vmscan pageout it. > > page -> vmscan's pageout -> cleancache -> may be discarded. > > If a filesystem look up a page which is in a cleancache, cache-hit and > bring it back to radix-tree. If not, read from disk again. > And cleancache for swap(frontswap) was posted, too. I am not sure this can prevent swapout. I think it ends up evicting pages into swap devices. > > > > > 2. can't we measure page-in/page-out distance by recording something ? > > > > I can't understand your point. What's relation does it with swapout prevent? > > > > If distance between pageout -> pagein is short, it means thrashing. > For example, recoding the timestamp when the page(mapping, index) was > paged-out, and check it at page-in. Our goal is prevent swapout. When we found thrashing, it's too late. > > > > > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can > > > ignore the data file cache ? > > > > It's good but how do we define some amount? > > It's very vague but I guess we can get a good idea from that. > > Perhaps, you already has it. > > > > Hm, a rough idea is... > > - we now have rss counter per mm. > - mapped anon > - mapped file > - swapents > > Ok, here, add one more counter. > > - paged-out file. (I think this can be recorded in pte.) > +1 when try_to_unmap_file() unmaps it. > -1 when a page is back or unmapped. > > Then, scanning all tasks. Then, > > mapped_anon + mapped_file > active_map_ratio = ----------------------------------------------------- * 100 > mapped_anon + mapped_file + swapents + paged_out_file > > Ok, how to use this value... > > Like memcg's threshold notify interface, you can change the mem_notify interface > to use eventfd() as > > > > This will inform you an event when active_map_ratio crosses passed threshold. > > complicated ? Yes. :) I want to make simple if possible. > > > > > 4. how about checking kswapd's busy status ? > > > > Could you elaborate on your idea? > > > > I just thought kswapd may not stop when the situation is very bad. As I said eariler, the goal is prevent swap. When we found kswapd is busy, it might many pages are already swapped-out so it's too late. > > Thanks, > -Kame > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756388Ab2AQXUi (ORCPT ); Tue, 17 Jan 2012 18:20:38 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:51119 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756254Ab2AQXUg (ORCPT ); Tue, 17 Jan 2012 18:20:36 -0500 Date: Wed, 18 Jan 2012 08:20:25 +0900 From: Minchan Kim To: Pekka Enberg Cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120117232025.GB903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have > proposed so far. > > Pekka > > -------------------> > > From a07f93fdca360b20daef4a5d66f2a5746f31f6a6 Mon Sep 17 00:00:00 2001 > From: Pekka Enberg > Date: Tue, 17 Jan 2012 17:51:48 +0200 > Subject: [PATCH] vmnotify: VM event notification system > > This patch implements a new sys_vmnotify_fd() system call that returns a > pollable file descriptor that can be used to watch VM events. > > For example, to watch for VM event when free memory is below 99% of available > memory using 1 second sample period, you'd do something like this: > > struct vmnotify_config config; > struct vmnotify_event event; > struct pollfd pollfd; > int fd; > > config = (struct vmnotify_config) { > .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, > .sample_period_ns = 1000000000L, > .free_threshold = 99, > }; > > fd = sys_vmnotify_fd(&config); > > pollfd.fd = fd; > pollfd.events = POLLIN; > > if (poll(&pollfd, 1, -1) < 0) { > perror("poll failed"); > exit(1); > } > > memset(&event, 0, sizeof(event)); > > if (read(fd, &event, sizeof(event)) < 0) { > perror("read failed"); > exit(1); > } Hi Pekka, I didn't look into your code(will do) but as I read description, still I don't convince we need really some process specific threshold like 99% I think application can know it by polling /proc/meminfo without this mechanism if they really want. I would like to notify when system has a trobule with memory pressure without some process specific threshold. Of course, applicatoin can't expect it.(ie, application can know system memory pressure by /proc/meminfo but it can't know when swapout really happens). Kernel low mem notify have to give such notification to user space, I think. > > Signed-off-by: Pekka Enberg > --- > arch/x86/include/asm/unistd_64.h | 2 + > include/linux/vmnotify.h | 44 ++++++ > mm/Kconfig | 6 + > mm/Makefile | 1 + > mm/vmnotify.c | 235 ++++++++++++++++++++++++++++++++ > tools/testing/vmnotify/vmnotify-test.c | 68 +++++++++ > 6 files changed, 356 insertions(+), 0 deletions(-) > create mode 100644 include/linux/vmnotify.h > create mode 100644 mm/vmnotify.c > create mode 100644 tools/testing/vmnotify/vmnotify-test.c > > diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h > index 0431f19..b0928cd 100644 > --- a/arch/x86/include/asm/unistd_64.h > +++ b/arch/x86/include/asm/unistd_64.h > @@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu) > __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv) > #define __NR_process_vm_writev 311 > __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev) > +#define __NR_vmnotify_fd 312 > +__SYSCALL(__NR_vmnotify_fd, sys_vmnotify_fd) > > #ifndef __NO_STUBS > #define __ARCH_WANT_OLD_READDIR > diff --git a/include/linux/vmnotify.h b/include/linux/vmnotify.h > new file mode 100644 > index 0000000..8f8642b > --- /dev/null > +++ b/include/linux/vmnotify.h > @@ -0,0 +1,44 @@ > +#ifndef _LINUX_VMNOTIFY_H > +#define _LINUX_VMNOTIFY_H > + > +#include > + > +enum { > + VMNOTIFY_TYPE_FREE_THRESHOLD = 1ULL << 0, > + VMNOTIFY_TYPE_SAMPLE = 1ULL << 1, > +}; > + > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; > + > + /* > + * Sample period in nanoseconds > + */ > + __u64 sample_period_ns; > +}; > + > +struct vmnotify_event { > + /* Size of the struct for ABI extensibility. */ > + __u32 size; > + > + __u64 nr_avail_pages; > + > + __u64 nr_swap_pages; > + > + __u64 nr_free_pages; > +}; > + > +#endif /* _LINUX_VMNOTIFY_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index 011b110..6631167 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -373,3 +373,9 @@ config CLEANCACHE > in a negligible performance hit. > > If unsure, say Y to enable cleancache > + > +config VMNOTIFY > + bool "Enable VM event notification system" > + default n > + help > + If unsure, say N to disable vmnotify > diff --git a/mm/Makefile b/mm/Makefile > index 50ec00e..e1b5db3 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o > obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > obj-$(CONFIG_CLEANCACHE) += cleancache.o > +obj-$(CONFIG_VMNOTIFY) += vmnotify.o > diff --git a/mm/vmnotify.c b/mm/vmnotify.c > new file mode 100644 > index 0000000..6800450 > --- /dev/null > +++ b/mm/vmnotify.c > @@ -0,0 +1,235 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define VMNOTIFY_MAX_FREE_THRESHOD 100 > + > +struct vmnotify_watch { > + struct vmnotify_config config; > + > + struct mutex mutex; > + bool pending; > + struct vmnotify_event event; > + > + /* sampling */ > + struct hrtimer timer; > + > + /* poll */ > + wait_queue_head_t waitq; > +}; > + > +static bool vmnotify_match(struct vmnotify_watch *watch, struct vmnotify_event *event) > +{ > + if (watch->config.type & VMNOTIFY_TYPE_FREE_THRESHOLD) { > + u64 threshold; > + > + if (!event->nr_avail_pages) > + return false; > + > + threshold = event->nr_free_pages * 100 / event->nr_avail_pages; > + if (threshold > watch->config.free_threshold) > + return false; > + } > + > + return true; > +} > + > +static void vmnotify_sample(struct vmnotify_watch *watch) > +{ > + struct vmnotify_event event; > + struct sysinfo si; > + > + memset(&event, 0, sizeof(event)); > + > + event.size = sizeof(event); > + event.nr_free_pages = global_page_state(NR_FREE_PAGES); > + > + si_meminfo(&si); > + event.nr_avail_pages = si.totalram; > + > +#ifdef CONFIG_SWAP > + si_swapinfo(&si); > + event.nr_swap_pages = si.totalswap; > +#endif > + > + if (!vmnotify_match(watch, &event)) > + return; > + > + mutex_lock(&watch->mutex); > + > + watch->pending = true; > + > + memcpy(&watch->event, &event, sizeof(event)); > + > + mutex_unlock(&watch->mutex); > +} > + > +static enum hrtimer_restart vmnotify_timer_fn(struct hrtimer *hrtimer) > +{ > + struct vmnotify_watch *watch = container_of(hrtimer, struct vmnotify_watch, timer); > + u64 sample_period = watch->config.sample_period_ns; > + > + vmnotify_sample(watch); > + > + hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); > + > + wake_up(&watch->waitq); > + > + return HRTIMER_RESTART; > +} > + > +static void vmnotify_start_timer(struct vmnotify_watch *watch) > +{ > + u64 sample_period = watch->config.sample_period_ns; > + > + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + watch->timer.function = vmnotify_timer_fn; > + > + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED); > +} > + > +static unsigned int vmnotify_poll(struct file *file, poll_table *wait) > +{ > + struct vmnotify_watch *watch = file->private_data; > + unsigned int events = 0; > + > + poll_wait(file, &watch->waitq, wait); > + > + mutex_lock(&watch->mutex); > + > + if (watch->pending) > + events |= POLLIN; > + > + mutex_unlock(&watch->mutex); > + > + return events; > +} > + > +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) > +{ > + struct vmnotify_watch *watch = file->private_data; > + int ret = 0; > + > + mutex_lock(&watch->mutex); > + > + if (!watch->pending) > + goto out_unlock; > + > + if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) { > + ret = -EFAULT; > + goto out_unlock; > + } > + > + ret = watch->event.size; > + > + watch->pending = false; > + > +out_unlock: > + mutex_unlock(&watch->mutex); > + > + return ret; > +} > + > +static int vmnotify_release(struct inode *inode, struct file *file) > +{ > + struct vmnotify_watch *watch = file->private_data; > + > + hrtimer_cancel(&watch->timer); > + > + kfree(watch); > + > + return 0; > +} > + > +static const struct file_operations vmnotify_fops = { > + .poll = vmnotify_poll, > + .read = vmnotify_read, > + .release = vmnotify_release, > +}; > + > +static struct vmnotify_watch *vmnotify_watch_alloc(void) > +{ > + struct vmnotify_watch *watch; > + > + watch = kzalloc(sizeof *watch, GFP_KERNEL); > + if (!watch) > + return NULL; > + > + mutex_init(&watch->mutex); > + > + init_waitqueue_head(&watch->waitq); > + > + return watch; > +} > + > +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig, > + struct vmnotify_config *config) > +{ > + int ret; > + > + ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config)); > + if (ret) > + return -EFAULT; > + > + if (!config->type) > + return -EINVAL; > + > + if (config->type & VMNOTIFY_TYPE_SAMPLE) { > + if (config->sample_period_ns < NSEC_PER_MSEC) > + return -EINVAL; > + } > + > + if (config->type & VMNOTIFY_TYPE_FREE_THRESHOLD) { > + if (config->free_threshold > VMNOTIFY_MAX_FREE_THRESHOD) > + return -EINVAL; > + } > + > + return 0; > +} > + > +SYSCALL_DEFINE1(vmnotify_fd, > + struct vmnotify_config __user *, uconfig) > +{ > + struct vmnotify_watch *watch; > + struct file *file; > + int err; > + int fd; > + > + watch = vmnotify_watch_alloc(); > + if (!watch) > + return -ENOMEM; > + > + err = vmnotify_copy_config(uconfig, &watch->config); > + if (err) > + goto err_free; > + > + fd = get_unused_fd_flags(O_RDONLY); > + if (fd < 0) { > + err = fd; > + goto err_free; > + } > + > + file = anon_inode_getfile("[vmnotify]", &vmnotify_fops, watch, O_RDONLY); > + if (IS_ERR(file)) { > + err = PTR_ERR(file); > + goto err_fd; > + } > + > + fd_install(fd, file); > + > + if (watch->config.type & VMNOTIFY_TYPE_SAMPLE) > + vmnotify_start_timer(watch); > + > + return fd; > + > +err_fd: > + put_unused_fd(fd); > +err_free: > + kfree(watch); > + return err; > +} > diff --git a/tools/testing/vmnotify/vmnotify-test.c b/tools/testing/vmnotify/vmnotify-test.c > new file mode 100644 > index 0000000..3c6b26d > --- /dev/null > +++ b/tools/testing/vmnotify/vmnotify-test.c > @@ -0,0 +1,68 @@ > +#include "../../../include/linux/vmnotify.h" > + > +#if defined(__x86_64__) > +#include "../../../arch/x86/include/asm/unistd.h" > +#endif > + > +#include > +#include > +#include > +#include > +#include > + > +static int sys_vmnotify_fd(struct vmnotify_config *config) > +{ > + config->size = sizeof(*config); > + > + return syscall(__NR_vmnotify_fd, config); > +} > + > +int main(int argc, char *argv[]) > +{ > + struct vmnotify_config config; > + struct vmnotify_event event; > + struct pollfd pollfd; > + int i; > + int fd; > + > + config = (struct vmnotify_config) { > + .type = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD, > + .sample_period_ns = 1000000000L, > + .free_threshold = 99, > + }; > + > + fd = sys_vmnotify_fd(&config); > + if (fd < 0) { > + perror("vmnotify_fd failed"); > + exit(1); > + } > + > + for (i = 0; i < 10; i++) { > + pollfd.fd = fd; > + pollfd.events = POLLIN; > + > + if (poll(&pollfd, 1, -1) < 0) { > + perror("poll failed"); > + exit(1); > + } > + > + memset(&event, 0, sizeof(event)); > + > + if (read(fd, &event, sizeof(event)) < 0) { > + perror("read failed"); > + exit(1); > + } > + > + printf("VM event:\n"); > + printf("\tsize=%lu\n", event.size); > + printf("\tnr_avail_pages=%Lu\n", event.nr_avail_pages); > + printf("\tnr_swap_pages=%Lu\n", event.nr_swap_pages); > + printf("\tnr_free_pages=%Lu\n", event.nr_free_pages); > + } > + if (close(fd) < 0) { > + perror("close failed"); > + exit(1); > + } > + > + return 0; > +} > -- > 1.7.6.4 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754937Ab2ARATp (ORCPT ); Tue, 17 Jan 2012 19:19:45 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:33259 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751804Ab2ARATn (ORCPT ); Tue, 17 Jan 2012 19:19:43 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Wed, 18 Jan 2012 09:18:24 +0900 From: KAMEZAWA Hiroyuki To: Minchan Kim Cc: linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, Rik van Riel , mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120117230801.GA903@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 18 Jan 2012 08:08:01 +0900 Minchan Kim wrote: > > > > > > > > 2. can't we measure page-in/page-out distance by recording something ? > > > > > > I can't understand your point. What's relation does it with swapout prevent? > > > > > > > If distance between pageout -> pagein is short, it means thrashing. > > For example, recoding the timestamp when the page(mapping, index) was > > paged-out, and check it at page-in. > > Our goal is prevent swapout. When we found thrashing, it's too late. > If you want to prevent swap-out, don't swapon any. That's all. Then, you can check the number of FILE_CACHE and have threshold. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756056Ab2ARHRG (ORCPT ); Wed, 18 Jan 2012 02:17:06 -0500 Received: from mail-lpp01m010-f46.google.com ([209.85.215.46]:64353 "EHLO mail-lpp01m010-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755302Ab2ARHRE (ORCPT ); Wed, 18 Jan 2012 02:17:04 -0500 Date: Wed, 18 Jan 2012 09:16:49 +0200 (EET) From: Pekka Enberg X-X-Sender: penberg@tux.localdomain To: Minchan Kim cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <20120117232025.GB903@barrios-desktop.redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120117232025.GB903@barrios-desktop.redhat.com> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 18 Jan 2012, Minchan Kim wrote: > I didn't look into your code(will do) but as I read description, > still I don't convince we need really some process specific threshold like 99% > I think application can know it by polling /proc/meminfo without this mechanism > if they really want. I'm not sure if we need arbitrary threshold either. However, we need to support the following cases: - We're about to swap - We're about to run out of memory - We're about to start OOM killing and I don't think your patch solves that. One possibility is to implement: VMNOTIFY_TYPE_ABOUT_TO_SWAP VMNOTIFY_TYPE_ABOUT_TO_OOM VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL and maybe rip out support for arbitrary thresholds. Does that more reasonable? As for polling /proc/meminfo, I'd much rather deliver stats as part of vmnotify_read() because it's easier to extend the ABI rather than adding new fields to /proc/meminfo. On Wed, 18 Jan 2012, Minchan Kim wrote: > I would like to notify when system has a trobule with memory pressure without > some process specific threshold. Of course, applicatoin can't expect it.(ie, > application can know system memory pressure by /proc/meminfo but it can't know > when swapout really happens). Kernel low mem notify have to give such notification > to user space, I think. It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE that uses your hooks. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755966Ab2ARHto (ORCPT ); Wed, 18 Jan 2012 02:49:44 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:50754 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753524Ab2ARHtn (ORCPT ); Wed, 18 Jan 2012 02:49:43 -0500 Date: Wed, 18 Jan 2012 16:49:30 +0900 From: Minchan Kim To: Pekka Enberg Cc: Rik van Riel , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120118074930.GA18621@barrios-desktop.redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120117232025.GB903@barrios-desktop.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 09:16:49AM +0200, Pekka Enberg wrote: > On Wed, 18 Jan 2012, Minchan Kim wrote: > >I didn't look into your code(will do) but as I read description, > >still I don't convince we need really some process specific threshold like 99% > >I think application can know it by polling /proc/meminfo without this mechanism > >if they really want. > > I'm not sure if we need arbitrary threshold either. However, we need > to support the following cases: > > - We're about to swap > > - We're about to run out of memory > > - We're about to start OOM killing > > and I don't think your patch solves that. One possibility is to implement: I think my patch can extend it but your ABI looks good to me than my approach. > > VMNOTIFY_TYPE_ABOUT_TO_SWAP > VMNOTIFY_TYPE_ABOUT_TO_OOM > VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL Yes. We can define some levels. 1. page cache reclaim 2. code page reclaim 3. anonymous page swap out 4. OOM kill. Application might handle it differenlty by the memory pressure level. > > and maybe rip out support for arbitrary thresholds. Does that more > reasonable? Currently, Nokia people seem to want process specific thresholds so we might need it. > > As for polling /proc/meminfo, I'd much rather deliver stats as part > of vmnotify_read() because it's easier to extend the ABI rather than > adding new fields to /proc/meminfo. Agree. > > On Wed, 18 Jan 2012, Minchan Kim wrote: > >I would like to notify when system has a trobule with memory pressure without > >some process specific threshold. Of course, applicatoin can't expect it.(ie, > >application can know system memory pressure by /proc/meminfo but it can't know > >when swapout really happens). Kernel low mem notify have to give such notification > >to user space, I think. > > It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE > that uses your hooks. Indeed. > > Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756686Ab2ARJJY (ORCPT ); Wed, 18 Jan 2012 04:09:24 -0500 Received: from smtp.nokia.com ([147.243.128.24]:38865 "EHLO mgw-da01.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756165Ab2ARJJV convert rfc822-to-8bit (ORCPT ); Wed, 18 Jan 2012 04:09:21 -0500 From: To: , CC: , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A Date: Wed, 18 Jan 2012 09:06:06 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 18 Jan 2012 09:06:08.0068 (UTC) FILETIME=[6A4BA040:01CCD5C0] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Just couple of observations, which maybe wrong below > -----Original Message----- > From: Pekka Enberg [mailto:penberg@gmail.com] On Behalf Of ext Pekka > Enberg > Sent: 17 January, 2012 20:51 .... > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; Would be possible to not use percents for thesholds? Accounting in pages even not so difficult to user-space. Also, looking on vmnotify_match I understand that events propagated to user-space only in case threshold trigger change state from 0 to 1 but not back, 1-> 0 is very useful event as well. Would be possible to use for threshold pointed value(s) e.g. according to enum zone_state_item, because kinds of memory to track could be different? E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be interesting, not only free. > + > + /* > + * Sample period in nanoseconds > + */ > + __u64 sample_period_ns; > +}; > + .... > +struct vmnotify_event { > + /* Size of the struct for ABI extensibility. */ > + __u32 size; > + > + __u64 nr_avail_pages; > + > + __u64 nr_swap_pages; > + > + __u64 nr_free_pages; > +}; Two fields here most likely session-constant, (nr_avail_pages and nr_swap_pages), seems not much sense to report them in every event. If we have memory/swap hotplug user-space can use sysinfo() call. > +static void vmnotify_sample(struct vmnotify_watch *watch) { ... > + si_meminfo(&si); > + event.nr_avail_pages = si.totalram; > + > +#ifdef CONFIG_SWAP > + si_swapinfo(&si); > + event.nr_swap_pages = si.totalswap; > +#endif > + Why not to use global_page_state() directly? si_meminfo() and especial si_swapinfo are quite expensive call. > +static void vmnotify_start_timer(struct vmnotify_watch *watch) { > + u64 sample_period = watch->config.sample_period_ns; > + > + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, > HRTIMER_MODE_REL); > + watch->timer.function = vmnotify_timer_fn; > + > + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), > +HRTIMER_MODE_REL_PINNED); } Do I understand correct you allocate timer for every user-space client and propagate events every pointed interval? What will happened with system if we have a timer but need to turn CPU off? The timer must not be a reason to wakeup if user-space is sleeping. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757243Ab2ARJPs (ORCPT ); Wed, 18 Jan 2012 04:15:48 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:63083 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756843Ab2ARJPm (ORCPT ); Wed, 18 Jan 2012 04:15:42 -0500 MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> Date: Wed, 18 Jan 2012 11:15:41 +0200 X-Google-Sender-Auth: d8xVX3HLIe9OovWpAKNHFUW2PNA Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: leonid.moiseichuk@nokia.com Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 11:06 AM, wrote: > Would be possible to not use percents for thesholds? Accounting in pages even > not so difficult to user-space. How does that work with memory hotplug? On Wed, Jan 18, 2012 at 11:06 AM, wrote: > Also, looking on vmnotify_match I understand that events propagated to > user-space only in case threshold trigger change state from 0 to 1 but not > back, 1-> 0 is very useful event as well. > > Would be possible to use for threshold pointed value(s) e.g. according to > enum zone_state_item, because kinds of memory to track could be different? > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be > interesting, not only free. I don't think there's anything in the ABI that would prevent that. >> +struct vmnotify_event { >> + /* Size of the struct for ABI extensibility. */ >> + __u32 size; >> + >> + __u64 nr_avail_pages; >> + >> + __u64 nr_swap_pages; >> + >> + __u64 nr_free_pages; >> +}; > > Two fields here most likely session-constant, (nr_avail_pages and > nr_swap_pages), seems not much sense to report them in every event. If we > have memory/swap hotplug user-space can use sysinfo() call. I actually changed the ABI to look like this: struct vmnotify_event { /* * Size of the struct for ABI extensibility. */ __u32 size; __u64 attrs; __u64 attr_values[]; }; So userspace can decide which fields to include in notifications. On Wed, Jan 18, 2012 at 11:06 AM, wrote: >> +static void vmnotify_sample(struct vmnotify_watch *watch) { > ... >> + si_meminfo(&si); >> + event.nr_avail_pages = si.totalram; >> + >> +#ifdef CONFIG_SWAP >> + si_swapinfo(&si); >> + event.nr_swap_pages = si.totalswap; >> +#endif >> + > > Why not to use global_page_state() directly? si_meminfo() and especial > si_swapinfo are quite expensive call. Sure, we can do that. Feel free to send a patch :-). >> +static void vmnotify_start_timer(struct vmnotify_watch *watch) { >> + u64 sample_period = watch->config.sample_period_ns; >> + >> + hrtimer_init(&watch->timer, CLOCK_MONOTONIC, >> HRTIMER_MODE_REL); >> + watch->timer.function = vmnotify_timer_fn; >> + >> + hrtimer_start(&watch->timer, ns_to_ktime(sample_period), >> +HRTIMER_MODE_REL_PINNED); } > > Do I understand correct you allocate timer for every user-space client and > propagate events every pointed interval? What will happened with system if > we have a timer but need to turn CPU off? The timer must not be a reason to > wakeup if user-space is sleeping. No idea what happens. The sampling code is just a proof of concept thing and I expect it to be buggy as hell. :-) Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757316Ab2ARJn6 (ORCPT ); Wed, 18 Jan 2012 04:43:58 -0500 Received: from smtp.nokia.com ([147.243.1.48]:49352 "EHLO mgw-sa02.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757104Ab2ARJn4 convert rfc822-to-8bit (ORCPT ); Wed, 18 Jan 2012 04:43:56 -0500 From: To: CC: , , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A///6iYCAABO3AA== Date: Wed, 18 Jan 2012 09:41:41 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 18 Jan 2012 09:41:42.0588 (UTC) FILETIME=[629187C0:01CCD5C5] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 18 January, 2012 11:16 ... > > Would be possible to not use percents for thesholds? Accounting in pages > even > > not so difficult to user-space. > > How does that work with memory hotplug? Not worse than %%. For example you had 10% free memory threshold for 512 MB RAM meaning 51.2 MB in absolute number. Then hotplug turned off 256 MB, you for sure must update threshold for %% because these 10% for 25.6 MB most likely will be not suitable for different operating mode. Using pages makes calculations must simpler. > > On Wed, Jan 18, 2012 at 11:06 AM, wrote: > > Also, looking on vmnotify_match I understand that events propagated to > > user-space only in case threshold trigger change state from 0 to 1 but not > > back, 1-> 0 is very useful event as well (*) > > > > Would be possible to use for threshold pointed value(s) e.g. according to > > enum zone_state_item, because kinds of memory to track could be > different? > > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE > could be > > interesting, not only free. > > I don't think there's anything in the ABI that would prevent that. If this statement also related my question (*) I have to point need to track attributes history, otherwise user-space will be constantly kicked with updates. > I actually changed the ABI to look like this: > > struct vmnotify_event { > /* > * Size of the struct for ABI extensibility. > */ > __u32 size; > > __u64 attrs; > > __u64 attr_values[]; > }; > > So userspace can decide which fields to include in notifications. Good. But how you can provide current status of attributes to user-space? Need to have read() call support to deliver all supported attr_values[] on demand. > >> + > >> +#ifdef CONFIG_SWAP > >> + si_swapinfo(&si); > >> + event.nr_swap_pages = si.totalswap; > >> +#endif > >> + > > > > Why not to use global_page_state() directly? si_meminfo() and especial > > si_swapinfo are quite expensive call. > > Sure, we can do that. Feel free to send a patch :-). When I see code because from emails it is quite difficult to understand. For short-term I need to focus on integration "memnotify" version internally which is kind of work for me already and provides all required interfaces n9 needs. Btw, when API starts to work with pointed thresholds logically it is not anymore low_mem_notify, you need to invent some other name. > No idea what happens. The sampling code is just a proof of concept thing and > I expect it to be buggy as hell. :-) > > Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753090Ab2ARKkM (ORCPT ); Wed, 18 Jan 2012 05:40:12 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:52640 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752719Ab2ARKkK (ORCPT ); Wed, 18 Jan 2012 05:40:10 -0500 MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> Date: Wed, 18 Jan 2012 12:40:10 +0200 X-Google-Sender-Auth: m39lzZJc9197wyBHMq5yVhZp3Yw Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: leonid.moiseichuk@nokia.com Cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 11:41 AM, wrote: >> -----Original Message----- >> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext >> Pekka Enberg >> Sent: 18 January, 2012 11:16 > ... >> > Would be possible to not use percents for thesholds? Accounting in pages >> even >> > not so difficult to user-space. >> >> How does that work with memory hotplug? > > Not worse than %%. For example you had 10% free memory threshold for 512 MB > RAM meaning 51.2 MB in absolute number. Then hotplug turned off 256 MB, you > for sure must update threshold for %% because these 10% for 25.6 MB most > likely will be not suitable for different operating mode. > Using pages makes calculations must simpler. Right. Does threshold in percentages make any sense then? Is it enough to use number of free pages? On Wed, Jan 18, 2012 at 11:06 AM, wrote: >> > Also, looking on vmnotify_match I understand that events propagated to >> > user-space only in case threshold trigger change state from 0 to 1 but not >> > back, 1-> 0 is very useful event as well > (*) > >> > >> > Would be possible to use for threshold pointed value(s) e.g. according to >> > enum zone_state_item, because kinds of memory to track could be >> different? >> > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE >> could be >> > interesting, not only free. >> >> I don't think there's anything in the ABI that would prevent that. > > If this statement also related my question (*) I have to point need to track > attributes history, otherwise user-space will be constantly kicked with > updates. Well sure, I think it makes sense to support state change to both directions. > When I see code because from emails it is quite difficult to understand. For > short-term I need to focus on integration "memnotify" version internally > which is kind of work for me already and provides all required interfaces n9 > needs. Sure. I'm only talking about mainline here. > Btw, when API starts to work with pointed thresholds logically it is not Definitely, it's about generic VM event notification now. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753380Ab2ARKpE (ORCPT ); Wed, 18 Jan 2012 05:45:04 -0500 Received: from smtp.nokia.com ([147.243.1.47]:28752 "EHLO mgw-sa01.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752753Ab2ARKpC convert rfc822-to-8bit (ORCPT ); Wed, 18 Jan 2012 05:45:02 -0500 From: To: CC: , , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A///6iYCAABO3AIAAA+QAgAARAJA= Date: Wed, 18 Jan 2012 10:44:13 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 18 Jan 2012 10:44:14.0431 (UTC) FILETIME=[1ED756F0:01CCD5CE] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 18 January, 2012 12:40 ... > > Not worse than %%. For example you had 10% free memory threshold for > > 512 MB RAM meaning 51.2 MB in absolute number. Then hotplug turned > > off 256 MB, you for sure must update threshold for %% because these > > 10% for 25.6 MB most likely will be not suitable for different operating > mode. > > Using pages makes calculations must simpler. > > Right. Does threshold in percentages make any sense then? Is it enough to > use number of free pages? Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. He was right. Percents are useless and do not correlate with other kernel APIs like sysinfo(). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757615Ab2AROSG (ORCPT ); Wed, 18 Jan 2012 09:18:06 -0500 Received: from mx1.redhat.com ([209.132.183.28]:10837 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757466Ab2AROSE (ORCPT ); Wed, 18 Jan 2012 09:18:04 -0500 Message-ID: <4F16D46D.5080000@redhat.com> Date: Wed, 18 Jan 2012 09:17:17 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: KAMEZAWA Hiroyuki CC: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > On Wed, 18 Jan 2012 08:08:01 +0900 > Minchan Kim wrote: > >>>>> 2. can't we measure page-in/page-out distance by recording something ? >>>> >>>> I can't understand your point. What's relation does it with swapout prevent? >>>> >>> >>> If distance between pageout -> pagein is short, it means thrashing. >>> For example, recoding the timestamp when the page(mapping, index) was >>> paged-out, and check it at page-in. >> >> Our goal is prevent swapout. When we found thrashing, it's too late. > > If you want to prevent swap-out, don't swapon any. That's all. > Then, you can check the number of FILE_CACHE and have threshold. I think you are getting hung up on a word here. As I understand it, the goal is to push out the point where we start doing heavier swap IO, allowing us to overcommit memory more heavily before things start really slowing down. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932142Ab2ARObk (ORCPT ); Wed, 18 Jan 2012 09:31:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46184 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757654Ab2ARObj (ORCPT ); Wed, 18 Jan 2012 09:31:39 -0500 Message-ID: <4F16D79C.2020402@redhat.com> Date: Wed, 18 Jan 2012 09:30:52 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: leonid.moiseichuk@nokia.com CC: penberg@kernel.org, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 04:06 AM, leonid.moiseichuk@nokia.com wrote: > Would be possible to use for threshold pointed value(s) e.g. according to enum zone_state_item, because kinds of memory to track could be different? > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be interesting, not only free. That seems like a horrible idea, because there is no guarantee that the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE internally in the future. What is exported to userspace must be somewhat independent of the specifics of how the kernel implements things internally. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757917Ab2ARP3f (ORCPT ); Wed, 18 Jan 2012 10:29:35 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:57282 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757890Ab2ARP3e (ORCPT ); Wed, 18 Jan 2012 10:29:34 -0500 MIME-Version: 1.0 In-Reply-To: <4F16D79C.2020402@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <4F16D79C.2020402@redhat.com> Date: Wed, 18 Jan 2012 17:29:33 +0200 X-Google-Sender-Auth: bEy302ISw1EyhKqFxUXtLkei-Eo Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Rik van Riel Cc: leonid.moiseichuk@nokia.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 4:30 PM, Rik van Riel wrote: > That seems like a horrible idea, because there is no guarantee that > the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE > internally in the future. > > What is exported to userspace must be somewhat independent of the > specifics of how the kernel implements things internally. Exactly, that's what I'm also interested in. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755412Ab2ARXfT (ORCPT ); Wed, 18 Jan 2012 18:35:19 -0500 Received: from mx1.redhat.com ([209.132.183.28]:27405 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752638Ab2ARXfS (ORCPT ); Wed, 18 Jan 2012 18:35:18 -0500 Message-ID: <4F175706.8000808@redhat.com> Date: Thu, 19 Jan 2012 01:34:30 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: leonid.moiseichuk@nokia.com CC: penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 12:44 PM, leonid.moiseichuk@nokia.com wrote: >> -----Original Message----- >> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext >> Pekka Enberg >> Sent: 18 January, 2012 12:40 > ... >>> Not worse than %%. For example you had 10% free memory threshold for >>> 512 MB RAM meaning 51.2 MB in absolute number. Then hotplug turned >>> off 256 MB, you for sure must update threshold for %% because these >>> 10% for 25.6 MB most likely will be not suitable for different operating >> mode. >>> Using pages makes calculations must simpler. >> Right. Does threshold in percentages make any sense then? Is it enough to >> use number of free pages? > Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. > He was right. > Percents are useless and do not correlate with other kernel APIs like sysinfo(). I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time. Ronen. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755825Ab2ASC0t (ORCPT ); Wed, 18 Jan 2012 21:26:49 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:38754 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752940Ab2ASC0s (ORCPT ); Wed, 18 Jan 2012 21:26:48 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Thu, 19 Jan 2012 11:25:28 +0900 From: KAMEZAWA Hiroyuki To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <4F16D46D.5080000@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 18 Jan 2012 09:17:17 -0500 Rik van Riel wrote: > On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > > On Wed, 18 Jan 2012 08:08:01 +0900 > > Minchan Kim wrote: > > > >>>>> 2. can't we measure page-in/page-out distance by recording something ? > >>>> > >>>> I can't understand your point. What's relation does it with swapout prevent? > >>>> > >>> > >>> If distance between pageout -> pagein is short, it means thrashing. > >>> For example, recoding the timestamp when the page(mapping, index) was > >>> paged-out, and check it at page-in. > >> > >> Our goal is prevent swapout. When we found thrashing, it's too late. > > > > If you want to prevent swap-out, don't swapon any. That's all. > > Then, you can check the number of FILE_CACHE and have threshold. > > I think you are getting hung up on a word here. > > As I understand it, the goal is to push out the point where > we start doing heavier swap IO, allowing us to overcommit > memory more heavily before things start really slowing down. > Yes. Hmm, considering that the issue is slow down, time values as - 'cpu time used for memory reclaim' - 'latency of page allocation' - 'application execution speed' ? may be a better score to see rather than just seeing lru's stat. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752993Ab2ASHZU (ORCPT ); Thu, 19 Jan 2012 02:25:20 -0500 Received: from mail-lpp01m010-f46.google.com ([209.85.215.46]:64313 "EHLO mail-lpp01m010-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752548Ab2ASHZS (ORCPT ); Thu, 19 Jan 2012 02:25:18 -0500 Date: Thu, 19 Jan 2012 09:25:03 +0200 (EET) From: Pekka Enberg X-X-Sender: penberg@tux.localdomain To: Ronen Hod cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <4F175706.8000808@redhat.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 19 Jan 2012, Ronen Hod wrote: > I believe that it will be best if the kernel publishes an ideal > number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to > work with since this is what applications do, they free pages. Applications > will be able to refer to this number from their garbage collector, or before > allocating memory also if they did not get a notification, and it is also > useful if several applications free memory at the same time. Isn't /proc/sys/vm/min_free_kbytes pretty much just that? Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753266Ab2ASHel (ORCPT ); Thu, 19 Jan 2012 02:34:41 -0500 Received: from mail-lpp01m010-f46.google.com ([209.85.215.46]:45077 "EHLO mail-lpp01m010-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752535Ab2ASHek (ORCPT ); Thu, 19 Jan 2012 02:34:40 -0500 Date: Thu, 19 Jan 2012 09:34:34 +0200 (EET) From: Pekka Enberg X-X-Sender: penberg@tux.localdomain To: leonid.moiseichuk@nokia.com cc: riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: RE: [RFC 1/3] /dev/low_mem_notify In-Reply-To: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> Message-ID: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 18 Jan 2012, leonid.moiseichuk@nokia.com wrote: > Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. > He was right. > Percents are useless and do not correlate with other kernel APIs like sysinfo(). I changed the code to use number of pages. Thanks! Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755849Ab2ASJGp (ORCPT ); Thu, 19 Jan 2012 04:06:45 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51298 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752890Ab2ASJGj (ORCPT ); Thu, 19 Jan 2012 04:06:39 -0500 Message-ID: <4F17DCED.4020908@redhat.com> Date: Thu, 19 Jan 2012 11:05:49 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Pekka Enberg CC: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/19/2012 09:25 AM, Pekka Enberg wrote: > On Thu, 19 Jan 2012, Ronen Hod wrote: >> I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time. > > Isn't > > /proc/sys/vm/min_free_kbytes > > pretty much just that? > > Pekka Would you suggest to use min_free_kbytes as the threshold for sending low_memory_notifications to applications, and separately as a target value for the applications' memory giveaway? Thanks, Ronen. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755958Ab2ASJK1 (ORCPT ); Thu, 19 Jan 2012 04:10:27 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:41336 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753046Ab2ASJKV (ORCPT ); Thu, 19 Jan 2012 04:10:21 -0500 MIME-Version: 1.0 In-Reply-To: <4F17DCED.4020908@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> Date: Thu, 19 Jan 2012 11:10:20 +0200 X-Google-Sender-Auth: IhnqItYe_yt8rsHfDMu3tgTYesU Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod wrote: >>> I believe that it will be best if the kernel publishes an ideal >>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to >>> work with since this is what applications do, they free pages. Applications >>> will be able to refer to this number from their garbage collector, or before >>> allocating memory also if they did not get a notification, and it is also >>> useful if several applications free memory at the same time. >> >> Isn't >> >> /proc/sys/vm/min_free_kbytes >> >> pretty much just that? > > Would you suggest to use min_free_kbytes as the threshold for sending > low_memory_notifications to applications, and separately as a target value > for the applications' memory giveaway? I'm not saying that the kernel should use it directly but it seems like the kind of "ideal number of free pages" threshold you're suggesting. So userspace can read that value and use it as the "number of free pages" threshold for VM events, no? Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757288Ab2ASJVX (ORCPT ); Thu, 19 Jan 2012 04:21:23 -0500 Received: from mx1.redhat.com ([209.132.183.28]:18239 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757039Ab2ASJVP (ORCPT ); Thu, 19 Jan 2012 04:21:15 -0500 Message-ID: <4F17E058.8020008@redhat.com> Date: Thu, 19 Jan 2012 11:20:24 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Pekka Enberg CC: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/19/2012 11:10 AM, Pekka Enberg wrote: > On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod wrote: >>>> I believe that it will be best if the kernel publishes an ideal >>>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to >>>> work with since this is what applications do, they free pages. Applications >>>> will be able to refer to this number from their garbage collector, or before >>>> allocating memory also if they did not get a notification, and it is also >>>> useful if several applications free memory at the same time. >>> Isn't >>> >>> /proc/sys/vm/min_free_kbytes >>> >>> pretty much just that? >> Would you suggest to use min_free_kbytes as the threshold for sending >> low_memory_notifications to applications, and separately as a target value >> for the applications' memory giveaway? > I'm not saying that the kernel should use it directly but it seems > like the kind of "ideal number of free pages" threshold you're > suggesting. So userspace can read that value and use it as the "number > of free pages" threshold for VM events, no? Yes, I like it. The rules of the game are simple and consistent all over, be it the alert threshold, voluntary poling by the apps, and for concurrent work by several applications. Well, as long as it provides a good indication for low_mem_pressure. Thanks, Ronen. > > Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754391Ab2ASKy0 (ORCPT ); Thu, 19 Jan 2012 05:54:26 -0500 Received: from smtp.nokia.com ([147.243.1.48]:33056 "EHLO mgw-sa02.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751211Ab2ASKyW convert rfc822-to-8bit (ORCPT ); Thu, 19 Jan 2012 05:54:22 -0500 From: To: , CC: , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A///6iYCAABO3AIAAA+QAgAARAJCAAMdZAIAAg3iAgAAcKICAAAFDAIAAAtAAgAASS5A= Date: Thu, 19 Jan 2012 10:53:29 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> In-Reply-To: <4F17E058.8020008@redhat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 19 Jan 2012 10:53:31.0467 (UTC) FILETIME=[954609B0:01CCD698] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: ext Ronen Hod [mailto:rhod@redhat.com] > Sent: 19 January, 2012 11:20 > To: Pekka Enberg ... > >>> Isn't > >>> > >>> /proc/sys/vm/min_free_kbytes > >>> > >>> pretty much just that? > >> Would you suggest to use min_free_kbytes as the threshold for sending > >> low_memory_notifications to applications, and separately as a target > >> value for the applications' memory giveaway? > > I'm not saying that the kernel should use it directly but it seems > > like the kind of "ideal number of free pages" threshold you're > > suggesting. So userspace can read that value and use it as the "number > > of free pages" threshold for VM events, no? > > Yes, I like it. The rules of the game are simple and consistent all over, be it the > alert threshold, voluntary poling by the apps, and for concurrent work by > several applications. > Well, as long as it provides a good indication for low_mem_pressure. For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >>From potential user point of view the proposed API has number of lacks which would be nice to have implemented: 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not. 7. would have sense to backport couple of attributes from memnotify.c I can submit couple of patches if some of proposals looks sane for everyone. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753678Ab2ASLHz (ORCPT ); Thu, 19 Jan 2012 06:07:55 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:47208 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750831Ab2ASLHx (ORCPT ); Thu, 19 Jan 2012 06:07:53 -0500 MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 13:07:52 +0200 X-Google-Sender-Auth: YEfKbOmpZzuZCK1Ej-4hIadl7Uc Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to > notification and memory situation in system: memory_pressure, memnotify, > memory_level etc. The word "low" is misleading here The thing is called vmevent: http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core [penberg@tux ~]$ vi [penberg@tux ~]$ cat email On Thu, Jan 19, 2012 at 12:53 PM, wrote: > From potential user point of view the proposed API has number of lacks which > would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to > notification and memory situation in system: memory_pressure, memnotify, > memory_level etc. The word "low" is misleading here The thing is called vmevent: http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core I haven't used "low mem" at all in the patches. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 2. API must use deferred timers to prevent use-time impact. Deferred timer > will be triggered only in case HW event or non-deferrable timer, so if device > sleeps timer might be skipped and that is what expected for user-space I'm currently looking at the possibility of hooking VM events to perf which also uses hrtimers. Can't we make hrtimers do the right thing? On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 3. API should be tunable for propagate changes when level is Up or Down, > maybe both ways. Agreed. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 4. to avoid triggering too much events probably has sense to filter according > to amount of change but that is optional. If subscriber set timer to 1s the > amount of events should not be very big. Agreed. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 5. API must provide interface to request parameters e.g. available swap or > free memory just to have some base. The current ABI already supports that. You can specify which attributes you're interested in and they will be delivered as part of th event. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 6. I do not understand how work with attributes performed ( ) but it has > sense to use mask and fill requested attributes using mask and callback table > i.e. if free pages requested - they are reported, otherwise not. That's how it works now in the git tree. On Thu, Jan 19, 2012 at 12:53 PM, wrote: > 7. would have sense to backport couple of attributes from memnotify.c > > I can submit couple of patches if some of proposals looks sane for everyone. Feel free to do that. I'm currently looking at how to support Minchan's non-sampled events. It seems to me integrating with perf would be nice because we could simply use tracepoints for this. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751696Ab2ASLzx (ORCPT ); Thu, 19 Jan 2012 06:55:53 -0500 Received: from smtp.nokia.com ([147.243.128.26]:43652 "EHLO mgw-da02.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751296Ab2ASLzw convert rfc822-to-8bit (ORCPT ); Thu, 19 Jan 2012 06:55:52 -0500 From: To: CC: , , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A///6iYCAABO3AIAAA+QAgAARAJCAAMdZAIAAg3iAgAAcKICAAAFDAIAAAtAAgAASS5CAAAu7AIAAFJKQ Date: Thu, 19 Jan 2012 11:54:58 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 19 Jan 2012 11:54:59.0475 (UTC) FILETIME=[2B7F6630:01CCD6A1] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext > Pekka Enberg > Sent: 19 January, 2012 13:08 ... > > 1. rename this API from low_mem_pressure to something more related to > > notification and memory situation in system: memory_pressure, > > memnotify, memory_level etc. The word "low" is misleading here > > The thing is called vmevent: Yes, I see it. But I was a bit confused with vmnotify_fops and was sure it is mapped through dev. Now it anonymous inode. > > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 2. API must use deferred timers to prevent use-time impact. Deferred > > timer will be triggered only in case HW event or non-deferrable timer, > > so if device sleeps timer might be skipped and that is what expected > > for user-space > > I'm currently looking at the possibility of hooking VM events to perf which > also uses hrtimers. Can't we make hrtimers do the right thing? I had no answer for this question. According to hrtimer_cpu_notify the cpu state is tracked but timer may set HW event to wake up. In this case use-time will be affected due to you will have too much HW events and reasons to wakeup. At least powertop reports hrtimers in relation to as an activities sources. > > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 3. API should be tunable for propagate changes when level is Up or > > Down, maybe both ways. > > Agreed. > > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 4. to avoid triggering too much events probably has sense to filter > > according to amount of change but that is optional. If subscriber set > > timer to 1s the amount of events should not be very big. > > Agreed. > > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 5. API must provide interface to request parameters e.g. available > > swap or free memory just to have some base. > > The current ABI already supports that. You can specify which attributes > you're interested in and they will be delivered as part of th event. But you have in vmnotify.h suspicious free_pages_threshold field. > > On Thu, Jan 19, 2012 at 12:53 PM, wrote: > > 6. I do not understand how work with attributes performed ( ) but it > > has sense to use mask and fill requested attributes using mask and > > callback table i.e. if free pages requested - they are reported, otherwise > not. > > That's how it works now in the git tree. Vmnotify.c has vmnotify_watch_event which collects fixed set of parameters. > I'm currently looking at how to support Minchan's non-sampled events. It > seems to me integrating with perf would be nice because we could simply > use tracepoints for this. If tracepoints not jeopardize use time has sense to do it. > > Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755104Ab2ASMAD (ORCPT ); Thu, 19 Jan 2012 07:00:03 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:44910 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751296Ab2ASL77 (ORCPT ); Thu, 19 Jan 2012 06:59:59 -0500 MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 13:59:58 +0200 X-Google-Sender-Auth: OE4cbxhDDbIkGtWw3Z0uGs17qHc Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2012 at 1:54 PM, wrote: >> The current ABI already supports that. You can specify which attributes >> you're interested in and they will be delivered as part of th event. > > But you have in vmnotify.h suspicious free_pages_threshold field. Aah, I was actually talking about the events userspace _reads_. The free_pages_threshold field is only used if VMEVENT_TYPE_FREE_THRESHOLD bit is set. It should be cleaned up a bit but it in theory it supports watching other attributes as well. I've postponed the cleanup until I've figured out whether we can use perf which would make the whole syscall go away. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756682Ab2ASMGQ (ORCPT ); Thu, 19 Jan 2012 07:06:16 -0500 Received: from mail-tul01m020-f174.google.com ([209.85.214.174]:61010 "EHLO mail-tul01m020-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751567Ab2ASMGP convert rfc822-to-8bit (ORCPT ); Thu, 19 Jan 2012 07:06:15 -0500 MIME-Version: 1.0 In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB9826904559D9B@008-AM1MPN1-003.mgdnok.nokia.com> Date: Thu, 19 Jan 2012 14:06:14 +0200 X-Google-Sender-Auth: Jmy5iVwsGI-n-qi2o9GQlKRr7e8 Message-ID: Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2012 at 1:54 PM, wrote: >> On Thu, Jan 19, 2012 at 12:53 PM,   wrote: >> > 6. I do not understand how work with attributes performed ( ) but it >> > has sense to use mask and fill requested attributes using mask and >> > callback table i.e. if free pages requested - they are reported, otherwise >> not. >> >> That's how it works now in the git tree. > > Vmnotify.c has vmnotify_watch_event which collects fixed set of parameters. That's would be a bug. We should check event_attrs like we do for NR_SWAP_PAGES. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754508Ab2ASOno (ORCPT ); Thu, 19 Jan 2012 09:43:44 -0500 Received: from mx1.redhat.com ([209.132.183.28]:1027 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751437Ab2ASOnn (ORCPT ); Thu, 19 Jan 2012 09:43:43 -0500 Message-ID: <4F182BF3.7050809@redhat.com> Date: Thu, 19 Jan 2012 09:42:59 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: KAMEZAWA Hiroyuki CC: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote: > On Wed, 18 Jan 2012 09:17:17 -0500 > Rik van Riel wrote: > >> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: >>> On Wed, 18 Jan 2012 08:08:01 +0900 >>> Minchan Kim wrote: >>> >>>>>>> 2. can't we measure page-in/page-out distance by recording something ? >>>>>> >>>>>> I can't understand your point. What's relation does it with swapout prevent? >>>>>> >>>>> >>>>> If distance between pageout -> pagein is short, it means thrashing. >>>>> For example, recoding the timestamp when the page(mapping, index) was >>>>> paged-out, and check it at page-in. >>>> >>>> Our goal is prevent swapout. When we found thrashing, it's too late. >>> >>> If you want to prevent swap-out, don't swapon any. That's all. >>> Then, you can check the number of FILE_CACHE and have threshold. >> >> I think you are getting hung up on a word here. >> >> As I understand it, the goal is to push out the point where >> we start doing heavier swap IO, allowing us to overcommit >> memory more heavily before things start really slowing down. >> > > Yes. > > Hmm, considering that the issue is slow down, > > time values as > > - 'cpu time used for memory reclaim' > - 'latency of page allocation' > - 'application execution speed' ? > > may be a better score to see rather than just seeing lru's stat. I believe those all qualify as "too late". We want to prevent things from becoming bad, for as long as we (easily) can. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757460Ab2ATA0N (ORCPT ); Thu, 19 Jan 2012 19:26:13 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:45116 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756222Ab2ATA0K (ORCPT ); Thu, 19 Jan 2012 19:26:10 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Fri, 20 Jan 2012 09:24:49 +0900 From: KAMEZAWA Hiroyuki To: Rik van Riel Cc: Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, penberg@kernel.org, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod Subject: Re: [RFC 2/3] vmscan hook Message-Id: <20120120092449.4ecbec86.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <4F182BF3.7050809@redhat.com> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-3-git-send-email-minchan@kernel.org> <20120117173932.1c058ba4.kamezawa.hiroyu@jp.fujitsu.com> <20120117091356.GA29736@barrios-desktop.redhat.com> <20120117190512.047d3a03.kamezawa.hiroyu@jp.fujitsu.com> <20120117230801.GA903@barrios-desktop.redhat.com> <20120118091824.0bde46f7.kamezawa.hiroyu@jp.fujitsu.com> <4F16D46D.5080000@redhat.com> <20120119112528.eda78467.kamezawa.hiroyu@jp.fujitsu.com> <4F182BF3.7050809@redhat.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 19 Jan 2012 09:42:59 -0500 Rik van Riel wrote: > On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote: > > On Wed, 18 Jan 2012 09:17:17 -0500 > > Rik van Riel wrote: > > > >> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote: > >>> On Wed, 18 Jan 2012 08:08:01 +0900 > >>> Minchan Kim wrote: > >>> > >>>>>>> 2. can't we measure page-in/page-out distance by recording something ? > >>>>>> > >>>>>> I can't understand your point. What's relation does it with swapout prevent? > >>>>>> > >>>>> > >>>>> If distance between pageout -> pagein is short, it means thrashing. > >>>>> For example, recoding the timestamp when the page(mapping, index) was > >>>>> paged-out, and check it at page-in. > >>>> > >>>> Our goal is prevent swapout. When we found thrashing, it's too late. > >>> > >>> If you want to prevent swap-out, don't swapon any. That's all. > >>> Then, you can check the number of FILE_CACHE and have threshold. > >> > >> I think you are getting hung up on a word here. > >> > >> As I understand it, the goal is to push out the point where > >> we start doing heavier swap IO, allowing us to overcommit > >> memory more heavily before things start really slowing down. > >> > > > > Yes. > > > > Hmm, considering that the issue is slow down, > > > > time values as > > > > - 'cpu time used for memory reclaim' > > - 'latency of page allocation' > > - 'application execution speed' ? > > > > may be a better score to see rather than just seeing lru's stat. > > I believe those all qualify as "too late". > > We want to prevent things from becoming bad, for as long > as we (easily) can. > Hmm, then some threshold-notifier interface will be required. Problem is how to know free + page_can_be_freed_without_risk. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754007Ab2AXPmK (ORCPT ); Tue, 24 Jan 2012 10:42:10 -0500 Received: from mx1.redhat.com ([209.132.183.28]:43595 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753393Ab2AXPmI (ORCPT ); Tue, 24 Jan 2012 10:42:08 -0500 Date: Tue, 24 Jan 2012 13:38:35 -0200 From: Marcelo Tosatti To: leonid.moiseichuk@nokia.com Cc: rhod@redhat.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124153835.GA10990@amt.cnet> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > > -----Original Message----- > > From: ext Ronen Hod [mailto:rhod@redhat.com] > > Sent: 19 January, 2012 11:20 > > To: Pekka Enberg > ... > > >>> Isn't > > >>> > > >>> /proc/sys/vm/min_free_kbytes > > >>> > > >>> pretty much just that? > > >> Would you suggest to use min_free_kbytes as the threshold for sending > > >> low_memory_notifications to applications, and separately as a target > > >> value for the applications' memory giveaway? > > > I'm not saying that the kernel should use it directly but it seems > > > like the kind of "ideal number of free pages" threshold you're > > > suggesting. So userspace can read that value and use it as the "number > > > of free pages" threshold for VM events, no? > > > > Yes, I like it. The rules of the game are simple and consistent all over, be it the > > alert threshold, voluntary poling by the apps, and for concurrent work by > > several applications. > > Well, as long as it provides a good indication for low_mem_pressure. > > For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > > >From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space Having userspace specify the "sample period" for low memory notification makes no sense. The frequency of notifications is a function of the memory pressure. > 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. It would make the interface easier to use if it provided the number of pages to free, in the notification (kernel can calculate that as the delta between current_free_pages -> comfortable_free_pages relative to process RSS). > 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not. > 7. would have sense to backport couple of attributes from memnotify.c > > I can submit couple of patches if some of proposals looks sane for everyone. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754176Ab2AXPxM (ORCPT ); Tue, 24 Jan 2012 10:53:12 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40175 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752166Ab2AXPxJ (ORCPT ); Tue, 24 Jan 2012 10:53:09 -0500 Date: Tue, 24 Jan 2012 13:40:01 -0200 From: Marcelo Tosatti To: Pekka Enberg Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124154001.GB10990@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote: > Hello, > > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall > ABI. I'd really like to see something like this that's much more > extensible and clean than the /dev based ABIs that people have > proposed so far. > > Pekka What is the practical advantage of a syscall, again? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754635Ab2AXQBZ (ORCPT ); Tue, 24 Jan 2012 11:01:25 -0500 Received: from filtteri6.pp.htv.fi ([213.243.153.189]:60481 "EHLO filtteri6.pp.htv.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754501Ab2AXQBY (ORCPT ); Tue, 24 Jan 2012 11:01:24 -0500 Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Marcelo Tosatti Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro In-Reply-To: <20120124154001.GB10990@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> <20120124154001.GB10990@amt.cnet> Content-Type: text/plain; charset="ISO-8859-1" Date: Tue, 24 Jan 2012 18:01:20 +0200 Message-ID: <1327420880.13624.24.camel@jaguar> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > What is the practical advantage of a syscall, again? Why do you ask? The advantage for this particular case is not needing to add ioctls() for configuration and keeping the file read/write ABI simple. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754957Ab2AXQJR (ORCPT ); Tue, 24 Jan 2012 11:09:17 -0500 Received: from mx1.redhat.com ([209.132.183.28]:32283 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753888Ab2AXQJP (ORCPT ); Tue, 24 Jan 2012 11:09:15 -0500 Message-ID: <4F1ED77F.4090900@redhat.com> Date: Tue, 24 Jan 2012 18:08:31 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Marcelo Tosatti CC: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> In-Reply-To: <20120124153835.GA10990@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>> -----Original Message----- >>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>> Sent: 19 January, 2012 11:20 >>> To: Pekka Enberg >> ... >>>>>> Isn't >>>>>> >>>>>> /proc/sys/vm/min_free_kbytes >>>>>> >>>>>> pretty much just that? >>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>> low_memory_notifications to applications, and separately as a target >>>>> value for the applications' memory giveaway? >>>> I'm not saying that the kernel should use it directly but it seems >>>> like the kind of "ideal number of free pages" threshold you're >>>> suggesting. So userspace can read that value and use it as the "number >>>> of free pages" threshold for VM events, no? >>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>> alert threshold, voluntary poling by the apps, and for concurrent work by >>> several applications. >>> Well, as long as it provides a good indication for low_mem_pressure. >> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >> >> > From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > Having userspace specify the "sample period" for low memory notification > makes no sense. The frequency of notifications is a function of the > memory pressure. > >> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > >> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > It would make the interface easier to use if it provided the number of > pages to free, in the notification (kernel can calculate that as the > delta between current_free_pages -> comfortable_free_pages relative to > process RSS). If you rely on the notification's argument you lose several features: - Handling of notifications by several applications in parallel - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. - Iterative release loops, until there are enough free pages. I believe that the notification should only serve as a trigger to run the cleanup. Ronen. > >> 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not. >> 7. would have sense to backport couple of attributes from memnotify.c >> >> I can submit couple of patches if some of proposals looks sane for everyone. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754891Ab2AXQKo (ORCPT ); Tue, 24 Jan 2012 11:10:44 -0500 Received: from filtteri1.pp.htv.fi ([213.243.153.184]:39324 "EHLO filtteri1.pp.htv.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754522Ab2AXQKm (ORCPT ); Tue, 24 Jan 2012 11:10:42 -0500 Subject: Re: [RFC 1/3] /dev/low_mem_notify From: Pekka Enberg To: Marcelo Tosatti Cc: leonid.moiseichuk@nokia.com, rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com In-Reply-To: <20120124153835.GA10990@amt.cnet> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> Content-Type: text/plain; charset="ISO-8859-1" Date: Tue, 24 Jan 2012 18:10:40 +0200 Message-ID: <1327421440.13624.30.camel@jaguar> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > Having userspace specify the "sample period" for low memory notification > makes no sense. The frequency of notifications is a function of the > memory pressure. Sure, it makes sense to autotune sample period. I don't see the problem with letting userspace decide it for themselves if they want to. Pekka From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755054Ab2AXQWv (ORCPT ); Tue, 24 Jan 2012 11:22:51 -0500 Received: from moutng.kundenserver.de ([212.227.17.10]:58609 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753451Ab2AXQWu (ORCPT ); Tue, 24 Jan 2012 11:22:50 -0500 From: Arnd Bergmann To: Pekka Enberg Subject: Re: [RFC 1/3] /dev/low_mem_notify Date: Tue, 24 Jan 2012 16:22:36 +0000 User-Agent: KMail/1.12.2 (Linux/3.2.0-rc7; KDE/4.3.2; x86_64; ; ) Cc: leonid.moiseichuk@nokia.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, mtosatti@redhat.com, akpm@linux-foundation.org, rhod@redhat.com, kosaki.motohiro@jp.fujitsu.com References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <84FF21A720B0874AA94B46D76DB98269045596AE@008-AM1MPN1-003.mgdnok.nokia.com> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201201241622.36222.arnd@arndb.de> X-Provags-ID: V02:K0:ssDhLAP5zU5THHJFgPKC62lf/Sc8a7m/4xNt140HCf4 wLHbAUHA8js6r0uQVxPJOdG9QN0lHH8NGo+IxhtJ8mQhcSB7sE U67gtKhvUW4CUzsGr7ivgygCrfjqzUdnMUbWaeE2Uj6TIkje3V JrOfGEazGqoBYnFV9+M6ViuXvUF3ynbnhPNuT5vv0P7iwDxUr+ rtHY9364f20v2qrbayZNqua9XPYT3k0dQoE3JpWJLk3I0s1ncQ oT3f8zD+OI9Eed655tJYT5d56k6J3c07x6bSIU7lpYl20T8OFs 9XEOh6mJIWTCW/W49XYCPFrFHOf2tONDCNmgNaGKyrQMl34O2H 2XfXCljifMSSPQQ7cSr0= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wednesday 18 January 2012, Pekka Enberg wrote: > >> +struct vmnotify_event { > >> + /* Size of the struct for ABI extensibility. */ > >> + __u32 size; > >> + > >> + __u64 nr_avail_pages; > >> + > >> + __u64 nr_swap_pages; > >> + > >> + __u64 nr_free_pages; > >> +}; > > > > Two fields here most likely session-constant, (nr_avail_pages and > > nr_swap_pages), seems not much sense to report them in every event. If we > > have memory/swap hotplug user-space can use sysinfo() call. > > I actually changed the ABI to look like this: > > struct vmnotify_event { > /* > * Size of the struct for ABI extensibility. > */ > __u32 size; > > __u64 attrs; > > __u64 attr_values[]; > }; > > So userspace can decide which fields to include in notifications. Please make the first member a __u64 instead of a __u32. This will avoid incompatibility between 32 and 64 bit processes, which have different alignment rules on x86: x86-32 would implicitly pack the struct while x86-64 would add padding with your layout. Arnd From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755368Ab2AXQ0L (ORCPT ); Tue, 24 Jan 2012 11:26:11 -0500 Received: from moutng.kundenserver.de ([212.227.17.9]:49263 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755204Ab2AXQ0I (ORCPT ); Tue, 24 Jan 2012 11:26:08 -0500 From: Arnd Bergmann To: Pekka Enberg Subject: Re: [RFC 1/3] /dev/low_mem_notify Date: Tue, 24 Jan 2012 16:25:55 +0000 User-Agent: KMail/1.12.2 (Linux/3.2.0-rc7; KDE/4.3.2; x86_64; ; ) Cc: Marcelo Tosatti , Rik van Riel , Minchan Kim , "linux-mm" , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <20120124154001.GB10990@amt.cnet> <1327420880.13624.24.camel@jaguar> In-Reply-To: <1327420880.13624.24.camel@jaguar> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201201241625.55295.arnd@arndb.de> X-Provags-ID: V02:K0:jytNgZkXFiyyQ0+HID3nBpRdCqrSWxTR4ekwO7kDCMY OJAqY+dwEB9q7vVlJKW0XcZDO/0mHEzmIa7Ijc1MCPkXjLE8Fp ij1lLDJ2xmxhPqFmXR+ahnHk6AirCo70rQ/I1wijgDcBINn3/x S/2EM5s2gAvVfLsz5g2bgeg44vKXr+YFtarltDx398cQN169s5 xqHBKTpYps/6Q5BtcoL5fR4yWXu8BwxUVHH/SJIamyKo4m6/bn 70pkYVA74RCDqXpVYJMUbcYZceBNc+f/FiuHWjVwLscmIBn4yr jZPWsxX6ny/ybv+bA0Crxxe7R0cRJB/c6fLmaFs943j5EvcjP8 i2KldNiEa1yaC3r3BFOo= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tuesday 24 January 2012, Pekka Enberg wrote: > On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > > What is the practical advantage of a syscall, again? > > Why do you ask? The advantage for this particular case is not needing to > add ioctls() for configuration and keeping the file read/write ABI > simple. The two are obviously equivalent and there is no reason to avoid ioctl in general. However I agree that the syscall would be better in this case, because that is what we tend to use for core kernel functionality, while character devices tend to be used for I/O device drivers that need stuff like enumeration and permission management. Arnd From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757370Ab2AXSbA (ORCPT ); Tue, 24 Jan 2012 13:31:00 -0500 Received: from mx1.redhat.com ([209.132.183.28]:13113 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756192Ab2AXSa6 (ORCPT ); Tue, 24 Jan 2012 13:30:58 -0500 Date: Tue, 24 Jan 2012 16:29:09 -0200 From: Marcelo Tosatti To: Pekka Enberg Cc: leonid.moiseichuk@nokia.com, rhod@redhat.com, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124182909.GB19186@amt.cnet> References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <1327421440.13624.30.camel@jaguar> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1327421440.13624.30.camel@jaguar> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 24, 2012 at 06:10:40PM +0200, Pekka Enberg wrote: > On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > > Having userspace specify the "sample period" for low memory notification > > makes no sense. The frequency of notifications is a function of the > > memory pressure. > > Sure, it makes sense to autotune sample period. I don't see the problem > with letting userspace decide it for themselves if they want to. > > Pekka Application polls on a file descriptor waiting for asynchronous events, particular conditions of memory reclaim upon which an action is necessary. These signalled conditions are not simply percentages of free memory, but depend on the amount of freeable cache available, etc. Otherwise applications could monitor /proc/mem_info and act on that. What is the point of sampling in the interface as you have it? Application can read() from the file descriptor to retrieve the current status, if it wishes. The objective in this argument is to make the API as simple and easy to use as possible. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757383Ab2AXSbT (ORCPT ); Tue, 24 Jan 2012 13:31:19 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46760 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756192Ab2AXSbR (ORCPT ); Tue, 24 Jan 2012 13:31:17 -0500 Date: Tue, 24 Jan 2012 16:10:34 -0200 From: Marcelo Tosatti To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124181034.GA19186@amt.cnet> References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1ED77F.4090900@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: > On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > >On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > >>>-----Original Message----- > >>>From: ext Ronen Hod [mailto:rhod@redhat.com] > >>>Sent: 19 January, 2012 11:20 > >>>To: Pekka Enberg > >>... > >>>>>>Isn't > >>>>>> > >>>>>>/proc/sys/vm/min_free_kbytes > >>>>>> > >>>>>>pretty much just that? > >>>>>Would you suggest to use min_free_kbytes as the threshold for sending > >>>>>low_memory_notifications to applications, and separately as a target > >>>>>value for the applications' memory giveaway? > >>>>I'm not saying that the kernel should use it directly but it seems > >>>>like the kind of "ideal number of free pages" threshold you're > >>>>suggesting. So userspace can read that value and use it as the "number > >>>>of free pages" threshold for VM events, no? > >>>Yes, I like it. The rules of the game are simple and consistent all over, be it the > >>>alert threshold, voluntary poling by the apps, and for concurrent work by > >>>several applications. > >>>Well, as long as it provides a good indication for low_mem_pressure. > >>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > >>of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > >> > >>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > >>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > >>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > >Having userspace specify the "sample period" for low memory notification > >makes no sense. The frequency of notifications is a function of the > >memory pressure. > > > >>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > > > >>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > >>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > >It would make the interface easier to use if it provided the number of > >pages to free, in the notification (kernel can calculate that as the > >delta between current_free_pages -> comfortable_free_pages relative to > >process RSS). > > If you rely on the notification's argument you lose several features: > - Handling of notifications by several applications in parallel Each application has its argument built in a custom fashion (pages_to_free = delta between current_free_pages -> comfortable_free_pages relative to process RSS), or something to that effect. It is compatible with parallel notifications. > - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. I am suggesting an additional field in the notification data so that the freeing routine has a goal. But it is not mandatory. > - Iterative release loops, until there are enough free pages. What is the advantage versus releasing the necessary amount of memory in a given moment? > I believe that the notification should only serve as a trigger to run the cleanup. Agree. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757418Ab2AXSe5 (ORCPT ); Tue, 24 Jan 2012 13:34:57 -0500 Received: from mx1.redhat.com ([209.132.183.28]:28433 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755691Ab2AXSey (ORCPT ); Tue, 24 Jan 2012 13:34:54 -0500 Date: Tue, 24 Jan 2012 16:32:47 -0200 From: Marcelo Tosatti To: Arnd Bergmann Cc: Pekka Enberg , Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124183247.GA19853@amt.cnet> References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <20120124154001.GB10990@amt.cnet> <1327420880.13624.24.camel@jaguar> <201201241625.55295.arnd@arndb.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201201241625.55295.arnd@arndb.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 24, 2012 at 04:25:55PM +0000, Arnd Bergmann wrote: > On Tuesday 24 January 2012, Pekka Enberg wrote: > > On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote: > > > What is the practical advantage of a syscall, again? > > > > Why do you ask? The advantage for this particular case is not needing to > > add ioctls() for configuration and keeping the file read/write ABI > > simple. > > The two are obviously equivalent and there is no reason to avoid > ioctl in general. However I agree that the syscall would be better > in this case, because that is what we tend to use for core kernel > functionality, while character devices tend to be used for I/O device > drivers that need stuff like enumeration and permission management. > > Arnd Makes sense. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751547Ab2AXV5Q (ORCPT ); Tue, 24 Jan 2012 16:57:16 -0500 Received: from tex.lwn.net ([70.33.254.29]:60910 "EHLO vena.lwn.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750895Ab2AXV5P (ORCPT ); Tue, 24 Jan 2012 16:57:15 -0500 Date: Tue, 24 Jan 2012 14:57:13 -0700 From: Jonathan Corbet To: Pekka Enberg Cc: Rik van Riel , Minchan Kim , linux-mm , LKML , leonid.moiseichuk@nokia.com, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, KOSAKI Motohiro , Johannes Weiner , Marcelo Tosatti , Andrew Morton , Ronen Hod , KOSAKI Motohiro Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120124145713.20fad866@dt> In-Reply-To: References: <1326788038-29141-1-git-send-email-minchan@kernel.org> <1326788038-29141-2-git-send-email-minchan@kernel.org> <4F15A34F.40808@redhat.com> Organization: LWN.net X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.8; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 17 Jan 2012 20:51:13 +0200 (EET) Pekka Enberg wrote: > Ok, so here's a proof of concept patch that implements sample-base > per-process free threshold VM event watching using perf-like syscall ABI. > I'd really like to see something like this that's much more extensible and > clean than the /dev based ABIs that people have proposed so far. OK, so I'm slow, but better late than never. I plead travel. I guess the thing that surprises me is that nobody has said this yet: this looks a lot like an event-reporting mechanism like perf. Is there a reason these can't be perf-style events integrated with all the rest? > +struct vmnotify_config { > + /* > + * Size of the struct for ABI extensibility. > + */ > + __u32 size; > + > + /* > + * Notification type bitmask > + */ > + __u64 type; > + > + /* > + * Free memory threshold in percentages [1..99] > + */ > + __u32 free_threshold; Is this an upper-bound threshold or a lower-bound threshold? From your example, it looks like "free_threshold" is "the amount of memory that is not free", which seems confusing. [...] > new file mode 100644 > index 0000000..6800450 > --- /dev/null > +++ b/mm/vmnotify.c > @@ -0,0 +1,235 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define VMNOTIFY_MAX_FREE_THRESHOD 100 Did we run out of L's here? :) > +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) > +{ > + struct vmnotify_watch *watch = file->private_data; > + int ret = 0; > + > + mutex_lock(&watch->mutex); > + > + if (!watch->pending) > + goto out_unlock; > + > + if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) { > + ret = -EFAULT; > + goto out_unlock; > + } > + > + ret = watch->event.size; > + > + watch->pending = false; > + > +out_unlock: > + mutex_unlock(&watch->mutex); > + > + return ret; > +} So this is a nonblocking-only interface? That may surprise some developers. You already have a wait queue, why not wait on it if need be? > +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig, > + struct vmnotify_config *config) > +{ > + int ret; > + > + ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config)); > + if (ret) > + return -EFAULT; > + > + if (!config->type) > + return -EINVAL; > + > + if (config->type & VMNOTIFY_TYPE_SAMPLE) { > + if (config->sample_period_ns < NSEC_PER_MSEC) > + return -EINVAL; > + } What happens if the sample period is zero? jon From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752987Ab2AYIUQ (ORCPT ); Wed, 25 Jan 2012 03:20:16 -0500 Received: from smtp.nokia.com ([147.243.128.26]:42583 "EHLO mgw-da02.nokia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750894Ab2AYIUP convert rfc822-to-8bit (ORCPT ); Wed, 25 Jan 2012 03:20:15 -0500 From: To: , CC: , , , , , , , , , , , Subject: RE: [RFC 1/3] /dev/low_mem_notify Thread-Topic: [RFC 1/3] /dev/low_mem_notify Thread-Index: AQHM1PAILpEwsgBJQ0+ebltvcfq8MpYQOdkAgAB3jICAACXvgIAA9v4A///6iYCAABO3AIAAA+QAgAARAJCAAMdZAIAAg3iAgAAcKICAAAFDAIAAAtAAgAASS5CACDMHgIAACPYAgAEemaA= Date: Wed, 25 Jan 2012 08:19:11 +0000 Message-ID: <84FF21A720B0874AA94B46D76DB9826904562B60@008-AM1MPN1-003.mgdnok.nokia.com> References: <84FF21A720B0874AA94B46D76DB98269045596EA@008-AM1MPN1-003.mgdnok.nokia.com> <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <1327421440.13624.30.camel@jaguar> In-Reply-To: <1327421440.13624.30.camel@jaguar> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.21.23.171] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-OriginalArrivalTime: 25 Jan 2012 08:19:12.0446 (UTC) FILETIME=[04F20DE0:01CCDB3A] X-Nokia-AV: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: ext Pekka Enberg [mailto:penberg@kernel.org] > Sent: 24 January, 2012 18:11 > To: Marcelo Tosatti .... > On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote: > > Having userspace specify the "sample period" for low memory > > notification makes no sense. The frequency of notifications is a > > function of the memory pressure. > > Sure, it makes sense to autotune sample period. I don't see the problem > with letting userspace decide it for themselves if they want to. > > Pekka Good point, but you must take into account that reaction time in user-space depends how SW stack is organized. So for some components 1s is good enough update time, for another cases 10ms. If changes on VM happened too often they had no sense for user-space. Thus from practical point of view having sampling period is not a bad idea. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754291Ab2AYIxJ (ORCPT ); Wed, 25 Jan 2012 03:53:09 -0500 Received: from mx1.redhat.com ([209.132.183.28]:59626 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753952Ab2AYIxH (ORCPT ); Wed, 25 Jan 2012 03:53:07 -0500 Message-ID: <4F1FC2C8.10103@redhat.com> Date: Wed, 25 Jan 2012 10:52:24 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Marcelo Tosatti CC: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <84FF21A720B0874AA94B46D76DB982690455978C@008-AM1MPN1-003.mgdnok.nokia.com> <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> In-Reply-To: <20120124181034.GA19186@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: > On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: >> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: >>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>>>> -----Original Message----- >>>>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>>>> Sent: 19 January, 2012 11:20 >>>>> To: Pekka Enberg >>>> ... >>>>>>>> Isn't >>>>>>>> >>>>>>>> /proc/sys/vm/min_free_kbytes >>>>>>>> >>>>>>>> pretty much just that? >>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>>>> low_memory_notifications to applications, and separately as a target >>>>>>> value for the applications' memory giveaway? >>>>>> I'm not saying that the kernel should use it directly but it seems >>>>>> like the kind of "ideal number of free pages" threshold you're >>>>>> suggesting. So userspace can read that value and use it as the "number >>>>>> of free pages" threshold for VM events, no? >>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>>>> alert threshold, voluntary poling by the apps, and for concurrent work by >>>>> several applications. >>>>> Well, as long as it provides a good indication for low_mem_pressure. >>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >>>> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >>>> >>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space >>> Having userspace specify the "sample period" for low memory notification >>> makes no sense. The frequency of notifications is a function of the >>> memory pressure. >>> >>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. >>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. >>> It would make the interface easier to use if it provided the number of >>> pages to free, in the notification (kernel can calculate that as the >>> delta between current_free_pages -> comfortable_free_pages relative to >>> process RSS). >> If you rely on the notification's argument you lose several features: >> - Handling of notifications by several applications in parallel > Each application has its argument built in a custom fashion > (pages_to_free = delta between current_free_pages -> > comfortable_free_pages relative to process RSS), or something to that > effect. It is compatible with parallel notifications. Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? >> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. > I am suggesting an additional field in the notification data so that the > freeing routine has a goal. But it is not mandatory. If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? > >> - Iterative release loops, until there are enough free pages. > What is the advantage versus releasing the necessary amount of > memory in a given moment? The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. Ronen. > >> I believe that the notification should only serve as a trigger to run the cleanup. > Agree. > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751356Ab2AYKOc (ORCPT ); Wed, 25 Jan 2012 05:14:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:30675 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750772Ab2AYKOb (ORCPT ); Wed, 25 Jan 2012 05:14:31 -0500 Date: Wed, 25 Jan 2012 08:12:09 -0200 From: Marcelo Tosatti To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120125101209.GB29167@amt.cnet> References: <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1FC2C8.10103@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote: > On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: > >On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: > >>On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: > >>>On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: > >>>>>-----Original Message----- > >>>>>From: ext Ronen Hod [mailto:rhod@redhat.com] > >>>>>Sent: 19 January, 2012 11:20 > >>>>>To: Pekka Enberg > >>>>... > >>>>>>>>Isn't > >>>>>>>> > >>>>>>>>/proc/sys/vm/min_free_kbytes > >>>>>>>> > >>>>>>>>pretty much just that? > >>>>>>>Would you suggest to use min_free_kbytes as the threshold for sending > >>>>>>>low_memory_notifications to applications, and separately as a target > >>>>>>>value for the applications' memory giveaway? > >>>>>>I'm not saying that the kernel should use it directly but it seems > >>>>>>like the kind of "ideal number of free pages" threshold you're > >>>>>>suggesting. So userspace can read that value and use it as the "number > >>>>>>of free pages" threshold for VM events, no? > >>>>>Yes, I like it. The rules of the game are simple and consistent all over, be it the > >>>>>alert threshold, voluntary poling by the apps, and for concurrent work by > >>>>>several applications. > >>>>>Well, as long as it provides a good indication for low_mem_pressure. > >>>>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount > >>>>of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. > >>>> > >>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: > >>>>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here > >>>>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space > >>>Having userspace specify the "sample period" for low memory notification > >>>makes no sense. The frequency of notifications is a function of the > >>>memory pressure. > >>> > >>>>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. > >>>>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. > >>>>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. > >>>It would make the interface easier to use if it provided the number of > >>>pages to free, in the notification (kernel can calculate that as the > >>>delta between current_free_pages -> comfortable_free_pages relative to > >>>process RSS). > >>If you rely on the notification's argument you lose several features: > >> - Handling of notifications by several applications in parallel > >Each application has its argument built in a custom fashion > >(pages_to_free = delta between current_free_pages -> > >comfortable_free_pages relative to process RSS), or something to that > >effect. It is compatible with parallel notifications. > > Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. > Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? The problem is, how is each process supposed to know how much memory it should free for each notification received, that is, its part? Its easier if there is a goal, a hint of how many pages the process should release. > >> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. > >I am suggesting an additional field in the notification data so that the > >freeing routine has a goal. But it is not mandatory. > > If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? > > > > >>- Iterative release loops, until there are enough free pages. > >What is the advantage versus releasing the necessary amount of > >memory in a given moment? > > The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. > > Ronen. > > > > >>I believe that the notification should only serve as a trigger to run the cleanup. > >Agree. > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752026Ab2AYKsw (ORCPT ); Wed, 25 Jan 2012 05:48:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:2284 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751415Ab2AYKsv (ORCPT ); Wed, 25 Jan 2012 05:48:51 -0500 Message-ID: <4F1FDDE2.9050609@redhat.com> Date: Wed, 25 Jan 2012 12:48:02 +0200 From: Ronen Hod User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0 MIME-Version: 1.0 To: Marcelo Tosatti CC: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify References: <4F175706.8000808@redhat.com> <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> <20120125101209.GB29167@amt.cnet> In-Reply-To: <20120125101209.GB29167@amt.cnet> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/25/2012 12:12 PM, Marcelo Tosatti wrote: > On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote: >> On 01/24/2012 08:10 PM, Marcelo Tosatti wrote: >>> On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote: >>>> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote: >>>>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote: >>>>>>> -----Original Message----- >>>>>>> From: ext Ronen Hod [mailto:rhod@redhat.com] >>>>>>> Sent: 19 January, 2012 11:20 >>>>>>> To: Pekka Enberg >>>>>> ... >>>>>>>>>> Isn't >>>>>>>>>> >>>>>>>>>> /proc/sys/vm/min_free_kbytes >>>>>>>>>> >>>>>>>>>> pretty much just that? >>>>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending >>>>>>>>> low_memory_notifications to applications, and separately as a target >>>>>>>>> value for the applications' memory giveaway? >>>>>>>> I'm not saying that the kernel should use it directly but it seems >>>>>>>> like the kind of "ideal number of free pages" threshold you're >>>>>>>> suggesting. So userspace can read that value and use it as the "number >>>>>>>> of free pages" threshold for VM events, no? >>>>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the >>>>>>> alert threshold, voluntary poling by the apps, and for concurrent work by >>>>>>> several applications. >>>>>>> Well, as long as it provides a good indication for low_mem_pressure. >>>>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount >>>>>> of memory available for GFP_ATOMIC allocations. In case situation comes under pointed level kernel will reclaim memory from e.g. caches. >>>>>> >>>>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented: >>>>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here >>>>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space >>>>> Having userspace specify the "sample period" for low memory notification >>>>> makes no sense. The frequency of notifications is a function of the >>>>> memory pressure. >>>>> >>>>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. >>>>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big. >>>>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base. >>>>> It would make the interface easier to use if it provided the number of >>>>> pages to free, in the notification (kernel can calculate that as the >>>>> delta between current_free_pages -> comfortable_free_pages relative to >>>>> process RSS). >>>> If you rely on the notification's argument you lose several features: >>>> - Handling of notifications by several applications in parallel >>> Each application has its argument built in a custom fashion >>> (pages_to_free = delta between current_free_pages -> >>> comfortable_free_pages relative to process RSS), or something to that >>> effect. It is compatible with parallel notifications. >> Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. >> Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target? > The problem is, how is each process supposed to know how much memory > it should free for each notification received, that is, its part? > > Its easier if there is a goal, a hint of how many pages the process > should release. I have to agree. Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested). Fairness is also an issue. And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work? Ronen. > >>>> - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience. >>> I am suggesting an additional field in the notification data so that the >>> freeing routine has a goal. But it is not mandatory. >> If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure? >> >>>> - Iterative release loops, until there are enough free pages. >>> What is the advantage versus releasing the necessary amount of >>> memory in a given moment? >> The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down. >> >> Ronen. >> >>>> I believe that the notification should only serve as a trigger to run the cleanup. >>> Agree. >>> >>> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753033Ab2AZQU0 (ORCPT ); Thu, 26 Jan 2012 11:20:26 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46105 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752972Ab2AZQUW (ORCPT ); Thu, 26 Jan 2012 11:20:22 -0500 Date: Thu, 26 Jan 2012 14:17:58 -0200 From: Marcelo Tosatti To: Ronen Hod Cc: leonid.moiseichuk@nokia.com, penberg@kernel.org, riel@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, mel@csn.ul.ie, rientjes@google.com, kosaki.motohiro@gmail.com, hannes@cmpxchg.org, akpm@linux-foundation.org, kosaki.motohiro@jp.fujitsu.com Subject: Re: [RFC 1/3] /dev/low_mem_notify Message-ID: <20120126161758.GA28367@amt.cnet> References: <4F17DCED.4020908@redhat.com> <4F17E058.8020008@redhat.com> <84FF21A720B0874AA94B46D76DB9826904559D46@008-AM1MPN1-003.mgdnok.nokia.com> <20120124153835.GA10990@amt.cnet> <4F1ED77F.4090900@redhat.com> <20120124181034.GA19186@amt.cnet> <4F1FC2C8.10103@redhat.com> <20120125101209.GB29167@amt.cnet> <4F1FDDE2.9050609@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1FDDE2.9050609@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > >it should free for each notification received, that is, its part? > > > >Its easier if there is a goal, a hint of how many pages the process > >should release. > > I have to agree. > Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested). It is easier if the kernel calculates the target (the application is free to ignore the hint, of course), because it depends on information not readily available in userspace. > Fairness is also an issue. > And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work? There appears to be interest in an event to notify that higher levels of memory are available (see Leonid's email).