[RFC PATCH 0/2] mm: Add ability to monitor task's memory changes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
@ 2012-11-30 17:55 Pavel Emelyanov
  2012-11-30 17:55 ` [PATCH 1/2] mm: Mark VMA with VM_TRACE bit Pavel Emelyanov
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Pavel Emelyanov @ 2012-11-30 17:55 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel

Hello,

This is an attempt to implement support for memory snapshot for the the
checkpoint-restore project (http://criu.org).

To create a dump of an application(s) we save all the information about it
to files. No surprise, the biggest part of such dump is the contents of tasks'
memory. However, in some usage scenarios it's not required to get _all_ the
task memory while creating a dump. For example, when doing periodical dumps
it's only required to take full memory dump only at the first step and then
take incremental changes of memory. Another example is live migration. In the
simplest form it looks like -- create dump, copy it on the remote node then
restore tasks from dump files. While all this dump-copy-restore thing goes all
the process must be stopped. However, if we can monitor how tasks change their
memory, we can dump and copy it in smaller chunks, periodically updating it 
and thus freezing tasks only at the very end for the very short time to pick
up the recent changes.

That said, some help from kernel to watch how processes modify the contents of
their memory is required. I'd like to propose one possible solution of this
task -- with the help of page-faults and trace events.

Briefly the approach is -- remap some memory regions as read-only, get the #pf
on task's attempt to modify the memory and issue a trace event of that. Since
we're only interested in parts of memory of some tasks, make it possible to mark
the vmas we're interested in and issue events for them only. Also, to be aware
of tasks unmapping the vma-s being watched, also issue an event when the marked
vma is removed (and for symmetry -- an event when a vma is marked).

What do you think about this approach? Is this way of supporting mem snapshot
OK for you, or should we invent some better one?

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] mm: Mark VMA with VM_TRACE bit
  2012-11-30 17:55 [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Pavel Emelyanov
@ 2012-11-30 17:55 ` Pavel Emelyanov
  2012-11-30 17:55 ` [PATCH 2/2] mm: Generate events when tasks change their memory Pavel Emelyanov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Pavel Emelyanov @ 2012-11-30 17:55 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel

When marked, mmu events on this vma will emit an event via
trace-events engine. For now only two events are added --
when the mark is set (on) and when it's unset or the marked
vma is unmapped (off). On fork() the mark is not inherited.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 fs/proc/task_mmu.c                     |    1 +
 include/linux/mm.h                     |    1 +
 include/trace/events/mmu.h             |   48 ++++++++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h |    2 +
 kernel/fork.c                          |    2 +-
 mm/madvise.c                           |   12 ++++++++
 mm/memory.c                            |    3 ++
 mm/mmap.c                              |    3 ++
 8 files changed, 71 insertions(+), 1 deletions(-)
 create mode 100644 include/trace/events/mmu.h

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c0b4a04..3d43343 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -564,6 +564,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_TRACE)]	= "tr",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..c7fad8d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -84,6 +84,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE	0x00000080
 
 #define VM_GROWSDOWN	0x00000100	/* general info on the segment */
+#define VM_TRACE	0x00000200	/* generate trace events */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
diff --git a/include/trace/events/mmu.h b/include/trace/events/mmu.h
new file mode 100644
index 0000000..71b1ba6
--- /dev/null
+++ b/include/trace/events/mmu.h
@@ -0,0 +1,48 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmu
+
+#if !defined(_TRACE_MMU_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMU_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT_CONDITION(mmu_trace_on,
+		TP_PROTO(struct vm_area_struct *vma),
+
+		TP_ARGS(vma),
+
+		TP_CONDITION(!(vma->vm_flags & VM_TRACE)),
+
+		TP_STRUCT__entry(
+			__field(unsigned long, start)
+		),
+
+		TP_fast_assign(
+			__entry->start = vma->vm_start;
+		),
+
+		TP_printk("start %#lx", __entry->start)
+);
+
+TRACE_EVENT_CONDITION(mmu_trace_off,
+		TP_PROTO(struct vm_area_struct *vma),
+
+		TP_ARGS(vma),
+
+		TP_CONDITION(vma->vm_flags & VM_TRACE),
+
+		TP_STRUCT__entry(
+			__field(unsigned long, start)
+		),
+
+		TP_fast_assign(
+			__entry->start = vma->vm_start;
+		),
+
+		TP_printk("start %#lx", __entry->start)
+);
+
+#endif /* _TRACE_MMU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index d030d2c..c2b633d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -51,6 +51,8 @@
 #define MADV_DONTDUMP   16		/* Explicity exclude from the core dump,
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
+#define MADV_DOTRACE	18		/* generate mmu: trace events */
+#define MADV_DONTTRACE	19		/* stop generating events */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..068ec0d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -408,7 +408,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		tmp->vm_mm = mm;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~VM_LOCKED;
+		tmp->vm_flags &= ~(VM_LOCKED | VM_TRACE);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
 		if (file) {
diff --git a/mm/madvise.c b/mm/madvise.c
index 03dfa5c..65633e9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,8 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 
+#include <trace/events/mmu.h>
+
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_sem for writing. Others, which simply traverse vmas, need
@@ -61,6 +63,14 @@ static long madvise_behavior(struct vm_area_struct * vma,
 	case MADV_DONTFORK:
 		new_flags |= VM_DONTCOPY;
 		break;
+	case MADV_DOTRACE:
+		trace_mmu_trace_on(vma);
+		new_flags |= VM_TRACE;
+		break;
+	case MADV_DONTTRACE:
+		trace_mmu_trace_off(vma);
+		new_flags &= ~VM_TRACE;
+		break;
 	case MADV_DOFORK:
 		if (vma->vm_flags & VM_IO) {
 			error = -EINVAL;
@@ -314,6 +324,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_DOTRACE:
+	case MADV_DONTTRACE:
 		return 1;
 
 	default:
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..a6f5951 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -58,6 +58,9 @@
 #include <linux/elf.h>
 #include <linux/gfp.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmu.h>
+
 #include <asm/io.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a796c4..29c9e69 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -32,6 +32,8 @@
 #include <linux/khugepaged.h>
 #include <linux/uprobes.h>
 
+#include <trace/events/mmu.h>
+
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
@@ -227,6 +229,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
+	trace_mmu_trace_off(vma);
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
-- 
1.7.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] mm: Generate events when tasks change their memory
  2012-11-30 17:55 [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Pavel Emelyanov
  2012-11-30 17:55 ` [PATCH 1/2] mm: Mark VMA with VM_TRACE bit Pavel Emelyanov
@ 2012-11-30 17:55 ` Pavel Emelyanov
  2012-12-03 23:42   ` Xiao Guangrong
  2012-12-03  8:36 ` [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Glauber Costa
  2012-12-03 22:43 ` Andrew Morton
  3 siblings, 1 reply; 17+ messages in thread
From: Pavel Emelyanov @ 2012-11-30 17:55 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel

When vma tracing is ON, the vma memory is remaped to read-only
state. Later in the pagefault handlers the event is sent via
tracing engine.

With the existing on/off events this makes it possible to monitor
how processes modify their memory contents.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 include/linux/mm.h         |    3 +++
 include/trace/events/mmu.h |   18 ++++++++++++++++++
 mm/huge_memory.c           |    4 ++++
 mm/madvise.c               |   11 +++++++++++
 mm/memory.c                |    2 ++
 mm/mprotect.c              |    2 +-
 6 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c7fad8d..7e5fe10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1082,6 +1082,9 @@ extern unsigned long do_mremap(unsigned long addr,
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
+void change_protection(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable);
 
 /*
  * doesn't attempt to fault and will return short.
diff --git a/include/trace/events/mmu.h b/include/trace/events/mmu.h
index 71b1ba6..d1bff37 100644
--- a/include/trace/events/mmu.h
+++ b/include/trace/events/mmu.h
@@ -24,6 +24,24 @@ TRACE_EVENT_CONDITION(mmu_trace_on,
 		TP_printk("start %#lx", __entry->start)
 );
 
+TRACE_EVENT_CONDITION(mmu_page_mod,
+		TP_PROTO(struct vm_area_struct *vma, unsigned long vaddr),
+
+		TP_ARGS(vma, vaddr),
+
+		TP_CONDITION(vma->vm_flags & VM_TRACE),
+
+		TP_STRUCT__entry(
+			__field(unsigned long, vaddr)
+		),
+
+		TP_fast_assign(
+			__entry->vaddr = vaddr;
+		),
+
+		TP_printk("vaddr %#lx", __entry->vaddr)
+);
+
 TRACE_EVENT_CONDITION(mmu_trace_off,
 		TP_PROTO(struct vm_area_struct *vma),
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..7a93683 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -22,6 +22,8 @@
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+#include <trace/events/mmu.h>
+
 /*
  * By default transparent hugepage support is enabled for all mappings
  * and khugepaged scans all mappings. Defrag is only invoked by
@@ -888,6 +890,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
+	trace_mmu_page_mod(vma, address);
+
 	VM_BUG_ON(!vma->anon_vma);
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
diff --git a/mm/madvise.c b/mm/madvise.c
index 65633e9..05361c2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -64,6 +64,17 @@ static long madvise_behavior(struct vm_area_struct * vma,
 		new_flags |= VM_DONTCOPY;
 		break;
 	case MADV_DOTRACE:
+		/*
+		 * Protect pages to be read-only and force tasks to generate
+		 * #PFs on modification.
+		 *
+		 * It should be done before issuing trace-on event. Otherwise
+		 * we're leaving a short window after the 'on' event when tasks
+		 * can still modify pages.
+		 */
+		change_protection(vma, start, end,
+				vm_get_page_prot(vma->vm_flags & ~VM_READ),
+				vma_wants_writenotify(vma));
 		trace_mmu_trace_on(vma);
 		new_flags |= VM_TRACE;
 		break;
diff --git a/mm/memory.c b/mm/memory.c
index a6f5951..1dd30ae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2533,6 +2533,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 
+	trace_mmu_page_mod(vma, address);
+
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
 		/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..91c2266 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -119,7 +119,7 @@ static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 	} while (pud++, addr = next, addr != end);
 }
 
-static void change_protection(struct vm_area_struct *vma,
+void change_protection(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
-- 
1.7.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] mm: Generate events when tasks change their memory
  2012-11-30 17:55 ` [PATCH 2/2] mm: Generate events when tasks change their memory Pavel Emelyanov
@ 2012-12-03 23:42   ` Xiao Guangrong
  2012-12-04  5:04     ` Pavel Emelyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Xiao Guangrong @ 2012-12-03 23:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel

On 12/01/2012 01:55 AM, Pavel Emelyanov wrote:

>  	case MADV_DOTRACE:
> +		/*
> +		 * Protect pages to be read-only and force tasks to generate
> +		 * #PFs on modification.
> +		 *
> +		 * It should be done before issuing trace-on event. Otherwise
> +		 * we're leaving a short window after the 'on' event when tasks
> +		 * can still modify pages.
> +		 */
> +		change_protection(vma, start, end,
> +				vm_get_page_prot(vma->vm_flags & ~VM_READ),
> +				vma_wants_writenotify(vma));

Should be VM_WRITE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] mm: Generate events when tasks change their memory
  2012-12-03 23:42   ` Xiao Guangrong
@ 2012-12-04  5:04     ` Pavel Emelyanov
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Emelyanov @ 2012-12-04  5:04 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel

On 12/04/2012 03:42 AM, Xiao Guangrong wrote:
> On 12/01/2012 01:55 AM, Pavel Emelyanov wrote:
> 
>>  	case MADV_DOTRACE:
>> +		/*
>> +		 * Protect pages to be read-only and force tasks to generate
>> +		 * #PFs on modification.
>> +		 *
>> +		 * It should be done before issuing trace-on event. Otherwise
>> +		 * we're leaving a short window after the 'on' event when tasks
>> +		 * can still modify pages.
>> +		 */
>> +		change_protection(vma, start, end,
>> +				vm_get_page_prot(vma->vm_flags & ~VM_READ),
>> +				vma_wants_writenotify(vma));
> 
> Should be VM_WRITE?

Ooops! Yes, sure. I guess I accidentally broke it while cleaning/splitting patch :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-11-30 17:55 [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Pavel Emelyanov
  2012-11-30 17:55 ` [PATCH 1/2] mm: Mark VMA with VM_TRACE bit Pavel Emelyanov
  2012-11-30 17:55 ` [PATCH 2/2] mm: Generate events when tasks change their memory Pavel Emelyanov
@ 2012-12-03  8:36 ` Glauber Costa
  2012-12-03 20:16   ` Marcelo Tosatti
  2012-12-03 22:43 ` Andrew Morton
  3 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2012-12-03  8:36 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel,
	Marcelo Tosatti, Gleb Natapov, kvm

On 11/30/2012 09:55 PM, Pavel Emelyanov wrote:
> Hello,
> 
> This is an attempt to implement support for memory snapshot for the the
> checkpoint-restore project (http://criu.org).
> 
> To create a dump of an application(s) we save all the information about it
> to files. No surprise, the biggest part of such dump is the contents of tasks'
> memory. However, in some usage scenarios it's not required to get _all_ the
> task memory while creating a dump. For example, when doing periodical dumps
> it's only required to take full memory dump only at the first step and then
> take incremental changes of memory. Another example is live migration. In the
> simplest form it looks like -- create dump, copy it on the remote node then
> restore tasks from dump files. While all this dump-copy-restore thing goes all
> the process must be stopped. However, if we can monitor how tasks change their
> memory, we can dump and copy it in smaller chunks, periodically updating it 
> and thus freezing tasks only at the very end for the very short time to pick
> up the recent changes.
> 
> That said, some help from kernel to watch how processes modify the contents of
> their memory is required. I'd like to propose one possible solution of this
> task -- with the help of page-faults and trace events.
> 
> Briefly the approach is -- remap some memory regions as read-only, get the #pf
> on task's attempt to modify the memory and issue a trace event of that. Since
> we're only interested in parts of memory of some tasks, make it possible to mark
> the vmas we're interested in and issue events for them only. Also, to be aware
> of tasks unmapping the vma-s being watched, also issue an event when the marked
> vma is removed (and for symmetry -- an event when a vma is marked).
> 
> What do you think about this approach? Is this way of supporting mem snapshot
> OK for you, or should we invent some better one?
> 

The page fault mechanism is pretty obvious - anything that deals with
dirty pages will end up having to do this. So there is nothing crazy
about this.

What concerns me, however, is that should this go in, we'll have two
dirty mem loggers in the kernel: one to support CRIU, one to support
KVM. And the worst part: They have the exact the same purpose!!

So to begin with, I think one thing to consider, would be to generalize
KVM's dirty memory notification so it can work on a normal process
memory region. KVM api requires a "memory slot" to be passed, something
we are unlikely to have. But KVM can easily keep its API and use an
alternate mechanics, that's trivial...

Generally speaking, KVM will do polling with this ioctl. I prefer your
tracing mechanism better. The only difference, is that KVM tends to
transfer large chunks of memory in some loads - in the high gigs range.
So the proposal tracing API should be able to optionally batch requests
within a time frame.

It would also be good to hear what does the KVM guys think of it as well



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-03  8:36 ` [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Glauber Costa
@ 2012-12-03 20:16   ` Marcelo Tosatti
  2012-12-04  7:39     ` Glauber Costa
  0 siblings, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2012-12-03 20:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Pavel Emelyanov, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	Michal Hocko, Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel,
	Gleb Natapov, kvm

On Mon, Dec 03, 2012 at 12:36:33PM +0400, Glauber Costa wrote:
> On 11/30/2012 09:55 PM, Pavel Emelyanov wrote:
> > Hello,
> > 
> > This is an attempt to implement support for memory snapshot for the the
> > checkpoint-restore project (http://criu.org).
> > 
> > To create a dump of an application(s) we save all the information about it
> > to files. No surprise, the biggest part of such dump is the contents of tasks'
> > memory. However, in some usage scenarios it's not required to get _all_ the
> > task memory while creating a dump. For example, when doing periodical dumps
> > it's only required to take full memory dump only at the first step and then
> > take incremental changes of memory. Another example is live migration. In the
> > simplest form it looks like -- create dump, copy it on the remote node then
> > restore tasks from dump files. While all this dump-copy-restore thing goes all
> > the process must be stopped. However, if we can monitor how tasks change their
> > memory, we can dump and copy it in smaller chunks, periodically updating it 
> > and thus freezing tasks only at the very end for the very short time to pick
> > up the recent changes.
> > 
> > That said, some help from kernel to watch how processes modify the contents of
> > their memory is required. I'd like to propose one possible solution of this
> > task -- with the help of page-faults and trace events.
> > 
> > Briefly the approach is -- remap some memory regions as read-only, get the #pf
> > on task's attempt to modify the memory and issue a trace event of that. Since
> > we're only interested in parts of memory of some tasks, make it possible to mark
> > the vmas we're interested in and issue events for them only. Also, to be aware
> > of tasks unmapping the vma-s being watched, also issue an event when the marked
> > vma is removed (and for symmetry -- an event when a vma is marked).
> > 
> > What do you think about this approach? Is this way of supporting mem snapshot
> > OK for you, or should we invent some better one?
> > 
> 
> The page fault mechanism is pretty obvious - anything that deals with
> dirty pages will end up having to do this. So there is nothing crazy
> about this.
> 
> What concerns me, however, is that should this go in, we'll have two
> dirty mem loggers in the kernel: one to support CRIU, one to support
> KVM. And the worst part: They have the exact the same purpose!!
> 
> So to begin with, I think one thing to consider, would be to generalize
> KVM's dirty memory notification so it can work on a normal process
> memory region. KVM api requires a "memory slot" to be passed, something
> we are unlikely to have. But KVM can easily keep its API and use an
> alternate mechanics, that's trivial...
> 
> Generally speaking, KVM will do polling with this ioctl. I prefer your
> tracing mechanism better. The only difference, is that KVM tends to
> transfer large chunks of memory in some loads - in the high gigs range.
> So the proposal tracing API should be able to optionally batch requests
> within a time frame.
> 
> It would also be good to hear what does the KVM guys think of it as well

There are significant differences. KVM's dirty logging works for
guest translations (NPT/shadow) and is optimized for specific use cases.

Above is about dirty logging of userspace memory areas.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-03 20:16   ` Marcelo Tosatti
@ 2012-12-04  7:39     ` Glauber Costa
  0 siblings, 0 replies; 17+ messages in thread
From: Glauber Costa @ 2012-12-04  7:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Pavel Emelyanov, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	Michal Hocko, Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel,
	Gleb Natapov, kvm

On 12/04/2012 12:16 AM, Marcelo Tosatti wrote:
> On Mon, Dec 03, 2012 at 12:36:33PM +0400, Glauber Costa wrote:
>> On 11/30/2012 09:55 PM, Pavel Emelyanov wrote:
>>> Hello,
>>>
>>> This is an attempt to implement support for memory snapshot for the the
>>> checkpoint-restore project (http://criu.org).
>>>
>>> To create a dump of an application(s) we save all the information about it
>>> to files. No surprise, the biggest part of such dump is the contents of tasks'
>>> memory. However, in some usage scenarios it's not required to get _all_ the
>>> task memory while creating a dump. For example, when doing periodical dumps
>>> it's only required to take full memory dump only at the first step and then
>>> take incremental changes of memory. Another example is live migration. In the
>>> simplest form it looks like -- create dump, copy it on the remote node then
>>> restore tasks from dump files. While all this dump-copy-restore thing goes all
>>> the process must be stopped. However, if we can monitor how tasks change their
>>> memory, we can dump and copy it in smaller chunks, periodically updating it 
>>> and thus freezing tasks only at the very end for the very short time to pick
>>> up the recent changes.
>>>
>>> That said, some help from kernel to watch how processes modify the contents of
>>> their memory is required. I'd like to propose one possible solution of this
>>> task -- with the help of page-faults and trace events.
>>>
>>> Briefly the approach is -- remap some memory regions as read-only, get the #pf
>>> on task's attempt to modify the memory and issue a trace event of that. Since
>>> we're only interested in parts of memory of some tasks, make it possible to mark
>>> the vmas we're interested in and issue events for them only. Also, to be aware
>>> of tasks unmapping the vma-s being watched, also issue an event when the marked
>>> vma is removed (and for symmetry -- an event when a vma is marked).
>>>
>>> What do you think about this approach? Is this way of supporting mem snapshot
>>> OK for you, or should we invent some better one?
>>>
>>
>> The page fault mechanism is pretty obvious - anything that deals with
>> dirty pages will end up having to do this. So there is nothing crazy
>> about this.
>>
>> What concerns me, however, is that should this go in, we'll have two
>> dirty mem loggers in the kernel: one to support CRIU, one to support
>> KVM. And the worst part: They have the exact the same purpose!!
>>
>> So to begin with, I think one thing to consider, would be to generalize
>> KVM's dirty memory notification so it can work on a normal process
>> memory region. KVM api requires a "memory slot" to be passed, something
>> we are unlikely to have. But KVM can easily keep its API and use an
>> alternate mechanics, that's trivial...
>>
>> Generally speaking, KVM will do polling with this ioctl. I prefer your
>> tracing mechanism better. The only difference, is that KVM tends to
>> transfer large chunks of memory in some loads - in the high gigs range.
>> So the proposal tracing API should be able to optionally batch requests
>> within a time frame.
>>
>> It would also be good to hear what does the KVM guys think of it as well
> 
> There are significant differences. KVM's dirty logging works for
> guest translations (NPT/shadow) and is optimized for specific use cases.
> 
> Above is about dirty logging of userspace memory areas.

This is envelope.

At the end, KVM is tracking pages, regardless of their format, and want
to know when pages are dirty, and which pages are dirty.

What you are saying, for example, is that if Linux had a dirty page
tracking for userpages memory, kvm could not make use of it ? IIRC
qemu's migration algorithm, a notification of dirty user pages - plus an
eventual extra trickery - would do just fine.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-11-30 17:55 [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2012-12-03  8:36 ` [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Glauber Costa
@ 2012-12-03 22:43 ` Andrew Morton
  2012-12-04  5:15   ` Pavel Emelyanov
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-12-03 22:43 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko, Mel Gorman,
	Johannes Weiner, Linux MM, Rik van Riel

On Fri, 30 Nov 2012 21:55:00 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:

> This is an attempt to implement support for memory snapshot for the the
> checkpoint-restore project (http://criu.org).
> 
> To create a dump of an application(s) we save all the information about it
> to files. No surprise, the biggest part of such dump is the contents of tasks'
> memory. However, in some usage scenarios it's not required to get _all_ the
> task memory while creating a dump. For example, when doing periodical dumps
> it's only required to take full memory dump only at the first step and then
> take incremental changes of memory. Another example is live migration. In the
> simplest form it looks like -- create dump, copy it on the remote node then
> restore tasks from dump files. While all this dump-copy-restore thing goes all
> the process must be stopped. However, if we can monitor how tasks change their
> memory, we can dump and copy it in smaller chunks, periodically updating it 
> and thus freezing tasks only at the very end for the very short time to pick
> up the recent changes.
> 
> That said, some help from kernel to watch how processes modify the contents of
> their memory is required. I'd like to propose one possible solution of this
> task -- with the help of page-faults and trace events.
> 
> Briefly the approach is -- remap some memory regions as read-only, get the #pf
> on task's attempt to modify the memory and issue a trace event of that. Since
> we're only interested in parts of memory of some tasks, make it possible to mark
> the vmas we're interested in and issue events for them only. Also, to be aware
> of tasks unmapping the vma-s being watched, also issue an event when the marked
> vma is removed (and for symmetry -- an event when a vma is marked).
> 
> What do you think about this approach? Is this way of supporting mem snapshot
> OK for you, or should we invent some better one?

The patches look pretty simple.

Some performance numbers would be useful.

Is it reliable?  Under what circumstances will the trace system drop
events?

Please cc Steven Rostedt on tracing stuff - he is a diligent reviewer.

The proposed interface might be useful to things other than c/r.  But
it hasn't actually been described.  Please include a full description
of the proposed kernel/usersapce interface.

Two alternatives come to mind:

1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
    fashion to determine which pages have been touched.

2)  At pagefault time, don't send an event: just mark the vma as
    "touched".  Then add a userspace interface to sweep the vma tree
    testing, clearing and reporting the touched flags.

2a) Avoid the full linear search by propagating the "touched" flag
    up the rbtree and do the sweep in a fashion similar to
    radix_tree_for_each_tagged().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-03 22:43 ` Andrew Morton
@ 2012-12-04  5:15   ` Pavel Emelyanov
  2012-12-04 23:21     ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Emelyanov @ 2012-12-04  5:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko, Mel Gorman,
	Johannes Weiner, Linux MM, Rik van Riel

On 12/04/2012 02:43 AM, Andrew Morton wrote:
> On Fri, 30 Nov 2012 21:55:00 +0400
> Pavel Emelyanov <xemul@parallels.com> wrote:
> 
>> This is an attempt to implement support for memory snapshot for the the
>> checkpoint-restore project (http://criu.org).
>>
>> To create a dump of an application(s) we save all the information about it
>> to files. No surprise, the biggest part of such dump is the contents of tasks'
>> memory. However, in some usage scenarios it's not required to get _all_ the
>> task memory while creating a dump. For example, when doing periodical dumps
>> it's only required to take full memory dump only at the first step and then
>> take incremental changes of memory. Another example is live migration. In the
>> simplest form it looks like -- create dump, copy it on the remote node then
>> restore tasks from dump files. While all this dump-copy-restore thing goes all
>> the process must be stopped. However, if we can monitor how tasks change their
>> memory, we can dump and copy it in smaller chunks, periodically updating it 
>> and thus freezing tasks only at the very end for the very short time to pick
>> up the recent changes.
>>
>> That said, some help from kernel to watch how processes modify the contents of
>> their memory is required. I'd like to propose one possible solution of this
>> task -- with the help of page-faults and trace events.
>>
>> Briefly the approach is -- remap some memory regions as read-only, get the #pf
>> on task's attempt to modify the memory and issue a trace event of that. Since
>> we're only interested in parts of memory of some tasks, make it possible to mark
>> the vmas we're interested in and issue events for them only. Also, to be aware
>> of tasks unmapping the vma-s being watched, also issue an event when the marked
>> vma is removed (and for symmetry -- an event when a vma is marked).
>>
>> What do you think about this approach? Is this way of supporting mem snapshot
>> OK for you, or should we invent some better one?
> 
> The patches look pretty simple.
> 
> Some performance numbers would be useful.
> 
> Is it reliable?  Under what circumstances will the trace system drop
> events?

AFAIS when the buffer for events overflows, but the buffer size can be
tuned. I will write some mode descriptive text about it if the tracing
approach will be considered to be the way to go.

> Please cc Steven Rostedt on tracing stuff - he is a diligent reviewer.

OK.

> The proposed interface might be useful to things other than c/r.  But
> it hasn't actually been described.  Please include a full description
> of the proposed kernel/usersapce interface.

OK, will try to address that.

> Two alternatives come to mind:
> 
> 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
>     fashion to determine which pages have been touched.

I thought about this. Unfortunately there's no free bits left in the pagemap
entry. What can we do about it (other than introducing the pagemap2 file)?

> 2)  At pagefault time, don't send an event: just mark the vma as
>     "touched".  Then add a userspace interface to sweep the vma tree
>     testing, clearing and reporting the touched flags.

Per-vma granularity is not enough. In OpenVZ we've observed Oracle touching
several pages in a hundred-megs anon mapping. Marking _part_ of the vma with
the "node write-faults" bit would help, but there's currently no APIs that
modifies vma and report some info back at the same time. Can you propose how
it could look like?

> 2a) Avoid the full linear search by propagating the "touched" flag
>     up the rbtree and do the sweep in a fashion similar to
>     radix_tree_for_each_tagged().
> .

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-04  5:15   ` Pavel Emelyanov
@ 2012-12-04 23:21     ` Andrew Morton
  2012-12-05  0:17       ` Matt Mackall
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-12-04 23:21 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko, Mel Gorman,
	Johannes Weiner, Linux MM, Rik van Riel, Matt Mackall,
	Wu Fengguang

On Tue, 04 Dec 2012 09:15:10 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:

> 
> > Two alternatives come to mind:
> > 
> > 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
> >     fashion to determine which pages have been touched.
> 
> I thought about this. Unfortunately there's no free bits left in the pagemap
> entry. What can we do about it (other than introducing the pagemap2 file)?

urgh, we were pretty careless in laying out the /proc/pid/pagemap
entries.

Probably the 55 bits for pfn/swap were excessive.

The page shift didn't need six bits!  Simply predividing the page shift
by 1k would have saved a few bits, and permitting expansion to a 1^63
byte page size is nuts.

Sigh.  I wonder how traumatic it would be to put the pagemap record on
a diet and make up some free space.

Anyway, do you actually need to add another bit?  /proc/pid/pagemap
gives you the pfn which can then be used to look up the page's flags in
/proc/pageflags.  You can add a "touched" flag to /proc/kpageflags? 
But that would require grabbing another bit in struct page.flags, I
assume.

And it would be very expensive.  An in-kernel loop which searches the
MM spitting out a string of touched-pages would be faster, but still
slow.

hm.

> > 2)  At pagefault time, don't send an event: just mark the vma as
> >     "touched".  Then add a userspace interface to sweep the vma tree
> >     testing, clearing and reporting the touched flags.
> 
> Per-vma granularity is not enough. In OpenVZ we've observed Oracle touching
> several pages in a hundred-megs anon mapping. Marking _part_ of the vma with
> the "node write-faults" bit would help, but there's currently no APIs that
> modifies vma and report some info back at the same time. Can you propose how
> it could look like?

I don't see a need to report the info back at the same time?  You want
to *record* that information but only report it when someone does a
query?

Dunno.  One could add a radix-tree to the vma and store 32 or 64
per-page bits in each slots[] entry.  Worst case that would consume
approx one bit of kernel memory for each 4k of instantiated user pages
- an increase of 1/32768.  Not too bad.  Use the tagged-lookup facility
to efficiently query that bitmap at query-time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-04 23:21     ` Andrew Morton
@ 2012-12-05  0:17       ` Matt Mackall
  2012-12-05  0:24         ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Matt Mackall @ 2012-12-05  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Emelyanov, Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

On Tue, 2012-12-04 at 15:21 -0800, Andrew Morton wrote:
> On Tue, 04 Dec 2012 09:15:10 +0400
> Pavel Emelyanov <xemul@parallels.com> wrote:
> 
> > 
> > > Two alternatives come to mind:
> > > 
> > > 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
> > >     fashion to determine which pages have been touched.

[momentarily coming out of kernel retirement for old man rant]

This is a popular interface anti-pattern.

You shouldn't use an interface that gives you huge amount of STATE to
detect small amounts of CHANGE via manual differentiation. For example,
you would be foolish to try to monitor an entire filesystem by stat()ing
all files on the disk continually. It will be massively slow, only sort
of work, and you'll miss changes sometimes. Instead, use inotify.

Similarly, you shouldn't try to use an interface that gives you small
amounts of CHANGE to get a large STATE via manual integration. For
instance, you would be silly to try to get the current timestamp on a
file by tracking every change to the filesystem since boot via inotify.
It would be massively slow, only sort of work, and you'll get wrong
answers sometimes. Instead, use stat().

Pagemap is unambiguously a STATE interface for making the kinds of
measurements that such interfaces are good for. If you try to use it as
a CHANGE interface, you may find sadness.

I don't know what a good CHANGE interface here might look like, but
tracepoints have been suggested in the past. If you want do something
UNIXy, you could teach inotify to report write iovecs and then make it
possible for /proc and /sys objects to report events through inotify.
Lots of other neat possibilities would fall out of that, of course.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-05  0:17       ` Matt Mackall
@ 2012-12-05  0:24         ` Andrew Morton
  2012-12-05  0:38           ` Matt Mackall
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-12-05  0:24 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pavel Emelyanov, Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

On Tue, 04 Dec 2012 18:17:08 -0600
Matt Mackall <mpm@selenic.com> wrote:

> On Tue, 2012-12-04 at 15:21 -0800, Andrew Morton wrote:
> > On Tue, 04 Dec 2012 09:15:10 +0400
> > Pavel Emelyanov <xemul@parallels.com> wrote:
> > 
> > > 
> > > > Two alternatives come to mind:
> > > > 
> > > > 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
> > > >     fashion to determine which pages have been touched.
> 
> [momentarily coming out of kernel retirement for old man rant]
> 
> This is a popular interface anti-pattern.
> 
> You shouldn't use an interface that gives you huge amount of STATE to
> detect small amounts of CHANGE via manual differentiation.

I'm not sure that's what checkpoint-restart will be doing.  If we want
to determine "which pages have been touched since the last checkpoint
ten minutes ago" then that set of touched pages *is* state.  And it's
not "small"!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-05  0:24         ` Andrew Morton
@ 2012-12-05  0:38           ` Matt Mackall
  2012-12-05  9:53             ` Pavel Emelyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Matt Mackall @ 2012-12-05  0:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Emelyanov, Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

On Tue, 2012-12-04 at 16:24 -0800, Andrew Morton wrote:
> On Tue, 04 Dec 2012 18:17:08 -0600
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Tue, 2012-12-04 at 15:21 -0800, Andrew Morton wrote:
> > > On Tue, 04 Dec 2012 09:15:10 +0400
> > > Pavel Emelyanov <xemul@parallels.com> wrote:
> > > 
> > > > 
> > > > > Two alternatives come to mind:
> > > > > 
> > > > > 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
> > > > >     fashion to determine which pages have been touched.
> > 
> > [momentarily coming out of kernel retirement for old man rant]
> > 
> > This is a popular interface anti-pattern.
> > 
> > You shouldn't use an interface that gives you huge amount of STATE to
> > detect small amounts of CHANGE via manual differentiation.
> 
> I'm not sure that's what checkpoint-restart will be doing.  If we want
> to determine "which pages have been touched since the last checkpoint
> ten minutes ago" then that set of touched pages *is* state.  And it's
> not "small"!

Yeah, there is definitely a middle-ground here between "I want
high-frequency updates" and "I want to see the whole picture". 
The filesystem analogy is backups: we don't have any good way to say
"find me all files changed since yesterday" short of "find all files".
The closest thing is explicit snapshotting.

-- 
Mathematics is the supreme nostalgia of our time.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-05  0:38           ` Matt Mackall
@ 2012-12-05  9:53             ` Pavel Emelyanov
  2012-12-05 22:06               ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Emelyanov @ 2012-12-05  9:53 UTC (permalink / raw)
  To: Matt Mackall, Andrew Morton
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko, Mel Gorman,
	Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

On 12/05/2012 04:38 AM, Matt Mackall wrote:
> On Tue, 2012-12-04 at 16:24 -0800, Andrew Morton wrote:
>> On Tue, 04 Dec 2012 18:17:08 -0600
>> Matt Mackall <mpm@selenic.com> wrote:
>>
>>> On Tue, 2012-12-04 at 15:21 -0800, Andrew Morton wrote:
>>>> On Tue, 04 Dec 2012 09:15:10 +0400
>>>> Pavel Emelyanov <xemul@parallels.com> wrote:
>>>>
>>>>>
>>>>>> Two alternatives come to mind:
>>>>>>
>>>>>> 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
>>>>>>     fashion to determine which pages have been touched.
>>>
>>> [momentarily coming out of kernel retirement for old man rant]
>>>
>>> This is a popular interface anti-pattern.
>>>
>>> You shouldn't use an interface that gives you huge amount of STATE to
>>> detect small amounts of CHANGE via manual differentiation.
>>
>> I'm not sure that's what checkpoint-restart will be doing.  If we want
>> to determine "which pages have been touched since the last checkpoint
>> ten minutes ago" then that set of touched pages *is* state.  And it's
>> not "small"!
> 
> Yeah, there is definitely a middle-ground here between "I want
> high-frequency updates" and "I want to see the whole picture". 
> The filesystem analogy is backups: we don't have any good way to say
> "find me all files changed since yesterday" short of "find all files".
> The closest thing is explicit snapshotting.

For what is required for checkpoint-restore is -- we want to query the kernel
for "what pages has been written to since moment X". But this "moment X" is
a little bit more tricky than just "mark all pages r/o". Consider we're doing
this periodically. So when defining the moment X for the 2nd time we should
query the "changed" state and remap the respective page r/o atomically. Full
snapshot is actually not required, since we don't need to keep the old copy
of a page that is written to. Just a sign, that this page was modified is OK.


Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-05  9:53             ` Pavel Emelyanov
@ 2012-12-05 22:06               ` Andrew Morton
  2012-12-06  6:32                 ` Pavel Emelyanov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2012-12-05 22:06 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Matt Mackall, Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

On Wed, 05 Dec 2012 13:53:17 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:

> On 12/05/2012 04:38 AM, Matt Mackall wrote:
> > On Tue, 2012-12-04 at 16:24 -0800, Andrew Morton wrote:
> >> On Tue, 04 Dec 2012 18:17:08 -0600
> >> Matt Mackall <mpm@selenic.com> wrote:
> >>
> >>> On Tue, 2012-12-04 at 15:21 -0800, Andrew Morton wrote:
> >>>> On Tue, 04 Dec 2012 09:15:10 +0400
> >>>> Pavel Emelyanov <xemul@parallels.com> wrote:
> >>>>
> >>>>>
> >>>>>> Two alternatives come to mind:
> >>>>>>
> >>>>>> 1)  Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some
> >>>>>>     fashion to determine which pages have been touched.
> >>>
> >>> [momentarily coming out of kernel retirement for old man rant]
> >>>
> >>> This is a popular interface anti-pattern.
> >>>
> >>> You shouldn't use an interface that gives you huge amount of STATE to
> >>> detect small amounts of CHANGE via manual differentiation.
> >>
> >> I'm not sure that's what checkpoint-restart will be doing.  If we want
> >> to determine "which pages have been touched since the last checkpoint
> >> ten minutes ago" then that set of touched pages *is* state.  And it's
> >> not "small"!
> > 
> > Yeah, there is definitely a middle-ground here between "I want
> > high-frequency updates" and "I want to see the whole picture". 
> > The filesystem analogy is backups: we don't have any good way to say
> > "find me all files changed since yesterday" short of "find all files".
> > The closest thing is explicit snapshotting.
> 
> For what is required for checkpoint-restore is -- we want to query the kernel
> for "what pages has been written to since moment X". But this "moment X" is
> a little bit more tricky than just "mark all pages r/o". Consider we're doing
> this periodically. So when defining the moment X for the 2nd time we should
> query the "changed" state and remap the respective page r/o atomically. Full
> snapshot is actually not required, since we don't need to keep the old copy
> of a page that is written to. Just a sign, that this page was modified is OK.

How is all this going to work, btw?  What is the interface to query
page states and set them read-only?  How will dirty pagecache and dirty
swapcache be handled?  And anonymous memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
  2012-12-05 22:06               ` Andrew Morton
@ 2012-12-06  6:32                 ` Pavel Emelyanov
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Emelyanov @ 2012-12-06  6:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matt Mackall, Hugh Dickins, KAMEZAWA Hiroyuki, Michal Hocko,
	Mel Gorman, Johannes Weiner, Linux MM, Rik van Riel, Wu Fengguang

>> For what is required for checkpoint-restore is -- we want to query the kernel
>> for "what pages has been written to since moment X". But this "moment X" is
>> a little bit more tricky than just "mark all pages r/o". Consider we're doing
>> this periodically. So when defining the moment X for the 2nd time we should
>> query the "changed" state and remap the respective page r/o atomically. Full
>> snapshot is actually not required, since we don't need to keep the old copy
>> of a page that is written to. Just a sign, that this page was modified is OK.
> 
> How is all this going to work, btw?  What is the interface to query
> page states and set them read-only?  How will dirty pagecache and dirty
> swapcache be handled?  And anonymous memory?

To begin with -- currently criu dumps lots of information about process by 
injecting a parasite code into the process [1] and working on the process
state as if it was this very process dumping himself.

That said, the proposed in this set API is about to be used like this:

1. A daemon is started, that turns tracing on, enables proposed mmu.* events
   and starts listening for them.
2. The parasite code gets injected into target task. This parasite knows
   which mapping(s) we're about to take to the image.
3. The parasite first sends the needed pages [2] to the image file.
4. Then parasite calls the proposed madvise(MADV_TRACE) on the mapping. When
   called, the respective mapping is marked with VM_TRACE bit and all the
   pages are remaped in ro.
5. After this parasite can be removed and the target task is continued.

If after this a process writes to some page the #PF occurs and the respective
event is send via tracing engine. Next time, when we want to take incremental
dump, we repeat steps 2 through 5, with a small change -- in step 3 parasite
requests the daemon from step 1 which pages has been changes since last time
and dumps only those into new image.

The state of swapcache (clean or dirty) doesn't matter in this case. If the
page is in swap and pte contains swap entry, we'll note this from pagemap file
and will take the page into image in the first pass. If later a process writes
to the page it will go through do_swap_page -> do_wp_page and the modification
event will be sent and caught by daemon from step 1.

The pagecache is completely out of the scope since criu doesn't dump the
contents of file mappings and doesn't snapshot filesystem state. It only
works with process' state. Filesystem state, that corresponds to process state
should be created with other means, e.g. lvm snapshot or rsync while tasks
are stopped. I've tried to explain this in more details here [3].

Thanks,
Pavel

[1] http://lwn.net/Articles/454304/
[2] Looking a the /proc/PID/pagemap file
[3] https://plus.google.com/103175467322423551911/posts/UAtVKaQcKsx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-12-06  6:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-30 17:55 [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Pavel Emelyanov
2012-11-30 17:55 ` [PATCH 1/2] mm: Mark VMA with VM_TRACE bit Pavel Emelyanov
2012-11-30 17:55 ` [PATCH 2/2] mm: Generate events when tasks change their memory Pavel Emelyanov
2012-12-03 23:42   ` Xiao Guangrong
2012-12-04  5:04     ` Pavel Emelyanov
2012-12-03  8:36 ` [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes Glauber Costa
2012-12-03 20:16   ` Marcelo Tosatti
2012-12-04  7:39     ` Glauber Costa
2012-12-03 22:43 ` Andrew Morton
2012-12-04  5:15   ` Pavel Emelyanov
2012-12-04 23:21     ` Andrew Morton
2012-12-05  0:17       ` Matt Mackall
2012-12-05  0:24         ` Andrew Morton
2012-12-05  0:38           ` Matt Mackall
2012-12-05  9:53             ` Pavel Emelyanov
2012-12-05 22:06               ` Andrew Morton
2012-12-06  6:32                 ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).