From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 2F5FC6B0005 for ; Thu, 11 Apr 2013 07:28:36 -0400 (EDT) Message-ID: <51669E5F.4000801@parallels.com> Date: Thu, 11 Apr 2013 15:28:31 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List Hello, This is the implementation of the soft-dirty bit concept that should help keep track of changes in user memory, which in turn is very-very required by the checkpoint-restore project (http://criu.org). Let me briefly remind what the issue is. << EOF To create a dump of an application(s) we save all the information about it to files, and the biggest part of such dump is the contents of tasks' memory. However, there are usage scenarios where it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps, it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. We copy all the memory to the destination node without stopping all tasks, then stop them, check for what pages has changed, dump it and the rest of the state, then copy it to the destination node. This decreases freeze time significantly. That said, some help from kernel to watch how processes modify the contents of their memory is required. EOF The proposal is to track changes with the help of new soft-dirty bit this way: 1. First do "echo 4 > /proc/$pid/clear_refs". At that point kernel clears the soft dirty _and_ the writable bits from all ptes of process $pid. From now on every write to any page will result in #pf and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set the soft dirty flag. 2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there (the 55'th one). If set, the respective pte was written to since last call to clear refs. The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck, the latter one marks kernel pages with it, while the former bit is put on user pages so they do not conflict to each other. The set is against the v3.9-rc5. It includes preparations to /proc/pid's clear_refs file, adds the pagemap2 one and the soft-dirty concept itself with Andrew's comments on the previous patch (hopefully) fixed. History of the set: * Previous version of this patch, commented out by Andrew: http://lwn.net/Articles/546184/ * Pre-previous ftrace-based approach: http://permalink.gmane.org/gmane.linux.kernel.mm/91428 This one was not nice, because ftrace could drop events so we might miss significant information about page updates. Another issue with it -- it was impossible to use one to watch arbitrary task -- task had to mark memory areas with madvise itself to make events occur. Also, program, that monitored the update events could interfere with anyone else trying to mess with ftrace. Signed-off-by: Pavel Emelyanov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id F35A36B0027 for ; Thu, 11 Apr 2013 07:28:54 -0400 (EDT) Message-ID: <51669E73.2000301@parallels.com> Date: Thu, 11 Apr 2013 15:28:51 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 1/5] clear_refs: Sanitize accepted commands declaration References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List A new clear-refs type will be added in the next patch, so prepare code for that. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 17 ++++++++++------- 1 files changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3e636d8..67c2586 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -688,6 +688,13 @@ const struct file_operations proc_tid_smaps_operations = { .release = seq_release_private, }; +enum clear_refs_types { + CLEAR_REFS_ALL = 1, + CLEAR_REFS_ANON, + CLEAR_REFS_MAPPED, + CLEAR_REFS_LAST, +}; + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -719,10 +726,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } -#define CLEAR_REFS_ALL 1 -#define CLEAR_REFS_ANON 2 -#define CLEAR_REFS_MAPPED 3 - static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, char buffer[PROC_NUMBUF]; struct mm_struct *mm; struct vm_area_struct *vma; - int type; + enum clear_refs_types type; int rv; memset(buffer, 0, sizeof(buffer)); @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, count = sizeof(buffer) - 1; if (copy_from_user(buffer, buf, count)) return -EFAULT; - rv = kstrtoint(strstrip(buffer), 10, &type); + rv = kstrtoint(strstrip(buffer), 10, (int *)&type); if (rv < 0) return rv; - if (type < CLEAR_REFS_ALL || type > CLEAR_REFS_MAPPED) + if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) return -EINVAL; task = get_proc_task(file_inode(file)); if (!task) -- 1.7.6.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id B539B6B0027 for ; Thu, 11 Apr 2013 07:29:12 -0400 (EDT) Message-ID: <51669E85.1020702@parallels.com> Date: Thu, 11 Apr 2013 15:29:09 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 2/5] clear_refs: Introduce private struct for mm_walk References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List In next patch the clear-refs-type will be required in clear_refs_pte_range funciton, so prepare the walk->private to carry this info. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 67c2586..c59a148 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -695,10 +695,15 @@ enum clear_refs_types { CLEAR_REFS_LAST, }; +struct clear_refs_private { + struct vm_area_struct *vma; +}; + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct clear_refs_private *cp = walk->private; + struct vm_area_struct *vma = cp->vma; pte_t *pte, ptent; spinlock_t *ptl; struct page *page; @@ -751,13 +756,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, return -ESRCH; mm = get_task_mm(task); if (mm) { + struct clear_refs_private cp = { + }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, .mm = mm, + .private = &cp, }; down_read(&mm->mmap_sem); for (vma = mm->mmap; vma; vma = vma->vm_next) { - clear_refs_walk.private = vma; + cp.vma = vma; if (is_vm_hugetlb_page(vma)) continue; /* -- 1.7.6.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 7B12D6B0036 for ; Thu, 11 Apr 2013 07:29:29 -0400 (EDT) Message-ID: <51669E95.8060101@parallels.com> Date: Thu, 11 Apr 2013 15:29:25 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List These bits are always constant (== PAGE_SHIFT) and just occupy space in the entry. Moreover, in next patch we will need to report one more bit in the pagemap, but all bits are already busy on it. That said, describe the pagemap entry that has 6 more free zero bits. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 50 ++++++++++++++++++++++++++++++-------------------- 1 files changed, 30 insertions(+), 20 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index c59a148..7f9b66c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -805,6 +805,7 @@ typedef struct { struct pagemapread { int pos, len; pagemap_entry_t *buffer; + bool v2; }; #define PAGEMAP_WALK_SIZE (PMD_SIZE) @@ -818,14 +819,16 @@ struct pagemapread { #define PM_PSHIFT_BITS 6 #define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS) #define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET) -#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) +#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) #define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1) #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) +/* in pagemap2 pshift bits are occupied with more status bits */ +#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) #define PM_PRESENT PM_STATUS(4LL) #define PM_SWAP PM_STATUS(2LL) #define PM_FILE PM_STATUS(1LL) -#define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT) +#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0) #define PM_END_OF_BUFFER 1 static inline pagemap_entry_t make_pme(u64 val) @@ -848,7 +851,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, struct pagemapread *pm = walk->private; unsigned long addr; int err = 0; - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); for (addr = start; addr < end; addr += PAGE_SIZE) { err = add_to_pagemap(addr, &pme, pm); @@ -858,7 +861,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, return err; } -static void pte_to_pagemap_entry(pagemap_entry_t *pme, +static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, struct vm_area_struct *vma, unsigned long addr, pte_t pte) { u64 frame, flags; @@ -877,18 +880,18 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, if (is_migration_entry(entry)) page = migration_entry_to_page(entry); } else { - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); return; } if (page && !PageAnon(page)) flags |= PM_FILE; - *pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags); + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, +static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pmd_t pmd, int offset) { /* @@ -898,12 +901,12 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, */ if (pmd_present(pmd)) *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); + | PM_STATUS2(pm->v2, 0) | PM_PRESENT); else - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } #else -static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, +static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pmd_t pmd, int offset) { } @@ -916,7 +919,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); @@ -926,7 +929,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, *pmd, offset); + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset); err = add_to_pagemap(addr, &pme, pm); if (err) break; @@ -943,7 +946,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * and need a new, higher one */ if (vma && (addr >= vma->vm_end)) { vma = find_vma(walk->mm, addr); - pme = make_pme(PM_NOT_PRESENT); + pme = make_pme(PM_NOT_PRESENT(pm->v2)); } /* check that 'vma' actually covers this address, @@ -951,7 +954,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (vma && (vma->vm_start <= addr) && !is_vm_hugetlb_page(vma)) { pte = pte_offset_map(pmd, addr); - pte_to_pagemap_entry(&pme, vma, addr, *pte); + pte_to_pagemap_entry(&pme, pm, vma, addr, *pte); /* unmap before userspace copy */ pte_unmap(pte); } @@ -966,14 +969,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } #ifdef CONFIG_HUGETLB_PAGE -static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, +static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pte_t pte, int offset) { if (pte_present(pte)) *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); + | PM_STATUS2(pm->v2, 0) | PM_PRESENT); else - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } /* This function walks within one hugetlb entry in the single call */ @@ -987,7 +990,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, for (; addr != end; addr += PAGE_SIZE) { int offset = (addr & ~hmask) >> PAGE_SHIFT; - huge_pte_to_pagemap_entry(&pme, *pte, offset); + huge_pte_to_pagemap_entry(&pme, pm, *pte, offset); err = add_to_pagemap(addr, &pme, pm); if (err) return err; @@ -1023,8 +1026,8 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ -static ssize_t pagemap_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) +static ssize_t do_pagemap_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos, bool v2) { struct task_struct *task = get_proc_task(file_inode(file)); struct mm_struct *mm; @@ -1049,6 +1052,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, if (!count) goto out_task; + pm.v2 = v2; pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); ret = -ENOMEM; @@ -1121,6 +1125,12 @@ out: return ret; } +static ssize_t pagemap_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return do_pagemap_read(file, buf, count, ppos, false); +} + const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, -- 1.7.6.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id 8E14E6B0006 for ; Thu, 11 Apr 2013 07:29:45 -0400 (EDT) Message-ID: <51669EA5.20209@parallels.com> Date: Thu, 11 Apr 2013 15:29:41 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List This file is the same as the pagemap one, but shows entries with bits 55-60 being zero (reserved for future use). Next patch will occupy one of them. Signed-off-by: Pavel Emelyanov --- Documentation/filesystems/proc.txt | 2 ++ Documentation/vm/pagemap.txt | 3 +++ fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 11 +++++++++++ 5 files changed, 19 insertions(+), 0 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fd8d0d5..22c47ec 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -487,6 +487,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using /proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +(There's also a /proc/pid/pagemap2 file which is the 2nd version of the + pagemap one). 1.2 Kernel data --------------- diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 7587493..4350397 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -30,6 +30,9 @@ There are three components to pagemap: determine which areas of memory are actually mapped and llseek to skip over unmapped regions. + * /proc/pid/pagemap2. This file provides the same info as the pagemap + does, but bits 55-60 are reserved for future use and thus zero + * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. diff --git a/fs/proc/base.c b/fs/proc/base.c index 69078c7..34966ce 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2537,6 +2537,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("clear_refs", S_IWUSR, proc_clear_refs_operations), REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), + REG("pagemap2", S_IRUGO, proc_pagemap2_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -2882,6 +2883,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("clear_refs", S_IWUSR, proc_clear_refs_operations), REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), + REG("pagemap2", S_IRUGO, proc_pagemap2_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 85ff3a4..cc12bb7 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -67,6 +67,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_pagemap2_operations; extern const struct file_operations proc_net_operations; extern const struct inode_operations proc_net_inode_operations; extern const struct inode_operations proc_pid_link_inode_operations; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7f9b66c..3138009 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1135,6 +1135,17 @@ const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, }; + +static ssize_t pagemap2_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return do_pagemap_read(file, buf, count, ppos, true); +} + +const struct file_operations proc_pagemap2_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = pagemap2_read, +}; #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA -- 1.7.6.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx102.postini.com [74.125.245.102]) by kanga.kvack.org (Postfix) with SMTP id 4F1046B0039 for ; Thu, 11 Apr 2013 07:30:04 -0400 (EDT) Message-ID: <51669EB8.2020102@parallels.com> Date: Thu, 11 Apr 2013 15:30:00 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: Pavel Emelyanov --- Documentation/filesystems/proc.txt | 7 +++++- Documentation/vm/pagemap.txt | 4 ++- Documentation/vm/soft-dirty.txt | 36 ++++++++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable.h | 26 ++++++++++++++++++++++- arch/x86/include/asm/pgtable_types.h | 6 +++++ fs/proc/task_mmu.c | 36 +++++++++++++++++++++++++++++---- include/asm-generic/pgtable.h | 22 ++++++++++++++++++++ mm/Kconfig | 12 +++++++++++ mm/huge_memory.c | 2 +- mm/mremap.c | 2 +- 10 files changed, 142 insertions(+), 11 deletions(-) create mode 100644 Documentation/vm/soft-dirty.txt diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 22c47ec..488c094 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -473,7 +473,8 @@ This file is only present if the CONFIG_MMU kernel configuration option is enabled. The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process. +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -482,6 +483,10 @@ To clear the bits for the anonymous pages associated with the process To clear the bits for the file mapped pages associated with the process > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 4350397..394cc03 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -31,7 +31,9 @@ There are three components to pagemap: skip over unmapped regions. * /proc/pid/pagemap2. This file provides the same info as the pagemap - does, but bits 55-60 are reserved for future use and thus zero + does, but bits 56-60 are reserved for future use and thus zero + + Bit 55 means pte is soft-dirty (see Documentation/vm/soft-dirty.txt) * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt new file mode 100644 index 0000000..9a12a59 --- /dev/null +++ b/Documentation/vm/soft-dirty.txt @@ -0,0 +1,36 @@ + SOFT-DIRTY PTEs + + The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the /proc/PID/clear_refs file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the /proc/PID/pagemap. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + + Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + + Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + + + This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1e67223..eb97470 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -207,7 +207,7 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { - return pte_set_flags(pte, _PAGE_DIRTY); + return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pte_t pte_mkyoung(pte_t pte) @@ -271,7 +271,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) static inline pmd_t pmd_mkdirty(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_DIRTY); + return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pmd_t pmd_mkhuge(pmd_t pmd) @@ -294,6 +294,28 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd_clear_flags(pmd, _PAGE_PRESENT); } +#define __HAVE_SOFT_DIRTY + +static inline int pte_soft_dirty(pte_t pte) +{ + return pte_flags(pte) & _PAGE_SOFT_DIRTY; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SOFT_DIRTY; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY); +} + /* * Mask out unsupported bits in a present pgprot. Non-present pgprots * can use those bits for other purposes, so leave them be. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 567b5d0..dcf718c 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -55,6 +55,18 @@ #define _PAGE_HIDDEN (_AT(pteval_t, 0)) #endif +/* + * The same hidden bit is used by kmemcheck, but since kmemcheck + * works on kernel pages while soft-dirty engine on user space, + * they do not conflict with each other. + */ + +#ifdef CONFIG_MEM_SOFT_DIRTY +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN) +#else +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #else diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3138009..aae2474 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -692,13 +692,32 @@ enum clear_refs_types { CLEAR_REFS_ALL = 1, CLEAR_REFS_ANON, CLEAR_REFS_MAPPED, + CLEAR_REFS_SOFT_DIRTY, CLEAR_REFS_LAST, }; struct clear_refs_private { struct vm_area_struct *vma; + enum clear_refs_types type; }; +static inline void clear_soft_dirty(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ +#ifdef CONFIG_MEM_SOFT_DIRTY + /* + * The soft-dirty tracker uses #PF-s to catch writes + * to pages, so write-protect the pte as well. See the + * Documentation/vm/soft-dirty.txt for full description + * of how soft-dirty works. + */ + pte_t ptent = *pte; + ptent = pte_wrprotect(ptent); + ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY); + set_pte_at(vma->vm_mm, addr, pte, ptent); +#endif +} + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -718,6 +731,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (!pte_present(ptent)) continue; + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + clear_soft_dirty(vma, addr, pte); + continue; + } + page = vm_normal_page(vma, addr, ptent); if (!page) continue; @@ -757,6 +775,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, mm = get_task_mm(task); if (mm) { struct clear_refs_private cp = { + .type = type, }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, @@ -825,6 +844,7 @@ struct pagemapread { /* in pagemap2 pshift bits are occupied with more status bits */ #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) +#define __PM_SOFT_DIRTY (1LL) #define PM_PRESENT PM_STATUS(4LL) #define PM_SWAP PM_STATUS(2LL) #define PM_FILE PM_STATUS(1LL) @@ -866,6 +886,7 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, { u64 frame, flags; struct page *page = NULL; + int flags2 = 0; if (pte_present(pte)) { frame = pte_pfn(pte); @@ -886,13 +907,15 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, if (page && !PageAnon(page)) flags |= PM_FILE; + if (pte_soft_dirty(pte)) + flags2 |= __PM_SOFT_DIRTY; - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags); + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { /* * Currently pmd for thp is always present because thp can not be @@ -901,13 +924,13 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p */ if (pmd_present(pmd)) *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_STATUS2(pm->v2, 0) | PM_PRESENT); + | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT); else *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } #else static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { } #endif @@ -924,12 +947,15 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); if (vma && pmd_trans_huge_lock(pmd, vma) == 1) { + int pmd_flags2; + + pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0); for (; addr != end; addr += PAGE_SIZE) { unsigned long offset; offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset); + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); err = add_to_pagemap(addr, &pme, pm); if (err) break; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index bfd8768..d74bdd2 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -386,6 +386,28 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #define arch_start_context_switch(prev) do {} while (0) #endif +#ifndef __HAVE_SOFT_DIRTY +static inline int pte_soft_dirty(pte_t pte) +{ + return 0; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return 0; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte; +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd; +} +#endif + #ifndef __HAVE_PFNMAP_TRACKING /* * Interfaces that can be used by architecture code to keep track of diff --git a/mm/Kconfig b/mm/Kconfig index 3bea74f..147689e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -471,3 +471,15 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config MEM_SOFT_DIRTY + bool "Track memory changes" + depends on CHECKPOINT_RESTORE && X86 + select PROC_PAGE_MONITOR + help + This option enables memory changes tracking by introducing a + soft-dirty bit on pte-s. This bit it set when someone writes + into a page just as regular dirty bit, but unlike the latter + it can be cleared by hands. + + See Documentation/vm/soft-dirty.txt for more details. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e2f7f5aa..eef1606 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1431,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, if (ret == 1) { pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); - set_pmd_at(mm, new_addr, new_pmd, pmd); + set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd)); spin_unlock(&mm->page_table_lock); } out: diff --git a/mm/mremap.c b/mm/mremap.c index 463a257..3708655 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); - set_pte_at(mm, new_addr, new_pte, pte); + set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte)); } arch_leave_lazy_mmu_mode(); -- 1.7.6.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id CB1B36B0006 for ; Thu, 11 Apr 2013 17:17:36 -0400 (EDT) Date: Thu, 11 Apr 2013 14:17:35 -0700 From: Andrew Morton Subject: Re: [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Message-Id: <20130411141735.107e583ca55e619f2e215851@linux-foundation.org> In-Reply-To: <51669E73.2000301@parallels.com> References: <51669E5F.4000801@parallels.com> <51669E73.2000301@parallels.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List On Thu, 11 Apr 2013 15:28:51 +0400 Pavel Emelyanov wrote: > A new clear-refs type will be added in the next patch, so prepare > code for that. > > @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > char buffer[PROC_NUMBUF]; > struct mm_struct *mm; > struct vm_area_struct *vma; > - int type; > + enum clear_refs_types type; > int rv; > > memset(buffer, 0, sizeof(buffer)); > @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > count = sizeof(buffer) - 1; > if (copy_from_user(buffer, buf, count)) > return -EFAULT; > - rv = kstrtoint(strstrip(buffer), 10, &type); > + rv = kstrtoint(strstrip(buffer), 10, (int *)&type); This is naughty. The compiler is allowed to put the enum into storage which is smaller (or, I guess, larger) than sizeof(int). I've seen one compiler which puts such an enum into a 16-bit word. --- a/fs/proc/task_mmu.c~clear_refs-sanitize-accepted-commands-declaration-fix +++ a/fs/proc/task_mmu.c @@ -734,6 +734,7 @@ static ssize_t clear_refs_write(struct f struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; + int itype; int rv; memset(buffer, 0, sizeof(buffer)); @@ -741,9 +742,10 @@ static ssize_t clear_refs_write(struct f count = sizeof(buffer) - 1; if (copy_from_user(buffer, buf, count)) return -EFAULT; - rv = kstrtoint(strstrip(buffer), 10, (int *)&type); + rv = kstrtoint(strstrip(buffer), 10, &itype); if (rv < 0) return rv; + type = (enum clear_refs_types)itype; if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) return -EINVAL; task = get_proc_task(file_inode(file)); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id 6F1E36B0006 for ; Thu, 11 Apr 2013 17:19:46 -0400 (EDT) Date: Thu, 11 Apr 2013 14:19:44 -0700 From: Andrew Morton Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Message-Id: <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> In-Reply-To: <51669EA5.20209@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov wrote: > This file is the same as the pagemap one, but shows entries with bits > 55-60 being zero (reserved for future use). Next patch will occupy one > of them. I'm not understanding the motivation for this. What does the current /proc/pid/pagemap have in those bit positions? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id 191E76B0005 for ; Thu, 11 Apr 2013 17:24:19 -0400 (EDT) Date: Thu, 11 Apr 2013 14:24:17 -0700 From: Andrew Morton Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Message-Id: <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> In-Reply-To: <51669EB8.2020102@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > The soft-dirty is a bit on a PTE which helps to track which pages a task > writes to. In order to do this tracking one should > > 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) > 2. Wait some time. > 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) > > To do this tracking, the writable bit is cleared from PTEs when the > soft-dirty bit is. Thus, after this, when the task tries to modify a page > at some virtual address the #PF occurs and the kernel sets the soft-dirty > bit on the respective PTE. > > Note, that although all the task's address space is marked as r/o after the > soft-dirty bits clear, the #PF-s that occur after that are processed fast. > This is so, since the pages are still mapped to physical memory, and thus > all the kernel does is finds this fact out and puts back writable, dirty > and soft-dirty bits on the PTE. > > Another thing to note, is that when mremap moves PTEs they are marked with > soft-dirty as well, since from the user perspective mremap modifies the > virtual memory at mremap's new address. > > ... > > +config MEM_SOFT_DIRTY > + bool "Track memory changes" > + depends on CHECKPOINT_RESTORE && X86 I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is a general facility and I expect others will want to get their hands on it for unrelated things. >>From that perspective, the dependency on X86 is awful. What's the problem here and what do other architectures need to do to be able to support the feature? You have a test application, I assume. It would be helpful if we could get that into tools/testing/selftests. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 0284F6B0005 for ; Fri, 12 Apr 2013 09:10:41 -0400 (EDT) Message-ID: <516807CB.6040208@parallels.com> Date: Fri, 12 Apr 2013 17:10:35 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> In-Reply-To: <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux MM , Linux Kernel Mailing List On 04/12/2013 01:19 AM, Andrew Morton wrote: > On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov wrote: > >> This file is the same as the pagemap one, but shows entries with bits >> 55-60 being zero (reserved for future use). Next patch will occupy one >> of them. > > I'm not understanding the motivation for this. What does the current > /proc/pid/pagemap have in those bit positions? A constant PAGE_SHIFT value. > > . > Thanks, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 236F86B0005 for ; Fri, 12 Apr 2013 09:14:07 -0400 (EDT) Message-ID: <5168089B.7060305@parallels.com> Date: Fri, 12 Apr 2013 17:14:03 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> In-Reply-To: <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux MM , Linux Kernel Mailing List On 04/12/2013 01:24 AM, Andrew Morton wrote: > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > >> The soft-dirty is a bit on a PTE which helps to track which pages a task >> writes to. In order to do this tracking one should >> >> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) >> 2. Wait some time. >> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) >> >> To do this tracking, the writable bit is cleared from PTEs when the >> soft-dirty bit is. Thus, after this, when the task tries to modify a page >> at some virtual address the #PF occurs and the kernel sets the soft-dirty >> bit on the respective PTE. >> >> Note, that although all the task's address space is marked as r/o after the >> soft-dirty bits clear, the #PF-s that occur after that are processed fast. >> This is so, since the pages are still mapped to physical memory, and thus >> all the kernel does is finds this fact out and puts back writable, dirty >> and soft-dirty bits on the PTE. >> >> Another thing to note, is that when mremap moves PTEs they are marked with >> soft-dirty as well, since from the user perspective mremap modifies the >> virtual memory at mremap's new address. >> >> ... >> >> +config MEM_SOFT_DIRTY >> + bool "Track memory changes" >> + depends on CHECKPOINT_RESTORE && X86 > > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is > a general facility and I expect others will want to get their hands on > it for unrelated things. OK. Just tell me when you need the dependency removing patch. >>>From that perspective, the dependency on X86 is awful. What's the > problem here and what do other architectures need to do to be able to > support the feature? The problem here is that I don't know what free bits are available on page table entries on other architectures. I was about to resolve this for ARM very soon, but for the rest of them I need help from other people. > You have a test application, I assume. It would be helpful if we could > get that into tools/testing/selftests. If a very stupid 10-lines test is OK, then I can cook a patch with it. Other than this I test this using the whole CRIU project, which is too big for inclusion. Thanks, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx117.postini.com [74.125.245.117]) by kanga.kvack.org (Postfix) with SMTP id 3FC2F6B0027 for ; Fri, 12 Apr 2013 11:53:49 -0400 (EDT) Message-ID: <51682E08.9050107@parallels.com> Date: Fri, 12 Apr 2013 19:53:44 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 6/5] selftest: Add simple test for soft-dirty bit References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> In-Reply-To: <51669EB8.2020102@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List It creates a mapping of 3 pages and checks that reads, writes and clear-refs result in present and soft-dirt bits reported from pagemap2 set as expected. Signed-off-by: Pavel Emelyanov --- diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 575ef80..827f2c0 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -6,6 +6,7 @@ TARGETS += cpu-hotplug TARGETS += memory-hotplug TARGETS += efivarfs TARGETS += ptrace +TARGETS += soft-dirty all: for TARGET in $(TARGETS); do \ diff --git a/tools/testing/selftests/soft-dirty/Makefile b/tools/testing/selftests/soft-dirty/Makefile new file mode 100644 index 0000000..a9cdc82 --- /dev/null +++ b/tools/testing/selftests/soft-dirty/Makefile @@ -0,0 +1,10 @@ +CFLAGS += -iquote../../../../include/uapi -Wall +soft-dirty: soft-dirty.c + +all: soft-dirty + +clean: + rm -f soft-dirty + +run_tests: all + @./soft-dirty || echo "soft-dirty selftests: [FAIL]" diff --git a/tools/testing/selftests/soft-dirty/soft-dirty.c b/tools/testing/selftests/soft-dirty/soft-dirty.c new file mode 100644 index 0000000..aba4f87 --- /dev/null +++ b/tools/testing/selftests/soft-dirty/soft-dirty.c @@ -0,0 +1,114 @@ +#include +#include +#include +#include +#include +#include + +typedef unsigned long long u64; + +#define PME_PRESENT (1ULL << 63) +#define PME_SOFT_DIRTY (1Ull << 55) + +#define PAGES_TO_TEST 3 +#ifndef PAGE_SIZE +#define PAGE_SIZE 4096 +#endif + +static void get_pagemap2(char *mem, u64 *map) +{ + int fd; + + fd = open("/proc/self/pagemap2", O_RDONLY); + if (fd < 0) { + perror("Can't open pagemap2"); + exit(1); + } + + lseek(fd, (unsigned long)mem / PAGE_SIZE * sizeof(u64), SEEK_SET); + read(fd, map, sizeof(u64) * PAGES_TO_TEST); + close(fd); +} + +static inline char map_p(u64 map) +{ + return map & PME_PRESENT ? 'p' : '-'; +} + +static inline char map_sd(u64 map) +{ + return map & PME_SOFT_DIRTY ? 'd' : '-'; +} + +static int check_pte(int step, int page, u64 *map, u64 want) +{ + if ((map[page] & want) != want) { + printf("Step %d Page %d has %c%c, want %c%c\n", + step, page, + map_p(map[page]), map_sd(map[page]), + map_p(want), map_sd(want)); + return 1; + } + + return 0; +} + +static void clear_refs(void) +{ + int fd; + char *v = "4"; + + fd = open("/proc/self/clear_refs", O_WRONLY); + if (write(fd, v, 3) < 3) { + perror("Can't clear soft-dirty bit"); + exit(1); + } + close(fd); +} + +int main(void) +{ + char *mem, x; + u64 map[PAGES_TO_TEST]; + + mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE, + PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0); + + x = mem[0]; + mem[2 * PAGE_SIZE] = 'c'; + get_pagemap2(mem, map); + + if (check_pte(1, 0, map, PME_PRESENT)) + return 1; + if (check_pte(1, 1, map, 0)) + return 1; + if (check_pte(1, 2, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + + clear_refs(); + get_pagemap2(mem, map); + + if (check_pte(2, 0, map, PME_PRESENT)) + return 1; + if (check_pte(2, 1, map, 0)) + return 1; + if (check_pte(2, 2, map, PME_PRESENT)) + return 1; + + mem[0] = 'a'; + mem[PAGE_SIZE] = 'b'; + x = mem[2 * PAGE_SIZE]; + get_pagemap2(mem, map); + + if (check_pte(3, 0, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + if (check_pte(3, 1, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + if (check_pte(3, 2, map, PME_PRESENT)) + return 1; + + (void)x; /* gcc warn */ + + printf("PASS\n"); + return 0; +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx138.postini.com [74.125.245.138]) by kanga.kvack.org (Postfix) with SMTP id 249C46B0002 for ; Mon, 15 Apr 2013 17:46:22 -0400 (EDT) Date: Mon, 15 Apr 2013 14:46:19 -0700 From: Andrew Morton Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Message-Id: <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> In-Reply-To: <5168089B.7060305@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> <5168089B.7060305@parallels.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List On Fri, 12 Apr 2013 17:14:03 +0400 Pavel Emelyanov wrote: > On 04/12/2013 01:24 AM, Andrew Morton wrote: > > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > > > >> The soft-dirty is a bit on a PTE which helps to track which pages a task > >> writes to. In order to do this tracking one should > >> > >> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) > >> 2. Wait some time. > >> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) > >> > >> To do this tracking, the writable bit is cleared from PTEs when the > >> soft-dirty bit is. Thus, after this, when the task tries to modify a page > >> at some virtual address the #PF occurs and the kernel sets the soft-dirty > >> bit on the respective PTE. > >> > >> Note, that although all the task's address space is marked as r/o after the > >> soft-dirty bits clear, the #PF-s that occur after that are processed fast. > >> This is so, since the pages are still mapped to physical memory, and thus > >> all the kernel does is finds this fact out and puts back writable, dirty > >> and soft-dirty bits on the PTE. > >> > >> Another thing to note, is that when mremap moves PTEs they are marked with > >> soft-dirty as well, since from the user perspective mremap modifies the > >> virtual memory at mremap's new address. > >> > >> ... > >> > >> +config MEM_SOFT_DIRTY > >> + bool "Track memory changes" > >> + depends on CHECKPOINT_RESTORE && X86 > > > > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is > > a general facility and I expect others will want to get their hands on > > it for unrelated things. > > OK. Just tell me when you need the dependency removing patch. > > >>From that perspective, the dependency on X86 is awful. What's the > > problem here and what do other architectures need to do to be able to > > support the feature? > > The problem here is that I don't know what free bits are available on > page table entries on other architectures. I was about to resolve this > for ARM very soon, but for the rest of them I need help from other people. Well, this is also a thing arch maintainers can do when they feel a need to support the feature on their architecture. To support them at that time we should provide them with a) adequate information in an easy-to-find place (eg, a nice comment at the site of the reference x86 implementation) and b) a userspace test app. > > You have a test application, I assume. It would be helpful if we could > > get that into tools/testing/selftests. > > If a very stupid 10-lines test is OK, then I can cook a patch with it. I think that would be good. As a low-priority thing, please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id C2E5D6B0002 for ; Mon, 15 Apr 2013 19:58:04 -0400 (EDT) Date: Tue, 16 Apr 2013 09:57:53 +1000 From: Stephen Rothwell Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Message-Id: <20130416095753.d94fa7d74db6c4293ec7dea9@canb.auug.org.au> In-Reply-To: <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> <5168089B.7060305@parallels.com> <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA256"; boundary="Signature=_Tue__16_Apr_2013_09_57_53_+1000_DzV3xC3VcGJY8=O2" Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Pavel Emelyanov , Linux MM , Linux Kernel Mailing List --Signature=_Tue__16_Apr_2013_09_57_53_+1000_DzV3xC3VcGJY8=O2 Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, 15 Apr 2013 14:46:19 -0700 Andrew Morton wrote: > > Well, this is also a thing arch maintainers can do when they feel a > need to support the feature on their architecture. To support them at > that time we should provide them with a) adequate information in an > easy-to-find place (eg, a nice comment at the site of the reference x86 > implementation) and b) a userspace test app. and c) a CONFIG symbol (maybe CONFIG_HAVE_MEM_SOFT_DIRTY, maybe in arch/Kconfig) that they can select to get this feature (so that this feature then depend on that CONFIG symbol instead of X86). That way we don't have to go back and tidy this up when 15 or so architectures implement it. --=20 Cheers, Stephen Rothwell sfr@canb.auug.org.au --Signature=_Tue__16_Apr_2013_09_57_53_+1000_DzV3xC3VcGJY8=O2 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJRbJQBAAoJEECxmPOUX5FE0vcP/iSlzGF4U0CKadJ9YMkW9x7w Yi9cA4ra6t6NS+NTdhRVJZl5ZjzBMSjGj01UeGbVkSjYzGVKpFCtk7FMthYqo+ky 1iCZVtsGymU7OeJpNho3s8K+q4G5DH+4kV6S00vGJdCZxANUlLqJ8ZEeHwrxSlMS YdcRcLdQsVE5bzBflBnNOv4Zye9+z1QiXo3n0nEdpZY1IcfgRqsE0f2nX3DrhLz5 ZlxS+4LtXaDA7QGgk/SOxNGTAU9Q5dKDpCbcVAwheoyl+A+g7AEaCZKKMN1Fsouh xPUOwr6CM3brkmSBjXVFrZv263Tx1i1dScrzz4dmQQ3tWcRFbLvBQwc+me4A5EYA Ik6qYqkfqeEJ8EKRjQzA3FT3oWFuFQM4xbOQ9qDwJF/l9Z4jHVzLUuRkbqNcSvgN UBdgkV0QbIqFaFTvIYjgBpM0qBsKD3igLoS5zEo531k01ybKxxYfu2tm09nGfg9e AExjuB5/dTTmtQBy26SS2A458oSmCqhHNuuWkry38RubXtx1vFTjHnQbbo7HhLNY mQ84ocWmsR6WctLTNxpIXXVRmQ8mocLIxQlPgxgLw/ozeW+YW5CZeW359WfgUVBR qpcYrs3zPaIVp+MS2/pQ/eMMEeM+GVc/+pLjoJM6YkDA6y8/CVndSmYaiD7Nvl17 V9LPhgi0ZQpx8S9wxNKJ =d5Gx -----END PGP SIGNATURE----- --Signature=_Tue__16_Apr_2013_09_57_53_+1000_DzV3xC3VcGJY8=O2-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id 13A916B0002 for ; Tue, 16 Apr 2013 15:51:54 -0400 (EDT) Message-ID: <516DABC8.1040606@parallels.com> Date: Tue, 16 Apr 2013 23:51:36 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linux MM , Linux Kernel Mailing List Cc: Stephen Rothwell As Stephen Rothwell pointed out, config options, that depend on architecture support, are better to be wrapped into a select + depends on scheme. Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only for X86. Signed-off-by: Pavel Emelyanov Cc: Stephen Rothwell --- diff --git a/arch/Kconfig b/arch/Kconfig index 1455579..71c06ab 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -365,6 +365,9 @@ config HAVE_IRQ_TIME_ACCOUNTING config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool +config HAVE_ARCH_SOFT_DIRTY + bool + config HAVE_MOD_ARCH_SPECIFIC bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 70c0f3d..81c0843 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -120,6 +120,7 @@ config X86 select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 select COMPAT_OLD_SIGACTION if IA32_EMULATION + select HAVE_ARCH_SOFT_DIRTY config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index eb97470..ebf9373 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -294,8 +294,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd_clear_flags(pmd, _PAGE_PRESENT); } -#define __HAVE_SOFT_DIRTY - static inline int pte_soft_dirty(pte_t pte) { return pte_flags(pte) & _PAGE_SOFT_DIRTY; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index d74bdd2..a2ca78f 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -386,7 +386,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #define arch_start_context_switch(prev) do {} while (0) #endif -#ifndef __HAVE_SOFT_DIRTY +#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline int pte_soft_dirty(pte_t pte) { return 0; diff --git a/mm/Kconfig b/mm/Kconfig index 147689e..7deac66 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -474,7 +474,7 @@ config FRONTSWAP config MEM_SOFT_DIRTY bool "Track memory changes" - depends on CHECKPOINT_RESTORE && X86 + depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY select PROC_PAGE_MONITOR help This option enables memory changes tracking by introducing a -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id AFBEC6B0002 for ; Tue, 16 Apr 2013 15:58:22 -0400 (EDT) Message-ID: <516DAD59.2020104@parallels.com> Date: Tue, 16 Apr 2013 23:58:17 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> <5168089B.7060305@parallels.com> <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> In-Reply-To: <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux MM , Linux Kernel Mailing List >>> >From that perspective, the dependency on X86 is awful. What's the >>> problem here and what do other architectures need to do to be able to >>> support the feature? >> >> The problem here is that I don't know what free bits are available on >> page table entries on other architectures. I was about to resolve this >> for ARM very soon, but for the rest of them I need help from other people. > > Well, this is also a thing arch maintainers can do when they feel a > need to support the feature on their architecture. To support them at > that time we should provide them with a) adequate information in an > easy-to-find place (eg, a nice comment at the site of the reference x86 > implementation) and b) a userspace test app. Item a) is presumably covered with two things -- required arch-specific PTE manipulations are all collected in asm-generic/pgtable.h under the !CONFIG_HAVE_ARCH_SOFT_DIRTY and the Documentation/vm/soft-dirty.txt pointed by the API clear_refs_soft_dirty()'s comment. Item b) was recently merged. Item c) from Stephen is already sent. Thank you for your time and help, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id 920826B0002 for ; Tue, 16 Apr 2013 19:25:11 -0400 (EDT) Date: Wed, 17 Apr 2013 09:24:59 +1000 From: Stephen Rothwell Subject: Re: [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly Message-Id: <20130417092459.a574ebb81a734973ff7081f9@canb.auug.org.au> In-Reply-To: <516DABC8.1040606@parallels.com> References: <51669E5F.4000801@parallels.com> <516DABC8.1040606@parallels.com> Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA256"; boundary="Signature=_Wed__17_Apr_2013_09_24_59_+1000_3u_Bxfn+37=1Py.I" Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Andrew Morton , Linux MM , Linux Kernel Mailing List --Signature=_Wed__17_Apr_2013_09_24_59_+1000_3u_Bxfn+37=1Py.I Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi Pavel, On Tue, 16 Apr 2013 23:51:36 +0400 Pavel Emelyanov wr= ote: > > As Stephen Rothwell pointed out, config options, that depend on > architecture support, are better to be wrapped into a select + > depends on scheme. >=20 > Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only > for X86. >=20 > Signed-off-by: Pavel Emelyanov > Cc: Stephen Rothwell Acked-by: Stephen Rothwell --=20 Cheers, Stephen Rothwell sfr@canb.auug.org.au --Signature=_Wed__17_Apr_2013_09_24_59_+1000_3u_Bxfn+37=1Py.I Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAEBCAAGBQJRbd3LAAoJEECxmPOUX5FEPV0P/jUT5wL3rPRxphjllQCmkem1 wXZNWrurjURGK3mAC/ChCBaRyUgqyqR+CnuQaKb9YLSDpGohVaLfhLjDlN+ElRuZ a+XCdZg1NBCQOayNGaL2nt3AepiI+vlbs/QmodkOMz/uT/xzIOQl3F4XoIW2Nvh4 s1tu5By9awUhH5F5p4dFiOjsFl2yv2X/O4y+6uZkcwk3e6h9Dzp58WY465SdYSIo E/heCwqwCFai7plmzbFmEyWnpHL0VN6Y/S9WoRo2YEWk2D2+PdLdCChv/fgJxCw3 QtD7rHpUSNGY3ZOr9a8IiNd6dPQj+T72xkGC+oaDcf5u8YvOryC2R2RQCyHmZUTs 5aOOJkdUdtXufECGp5T1dR9Ii4pOB8q7SAsz9oGmzXTyBqXGF7FjuaK4rej2ZMcg U9FfRDVFsWz+QrTQlMR7IfGCvMsoORyp7JcvUai9bzbLfSexkxwCnaqbbeAD/C8H vuCKcfHTXDu3Zp/HzN6W1AFVxbx5DH+32te5izZhruKIhJPxAoCVjzM1k4FEa+H7 +2JyucLISZiSXqSU7QoGzvu8cK5kXXEjt06ccCzJ7ns9W9qMHfl7yjYdTAL89MP6 gz8rEVD1gbFgUxqI9/r4GmW8isgXOcckhE275zjyV6yCJ8FOdQGPqod5GiwTHrep M/MOVZCd3bknOmRUzXWk =cYpe -----END PGP SIGNATURE----- --Signature=_Wed__17_Apr_2013_09_24_59_+1000_3u_Bxfn+37=1Py.I-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id 0E3806B02FF for ; Fri, 3 May 2013 18:58:27 -0400 (EDT) Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 3 May 2013 18:58:26 -0400 Received: from d01relay05.pok.ibm.com (d01relay05.pok.ibm.com [9.56.227.237]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 4EA006E803A for ; Fri, 3 May 2013 18:58:16 -0400 (EDT) Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay05.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r43MwJAx330972 for ; Fri, 3 May 2013 18:58:19 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r43MwIJE010229 for ; Fri, 3 May 2013 19:58:18 -0300 Date: Thu, 2 May 2013 10:08:57 -0700 From: Matt Helsley Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Message-ID: <20130502170857.GB24627@us.ibm.com> References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51669EA5.20209@parallels.com> Sender: owner-linux-mm@kvack.org List-ID: To: Pavel Emelyanov Cc: Andrew Morton , Linux MM , Linux Kernel Mailing List On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote: > This file is the same as the pagemap one, but shows entries with bits > 55-60 being zero (reserved for future use). Next patch will occupy one > of them. This approach doesn't scale as well as it could. As best I can see CRIU would do: for each vma in /proc//smaps for each page in /proc//pagemap2 if soft dirty bit copy page (possibly with pfn checks to avoid copying the same page mapped in multiple locations..) However, if soft dirty bit changes could be queued up (from say the fault handler and page table ops that map/unmap pages) and accumulated in something like an interval tree it could be something like: for each range of changed pages for each page in range copy page IOW something that scales with the number of changed pages rather than the number of mapped pages. So I wonder if CRIU would abandon pagemap2 in the future for something like this. Cheers, -Matt Helsley -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id E815D6B030B for ; Sat, 4 May 2013 05:47:48 -0400 (EDT) Message-ID: <5184D93C.7000806@parallels.com> Date: Sat, 04 May 2013 13:47:40 +0400 From: Pavel Emelyanov MIME-Version: 1.0 Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> <20130502170857.GB24627@us.ibm.com> In-Reply-To: <20130502170857.GB24627@us.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Matt Helsley Cc: Andrew Morton , Linux MM , Linux Kernel Mailing List On 05/02/2013 09:08 PM, Matt Helsley wrote: > On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote: >> This file is the same as the pagemap one, but shows entries with bits >> 55-60 being zero (reserved for future use). Next patch will occupy one >> of them. > > This approach doesn't scale as well as it could. As best I can see > CRIU would do: > > for each vma in /proc//smaps > for each page in /proc//pagemap2 > if soft dirty bit > copy page > > (possibly with pfn checks to avoid copying the same page mapped in > multiple locations..) Comparing pfns got from two subsequent pagemap reads doesn't help at all. If they are equal, this can mean that either page is shared or (less likely, but still) that the page, that used to be at the 1st pagemap was reclaimed and mapped to the 2nd between two reads. If they differ, it can again mean either not-shared (most likely) or shared (pfns were equal, but got reclaimed and swapped in back). Some better API for pages sharing would be nice, probably such API could be also re-used for the user-space KSM :) > However, if soft dirty bit changes could be queued up (from say the > fault handler and page table ops that map/unmap pages) and accumulated > in something like an interval tree it could be something like: > > for each range of changed pages > for each page in range > copy page > > IOW something that scales with the number of changed pages rather > than the number of mapped pages. > > So I wonder if CRIU would abandon pagemap2 in the future for something > like this. We'd surely adopt such APIs is one exists. One thing to note about one is that we'd also appreciate if this API would be able to batch "present" bits as well as "swapped" and "page-file" ones. We use these three in CRIU as well, and these bits scanning can also be optimized. > Cheers, > -Matt Helsley > Thanks, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753702Ab3DKL2q (ORCPT ); Thu, 11 Apr 2013 07:28:46 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:41589 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751587Ab3DKL2p (ORCPT ); Thu, 11 Apr 2013 07:28:45 -0400 Message-ID: <51669E5F.4000801@parallels.com> Date: Thu, 11 Apr 2013 15:28:31 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, This is the implementation of the soft-dirty bit concept that should help keep track of changes in user memory, which in turn is very-very required by the checkpoint-restore project (http://criu.org). Let me briefly remind what the issue is. << EOF To create a dump of an application(s) we save all the information about it to files, and the biggest part of such dump is the contents of tasks' memory. However, there are usage scenarios where it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps, it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. We copy all the memory to the destination node without stopping all tasks, then stop them, check for what pages has changed, dump it and the rest of the state, then copy it to the destination node. This decreases freeze time significantly. That said, some help from kernel to watch how processes modify the contents of their memory is required. EOF The proposal is to track changes with the help of new soft-dirty bit this way: 1. First do "echo 4 > /proc/$pid/clear_refs". At that point kernel clears the soft dirty _and_ the writable bits from all ptes of process $pid. From now on every write to any page will result in #pf and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set the soft dirty flag. 2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there (the 55'th one). If set, the respective pte was written to since last call to clear refs. The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck, the latter one marks kernel pages with it, while the former bit is put on user pages so they do not conflict to each other. The set is against the v3.9-rc5. It includes preparations to /proc/pid's clear_refs file, adds the pagemap2 one and the soft-dirty concept itself with Andrew's comments on the previous patch (hopefully) fixed. History of the set: * Previous version of this patch, commented out by Andrew: http://lwn.net/Articles/546184/ * Pre-previous ftrace-based approach: http://permalink.gmane.org/gmane.linux.kernel.mm/91428 This one was not nice, because ftrace could drop events so we might miss significant information about page updates. Another issue with it -- it was impossible to use one to watch arbitrary task -- task had to mark memory areas with madvise itself to make events occur. Also, program, that monitored the update events could interfere with anyone else trying to mess with ftrace. Signed-off-by: Pavel Emelyanov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754029Ab3DKL3A (ORCPT ); Thu, 11 Apr 2013 07:29:00 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:27126 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752582Ab3DKL26 (ORCPT ); Thu, 11 Apr 2013 07:28:58 -0400 Message-ID: <51669E73.2000301@parallels.com> Date: Thu, 11 Apr 2013 15:28:51 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 1/5] clear_refs: Sanitize accepted commands declaration References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A new clear-refs type will be added in the next patch, so prepare code for that. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 17 ++++++++++------- 1 files changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3e636d8..67c2586 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -688,6 +688,13 @@ const struct file_operations proc_tid_smaps_operations = { .release = seq_release_private, }; +enum clear_refs_types { + CLEAR_REFS_ALL = 1, + CLEAR_REFS_ANON, + CLEAR_REFS_MAPPED, + CLEAR_REFS_LAST, +}; + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -719,10 +726,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } -#define CLEAR_REFS_ALL 1 -#define CLEAR_REFS_ANON 2 -#define CLEAR_REFS_MAPPED 3 - static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, char buffer[PROC_NUMBUF]; struct mm_struct *mm; struct vm_area_struct *vma; - int type; + enum clear_refs_types type; int rv; memset(buffer, 0, sizeof(buffer)); @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, count = sizeof(buffer) - 1; if (copy_from_user(buffer, buf, count)) return -EFAULT; - rv = kstrtoint(strstrip(buffer), 10, &type); + rv = kstrtoint(strstrip(buffer), 10, (int *)&type); if (rv < 0) return rv; - if (type < CLEAR_REFS_ALL || type > CLEAR_REFS_MAPPED) + if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) return -EINVAL; task = get_proc_task(file_inode(file)); if (!task) -- 1.7.6.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754352Ab3DKL3T (ORCPT ); Thu, 11 Apr 2013 07:29:19 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:24511 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752582Ab3DKL3Q (ORCPT ); Thu, 11 Apr 2013 07:29:16 -0400 Message-ID: <51669E85.1020702@parallels.com> Date: Thu, 11 Apr 2013 15:29:09 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 2/5] clear_refs: Introduce private struct for mm_walk References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In next patch the clear-refs-type will be required in clear_refs_pte_range funciton, so prepare the walk->private to carry this info. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 67c2586..c59a148 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -695,10 +695,15 @@ enum clear_refs_types { CLEAR_REFS_LAST, }; +struct clear_refs_private { + struct vm_area_struct *vma; +}; + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct clear_refs_private *cp = walk->private; + struct vm_area_struct *vma = cp->vma; pte_t *pte, ptent; spinlock_t *ptl; struct page *page; @@ -751,13 +756,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, return -ESRCH; mm = get_task_mm(task); if (mm) { + struct clear_refs_private cp = { + }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, .mm = mm, + .private = &cp, }; down_read(&mm->mmap_sem); for (vma = mm->mmap; vma; vma = vma->vm_next) { - clear_refs_walk.private = vma; + cp.vma = vma; if (is_vm_hugetlb_page(vma)) continue; /* -- 1.7.6.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765439Ab3DKLaD (ORCPT ); Thu, 11 Apr 2013 07:30:03 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:48527 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754001Ab3DKL3t (ORCPT ); Thu, 11 Apr 2013 07:29:49 -0400 Message-ID: <51669EA5.20209@parallels.com> Date: Thu, 11 Apr 2013 15:29:41 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This file is the same as the pagemap one, but shows entries with bits 55-60 being zero (reserved for future use). Next patch will occupy one of them. Signed-off-by: Pavel Emelyanov --- Documentation/filesystems/proc.txt | 2 ++ Documentation/vm/pagemap.txt | 3 +++ fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 11 +++++++++++ 5 files changed, 19 insertions(+), 0 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fd8d0d5..22c47ec 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -487,6 +487,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using /proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +(There's also a /proc/pid/pagemap2 file which is the 2nd version of the + pagemap one). 1.2 Kernel data --------------- diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 7587493..4350397 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -30,6 +30,9 @@ There are three components to pagemap: determine which areas of memory are actually mapped and llseek to skip over unmapped regions. + * /proc/pid/pagemap2. This file provides the same info as the pagemap + does, but bits 55-60 are reserved for future use and thus zero + * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. diff --git a/fs/proc/base.c b/fs/proc/base.c index 69078c7..34966ce 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2537,6 +2537,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("clear_refs", S_IWUSR, proc_clear_refs_operations), REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), + REG("pagemap2", S_IRUGO, proc_pagemap2_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -2882,6 +2883,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("clear_refs", S_IWUSR, proc_clear_refs_operations), REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), + REG("pagemap2", S_IRUGO, proc_pagemap2_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 85ff3a4..cc12bb7 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -67,6 +67,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_pagemap2_operations; extern const struct file_operations proc_net_operations; extern const struct inode_operations proc_net_inode_operations; extern const struct inode_operations proc_pid_link_inode_operations; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7f9b66c..3138009 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1135,6 +1135,17 @@ const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, }; + +static ssize_t pagemap2_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return do_pagemap_read(file, buf, count, ppos, true); +} + +const struct file_operations proc_pagemap2_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = pagemap2_read, +}; #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA -- 1.7.6.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932397Ab3DKLaO (ORCPT ); Thu, 11 Apr 2013 07:30:14 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:3465 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1765547Ab3DKLaJ (ORCPT ); Thu, 11 Apr 2013 07:30:09 -0400 Message-ID: <51669EB8.2020102@parallels.com> Date: Thu, 11 Apr 2013 15:30:00 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: Pavel Emelyanov --- Documentation/filesystems/proc.txt | 7 +++++- Documentation/vm/pagemap.txt | 4 ++- Documentation/vm/soft-dirty.txt | 36 ++++++++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable.h | 26 ++++++++++++++++++++++- arch/x86/include/asm/pgtable_types.h | 6 +++++ fs/proc/task_mmu.c | 36 +++++++++++++++++++++++++++++---- include/asm-generic/pgtable.h | 22 ++++++++++++++++++++ mm/Kconfig | 12 +++++++++++ mm/huge_memory.c | 2 +- mm/mremap.c | 2 +- 10 files changed, 142 insertions(+), 11 deletions(-) create mode 100644 Documentation/vm/soft-dirty.txt diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 22c47ec..488c094 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -473,7 +473,8 @@ This file is only present if the CONFIG_MMU kernel configuration option is enabled. The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process. +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -482,6 +483,10 @@ To clear the bits for the anonymous pages associated with the process To clear the bits for the file mapped pages associated with the process > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 4350397..394cc03 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -31,7 +31,9 @@ There are three components to pagemap: skip over unmapped regions. * /proc/pid/pagemap2. This file provides the same info as the pagemap - does, but bits 55-60 are reserved for future use and thus zero + does, but bits 56-60 are reserved for future use and thus zero + + Bit 55 means pte is soft-dirty (see Documentation/vm/soft-dirty.txt) * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt new file mode 100644 index 0000000..9a12a59 --- /dev/null +++ b/Documentation/vm/soft-dirty.txt @@ -0,0 +1,36 @@ + SOFT-DIRTY PTEs + + The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the /proc/PID/clear_refs file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the /proc/PID/pagemap. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + + Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + + Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + + + This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1e67223..eb97470 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -207,7 +207,7 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { - return pte_set_flags(pte, _PAGE_DIRTY); + return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pte_t pte_mkyoung(pte_t pte) @@ -271,7 +271,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) static inline pmd_t pmd_mkdirty(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_DIRTY); + return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pmd_t pmd_mkhuge(pmd_t pmd) @@ -294,6 +294,28 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd_clear_flags(pmd, _PAGE_PRESENT); } +#define __HAVE_SOFT_DIRTY + +static inline int pte_soft_dirty(pte_t pte) +{ + return pte_flags(pte) & _PAGE_SOFT_DIRTY; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SOFT_DIRTY; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY); +} + /* * Mask out unsupported bits in a present pgprot. Non-present pgprots * can use those bits for other purposes, so leave them be. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 567b5d0..dcf718c 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -55,6 +55,18 @@ #define _PAGE_HIDDEN (_AT(pteval_t, 0)) #endif +/* + * The same hidden bit is used by kmemcheck, but since kmemcheck + * works on kernel pages while soft-dirty engine on user space, + * they do not conflict with each other. + */ + +#ifdef CONFIG_MEM_SOFT_DIRTY +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN) +#else +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #else diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3138009..aae2474 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -692,13 +692,32 @@ enum clear_refs_types { CLEAR_REFS_ALL = 1, CLEAR_REFS_ANON, CLEAR_REFS_MAPPED, + CLEAR_REFS_SOFT_DIRTY, CLEAR_REFS_LAST, }; struct clear_refs_private { struct vm_area_struct *vma; + enum clear_refs_types type; }; +static inline void clear_soft_dirty(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ +#ifdef CONFIG_MEM_SOFT_DIRTY + /* + * The soft-dirty tracker uses #PF-s to catch writes + * to pages, so write-protect the pte as well. See the + * Documentation/vm/soft-dirty.txt for full description + * of how soft-dirty works. + */ + pte_t ptent = *pte; + ptent = pte_wrprotect(ptent); + ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY); + set_pte_at(vma->vm_mm, addr, pte, ptent); +#endif +} + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -718,6 +731,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (!pte_present(ptent)) continue; + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + clear_soft_dirty(vma, addr, pte); + continue; + } + page = vm_normal_page(vma, addr, ptent); if (!page) continue; @@ -757,6 +775,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, mm = get_task_mm(task); if (mm) { struct clear_refs_private cp = { + .type = type, }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, @@ -825,6 +844,7 @@ struct pagemapread { /* in pagemap2 pshift bits are occupied with more status bits */ #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) +#define __PM_SOFT_DIRTY (1LL) #define PM_PRESENT PM_STATUS(4LL) #define PM_SWAP PM_STATUS(2LL) #define PM_FILE PM_STATUS(1LL) @@ -866,6 +886,7 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, { u64 frame, flags; struct page *page = NULL; + int flags2 = 0; if (pte_present(pte)) { frame = pte_pfn(pte); @@ -886,13 +907,15 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, if (page && !PageAnon(page)) flags |= PM_FILE; + if (pte_soft_dirty(pte)) + flags2 |= __PM_SOFT_DIRTY; - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags); + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { /* * Currently pmd for thp is always present because thp can not be @@ -901,13 +924,13 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p */ if (pmd_present(pmd)) *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_STATUS2(pm->v2, 0) | PM_PRESENT); + | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT); else *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } #else static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { } #endif @@ -924,12 +947,15 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); if (vma && pmd_trans_huge_lock(pmd, vma) == 1) { + int pmd_flags2; + + pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0); for (; addr != end; addr += PAGE_SIZE) { unsigned long offset; offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset); + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); err = add_to_pagemap(addr, &pme, pm); if (err) break; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index bfd8768..d74bdd2 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -386,6 +386,28 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #define arch_start_context_switch(prev) do {} while (0) #endif +#ifndef __HAVE_SOFT_DIRTY +static inline int pte_soft_dirty(pte_t pte) +{ + return 0; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return 0; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte; +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd; +} +#endif + #ifndef __HAVE_PFNMAP_TRACKING /* * Interfaces that can be used by architecture code to keep track of diff --git a/mm/Kconfig b/mm/Kconfig index 3bea74f..147689e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -471,3 +471,15 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config MEM_SOFT_DIRTY + bool "Track memory changes" + depends on CHECKPOINT_RESTORE && X86 + select PROC_PAGE_MONITOR + help + This option enables memory changes tracking by introducing a + soft-dirty bit on pte-s. This bit it set when someone writes + into a page just as regular dirty bit, but unlike the latter + it can be cleared by hands. + + See Documentation/vm/soft-dirty.txt for more details. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e2f7f5aa..eef1606 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1431,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, if (ret == 1) { pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); - set_pmd_at(mm, new_addr, new_pmd, pmd); + set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd)); spin_unlock(&mm->page_table_lock); } out: diff --git a/mm/mremap.c b/mm/mremap.c index 463a257..3708655 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); - set_pte_at(mm, new_addr, new_pte, pte); + set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte)); } arch_leave_lazy_mmu_mode(); -- 1.7.6.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756224Ab3DKL3h (ORCPT ); Thu, 11 Apr 2013 07:29:37 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:20312 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752808Ab3DKL3d (ORCPT ); Thu, 11 Apr 2013 07:29:33 -0400 Message-ID: <51669E95.8060101@parallels.com> Date: Thu, 11 Apr 2013 15:29:25 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These bits are always constant (== PAGE_SHIFT) and just occupy space in the entry. Moreover, in next patch we will need to report one more bit in the pagemap, but all bits are already busy on it. That said, describe the pagemap entry that has 6 more free zero bits. Signed-off-by: Pavel Emelyanov --- fs/proc/task_mmu.c | 50 ++++++++++++++++++++++++++++++-------------------- 1 files changed, 30 insertions(+), 20 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index c59a148..7f9b66c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -805,6 +805,7 @@ typedef struct { struct pagemapread { int pos, len; pagemap_entry_t *buffer; + bool v2; }; #define PAGEMAP_WALK_SIZE (PMD_SIZE) @@ -818,14 +819,16 @@ struct pagemapread { #define PM_PSHIFT_BITS 6 #define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS) #define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET) -#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) +#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) #define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1) #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) +/* in pagemap2 pshift bits are occupied with more status bits */ +#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) #define PM_PRESENT PM_STATUS(4LL) #define PM_SWAP PM_STATUS(2LL) #define PM_FILE PM_STATUS(1LL) -#define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT) +#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0) #define PM_END_OF_BUFFER 1 static inline pagemap_entry_t make_pme(u64 val) @@ -848,7 +851,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, struct pagemapread *pm = walk->private; unsigned long addr; int err = 0; - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); for (addr = start; addr < end; addr += PAGE_SIZE) { err = add_to_pagemap(addr, &pme, pm); @@ -858,7 +861,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, return err; } -static void pte_to_pagemap_entry(pagemap_entry_t *pme, +static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, struct vm_area_struct *vma, unsigned long addr, pte_t pte) { u64 frame, flags; @@ -877,18 +880,18 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, if (is_migration_entry(entry)) page = migration_entry_to_page(entry); } else { - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); return; } if (page && !PageAnon(page)) flags |= PM_FILE; - *pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags); + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, +static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pmd_t pmd, int offset) { /* @@ -898,12 +901,12 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, */ if (pmd_present(pmd)) *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); + | PM_STATUS2(pm->v2, 0) | PM_PRESENT); else - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } #else -static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, +static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pmd_t pmd, int offset) { } @@ -916,7 +919,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); @@ -926,7 +929,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, *pmd, offset); + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset); err = add_to_pagemap(addr, &pme, pm); if (err) break; @@ -943,7 +946,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * and need a new, higher one */ if (vma && (addr >= vma->vm_end)) { vma = find_vma(walk->mm, addr); - pme = make_pme(PM_NOT_PRESENT); + pme = make_pme(PM_NOT_PRESENT(pm->v2)); } /* check that 'vma' actually covers this address, @@ -951,7 +954,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (vma && (vma->vm_start <= addr) && !is_vm_hugetlb_page(vma)) { pte = pte_offset_map(pmd, addr); - pte_to_pagemap_entry(&pme, vma, addr, *pte); + pte_to_pagemap_entry(&pme, pm, vma, addr, *pte); /* unmap before userspace copy */ pte_unmap(pte); } @@ -966,14 +969,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } #ifdef CONFIG_HUGETLB_PAGE -static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, +static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, pte_t pte, int offset) { if (pte_present(pte)) *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); + | PM_STATUS2(pm->v2, 0) | PM_PRESENT); else - *pme = make_pme(PM_NOT_PRESENT); + *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } /* This function walks within one hugetlb entry in the single call */ @@ -987,7 +990,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, for (; addr != end; addr += PAGE_SIZE) { int offset = (addr & ~hmask) >> PAGE_SHIFT; - huge_pte_to_pagemap_entry(&pme, *pte, offset); + huge_pte_to_pagemap_entry(&pme, pm, *pte, offset); err = add_to_pagemap(addr, &pme, pm); if (err) return err; @@ -1023,8 +1026,8 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ -static ssize_t pagemap_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) +static ssize_t do_pagemap_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos, bool v2) { struct task_struct *task = get_proc_task(file_inode(file)); struct mm_struct *mm; @@ -1049,6 +1052,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, if (!count) goto out_task; + pm.v2 = v2; pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); ret = -ENOMEM; @@ -1121,6 +1125,12 @@ out: return ret; } +static ssize_t pagemap_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return do_pagemap_read(file, buf, count, ppos, false); +} + const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, -- 1.7.6.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935534Ab3DKVRi (ORCPT ); Thu, 11 Apr 2013 17:17:38 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:39043 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934464Ab3DKVRg (ORCPT ); Thu, 11 Apr 2013 17:17:36 -0400 Date: Thu, 11 Apr 2013 14:17:35 -0700 From: Andrew Morton To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Message-Id: <20130411141735.107e583ca55e619f2e215851@linux-foundation.org> In-Reply-To: <51669E73.2000301@parallels.com> References: <51669E5F.4000801@parallels.com> <51669E73.2000301@parallels.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 11 Apr 2013 15:28:51 +0400 Pavel Emelyanov wrote: > A new clear-refs type will be added in the next patch, so prepare > code for that. > > @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > char buffer[PROC_NUMBUF]; > struct mm_struct *mm; > struct vm_area_struct *vma; > - int type; > + enum clear_refs_types type; > int rv; > > memset(buffer, 0, sizeof(buffer)); > @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > count = sizeof(buffer) - 1; > if (copy_from_user(buffer, buf, count)) > return -EFAULT; > - rv = kstrtoint(strstrip(buffer), 10, &type); > + rv = kstrtoint(strstrip(buffer), 10, (int *)&type); This is naughty. The compiler is allowed to put the enum into storage which is smaller (or, I guess, larger) than sizeof(int). I've seen one compiler which puts such an enum into a 16-bit word. --- a/fs/proc/task_mmu.c~clear_refs-sanitize-accepted-commands-declaration-fix +++ a/fs/proc/task_mmu.c @@ -734,6 +734,7 @@ static ssize_t clear_refs_write(struct f struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; + int itype; int rv; memset(buffer, 0, sizeof(buffer)); @@ -741,9 +742,10 @@ static ssize_t clear_refs_write(struct f count = sizeof(buffer) - 1; if (copy_from_user(buffer, buf, count)) return -EFAULT; - rv = kstrtoint(strstrip(buffer), 10, (int *)&type); + rv = kstrtoint(strstrip(buffer), 10, &itype); if (rv < 0) return rv; + type = (enum clear_refs_types)itype; if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) return -EINVAL; task = get_proc_task(file_inode(file)); _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936394Ab3DKVTs (ORCPT ); Thu, 11 Apr 2013 17:19:48 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:39126 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935467Ab3DKVTq (ORCPT ); Thu, 11 Apr 2013 17:19:46 -0400 Date: Thu, 11 Apr 2013 14:19:44 -0700 From: Andrew Morton To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Message-Id: <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> In-Reply-To: <51669EA5.20209@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov wrote: > This file is the same as the pagemap one, but shows entries with bits > 55-60 being zero (reserved for future use). Next patch will occupy one > of them. I'm not understanding the motivation for this. What does the current /proc/pid/pagemap have in those bit positions? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935767Ab3DKVYf (ORCPT ); Thu, 11 Apr 2013 17:24:35 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:39133 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935434Ab3DKVYS (ORCPT ); Thu, 11 Apr 2013 17:24:18 -0400 Date: Thu, 11 Apr 2013 14:24:17 -0700 From: Andrew Morton To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Message-Id: <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> In-Reply-To: <51669EB8.2020102@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > The soft-dirty is a bit on a PTE which helps to track which pages a task > writes to. In order to do this tracking one should > > 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) > 2. Wait some time. > 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) > > To do this tracking, the writable bit is cleared from PTEs when the > soft-dirty bit is. Thus, after this, when the task tries to modify a page > at some virtual address the #PF occurs and the kernel sets the soft-dirty > bit on the respective PTE. > > Note, that although all the task's address space is marked as r/o after the > soft-dirty bits clear, the #PF-s that occur after that are processed fast. > This is so, since the pages are still mapped to physical memory, and thus > all the kernel does is finds this fact out and puts back writable, dirty > and soft-dirty bits on the PTE. > > Another thing to note, is that when mremap moves PTEs they are marked with > soft-dirty as well, since from the user perspective mremap modifies the > virtual memory at mremap's new address. > > ... > > +config MEM_SOFT_DIRTY > + bool "Track memory changes" > + depends on CHECKPOINT_RESTORE && X86 I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is a general facility and I expect others will want to get their hands on it for unrelated things. >>From that perspective, the dependency on X86 is awful. What's the problem here and what do other architectures need to do to be able to support the feature? You have a test application, I assume. It would be helpful if we could get that into tools/testing/selftests. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755390Ab3DLNKs (ORCPT ); Fri, 12 Apr 2013 09:10:48 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:30016 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753152Ab3DLNKr (ORCPT ); Fri, 12 Apr 2013 09:10:47 -0400 Message-ID: <516807CB.6040208@parallels.com> Date: Fri, 12 Apr 2013 17:10:35 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton CC: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> In-Reply-To: <20130411141944.dc17b3b1c78132eedec06aa6@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/12/2013 01:19 AM, Andrew Morton wrote: > On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov wrote: > >> This file is the same as the pagemap one, but shows entries with bits >> 55-60 being zero (reserved for future use). Next patch will occupy one >> of them. > > I'm not understanding the motivation for this. What does the current > /proc/pid/pagemap have in those bit positions? A constant PAGE_SHIFT value. > > . > Thanks, Pavel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755430Ab3DLNOM (ORCPT ); Fri, 12 Apr 2013 09:14:12 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:34941 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754353Ab3DLNOK (ORCPT ); Fri, 12 Apr 2013 09:14:10 -0400 Message-ID: <5168089B.7060305@parallels.com> Date: Fri, 12 Apr 2013 17:14:03 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton CC: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> In-Reply-To: <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/12/2013 01:24 AM, Andrew Morton wrote: > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > >> The soft-dirty is a bit on a PTE which helps to track which pages a task >> writes to. In order to do this tracking one should >> >> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) >> 2. Wait some time. >> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) >> >> To do this tracking, the writable bit is cleared from PTEs when the >> soft-dirty bit is. Thus, after this, when the task tries to modify a page >> at some virtual address the #PF occurs and the kernel sets the soft-dirty >> bit on the respective PTE. >> >> Note, that although all the task's address space is marked as r/o after the >> soft-dirty bits clear, the #PF-s that occur after that are processed fast. >> This is so, since the pages are still mapped to physical memory, and thus >> all the kernel does is finds this fact out and puts back writable, dirty >> and soft-dirty bits on the PTE. >> >> Another thing to note, is that when mremap moves PTEs they are marked with >> soft-dirty as well, since from the user perspective mremap modifies the >> virtual memory at mremap's new address. >> >> ... >> >> +config MEM_SOFT_DIRTY >> + bool "Track memory changes" >> + depends on CHECKPOINT_RESTORE && X86 > > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is > a general facility and I expect others will want to get their hands on > it for unrelated things. OK. Just tell me when you need the dependency removing patch. >>>From that perspective, the dependency on X86 is awful. What's the > problem here and what do other architectures need to do to be able to > support the feature? The problem here is that I don't know what free bits are available on page table entries on other architectures. I was about to resolve this for ARM very soon, but for the rest of them I need help from other people. > You have a test application, I assume. It would be helpful if we could > get that into tools/testing/selftests. If a very stupid 10-lines test is OK, then I can cook a patch with it. Other than this I test this using the whole CRIU project, which is too big for inclusion. Thanks, Pavel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756403Ab3DLPxy (ORCPT ); Fri, 12 Apr 2013 11:53:54 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:21104 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753193Ab3DLPxx (ORCPT ); Fri, 12 Apr 2013 11:53:53 -0400 Message-ID: <51682E08.9050107@parallels.com> Date: Fri, 12 Apr 2013 19:53:44 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: [PATCH 6/5] selftest: Add simple test for soft-dirty bit References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> In-Reply-To: <51669EB8.2020102@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It creates a mapping of 3 pages and checks that reads, writes and clear-refs result in present and soft-dirt bits reported from pagemap2 set as expected. Signed-off-by: Pavel Emelyanov --- diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 575ef80..827f2c0 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -6,6 +6,7 @@ TARGETS += cpu-hotplug TARGETS += memory-hotplug TARGETS += efivarfs TARGETS += ptrace +TARGETS += soft-dirty all: for TARGET in $(TARGETS); do \ diff --git a/tools/testing/selftests/soft-dirty/Makefile b/tools/testing/selftests/soft-dirty/Makefile new file mode 100644 index 0000000..a9cdc82 --- /dev/null +++ b/tools/testing/selftests/soft-dirty/Makefile @@ -0,0 +1,10 @@ +CFLAGS += -iquote../../../../include/uapi -Wall +soft-dirty: soft-dirty.c + +all: soft-dirty + +clean: + rm -f soft-dirty + +run_tests: all + @./soft-dirty || echo "soft-dirty selftests: [FAIL]" diff --git a/tools/testing/selftests/soft-dirty/soft-dirty.c b/tools/testing/selftests/soft-dirty/soft-dirty.c new file mode 100644 index 0000000..aba4f87 --- /dev/null +++ b/tools/testing/selftests/soft-dirty/soft-dirty.c @@ -0,0 +1,114 @@ +#include +#include +#include +#include +#include +#include + +typedef unsigned long long u64; + +#define PME_PRESENT (1ULL << 63) +#define PME_SOFT_DIRTY (1Ull << 55) + +#define PAGES_TO_TEST 3 +#ifndef PAGE_SIZE +#define PAGE_SIZE 4096 +#endif + +static void get_pagemap2(char *mem, u64 *map) +{ + int fd; + + fd = open("/proc/self/pagemap2", O_RDONLY); + if (fd < 0) { + perror("Can't open pagemap2"); + exit(1); + } + + lseek(fd, (unsigned long)mem / PAGE_SIZE * sizeof(u64), SEEK_SET); + read(fd, map, sizeof(u64) * PAGES_TO_TEST); + close(fd); +} + +static inline char map_p(u64 map) +{ + return map & PME_PRESENT ? 'p' : '-'; +} + +static inline char map_sd(u64 map) +{ + return map & PME_SOFT_DIRTY ? 'd' : '-'; +} + +static int check_pte(int step, int page, u64 *map, u64 want) +{ + if ((map[page] & want) != want) { + printf("Step %d Page %d has %c%c, want %c%c\n", + step, page, + map_p(map[page]), map_sd(map[page]), + map_p(want), map_sd(want)); + return 1; + } + + return 0; +} + +static void clear_refs(void) +{ + int fd; + char *v = "4"; + + fd = open("/proc/self/clear_refs", O_WRONLY); + if (write(fd, v, 3) < 3) { + perror("Can't clear soft-dirty bit"); + exit(1); + } + close(fd); +} + +int main(void) +{ + char *mem, x; + u64 map[PAGES_TO_TEST]; + + mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE, + PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0); + + x = mem[0]; + mem[2 * PAGE_SIZE] = 'c'; + get_pagemap2(mem, map); + + if (check_pte(1, 0, map, PME_PRESENT)) + return 1; + if (check_pte(1, 1, map, 0)) + return 1; + if (check_pte(1, 2, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + + clear_refs(); + get_pagemap2(mem, map); + + if (check_pte(2, 0, map, PME_PRESENT)) + return 1; + if (check_pte(2, 1, map, 0)) + return 1; + if (check_pte(2, 2, map, PME_PRESENT)) + return 1; + + mem[0] = 'a'; + mem[PAGE_SIZE] = 'b'; + x = mem[2 * PAGE_SIZE]; + get_pagemap2(mem, map); + + if (check_pte(3, 0, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + if (check_pte(3, 1, map, PME_PRESENT | PME_SOFT_DIRTY)) + return 1; + if (check_pte(3, 2, map, PME_PRESENT)) + return 1; + + (void)x; /* gcc warn */ + + printf("PASS\n"); + return 0; +} From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935171Ab3DOVqW (ORCPT ); Mon, 15 Apr 2013 17:46:22 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:47178 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934571Ab3DOVqV (ORCPT ); Mon, 15 Apr 2013 17:46:21 -0400 Date: Mon, 15 Apr 2013 14:46:19 -0700 From: Andrew Morton To: Pavel Emelyanov Cc: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Message-Id: <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> In-Reply-To: <5168089B.7060305@parallels.com> References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> <5168089B.7060305@parallels.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 12 Apr 2013 17:14:03 +0400 Pavel Emelyanov wrote: > On 04/12/2013 01:24 AM, Andrew Morton wrote: > > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov wrote: > > > >> The soft-dirty is a bit on a PTE which helps to track which pages a task > >> writes to. In order to do this tracking one should > >> > >> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) > >> 2. Wait some time. > >> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) > >> > >> To do this tracking, the writable bit is cleared from PTEs when the > >> soft-dirty bit is. Thus, after this, when the task tries to modify a page > >> at some virtual address the #PF occurs and the kernel sets the soft-dirty > >> bit on the respective PTE. > >> > >> Note, that although all the task's address space is marked as r/o after the > >> soft-dirty bits clear, the #PF-s that occur after that are processed fast. > >> This is so, since the pages are still mapped to physical memory, and thus > >> all the kernel does is finds this fact out and puts back writable, dirty > >> and soft-dirty bits on the PTE. > >> > >> Another thing to note, is that when mremap moves PTEs they are marked with > >> soft-dirty as well, since from the user perspective mremap modifies the > >> virtual memory at mremap's new address. > >> > >> ... > >> > >> +config MEM_SOFT_DIRTY > >> + bool "Track memory changes" > >> + depends on CHECKPOINT_RESTORE && X86 > > > > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is > > a general facility and I expect others will want to get their hands on > > it for unrelated things. > > OK. Just tell me when you need the dependency removing patch. > > >>From that perspective, the dependency on X86 is awful. What's the > > problem here and what do other architectures need to do to be able to > > support the feature? > > The problem here is that I don't know what free bits are available on > page table entries on other architectures. I was about to resolve this > for ARM very soon, but for the rest of them I need help from other people. Well, this is also a thing arch maintainers can do when they feel a need to support the feature on their architecture. To support them at that time we should provide them with a) adequate information in an easy-to-find place (eg, a nice comment at the site of the reference x86 implementation) and b) a userspace test app. > > You have a test application, I assume. It would be helpful if we could > > get that into tools/testing/selftests. > > If a very stupid 10-lines test is OK, then I can cook a patch with it. I think that would be good. As a low-priority thing, please. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936009Ab3DPTwA (ORCPT ); Tue, 16 Apr 2013 15:52:00 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:29556 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935589Ab3DPTv6 (ORCPT ); Tue, 16 Apr 2013 15:51:58 -0400 Message-ID: <516DABC8.1040606@parallels.com> Date: Tue, 16 Apr 2013 23:51:36 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton , Linux MM , Linux Kernel Mailing List CC: Stephen Rothwell Subject: [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly References: <51669E5F.4000801@parallels.com> In-Reply-To: <51669E5F.4000801@parallels.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As Stephen Rothwell pointed out, config options, that depend on architecture support, are better to be wrapped into a select + depends on scheme. Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only for X86. Signed-off-by: Pavel Emelyanov Cc: Stephen Rothwell --- diff --git a/arch/Kconfig b/arch/Kconfig index 1455579..71c06ab 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -365,6 +365,9 @@ config HAVE_IRQ_TIME_ACCOUNTING config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool +config HAVE_ARCH_SOFT_DIRTY + bool + config HAVE_MOD_ARCH_SPECIFIC bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 70c0f3d..81c0843 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -120,6 +120,7 @@ config X86 select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 select COMPAT_OLD_SIGACTION if IA32_EMULATION + select HAVE_ARCH_SOFT_DIRTY config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index eb97470..ebf9373 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -294,8 +294,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd_clear_flags(pmd, _PAGE_PRESENT); } -#define __HAVE_SOFT_DIRTY - static inline int pte_soft_dirty(pte_t pte) { return pte_flags(pte) & _PAGE_SOFT_DIRTY; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index d74bdd2..a2ca78f 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -386,7 +386,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #define arch_start_context_switch(prev) do {} while (0) #endif -#ifndef __HAVE_SOFT_DIRTY +#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline int pte_soft_dirty(pte_t pte) { return 0; diff --git a/mm/Kconfig b/mm/Kconfig index 147689e..7deac66 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -474,7 +474,7 @@ config FRONTSWAP config MEM_SOFT_DIRTY bool "Track memory changes" - depends on CHECKPOINT_RESTORE && X86 + depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY select PROC_PAGE_MONITOR help This option enables memory changes tracking by introducing a From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936048Ab3DPT61 (ORCPT ); Tue, 16 Apr 2013 15:58:27 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:13401 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935400Ab3DPT60 (ORCPT ); Tue, 16 Apr 2013 15:58:26 -0400 Message-ID: <516DAD59.2020104@parallels.com> Date: Tue, 16 Apr 2013 23:58:17 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Andrew Morton CC: Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking References: <51669E5F.4000801@parallels.com> <51669EB8.2020102@parallels.com> <20130411142417.bb58d519b860d06ab84333c2@linux-foundation.org> <5168089B.7060305@parallels.com> <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> In-Reply-To: <20130415144619.645394d8ecdb180d7757a735@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >>> >From that perspective, the dependency on X86 is awful. What's the >>> problem here and what do other architectures need to do to be able to >>> support the feature? >> >> The problem here is that I don't know what free bits are available on >> page table entries on other architectures. I was about to resolve this >> for ARM very soon, but for the rest of them I need help from other people. > > Well, this is also a thing arch maintainers can do when they feel a > need to support the feature on their architecture. To support them at > that time we should provide them with a) adequate information in an > easy-to-find place (eg, a nice comment at the site of the reference x86 > implementation) and b) a userspace test app. Item a) is presumably covered with two things -- required arch-specific PTE manipulations are all collected in asm-generic/pgtable.h under the !CONFIG_HAVE_ARCH_SOFT_DIRTY and the Documentation/vm/soft-dirty.txt pointed by the API clear_refs_soft_dirty()'s comment. Item b) was recently merged. Item c) from Stephen is already sent. Thank you for your time and help, Pavel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763875Ab3ECW63 (ORCPT ); Fri, 3 May 2013 18:58:29 -0400 Received: from e8.ny.us.ibm.com ([32.97.182.138]:54673 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1763861Ab3ECW61 (ORCPT ); Fri, 3 May 2013 18:58:27 -0400 Date: Thu, 2 May 2013 10:08:57 -0700 From: Matt Helsley To: Pavel Emelyanov Cc: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Message-ID: <20130502170857.GB24627@us.ibm.com> References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51669EA5.20209@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13050322-9360-0000-0000-00001205EAC7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote: > This file is the same as the pagemap one, but shows entries with bits > 55-60 being zero (reserved for future use). Next patch will occupy one > of them. This approach doesn't scale as well as it could. As best I can see CRIU would do: for each vma in /proc//smaps for each page in /proc//pagemap2 if soft dirty bit copy page (possibly with pfn checks to avoid copying the same page mapped in multiple locations..) However, if soft dirty bit changes could be queued up (from say the fault handler and page table ops that map/unmap pages) and accumulated in something like an interval tree it could be something like: for each range of changed pages for each page in range copy page IOW something that scales with the number of changed pages rather than the number of mapped pages. So I wonder if CRIU would abandon pagemap2 in the future for something like this. Cheers, -Matt Helsley From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754765Ab3EDJsE (ORCPT ); Sat, 4 May 2013 05:48:04 -0400 Received: from mailhub.sw.ru ([195.214.232.25]:2356 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752264Ab3EDJsB (ORCPT ); Sat, 4 May 2013 05:48:01 -0400 Message-ID: <5184D93C.7000806@parallels.com> Date: Sat, 04 May 2013 13:47:40 +0400 From: Pavel Emelyanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: Matt Helsley CC: Andrew Morton , Linux MM , Linux Kernel Mailing List Subject: Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file References: <51669E5F.4000801@parallels.com> <51669EA5.20209@parallels.com> <20130502170857.GB24627@us.ibm.com> In-Reply-To: <20130502170857.GB24627@us.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/02/2013 09:08 PM, Matt Helsley wrote: > On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote: >> This file is the same as the pagemap one, but shows entries with bits >> 55-60 being zero (reserved for future use). Next patch will occupy one >> of them. > > This approach doesn't scale as well as it could. As best I can see > CRIU would do: > > for each vma in /proc//smaps > for each page in /proc//pagemap2 > if soft dirty bit > copy page > > (possibly with pfn checks to avoid copying the same page mapped in > multiple locations..) Comparing pfns got from two subsequent pagemap reads doesn't help at all. If they are equal, this can mean that either page is shared or (less likely, but still) that the page, that used to be at the 1st pagemap was reclaimed and mapped to the 2nd between two reads. If they differ, it can again mean either not-shared (most likely) or shared (pfns were equal, but got reclaimed and swapped in back). Some better API for pages sharing would be nice, probably such API could be also re-used for the user-space KSM :) > However, if soft dirty bit changes could be queued up (from say the > fault handler and page table ops that map/unmap pages) and accumulated > in something like an interval tree it could be something like: > > for each range of changed pages > for each page in range > copy page > > IOW something that scales with the number of changed pages rather > than the number of mapped pages. > > So I wonder if CRIU would abandon pagemap2 in the future for something > like this. We'd surely adopt such APIs is one exists. One thing to note about one is that we'd also appreciate if this API would be able to batch "present" bits as well as "swapped" and "page-file" ones. We use these three in CRIU as well, and these bits scanning can also be optimized. > Cheers, > -Matt Helsley > Thanks, Pavel