From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id 677A06B0072 for ; Tue, 19 Nov 2013 15:06:40 -0500 (EST) Received: by mail-pd0-f179.google.com with SMTP id r10so5657333pdi.10 for ; Tue, 19 Nov 2013 12:06:40 -0800 (PST) Received: from psmtp.com ([74.125.245.129]) by mx.google.com with SMTP id oy2si12365185pbc.219.2013.11.19.12.06.37 for ; Tue, 19 Nov 2013 12:06:38 -0800 (PST) From: Thomas Hellstrom Subject: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces Date: Tue, 19 Nov 2013 12:06:13 -0800 Message-Id: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com Hi! Before going any further with this I'd like to check whether this is an acceptable way to go. Background: GPU buffer objects in general and vmware svga GPU buffers in particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the address space is backed by a set of pages, sometimes it's backed by PCI memory. In the latter case in particular, there is no way to track dirty regions using page_mkwrite() and page_mkclean(), other than allocating a bounce buffer and perform dirty tracking on it, and then copy data to the real GPU buffer. This comes with a big memory- and performance overhead. So I'd like to add the following infrastructure with a callback pfn_mkwrite() and a function mkclean_mapping_range(). Typically we will be cleaning a range of ptes rather than random ptes in a vma. This comes with the extra benefit of being usable when the backing memory of the GPU buffer is not coherent with the GPU itself, and where we either need to flush caches or move data to synchronize. So this is a RFC for 1) The API. Is it acceptable? Any other suggestions if not? 2) Modifying apply_to_page_range(). Better to make a standalone non-populating version? 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() and page_mkclean_one() to try to get it right, but still unsure. Thanks, Thomas HellstrA?m -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f175.google.com (mail-pd0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id EC2B26B0075 for ; Tue, 19 Nov 2013 15:06:40 -0500 (EST) Received: by mail-pd0-f175.google.com with SMTP id w10so6371765pde.6 for ; Tue, 19 Nov 2013 12:06:40 -0800 (PST) Received: from psmtp.com ([74.125.245.192]) by mx.google.com with SMTP id iy4si12388367pbb.90.2013.11.19.12.06.38 for ; Tue, 19 Nov 2013 12:06:39 -0800 (PST) From: Thomas Hellstrom Subject: [PATCH RFC 1/3] mm: Add pfn_mkwrite() Date: Tue, 19 Nov 2013 12:06:14 -0800 Message-Id: <1384891576-7851-2-git-send-email-thellstrom@vmware.com> In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom A callback similar to page_mkwrite except it will be called before making ptes that don't point to normal pages writable. Signed-off-by: Thomas Hellstrom --- include/linux/mm.h | 9 +++++++++ mm/memory.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8b6e55e..23d1791 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -212,6 +212,15 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* + * Notification that a previously read-only pfn map is about to become + * writable, Returning VM_FAULT_NOPAGE will cause the fault to be + * retried, + * Returning a VM_FAULT_SIGBUS or VM_FAULT_OOM will propagate the + * error. Returning 0 will make the pfn map writable. + */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index d176154..8ae9a6e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2584,6 +2584,45 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va, vma); } +static int prepare_call_pfn_mkwrite(struct vm_area_struct *vma, + unsigned long address, + pte_t *pte, pmd_t *pmd, + spinlock_t *ptl, pte_t orig_pte) +{ + int ret = 0; + struct vm_fault vmf; + struct mm_struct *mm = vma->vm_mm; + + if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite) + return 0; + + /* + * In general, we can't say anything about the mapping offset + * here, so set it to 0. + */ + vmf.pgoff = 0; + vmf.virtual_address = (void __user *)(address & PAGE_MASK); + vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE; + vmf.page = NULL; + pte_unmap_unlock(pte, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) + return ret; + + pte = pte_offset_map_lock(mm, pmd, address, &ptl); + + /* + * Retry the fault if someone updated the pte while we + * dropped the lock. + */ + if (!pte_same(*pte, orig_pte)) { + pte_unmap_unlock(pte, ptl); + return VM_FAULT_NOPAGE; + } + + return 0; +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address @@ -2621,12 +2660,19 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_MIXEDMAP !pfn_valid() case * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Optionally call pfn_mkwrite to notify the address + * space that the pte is about to become writeable. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + ret = prepare_call_pfn_mkwrite(vma, address, + page_table, pmd, ptl, + orig_pte); + if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)) + return ret; + goto reuse; + } goto gotten; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f173.google.com (mail-pd0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id 8C0386B0075 for ; Tue, 19 Nov 2013 15:06:42 -0500 (EST) Received: by mail-pd0-f173.google.com with SMTP id p10so930378pdj.4 for ; Tue, 19 Nov 2013 12:06:42 -0800 (PST) Received: from psmtp.com ([74.125.245.130]) by mx.google.com with SMTP id m9si12366340pba.233.2013.11.19.12.06.40 for ; Tue, 19 Nov 2013 12:06:41 -0800 (PST) From: Thomas Hellstrom Subject: [PATCH RFC 2/3] mm: Add a non-populating version of apply_to_page_range() Date: Tue, 19 Nov 2013 12:06:15 -0800 Message-Id: <1384891576-7851-3-git-send-email-thellstrom@vmware.com> In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom For some tasks, like cleaning ptes it's desirable to operate on only populated ptes. This avoids the overhead of page table memory allocation and also avoids memory allocation errors. Adds apply_to_pt_range() which, in addition to apply_to_page_range(), optionally skips the populating step. Share code with apply_to_page_range(). Signed-off-by: Thomas Hellstrom --- mm/memory.c | 73 +++++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 48 insertions(+), 25 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8ae9a6e..79178c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2433,19 +2433,22 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long EXPORT_SYMBOL(vm_iomap_memory); static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pte_t *pte; int err; pgtable_t token; spinlock_t *uninitialized_var(ptl); - pte = (mm == &init_mm) ? - pte_alloc_kernel(pmd, addr) : - pte_alloc_map_lock(mm, pmd, addr, &ptl); - if (!pte) - return -ENOMEM; + if (fill) { + pte = (mm == &init_mm) ? + pte_alloc_kernel(pmd, addr) : + pte_alloc_map_lock(mm, pmd, addr, &ptl); + if (!pte) + return -ENOMEM; + } else + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); BUG_ON(pmd_huge(*pmd)); @@ -2461,27 +2464,32 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, arch_leave_lazy_mmu_mode(); - if (mm != &init_mm) + if (!fill || mm != &init_mm) pte_unmap_unlock(pte-1, ptl); return err; } static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pmd_t *pmd; unsigned long next; - int err; + int err = 0; BUG_ON(pud_huge(*pud)); - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return -ENOMEM; + if (fill) { + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return -ENOMEM; + } else + pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - err = apply_to_pte_range(mm, pmd, addr, next, fn, data); + if (!fill && pmd_none_or_clear_bad(pmd)) + continue; + err = apply_to_pte_range(mm, pmd, addr, next, fn, data, fill); if (err) break; } while (pmd++, addr = next, addr != end); @@ -2489,19 +2497,24 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, } static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pud_t *pud; unsigned long next; - int err; + int err = 0; - pud = pud_alloc(mm, pgd, addr); - if (!pud) - return -ENOMEM; + if (fill) { + pud = pud_alloc(mm, pgd, addr); + if (!pud) + return -ENOMEM; + } else + pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); - err = apply_to_pmd_range(mm, pud, addr, next, fn, data); + if (!fill && pud_none_or_clear_bad(pud)) + continue; + err = apply_to_pmd_range(mm, pud, addr, next, fn, data, fill); if (err) break; } while (pud++, addr = next, addr != end); @@ -2512,25 +2525,35 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd, * Scan a region of virtual memory, filling in page tables as necessary * and calling a provided function on each leaf page table. */ -int apply_to_page_range(struct mm_struct *mm, unsigned long addr, - unsigned long size, pte_fn_t fn, void *data) +static int apply_to_pt_range(struct mm_struct *mm, unsigned long addr, + unsigned long size, pte_fn_t fn, void *data, + bool fill) { pgd_t *pgd; unsigned long next; unsigned long end = addr + size; int err; + BUG_ON(!fill && mm == &init_mm); BUG_ON(addr >= end); + pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); - err = apply_to_pud_range(mm, pgd, addr, next, fn, data); + err = apply_to_pud_range(mm, pgd, addr, next, fn, data, + fill); if (err) break; } while (pgd++, addr = next, addr != end); return err; } + +int apply_to_page_range(struct mm_struct *mm, unsigned long addr, + unsigned long size, pte_fn_t fn, void *data) +{ + return apply_to_pt_range(mm, addr, size, fn, data, true); +} EXPORT_SYMBOL_GPL(apply_to_page_range); /* -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id 65C056B0078 for ; Tue, 19 Nov 2013 15:06:49 -0500 (EST) Received: by mail-pd0-f181.google.com with SMTP id p10so936525pdj.12 for ; Tue, 19 Nov 2013 12:06:49 -0800 (PST) Received: from psmtp.com ([74.125.245.108]) by mx.google.com with SMTP id rz1si830158pab.101.2013.11.19.12.06.47 for ; Tue, 19 Nov 2013 12:06:48 -0800 (PST) From: Thomas Hellstrom Subject: [PATCH RFC 3/3] mm: Add mkclean_mapping_range() Date: Tue, 19 Nov 2013 12:06:16 -0800 Message-Id: <1384891576-7851-4-git-send-email-thellstrom@vmware.com> In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom A general function to clean (Mark non-writeable and non-dirty) all ptes pointing to a certain range in an address space. Although it is primarily intended for PFNMAP and MIXEDMAP vmas, AFAICT it should work on address spaces backed by normal pages as well. It will not clean COW'd pages and it will not work with nonlinear VMAs. Signed-off-by: Thomas Hellstrom --- include/linux/mm.h | 3 ++ mm/memory.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 111 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 23d1791..e6bf5b3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -982,6 +982,9 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +void mkclean_mapping_range(struct address_space *mapping, + pgoff_t pg_clean_begin, + pgoff_t pg_len); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 79178c2..f7a48f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4395,3 +4395,111 @@ void copy_user_huge_page(struct page *dst, struct page *src, } } #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ + +struct mkclean_data { + struct mmu_gather tlb; + struct vm_area_struct *vma; +}; + +static int mkclean_mapping_pte(pte_t *pte, pgtable_t token, unsigned long addr, + void *data) +{ + struct mkclean_data *md = data; + struct mm_struct *mm = md->vma->vm_mm; + + pte_t ptent = *pte; + + if (pte_none(ptent) || !pte_present(ptent)) + return 0; + + if (pte_dirty(ptent) || pte_write(ptent)) { + struct page *page = vm_normal_page(md->vma, addr, ptent); + + /* + * Don't clean COW'ed pages + */ + if (page && PageAnon(page)) + return 0; + + tlb_remove_tlb_entry((&md->tlb), pte, addr); + ptent = pte_wrprotect(ptent); + ptent = pte_mkclean(ptent); + set_pte_at(mm, addr, pte, ptent); + } + + return 0; +} + +static void mkclean_mapping_range_tree(struct rb_root *root, + pgoff_t first, + pgoff_t last) +{ + struct vm_area_struct *vma; + + vma_interval_tree_foreach(vma, root, first, last) { + struct mkclean_data md; + pgoff_t vba, vea, zba, zea; + struct mm_struct *mm; + unsigned long addr, end; + + BUG_ON(vma->vm_flags & VM_NONLINEAR); + + if (!(vma->vm_flags & VM_SHARED)) + continue; + + mm = vma->vm_mm; + vba = vma->vm_pgoff; + vea = vba + vma_pages(vma) - 1; + zba = (first < vba) ? vba : first; + zea = (last > vea) ? vea : last; + + addr = ((zba - vba) << PAGE_SHIFT) + vma->vm_start; + end = ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start; + + tlb_gather_mmu(&md.tlb, mm, addr, end); + md.vma = vma; + + mmu_notifier_invalidate_range_start(mm, addr, end); + tlb_start_vma(&md.tlb, vma); + + (void) apply_to_pt_range(mm, addr, end - addr, + mkclean_mapping_pte, + &md, false); + + tlb_end_vma(&md.tlb, vma); + mmu_notifier_invalidate_range_end(mm, addr, end); + + tlb_finish_mmu(&md.tlb, addr, end); + } +} + +/* + * mkclean_mapping_range - Clean all PTEs pointing to a given range of an + * address space. + * + * @mapping: Pointer to the address space + * @pg_clean_begin: Page offset into the address space where cleaning should + * start + * @pg_len: Length of the range to be cleaned + * + * This function walks all vmas pointing to a given range of an address space, + * marking PTEs clean, unless they are COW'ed. This implies that we only + * touch VMAs with the flag VM_SHARED set. This interface also doesn't + * support VM_NONLINEAR vmas since there is no general way for us to + * make sure a pte is actually pointing into the given address space range + * for such VMAs. + */ +void mkclean_mapping_range(struct address_space *mapping, + pgoff_t pg_clean_begin, + pgoff_t pg_len) +{ + pgoff_t last = pg_clean_begin + pg_len - 1UL; + + mutex_lock(&mapping->i_mmap_mutex); + WARN_ON(!list_empty(&mapping->i_mmap_nonlinear)); + if (!RB_EMPTY_ROOT(&mapping->i_mmap)) + mkclean_mapping_range_tree(&mapping->i_mmap, pg_clean_begin, + last); + mutex_unlock(&mapping->i_mmap_mutex); +} +EXPORT_SYMBOL_GPL(mkclean_mapping_range); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id AC7166B0031 for ; Tue, 19 Nov 2013 17:51:18 -0500 (EST) Received: by mail-pa0-f42.google.com with SMTP id lj1so3250856pab.29 for ; Tue, 19 Nov 2013 14:51:18 -0800 (PST) Received: from psmtp.com ([74.125.245.204]) by mx.google.com with SMTP id rw4si12634532pac.178.2013.11.19.14.51.16 for ; Tue, 19 Nov 2013 14:51:17 -0800 (PST) Received: by mail-pa0-f43.google.com with SMTP id bj1so927423pad.2 for ; Tue, 19 Nov 2013 14:51:15 -0800 (PST) Message-ID: <528BEB60.7040402@amacapital.net> Date: Tue, 19 Nov 2013 14:51:12 -0800 From: Andy Lutomirski MIME-Version: 1.0 Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Thomas Hellstrom , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: > Hi! > > Before going any further with this I'd like to check whether this is an > acceptable way to go. > Background: > GPU buffer objects in general and vmware svga GPU buffers in > particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the > address space is backed by a set of pages, sometimes it's backed by PCI memory. > In the latter case in particular, there is no way to track dirty regions > using page_mkwrite() and page_mkclean(), other than allocating a bounce > buffer and perform dirty tracking on it, and then copy data to the real GPU > buffer. This comes with a big memory- and performance overhead. > > So I'd like to add the following infrastructure with a callback pfn_mkwrite() > and a function mkclean_mapping_range(). Typically we will be cleaning a range > of ptes rather than random ptes in a vma. > This comes with the extra benefit of being usable when the backing memory of > the GPU buffer is not coherent with the GPU itself, and where we either need > to flush caches or move data to synchronize. > > So this is a RFC for > 1) The API. Is it acceptable? Any other suggestions if not? > 2) Modifying apply_to_page_range(). Better to make a standalone > non-populating version? > 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() > and page_mkclean_one() to try to get it right, but still unsure. Most (all?) architectures have real dirty tracking -- you can mark a pte as "clean" and the hardware (or arch code) will mark it dirty when written, *without* a page fault. I'm not convinced that it works completely correctly right now (I suspect that there are some TLB flushing issues on the dirty->clean transition), and it's likely prone to bit-rot, since the page cache doesn't rely on it. That being said, using hardware dirty tracking should be *much* faster and less latency-inducing than doing it in software like this. It may be worth trying to get HW dirty tracking working before adding more page fault-based tracking. (I think there's also some oddity on S/390. I don't know what that oddity is or whether you should care.) --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 6CC006B0031 for ; Wed, 20 Nov 2013 03:12:17 -0500 (EST) Received: by mail-pa0-f42.google.com with SMTP id lj1so3826334pab.15 for ; Wed, 20 Nov 2013 00:12:17 -0800 (PST) Received: from psmtp.com ([74.125.245.121]) by mx.google.com with SMTP id yg5si13598302pbc.296.2013.11.20.00.12.14 for ; Wed, 20 Nov 2013 00:12:16 -0800 (PST) Message-ID: <528C6ED9.3070600@vmware.com> Date: Wed, 20 Nov 2013 09:12:09 +0100 From: Thomas Hellstrom MIME-Version: 1.0 Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> In-Reply-To: <528BEB60.7040402@amacapital.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-graphics-maintainer@vmware.com On 11/19/2013 11:51 PM, Andy Lutomirski wrote: > On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >> Hi! >> >> Before going any further with this I'd like to check whether this is an >> acceptable way to go. >> Background: >> GPU buffer objects in general and vmware svga GPU buffers in >> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the >> address space is backed by a set of pages, sometimes it's backed by PCI memory. >> In the latter case in particular, there is no way to track dirty regions >> using page_mkwrite() and page_mkclean(), other than allocating a bounce >> buffer and perform dirty tracking on it, and then copy data to the real GPU >> buffer. This comes with a big memory- and performance overhead. >> >> So I'd like to add the following infrastructure with a callback pfn_mkwrite() >> and a function mkclean_mapping_range(). Typically we will be cleaning a range >> of ptes rather than random ptes in a vma. >> This comes with the extra benefit of being usable when the backing memory of >> the GPU buffer is not coherent with the GPU itself, and where we either need >> to flush caches or move data to synchronize. >> >> So this is a RFC for >> 1) The API. Is it acceptable? Any other suggestions if not? >> 2) Modifying apply_to_page_range(). Better to make a standalone >> non-populating version? >> 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() >> and page_mkclean_one() to try to get it right, but still unsure. > Most (all?) architectures have real dirty tracking -- you can mark a pte > as "clean" and the hardware (or arch code) will mark it dirty when > written, *without* a page fault. > > I'm not convinced that it works completely correctly right now (I > suspect that there are some TLB flushing issues on the dirty->clean > transition), and it's likely prone to bit-rot, since the page cache > doesn't rely on it. > > That being said, using hardware dirty tracking should be *much* faster > and less latency-inducing than doing it in software like this. It may > be worth trying to get HW dirty tracking working before adding more page > fault-based tracking. > > (I think there's also some oddity on S/390. I don't know what that > oddity is or whether you should care.) > > --Andy Andy, Thanks for the tip. It indeed sounds interesting, however there are a couple of culprits: 1) As you say, it sounds like there might be TLB flushing issues. Let's say the TLB detects a write and raises an IRQ for the arch code to set the PTE dirty bit, and before servicing that interrupt, we clear the PTE and flush that TLB. What will happen? And if the TLB hardware would write directly to the in-memory PTE I guess we'd have the same synchronization issues. I guess we'd then need an atomic read-modify-write against the TLB hardware? 2) Even if most hardware is capable of this stuff, I'm not sure what would happen in a virtual machine. Need to check. 3) For dirty contents that need to appear on a screen within a short interval, we need the write notification anyway, to start a delayed task that will gather the dirty data and flush it to the screen... Thanks, /Thomas -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id F036E6B0035 for ; Wed, 20 Nov 2013 11:51:01 -0500 (EST) Received: by mail-pa0-f46.google.com with SMTP id kl14so5225885pab.19 for ; Wed, 20 Nov 2013 08:51:01 -0800 (PST) Received: from psmtp.com ([74.125.245.127]) by mx.google.com with SMTP id m9si14631512pba.293.2013.11.20.08.50.59 for ; Wed, 20 Nov 2013 08:51:00 -0800 (PST) Received: by mail-ve0-f176.google.com with SMTP id oz11so1734735veb.35 for ; Wed, 20 Nov 2013 08:50:58 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <528C6ED9.3070600@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> From: Andy Lutomirski Date: Wed, 20 Nov 2013 08:50:37 -0800 Message-ID: Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Thomas Hellstrom Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer@vmware.com On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom wrote: > On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >> >> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>> >>> Hi! >>> >>> Before going any further with this I'd like to check whether this is an >>> acceptable way to go. >>> Background: >>> GPU buffer objects in general and vmware svga GPU buffers in >>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>> the >>> address space is backed by a set of pages, sometimes it's backed by PCI >>> memory. >>> In the latter case in particular, there is no way to track dirty regions >>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>> buffer and perform dirty tracking on it, and then copy data to the real >>> GPU >>> buffer. This comes with a big memory- and performance overhead. >>> >>> So I'd like to add the following infrastructure with a callback >>> pfn_mkwrite() >>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>> range >>> of ptes rather than random ptes in a vma. >>> This comes with the extra benefit of being usable when the backing memory >>> of >>> the GPU buffer is not coherent with the GPU itself, and where we either >>> need >>> to flush caches or move data to synchronize. >>> >>> So this is a RFC for >>> 1) The API. Is it acceptable? Any other suggestions if not? >>> 2) Modifying apply_to_page_range(). Better to make a standalone >>> non-populating version? >>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>> unmap_mapping_range() >>> and page_mkclean_one() to try to get it right, but still unsure. >> >> Most (all?) architectures have real dirty tracking -- you can mark a pte >> as "clean" and the hardware (or arch code) will mark it dirty when >> written, *without* a page fault. >> >> I'm not convinced that it works completely correctly right now (I >> suspect that there are some TLB flushing issues on the dirty->clean >> transition), and it's likely prone to bit-rot, since the page cache >> doesn't rely on it. >> >> That being said, using hardware dirty tracking should be *much* faster >> and less latency-inducing than doing it in software like this. It may >> be worth trying to get HW dirty tracking working before adding more page >> fault-based tracking. >> >> (I think there's also some oddity on S/390. I don't know what that >> oddity is or whether you should care.) >> >> --Andy > > > Andy, > > Thanks for the tip. It indeed sounds interesting, however there are a couple > of culprits: > > 1) As you say, it sounds like there might be TLB flushing issues. Let's say > the TLB detects a write and raises an IRQ for the arch code to set the PTE > dirty bit, and before servicing that interrupt, we clear the PTE and flush > that TLB. What will happen? This should be fine. I assume that all architectures that do this kind of software dirty tracking will make the write block until the fault is handled, so the write won't have happened when you clear the PTE. After the TLB flush, the PTE will become dirty again and then the page will be written. > And if the TLB hardware would write directly to > the in-memory PTE I guess we'd have the same synchronization issues. I guess > we'd then need an atomic read-modify-write against the TLB hardware? IIRC the part that looked fishy to me was the combination of hw dirty tracking and write protecting the page. If you see that the pte is clean and want to write protect it, you probably need to set the write protect bit (atomically so you don't lose a dirty bit), flush the TLB, and then check the dirty bit again. > 2) Even if most hardware is capable of this stuff, I'm not sure what would > happen in a virtual machine. Need to check. This should be fine. Any VM monitor that fails to implement dirty tracking is probably terminally broken. > 3) For dirty contents that need to appear on a screen within a short > interval, we need the write notification anyway, to start a delayed task > that will gather the dirty data and flush it to the screen... > So that's what you want to do :) I bet that the best approach is some kind of hybrid. If, on the first page fault per frame, you un-write-protected the entire buffer and then, near the end of the frame, check all the hw dirty bits and re-write-protect the entire buffer, you get the benefit detecting which pages were written, but you only take one write fault per frame instead of one write fault per page. (I imagine that there are video apps out that there that would slow down measurably if they started taking one write fault per page per frame.) --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id 664196B0031 for ; Wed, 20 Nov 2013 15:16:51 -0500 (EST) Received: by mail-pd0-f181.google.com with SMTP id p10so2473134pdj.12 for ; Wed, 20 Nov 2013 12:16:51 -0800 (PST) Received: from smtp-outbound-1.vmware.com (smtp-outbound-1.vmware.com. [208.91.2.12]) by mx.google.com with ESMTPS id n5si15012313pav.98.2013.11.20.12.16.49 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 20 Nov 2013 12:16:50 -0800 (PST) Message-ID: <528D18AB.5020009@vmware.com> Date: Wed, 20 Nov 2013 21:16:43 +0100 From: Thomas Hellstrom MIME-Version: 1.0 Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer@vmware.com On 11/20/2013 05:50 PM, Andy Lutomirski wrote: > On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom > wrote: >> On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >>> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>>> Hi! >>>> >>>> Before going any further with this I'd like to check whether this is an >>>> acceptable way to go. >>>> Background: >>>> GPU buffer objects in general and vmware svga GPU buffers in >>>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>>> the >>>> address space is backed by a set of pages, sometimes it's backed by PCI >>>> memory. >>>> In the latter case in particular, there is no way to track dirty regions >>>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>>> buffer and perform dirty tracking on it, and then copy data to the real >>>> GPU >>>> buffer. This comes with a big memory- and performance overhead. >>>> >>>> So I'd like to add the following infrastructure with a callback >>>> pfn_mkwrite() >>>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>>> range >>>> of ptes rather than random ptes in a vma. >>>> This comes with the extra benefit of being usable when the backing memory >>>> of >>>> the GPU buffer is not coherent with the GPU itself, and where we either >>>> need >>>> to flush caches or move data to synchronize. >>>> >>>> So this is a RFC for >>>> 1) The API. Is it acceptable? Any other suggestions if not? >>>> 2) Modifying apply_to_page_range(). Better to make a standalone >>>> non-populating version? >>>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>>> unmap_mapping_range() >>>> and page_mkclean_one() to try to get it right, but still unsure. >>> Most (all?) architectures have real dirty tracking -- you can mark a pte >>> as "clean" and the hardware (or arch code) will mark it dirty when >>> written, *without* a page fault. >>> >>> I'm not convinced that it works completely correctly right now (I >>> suspect that there are some TLB flushing issues on the dirty->clean >>> transition), and it's likely prone to bit-rot, since the page cache >>> doesn't rely on it. >>> >>> That being said, using hardware dirty tracking should be *much* faster >>> and less latency-inducing than doing it in software like this. It may >>> be worth trying to get HW dirty tracking working before adding more page >>> fault-based tracking. >>> >>> (I think there's also some oddity on S/390. I don't know what that >>> oddity is or whether you should care.) >>> >>> --Andy >> >> Andy, >> >> Thanks for the tip. It indeed sounds interesting, however there are a couple >> of culprits: >> >> 1) As you say, it sounds like there might be TLB flushing issues. Let's say >> the TLB detects a write and raises an IRQ for the arch code to set the PTE >> dirty bit, and before servicing that interrupt, we clear the PTE and flush >> that TLB. What will happen? > This should be fine. I assume that all architectures that do this > kind of software dirty tracking will make the write block until the > fault is handled, so the write won't have happened when you clear the > PTE. After the TLB flush, the PTE will become dirty again and then > the page will be written. > >> And if the TLB hardware would write directly to >> the in-memory PTE I guess we'd have the same synchronization issues. I guess >> we'd then need an atomic read-modify-write against the TLB hardware? > IIRC the part that looked fishy to me was the combination of hw dirty > tracking and write protecting the page. If you see that the pte is > clean and want to write protect it, you probably need to set the write > protect bit (atomically so you don't lose a dirty bit), flush the TLB, > and then check the dirty bit again. > >> 2) Even if most hardware is capable of this stuff, I'm not sure what would >> happen in a virtual machine. Need to check. > This should be fine. Any VM monitor that fails to implement dirty > tracking is probably terminally broken. OK. I'll give it a try. If I understand this correctly, even if I set up a shared RW mapping, the PTEs should magically be marked dirty if written to, and everything works as it should? > >> 3) For dirty contents that need to appear on a screen within a short >> interval, we need the write notification anyway, to start a delayed task >> that will gather the dirty data and flush it to the screen... >> > So that's what you want to do :) Well this is mostly a benefit, actually. We already do this using fb_defio, but without this new interface we need a bounce-buffer covering the whole screen. Luckily this isn't a common use-case. Typically (if we use this) we'd gather dirty data when the buffer is referenced in a GPU command stream. > > I bet that the best approach is some kind of hybrid. If, on the first > page fault per frame, you un-write-protected the entire buffer and > then, near the end of the frame, check all the hw dirty bits and > re-write-protect the entire buffer, you get the benefit detecting > which pages were written, but you only take one write fault per frame > instead of one write fault per page. Yes, that sounds sane, particularly as un-write-protecting shouldn't need any additional tlb flushing, AFAICT. > > (I imagine that there are video apps out that there that would slow > down measurably if they started taking one write fault per page per > frame.) I actually hope to be able to avoid this stuff completely, but I need a backup plan, so that's why I threw out this RFC. > > --Andy Thanks, Thomas -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vb0-f51.google.com (mail-vb0-f51.google.com [209.85.212.51]) by kanga.kvack.org (Postfix) with ESMTP id 587E06B0031 for ; Wed, 20 Nov 2013 15:29:49 -0500 (EST) Received: by mail-vb0-f51.google.com with SMTP id m10so1916617vbh.10 for ; Wed, 20 Nov 2013 12:29:49 -0800 (PST) Received: from mail-ve0-f181.google.com (mail-ve0-f181.google.com [209.85.128.181]) by mx.google.com with ESMTPS id mq14si9899030vcb.56.2013.11.20.12.29.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 20 Nov 2013 12:29:48 -0800 (PST) Received: by mail-ve0-f181.google.com with SMTP id oy12so3579729veb.40 for ; Wed, 20 Nov 2013 12:29:47 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <528D18AB.5020009@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> <528D18AB.5020009@vmware.com> From: Andy Lutomirski Date: Wed, 20 Nov 2013 12:29:27 -0800 Message-ID: Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Thomas Hellstrom Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer On Wed, Nov 20, 2013 at 12:16 PM, Thomas Hellstrom wrote: > On 11/20/2013 05:50 PM, Andy Lutomirski wrote: >> >> On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom >> wrote: >>> >>> On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >>>> >>>> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>>>> >>>>> Hi! >>>>> >>>>> Before going any further with this I'd like to check whether this is an >>>>> acceptable way to go. >>>>> Background: >>>>> GPU buffer objects in general and vmware svga GPU buffers in >>>>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>>>> the >>>>> address space is backed by a set of pages, sometimes it's backed by PCI >>>>> memory. >>>>> In the latter case in particular, there is no way to track dirty >>>>> regions >>>>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>>>> buffer and perform dirty tracking on it, and then copy data to the real >>>>> GPU >>>>> buffer. This comes with a big memory- and performance overhead. >>>>> >>>>> So I'd like to add the following infrastructure with a callback >>>>> pfn_mkwrite() >>>>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>>>> range >>>>> of ptes rather than random ptes in a vma. >>>>> This comes with the extra benefit of being usable when the backing >>>>> memory >>>>> of >>>>> the GPU buffer is not coherent with the GPU itself, and where we either >>>>> need >>>>> to flush caches or move data to synchronize. >>>>> >>>>> So this is a RFC for >>>>> 1) The API. Is it acceptable? Any other suggestions if not? >>>>> 2) Modifying apply_to_page_range(). Better to make a standalone >>>>> non-populating version? >>>>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>>>> unmap_mapping_range() >>>>> and page_mkclean_one() to try to get it right, but still unsure. >>>> >>>> Most (all?) architectures have real dirty tracking -- you can mark a pte >>>> as "clean" and the hardware (or arch code) will mark it dirty when >>>> written, *without* a page fault. >>>> >>>> I'm not convinced that it works completely correctly right now (I >>>> suspect that there are some TLB flushing issues on the dirty->clean >>>> transition), and it's likely prone to bit-rot, since the page cache >>>> doesn't rely on it. >>>> >>>> That being said, using hardware dirty tracking should be *much* faster >>>> and less latency-inducing than doing it in software like this. It may >>>> be worth trying to get HW dirty tracking working before adding more page >>>> fault-based tracking. >>>> >>>> (I think there's also some oddity on S/390. I don't know what that >>>> oddity is or whether you should care.) >>>> >>>> --Andy >>> >>> >>> Andy, >>> >>> Thanks for the tip. It indeed sounds interesting, however there are a >>> couple >>> of culprits: >>> >>> 1) As you say, it sounds like there might be TLB flushing issues. Let's >>> say >>> the TLB detects a write and raises an IRQ for the arch code to set the >>> PTE >>> dirty bit, and before servicing that interrupt, we clear the PTE and >>> flush >>> that TLB. What will happen? >> >> This should be fine. I assume that all architectures that do this >> kind of software dirty tracking will make the write block until the >> fault is handled, so the write won't have happened when you clear the >> PTE. After the TLB flush, the PTE will become dirty again and then >> the page will be written. >> >>> And if the TLB hardware would write directly to >>> the in-memory PTE I guess we'd have the same synchronization issues. I >>> guess >>> we'd then need an atomic read-modify-write against the TLB hardware? >> >> IIRC the part that looked fishy to me was the combination of hw dirty >> tracking and write protecting the page. If you see that the pte is >> clean and want to write protect it, you probably need to set the write >> protect bit (atomically so you don't lose a dirty bit), flush the TLB, >> and then check the dirty bit again. >> >>> 2) Even if most hardware is capable of this stuff, I'm not sure what >>> would >>> happen in a virtual machine. Need to check. >> >> This should be fine. Any VM monitor that fails to implement dirty >> tracking is probably terminally broken. > > > OK. I'll give it a try. If I understand this correctly, even if I set up a > shared RW mapping, the > PTEs should magically be marked dirty if written to, and everything works as > it should? > I *think* so. (It's certainly worth doing a quick-and-dirty test to make sure I'm not completely nuts before you invest too much time here -- I've read the code, and I've looked at the Intel specs, but I've never actually verified that pte_dirty and pte_mkclean to what I think they do.) (If you ever intend to run on S/390, you should ask someone who understands what's going on. I, personally, have no clue, other than having seen references to something weird happening.) --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752134Ab3KSUGh (ORCPT ); Tue, 19 Nov 2013 15:06:37 -0500 Received: from smtp-outbound-2.vmware.com ([208.91.2.13]:56114 "EHLO smtp-outbound-2.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751821Ab3KSUGg (ORCPT ); Tue, 19 Nov 2013 15:06:36 -0500 From: Thomas Hellstrom To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com Subject: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces Date: Tue, 19 Nov 2013 12:06:13 -0800 Message-Id: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> X-Mailer: git-send-email 1.7.10.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi! Before going any further with this I'd like to check whether this is an acceptable way to go. Background: GPU buffer objects in general and vmware svga GPU buffers in particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the address space is backed by a set of pages, sometimes it's backed by PCI memory. In the latter case in particular, there is no way to track dirty regions using page_mkwrite() and page_mkclean(), other than allocating a bounce buffer and perform dirty tracking on it, and then copy data to the real GPU buffer. This comes with a big memory- and performance overhead. So I'd like to add the following infrastructure with a callback pfn_mkwrite() and a function mkclean_mapping_range(). Typically we will be cleaning a range of ptes rather than random ptes in a vma. This comes with the extra benefit of being usable when the backing memory of the GPU buffer is not coherent with the GPU itself, and where we either need to flush caches or move data to synchronize. So this is a RFC for 1) The API. Is it acceptable? Any other suggestions if not? 2) Modifying apply_to_page_range(). Better to make a standalone non-populating version? 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() and page_mkclean_one() to try to get it right, but still unsure. Thanks, Thomas Hellström From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752406Ab3KSUGs (ORCPT ); Tue, 19 Nov 2013 15:06:48 -0500 Received: from smtp-outbound-1.vmware.com ([208.91.2.12]:45012 "EHLO smtp-outbound-1.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751821Ab3KSUGi (ORCPT ); Tue, 19 Nov 2013 15:06:38 -0500 From: Thomas Hellstrom To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom Subject: [PATCH RFC 1/3] mm: Add pfn_mkwrite() Date: Tue, 19 Nov 2013 12:06:14 -0800 Message-Id: <1384891576-7851-2-git-send-email-thellstrom@vmware.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A callback similar to page_mkwrite except it will be called before making ptes that don't point to normal pages writable. Signed-off-by: Thomas Hellstrom --- include/linux/mm.h | 9 +++++++++ mm/memory.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8b6e55e..23d1791 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -212,6 +212,15 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* + * Notification that a previously read-only pfn map is about to become + * writable, Returning VM_FAULT_NOPAGE will cause the fault to be + * retried, + * Returning a VM_FAULT_SIGBUS or VM_FAULT_OOM will propagate the + * error. Returning 0 will make the pfn map writable. + */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index d176154..8ae9a6e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2584,6 +2584,45 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va, vma); } +static int prepare_call_pfn_mkwrite(struct vm_area_struct *vma, + unsigned long address, + pte_t *pte, pmd_t *pmd, + spinlock_t *ptl, pte_t orig_pte) +{ + int ret = 0; + struct vm_fault vmf; + struct mm_struct *mm = vma->vm_mm; + + if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite) + return 0; + + /* + * In general, we can't say anything about the mapping offset + * here, so set it to 0. + */ + vmf.pgoff = 0; + vmf.virtual_address = (void __user *)(address & PAGE_MASK); + vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE; + vmf.page = NULL; + pte_unmap_unlock(pte, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) + return ret; + + pte = pte_offset_map_lock(mm, pmd, address, &ptl); + + /* + * Retry the fault if someone updated the pte while we + * dropped the lock. + */ + if (!pte_same(*pte, orig_pte)) { + pte_unmap_unlock(pte, ptl); + return VM_FAULT_NOPAGE; + } + + return 0; +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address @@ -2621,12 +2660,19 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_MIXEDMAP !pfn_valid() case * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Optionally call pfn_mkwrite to notify the address + * space that the pte is about to become writeable. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED)) + (VM_WRITE|VM_SHARED)) { + ret = prepare_call_pfn_mkwrite(vma, address, + page_table, pmd, ptl, + orig_pte); + if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)) + return ret; + goto reuse; + } goto gotten; } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752668Ab3KSUHJ (ORCPT ); Tue, 19 Nov 2013 15:07:09 -0500 Received: from smtp-outbound-2.vmware.com ([208.91.2.13]:56122 "EHLO smtp-outbound-2.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752139Ab3KSUGj (ORCPT ); Tue, 19 Nov 2013 15:06:39 -0500 From: Thomas Hellstrom To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom Subject: [PATCH RFC 2/3] mm: Add a non-populating version of apply_to_page_range() Date: Tue, 19 Nov 2013 12:06:15 -0800 Message-Id: <1384891576-7851-3-git-send-email-thellstrom@vmware.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For some tasks, like cleaning ptes it's desirable to operate on only populated ptes. This avoids the overhead of page table memory allocation and also avoids memory allocation errors. Adds apply_to_pt_range() which, in addition to apply_to_page_range(), optionally skips the populating step. Share code with apply_to_page_range(). Signed-off-by: Thomas Hellstrom --- mm/memory.c | 73 +++++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 48 insertions(+), 25 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8ae9a6e..79178c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2433,19 +2433,22 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long EXPORT_SYMBOL(vm_iomap_memory); static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pte_t *pte; int err; pgtable_t token; spinlock_t *uninitialized_var(ptl); - pte = (mm == &init_mm) ? - pte_alloc_kernel(pmd, addr) : - pte_alloc_map_lock(mm, pmd, addr, &ptl); - if (!pte) - return -ENOMEM; + if (fill) { + pte = (mm == &init_mm) ? + pte_alloc_kernel(pmd, addr) : + pte_alloc_map_lock(mm, pmd, addr, &ptl); + if (!pte) + return -ENOMEM; + } else + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); BUG_ON(pmd_huge(*pmd)); @@ -2461,27 +2464,32 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, arch_leave_lazy_mmu_mode(); - if (mm != &init_mm) + if (!fill || mm != &init_mm) pte_unmap_unlock(pte-1, ptl); return err; } static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pmd_t *pmd; unsigned long next; - int err; + int err = 0; BUG_ON(pud_huge(*pud)); - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return -ENOMEM; + if (fill) { + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return -ENOMEM; + } else + pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - err = apply_to_pte_range(mm, pmd, addr, next, fn, data); + if (!fill && pmd_none_or_clear_bad(pmd)) + continue; + err = apply_to_pte_range(mm, pmd, addr, next, fn, data, fill); if (err) break; } while (pmd++, addr = next, addr != end); @@ -2489,19 +2497,24 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, } static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd, - unsigned long addr, unsigned long end, - pte_fn_t fn, void *data) + unsigned long addr, unsigned long end, + pte_fn_t fn, void *data, bool fill) { pud_t *pud; unsigned long next; - int err; + int err = 0; - pud = pud_alloc(mm, pgd, addr); - if (!pud) - return -ENOMEM; + if (fill) { + pud = pud_alloc(mm, pgd, addr); + if (!pud) + return -ENOMEM; + } else + pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); - err = apply_to_pmd_range(mm, pud, addr, next, fn, data); + if (!fill && pud_none_or_clear_bad(pud)) + continue; + err = apply_to_pmd_range(mm, pud, addr, next, fn, data, fill); if (err) break; } while (pud++, addr = next, addr != end); @@ -2512,25 +2525,35 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd, * Scan a region of virtual memory, filling in page tables as necessary * and calling a provided function on each leaf page table. */ -int apply_to_page_range(struct mm_struct *mm, unsigned long addr, - unsigned long size, pte_fn_t fn, void *data) +static int apply_to_pt_range(struct mm_struct *mm, unsigned long addr, + unsigned long size, pte_fn_t fn, void *data, + bool fill) { pgd_t *pgd; unsigned long next; unsigned long end = addr + size; int err; + BUG_ON(!fill && mm == &init_mm); BUG_ON(addr >= end); + pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); - err = apply_to_pud_range(mm, pgd, addr, next, fn, data); + err = apply_to_pud_range(mm, pgd, addr, next, fn, data, + fill); if (err) break; } while (pgd++, addr = next, addr != end); return err; } + +int apply_to_page_range(struct mm_struct *mm, unsigned long addr, + unsigned long size, pte_fn_t fn, void *data) +{ + return apply_to_pt_range(mm, addr, size, fn, data, true); +} EXPORT_SYMBOL_GPL(apply_to_page_range); /* -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752742Ab3KSUHN (ORCPT ); Tue, 19 Nov 2013 15:07:13 -0500 Received: from smtp-outbound-1.vmware.com ([208.91.2.12]:45017 "EHLO smtp-outbound-1.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752146Ab3KSUGl (ORCPT ); Tue, 19 Nov 2013 15:06:41 -0500 From: Thomas Hellstrom To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-graphics-maintainer@vmware.com, Thomas Hellstrom Subject: [PATCH RFC 3/3] mm: Add mkclean_mapping_range() Date: Tue, 19 Nov 2013 12:06:16 -0800 Message-Id: <1384891576-7851-4-git-send-email-thellstrom@vmware.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A general function to clean (Mark non-writeable and non-dirty) all ptes pointing to a certain range in an address space. Although it is primarily intended for PFNMAP and MIXEDMAP vmas, AFAICT it should work on address spaces backed by normal pages as well. It will not clean COW'd pages and it will not work with nonlinear VMAs. Signed-off-by: Thomas Hellstrom --- include/linux/mm.h | 3 ++ mm/memory.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 111 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 23d1791..e6bf5b3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -982,6 +982,9 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +void mkclean_mapping_range(struct address_space *mapping, + pgoff_t pg_clean_begin, + pgoff_t pg_len); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 79178c2..f7a48f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4395,3 +4395,111 @@ void copy_user_huge_page(struct page *dst, struct page *src, } } #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ + +struct mkclean_data { + struct mmu_gather tlb; + struct vm_area_struct *vma; +}; + +static int mkclean_mapping_pte(pte_t *pte, pgtable_t token, unsigned long addr, + void *data) +{ + struct mkclean_data *md = data; + struct mm_struct *mm = md->vma->vm_mm; + + pte_t ptent = *pte; + + if (pte_none(ptent) || !pte_present(ptent)) + return 0; + + if (pte_dirty(ptent) || pte_write(ptent)) { + struct page *page = vm_normal_page(md->vma, addr, ptent); + + /* + * Don't clean COW'ed pages + */ + if (page && PageAnon(page)) + return 0; + + tlb_remove_tlb_entry((&md->tlb), pte, addr); + ptent = pte_wrprotect(ptent); + ptent = pte_mkclean(ptent); + set_pte_at(mm, addr, pte, ptent); + } + + return 0; +} + +static void mkclean_mapping_range_tree(struct rb_root *root, + pgoff_t first, + pgoff_t last) +{ + struct vm_area_struct *vma; + + vma_interval_tree_foreach(vma, root, first, last) { + struct mkclean_data md; + pgoff_t vba, vea, zba, zea; + struct mm_struct *mm; + unsigned long addr, end; + + BUG_ON(vma->vm_flags & VM_NONLINEAR); + + if (!(vma->vm_flags & VM_SHARED)) + continue; + + mm = vma->vm_mm; + vba = vma->vm_pgoff; + vea = vba + vma_pages(vma) - 1; + zba = (first < vba) ? vba : first; + zea = (last > vea) ? vea : last; + + addr = ((zba - vba) << PAGE_SHIFT) + vma->vm_start; + end = ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start; + + tlb_gather_mmu(&md.tlb, mm, addr, end); + md.vma = vma; + + mmu_notifier_invalidate_range_start(mm, addr, end); + tlb_start_vma(&md.tlb, vma); + + (void) apply_to_pt_range(mm, addr, end - addr, + mkclean_mapping_pte, + &md, false); + + tlb_end_vma(&md.tlb, vma); + mmu_notifier_invalidate_range_end(mm, addr, end); + + tlb_finish_mmu(&md.tlb, addr, end); + } +} + +/* + * mkclean_mapping_range - Clean all PTEs pointing to a given range of an + * address space. + * + * @mapping: Pointer to the address space + * @pg_clean_begin: Page offset into the address space where cleaning should + * start + * @pg_len: Length of the range to be cleaned + * + * This function walks all vmas pointing to a given range of an address space, + * marking PTEs clean, unless they are COW'ed. This implies that we only + * touch VMAs with the flag VM_SHARED set. This interface also doesn't + * support VM_NONLINEAR vmas since there is no general way for us to + * make sure a pte is actually pointing into the given address space range + * for such VMAs. + */ +void mkclean_mapping_range(struct address_space *mapping, + pgoff_t pg_clean_begin, + pgoff_t pg_len) +{ + pgoff_t last = pg_clean_begin + pg_len - 1UL; + + mutex_lock(&mapping->i_mmap_mutex); + WARN_ON(!list_empty(&mapping->i_mmap_nonlinear)); + if (!RB_EMPTY_ROOT(&mapping->i_mmap)) + mkclean_mapping_range_tree(&mapping->i_mmap, pg_clean_begin, + last); + mutex_unlock(&mapping->i_mmap_mutex); +} +EXPORT_SYMBOL_GPL(mkclean_mapping_range); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753341Ab3KSWvT (ORCPT ); Tue, 19 Nov 2013 17:51:19 -0500 Received: from mail-pb0-f52.google.com ([209.85.160.52]:34635 "EHLO mail-pb0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753234Ab3KSWvP (ORCPT ); Tue, 19 Nov 2013 17:51:15 -0500 Message-ID: <528BEB60.7040402@amacapital.net> Date: Tue, 19 Nov 2013 14:51:12 -0800 From: Andy Lutomirski User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Thomas Hellstrom , linux-mm@kvack.org, linux-kernel@vger.kernel.org CC: linux-graphics-maintainer@vmware.com Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> In-Reply-To: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: > Hi! > > Before going any further with this I'd like to check whether this is an > acceptable way to go. > Background: > GPU buffer objects in general and vmware svga GPU buffers in > particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the > address space is backed by a set of pages, sometimes it's backed by PCI memory. > In the latter case in particular, there is no way to track dirty regions > using page_mkwrite() and page_mkclean(), other than allocating a bounce > buffer and perform dirty tracking on it, and then copy data to the real GPU > buffer. This comes with a big memory- and performance overhead. > > So I'd like to add the following infrastructure with a callback pfn_mkwrite() > and a function mkclean_mapping_range(). Typically we will be cleaning a range > of ptes rather than random ptes in a vma. > This comes with the extra benefit of being usable when the backing memory of > the GPU buffer is not coherent with the GPU itself, and where we either need > to flush caches or move data to synchronize. > > So this is a RFC for > 1) The API. Is it acceptable? Any other suggestions if not? > 2) Modifying apply_to_page_range(). Better to make a standalone > non-populating version? > 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() > and page_mkclean_one() to try to get it right, but still unsure. Most (all?) architectures have real dirty tracking -- you can mark a pte as "clean" and the hardware (or arch code) will mark it dirty when written, *without* a page fault. I'm not convinced that it works completely correctly right now (I suspect that there are some TLB flushing issues on the dirty->clean transition), and it's likely prone to bit-rot, since the page cache doesn't rely on it. That being said, using hardware dirty tracking should be *much* faster and less latency-inducing than doing it in software like this. It may be worth trying to get HW dirty tracking working before adding more page fault-based tracking. (I think there's also some oddity on S/390. I don't know what that oddity is or whether you should care.) --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751904Ab3KTIMO (ORCPT ); Wed, 20 Nov 2013 03:12:14 -0500 Received: from smtp-outbound-2.vmware.com ([208.91.2.13]:37233 "EHLO smtp-outbound-2.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750855Ab3KTIMN (ORCPT ); Wed, 20 Nov 2013 03:12:13 -0500 Message-ID: <528C6ED9.3070600@vmware.com> Date: Wed, 20 Nov 2013 09:12:09 +0100 From: Thomas Hellstrom User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Andy Lutomirski CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-graphics-maintainer@vmware.com Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> In-Reply-To: <528BEB60.7040402@amacapital.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/19/2013 11:51 PM, Andy Lutomirski wrote: > On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >> Hi! >> >> Before going any further with this I'd like to check whether this is an >> acceptable way to go. >> Background: >> GPU buffer objects in general and vmware svga GPU buffers in >> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes the >> address space is backed by a set of pages, sometimes it's backed by PCI memory. >> In the latter case in particular, there is no way to track dirty regions >> using page_mkwrite() and page_mkclean(), other than allocating a bounce >> buffer and perform dirty tracking on it, and then copy data to the real GPU >> buffer. This comes with a big memory- and performance overhead. >> >> So I'd like to add the following infrastructure with a callback pfn_mkwrite() >> and a function mkclean_mapping_range(). Typically we will be cleaning a range >> of ptes rather than random ptes in a vma. >> This comes with the extra benefit of being usable when the backing memory of >> the GPU buffer is not coherent with the GPU itself, and where we either need >> to flush caches or move data to synchronize. >> >> So this is a RFC for >> 1) The API. Is it acceptable? Any other suggestions if not? >> 2) Modifying apply_to_page_range(). Better to make a standalone >> non-populating version? >> 3) tlb- mmu- and cache-flushing calls. I've looked at unmap_mapping_range() >> and page_mkclean_one() to try to get it right, but still unsure. > Most (all?) architectures have real dirty tracking -- you can mark a pte > as "clean" and the hardware (or arch code) will mark it dirty when > written, *without* a page fault. > > I'm not convinced that it works completely correctly right now (I > suspect that there are some TLB flushing issues on the dirty->clean > transition), and it's likely prone to bit-rot, since the page cache > doesn't rely on it. > > That being said, using hardware dirty tracking should be *much* faster > and less latency-inducing than doing it in software like this. It may > be worth trying to get HW dirty tracking working before adding more page > fault-based tracking. > > (I think there's also some oddity on S/390. I don't know what that > oddity is or whether you should care.) > > --Andy Andy, Thanks for the tip. It indeed sounds interesting, however there are a couple of culprits: 1) As you say, it sounds like there might be TLB flushing issues. Let's say the TLB detects a write and raises an IRQ for the arch code to set the PTE dirty bit, and before servicing that interrupt, we clear the PTE and flush that TLB. What will happen? And if the TLB hardware would write directly to the in-memory PTE I guess we'd have the same synchronization issues. I guess we'd then need an atomic read-modify-write against the TLB hardware? 2) Even if most hardware is capable of this stuff, I'm not sure what would happen in a virtual machine. Need to check. 3) For dirty contents that need to appear on a screen within a short interval, we need the write notification anyway, to start a delayed task that will gather the dirty data and flush it to the screen... Thanks, /Thomas From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754317Ab3KTQvA (ORCPT ); Wed, 20 Nov 2013 11:51:00 -0500 Received: from mail-vb0-f45.google.com ([209.85.212.45]:51163 "EHLO mail-vb0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753952Ab3KTQu6 (ORCPT ); Wed, 20 Nov 2013 11:50:58 -0500 MIME-Version: 1.0 In-Reply-To: <528C6ED9.3070600@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> From: Andy Lutomirski Date: Wed, 20 Nov 2013 08:50:37 -0800 Message-ID: Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces To: Thomas Hellstrom Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer@vmware.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom wrote: > On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >> >> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>> >>> Hi! >>> >>> Before going any further with this I'd like to check whether this is an >>> acceptable way to go. >>> Background: >>> GPU buffer objects in general and vmware svga GPU buffers in >>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>> the >>> address space is backed by a set of pages, sometimes it's backed by PCI >>> memory. >>> In the latter case in particular, there is no way to track dirty regions >>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>> buffer and perform dirty tracking on it, and then copy data to the real >>> GPU >>> buffer. This comes with a big memory- and performance overhead. >>> >>> So I'd like to add the following infrastructure with a callback >>> pfn_mkwrite() >>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>> range >>> of ptes rather than random ptes in a vma. >>> This comes with the extra benefit of being usable when the backing memory >>> of >>> the GPU buffer is not coherent with the GPU itself, and where we either >>> need >>> to flush caches or move data to synchronize. >>> >>> So this is a RFC for >>> 1) The API. Is it acceptable? Any other suggestions if not? >>> 2) Modifying apply_to_page_range(). Better to make a standalone >>> non-populating version? >>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>> unmap_mapping_range() >>> and page_mkclean_one() to try to get it right, but still unsure. >> >> Most (all?) architectures have real dirty tracking -- you can mark a pte >> as "clean" and the hardware (or arch code) will mark it dirty when >> written, *without* a page fault. >> >> I'm not convinced that it works completely correctly right now (I >> suspect that there are some TLB flushing issues on the dirty->clean >> transition), and it's likely prone to bit-rot, since the page cache >> doesn't rely on it. >> >> That being said, using hardware dirty tracking should be *much* faster >> and less latency-inducing than doing it in software like this. It may >> be worth trying to get HW dirty tracking working before adding more page >> fault-based tracking. >> >> (I think there's also some oddity on S/390. I don't know what that >> oddity is or whether you should care.) >> >> --Andy > > > Andy, > > Thanks for the tip. It indeed sounds interesting, however there are a couple > of culprits: > > 1) As you say, it sounds like there might be TLB flushing issues. Let's say > the TLB detects a write and raises an IRQ for the arch code to set the PTE > dirty bit, and before servicing that interrupt, we clear the PTE and flush > that TLB. What will happen? This should be fine. I assume that all architectures that do this kind of software dirty tracking will make the write block until the fault is handled, so the write won't have happened when you clear the PTE. After the TLB flush, the PTE will become dirty again and then the page will be written. > And if the TLB hardware would write directly to > the in-memory PTE I guess we'd have the same synchronization issues. I guess > we'd then need an atomic read-modify-write against the TLB hardware? IIRC the part that looked fishy to me was the combination of hw dirty tracking and write protecting the page. If you see that the pte is clean and want to write protect it, you probably need to set the write protect bit (atomically so you don't lose a dirty bit), flush the TLB, and then check the dirty bit again. > 2) Even if most hardware is capable of this stuff, I'm not sure what would > happen in a virtual machine. Need to check. This should be fine. Any VM monitor that fails to implement dirty tracking is probably terminally broken. > 3) For dirty contents that need to appear on a screen within a short > interval, we need the write notification anyway, to start a delayed task > that will gather the dirty data and flush it to the screen... > So that's what you want to do :) I bet that the best approach is some kind of hybrid. If, on the first page fault per frame, you un-write-protected the entire buffer and then, near the end of the frame, check all the hw dirty bits and re-write-protect the entire buffer, you get the benefit detecting which pages were written, but you only take one write fault per frame instead of one write fault per page. (I imagine that there are video apps out that there that would slow down measurably if they started taking one write fault per page per frame.) --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755177Ab3KTUQs (ORCPT ); Wed, 20 Nov 2013 15:16:48 -0500 Received: from smtp-outbound-1.vmware.com ([208.91.2.12]:38357 "EHLO smtp-outbound-1.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754969Ab3KTUQr (ORCPT ); Wed, 20 Nov 2013 15:16:47 -0500 Message-ID: <528D18AB.5020009@vmware.com> Date: Wed, 20 Nov 2013 21:16:43 +0100 From: Thomas Hellstrom User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Andy Lutomirski CC: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer@vmware.com Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/20/2013 05:50 PM, Andy Lutomirski wrote: > On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom > wrote: >> On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >>> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>>> Hi! >>>> >>>> Before going any further with this I'd like to check whether this is an >>>> acceptable way to go. >>>> Background: >>>> GPU buffer objects in general and vmware svga GPU buffers in >>>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>>> the >>>> address space is backed by a set of pages, sometimes it's backed by PCI >>>> memory. >>>> In the latter case in particular, there is no way to track dirty regions >>>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>>> buffer and perform dirty tracking on it, and then copy data to the real >>>> GPU >>>> buffer. This comes with a big memory- and performance overhead. >>>> >>>> So I'd like to add the following infrastructure with a callback >>>> pfn_mkwrite() >>>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>>> range >>>> of ptes rather than random ptes in a vma. >>>> This comes with the extra benefit of being usable when the backing memory >>>> of >>>> the GPU buffer is not coherent with the GPU itself, and where we either >>>> need >>>> to flush caches or move data to synchronize. >>>> >>>> So this is a RFC for >>>> 1) The API. Is it acceptable? Any other suggestions if not? >>>> 2) Modifying apply_to_page_range(). Better to make a standalone >>>> non-populating version? >>>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>>> unmap_mapping_range() >>>> and page_mkclean_one() to try to get it right, but still unsure. >>> Most (all?) architectures have real dirty tracking -- you can mark a pte >>> as "clean" and the hardware (or arch code) will mark it dirty when >>> written, *without* a page fault. >>> >>> I'm not convinced that it works completely correctly right now (I >>> suspect that there are some TLB flushing issues on the dirty->clean >>> transition), and it's likely prone to bit-rot, since the page cache >>> doesn't rely on it. >>> >>> That being said, using hardware dirty tracking should be *much* faster >>> and less latency-inducing than doing it in software like this. It may >>> be worth trying to get HW dirty tracking working before adding more page >>> fault-based tracking. >>> >>> (I think there's also some oddity on S/390. I don't know what that >>> oddity is or whether you should care.) >>> >>> --Andy >> >> Andy, >> >> Thanks for the tip. It indeed sounds interesting, however there are a couple >> of culprits: >> >> 1) As you say, it sounds like there might be TLB flushing issues. Let's say >> the TLB detects a write and raises an IRQ for the arch code to set the PTE >> dirty bit, and before servicing that interrupt, we clear the PTE and flush >> that TLB. What will happen? > This should be fine. I assume that all architectures that do this > kind of software dirty tracking will make the write block until the > fault is handled, so the write won't have happened when you clear the > PTE. After the TLB flush, the PTE will become dirty again and then > the page will be written. > >> And if the TLB hardware would write directly to >> the in-memory PTE I guess we'd have the same synchronization issues. I guess >> we'd then need an atomic read-modify-write against the TLB hardware? > IIRC the part that looked fishy to me was the combination of hw dirty > tracking and write protecting the page. If you see that the pte is > clean and want to write protect it, you probably need to set the write > protect bit (atomically so you don't lose a dirty bit), flush the TLB, > and then check the dirty bit again. > >> 2) Even if most hardware is capable of this stuff, I'm not sure what would >> happen in a virtual machine. Need to check. > This should be fine. Any VM monitor that fails to implement dirty > tracking is probably terminally broken. OK. I'll give it a try. If I understand this correctly, even if I set up a shared RW mapping, the PTEs should magically be marked dirty if written to, and everything works as it should? > >> 3) For dirty contents that need to appear on a screen within a short >> interval, we need the write notification anyway, to start a delayed task >> that will gather the dirty data and flush it to the screen... >> > So that's what you want to do :) Well this is mostly a benefit, actually. We already do this using fb_defio, but without this new interface we need a bounce-buffer covering the whole screen. Luckily this isn't a common use-case. Typically (if we use this) we'd gather dirty data when the buffer is referenced in a GPU command stream. > > I bet that the best approach is some kind of hybrid. If, on the first > page fault per frame, you un-write-protected the entire buffer and > then, near the end of the frame, check all the hw dirty bits and > re-write-protect the entire buffer, you get the benefit detecting > which pages were written, but you only take one write fault per frame > instead of one write fault per page. Yes, that sounds sane, particularly as un-write-protecting shouldn't need any additional tlb flushing, AFAICT. > > (I imagine that there are video apps out that there that would slow > down measurably if they started taking one write fault per page per > frame.) I actually hope to be able to avoid this stuff completely, but I need a backup plan, so that's why I threw out this RFC. > > --Andy Thanks, Thomas From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755063Ab3KTU3v (ORCPT ); Wed, 20 Nov 2013 15:29:51 -0500 Received: from mail-ve0-f177.google.com ([209.85.128.177]:39580 "EHLO mail-ve0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754992Ab3KTU3s (ORCPT ); Wed, 20 Nov 2013 15:29:48 -0500 MIME-Version: 1.0 In-Reply-To: <528D18AB.5020009@vmware.com> References: <1384891576-7851-1-git-send-email-thellstrom@vmware.com> <528BEB60.7040402@amacapital.net> <528C6ED9.3070600@vmware.com> <528D18AB.5020009@vmware.com> From: Andy Lutomirski Date: Wed, 20 Nov 2013 12:29:27 -0800 Message-ID: Subject: Re: [PATCH RFC 0/3] Add dirty-tracking infrastructure for non-page-backed address spaces To: Thomas Hellstrom Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , linux-graphics-maintainer Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 20, 2013 at 12:16 PM, Thomas Hellstrom wrote: > On 11/20/2013 05:50 PM, Andy Lutomirski wrote: >> >> On Wed, Nov 20, 2013 at 12:12 AM, Thomas Hellstrom >> wrote: >>> >>> On 11/19/2013 11:51 PM, Andy Lutomirski wrote: >>>> >>>> On 11/19/2013 12:06 PM, Thomas Hellstrom wrote: >>>>> >>>>> Hi! >>>>> >>>>> Before going any further with this I'd like to check whether this is an >>>>> acceptable way to go. >>>>> Background: >>>>> GPU buffer objects in general and vmware svga GPU buffers in >>>>> particular are mapped by user-space using MIXEDMAP or PFNMAP. Sometimes >>>>> the >>>>> address space is backed by a set of pages, sometimes it's backed by PCI >>>>> memory. >>>>> In the latter case in particular, there is no way to track dirty >>>>> regions >>>>> using page_mkwrite() and page_mkclean(), other than allocating a bounce >>>>> buffer and perform dirty tracking on it, and then copy data to the real >>>>> GPU >>>>> buffer. This comes with a big memory- and performance overhead. >>>>> >>>>> So I'd like to add the following infrastructure with a callback >>>>> pfn_mkwrite() >>>>> and a function mkclean_mapping_range(). Typically we will be cleaning a >>>>> range >>>>> of ptes rather than random ptes in a vma. >>>>> This comes with the extra benefit of being usable when the backing >>>>> memory >>>>> of >>>>> the GPU buffer is not coherent with the GPU itself, and where we either >>>>> need >>>>> to flush caches or move data to synchronize. >>>>> >>>>> So this is a RFC for >>>>> 1) The API. Is it acceptable? Any other suggestions if not? >>>>> 2) Modifying apply_to_page_range(). Better to make a standalone >>>>> non-populating version? >>>>> 3) tlb- mmu- and cache-flushing calls. I've looked at >>>>> unmap_mapping_range() >>>>> and page_mkclean_one() to try to get it right, but still unsure. >>>> >>>> Most (all?) architectures have real dirty tracking -- you can mark a pte >>>> as "clean" and the hardware (or arch code) will mark it dirty when >>>> written, *without* a page fault. >>>> >>>> I'm not convinced that it works completely correctly right now (I >>>> suspect that there are some TLB flushing issues on the dirty->clean >>>> transition), and it's likely prone to bit-rot, since the page cache >>>> doesn't rely on it. >>>> >>>> That being said, using hardware dirty tracking should be *much* faster >>>> and less latency-inducing than doing it in software like this. It may >>>> be worth trying to get HW dirty tracking working before adding more page >>>> fault-based tracking. >>>> >>>> (I think there's also some oddity on S/390. I don't know what that >>>> oddity is or whether you should care.) >>>> >>>> --Andy >>> >>> >>> Andy, >>> >>> Thanks for the tip. It indeed sounds interesting, however there are a >>> couple >>> of culprits: >>> >>> 1) As you say, it sounds like there might be TLB flushing issues. Let's >>> say >>> the TLB detects a write and raises an IRQ for the arch code to set the >>> PTE >>> dirty bit, and before servicing that interrupt, we clear the PTE and >>> flush >>> that TLB. What will happen? >> >> This should be fine. I assume that all architectures that do this >> kind of software dirty tracking will make the write block until the >> fault is handled, so the write won't have happened when you clear the >> PTE. After the TLB flush, the PTE will become dirty again and then >> the page will be written. >> >>> And if the TLB hardware would write directly to >>> the in-memory PTE I guess we'd have the same synchronization issues. I >>> guess >>> we'd then need an atomic read-modify-write against the TLB hardware? >> >> IIRC the part that looked fishy to me was the combination of hw dirty >> tracking and write protecting the page. If you see that the pte is >> clean and want to write protect it, you probably need to set the write >> protect bit (atomically so you don't lose a dirty bit), flush the TLB, >> and then check the dirty bit again. >> >>> 2) Even if most hardware is capable of this stuff, I'm not sure what >>> would >>> happen in a virtual machine. Need to check. >> >> This should be fine. Any VM monitor that fails to implement dirty >> tracking is probably terminally broken. > > > OK. I'll give it a try. If I understand this correctly, even if I set up a > shared RW mapping, the > PTEs should magically be marked dirty if written to, and everything works as > it should? > I *think* so. (It's certainly worth doing a quick-and-dirty test to make sure I'm not completely nuts before you invest too much time here -- I've read the code, and I've looked at the Intel specs, but I've never actually verified that pte_dirty and pte_mkclean to what I think they do.) (If you ever intend to run on S/390, you should ask someone who understands what's going on. I, personally, have no clue, other than having seen references to something weird happening.) --Andy