From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id EC7BF828F3 for ; Sun, 10 Jan 2016 08:59:26 -0500 (EST) Received: by mail-wm0-f43.google.com with SMTP id u188so186128712wmu.1 for ; Sun, 10 Jan 2016 05:59:26 -0800 (PST) Received: from mail-wm0-x231.google.com (mail-wm0-x231.google.com. [2a00:1450:400c:c09::231]) by mx.google.com with ESMTPS id k206si14978201wmf.37.2016.01.10.05.59.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 10 Jan 2016 05:59:25 -0800 (PST) Received: by mail-wm0-x231.google.com with SMTP id f206so183475277wmf.0 for ; Sun, 10 Jan 2016 05:59:25 -0800 (PST) Message-ID: <569263BA.5060503@plexistor.com> Date: Sun, 10 Jan 2016 15:59:22 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Dan Williams , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" Cc: Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner Hi Today any VM_MIXEDMAP or VM_PFN mapping when enabling a write access to their mapping, will have a double pagefault for every write access. This is because vma->vm_page_prot defines how a page/pfn is inserted into the page table (see vma_wants_writenotify in mm/mmap.c). Which means that it is always inserted with read-only under the assumption that we want to be notified when write access occurs. But this is not always true and adds an unnecessary page-fault on every new mmap-write access This patchset is trying to give the fault handler more choice by passing an pgprot_t to vm_insert_mixed() via a new vm_insert_mixed_prot() API. If the mm guys feel that the pgprot_t and its helpers and flags are private to mm/memory.c I can easily do a new: vm_insert_mixed_rw() instead. of the above vm_insert_mixed_prot() which enables any control not only write. Following is a patch to DAX to optimize out the extra page-fault. TODO: I only did 4k mapping perhaps 2M mapping can enjoy the same single fault on write access. If interesting to anyone I can attempt a fix. Dan Andrew who needs to pick this up please? list of patches: [PATCH 1/2] mm: Allow single pagefault on mmap-write with VM_MIXEDMAP [PATCH 2/2] dax: Only fault once on mmap write access Thank you Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) by kanga.kvack.org (Postfix) with ESMTP id 22938828F3 for ; Sun, 10 Jan 2016 09:02:10 -0500 (EST) Received: by mail-wm0-f46.google.com with SMTP id u188so186176751wmu.1 for ; Sun, 10 Jan 2016 06:02:10 -0800 (PST) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com. [2a00:1450:400c:c09::233]) by mx.google.com with ESMTPS id k8si14980003wmd.56.2016.01.10.06.02.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 10 Jan 2016 06:02:09 -0800 (PST) Received: by mail-wm0-x233.google.com with SMTP id f206so183523269wmf.0 for ; Sun, 10 Jan 2016 06:02:08 -0800 (PST) Message-ID: <5692645E.5080304@plexistor.com> Date: Sun, 10 Jan 2016 16:02:06 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 1/2] mm: Allow single pagefault on mmap-write with VM_MIXEDMAP References: <569263BA.5060503@plexistor.com> In-Reply-To: <569263BA.5060503@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Dan Williams , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" Cc: Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner Until now vma->vm_page_prot defines how a page/pfn is inserted into the page table (see vma_wants_writenotify in mm/mmap.c). Which meant that it was always inserted with read-only under the assumption that we want to be notified when write access occurs. This is not always true and adds an unnecessary page-fault on every new mmap-write. This patch adds a more granular approach and lets the fault handler decide how it wants to map the mixmap pfn. The old vm_insert_mixed() now receives a new pgprot_t prot and is renamed to: vm_insert_mixed_prot(). A new inline vm_insert_mixed() is defined which is a wrapper over vm_insert_mixed_prot(), with the vma->vm_page_prot default as before, so to satisfy all current users. CC: Andrew Morton CC: "Kirill A. Shutemov" CC: Oleg Nesterov CC: Mel Gorman CC: Johannes Weiner CC: Matthew Wilcox CC: linux-mm@kvack.org (open list:MEMORY MANAGEMENT) Reviewed-by: Yigal Korman Signed-off-by: Boaz Harrosh --- include/linux/mm.h | 8 +++++++- mm/memory.c | 10 +++++----- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de..46a9a19 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2108,8 +2108,14 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_mixed_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t prot); +static inline int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn); + unsigned long pfn) +{ + return vm_insert_mixed_prot(vma, addr, pfn, vma->vm_page_prot); +} int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index deb679c..c716913 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1589,8 +1589,8 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_pfn); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) +int vm_insert_mixed_prot(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, pgprot_t prot) { BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); @@ -1608,11 +1608,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, struct page *page; page = pfn_to_page(pfn); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, prot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, prot); } -EXPORT_SYMBOL(vm_insert_mixed); +EXPORT_SYMBOL(vm_insert_mixed_prot); /* * maps a range of physical memory into the requested pages. the old -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by kanga.kvack.org (Postfix) with ESMTP id E715E828F3 for ; Sun, 10 Jan 2016 09:03:46 -0500 (EST) Received: by mail-wm0-f50.google.com with SMTP id f206so183552189wmf.0 for ; Sun, 10 Jan 2016 06:03:46 -0800 (PST) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com. [2a00:1450:400c:c09::22f]) by mx.google.com with ESMTPS id o82si14931186wmg.112.2016.01.10.06.03.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 10 Jan 2016 06:03:46 -0800 (PST) Received: by mail-wm0-x22f.google.com with SMTP id u188so186205088wmu.1 for ; Sun, 10 Jan 2016 06:03:45 -0800 (PST) Message-ID: <569264BF.8010905@plexistor.com> Date: Sun, 10 Jan 2016 16:03:43 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: [PATCH 2/2] dax: Only fault once on mmap write access References: <569263BA.5060503@plexistor.com> In-Reply-To: <569263BA.5060503@plexistor.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Dan Williams , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" Cc: Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner In current code for any mmap-write access there are two page faults. One that maps the pfn into the vma (vm_insert_mixed()), and a second one that converts the read-only mapping to read-write (via pfn_mkwrite). But since we already know that this is a write access we can map the pfn read-write and save the extra fault. Signed-off-by: Boaz Harrosh --- fs/dax.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index a86d3cc..3fee696 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -289,6 +289,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); unsigned long vaddr = (unsigned long)vmf->virtual_address; void __pmem *addr; + pgprot_t prot = vma->vm_page_prot; unsigned long pfn; pgoff_t size; int error; @@ -321,7 +322,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, wmb_pmem(); } - error = vm_insert_mixed(vma, vaddr, pfn); + if (vmf->flags & FAULT_FLAG_WRITE) + prot = pgprot_modify(prot, PAGE_SHARED); + + error = vm_insert_mixed_prot(vma, vaddr, pfn, prot); out: i_mmap_unlock_read(mapping); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f181.google.com (mail-yk0-f181.google.com [209.85.160.181]) by kanga.kvack.org (Postfix) with ESMTP id 9FDAB828F3 for ; Sun, 10 Jan 2016 20:19:06 -0500 (EST) Received: by mail-yk0-f181.google.com with SMTP id x67so417167189ykd.2 for ; Sun, 10 Jan 2016 17:19:06 -0800 (PST) Received: from mail-yk0-x232.google.com (mail-yk0-x232.google.com. [2607:f8b0:4002:c07::232]) by mx.google.com with ESMTPS id q185si14783320ywe.315.2016.01.10.17.19.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 10 Jan 2016 17:19:05 -0800 (PST) Received: by mail-yk0-x232.google.com with SMTP id v14so323049291ykd.3 for ; Sun, 10 Jan 2016 17:19:05 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <569263BA.5060503@plexistor.com> References: <569263BA.5060503@plexistor.com> Date: Sun, 10 Jan 2016 17:19:05 -0800 Message-ID: Subject: Re: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: "Kirill A. Shutemov" , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" , Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner On Sun, Jan 10, 2016 at 5:59 AM, Boaz Harrosh wrote: > Hi > > Today any VM_MIXEDMAP or VM_PFN mapping when enabling a write access > to their mapping, will have a double pagefault for every write access. > > This is because vma->vm_page_prot defines how a page/pfn is inserted into > the page table (see vma_wants_writenotify in mm/mmap.c). > > Which means that it is always inserted with read-only under the > assumption that we want to be notified when write access occurs. > > But this is not always true and adds an unnecessary page-fault on > every new mmap-write access > > This patchset is trying to give the fault handler more choice by passing > an pgprot_t to vm_insert_mixed() via a new vm_insert_mixed_prot() API. > > If the mm guys feel that the pgprot_t and its helpers and flags are private > to mm/memory.c I can easily do a new: vm_insert_mixed_rw() instead. of the > above vm_insert_mixed_prot() which enables any control not only write. > > Following is a patch to DAX to optimize out the extra page-fault. > > TODO: I only did 4k mapping perhaps 2M mapping can enjoy the same single > fault on write access. If interesting to anyone I can attempt a fix. > > Dan Andrew who needs to pick this up please? This collides with the patches currently pending in -mm for 4.5, lets take a look at this for 4.6. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id D80CE828F3 for ; Mon, 11 Jan 2016 04:22:18 -0500 (EST) Received: by mail-wm0-f48.google.com with SMTP id f206so257779988wmf.0 for ; Mon, 11 Jan 2016 01:22:18 -0800 (PST) Received: from mail-wm0-x22d.google.com (mail-wm0-x22d.google.com. [2a00:1450:400c:c09::22d]) by mx.google.com with ESMTPS id wt9si197415002wjc.42.2016.01.11.01.22.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 01:22:17 -0800 (PST) Received: by mail-wm0-x22d.google.com with SMTP id b14so257667141wmb.1 for ; Mon, 11 Jan 2016 01:22:17 -0800 (PST) Message-ID: <56937446.1050308@plexistor.com> Date: Mon, 11 Jan 2016 11:22:14 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping References: <569263BA.5060503@plexistor.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dan Williams Cc: "Kirill A. Shutemov" , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" , Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner On 01/11/2016 03:19 AM, Dan Williams wrote: > On Sun, Jan 10, 2016 at 5:59 AM, Boaz Harrosh wrote: >> Hi >> >> Today any VM_MIXEDMAP or VM_PFN mapping when enabling a write access >> to their mapping, will have a double pagefault for every write access. >> >> This is because vma->vm_page_prot defines how a page/pfn is inserted into >> the page table (see vma_wants_writenotify in mm/mmap.c). >> >> Which means that it is always inserted with read-only under the >> assumption that we want to be notified when write access occurs. >> >> But this is not always true and adds an unnecessary page-fault on >> every new mmap-write access >> >> This patchset is trying to give the fault handler more choice by passing >> an pgprot_t to vm_insert_mixed() via a new vm_insert_mixed_prot() API. >> >> If the mm guys feel that the pgprot_t and its helpers and flags are private >> to mm/memory.c I can easily do a new: vm_insert_mixed_rw() instead. of the >> above vm_insert_mixed_prot() which enables any control not only write. >> >> Following is a patch to DAX to optimize out the extra page-fault. >> >> TODO: I only did 4k mapping perhaps 2M mapping can enjoy the same single >> fault on write access. If interesting to anyone I can attempt a fix. >> >> Dan Andrew who needs to pick this up please? > > This collides with the patches currently pending in -mm for 4.5, lets > take a look at this for 4.6. > OK thanks, I will try to work this over current linux-next and sure we will wait for 4.5-rc1 to look at this again. Do you have any comments in general about this? Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f176.google.com (mail-yk0-f176.google.com [209.85.160.176]) by kanga.kvack.org (Postfix) with ESMTP id 0CBCE828F3 for ; Mon, 11 Jan 2016 11:37:40 -0500 (EST) Received: by mail-yk0-f176.google.com with SMTP id x67so437687333ykd.2 for ; Mon, 11 Jan 2016 08:37:40 -0800 (PST) Received: from mail-yk0-x22d.google.com (mail-yk0-x22d.google.com. [2607:f8b0:4002:c07::22d]) by mx.google.com with ESMTPS id b189si15530900ywf.172.2016.01.11.08.37.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jan 2016 08:37:39 -0800 (PST) Received: by mail-yk0-x22d.google.com with SMTP id v14so343395353ykd.3 for ; Mon, 11 Jan 2016 08:37:39 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <56937446.1050308@plexistor.com> References: <569263BA.5060503@plexistor.com> <56937446.1050308@plexistor.com> Date: Mon, 11 Jan 2016 08:37:38 -0800 Message-ID: Subject: Re: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: "Kirill A. Shutemov" , Andrew Morton , Matthew Wilcox , "linux-mm@kvack.org" , Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner On Mon, Jan 11, 2016 at 1:22 AM, Boaz Harrosh wrote: > On 01/11/2016 03:19 AM, Dan Williams wrote: >> On Sun, Jan 10, 2016 at 5:59 AM, Boaz Harrosh wrote: >>> Hi >>> >>> Today any VM_MIXEDMAP or VM_PFN mapping when enabling a write access >>> to their mapping, will have a double pagefault for every write access. >>> >>> This is because vma->vm_page_prot defines how a page/pfn is inserted into >>> the page table (see vma_wants_writenotify in mm/mmap.c). >>> >>> Which means that it is always inserted with read-only under the >>> assumption that we want to be notified when write access occurs. >>> >>> But this is not always true and adds an unnecessary page-fault on >>> every new mmap-write access >>> >>> This patchset is trying to give the fault handler more choice by passing >>> an pgprot_t to vm_insert_mixed() via a new vm_insert_mixed_prot() API. >>> >>> If the mm guys feel that the pgprot_t and its helpers and flags are private >>> to mm/memory.c I can easily do a new: vm_insert_mixed_rw() instead. of the >>> above vm_insert_mixed_prot() which enables any control not only write. >>> >>> Following is a patch to DAX to optimize out the extra page-fault. >>> >>> TODO: I only did 4k mapping perhaps 2M mapping can enjoy the same single >>> fault on write access. If interesting to anyone I can attempt a fix. >>> >>> Dan Andrew who needs to pick this up please? >> >> This collides with the patches currently pending in -mm for 4.5, lets >> take a look at this for 4.6. >> > > OK thanks, I will try to work this over current linux-next and sure we > will wait for 4.5-rc1 to look at this again. > > Do you have any comments in general about this? Looks worthwhile at first glance, the only concern that comes to mind is integration with Ross' fsync/msync enabling. How much does this change matter in practice? If the mapping is long standing then I expect this cost gets hidden? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f173.google.com (mail-pf0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id BD2AC828EB for ; Mon, 11 Jan 2016 14:35:27 -0500 (EST) Received: by mail-pf0-f173.google.com with SMTP id 65so49269049pff.2 for ; Mon, 11 Jan 2016 11:35:27 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id f88si30295066pff.80.2016.01.11.11.35.26 for ; Mon, 11 Jan 2016 11:35:26 -0800 (PST) Date: Mon, 11 Jan 2016 14:35:23 -0500 From: Matthew Wilcox Subject: Re: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping Message-ID: <20160111193523.GA8945@linux.intel.com> References: <569263BA.5060503@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <569263BA.5060503@plexistor.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: "Kirill A. Shutemov" , Dan Williams , Andrew Morton , "linux-mm@kvack.org" , Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner On Sun, Jan 10, 2016 at 03:59:22PM +0200, Boaz Harrosh wrote: > Today any VM_MIXEDMAP or VM_PFN mapping when enabling a write access > to their mapping, will have a double pagefault for every write access. > > This is because vma->vm_page_prot defines how a page/pfn is inserted into > the page table (see vma_wants_writenotify in mm/mmap.c). > > Which means that it is always inserted with read-only under the > assumption that we want to be notified when write access occurs. > > But this is not always true and adds an unnecessary page-fault on > every new mmap-write access > > This patchset is trying to give the fault handler more choice by passing > an pgprot_t to vm_insert_mixed() via a new vm_insert_mixed_prot() API. > > If the mm guys feel that the pgprot_t and its helpers and flags are private > to mm/memory.c I can easily do a new: vm_insert_mixed_rw() instead. of the > above vm_insert_mixed_prot() which enables any control not only write. We've known about this one for a while, and it's never been terribly high on the priority list to fix it. This is the obvious way to fix it but, as you note, it might be seen as increasing the leak between the abstractions. I would rather see the memory.c code move in the direction of the huge_memory.c code. How about something like this? diff --git a/fs/dax.c b/fs/dax.c index a610cbe..09b6c8c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -498,6 +498,7 @@ EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { + const bool write = vmf->flags & FAULT_FLAG_WRITE; unsigned long vaddr = (unsigned long)vmf->virtual_address; struct address_space *mapping = inode->i_mapping; struct block_device *bdev = bh->b_bdev; @@ -534,12 +535,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } dax_unmap_atomic(bdev, &dax); - error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, - vmf->flags & FAULT_FLAG_WRITE); + error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, write); if (error) goto out; - error = vm_insert_mixed(vma, vaddr, dax.pfn); + error = vmf_insert_pfn(vma, vaddr, dax.pfn, write); out: i_mmap_unlock_read(mapping); @@ -559,7 +559,7 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf, unsigned blkbits = inode->i_blkbits; sector_t block; pgoff_t size; - int error; + int result, error; int major = 0; size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -660,13 +660,14 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * indicate what the callback should do via the uptodate variable, same * as for normal BH based IO completions. */ - error = dax_insert_mapping(inode, &bh, vma, vmf); + result = dax_insert_mapping(inode, &bh, vma, vmf); if (buffer_unwritten(&bh)) { if (complete_unwritten) - complete_unwritten(&bh, !error); + complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); else WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE)); } + return result | major; out: if (error == -ENOMEM) diff --git a/include/linux/mm.h b/include/linux/mm.h index 27dbd1b..a95242c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2170,8 +2170,10 @@ struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); -int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, +int vm_insert_pfn(struct vm_area_struct *, unsigned long addr, unsigned long pfn); +int vmf_insert_pfn(struct vm_area_struct *, unsigned long addr, + pfn_t pfn, bool write); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 708a0c7c..b93bcba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1505,8 +1505,15 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_page); +static pte_t maybe_pte_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot, bool write) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1526,6 +1533,10 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + if (write) { + entry = pte_mkyoung(pte_mkdirty(entry)); + entry = maybe_pte_mkwrite(entry, vma); + } set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1537,26 +1548,28 @@ out: } /** - * vm_insert_pfn - insert single pfn into user vma + * vmf_insert_pfn - insert single pfn into user vma * @vma: user vma to map to * @addr: target user address of this page * @pfn: source kernel pfn + * @write: Whether to insert a writable entry * * Similar to vm_insert_page, this allows drivers to insert individual pages * they've allocated into a user vma. Same comments apply. * * This function should only be called from a vm_ops->fault handler, and - * in that case the handler should return NULL. + * the return value from this function is suitable for returning from that + * handler. * * vma cannot be a COW mapping. * * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. */ -int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) +int vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn, bool write) { - int ret; + int error; pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these @@ -1568,16 +1581,29 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); - BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); + BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_t_valid(pfn)); if (addr < vma->vm_start || addr >= vma->vm_end) - return -EFAULT; - if (track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV))) - return -EINVAL; + return VM_FAULT_SIGBUS; + if (track_pfn_insert(vma, &pgprot, pfn)) + return VM_FAULT_SIGBUS; - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); + error = insert_pfn(vma, addr, pfn, pgprot, write); + if (error == -EBUSY || !error) + return VM_FAULT_NOPAGE; + return VM_FAULT_SIGBUS; +} +EXPORT_SYMBOL(vmf_insert_pfn); - return ret; +/* TODO: Convert users to vmf_insert_pfn */ +int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn) +{ + int result = vmf_insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), + false); + if (result & VM_FAULT_ERROR) + return -EFAULT; + return 0; } EXPORT_SYMBOL(vm_insert_pfn); @@ -1602,7 +1628,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_t_to_page(pfn); return insert_page(vma, addr, page, vma->vm_page_prot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_mixed); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f171.google.com (mail-pf0-f171.google.com [209.85.192.171]) by kanga.kvack.org (Postfix) with ESMTP id 54CEB4403D9 for ; Tue, 12 Jan 2016 08:30:01 -0500 (EST) Received: by mail-pf0-f171.google.com with SMTP id 65so63054503pff.2 for ; Tue, 12 Jan 2016 05:30:01 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id rd6si10355132pab.153.2016.01.12.05.30.00 for ; Tue, 12 Jan 2016 05:30:00 -0800 (PST) Date: Tue, 12 Jan 2016 08:29:57 -0500 From: Matthew Wilcox Subject: Re: [PATCHSET 0/2] Allow single pagefault in write access of a VM_MIXEDMAP mapping Message-ID: <20160112132957.GA21285@linux.intel.com> References: <569263BA.5060503@plexistor.com> <20160111193523.GA8945@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160111193523.GA8945@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: "Kirill A. Shutemov" , Dan Williams , Andrew Morton , "linux-mm@kvack.org" , Ross Zwisler , Oleg Nesterov , Mel Gorman , Johannes Weiner On Mon, Jan 11, 2016 at 02:35:23PM -0500, Matthew Wilcox wrote: > I would rather see the memory.c code move in the direction of the > huge_memory.c code. How about something like this? Whoops, missed some bits in the DAX conversion where we'd return an -errno instead of VM_FAULT flags. Take two. diff --git a/fs/dax.c b/fs/dax.c index a610cbe..deff70f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -498,6 +498,7 @@ EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { + const bool write = vmf->flags & FAULT_FLAG_WRITE; unsigned long vaddr = (unsigned long)vmf->virtual_address; struct address_space *mapping = inode->i_mapping; struct block_device *bdev = bh->b_bdev; @@ -506,7 +507,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, .size = bh->b_size, }; pgoff_t size; - int error; + int result; i_mmap_lock_read(mapping); @@ -518,15 +519,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, * allocated past the end of the file. */ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { - error = -EIO; - goto out; - } + if (unlikely(vmf->pgoff >= size)) + goto sigbus; - if (dax_map_atomic(bdev, &dax) < 0) { - error = PTR_ERR(dax.addr); - goto out; - } + if (dax_map_atomic(bdev, &dax) < 0) + goto sigbus; if (buffer_unwritten(bh) || buffer_new(bh)) { clear_pmem(dax.addr, PAGE_SIZE); @@ -534,17 +531,19 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } dax_unmap_atomic(bdev, &dax); - error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, - vmf->flags & FAULT_FLAG_WRITE); - if (error) - goto out; + if (dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, write)) + goto sigbus; - error = vm_insert_mixed(vma, vaddr, dax.pfn); + result = vmf_insert_pfn(vma, vaddr, dax.pfn, write); out: i_mmap_unlock_read(mapping); - return error; + return result; + + sigbus: + result = VM_FAULT_SIGBUS; + goto out; } static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf, @@ -559,7 +558,7 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf, unsigned blkbits = inode->i_blkbits; sector_t block; pgoff_t size; - int error; + int result, error; int major = 0; size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -660,13 +659,14 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * indicate what the callback should do via the uptodate variable, same * as for normal BH based IO completions. */ - error = dax_insert_mapping(inode, &bh, vma, vmf); + result = dax_insert_mapping(inode, &bh, vma, vmf); if (buffer_unwritten(&bh)) { if (complete_unwritten) - complete_unwritten(&bh, !error); + complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); else WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE)); } + return result | major; out: if (error == -ENOMEM) diff --git a/include/linux/mm.h b/include/linux/mm.h index 27dbd1b..a95242c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2170,8 +2170,10 @@ struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); -int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, +int vm_insert_pfn(struct vm_area_struct *, unsigned long addr, unsigned long pfn); +int vmf_insert_pfn(struct vm_area_struct *, unsigned long addr, + pfn_t pfn, bool write); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 708a0c7c..b93bcba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1505,8 +1505,15 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_page); +static pte_t maybe_pte_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot, bool write) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1526,6 +1533,10 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + if (write) { + entry = pte_mkyoung(pte_mkdirty(entry)); + entry = maybe_pte_mkwrite(entry, vma); + } set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1537,26 +1548,28 @@ out: } /** - * vm_insert_pfn - insert single pfn into user vma + * vmf_insert_pfn - insert single pfn into user vma * @vma: user vma to map to * @addr: target user address of this page * @pfn: source kernel pfn + * @write: Whether to insert a writable entry * * Similar to vm_insert_page, this allows drivers to insert individual pages * they've allocated into a user vma. Same comments apply. * * This function should only be called from a vm_ops->fault handler, and - * in that case the handler should return NULL. + * the return value from this function is suitable for returning from that + * handler. * * vma cannot be a COW mapping. * * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. */ -int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) +int vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn, bool write) { - int ret; + int error; pgprot_t pgprot = vma->vm_page_prot; /* * Technically, architectures with pte_special can avoid all these @@ -1568,16 +1581,29 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); - BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); + BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_t_valid(pfn)); if (addr < vma->vm_start || addr >= vma->vm_end) - return -EFAULT; - if (track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV))) - return -EINVAL; + return VM_FAULT_SIGBUS; + if (track_pfn_insert(vma, &pgprot, pfn)) + return VM_FAULT_SIGBUS; - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); + error = insert_pfn(vma, addr, pfn, pgprot, write); + if (error == -EBUSY || !error) + return VM_FAULT_NOPAGE; + return VM_FAULT_SIGBUS; +} +EXPORT_SYMBOL(vmf_insert_pfn); - return ret; +/* TODO: Convert users to vmf_insert_pfn */ +int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn) +{ + int result = vmf_insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), + false); + if (result & VM_FAULT_ERROR) + return -EFAULT; + return 0; } EXPORT_SYMBOL(vm_insert_pfn); @@ -1602,7 +1628,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_t_to_page(pfn); return insert_page(vma, addr, page, vma->vm_page_prot); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_mixed); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org