Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 3/4] mm: drop unused argument of zap_page_range()
From: kbuild test robot @ 2016-12-16 17:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Michal Hocko, Peter Zijlstra, Rik van Riel,
	Andrew Morton, linux-mm, linux-kernel
In-Reply-To: <20161216141556.75130-3-kirill.shutemov@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 4269 bytes --]

Hi Kirill,

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.9 next-20161216]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161216-231509
base:   git://git.cmpxchg.org/linux-mmotm.git master
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   lib/crc32.c:148: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:148: warning: Excess function parameter 'tab' description in 'crc32_le_generic'
   lib/crc32.c:293: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:293: warning: Excess function parameter 'tab' description in 'crc32_be_generic'
   lib/crc32.c:1: warning: no structured comments found
   lib/idr.c:223: warning: No description found for parameter 'start'
   lib/idr.c:223: warning: No description found for parameter 'id'
   lib/idr.c:223: warning: Excess function parameter 'starting_id' description in 'ida_get_new_above'
   lib/idr.c:223: warning: Excess function parameter 'p_id' description in 'ida_get_new_above'
   lib/idr.c:1: warning: no structured comments found
       Was looking for 'IDA description'.
   lib/idr.c:223: warning: No description found for parameter 'start'
   lib/idr.c:223: warning: No description found for parameter 'id'
   lib/idr.c:223: warning: Excess function parameter 'starting_id' description in 'ida_get_new_above'
   lib/idr.c:223: warning: Excess function parameter 'p_id' description in 'ida_get_new_above'
>> mm/memory.c:1379: warning: Excess function parameter 'details' description in 'zap_page_range'
   drivers/pci/msi.c:623: warning: No description found for parameter 'affd'
   drivers/pci/msi.c:623: warning: Excess function parameter 'affinity' description in 'msi_capability_init'

vim +1379 mm/memory.c

f5cc4eef9 Al Viro            2012-03-05  1363  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
4f74d2c8e Linus Torvalds     2012-05-06  1364  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
cddb8a5c1 Andrea Arcangeli   2008-07-28  1365  	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
^1da177e4 Linus Torvalds     2005-04-16  1366  }
^1da177e4 Linus Torvalds     2005-04-16  1367  
^1da177e4 Linus Torvalds     2005-04-16  1368  /**
^1da177e4 Linus Torvalds     2005-04-16  1369   * zap_page_range - remove user pages in a given range
^1da177e4 Linus Torvalds     2005-04-16  1370   * @vma: vm_area_struct holding the applicable pages
eb4546bbb Randy Dunlap       2012-06-20  1371   * @start: starting address of pages to zap
^1da177e4 Linus Torvalds     2005-04-16  1372   * @size: number of bytes to zap
8a5f14a23 Kirill A. Shutemov 2015-02-10  1373   * @details: details of shared cache invalidation
f5cc4eef9 Al Viro            2012-03-05  1374   *
f5cc4eef9 Al Viro            2012-03-05  1375   * Caller must protect the VMA list
^1da177e4 Linus Torvalds     2005-04-16  1376   */
7e027b14d Linus Torvalds     2012-05-06  1377  void zap_page_range(struct vm_area_struct *vma, unsigned long start,
1ddef4086 Kirill A. Shutemov 2016-12-16  1378  		unsigned long size)
^1da177e4 Linus Torvalds     2005-04-16 @1379  {
^1da177e4 Linus Torvalds     2005-04-16  1380  	struct mm_struct *mm = vma->vm_mm;
d16dfc550 Peter Zijlstra     2011-05-24  1381  	struct mmu_gather tlb;
7e027b14d Linus Torvalds     2012-05-06  1382  	unsigned long end = start + size;
^1da177e4 Linus Torvalds     2005-04-16  1383  
^1da177e4 Linus Torvalds     2005-04-16  1384  	lru_add_drain();
2b047252d Linus Torvalds     2013-08-15  1385  	tlb_gather_mmu(&tlb, mm, start, end);
365e9c87a Hugh Dickins       2005-10-29  1386  	update_hiwater_rss(mm);
7e027b14d Linus Torvalds     2012-05-06  1387  	mmu_notifier_invalidate_range_start(mm, start, end);

:::::: The code at line 1379 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6474 bytes --]

^ permalink raw reply

* [PATCH v3] arm64: mm: Fix NOMAP page initialization
From: Robert Richter @ 2016-12-16 16:54 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon
  Cc: Ard Biesheuvel, David Daney, Mark Rutland, Hanjun Guo,
	James Morse, Yisheng Xie, Robert Richter, linux-arm-kernel,
	linux-kernel, linux-mm

On ThunderX systems with certain memory configurations we see the
following BUG_ON():

 kernel BUG at mm/page_alloc.c:1848!

This happens for some configs with 64k page size enabled. The BUG_ON()
checks if start and end page of a memmap range belongs to the same
zone.

The BUG_ON() check fails if a memory zone contains NOMAP regions. In
this case the node information of those pages is not initialized. This
causes an inconsistency of the page links with wrong zone and node
information for that pages. NOMAP pages from node 1 still point to the
mem zone from node 0 and have the wrong nid assigned.

The reason for the mis-configuration is a change in pfn_valid() which
reports pages marked NOMAP as invalid:

 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping

This causes pages marked as nomap being no longer reassigned to the
new zone in memmap_init_zone() by calling __init_single_pfn().

Fixing this by implementing an arm64 specific early_pfn_valid(). This
causes all pages of sections with memory including NOMAP ranges to be
initialized by __init_single_page() and ensures consistency of page
links to zone, node and section.

The HAVE_ARCH_PFN_VALID config option now requires an explicit
definiton of early_pfn_valid() in the same way as pfn_valid(). This
allows a customized implementation of early_pfn_valid() which
redirects to valid_section() for arm64. This is the same as for the
generic pfn_valid() implementation.

v3:

 * Use valid_section() which is the same as the default pfn_valid()
   implementation to initialize
 * Added Ack for arm/ changes.

v2:

 * Use pfn_present() instead of memblock_is_memory() to support also
   non-memory NOMAP holes

Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Robert Richter <rrichter@cavium.com>
---
 arch/arm/include/asm/page.h   |  1 +
 arch/arm64/include/asm/page.h |  2 ++
 arch/arm64/mm/init.c          | 15 +++++++++++++++
 include/linux/mmzone.h        |  5 ++++-
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 4355f0ec44d6..79761bd55f94 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -158,6 +158,7 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 extern int pfn_valid(unsigned long);
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
 #endif
 
 #include <asm/memory.h>
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 8472c6def5ef..17ceb7435ded 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -49,6 +49,8 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 extern int pfn_valid(unsigned long);
+extern int early_pfn_valid(unsigned long);
+#define early_pfn_valid early_pfn_valid
 #endif
 
 #include <asm/memory.h>
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 212c4d1e2f26..8ff62a7ff634 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -145,11 +145,26 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
+
 int pfn_valid(unsigned long pfn)
 {
 	return memblock_is_map_memory(pfn << PAGE_SHIFT);
 }
 EXPORT_SYMBOL(pfn_valid);
+
+/*
+ * This is the same as the generic pfn_valid() implementation. We use
+ * valid_section() here to make sure all pages of a section including
+ * NOMAP pages are initialized with __init_single_page().
+ */
+int early_pfn_valid(unsigned long pfn)
+{
+	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+		return 0;
+	return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+}
+EXPORT_SYMBOL(early_pfn_valid);
+
 #endif
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..bedcf8a95881 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1170,12 +1170,16 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn)
 }
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+
 static inline int pfn_valid(unsigned long pfn)
 {
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
 }
+
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
+
 #endif
 
 static inline int pfn_present(unsigned long pfn)
@@ -1200,7 +1204,6 @@ static inline int pfn_present(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
-#define early_pfn_valid(pfn)	pfn_valid(pfn)
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() instead of unmap_page_range()
From: kbuild test robot @ 2016-12-16 16:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Michal Hocko, Peter Zijlstra, Rik van Riel,
	Andrew Morton, linux-mm, linux-kernel
In-Reply-To: <20161216141556.75130-4-kirill.shutemov@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9 next-20161216]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161216-231509
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: parisc-allnoconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=parisc 

All errors (new ones prefixed by >>):

   mm/built-in.o: In function `oom_reaper':
>> mm/oom_kill.o:(.text.oom_reaper+0x114): undefined reference to `madvise_dontneed'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 4737 bytes --]

^ permalink raw reply

* Re: [PATCH 5/9] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
From: Brian Foster @ 2016-12-16 16:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko
In-Reply-To: <20161215140715.12732-6-mhocko@kernel.org>

On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f31ae592dcae..5c6f9bd4d8be 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_noio_restore(nofs_flag);

memalloc_nofs_restore() ?

Brian

>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
From: Brian Foster @ 2016-12-16 16:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko
In-Reply-To: <20161215140715.12732-4-mhocko@kernel.org>

On Thu, Dec 15, 2016 at 03:07:09PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting

Typos: generic						     existing

> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Otherwise seems fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/kmem.c             |  4 ++--
>  fs/xfs/kmem.h             |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c         |  6 +++---
>  fs/xfs/xfs_trans.c        | 12 ++++++------
>  include/linux/sched.h     |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		noio_flag = memalloc_noio_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		memalloc_noio_restore(noio_flag);
>  
>  	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ea3984091d58..e40ddd12900b 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  			lflags &= ~__GFP_FS;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>  	struct xfs_btree_split_args	*args = container_of(work,
>  						struct xfs_btree_split_args, work);
>  	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
>  
>  	/*
>  	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0f56fcd3a5d5..61ca9f9c5a12 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
>  	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	/* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>  	 * Given that we do not allow direct reclaim to call us, we should
>  	 * never be called while in a filesystem transaction.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		goto redirty;
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
>  	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	/*
>  	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ xfs_trans_reserve(
>  	if (blocks > 0) {
>  		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
>  		}
>  		tp->t_blk_res += blocks;
> @@ -221,7 +221,7 @@ xfs_trans_reserve(
>  		tp->t_blk_res = 0;
>  	}
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	return error;
>  }
> @@ -914,7 +914,7 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
>  	/*
> @@ -944,7 +944,7 @@ __xfs_trans_commit(
>  		if (commit_lsn == -1 && !error)
>  			error = -EIO;
>  	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
>  	xfs_trans_free(tp);
>  
> @@ -998,7 +998,7 @@ xfs_trans_cancel(
>  		xfs_log_done(mp, tp->t_ticket, NULL, false);
>  
>  	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
>  	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4d1905245c7a..baffd340ea82 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
From: Brian Foster @ 2016-12-16 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML
In-Reply-To: <20161216154041.GA7645@dhcp22.suse.cz>

On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> ---
> From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 15 Dec 2016 13:06:43 +0100
> Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
>  lockdep false positives
> 
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
> 

Hi Michal,

I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
the original problem. FWIW, there was another KM_NOFS instance added by
that commit in xlog_cil_prepare_log_vecs() that is now in
xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
original issue still applies..?

Brian

> Changes since v1
> - check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON
>   as per Mike Galbraith
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h                | 6 +++++-
>  fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
>  fs/xfs/xfs_buf.c             | 2 +-
>  fs/xfs/xfs_dir2_readdir.c    | 2 +-
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d5d634ef1f7f 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
>  #define KM_NOFS		((__force xfs_km_flags_t)0x0004u)
>  #define KM_MAYFAIL	((__force xfs_km_flags_t)0x0008u)
>  #define KM_ZERO		((__force xfs_km_flags_t)0x0010u)
> +#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0x0020u)
>  
>  /*
>   * We use a special process flag to avoid recursive callbacks into
> @@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  {
>  	gfp_t	lflags;
>  
> -	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> +	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>  
>  	if (flags & KM_NOSLEEP) {
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
> @@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> +	if (flags & KM_NOLOCKDEP)
> +		lflags |= __GFP_NOLOCKDEP;
> +
>  	return lflags;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index f2dc1a950c85..b8b5f6914863 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(
>  
>  	if (nirecs > 1) {
>  		map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
> -				  KM_SLEEP | KM_NOFS);
> +				  KM_SLEEP | KM_NOLOCKDEP);
>  		if (!map)
>  			return -ENOMEM;
>  		*mapp = map;
> @@ -2488,7 +2488,7 @@ xfs_dabuf_map(
>  		 */
>  		if (nfsb != 1)
>  			irecs = kmem_zalloc(sizeof(irec) * nfsb,
> -					    KM_SLEEP | KM_NOFS);
> +					    KM_SLEEP | KM_NOLOCKDEP);
>  
>  		nirecs = nfsb;
>  		error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..f31ae592dcae 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
>  {
>  	xfs_buftarg_t		*btp;
>  
> -	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
> +	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);
>  
>  	btp->bt_mount = mp;
>  	btp->bt_dev =  bdev->bd_dev;
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 003a99b83bd8..033ed65d7ce6 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
>  	length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
>  	map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
>  				(length * sizeof(struct xfs_bmbt_irec)),
> -			       KM_SLEEP | KM_NOFS);
> +			       KM_SLEEP | KM_NOLOCKDEP);
>  	map_info->map_size = length;
>  
>  	/*
> -- 
> 2.10.2
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/9 v2] scope GFP_NOFS api
From: Mike Galbraith @ 2016-12-16 16:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML,
	Peter Zijlstra (Intel)
In-Reply-To: <20161216153502.GP13940@dhcp22.suse.cz>

On Fri, 2016-12-16 at 16:35 +0100, Michal Hocko wrote:
> On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> > On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > > Hi,
> > > I have posted the previous version here [1]. Since then I have added a
> > > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > > the xfs code base. This would need a review from somebody familiar with
> > > xfs of course.
> > 
> > The wild ass guess below prevents the xfs explosion below when running
> > ltp zram tests.
> 
> Yes this looks correct. Thanks for noticing. I will fold it to the
> patch2. Thanks for testing Mike!

I had ulterior motives, was hoping you might have made the irksome RT
gripe below just _go away_, as staring at it ain't working out ;-)

[ 1441.309006] =========================================================
[ 1441.309006] [ INFO: possible irq lock inversion dependency detected ]
[ 1441.309007] 4.10.0-rt9-rt #11 Tainted: G            E  
[ 1441.309007] ---------------------------------------------------------
[ 1441.309008] kswapd0/165 just changed the state of lock:
[ 1441.309009]  (&journal->j_state_lock){+.+.-.}, at: [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309017] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1441.309017]  (&tb->tb6_lock){+.+.+.}
[ 1441.309018] and interrupts could create inverse lock ordering between them.
[ 1441.309018] other info that might help us debug this:
[ 1441.309018] Chain exists of: &journal->j_state_lock --> &journal->j_list_lock --> &tb->tb6_lock
[ 1441.309019]  Possible interrupt unsafe locking scenario:
[ 1441.309019]        CPU0                    CPU1
[ 1441.309019]        ----                    ----
[ 1441.309019]   lock(&tb->tb6_lock);
[ 1441.309020]                                local_irq_disable();
[ 1441.309020]                                lock(&journal->j_state_lock);
[ 1441.309020]                                lock(&journal->j_list_lock);
[ 1441.309021]   <Interrupt>
[ 1441.309021]     lock(&journal->j_state_lock);
[ 1441.309021] *** DEADLOCK ***
[ 1441.309022] 2 locks held by kswapd0/165:
[ 1441.309022]  #0:  (shrinker_rwsem){+.+...}, at: [<ffffffff811efa2a>] shrink_slab+0x7a/0x6c0
[ 1441.309027]  #1:  (&type->s_umount_key#29){+.+.+.}, at: [<ffffffff8126f20b>] trylock_super+0x1b/0x50
[ 1441.309030] the shortest dependencies between 2nd lock and 1st lock:
[ 1441.309031]    -> (&tb->tb6_lock){+.+.+.} ops: 271 {
[ 1441.309032]       HARDIRQ-ON-W at:
[ 1441.309035] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309036] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309039] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309041] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309043] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309044] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309046] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309047] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309049] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309051] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309053] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309054] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309055] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309057] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309058] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309060] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309061] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309061]       SOFTIRQ-ON-W at:
[ 1441.309063] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309064] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309064] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309065] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309067] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309067] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309068] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309069] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309071] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309073] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309074] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309075] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309077] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309078] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309079] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309080] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309081] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309081]       RECLAIM_FS-ON-W at:
[ 1441.309082] [<ffffffff810e0316>] mark_held_locks+0x66/0x90
[ 1441.309084] [<ffffffff810e34a8>] lockdep_trace_alloc+0xd8/0x120
[ 1441.309085] [<ffffffff81246df6>] kmem_cache_alloc_node+0x36/0x310
[ 1441.309086] [<ffffffff815dfd4e>] __alloc_skb+0x4e/0x280
[ 1441.309088] [<ffffffff816ee6ac>] inet6_rt_notify+0x5c/0x130
[ 1441.309089] [<ffffffff816f101b>] fib6_add+0x56b/0xa30
[ 1441.309090] [<ffffffff816e66f8>] __ip6_ins_rt+0x48/0x70
[ 1441.309091] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309092] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309093] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309094] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309096] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309097] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309098] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309099] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309100] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309102] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309103] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309104] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309105] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309105]       INITIAL USE at:
[ 1441.309106] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309107] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309108] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309109] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309110] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309111] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309112] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309113] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309115] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309116] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309117] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309118] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309119] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309120] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309121] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309122] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309123] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309123]     }
[ 1441.309125]     ... key      at: [<ffffffff82dd96e0>] __key.59908+0x0/0x8
[ 1441.309125]     ... acquired at:
[ 1441.309126] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309127] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309128] [<ffffffff816ea541>] ip6_pol_route+0x61/0xa60
[ 1441.309130] [<ffffffff816eaf5a>] ip6_pol_route_input+0x1a/0x20
[ 1441.309131] [<ffffffff81718f21>] fib6_rule_action+0xa1/0x1e0
[ 1441.309133] [<ffffffff81621e53>] fib_rules_lookup+0x153/0x2e0
[ 1441.309134] [<ffffffff81719219>] fib6_rule_lookup+0x59/0xc0
[ 1441.309135] [<ffffffff816e6a3e>] ip6_route_input_lookup+0x4e/0x60
[ 1441.309136] [<ffffffff816ec59d>] ip6_route_input+0xdd/0x1a0
[ 1441.309137] [<ffffffff816d87d0>] ip6_rcv_finish+0x60/0x200
[ 1441.309139] [<ffffffffa09e00b0>] ip_sabotage_in+0x30/0x40 [br_netfilter]
[ 1441.309141] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
[ 1441.309142] [<ffffffff816d998a>] ipv6_rcv+0x72a/0x980
[ 1441.309143] [<ffffffff815f81ef>] __netif_receive_skb_core+0x38f/0xd20
[ 1441.309144] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
[ 1441.309145] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
[ 1441.309147] [<ffffffff815fa668>] netif_receive_skb+0x38/0x180
[ 1441.309151] [<ffffffffa09b77e5>] br_pass_frame_up+0xd5/0x2c0 [bridge]
[ 1441.309154] [<ffffffffa09b7d66>] br_handle_frame_finish+0x256/0x5c0 [bridge]
[ 1441.309156] [<ffffffffa09e159c>] br_nf_hook_thresh+0xac/0x220 [br_netfilter]
[ 1441.309157] [<ffffffffa09e2ee3>] br_nf_pre_routing_finish_ipv6+0x1c3/0x340 [br_netfilter]
[ 1441.309158] [<ffffffffa09e349d>] br_nf_pre_routing_ipv6+0xdd/0x27a [br_netfilter]
[ 1441.309159] [<ffffffffa09e2942>] br_nf_pre_routing+0x1b2/0x540 [br_netfilter]
[ 1441.309160] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
[ 1441.309163] [<ffffffffa09b82f7>] br_handle_frame+0x227/0x5b0 [bridge]
[ 1441.309164] [<ffffffff815f8036>] __netif_receive_skb_core+0x1d6/0xd20
[ 1441.309165] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
[ 1441.309166] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
[ 1441.309167] [<ffffffff815fbca2>] napi_gro_receive+0x192/0x250
[ 1441.309171] [<ffffffffa03f2163>] rtl8169_poll+0x183/0x6a0 [r8169]
[ 1441.309172] [<ffffffff815faea0>] net_rx_action+0x3b0/0x700
[ 1441.309173] [<ffffffff810841a5>] do_current_softirqs+0x285/0x680
[ 1441.309174] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309177] [<ffffffff810f7d81>] irq_forced_thread_fn+0x41/0x60
[ 1441.309178] [<ffffffff810f832f>] irq_thread+0x13f/0x1e0
[ 1441.309179] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309180] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309181]   -> (&per_cpu(local_softirq_locks[i], __cpu).lock){+.+...} ops: 3582145 {
[ 1441.309182]      HARDIRQ-ON-W at:
[ 1441.309183] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309184] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309185] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309186] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309187] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309188] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309190] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309191] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309193] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309194] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309195] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309196] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309196]      SOFTIRQ-ON-W at:
[ 1441.309197] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309198] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309199] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309200] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309201] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309202] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309203] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309204] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309204] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309205] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309206] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309207] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309207]      INITIAL USE at:
[ 1441.309208] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309209] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309210] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309211] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309212] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309213] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309213] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309215] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309216] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309216] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309217] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309218] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309218]    }
[ 1441.309220]    ... key      at: [<ffffffff81f6c110>] __key.38555+0x0/0x8
[ 1441.309220]    ... acquired at:
[ 1441.309221] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309222] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309222] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309223] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309225] [<ffffffff81200bc9>] wb_wakeup_delayed+0x69/0x70
[ 1441.309226] [<ffffffff812a0aab>] __mark_inode_dirty+0x60b/0x7c0
[ 1441.309227] [<ffffffff812ab235>] mark_buffer_dirty+0xb5/0x240
[ 1441.309231] [<ffffffffa009ba1d>] __jbd2_journal_temp_unlink_buffer+0xbd/0xe0 [jbd2]
[ 1441.309233] [<ffffffffa009e40a>] __jbd2_journal_refile_buffer+0xba/0xe0 [jbd2]
[ 1441.309235] [<ffffffffa009fbed>] jbd2_journal_commit_transaction+0x112d/0x2130 [jbd2]
[ 1441.309237] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
[ 1441.309239] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309239] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309240]  -> (&journal->j_list_lock){+.+...} ops: 587416 {
[ 1441.309241]     HARDIRQ-ON-W at:
[ 1441.309242] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309243] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309244] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309246] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309248] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309259] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309264] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309265] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309266] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309268] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309269] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309270] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309271] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309272] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309274] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309274] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309275]     SOFTIRQ-ON-W at:
[ 1441.309276] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309277] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309277] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309279] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309281] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309288] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309293] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309294] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309295] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309296] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309297] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309298] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309299] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309300] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309301] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309302] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309302]     INITIAL USE at:
[ 1441.309303] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309304] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309305] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309307] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309308] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309314] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309319] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309319] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309320] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309321] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309322] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309323] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309324] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309325] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309326] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309327] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309327]   }
[ 1441.309330]   ... key      at: [<ffffffffa00af5c0>] __key.47251+0x0/0xffffffffffff9a40 [jbd2]
[ 1441.309330]   ... acquired at:
[ 1441.309331] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309331] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309333] [<ffffffffa009ee25>] jbd2_journal_commit_transaction+0x365/0x2130 [jbd2]
[ 1441.309335] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
[ 1441.309337] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309337] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309338] -> (&journal->j_state_lock){+.+.-.} ops: 5849939 {
[ 1441.309339]    HARDIRQ-ON-W at:
[ 1441.309340] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309341] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309341] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309348] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309353] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309355] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309360] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309361] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309362] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309363] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309364] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309365] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309365]    SOFTIRQ-ON-W at:
[ 1441.309366] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309367] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309368] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309373] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309378] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309379] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309383] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309385] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309385] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309386] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309387] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309388] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309388]    IN-RECLAIM_FS-W at:
[ 1441.309390] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
[ 1441.309391] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309391] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309394] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309398] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309400] [<ffffffff8128b221>] evict+0xd1/0x1a0
[ 1441.309401] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
[ 1441.309402] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
[ 1441.309404] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
[ 1441.309405] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
[ 1441.309406] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
[ 1441.309407] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
[ 1441.309409] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309409] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309410]    INITIAL USE at:
[ 1441.309411] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309412] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309412] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309417] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309422] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309423] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309427] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309429] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309429] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309430] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309431] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309432] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309432]  }
[ 1441.309435]  ... key      at: [<ffffffffa00af5b0>] __key.47253+0x0/0xffffffffffff9a50 [jbd2]
[ 1441.309435]  ... acquired at:
[ 1441.309436] [<ffffffff810df99e>] check_usage_forwards+0x11e/0x120
[ 1441.309437] [<ffffffff810e01a8>] mark_lock+0x1e8/0x2f0
[ 1441.309437] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
[ 1441.309438] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309439] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309441] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309446] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309447] [<ffffffff8128b221>] evict+0xd1/0x1a0
[ 1441.309448] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
[ 1441.309449] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
[ 1441.309450] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
[ 1441.309451] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
[ 1441.309452] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
[ 1441.309453] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
[ 1441.309454] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309455] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309455] stack backtrace:
[ 1441.309457] CPU: 0 PID: 165 Comm: kswapd0 Tainted: G            E   4.10.0-rt9-rt #11
[ 1441.309457] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 1441.309457] Call Trace:
[ 1441.309459]  dump_stack+0x85/0xc8
[ 1441.309461]  print_irq_inversion_bug.part.34+0x1ac/0x1b8
[ 1441.309462]  check_usage_forwards+0x11e/0x120
[ 1441.309463]  ? check_usage_backwards+0x120/0x120
[ 1441.309463]  mark_lock+0x1e8/0x2f0
[ 1441.309464]  __lock_acquire+0x2b6/0x1770
[ 1441.309465]  ? __lock_acquire+0x420/0x1770
[ 1441.309466]  lock_acquire+0xd4/0x270
[ 1441.309468]  ? jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309469]  rt_read_lock+0x47/0x60
[ 1441.309471]  ? jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309472]  jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309477]  ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309478]  evict+0xd1/0x1a0
[ 1441.309479]  dispose_list+0x4d/0x70
[ 1441.309480]  prune_icache_sb+0x4b/0x60
[ 1441.309481]  super_cache_scan+0x141/0x190
[ 1441.309482]  shrink_slab+0x277/0x6c0
[ 1441.309483]  shrink_node+0x2e3/0x2f0
[ 1441.309485]  kswapd+0x34f/0x980
[ 1441.309487]  kthread+0x10c/0x140
[ 1441.309488]  ? mem_cgroup_shrink_node+0x390/0x390
[ 1441.309488]  ? kthread_park+0x90/0x90
[ 1441.309489]  ret_from_fork+0x2a/0x40

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v4 1/3] dax: masking off __GFP_FS in fs DAX handlers
From: Ross Zwisler @ 2016-12-16 16:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Jiang, akpm, jack, linux-nvdimm, hch, linux-mm, tytso,
	ross.zwisler, dan.j.williams
In-Reply-To: <20161216010730.GY4219@dastard>

On Fri, Dec 16, 2016 at 12:07:30PM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 04:40:41PM -0700, Dave Jiang wrote:
> > The caller into dax needs to clear __GFP_FS mask bit since it's
> > responsible for acquiring locks / transactions that blocks __GFP_FS
> > allocation.  The caller will restore the original mask when dax function
> > returns.
> 
> What's the allocation problem you're working around here? Can you
> please describe the call chain that is the problem?
> 
> >  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> >  
> >  	if (IS_DAX(inode)) {
> > +		gfp_t old_gfp = vmf->gfp_mask;
> > +
> > +		vmf->gfp_mask &= ~__GFP_FS;
> >  		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
> > +		vmf->gfp_mask = old_gfp;
> 
> I really have to say that I hate code that clears and restores flags
> without any explanation of why the code needs to play flag tricks. I
> take one look at the XFS fault handling code and ask myself now "why
> the hell do we need to clear those flags?" Especially as the other
> paths into generic fault handlers /don't/ require us to do this.
> What does DAX do that require us to treat memory allocation contexts
> differently to the filemap_fault() path?

This was done in response to Jan Kara's concern:

  The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
  because at that point it is correct. But once we grab filesystem locks which
  are not reclaim safe, we should update vmf->gfp_mask we pass further down
  into DAX code to not contain __GFP_FS (that's a bug we apparently have
  there). And inside DAX code, we definitely are not generally safe to add
  __GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
  vm_fault into this function, using passed gfp_mask there and make sure
  callers update gfp_mask as appropriate.

https://lkml.org/lkml/2016/10/4/37

IIUC I think the concern is that, for example, in xfs_filemap_page_mkwrite()
we take a read lock on the struct inode.i_rwsem before we call
dax_iomap_fault().

dax_iomap_fault() then calls find_or_create_page(), etc. with the
vfm->gfp_mask we were given.

I believe the concern is that if that memory allocation tries to do FS
operations to free memory because __GFP_FS is part of the gfp mask, then we
could end up deadlocking because we are already holding FS locks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() instead of unmap_page_range()
From: kbuild test robot @ 2016-12-16 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Michal Hocko, Peter Zijlstra, Rik van Riel,
	Andrew Morton, linux-mm, linux-kernel
In-Reply-To: <20161216141556.75130-4-kirill.shutemov@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 879 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9 next-20161216]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161216-231509
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-s0-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/built-in.o: In function `oom_reaper':
>> oom_kill.c:(.text+0x4869): undefined reference to `madvise_dontneed'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28732 bytes --]

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Andrea Arcangeli @ 2016-12-16 16:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Li, Liang Z, David Hildenbrand, kvm@vger.kernel.org,
	mhocko@suse.com, mst@redhat.com, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org, linux-mm@kvack.org, dgilbert@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <84ac9822-880d-b998-52ca-6aa87e0f7a43@intel.com>

On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > 
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8 bytes".
> > 
> > That will be much more simple. Right?
> 
> Sounds fine to me.
> 

Sounds fine to me too indeed.

I'm only wondering what is the major point for compressing gpfn+len in
8 bytes in the common case, you already use sg_init_table to send down
two pages, we could send three as well and avoid all math and bit
shifts and ors, or not?

I agree with the above because from a performance prospective I tend
to think the above proposal will run at least theoretically faster
because the other way is to waste double amount of CPU cache, and bit
mangling in the encoding and the later decoding on qemu side should be
faster than accessing an array of double size, but then I'm not sure
if it's measurable optimization. So I'd be curious to know the exact
motivation and if it is to reduce the CPU cache usage or if there's
some other fundamental reason to compress it.

The header already tells qemu how big is the array payload, couldn't
we just add more pages if one isn't enough?

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw)
  To: Nils Holland
  Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs,
	Michal Hocko
In-Reply-To: <20161216155808.12809-1-mhocko@kernel.org>

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases:
	- it will hardly help to form costly order which in turn can
	  result in the system panic because of no oom killable task in
	  the end - I believe we certainly do not want to put the system
	  down just because there is a nasty driver asking for order-9
	  page with GFP_NOFAIL not realizing all the consequences. It is
	  much better this request would loop for ever than the massive
	  system disruption
	- lowmem is also highly unlikely to be freed during OOM killer
	- GFP_NOFS request could trigger while there is still a lot of
	  memory pinned by filesystems.

The pre-mature OOM killer is a real issue as reported by Nils Holland
	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
	kworker/u4:5 cpuset=/ mems_allowed=0
	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
	Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
	Workqueue: writeback wb_workfn (flush-btrfs-1)
	 eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
	 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
	 eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
	Call Trace:
	 [<c142bcce>] dump_stack+0x47/0x69
	 [<c1163332>] dump_header+0x60/0x178
	 [<c1431876>] ? ___ratelimit+0x86/0xe0
	 [<c110795f>] oom_kill_process+0x20f/0x3d0
	 [<c1043895>] ? has_capability_noaudit+0x15/0x20
	 [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
	 [<c1107df9>] out_of_memory+0xd9/0x260
	 [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
	 [<c110414d>] pagecache_get_page+0xad/0x270
	 [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
	 [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
	[...]
	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
	lowmem_reserve[]: 0 0 21292 21292
	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
because there is clearly nothing reclaimable in the zone Normal while
there is a lot of page cache which is most probably pinned by the fs but
GFP_NOFS cannot reclaim it.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
In the later case we are more careful and only allow a partial access
because we do not want to risk the whole reserves depleting. There
are users doing GFP_NOFS|__GFP_NOFAIL heavily (e.g. __getblk_gfp ->
grow_dev_page).

Introduce __alloc_pages_cpuset_fallback helper which allows to bypass
allocation constrains for the given gfp mask while it enforces cpusets
whenever possible.

Reported-by: Nils Holland <nholland@tisys.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 095e2fa286de..d6bc3e4f1a0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
+			      unsigned int alloc_flags,
+			      const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			alloc_flags|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				alloc_flags, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3091,47 +3111,42 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
 					ALLOC_NO_WATERMARKS, ac);
-		}
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves but do not use ALLOC_NO_WATERMARKS because this
+		 * could deplete whole memory reserves which would just make
+		 * the situation worse
+		 */
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw)
  To: Nils Holland
  Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs,
	Michal Hocko
In-Reply-To: <20161216155808.12809-1-mhocko@kernel.org>

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Changes since v1
- do not skip direct reclaim for TIF_MEMDIE && GFP_NOFAIL as per Hillf
- do not skip __alloc_pages_may_oom for TIF_MEMDIE && GFP_NOFAIL as
  per Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2c9e535f7f..095e2fa286de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3640,35 +3640,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
-	}
 
-	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
-		goto nopage;
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	/* Avoid recursion of direct reclaim */
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
-
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
@@ -3692,14 +3678,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3721,6 +3699,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	/* Avoid allocations with no watermarks from looping endlessly */
+	if (test_thread_flag(TIF_MEMDIE))
+		goto nopage;
+
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
@@ -3728,6 +3710,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: OOM: Better, but still there on
From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw)
  To: Nils Holland
  Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs
In-Reply-To: <20161216073941.GA26976@dhcp22.suse.cz>

On Fri 16-12-16 08:39:41, Michal Hocko wrote:
[...]
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
> 
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
> 
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Could you try to run with the two following patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Andrea Arcangeli @ 2016-12-16 15:40 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: Michael S. Tsirkin, Hansen, Dave, David Hildenbrand,
	kvm@vger.kernel.org, mhocko@suse.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E3C32A899@shsmsx102.ccr.corp.intel.com>

On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> There still exist the case if the MAX_ORDER is configured to a large value, e.g. 36 for a system
> with huge amount of memory, then there is only 28 bits left for the pfn, which is not enough.

Not related to the balloon but how would it help to set MAX_ORDER to
36?

What the MAX_ORDER affects is that you won't be able to ask the kernel
page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
that's a driver issue not relevant to the amount of RAM. Drivers won't
suddenly start to ask the kernel allocator to allocate compound pages
at orders >= 11 just because more RAM was added.

The higher the MAX_ORDER the slower the kernel runs simply so the
smaller the MAX_ORDER the better.

> Should  we limit the MAX_ORDER? I don't think so.

We shouldn't strictly depend on MAX_ORDER value but it's mostly
limited already even if configurable at build time.

We definitely need it to reach at least the hugepage size, then it's
mostly driver issue, but drivers requiring large contiguous
allocations should rely on CMA only or vmalloc if they only require it
virtually contiguous, and not rely on larger MAX_ORDER that would
slowdown all kernel allocations/freeing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
From: Michal Hocko @ 2016-12-16 15:40 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML
In-Reply-To: <20161215140715.12732-3-mhocko@kernel.org>

Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
---

^ permalink raw reply

* Re: [PATCH 0/9 v2] scope GFP_NOFS api
From: Michal Hocko @ 2016-12-16 15:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML,
	Peter Zijlstra (Intel)
In-Reply-To: <1481900758.31172.20.camel@gmail.com>

On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > Hi,
> > I have posted the previous version here [1]. Since then I have added a
> > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > the xfs code base. This would need a review from somebody familiar with
> > xfs of course.
> 
> The wild ass guess below prevents the xfs explosion below when running
> ltp zram tests.

Yes this looks correct. Thanks for noticing. I will fold it to the
patch2. Thanks for testing Mike!
> 
> ---
>  fs/xfs/kmem.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  {
>  	gfp_t	lflags;
>  
> -	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> +	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>  
>  	if (flags & KM_NOSLEEP) {
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/9 v2] scope GFP_NOFS api
From: Mike Galbraith @ 2016-12-16 15:05 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML,
	Michal Hocko, Peter Zijlstra (Intel)
In-Reply-To: <20161215140715.12732-1-mhocko@kernel.org>

On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> Hi,
> I have posted the previous version here [1]. Since then I have added a
> support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> removing GFP_NOFS usage motivated by the lockdep false positives. On top
> of that I've tried to convert few KM_NOFS usages to use the new flag in
> the xfs code base. This would need a review from somebody familiar with
> xfs of course.

The wild ass guess below prevents the xfs explosion below when running
ltp zram tests.

---
 fs/xfs/kmem.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 {
 	gfp_t	lflags;
 
-	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
+	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
 
 	if (flags & KM_NOSLEEP) {
 		lflags = GFP_ATOMIC | __GFP_NOWARN;

[  108.775501] ------------[ cut here ]------------
[  108.775503] kernel BUG at fs/xfs/kmem.h:48!
[  108.775504] invalid opcode: 0000 [#1] SMP
[  108.775505] Dumping ftrace buffer:
[  108.775508]    (ftrace buffer empty)
[  108.775508] Modules linked in: xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) zram(E) ebtable_filter(E) ebtables(E) fuse(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) nfsd(E) kvm(E) auth_rpcgss(E) nfs_acl(E) lockd(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) pl2303(E) grace(E) snd_hda_codec_generic(E) usbserial(E) irqbypass(E) snd_hda_intel(E) snd_hda_codec(E)
[  108.775523]  snd_hwdep(E) sunrpc(E) snd_hda_core(E) crct10dif_pclmul(E) mei_me(E) mei(E) serio_raw(E) snd_pcm(E) crc32_pclmul(E) snd_timer(E) crc32c_intel(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) snd(E) soundcore(E) joydev(E) fan(E) shpchp(E) tpm_infineon(E) cryptd(E) battery(E) thermal(E) pcspkr(E) glue_helper(E) usblp(E) intel_smartconnect(E) i2c_i801(E) efivarfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) ahci(E) libahci(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) xhci_pci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) r8169(E) libata(E) mii(E) drm(E) usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E)
[  108.775540]  jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E)
[  108.775544] CPU: 5 PID: 4495 Comm: mount Tainted: G            E   4.10.0-master #4
[  108.775545] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[  108.775546] task: ffff8803f9e54e00 task.stack: ffffc900018fc000
[  108.775565] RIP: 0010:kmem_flags_convert.part.0+0x4/0x6 [xfs]
[  108.775565] RSP: 0018:ffffc900018ffcd8 EFLAGS: 00010202
[  108.775566] RAX: ffff8803f630a800 RBX: ffff8803f6b20000 RCX: 0000000000001000
[  108.775567] RDX: 0000000000001000 RSI: 0000000000000031 RDI: 00000000000000b0
[  108.775568] RBP: ffffc900018ffcd8 R08: 0000000000019fe0 R09: ffff8803f6b20000
[  108.775568] R10: 0000000000000005 R11: 0000000000010641 R12: ffff88041e21ea00
[  108.775569] R13: ffff8803f6b20000 R14: 0000000000000000 R15: 00000000fffffff4
[  108.775570] FS:  00007f1cbee9e840(0000) GS:ffff88041ed40000(0000) knlGS:0000000000000000
[  108.775571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  108.775571] CR2: 00007f5d4b6ed000 CR3: 00000003fbf13000 CR4: 00000000001406e0
[  108.775572] Call Trace:
[  108.775588]  kmem_alloc+0x100/0x100 [xfs]
[  108.775591]  ? kstrndup+0x49/0x60
[  108.775605]  xfs_alloc_buftarg+0x23/0xd0 [xfs]
[  108.775619]  xfs_open_devices+0x8c/0x170 [xfs]
[  108.775621]  ? sb_set_blocksize+0x1d/0x50
[  108.775633]  xfs_fs_fill_super+0x234/0x580 [xfs]
[  108.775635]  mount_bdev+0x184/0x1c0
[  108.775647]  ? xfs_test_remount_options.isra.15+0x60/0x60 [xfs]
[  108.775658]  xfs_fs_mount+0x15/0x20 [xfs]
[  108.775659]  mount_fs+0x15/0x90
[  108.775661]  vfs_kern_mount+0x67/0x130
[  108.775663]  do_mount+0x190/0xbd0
[  108.775664]  ? memdup_user+0x42/0x60
[  108.775665]  SyS_mount+0x83/0xd0
[  108.775668]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  108.775669] RIP: 0033:0x7f1cbe7bf78a
[  108.775669] RSP: 002b:00007ffc99215198 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[  108.775670] RAX: ffffffffffffffda RBX: 00007f1cbeabb3b8 RCX: 00007f1cbe7bf78a
[  108.775671] RDX: 000055715a611690 RSI: 000055715a60d270 RDI: 000055715a60d2d0
[  108.775671] RBP: 000055715a60d120 R08: 0000000000000000 R09: 00007f1cbea7c678
[  108.775672] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f1cbecc8e78
[  108.775672] R13: 00000000ffffffff R14: 0000000000000000 R15: 000055715a60d060
[  108.775673] Code: ff 74 05 e8 c2 17 64 e0 48 8b 3d 6b ec 03 00 48 85 ff 74 05 e8 b1 17 64 e0 48 8b 3d 32 ec 03 00 e8 25 a6 74 e0 5d c3 55 48 89 e5 <0f> 0b 55 48 89 e5 e8 f4 53 ff ff 48 c7 c7 40 78 b7 a0 e8 18 01 
[  108.775700] RIP: kmem_flags_convert.part.0+0x4/0x6 [xfs] RSP: ffffc900018ffcd8

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: crash during oom reaper
From: Vegard Nossum @ 2016-12-16 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Rik van Riel, Matthew Wilcox,
	Peter Zijlstra, Andrew Morton, Al Viro, Ingo Molnar,
	Linus Torvalds
In-Reply-To: <20161216143235.GO13940@dhcp22.suse.cz>

On 12/16/2016 03:32 PM, Michal Hocko wrote:
> On Fri 16-12-16 15:25:27, Vegard Nossum wrote:
>> On 12/16/2016 03:00 PM, Michal Hocko wrote:
>>> On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
>>> [...]
>>>> Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
>>>> Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
>>>> file-rss:112kB, shmem-rss:112kB
>>>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
>>>> IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
>>>> PGD c001067 PUD c000067
>>>> PMD 0
>>>> Oops: 0002 [#1] PREEMPT SMP KASAN
>>>> Dumping ftrace buffer:
>>>>    (ftrace buffer empty)
>>>> CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317
>>>
>>> Hmm, so this was the oom victim initially but we have decided to kill
>>> its child 1724 instead.
>>>
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>> task: ffff88000f9bc440 task.stack: ffff88000c778000
>>>> RIP: 0010:[<ffffffff8126b1c0>]  [<ffffffff8126b1c0>]
>>>> copy_process.part.41+0x2150/0x5580
>>>
>>> Could you match this to the kernel source please?
>>
>> kernel/fork.c:629 dup_mmap()
>
> Ok, so this is before the child is made visible so the oom reaper
> couldn't have seen it.
>
>> it's atomic_dec(&inode->i_writecount), it matches up with
>> file_inode(file) == NULL:
>>
>> (gdb) p &((struct inode *)0)->i_writecount
>> $1 = (atomic_t *) 0x1e8 <irq_stack_union+488>
>
> is this a p9 inode?

When I looked at this before it always crashed in this spot for the very
first VMA in the mm (which happens to be the exe, which is on a 9p root fs).

I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and
the last thing I see for a new crash in the same place is:

trinity--9280   28.... 136345090us : copy_process.part.41: ffffffff8485ec40
---------------------------------
CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff880000070000 task.stack: ffff8800099e0000
RIP: 0010:[<ffffffff8126c7c9>]  [<ffffffff8126c7c9>] 
copy_process.part.41+0x22c9/0x55b0

As you can see, the addresses match:

(gdb) p &v9fs_fs_type
$1 = (struct file_system_type *) 0xffffffff8485ec40 <v9fs_fs_type>

So I think we can safely say that yes, it's a p9 inode.


Vegard

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 31/42] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory
From: Andrea Arcangeli @ 2016-12-16 14:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

The shmem_mcopy_atomic_pte implements low lever part of UFFDIO_COPY
operation for shared memory VMAs. It's based on mcopy_atomic_pte with
adjustments necessary for shared memory pages.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/userfaultfd.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 31207b4..a0817cc 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -16,6 +16,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
+#include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -369,7 +370,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	err = -EINVAL;
 	dst_vma = find_vma(dst_mm, dst_start);
-	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+	if (!dst_vma)
+		goto out_unlock;
+	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
 		goto out_unlock;
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
@@ -394,11 +397,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	if (!dst_vma->vm_userfaultfd_ctx.ctx)
 		goto out_unlock;
 
-	/*
-	 * FIXME: only allow copying on anonymous vmas, tmpfs should
-	 * be added.
-	 */
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
 
 	/*
@@ -407,7 +406,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * dst_vma.
 	 */
 	err = -ENOMEM;
-	if (unlikely(anon_vma_prepare(dst_vma)))
+	if (vma_is_anonymous(dst_vma) && unlikely(anon_vma_prepare(dst_vma)))
 		goto out_unlock;
 
 	while (src_addr < src_start + len) {
@@ -444,12 +443,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_none(*dst_pmd));
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
-		if (!zeropage)
-			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr, &page);
-		else
-			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
-						 dst_addr);
+		if (vma_is_anonymous(dst_vma)) {
+			if (!zeropage)
+				err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+						       dst_addr, src_addr,
+						       &page);
+			else
+				err = mfill_zeropage_pte(dst_mm, dst_pmd,
+							 dst_vma, dst_addr);
+		} else {
+			err = -EINVAL; /* if zeropage is true return -EINVAL */
+			if (likely(!zeropage))
+				err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
+							     dst_vma, dst_addr,
+							     src_addr, &page);
+		}
 
 		cond_resched();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 32/42] userfaultfd: shmem: add userfaultfd hook for shared memory faults
From: Andrea Arcangeli @ 2016-12-16 14:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

When processing a page fault in shared memory area for not present page,
check the VMA determine if faults are to be handled by userfaultfd. If so,
delegate the page fault to handle_userfault.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/shmem.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5cc1cb2..75866a3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -72,6 +72,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
 #include <uapi/linux/memfd.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 
 #include <asm/uaccess.h>
@@ -118,13 +119,14 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp,
-		gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
+		gfp_t gfp, struct vm_area_struct *vma,
+		struct vm_fault *vmf, int *fault_type);
 
 int shmem_getpage(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp)
 {
 	return shmem_getpage_gfp(inode, index, pagep, sgp,
-		mapping_gfp_mask(inode->i_mapping), NULL, NULL);
+		mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
 }
 
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
@@ -1571,7 +1573,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp,
-	struct mm_struct *fault_mm, int *fault_type)
+	struct vm_area_struct *vma, struct vm_fault *vmf, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1625,7 +1627,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	 * bring it back from swap or allocate.
 	 */
 	sbinfo = SHMEM_SB(inode->i_sb);
-	charge_mm = fault_mm ? : current->mm;
+	charge_mm = vma ? vma->vm_mm : current->mm;
 
 	if (swap.val) {
 		/* Look it up and read it in.. */
@@ -1635,7 +1637,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			if (fault_type) {
 				*fault_type |= VM_FAULT_MAJOR;
 				count_vm_event(PGMAJFAULT);
-				mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
+				mem_cgroup_count_vm_event(charge_mm,
+							  PGMAJFAULT);
 			}
 			/* Here we actually start the io */
 			page = shmem_swapin(swap, gfp, info, index);
@@ -1704,6 +1707,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		swap_free(swap);
 
 	} else {
+		if (vma && userfaultfd_missing(vma)) {
+			*fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
+			return 0;
+		}
+
 		/* shmem_symlink() */
 		if (mapping->a_ops != &shmem_aops)
 			goto alloc_nohuge;
@@ -1966,7 +1974,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		sgp = SGP_NOHUGE;
 
 	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
-				  gfp, vma->vm_mm, &ret);
+				  gfp, vma, vmf, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
 	return ret;
@@ -4252,7 +4260,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 
 	BUG_ON(mapping->a_ops != &shmem_aops);
 	error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
-				  gfp, NULL, NULL);
+				  gfp, NULL, NULL, NULL);
 	if (error)
 		page = ERR_PTR(error);
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 37/42] userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_SHMEM
From: Andrea Arcangeli @ 2016-12-16 14:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

Userland developers asked to be notified immediately by the UFFDIO_API
ioctl if shmem missing mode is supported by userfaultfd in the running
kernel. This avoids the need to run UFFDIO_REGISTER on a shmem virtual
memory range to find out.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 10631a4..9ac4b68 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -21,7 +21,8 @@
 #define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
 			   UFFD_FEATURE_EVENT_MADVDONTNEED |	\
-			   UFFD_FEATURE_MISSING_HUGETLBFS)
+			   UFFD_FEATURE_MISSING_HUGETLBFS |	\
+			   UFFD_FEATURE_MISSING_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -146,12 +147,17 @@ struct uffdio_api {
 	 *    it, so userland can later check if the feature flag is
 	 *    present in uffdio_api.features after UFFDIO_API
 	 *    succeeded.
+	 *
+	 * UFFD_FEATURE_MISSING_SHMEM works the same as
+	 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
+	 * (i.e. tmpfs and other shmem based APIs).
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
 #define UFFD_FEATURE_EVENT_REMAP		(1<<2)
 #define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
 #define UFFD_FEATURE_MISSING_HUGETLBFS		(1<<4)
+#define UFFD_FEATURE_MISSING_SHMEM		(1<<5)
 	__u64 features;
 
 	__u64 ioctls;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 11/42] userfaultfd: non-cooperative: Add mremap() event
From: Andrea Arcangeli @ 2016-12-16 14:47 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

From: Pavel Emelyanov <xemul@parallels.com>

The event denotes that an area [start:end] moves to different
location. Length change isn't reported as "new" addresses, if
they appear on the uffd reader side they will not contain any
data and the latter can just zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as
for fork event.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h    | 17 +++++++++++++++++
 include/uapi/linux/userfaultfd.h | 11 ++++++++++-
 mm/mremap.c                      | 17 ++++++++++++-----
 4 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9bb7caf..c047b6f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -563,6 +563,43 @@ void dup_userfaultfd_complete(struct list_head *fcs)
 	}
 }
 
+void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+			     struct vm_userfaultfd_ctx *vm_ctx)
+{
+	struct userfaultfd_ctx *ctx;
+
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
+		vm_ctx->ctx = ctx;
+		userfaultfd_ctx_get(ctx);
+	}
+}
+
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
+				 unsigned long from, unsigned long to,
+				 unsigned long len)
+{
+	struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+	struct userfaultfd_wait_queue ewq;
+
+	if (!ctx)
+		return;
+
+	if (to & ~PAGE_MASK) {
+		userfaultfd_ctx_put(ctx);
+		return;
+	}
+
+	msg_init(&ewq.msg);
+
+	ewq.msg.event = UFFD_EVENT_REMAP;
+	ewq.msg.arg.remap.from = from;
+	ewq.msg.arg.remap.to = to;
+	ewq.msg.arg.remap.len = len;
+
+	userfaultfd_event_wait_completion(ctx, &ewq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 79002bc..7f318a4 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -55,6 +55,12 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
+extern void mremap_userfaultfd_prep(struct vm_area_struct *,
+				    struct vm_userfaultfd_ctx *);
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
+					unsigned long from, unsigned long to,
+					unsigned long len);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -89,6 +95,17 @@ static inline void dup_userfaultfd_complete(struct list_head *l)
 {
 }
 
+static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+					   struct vm_userfaultfd_ctx *ctx)
+{
+}
+
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+					       unsigned long from,
+					       unsigned long to,
+					       unsigned long len)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c8953c8..79a85e5 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -18,7 +18,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |	    \
+			   UFFD_FEATURE_EVENT_REMAP)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -77,6 +78,12 @@ struct uffd_msg {
 		} fork;
 
 		struct {
+			__u64	from;
+			__u64	to;
+			__u64	len;
+		} remap;
+
+		struct {
 			/* unused reserved fields */
 			__u64	reserved1;
 			__u64	reserved2;
@@ -90,6 +97,7 @@ struct uffd_msg {
  */
 #define UFFD_EVENT_PAGEFAULT	0x12
 #define UFFD_EVENT_FORK		0x13
+#define UFFD_EVENT_REMAP	0x14
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
@@ -110,6 +118,7 @@ struct uffdio_api {
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
+#define UFFD_FEATURE_EVENT_REMAP		(1<<2)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/mremap.c b/mm/mremap.c
index 30d7d24..504b560 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -250,7 +251,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 static unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
-		unsigned long new_len, unsigned long new_addr, bool *locked)
+		unsigned long new_len, unsigned long new_addr,
+		bool *locked, struct vm_userfaultfd_ctx *uf)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
@@ -309,6 +311,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		old_addr = new_addr;
 		new_addr = err;
 	} else {
+		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
 	}
@@ -413,7 +416,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 }
 
 static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
-		unsigned long new_addr, unsigned long new_len, bool *locked)
+		unsigned long new_addr, unsigned long new_len, bool *locked,
+		struct vm_userfaultfd_ctx *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
@@ -458,7 +462,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(ret))
 		goto out1;
 
-	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
+	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf);
 	if (!(offset_in_page(ret)))
 		goto out;
 out1:
@@ -497,6 +501,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	unsigned long ret = -EINVAL;
 	unsigned long charged = 0;
 	bool locked = false;
+	struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
 
 	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
 		return ret;
@@ -523,7 +528,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 
 	if (flags & MREMAP_FIXED) {
 		ret = mremap_to(addr, old_len, new_addr, new_len,
-				&locked);
+				&locked, &uf);
 		goto out;
 	}
 
@@ -592,7 +597,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			goto out;
 		}
 
-		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
+		ret = move_vma(vma, addr, old_len, new_len, new_addr,
+			       &locked, &uf);
 	}
 out:
 	if (offset_in_page(ret)) {
@@ -602,5 +608,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	up_write(&current->mm->mmap_sem);
 	if (locked && new_len > old_len)
 		mm_populate(new_addr + old_len, new_len - old_len);
+	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 23/42] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges
From: Andrea Arcangeli @ 2016-12-16 14:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

Add routine userfaultfd_huge_must_wait which has the same functionality as
the existing userfaultfd_must_wait routine.  Only difference is that new
routine must handle page table structure for hugepmd vmas.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1268496..92614c0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -195,6 +195,49 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 	return msg;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Same functionality as userfaultfd_must_wait below with modifications for
+ * hugepmd ranges.
+ */
+static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 unsigned long address,
+					 unsigned long flags,
+					 unsigned long reason)
+{
+	struct mm_struct *mm = ctx->mm;
+	pte_t *pte;
+	bool ret = true;
+
+	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	pte = huge_pte_offset(mm, address);
+	if (!pte)
+		goto out;
+
+	ret = false;
+
+	/*
+	 * Lockless access: we're in a wait_event so it's ok if it
+	 * changes under us.
+	 */
+	if (huge_pte_none(*pte))
+		ret = true;
+	if (!huge_pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
+out:
+	return ret;
+}
+#else
+static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 unsigned long address,
+					 unsigned long flags,
+					 unsigned long reason)
+{
+	return false;	/* should never get here */
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
 /*
  * Verify the pagetables are still not ok after having reigstered into
  * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
@@ -368,8 +411,12 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
-	must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
-					  reason);
+	if (!is_vm_hugetlb_page(vmf->vma))
+		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
+						  reason);
+	else
+		must_wait = userfaultfd_huge_must_wait(ctx, vmf->address,
+						       vmf->flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 18/42] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
From: Andrea Arcangeli @ 2016-12-16 14:47 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge
pages.  It is based on the existing __mcopy_atomic routine for normal
pages.  Unlike normal pages, there is no huge page support for the
UFFDIO_ZEROPAGE operation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/userfaultfd.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 186 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9c2ed70..ef0495b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,6 +14,8 @@
 #include <linux/swapops.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -139,6 +141,183 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
+ * called with mmap_sem held, it will release mmap_sem before returning.
+ */
+static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
+					      struct vm_area_struct *dst_vma,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	ssize_t err;
+	pte_t *dst_pte;
+	unsigned long src_addr, dst_addr;
+	long copied;
+	struct page *page;
+	struct hstate *h;
+	unsigned long vma_hpagesize;
+	pgoff_t idx;
+	u32 hash;
+	struct address_space *mapping;
+
+	/*
+	 * There is no default zero huge page for all huge page sizes as
+	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
+	 * by THP.  Since we can not reliably insert a zero page, this
+	 * feature is not supported.
+	 */
+	if (zeropage) {
+		up_read(&dst_mm->mmap_sem);
+		return -EINVAL;
+	}
+
+	src_addr = src_start;
+	dst_addr = dst_start;
+	copied = 0;
+	page = NULL;
+	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+
+	/*
+	 * Validate alignment based on huge page size
+	 */
+	err = -EINVAL;
+	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+		goto out_unlock;
+
+retry:
+	/*
+	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
+	 * retry, dst_vma will be set to NULL and we must lookup again.
+	 */
+	if (!dst_vma) {
+		err = -EINVAL;
+		dst_vma = find_vma(dst_mm, dst_start);
+		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
+			goto out_unlock;
+
+		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
+			goto out_unlock;
+
+		/*
+		 * Make sure the vma is not shared, that the remaining dst
+		 * range is both valid and fully within a single existing vma.
+		 */
+		if (dst_vma->vm_flags & VM_SHARED)
+			goto out_unlock;
+		if (dst_start < dst_vma->vm_start ||
+		    dst_start + len > dst_vma->vm_end)
+			goto out_unlock;
+	}
+
+	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
+		    (len - copied) & (vma_hpagesize - 1)))
+		goto out_unlock;
+
+	/*
+	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out_unlock;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out_unlock;
+
+	h = hstate_vma(dst_vma);
+
+	while (src_addr < src_start + len) {
+		pte_t dst_pteval;
+
+		BUG_ON(dst_addr >= dst_start + len);
+		VM_BUG_ON(dst_addr & ~huge_page_mask(h));
+
+		/*
+		 * Serialize via hugetlb_fault_mutex
+		 */
+		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
+		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
+								idx, dst_addr);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		err = -ENOMEM;
+		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
+		if (!dst_pte) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = -EEXIST;
+		dst_pteval = huge_ptep_get(dst_pte);
+		if (!huge_pte_none(dst_pteval)) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+						dst_addr, src_addr, &page);
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+		cond_resched();
+
+		if (unlikely(err == -EFAULT)) {
+			up_read(&dst_mm->mmap_sem);
+			BUG_ON(!page);
+
+			err = copy_huge_page_from_user(page,
+						(const void __user *)src_addr,
+						pages_per_huge_page(h));
+			if (unlikely(err)) {
+				err = -EFAULT;
+				goto out;
+			}
+			down_read(&dst_mm->mmap_sem);
+
+			dst_vma = NULL;
+			goto retry;
+		} else
+			BUG_ON(page);
+
+		if (!err) {
+			dst_addr += vma_hpagesize;
+			src_addr += vma_hpagesize;
+			copied += vma_hpagesize;
+
+			if (fatal_signal_pending(current))
+				err = -EINTR;
+		}
+		if (err)
+			break;
+	}
+
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+out:
+	if (page)
+		put_page(page);
+	BUG_ON(copied < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!copied && !err);
+	return copied ? copied : err;
+}
+#else /* !CONFIG_HUGETLB_PAGE */
+/* fail at build time if gcc attempts to use this */
+extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
+				      struct vm_area_struct *dst_vma,
+				      unsigned long dst_start,
+				      unsigned long src_start,
+				      unsigned long len,
+				      bool zeropage);
+#endif /* CONFIG_HUGETLB_PAGE */
+
 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
@@ -182,6 +361,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
+	 * If this is a HUGETLB vma, pass off to appropriate routine
+	 */
+	if (is_vm_hugetlb_page(dst_vma))
+		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
+						src_start, len, zeropage);
+
+	/*
 	 * Be strict and only allow __mcopy_atomic on userfaultfd
 	 * registered ranges to prevent userland errors going
 	 * unnoticed. As far as the VM consistency is concerned, it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 42/42] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock
From: Andrea Arcangeli @ 2016-12-16 14:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton
In-Reply-To: <20161216144821.5183-1-aarcange@redhat.com>

pmd_trans_unstable does an atomic read on the pmd so it doesn't
require the pmd_lock for the same check.

This also removes the special assumption that the mmap_sem is hold for
writing if prot_numa is not set. userfaultfd will hold the mmap_sem
only for reading in change_pte_range like prot_numa, but it will not
set prot_numa.

This is always a valid micro-optimization regardless of userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mprotect.c | 44 +++++++++++++++-----------------------------
 1 file changed, 15 insertions(+), 29 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index cc2459c..98acf7d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,34 +33,6 @@
 
 #include "internal.h"
 
-/*
- * For a prot_numa update we only hold mmap_sem for read so there is a
- * potential race with faulting where a pmd was temporarily none. This
- * function checks for a transhuge pmd under the appropriate lock. It
- * returns a pte if it was successfully locked or NULL if it raced with
- * a transhuge insertion.
- */
-static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, int prot_numa, spinlock_t **ptl)
-{
-	pte_t *pte;
-	spinlock_t *pmdl;
-
-	/* !prot_numa is protected by mmap_sem held for write */
-	if (!prot_numa)
-		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
-
-	pmdl = pmd_lock(vma->vm_mm, pmd);
-	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
-		spin_unlock(pmdl);
-		return NULL;
-	}
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
-	spin_unlock(pmdl);
-	return pte;
-}
-
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable, int prot_numa)
@@ -71,7 +43,21 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
 
-	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
+	/*
+	 * Can be called with only the mmap_sem for reading by
+	 * prot_numa so we must check the pmd isn't constantly
+	 * changing from under us from pmd_none to pmd_trans_huge
+	 * and/or the other way around.
+	 */
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	/*
+	 * The pmd points to a regular pte so the pmd can't change
+	 * from under us even if the mmap_sem is only hold for
+	 * reading.
+	 */
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (!pte)
 		return 0;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox